What constitutes business-critical data for an enterprise today? A decade ago, the answer would be straightforward: the data associated with business-critical applications such as inventory management, email systems, financials, etc. IT was focused primarily on transforming business processes into efficient electronic workflows.
Today, globalization is forcing enterprises to better optimize and personalize workflows to remain competitive (and relevant). As such, enterprise data today is no longer limited only to application-centric state — there is an exponential data growth in a new class of data that I refer to as “activity-related data” — data such as click logs, IoT sensors, supply-chain logistics, and other industry vertical data. The Big Data revolution has democratized the availability of analytics that enterprises are employing to explore and extract patterns, correlations, and other characteristics that can improve their business top-line and bottom-line. This paradigm is also referred to as “data-driven enterprises”.
Given the changing face of enterprise data, the emerging solutions in the storage industry are analogous to “old wine in a new bottle” i.e., traditional data management solutions such as backup, DR, tiering, etc., being applied to the new breed of data. To clarify, traditional application-centric data will continue to be critical — the question we are trying to tackle is whether the same data management solutions apply to the new class of data, and what are the important criteria for storage solutions to satisfy? In this post, I share my insights on how the new face of enterprise data correlates to enterprise storage solutions.
To make our discussion concrete, lets take an example of a hypothetical durable goods manufacturing enterprise. The enterprise has IT applications for payroll, email, inventory management, etc. As the enterprise business expands into global manufacturing and distribution, efficiencies in production, procurement, warehousing, marketing, etc., will become the new differentiators (in addition to the original manufacturing technology). To accomplish these efficiencies, the enterprise starts collecting activity-related data such as supply-chain logistics, IoT sensors on machinery, geographic sentiments, social media sentiments, conversion rates of marketing channels, and so on. The key characteristics of this data are that its inherent value is not predefined, and it requires data exploration techniques to extract the signal from the noise and derive business rules.
There is no one-size-fits-all when it comes to data exploration analytics or Big Data. In traditional relational databases, exploration was interactive using SQL queries. Hadoop democratized batch analytics for unstructured data using MapReduce programming model. Today, streaming analytics, machine learning packages, distributed search and indexing, etc., are among the plethora of open-source solutions available to analyze data. Most enterprises employ several of these techniques together given the different sources and forms of data that are being collected, i.e., there is no silver bullet.
To understand storage solutions that complement the new face of data, lets compare and contrast data lifecycle management for traditional application-related data to the new activity-related data. Data lifecycle can be roughly divided into three phases: ingestion phase, access (read/write) phase, and retention/deletion phase:
- Data Ingestion: Traditionally, data ingestion equated to data generated and maintained by business-critical applications — the focus was on best serving the application’s QoS and data availability requirements. In contrast, activity-related data by definition is collected from multiple sources, is continuously generated, and exhibits significant variance in the traffic volume. The ingestion process differs in two aspects:
- Data Aggregation model: Activity-related data is collected from a wide number of sources. Aggregating data sources into a coherent stream of records is a new data ingestion requirement.
- Data Persistence model: Application-related data was persisted as named records where each record is identified by a unique ID. Examples of these named records are files, rows in a relational database, objects, etc. In contrast, activity-related data represents a stream of records, such that storage systems do not need to be optimized for persistence/retrieval of individual records, but rather a stream of correlated records.
- Data Access: In the past, access activity was driven mainly by the application. As storage developers, we used to sweat over IO access patterns such as read/write, random/sequential, block size, etc., and optimize the storage system for those access characteristics. Activity-centric data is a lot more unpredictable in its access characteristics given that it is dependent on the value of data. The two new aspects to consider are:
- Data-centric APIs: Traditionally, storage was all about store-retrieve APIs. Today, storage needs to support APIs for higher order functions such as search/indexing, data governance, data structures as Key-Value, Graph, Documents, etc. I refer to this phenomenon as the blurring of storage and databases, and have covered it in detail in some of my previous posts.
- Messaging APIs: The micro-services paradigm is transforming application development from monolithic analytics to “LEGO style” pipeline programming where micro-services cooperate for analyzing different aspects of the data. Storage needs to support messaging semantics as a first-class service. These messaging interactions can be generically represented as a pub-sub model. In other words, application-centric data is optimized for the single producer-consumer access model, while activity-related data by definition is a multiple producer-multiple consumer model.
- Data Retention: While data continues to grow exponentially for most enterprises, storage capacity typically grows only linearly. As such, it is not possible to retain all the data. Application-centric data is subject to well-defined retention policies e.g., archive all inventory operations older than six months or retain medical records for three years, etc. How do we handle retention of activity-related data? It is useful to retain streams if they have useful correlations or actionable business rules. As enterprises continually grapple with the capacity deficit, streaming analytics will play an increasingly important role in summarizing data in real-time or semi real-time. Also, activity-related data is a lot more perishable, as in its relevance rapidly diminishes with time.
- Exploration Yoga: Activity-related data has no inherent value unless it is analyzed. I believe that we are in the early phases of a Big Data evolution, with significant innovations and adoption in the coming years. It would be a fallacy to envision a one-size-fits-all model for data analytics — instead, enterprises require an ability to plug-in different analytics to analyze data as batch queries, interactively, or in real-time. Flexibility and agility of analysis are key requirements now! With micro-services becoming the de-facto programming model for developing applications, the storage system needs to be thought more as a clearing house that supports persistence and communication between services, rather than a safety deposit service that is optimized for single producer-consumer access. The phenomenon is visible in Web 2.0 innovations such as Kafka (and variants) and Lambda architectures.
- First-class Analytics Support: In recent years, enterprise storage solutions have extended POSIX block/file semantics to support Hadoop, Spark, and other popular analytics. I believe that storage needs to support analytics as a first-class out-of-the-box service. There are no one-size-fits-all analytics, but there is no reason why a single programming model cannot represent batch, streaming, and interactive analytics. The programming model constructs need to integrate deeply with storage architectures. Google’s Dataflow is a recent innovation in combing batch and streaming analytics. Similarly, Hadoop’s shift towards YARN is aimed to provide a platform of analytics running on HDFS.
In summary, IT today has a different challenge: the search of business differentiators in existing electronic workflows. Enterprise data is shifting from featuring structured and named records, to large-scale, unnamed, schema-less record streams. The value of these data streams needs to be derived by applying different analytics. The application-centric data will remain critical, and will fuel the existing innovation curve for storage solutions. Instead of continuing to apply techniques from application-centric data to the new class of data, it is time that the storage industry gets out of its comfort zone and provides clean-slate holistic solutions!