On the Balkanization of Data Storage
2014 marked the centennial of the beginning of the First World War (WWI): an event that had a terrible human cost and paved the way to major political and social changes across the globe. The centennial steered a wave of retrospection around the events and causes that led to WWI (ref. The Sleepwalkers: How Europe went to War in 1914 by Christopher Clark, The Guns of August by Barbara W. Tuchman). Historians have suggested technological advances, especially in engines and communications, as both a reason for the geopolitical instability that preceded the war, and the main cause for the resulting unprecedented casualty rate.
Credit: The National Archives (UK)
WWI turned Europe from a land of powerful empires to a patchwork of small, often weak nation states. No area in Europe reflects that change more dramatically than the Balkans, where the Austro-Hungarian and Ottoman empires gave way to dozens of small states (ref. The Balkans: A Short History by Mark Mazower). It was only through economic unification and close collaboration amongst the resulting nation states that Europe put eventually put tension and armed conflicts behind her and states and people prospered (ref. World Order by Henry Kissinger).
Without intending to trivialize the importance and terrible toll of the Great War, I cannot help but notice some parallels with the data storage industry today. Until only a few years ago, this industry was dominated by a few major players (the equivalent of Empires) that provided storage products and solutions to Enterprise IT using stringent and prescriptive approaches: tightly controlled hardware, well-tested recipes for software, two data-access protocols – block and file. That was the model for providing storage for the so-called “second platform”—enterprise applications and automation that followed the typical client-server model. It was a simple world, and the control of the big players absolute.
The Third Platform
That was the world order until the 2000s, when a wave of rapid technological evolution emerged to meet the needs of cloud-scale applications and use cases that ranged from search engines to new ecommerce models and social media. This is the era of ubiquitous web services, big data and analytics. The combustion engine and the electric telegraph changed the socioeconomic order of the world in the turn of the 20th century. In a similar fashion, the turn of the 21st century witnessed the rapid introduction of new technologies for application-oriented storage including MapReduce, GFS, and their open-source counterparts (Hadoop, HDFS ) as well as a myriad of new-generation databases which deviate from the traditional RDBMS model (e.g., HBase, Hive, Cassandra, MongoDB).
The third platform, a loosely-defined term used to describe the collection of such technologies, made its debut in the data centers of big Internet companies that are at the forefront of the third platform evolution (Google, Facebook, etc.). Soon, it started making its way into more typical enterprise environments thanks to companies that made that their business model (Cloudera, Hortonworks).
However, unlike traditional storage systems which are owned and run by central IT organizations, new data storage platforms are installed and run by individual lines of business. They are used to run applications and workloads that are central to the needs of individual product groups and functional teams, whether R&D, market analysis, marketing, or finance. Even in the rare cases where they are owned by central IT organizations, there is a multitude of different, disassociated platforms, one for each application and data set.
The result is the Balkanization of data storage systems: a multitude of data pools spread both geographically and organizationally across the enterprise. Each platform has its own requirements for hardware configurations ranging from the traditional SAN models to the more contemporary distributed architectures. They have completely different models of management, anything from manual management to exclusively programmatic APIs. They require different types of skills and expertise. Even worse, huge volumes of data need to migrate regularly between different pools depending on how that data are consumed at different points in their life cycle. For example, data is siphoned out of the corporate Oracle DBs into Hadoop clusters where they undergo different levels of filtering and analysis, only for the results to be further piped into some non-SQL databases to be used for research and marketing purposes. Business and application evolution results in the proliferation of the number of different data pools—the situation becomes unmanageable. Capex and Opex spin out of control and, more importantly for many businesses, there is serious impact on data governance and regulatory compliance.
As it was for the world at the turn of the 20th century, technology becomes a double-edged knife.
The Hope of Reunification
In my opinion, the industry needs to follow two principles to address the problem of data storage Balkanization.
1. Multifaceted storage platforms
By this, I mean storage platforms that can support different data abstractions and access protocols: block, file, key-value data stores to name a few. Even more importantly, the architecture of such platforms shall accommodate multiple workloads with varied characteristics all of them sharing the same physical resources (CPU, Buses, memory, network, storage controllers and devices).
Envision, for example, a storage cluster based on commodity hardware on which one can run their legacy relational CRM database, and a virtual HDFS cluster used by the market analytics team, and a Cassandra database used for a new customer-facing mobile application. The recipe for building such platforms includes three main ingredients:
- A virtualization technology through which the platform can identify and control the workloads of the different applications and “users” that share the platform.
- Adaptive storage resource management and data path controls (scheduling) to meet the QoS goals of the different workloads without interference amongst each other.
- Management tools and APIs for provisioning, configuration, monitoring and troubleshooting that are orthogonal to the different data abstractions exposed by the storage platform.
2. Advanced data services
Platform consolidation is necessary but not sufficient. What use is a platform that can natively serve instances of RDBMS, HDFS and key-value stores, if it mandates elaborate and slow data copying between those instances? For example, copying large volumes of data from filers and DBs to HDFS for analytics and then to some NoSQL DB results in waste of system resources that would have otherwise been available to run application workloads.
The new generation of storage platforms need to offer powerful data services for efficient manipulation and transformation of large data sets. Examples include:
- Snapshot technologies that can create instantaneously and cheaply snapshots of data objects (volumes, files, DBs). Such snapshots can be used either as read-only data sets for processing by analytics engines or as mutable copies (clones) for test, dev, and research purposes. Reducing the capacity footprint of data using mechanisms such as deduplication or compression is of paramount importance in this context.
- Point-in-time copies of data that can be distributed and stored in different physical locations for data protection and disaster recovery purposes. At the same time offering metadata services to ensure that data are traceable and auditable.
- Unified security primitives for access control, data integrity and encryption, which are applicable for different data abstractions and access protocols.
- Application-enabling data services, such as Amazon’s recent AWS Lambda feature, which allows developers to write dynamic, event-driven applications that adapt to the changes and transformations of their data sets. One can envision similar services around metadata for data provenance, security and auditing. Moreover, the concept of a lambda function could be extended to support a model of code execution at the physical location of the data, for performance or data sovereign purposes.
Such data services shall be offered through programmatic APIs that will be usable not only by automation tools but even more importantly in this new era by applications that are designed around data transformations and processing.
Predicting the Future
It took the best part of the 20th century for the European Nation States to reach an equilibrium wherein individual national identities and interests co-exist within a unified economy and market. (One can argue that the Balkans have still some way to go.)
Similarly, the path to storage consolidation and unification will not be fast or easy. The public cloud players are blazing the trail. They teach the industry valuable lessons by their well-publicized successes and blunders. In the enterprise space (or on-premise cloud as some like to call the private cloud-like IaaS built in big enterprises) we will have to live with islands of data storage for a very long time. If anything, the trend is accelerating.
However, I predict that within the next couple of years, IT organizations will start reconsidering the situation. Operational costs and data governance requirements will force them to reign in the sprawling pools of data.
I see a few obvious ways of how things will evolve. As I predicted in the past, the future of storage is software defined. “Big iron” disk arrays are going the way of the 19th century European Empires. Hardware, including emerging storage technologies is commoditized—its properties and interfaces are mature and stable. The added value comes in software. The third platform storage solutions are all implemented in software on commodity compute and network. Distributed software is the key ingredient of storage platforms with unprecedented levels of scalability and fault tolerance. Data services (outlined above) and the programmatic APIs to manage them are all about software that allows legacy and new-generation applications to co-exist on the same consolidated platforms; platforms that support a variety of data abstractions and protocols.
I wish to take the point on programmatic APIs a step further. New generation applications will be built around a model of dynamic data composition and transformation. Such APIs are not just the interface to the “storage control plane”. They are essential primitives of the third platform. They will facilitate the development of data-oriented applications of unprecedented scale and sophistication. Lambda functions are a harbinger of things to come.
What does all these mean for the storage platforms of the future? Storage vendors are currently split in distinct groups: a) traditional storage block/file platforms that are superficially extended to export the occasional REST API or HDFS protocol; b) a hotchpotch of different, often competing open-source projects, each of them aimed at specific use cases. Few existing platforms have architectures that can support efficiently different data abstractions and generic data services. Some of them use a generic internal object abstraction around which data services are implemented; different data “personalities” and protocols are layered on top of objects (e.g., RADOS, VSAN, Atmos). Others built on the basic abstraction of a file (e.g., GFS, Isilon), without necessarily the traditional namespace and POSIX semantics file systems imply.
Vendors have a long way to go to evolve their platforms for the emerging model of Third Platform storage. Lessons will be learned from Cloud providers and from the open source community. In some cases, open-source technologies will be integrated with more traditional platforms. In other cases, platforms will evolve natively. Ultimately consolidated products will be built for the generic IT populace one way or another. It took several decades after WWI and many different treaties to establish the political and economic framework for a peaceful and prosperous Europe. Similarly, this is the time for the IT industry to invest in the right storage architectures and get ready for the new era of data storage federations.