Enterprises today employ a variety of specialized NoSQL/NewSQL data platforms for managing documents, key-values, graphs, sparse tables, etc. Developers love the specialized APIs as they increase agility and velocity when delivering their applications. Ops is on the receiving side of this equation, being required to manage developer-defined software stacks in production, and being accountable for security, compliance, availability, governance, and operational efficiencies. As an industry, we have oscillated between data silos with Data Attached Storage (DAS) in the mid-’80s, to centralization with Network Storage in the late ’90s, and now back to silos with multiple specialized data management pools. I am referring to the current state of affairs with data silos as the “CAP aftermath,” as Eric Brewer’s CAP formulation has been the poster child for specialized distributed solutions that implemented various design trade-offs required by Web 2.0 specific use-cases. This blog post delves deeper in this topic, and aims to articulate the road ahead for the storage and data management community.
Enterprise data today is becoming increasingly fragmented across silos of specialized interfaces. So, what’s the pain-point with silos? For Web 2.0 companies with an army of engineers specializing in individual NoSQL/NewSQL technologies, these are not pain-points. But for the mainstream adopters, especially SMB enterprises that aspire to adopt cutting-edge data analytics to extract insights from their data, managing multiple technologies is a nightmare. To illustrate the complexity, compare the situation to managing traditional storage arrays from multiple vendors. Even with limited and fairly standardized block/file semantics, it took several years to build single pane-of-glass solutions that hide the technology specifics. The complexity today is not in building a pane-of-glass with all the bells-and-whistles, but rather in normalizing the non-POSIX semantics that are untamed like the Wild West — behaviors such as RW coherence or WW serialization have no standardized semantics. The key point here is that Ops needs to have expertise in each of the varied solutions, and these cannot be managed just as black-boxes. Alternatively, the API can be consumed as a Cloud PaaS, but this is not the correct answer for Ops looking for more control, no vendor lock-in, compliance, and price effectiveness.
The CAP aftermath has not gone unnoticed — there are products across both incumbents and startups aiming to address this space (I will resist listing names to avoid playing favorites). I refer to these efforts as syntactic sugar, in contrast to fundamental semantics — they fall into three broad categories:
- Platform-agnostic Consolidation Tools: The focus here is to provide a global view of data spread across multiple on-prem and cloud silos. The tools will increasingly mature from single pane-of-glass visibility to orchestration and provenience.
- Multi-head Storage Controllers: These are traditional controllers that are aiming to support multiple interfaces such as block, file, object, HDFS, as first-class entities. Often times, the limitation of these approaches have been that they were built on POSIX constructs. Thus, while the interfaces are interoperable with open-source equivalents, the motivation to trade-off consistency for performance and scale is lost.
- Interoperable Cloud Data Services: Most cloud-based services increasingly allow data export/import capability. The interop is typically limited within a single cloud provider, and not cross-industry wide. While this is helpful especially in the data lifecycle management as the data moves across different solutions, the boundary of the programming environment is limited to that of a single provider.
So, what can be done fundamentally different from existing initiatives? Well, as an industry, we have fallen into the trap of having a solution per use-case. While specialized solutions are good, I am arguing that the hypothesis of thinking of them as different systems needs to be revisited. There are two aspects to this argument. First, trade-offs are not water-tight boundaries. Trade-offs apply to a very narrow range of scenarios, especially during failure scenarios. For a majority of common operating conditions, it is a balancing act between different dimensions. Eric Brewer’s article on “12 years after CAP” clarifies this point quite eloquently. Second, existing system design is starting to encompass a broader range of behavior as it gets applied to an increasing number of use-cases. As a case in point, the concurrency control in Cassandra today supports merged updates, atomic compare-and-swap, and batch transactions. From an application standpoint, there are now different flavors of WW serialization available, with varying degrees of performance cost within a single system, instead of treating the trade-offs as water-tight from a design standpoint and implementing as separate systems.
In summary, developers will continue to gravitate towards specialized data management solutions that provide agility for application delivery. Given a mushroom-like growth of available specialized solutions, consuming APIs will inevitably replace an implement-your-own paradigm. From an Ops standpoint, instead of resisting the CAP aftermath, the industry needs to explore fundamental architectures where specialized interfaces can be realized without the silos– i.e., eat the cake and hold it too.