What is the first thing that comes to mind when you think of a storage system? Well, rotating disks! Storage has been traditionally consumed either in the form of block storage or as files. Irrespective of the consumption model, data gets persisted/retrieved on physical disks or Flash using protocols (such as Small Computer System Interface (‘SCSI’), or NVM Express (‘NVMe’) that have been standardized over decades. Traditionally, the role of the storage system has been lowest in the food chain, with very little visibility or understanding of the data contents. Instead of talking to the storage system in terms of “Retrieve record X,” we talk to it as “Retrieve block address Y.” File systems provide an incremental layer of abstraction in the form of “Retrieve file Z created by Alice”. File-related metadata and extended attributes provide limited visibility into data, and aid data search based on key-value metadata attributes.
Database systems have existed since the 1960’s. Most people remember databases as data organized in a relational model with ACID (Atomicity, Consistency, Isolation, Durability) read/write properties. A database in some sense can be considered as a specialized storage system that can persist, retrieve, and analyze data records. Enterprise data was broadly categorized into structured and unstructured data – databases were geared towards structured data that can be normalized into a relational form, while a generic storage system caters to the majority of the data in the enterprise, which is typically unstructured. Analogous to block and file protocols, databases created the SQL language to allow applications to store, retrieve, and analyze data. In the late 1990’s/early 2000’s, databases differentiated themselves on query optimization, handing of complex transaction processing, and efficient caching, buffering, and mutual exclusion implementations. The actual persistence layer for data was mostly handled using standard Volume Managers for block storage or, less commonly, running on top of file systems.
Fast forward to today. Data continues to grow at a phenomenal rate, with medium-sized organizations operating at petabyte scale! Over the years, two key shifts have become mainstream within enterprises (thanks to Web 2.0 companies leading the way in their journey to becoming “data marts”):
- Mainstream adoption of semi-structured data: a new category of data that could not be represented in the traditional relational model, yet had enough of structured data records such that it wasn`t optimal to simply retrieve and analyze using a generic block/file interfaces. This category represents a rapidly growing corpus of data in the form of documents (MongoDB), sparse multi-column (Cassandra), graph (Neo4j), data-structures (Redis), etc.
- Relaxation of SQL semantics: this was led by the classic CAP theorem that legalized the idea of trade-offs in data processing at scale. The NoSQL revolution revisited semantics (not necessarily the syntax as it is commonly misunderstood) to explore areas where applications could gain better performance and scaling in exchange for having limited transaction granularity, optimistic concurrency control, table joins, etc.
Complementing these shifts in data processing, enterprise storage systems have also evolved from their traditional monolithic roots into a scale-out, policy-driven architecture commonly referred to as Software Defined Storage (‘SDS’). Scale-out architectures across the storage as well as database domains all inherit a common set of asynchronous distributed coordination problems, often leveraging well-known techniques from the distributed systems research.
The combined results of these shifts have blurred the boundaries between a modern NoSQL/NewSQL database and an enterprise scale-out storage system. The main focus of both these systems was a scale-out architecture, with the database systems additionally specializing on interfaces for storing, retrieving, and analyzing data records defined in specific formats. A NoSQL system such as Cassandra, MongoDB, Neo4j, etc., invests a significant effort in developing a custom scale-out persistence layer. HBase, for instance, clearly delineates the scale-out storage layer and leverages HDFS (Hadoop Distributed File System) — this is not necessarily optimal compared to a custom implementation, but I cite it as an example for separation of concerns w.r.t. the data interface and scale-out persistence functionality. The differentiated functionality of the data interface layer (namely indexing, query optimization, transactions support, etc.) has become less of a differentiator at scale.
In summary, the distinction between a data management system and a storage system is blurring. Currently, there is an interplay of two forces within the industry:
a) Databases continuing to develop specialized data interfaces. While the majority of data corpus within the enterprise is unstructured, the industry will increasingly leverage systems that offer specialized data record formats as and when they become viable.
b) Storage systems continuing to evolve into content analysis platforms. With the collocation of compute and storage in modern scale-out storage architectures, it has become a feasible option to push compute to where the data is located. In other words, data analysis micro-services (packaged as containers) can now extend the storage system’s role from a mere persistence layer to a customizable data management platform.