The amount of data stored within an enterprise is growing at a phenomenal rate. In the late ‘90s, a storage system with one petabyte capacity was considered a great accomplishment— you got your name in news articles and also got invited to give conference keynotes! Today, the storage industry talks in terms of exabytes and zettabytes. Unless you have been living under a rock for the last few years, this should come as no surprise— the phenomenal growth rates (50-60% CAGR), coupled with the growing need to harness data-driven insights for business decision-making (commonly referred to as ‘Big Data’), are redefining how enterprises manage the lifecycle of data.
While the overall data growth has been exponential, not all data is active at the same time. In fact, the actual working set size (i.e., the data actively being used by the applications) is in-line with the memory densities growth— check out Stanford’s RAMCloud project where they are implementing an in-memory scale-out layer to handle the active working set (inspired by the model employed by Web 2.0 companies such as Facebook, Wikipedia, Twitter and YouTube). Holistically, within a large exabyte store, datasets vary in the probability that they will be accessed, as well as the corresponding latency expectation. In industry parlance, this aspect is illustrated by drawing an analogy to data at different temperatures: hot, warm, cold, and archive (frozen). The essence is to handle data differently based on their temperature — this is not a new concept; in the good old days this was referred to as Hierarchical Storage Management (HSM). The difference is in the number of tiers — traditional HSM is aimed at moving data from hot to frozen — this is much more black-and-white. Today, the tiering is much more involved given the different shades of gray between hot-warm-cold-frozen. The awareness is certainly catching up, with Hadoop recently announcing support for data tiering.
So, do we really need a new interface to handle data that is not accessed frequently, and does not require low latency whenever it is accessed? Today, enterprise storage is built on block (SCSI, FC, iSCSI, FCoE) and file (local FS, NFS, CIFS/SMB) interfaces. While these interfaces can certainly be used, they provide richer functionality than what is required for bulk data. For instance, instead of organizing billions of bulk data objects in directories, can a flat namespace with collocated data and metadata be used to avoid the overheads associated with managing separate metadata structures? Also, for infrequent updates, do we truly need POSIX atomic updates and read-write mutual exclusion? Given the latency expectation, can we optimize for latency instead of throughput, and instead rely on a stateless implementations? These questions led to the emergence of a new interface known as Object Storage interface — a REST-based flat namespace with a security model and with collocated data and metadata. The interface has become mostly popular in the context of accessing storage from the cloud (via the internet), but is also used in the context of Openstack Swift, as well as a growing number of storage array offerings that are adding object as an additional interface to block and file. Also, there are ongoing efforts to standardize via SNIA’s CDMI initiative. Overall, the barebones simplicity of the Object Storage interface, combined with extensibility, has been a fundamental driver for rapid adoption.
Object Storage is an overloaded term and is typically used in more than one context. Before continuing, I would like to clarify this term. So far, we have referred to Object Storage as an interface that applications use to communicate to the storage system. There is another common usage of the term — scale-out systems such as Lustre, Ceph, and VSAN use objects as a unit of resource management instead of blocks or files. Addressing resources as objects allows for higher scalability as resource management APIs have richer semantics providing decentralized execution of tasks for replication, data integrity checks, recovery, etc. I will defer several other interesting aspects of object-based storage to a future post.
Back to the topic of Object Storage as an interface! While a majority of applications have been traditionally written for block and file, there are three emerging ways for consumption of Object Storage interface within the enterprise:
- Gateway appliance model: Think of this as a translator, exposing the traditional block and file interface to the applications, while talking to object storage on the other end (typically in the cloud). Gateways 1.0 emerged a few years back mainly as standalone appliances with limited success. Now there is the 2.0 emergence of gateways in the context of integrated capability within storage arrays. Also, gateways are more than translators — given their visibility into the IO path, they have been extended to provide other services, namely caching, global data synchronization, consistent snapshots, etc.
- Retooling of existing data services: Existing solutions such as those for backup, archiving, and Disaster Recovery are evolving to natively leverage Object Storage interfaces. Essentially, solutions that deal with cold and frozen data are good candidates to leverage Object Storage interfaces, especially in the cloud.
- Greenfield applications: These applications represent a broad range of new applications that are written to leverage storage via the object interface. Instead of retrofitting data model semantics for update, synchronization, authentication and authorization, these services are developed ground-up with the assumptions of Object Storage interface, and the latencies of the cloud service.
In summary, data deluge is pushing Object Storage as a first-class interface within enterprises. Object Storage gained momentum at the bottom of the value-chain with frozen and cold data, and has been moving up the chain towards warm data. While there is no prescriptive one-size-fits-all, it is clear the $/GB economies (driven mainly by cloud economies of scale) will continue to keep Object Storage interface on the radar of a growing number of enterprises, especially those retaining big corpus of data for harnessing business value via the fast maturing Big Data stack. VMware continues to deliver the best-of-breed cutting-edge platform for enterprises. We are also working on enabling seamless integration for Object Storage interfaces – check out our past collaboration announcement with Google on Object Storage and other services.