As enterprises grapple with exponential data growth, they are looking for new models to efficiently store, analyze, and archive data. Software-defined storage (SDS), as well as the hybrid cloud model, is shifting wasteful “just-in-case” provisioning into seamless “just-in-time” provisioning and storage management for optimal total cost of ownership (TCO). The hybrid cloud model allows enterprises to leverage both their existing data centers’ deployments as well as the cloud in a seamless fashion, leveraging well-established workflows and management tools. The focus of this blog post is to discuss emerging intersection points between traditional data lifecycle management and hybrid cloud storage.
Let’s level-set on the lowest common denominator for this discussion — storage hardware trends, and their implications on data management. A decade ago, storage tiers were mainly rotating disks with different speeds, and sequential media such as tapes. The industry’s research focus was mainly on closing the latency gap between main memory with nanoseconds service times and disks operating at milliseconds. Today, storage tiers have spread across the entire latency spectrum: Non Volatile Memory (NVM) with expected service times of hundreds of nanoseconds, to NAND Flash operating at an order of microseconds, to innovations in high density disk technologies such as Shingled Disks with higher access times. An interesting observation is the emergence of distinct $/IOP and $/GB tiers — flash-based storage excels with regards to $/IOP, while disk-based technologies are optimal with regards to $/GB. So, depending on the IOPS/GB of the data, and the application’s latency constraints, the data needs to be placed on the appropriate tier to best utilize the storage resources.
Data lifecycle management deals with data placement and management from the initial creation of data to its eventual deletion. The key point is that the quality of service (QoS) requirements for data are time-varying i.e., at the time of initial creation, the latency, throughput, durability and recovery point objective (RPO)/recovery time objective (RTO) availability, may be different from data that has aged and is being retained mainly for batch processing and archiving/compliance. By combining data lifecycle management with the differentiated properties and pricing of storage tiers across on-premises and cloud resources, enterprises can exploit storage hardware innovations to derive the lowest total cost of ownership (TCO) specific for their deployment and usage models.
To consume cloud resources, enterprises are increasingly adopting the hybrid cloud model that allows consuming cloud resources as a natural extension of their on-premises data-center deployment. Specifically in the context of storage resources, the cloud offers a variety of protocols (block, object, etc.), and tiers with varying $/GB/month pricing. Also, it offers a growing number of scale-out services for structured and semi-structured solutions, blurring the lines between traditional storage and data management solutions. It is important to realize that cloud storage is more than just storage resources – it has virtually unlimited compute resources that can be collocated with storage in the cloud. Also, given the economies of scale and competitive cloud wars, enterprise customers benefit from ever-decreasing prices of cloud resources, also referred to as “the race to zero.”
Customers commonly think of hybrid cloud storage as a tier of storage that is used mainly for infrequently accessed data copies or archive data. While those have been the founding use cases, there are a growing number of intersection points between data lifecycle management and hybrid cloud storage means. The following list of recipes is not meant to be comprehensive, but more to get you thinking of the possibilities!
- Data copy offloading recipe: The idea is to keep the primary data on-premises closer to the application, but the copies of data in the form of Point-in-Time (PiT), replicas, etc., reside in the cloud.
- Hot-Cold data-tiering recipe: In this model, the primary data that is relatively cold gets moved to the Cloud. There is a tradeoff between the aggressiveness of tiering and the tolerable access latencies. As the cloud storage latencies continue to improve, the tiering can get increasingly aggressive including even warm data in some cases.
- Ingestion funnel recipe: For applications consuming a large number of data sources in real-time, the cloud tier can provide the initial ingestion layer for data crunching and insights extraction — only a small subset of business critical data gets copied back on-premises for deeper analysis.
- Cloud bursting recipe: In periods of high load, the incoming load on on-premises resources can be burst to VMs in the cloud. This model works when the data requests are either for static data (such as shopping catalogs), or eventually consistent data (such as inventory data) leveraging the on-premise data being replicated into the cloud.
- Batch analytics recipe: This is an obvious one – running Map-Reduce and other Big Data analytics on the data copies in the cloud. A variant of this recipe is also verifying the integrity of backups/replicas, or creating indexes for searching application-specific metadata, etc.
- Fossilized data recipe: To meet compliance requirements for specific industry verticals, the primary data continues to be on-premise, while the versions of original data and audit trails can be offloaded to cloud storage.
- Global Filesystem recipe: The idea is to have Storage Gateway appliances across multiple on-premise data-centers/branch offices providing a single global file-system namespace. These appliances in-turn use cloud storage as a centralized data and metadata repository. There are examples of such gateway products such as Panzura, Nasuni, etc.
In summary, disruptions in storage technologies are redefining not just the emerging storage architectures, but also influencing traditional workflows/IT processes within the data-centers. The model of a hybrid cloud comes naturally to storage use cases, motivated by exponential data growth and limited IT budgets. While hybrid cloud storage emerged mainly for data archive and availability use cases, it is increasingly playing an important role in traditional Data Lifecycle Management. Hybrid Cloud provides not just another storage tier for on-premise applications, but a combination of on-demand compute capabilities, with differentiated tiers for cost, performance, durability, and a globally accessible (object) storage namespace.