Cloud-scale Blob is the new category of storage designed specifically to meet a very relaxed set of requirements uniquely selected to match the needs of the majority of bytes of data for new media types. Gone are the complex constraints of concise consistency, the ability to randomly update the object, or any sort of concurrency/locking semantics. In fact, the majority of Blob stores today support just create/append/delete, and most have no namespace.
Capability |
POSIX File System |
Cloud Blob |
---|---|---|
Namespace |
Yes |
No |
Create |
Yes |
Yes |
Read |
Yes |
Yes |
Update |
Yes |
Most offerings NO |
Append |
Yes |
Yes |
Delete |
Yes |
Yes |
Locking |
Yes |
No |
Read-after-write guarantee |
Yes |
No |
Access control lists |
Yes |
Yes |
Authentication |
Yes |
Yes |
Meta-data per object |
* |
Yes |
Search/Query |
No |
Yes |
|
Most applications will access the objects by a unique ID assigned to the object at create time, eliminating the need for a namespace. In addition, there are some new capabilities for finding objects that come with cloud blob storage – most notably the ability to store meta-data along with the objects and a search interface on the metadata that can be used to find objects.
Examples of product or project based implementations of cloud-blob storage are EMC’s Atmos, Amazon S3, Facebook’s Haystack, Rackspace CloudFiles, Azure Blob, Nirvanix, Synaptic, Walrus, and Park place.
The case for cloud-scale Blob
Over the past few years, I’ve spent some time with Facebook, eBay, MySpace, QQ.com, and Smugmug.com learning about their requirements and use of distributed blob storage. In all these cases, the observation is that the number of bytes stored for blob style objects is several orders of magnitude higher than the entity data stored in the existing database systems, and growing at a much faster rate. The scale of bytes stored make the cost economics a primary concern, but there other capabilities that are unique to Cloud Blob storage:
-
It’s easy to scale to very large size: Exabyte scale is easily attainable
-
Access is via HTTP/Rest, can be easily accessed on either side of the firewall
-
Access can be done via any application-level language, no need for complex kernel-based operating system clients
-
Implementations have infinite snapshot and recovery capabilities
-
Objects are easily replicated across multiple physical locations
-
Access can be federated across multiple locations – objects can be created and accessed from multiple locations
-
CDN like capabilities: for example, objects can be exposed through simple http methods, allowing applications to embed read-only copies of objects directly in their web content.
-
Policy can be easily attached and acted on: including encryption, compression, transformation etc.
Capability |
POSIX File System |
Cloud Blob |
---|---|---|
Scale |
Typically One Server or NAS |
100PB+ |
Firewall Friendly |
No |
Yes |
Replication |
Rare/Complex |
Yes |
CDN Features |
No |
Yes |
Typical Read/Write Latency |
0.5 – 10ms |
100ms local/500ms public |
Built-in Snapshot/Clone |
Some (ZFS) |
Yes |
Geographic Replication |
Seldom, Very Hard |
Yes |
In each of these cases the internal project teams started by building their own simple cloud-scale blob systems, however most have now migrated to off the shelf vendor solutions.
An example: Facebook
Facebook published a great summary of their architecture change away from traditional NFS storage. They cite that they only need a few features of POSIX, and could replace their Netapp NAS with a simpler, lower cost scale-out solution for the majority of their data (Photos):
“Only the top three POSIX requirements matter to a file system such as Facebook. Its servers care where the file is located and its total length but have little concern for file system owners, access rights, timestamps, or the possibility of linked references. The additional overhead of POSIX-compliant metadata storage and lookup on NetApp Filers led to 3 disk I/O operations for each photo read. Facebook simply needs a fast blob store but was stuck inside a file system.”
Cloud-scale economics
In one of the companies I visited, they stated they had 1PB of storage for blob storage and needed to double that available space every 12-18 months. On economics alone, if they used the same tier-1 database storage for this data it would have cost about $5 million at today’s storage rates. They built their own distributed storage system based on the very simple semantics required — namely create/store a file, and read a file starting at offset (note that delete wasn’t even implemented, as it wasn’t deemed to be necessary). Their implementation was using a distributed set of small servers each with 24 disks, which in today’s capabilities would be just 20 commodity servers, costing about $200k.
Cost model for cloud storage
There are different models for cost of Blob storage. First, the traditional storage costs are typically measured in dollars per gigabyte, based on the purchase price of the storage system. To be more complete, this should also factor in the operational expense of location, power, cooling, people management and backup. Typical cost of NAS (which could be used to store Blob like objects) is in the range of $3-$10 per gigabyte of storage. This cost is typically tripled by the time that storage is fully operated and backed up, putting it in the range of $9-$30 per month.
The cost of storage in a cloud Blob service is a combination of several factors:
-
The cost to store each byte per month
-
The cost to upload the data to the service
-
The cost to retrieve the data from the service
Cloud storage is typically measured in dollars per gigabyte per month, fully managed. The services available today range from 5c to 25c per month per GB, but to put it on the same scale we need to expand to the same dimension. Our fully operated NAS cost in the same direction (divided by 36 months) is 25-81c per GB.
We often associate Cloud blob with public cloud blob services. However, similar value propositions are often available with a private or public blob cloud. For a cloud Blob storage service, if your compute is close to the data or the data is seldom accessed, then the public cloud blob is the easiest and most cost effective solution. If your compute is remote or there are other constraints like privacy, then it may be more economical to build a private cloud blob system.
Options for private Blob include EMC Atmos, Openstack Swift and CEPH. I’m also seeing Riak being used for small to medium Blob storage lately too.
Cost to build a cloud-storage system
The economics of cloud storage are affected by relaxed semantics and performance requirements, and are scaled largely by the size of the deployment.
The semantics of Blob make it possible to use lower cost storage hardware. Blob doesn’t offer high performance read/write/update that we might expect from a NAS or SAN device, since most Blob services don’t even offer an update capability. This means that the performance requirement is mostly bandwidth and space driven, which eliminates the need for high-end expensive high-RPM enterprise SAS or fibrechannel disks. In addition to the relaxed performance requirements, Blob systems are able to easily cope with individual component or complete node failures, by using raid-across-the-datatcenter approaches. Most Blob implementations in fact are able to use PC disks, allowing access to the high-volume low cost PC disk market and with today’s costs at below $100/TB.
Cost per GB Deployed vs. Scale of Deployment in Petabytes
In addition to being able to use low cost components, the cost to operate a cloud storage system at scale is driven by the size of the configuration. Typical observations today are that it costs less to purchase and operate the system as the number of bytes stored increases. As you can see, the cost of building and operating your private blob cloud vary depending on the scale of your cloud Blob system.
Summary
The main choice going forward is to figuring out the best options for private blob, public blob, or a hybrid of both. The best option will be a choice made on several key criteria including your data latency needs, amount of bandwidth needed and scale of data.
Comments