Blue glowing high energy plasma field in space, computer generated abstract background
Uncategorized

Is Cloud Storage going to disrupt Traditional Storage? – Part2: What is Blob storage?

Cloud-scale Blob is the new category of storage designed specifically to meet a very relaxed set of requirements uniquely selected to match the needs of the majority of bytes of data for new media types. Gone are the complex constraints of concise consistency, the ability to randomly update the object, or any sort of concurrency/locking semantics. In fact, the majority of Blob stores today support just create/append/delete, and most have no namespace.

 

 

Capability

POSIX File System

Cloud Blob

Namespace

Yes

No

Create

Yes

Yes

Read

Yes

Yes

Update

Yes

Most offerings NO

Append

Yes

Yes

Delete

Yes

Yes

Locking

Yes

No

Read-after-write guarantee

Yes

No

Access control lists

Yes

Yes

Authentication

Yes

Yes

Meta-data per object

*

Yes

Search/Query

No

Yes

|

 

Most applications will access the objects by a unique ID assigned to the object at create time, eliminating the need for a namespace. In addition, there are some new capabilities for finding objects that come with cloud blob storage – most notably the ability to store meta-data along with the objects and a search interface on the metadata that can be used to find objects.

 

Examples of product or project based implementations of cloud-blob storage are  EMC’s Atmos, Amazon S3, Facebook’s Haystack, Rackspace CloudFiles, Azure Blob, Nirvanix, Synaptic, Walrus, and Park place.

 

 

The case for cloud-scale Blob

 

Over the past few years, I’ve spent some time with Facebook, eBay, MySpace, QQ.com, and Smugmug.com learning about their requirements and use of distributed blob storage. In all these cases, the observation is that the number of bytes stored for blob style objects is several orders of magnitude higher than the entity data stored in the existing database systems, and growing at a much faster rate. The scale of bytes stored make the cost economics a primary concern, but there other capabilities that are unique to Cloud Blob storage:

 

  • It’s easy to scale to very large size: Exabyte scale is easily attainable

  • Access is via HTTP/Rest, can be easily accessed on either side of the firewall

  • Access can be done via any application-level language, no need for complex kernel-based operating system clients

  • Implementations have infinite snapshot and recovery capabilities

  • Objects are easily replicated across multiple physical locations

  • Access can be federated across multiple locations – objects can be created and accessed from multiple locations

  • CDN like capabilities: for example, objects can be exposed through simple http methods, allowing applications to embed read-only copies of objects directly in their web content.

  • Policy can be easily attached and acted on: including encryption, compression, transformation etc.

 

Capability

POSIX File System

Cloud Blob

Scale

Typically One Server or NAS

100PB+

Firewall Friendly

No

Yes

Replication

Rare/Complex

Yes

CDN Features

No

Yes

Typical Read/Write Latency

0.5 – 10ms

100ms local/500ms public

Built-in Snapshot/Clone

Some (ZFS)

Yes

Geographic Replication

Seldom, Very Hard

Yes

 

In each of these cases the internal project teams started by building their own simple cloud-scale blob systems, however most have now migrated to off the shelf vendor solutions.

 

An example: Facebook

 

Facebook published a great summary of their architecture change away from traditional NFS storage. They cite that they only need a few features of POSIX, and could replace their Netapp NAS with a simpler, lower cost scale-out solution for the majority of their data (Photos):

 

 

“Only the top three POSIX requirements matter to a file system such as Facebook. Its servers care where the file is located and its total length but have little concern for file system owners, access rights, timestamps, or the possibility of linked references. The additional overhead of POSIX-compliant metadata storage and lookup on NetApp Filers led to 3 disk I/O operations for each photo read. Facebook simply needs a fast blob store but was stuck inside a file system.”

 

 

Cloud-scale economics

 

In one of the companies I visited, they stated they had 1PB of storage for blob storage and needed to double that available space every 12-18 months. On economics alone, if they used the same tier-1 database storage for this data it would have cost about $5 million at today’s storage rates. They built their own distributed storage system based on the very simple semantics required — namely create/store a file, and read a file starting at offset (note that delete wasn’t even implemented, as it wasn’t deemed to be necessary). Their implementation was using a distributed set of small servers each with 24 disks, which in today’s capabilities would be just 20 commodity servers, costing about $200k.

 

Cost model for cloud storage

 

There are different models for cost of Blob storage. First, the traditional storage costs are typically measured in dollars per gigabyte, based on the purchase price of the storage system. To be more complete, this should also factor in the operational expense of location, power, cooling, people management and backup. Typical cost of NAS (which could be used to store Blob like objects) is in the range of $3-$10 per gigabyte of storage. This cost is typically tripled by the time that storage is fully operated and backed up, putting it in the range of $9-$30 per month.

The cost of storage in a cloud Blob service is a combination of several factors:

 

  • The cost to store each byte per month

  • The cost to upload the data to the service

  • The cost to retrieve the data from the service

 

Cloud storage is typically measured in dollars per gigabyte per month, fully managed. The services available today range from 5c to 25c per month per GB, but to put it on the same scale we need to expand to the same dimension. Our fully operated NAS cost in the same direction (divided by 36 months) is 25-81c per GB.

 

We often associate Cloud blob with public cloud blob services. However, similar value propositions are often available with a private or public blob cloud. For a cloud Blob storage service, if your compute is close to the data or the data is seldom accessed, then the public cloud blob is the easiest and most cost effective solution. If your compute is remote or there are other constraints like privacy, then it may be more economical to build a private cloud blob system.

 

Options for private Blob include EMC Atmos, Openstack Swift and CEPH. I’m also seeing Riak being used for small to medium Blob storage lately too.

 

Cost to build a cloud-storage system

 

The economics of cloud storage are affected by relaxed semantics and performance requirements, and are scaled largely by the size of the deployment.

 

The semantics of Blob make it possible to use lower cost storage hardware. Blob doesn’t offer high performance read/write/update that we might expect from a NAS or SAN device, since most Blob services don’t even offer an update capability. This means that the performance requirement is mostly bandwidth and space driven, which eliminates the need for high-end expensive high-RPM enterprise SAS or fibrechannel disks. In addition to the relaxed performance requirements, Blob systems are able to easily cope with individual component or complete node failures, by using raid-across-the-datatcenter approaches. Most Blob implementations in fact are able to use PC disks, allowing access to the high-volume low cost PC disk market and with today’s costs at below $100/TB.

 

                       Cost per GB Deployed vs. Scale of Deployment in Petabytes

 

In addition to being able to use low cost components, the cost to operate a cloud storage system at scale is driven by the size of the configuration. Typical observations today are that it costs less to purchase and operate the system as the number of bytes stored increases. As you can see, the cost of building and operating your private blob cloud vary depending on the scale of your cloud Blob system.

 

Summary

 

The main choice going forward is to figuring out the best options for private blob, public blob, or a hybrid of both. The best option will be a choice made on several key criteria including your data latency needs, amount of bandwidth needed and scale of data.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *