This week is Big Data Week in NY, if your into big data it’s one of the coolest places to be right now. At the center of stage is O’Reilly Strata Conf and Hadoop World, with an array of other data events happening around the venue — a data sensing lab, a NY big data meet-up, and many others listed here.
I’m happy to see that we’re seeing continued increase in momentum of using virtualization as a platform for Hadoop and big data. At VMworld 2013, I shared the stage with Fedex who described that they are seeing significant benefits from their virtual Hadoop platform — they cited simplified systems management, the ability to rapidly deploy new Hadoop environments on demand, and the ability to use VMs as containers to ensure their environment is secure and audible using known validated mechanisms.
My talk today at the Strata conf poses the question – “Is your Cloud Ready for Big Data?“. I’d like to provide just a primer here to the main talk, highlighting some of the main areas we’ll be discussing.
The Runtime Platform – Provisioning, Multi-tenancy and Resource Management
We’ve learned a lot during the great conversation we’ve had with customers over the last two years about creating various types of platforms to run big data solutions — ranging from the use of Mesos at Twitter, Virtualization at Fedex, and Yarn at Yahoo. The general theme is a strong need to bring various aspects of a Cloud platform to Hadoop and the broader family of big data applications, with different technologies solving different parts of the puzzle.
The technologies all have their strengths depending on the use cases. The objectives however are somewhat aligned across the use cases, and include:
- Making it dead simple to deploy new big data environments, through automation and self-service.
- The ability to run mixed workloads – the ability to run more than just map-reduce, including some Hadoop ecosystem apps, SQL services and other generic apps on the same platform.
- Security Isolation – being able to secure data sets and environments when sensitive information needs to be walled off.
- Resource Isolation – the ability to provide performance isolation to prevent the noisy neighbor problem, and to provide enough resources to meet throughput SLAs.
- Version freedom – the ability to run different versions of the runtime environments for different users and developers.
- Availability – making the whole system robust and enterprise grade.
In addition to the runtime platform, there are a bunch of lessons learned and impacts to the networking and storage platforms we choose for our big data systems.
The Networking Challenge
For networking there are both challenges and opportunities. Today’s network two layer core aggregation switch networks aren’t able to provide necessary bandwidth across racks in the datacenter. Bandwidth inside a rack can be good — up to several hundred gigabits, but is very limited between racks. This make it important to optimize locality, and pretty much mandates a consolidated data and compute model.
Fortunately, there are new network topologies that provide great solutions — using CLOS or Leaf and Spine designs. Using these new topologies, we’ve reached a point where a sufficiently designed network infrastructure can provide large bandwidth with flat latency across the cluster — at the order of a terabit inside the rack and intra-rack. This also allows for the option for storage and compute to be separated, opening the door for different topologies that those first seen when Hadoop was designed (circa 2005 with gigabit ethernet).
Big Data Storage Platforms
The storage options are also heating up — with a set of choices about platforms for storing and processing big data sets. Of course, Hadoop’s HDFS is at center stage, but there are other strong storage platforms that provide plug-and-play compatibility with Hadoop, and offer some interesting values. Some of the key options include:
- Using your existing SAN or NAS. This is the most desirable, since it’s easy to get started and already offers a variety of data services — including snaps, archives, replication. Most of these solutions are however challenged in the areas of cost and bandwidth scalability.
- Software-defined storage on local disks in commodity servers. Here we can see HDFS as the primary candidate, but there are other options in this space, including CEPH, Gluster and MAPR — which also have file system presentation options to simply getting data in and out of the system.
- Scale-out hardware solutions. In this space there are some credible new entrants, which offer to replace HDFS with a supported scale-out hardware storage solution, making an impact on the cost and bandwidth challenges mentioned above. A great example is the Isilon scale-out storage solution, which offers a 3-144 node solution scaling up to 40PB at 40Gbytes/sec.
Big Data on vSphere
Since we launched Big Data Extensions, we’ve been busy working on complementary solutions.
We’ve complemented BDE with vCloud Automation Center to provide self-service access to Hadoop cluster creation. The aim of vCAC is to allow end-users (developers, business units) to be able to easily provision applications from a software catalog, without having to use the lower level VM tools in vCenter. vCAC automates the steps required to provision entire applications using workflow automation. In the case of Hadoop, an user with appropriate privileges can pick from a set of templates of Hadoop clusters, follow the questions and deploy their own cluster. There’s a great demo here.
We’ve also had a great partnership with Intel to build and analyze a large consolidation cluster. Using Hadoop and vSphere, Intel have tested an 808 virtual node Hadoop cluster, using 110 Dell PowerEdge servers. The results show consolidating two clusters onto one, leveraging vSphere resource management and Hadoop auto-elasticity to balance the compute resources to the virtual Hadoop nodes as needed. I’ll post the official results and paper as soon as they are available.
Update: The paper is available here.
Enjoy the Conference
Our Big Data team is onsite at Strata/Hadoop world, and we also have a booth at the show. Please feel free to stop by and say hello!