This week marks the one year birthday for Project Serengeti, first released at the Hadoop Summit in 2012. Today we’re announcing vSphere Big Data Extensions Beta. I’d also like to take this opportunity to update out thoughts about the big data program and highlights for the support of Big Data in a virtual environment.
Our overall mission is to maximize the combination of virtualization and Hadoop as the premier platform for Big Data. Through virtualization, we allow deploying of a variety of big data workloads on a common infrastructure, enabling rapid provisioning and cost savings through shared hardware clusters.
We realize these capabilities by optimizing and enabling the vSphere platform for big data, by building a layer of extensions between Hadoop and virtualization and by working with our partners of the key Hadoop distributions. This gives access to virtualization for Big Data to the broadest customer and application base.
The key facets of the vSphere Big Data platform to date include:
- Performance characterization of vSphere for Hadoop, including documented best practices and tuning
- Deploying Hadoop clusters in their own set of virtual machines, to provide strong resource and security isolation between tenants.
- Leveraging vSphere HA & FT to provide enterprise level availability
- Project Serengeti – Enabling rapid provisioning of Hadoop clusters
- Project Serengeti – Elastic clusters through Serengeti’s dynamic grow/shrink capability
- Virtualization aware through the Hadoop Virtualization Extensions – contributions to the Hadoop scheduler and file system for virtual topology placement.
- Joint testing/validation of Hadoop in a virtual environment with Cloudera, MAPR, Pivotal and Hortonworks.
The Big News: vSphere Big Data Extensions
Today we’re announcing VMware vSphere Big Data Extensions Beta. Through Big Data Extensions, we are providing an easy to use management tool to provision, manage, and monitor your enterprise Hadoop clusters on vSphere through the vCenter user interface. The new client provides a single management console to enact configuration changes across your cluster, and incorporates reporting and diagnostic tools to help you optimize performance and utilization.
Big Data Extensions uses the core of Serengeti with integration into the vSphere client.
Provisioning of the Hadoop cluster is as simple as clicking through the guided interface, which allows us to specify the Hadoop distribution, the resources to be assigned, and details about the topology of the Hadoop cluster.
Working with the Community and Partners
We’re happy to continue work with the open source community and users by introducing a this version of Serengeti as part of Big Data Extensions. The latest version incorporates support for leading distributions of Apache Hadoop 1.2: Cloudera Distribution Including Apache Hadoop (CDH) 4.2, CDH 3, Greenplum HD 1.2, MapR Distribution for Hadoop 2.1.3 and Hortonworks Data Platform (HDP) 1.3. We are also adding support for Hadoop YARN, the next generation of Map Reduce. Pivotal HD 1.0, Pivotal’s Hadoop distribution, is the first YARN-based distribution supported by Project Serengeti.
Automatic Elasticity
Big Data Extensions can effectively isolate each Hadoop cluster in its own dedicated resource pool, allowing you to control cluster resource usage using vSphere shares, limits, and reservations.
To facilitate elasticity, Big Data Extensions can automatically scale the number of compute virtual machines in a Hadoop cluster based on contention from other workloads running on the same shared physical infrastructure. Compute virtual machines are added or removed from the Hadoop cluster as needed to give the best performance to Hadoop when its needed, and make resources available for other applications or Hadoop clusters at other times. This allows you to efficiently share resources across multiple Hadoop clusters.
Additional Functionality
There are also several enhancements to the Big Data Extensions and Serengeti functionality, including:
- Modification of Compute/Memory: The ability to modify the number of vCPUs and amount of memory of a running cluster
- Yarn/Map Reduce 2: Provisioning and control of YARN/MR2 based clusters
- Disk Failure Support: Ability to recover from the failure of a single disk automatically
- Custom OS Environments: Additional capability to deploy from CentOS 6.x templates with customizations to the Linux environment.
Early access to vSphere Big Data Extensions is available at: http://communities.vmware.com/community/vmtn/beta/vsphere_bde . VMware vSphere Big Data Extensions is expected to be generally available by the end of 2013.
Looking back over the last year, it’s rewarding to see the number of customers engaging in virtualized Hadoop deployments, and I look forward to working with you on this new release. Please feel free to share your thoughts.
Comments