Since the release of Serengeti, VMware has learned a tremendous amount from our customers about using virtualization as the platform for big data workloads and Hadoop. These customer conversations provided us with solid reasons to virtualize Hadoop and other big data workloads.
Today, we’re introducing the third significant release of project Serengeti. In this blog I cover some of the new headlines of the work accomplished since the Hadoop Summit in June, which include:
- Support for Dynamic Elastic Scaling
- Hive JDBC connections
- Data upload/download interface
- Ability to configure infrastructure topologies
- Placement controls for Hadoop nodes on Physical
- Hadoop tuning configurables
- A community contributed UI
At this October release, we’re also supporting Cloudera CDH3u3, Greenplum GPHD 1.2.0.0, Hortonworks HW 1.0.7, and Apache 1.0.1.
The Hadoop Users
There are two key users that interact with the Hadoop system:
- The Data Scientist: This would typically be the line of business person who is tasked with analysing and providing intelligent insights from the data. Their primary concerns are to have quick time to insights on the data. Here, getting quick access to a cluster on demand, and getting reasonable performance from their cluster are primary concerns.
- The IT Guy: This would include the architect, admin and CIO. They run the infrastructure and their primary concerns are keeping up with the demands of the business, cost efficiency, and keeping the services available.
Virtual Hadoop helps the data scientist get rapid access to new virtual clusters on demand, reducing the time taken to provision new data environments. For the IT guy, we can provide a common infrastructure on which to run many big data workloads, removing the need for silos of different solutions, and reducing the cost and complexity of the systems required to host big data workloads.
Multi-tenancy, Mixed Workloads and Elastic Hadoop
During our investigations of how to run Hadoop on virtualization, we found that there are several new ways to configure Hadoop that can take advantage of the software-definable underlying resources. Consider today that Hadoop is typically configured on a relatively static set of physical resources, and takes over 100% of those physical resources. This makes it hard or nearly impossible to share those resources with other applications. Typically, a whole physical host would be assigned to either Hadoop or something else, leading us to build vertical silo’s of application-specific clusters.
In a virtual configuration, it’s of course easy to share those resources amongst multiple applications. But first, we should discuss what might motivate us to want to do that.
While discussing big data with our customers, we’ve seen a continuous pattern of multiple components and applications that make up a full big-data application. For example, there is typically Hadoop, layered Hadoop components (such as Hive, etc.), several flavors of NoSQL (including Cassandra, hBase, MongoDB), fast databases (in memory technologies such as GemFire, Redis, Memcache etc.), scale-out column oriented SQL databases, and then other non-map-reduce applications which are part of data integration and visulation. Some of these are batch oriented, like Map-Reduce, but many are long-running apps that need the traditional services of an application deployment and life cycle management system.
The main point is that rarely do I just see Hadoop. It’s always a mix of different application types. The other types of applications that complement Hadoop should be able to share the same cluster as Hadoop, but may leverage existing application deployment and management frameworks, such as those provided by virtual infrastructure.
Since the resource demands of each workload shift over time, we can make much better use of the infrastructure if we can mix these workloads on the same cluster. In the ideal scenario, we can dynamically size of the amount of resources available to each.
Frameworks for Mixed Big Data Workloads
Given the need to support mixed types of big data workloads, there is an industry level focus on creating a shared infrastructure platform that can support concurrent operation of these different application types. There are several efforts underway in the community to address this, with a diverse set of approaches. Some notable examples are:
- Apache YARN – which adds the some capabilities of running other non-map-reduce workloads inside the Hadoop framework. YARN allows multiple application types to run in YARN containers, of which map-reduce is just one of a few types. The initial work shows map-reduce and MPI jobs running along side of each other on the same cluster. The Hadoop resource manager is used to partition the resources according to policy. Arun discusses the need to run mixed workloads on the same cluster in his InfoQ article. He cites some examples of Map-reduce, Twitter Storm, Apache S4 and OpenMPI sharing the same platform environment.
- MESOS – a resource-brokering framework, which allows Hadoop to share the resources with other application frameworks. Started originally at UC Berkeley, MESOS allows multiple Hadoop frameworks to run along side of non-Hadoop workloads on the same cluster, integrating with the native resource and configuration management capabilities of the underlying OS to provide some level of isolation between frameworks.
- Serengeti – VMware’s mixed workload framework that leverages Virtualization to allow Hadoop to run along side of other applications in the same cluster. We can use virtualization to provide the runtime containers for many types of applications concurrently on the same platform, and leverage the strong isolation that virtual machines provide.
Common to all of these approaches is a need to provide the three standard levels of isolation:
- Configuration Isolation: the ability to allow each application type to be configured independently of each other. This extends beyond mixed workloads, and is useful for even different versions of the Hadoop environment on the same cluster – commonly to support development, test and production.
- Resource Isolation: the ability to independently control the quality of service of application environments. This is most commonly used to prevent the “noisy neighbor problem”, so that a busy application doesn’t impact the performance of another. A resource management framework is used to set priorities or amounts of resource available to each application.
- Fault Isolation: the ability to prevent the failure of one environment from impacting another. For example, if someone runs an experimental version of Hadoop in test, we don’t wan that to impact a production version that is running daily operations.
If we contrast the approaches, we see that YARN is bringing some new workloads to Hadoop, but will continue to be quite restrictive in what it can run and control, since it lacks some of the key foundational primitives to run generic applications (e.g. block or Posix storage, generic networking, etc.). MESOS is heading towards a similar place to managed virtualization, i.e. a common infrastructure approach to running different workload types, with an emphasis on resource management. In the end, a full virtualization platform has a strong blend of the provisioning, infrastructure services (e.g. storage), management and distributed resource controls required for a wide variety of applications.
The Need for Elasticity
Most big data workloads utilize a scale-out distributed architecture. This means that if we mixed these workloads, we will ultimately need to vary the amount of resource allocated to each by scaling the number of nodes in each workload. In one recent discussion with a public social application provider, the need to have differentiated quality of service for their production hBase and map-reduce workloads was called out. Often, the map-reduce workloads would impact the performance of hBase, so they wanted stronger isolation between hBase and map-reduce. Both of these workloads are scale-out architectures, and the actual number of nodes is a function of the application demand for a given time of day, and stage of the applications growth pattern. Another customer cited the need to support a scale-out SAS environment for numerical analysis and map-reduce. In a third customer example, there is a need to run more Hadoop map-reduce on a large standby cluster, that is mostly idle when it’s not be used for fail-over workloads. They want Hadoop to be able to shrink, but still be available for some access to the data, and then grow again.
Scaling workload up and down is not a new science – it’s something we’ve done for years in virtual and cloud environments for stateless applications. The hard part is scaling an environment which contains a large amount of distributed state. For example, if we want to scale Hadoop, each node of Hadoop may contain upwards of 20Terabytes of data, which means that growing Hadoop onto a new node will require re-balancing and replication of that data onto the node. Likewise, shrinking of the environment will require evacuation of that 20Terabytes of data and re-replication onto other nodes. This can be done, but it’s expensive and not something that could be done often. This is why Amazon elastic map-reduce has limited purpose, because when you rent the cluster, it’s empty. When you finish using the cluster, you have to do something with the data, since the cluster goes away. The blob service provides somewhere to store some of the state, but its limited bandwidth limits the size of the data that can be practically used by the map-reduce cluster.
Enabling Elasticity and Multi-tenancy in Hadoop
The ideal elastic map-reduce would allow us to rapidly grow and shrink the amount of map-reduce on the cluster, without the constraint of moving massive amounts of data. The data would be always available by a resident service, and the amount of map-reduce can vary as required.
As VMware explored these needs, we found an interesting set of deployment options appeared for how Hadoop can be configured in a virtual environment that supports these needs. Since Hadoop is a combination of both a distributed file system and a scale-out distributed runtime system, we are able to consider these layers separately, which open up new options for scaling and multi-tenancy.
We considered three primary ways to run Hadoop in a virtual environment, which would allow varying amounts of elasticity and multi-tenancy.
- Traditional Hadoop – In this model, we can easily virtualize Hadoop the same way it run’s on physical systems today. We would run one large Linux instance on each host, which could contain the runtime portion (the task-tracker) and the distributed file system node (the data-node). Both of these services run as a JVM inside the same Linux instances, and communicate with each other via domain sockets. This model allows us to provision a virtual Hadoop cluster rapidly on virtual, but has the limitation that it’s hard to scale up/down because of the data in each VM (the 20Tetabytes in our example). It would also be difficult to scale the amount of memory up/down if we want to share the hosts’ resources with other applications.
- Separated Compute and Storage – In this model, we place the run-time nodes (task-tracker) and distributed file system node (data-node) in separate virtual machines. Surprisingly, this model is quite easy to configure, and does not required changes to the Hadoop source. We run the Hadoop file system as a distribute service on the cluster, to manage the disks on each node and present an always available file system service to the Hadoop runtime nodes. We can then run any number of runtime nodes (virtual machines with the task-tracker inside) against that data service. To expand the size of the virtual Hadoop cluster, we simply add more runtime virtual machines, and to contract we simply power off some of the virtual machines. The data stays resident in the data layer.
- Separated Compute and Storage with Multiple Tenants – Taking this model to the next step, it’s possible to run multiple Hadoop runtime clusters against the same data service. This allows us to build a multi-tenant Hadoop service, by offering virtual Hadoop clusters each with it’s own set of virtual resources. Each tenant can run their own version of Hadoop, Linux or environment, fully isolated with distinct versions, resources and configuration.
Virtual Hadoop and Serengeti
We have been configuring and exploring these architecture in the lab, and extending project Serengeti to support these models.
We have added several new capabilities to Serengeti, including the ability to specify separate compute and data clusters, the ability to provide Hadoop tuning into the configuration, and the ability to deploy a Hive server.
Separated Compute/Data Clusters in Serengeti
Also in milestone 2, we introduced the capability of specifying a deployment with separated compute and data-node virtual machines. This lays the foundation for building our elastically scalable Hadoop clusters.
A blog by my colleague Tariq Magdon-Ismail, discusses the architectural implications to separated compute and data for Hadoop, and some performance measurements that answer some of the key questions about deploying Hadoop inside virtual machines in model 2.
Dynamic Cluster Resizing
The big theme for this week’s introduction of Serengeti milestone 3 is dynamic scaling. In this release, we are fully embracing virtual resources as the backing for Hadoop, by allowing Hadoop to hot add/remove Virtual Hadoop nodes on the fly. This gives us the capability to rapidly grow or shrink a Virtual Hadoop cluster, based on the changing needs of the application and user mix being serviced.
See here for a demo of Serengeti dynamic Hadoop Scaling.
New to milestone 3 is a Virtual Hadoop Manager (VHM), which provides a dynamic interface between Serengeti and vSphere vCenter. The VHM allows Serengeti to power up/down Hadoop nodes on the cluster, which allows dynamic grow/shrink of the Hadoop cluster while jobs are running.
We leverage Serengeti’s capability to deploy
Hadoop in a separated mode so that we have multiple virtual machines for the task-trackers and data-nodes. By default, Hadoop will use all of these provisioned nodes to run Hadoop. With VHM, we interface with Hadoop’s job tracker to vary the active number of virtual machines dynamically. It does this by using an algorithm to determine the correct number of powered on task-trackers per host, and then updating the exclude list on the job tracker to change the active scheduler configuration.
The architecture of VHM is shown in the diagram above. The VHM interfaces with vCenter to gather statistics about each host, information from Hadoop about the current task-tracker configurations, and inputs from the Serengeti user interface. These inputs are processed by a pluggable algorithm, and output as actions against vCenter (for power on/off) and to Hadoop’s job tracker (for node activate/deactivate).
Changing the size of the cluster is through a new command and a user-interface component:
serengeti> cluster limit --name myHadoop –activeComputeNodeNum 8
To enable all the TaskTrackers in the “myHadoop” cluster, use the “cluster unlimit” command:
serengeti> cluster unlimit --name myHadoop
The VHM is open sourced in Serengeti M3.
Several other key features are included in Serengeti M3, including the ability to upload infrastructure topologies, hints for placement control, and a new user interface.
Hadoop using External Distributed File System
There are a growing number of environments where the compute portion of Hadoop connects to a distributed NAS solution, which replaces HDFS. An example of this is the Isilon scale-out NAS system from EMC. In this configuration, just the job-tracker and task-trackers (the compute portion) of Hadoop are deployed, and the scale-out NAS provides the Hadoop compatible storage. There’s a great demo of the Isilon-Serengeti integration here.
To facilitate these configurations, Serengeti now supports deploying Hadoop with its compute nodes (task-tracker) deployed in separated virtual machines.
To support the Isilon example, we want to have just a compute cluster. In the cluster configuration file, a compute-only cluster can be described by specifying the Hadoop master node to contain just a job-tracker, and the worker nodes to contain just the task-trackers. Here is an example of this configuration:
{ "externalHDFS": "hdfs://hostname-of-namenode:8020", "nodeGroups": [ { "name": "master", "roles": [ "hadoop_jobtracker" ], "instanceNum": 1, "cpuNum": 2, "memCapacityMB": 2048, }, { "name": "worker", "roles": [ "hadoop_tasktracker", ], "instanceNum": 4, "cpuNum": 2, "memCapacityMB": 1024, "storage": { "type": "LOCAL", "sizeGB": 10 }, }, { "name": "client", "roles": [ "hadoop_client", "hive", "pig" ], "instanceNum": 1, "cpuNum": 1, "storage": { "type": "LOCAL", "sizeGB": 10 }, } ], “configuration” : { } }
Placement Policies
In milestone 2, we also support directives to affect the placement of virtual Hadoop nodes on specific hosts. This is important when creating specific relationships between compute and data nodes, now that we allow separation of compute from data. For example, we often want to create one data node per physical host, and then a variable number of compute nodes per host. To allow this to be described, we provide the placementPolicies directive.
{ "nodeGroups":[ { "name": "master", "roles": [ "hadoop_namenode", "hadoop_jobtracker" ], "instanceNum": 1, "cpuNum": 4, "memCapacityMB": 2048, }, { "name": "data", "roles": [ "hadoop_datanode" ], "instanceNum": 4, "cpuNum": 2, "memCapacityMB": 1024, "storage": { "type": "LOCAL", "sizeGB": 50 }, "placementPolicies": { "instancePerHost": 1 } }, { "name": "compute", "roles": [ "hadoop_tasktracker" ], "instanceNum": 8, "cpuNum": 2, "memCapacityMB": 1024, "storage": { "type": "LOCAL", "sizeGB": 10 }, "placementPolicies": { "instancePerHost": 2, "groupAssociations": [ { "reference": "data", "type": "STRICT" }] } }, { "name": "client", "roles": [ "hadoop_client", "hive", "pig" ], "instanceNum": 1, "cpuNum": 1, "storage": { "type": "LOCAL", "sizeGB": 10 } } ], "configuration": { } }
Hadoop Configuration Tunings
Serengeti now allows tuning of the site and instance configurations. Serengeti provides a simple and easy way to tune the Hadoop cluster configurations including attributes in core-site.xml, hdfs-site.xml, mapred-site.xml, hadoop-env.sh. You can set the configurations when provisioning a Hadoop cluster or modify the configuration afterwards. Configuration changes can be done at cluster level or node group level. In cluster level, changes will apply to all the nodes in a cluster. In node group level, changes only apply to a group of nodes.
This is done through the spec file with the cluster config command:
{ "nodeGroups" : [ { "name": "master", "roles": [ "hadoop_namenode", "hadoop_jobtracker" ], "instanceNum": 1, "instanceType": "LARGE", "cpuNum": 2, "memCapacityMB":4096, "storage": { "type": "SHARED", "sizeGB": 20 }, "haFlag": 'ft', }, { "name": "worker", "roles": [ "hadoop_datanode", "hadoop_tasktracker" ], "instanceNum": 3, "instanceType": "MEDIUM", "cpuNum": 2, "memCapacityMB":2048, "storage": { "type": "LOCAL", "sizeGB": 30 } }, { "name": "client", "roles": [ "hadoop_client", "hive", "hive_server", "pig" ], "instanceNum": 1, "instanceType": "SMALL", "memCapacityMB": 2048, "storage": { "type": "LOCAL", "sizeGB": 10 } } ], "configuration": { "hadoop": { "core-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/core-default.html // note: any value (int, float, boolean, string) must be enclosed in double quotes and here is a sample: // "io.file.buffer.size": "4096" }, "hdfs-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/hdfs-default.html }, "mapred-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/mapred-default.html }, "hadoop-env.sh": { // "HADOOP_HEAPSIZE": "", // "HADOOP_NAMENODE_OPTS": "", // "HADOOP_DATANODE_OPTS": "", // "HADOOP_SECONDARYNAMENODE_OPTS": "", // "HADOOP_JOBTRACKER_OPTS": "", // "HADOOP_TASKTRACKER_OPTS": "", // "JAVA_HOME": "", // "PATH": "", }, "log4j.properties": { // "hadoop.root.logger": "DEBUG,console", // "hadoop.security.logger": "DEBUG,console", } } }
HA and FT improvements
In Milestone 2, you no longer need to enable Fault Tolerance (FT) in the vSphere client. The setting is done by Serengeti automatically if you want to do so. When you enable High Availability (HA) for namenode or jobtracker, the service is configured to be monitored as well. It makes the HA and FT setting extremely easy.
The haFlag attribute now accepts “on” for HA protection, “off” for no protection, or “ft” for FT protection.
Hive server
Hive server can be deployed and configured automatically. It provides JDBC/ODBC remote connection. Just add “hive server” as one of the roles on the node you want it to run on.
Rack Configuration Profiles
There are now fully customizable configuration profiles to allow flexible topology descriptions, which allow:
- Setup of Physical Rack-Hosts mapping topology to Serengeti.
- Fully control the placement of Hadoop nodes including group association and rack policy.
- Define Hadoop topology including RACK_AS_RACK, HOST_AS_RACK and HVE.
With this week’s Serengeti release, VMware now supports several new capabilities for big data in a virtual environment. We look forward to interacting on some of the new proof of concepts and deployments around the latest release of Serengeti!
Comments