Machine Learning (ML) applications are increasingly being embraced by organizations to accelerate business growth. As the scale of the ML applications grows, IT infrastructure has challenges to meet the requirements of ML workloads. Infrastructure must be flexible to allow ML developers’ work to be productive through cloud native platforms like Kubernetes. Nowadays, more businesses are leveraging Kubernetes to deploy and manage their ML workloads.

Bitfusion allows more applications to gain access to shared GPUs via the network

Kubernetes usually consists of a cluster of worker nodes that can have a ML workload scheduled to any of its worker nodes. Many ML application use cases need hardware accelerators such as GPU, requiring each worker node to have at least one accelerator installed locally. These accelerators, like GPU, are an expensive infrastructure. Fortunately, VMware vSphere 7 comes with a feature called Bitfusion, which can create pools of hardware accelerators. Different nodes across the network can share GPUs in a pool. vSphere Bitfusion increases the utilization of GPUs, and eliminates the need for local hardware accelerators of every node.

Since Bitfusion can provide the compute horsepower, Kubernetes is a great partner to manage the workloads needed to run ML applications. However, out of the box Kubernetes does not offer a way to let workload consume Bitfusion’s remote GPU resources. Therefore, we need some glue to make them work with each other.

Extending capabilities to Kubernetes, making it easy for any Kubernetes pod to gain access to remote GPUs

Kubernetes provides a device plugin framework for the developer to advertise system hardware resource to the kubelet. The Office of the CTO, Cloud Native Lab at China R&D  created a device plugin that monitors Bitfusion GPU resources and properly allocates the GPU resource to Kubernetes’ workloads (i.e. pods). Since device plugins are a standard approach for Kubernetes to customize hardware resources, the plugin supports Kubernetes advanced features such as resource quota and ensures the plugin is fully aligned with the Kubernetes ecosystem.

The Bitfusion device plugin implements Kubernetes’ device plugin framework and updates the kubelet periodically about the available Bitfusion GPU resources. The information collected is then used when Kubernetes schedules workloads with GPU requirements. The Bitfusion device plugin can be installed as a DaemonSet of Kubernetes so that every worker node can have a running copy of the device plugin to report GPU resources from the Bitfusion pool.

Bitfusion device plugin interactions with kubelet and GPU pool

Example of Bitfusion Device Plugin in Action

When the Bitfusion device plugin is ready, GPU requirements can be specified in the Pod’s declaration file. In the below example, bitfusion.io/gpu is the name of the Bitfusion GPU resource. The pod contains an application of TensorFlow and asks for two (2) GPU resources from Bitfusion.

 

Suppose the pod is defined by a file called job.yml. We can use the below command to run it on Kubernetes:

$ kubectl apply -f job.yml

When Kubernetes schedules this pod, it will look up its worker nodes and find out which nodes can allocate at least two Bitfusion GPU resources. Then it picks one of the node to run the pod. In our case, since Bitfusion GPU resource is from a remote pool, every worker node reports the same quantity of available GPU resources. The scheduler of Kubernetes can determine which worker node run the pod based on its own algorithm.

The node that is assigned to run the pod allocates GPU resources to the pod from the Bitfusion server. When the pod finishes, the GPU resources are released back to the pool and can be reused by other workloads. In this way, multiple workloads of Kubernetes can share the same GPU pool created by a Bitfusion server.

The advantages of using Bitfusion device plugin to share GPU resources are obvious. The device plugin does not mandate the need for local GPUs on each worker node. GPUs in a pool are shared by all nodes of a cluster and the utilization is increased. The device plugin approach is also a non-intrusive way to extend Kubernetes to leverage Bitfusion’s GPU resources.

What’s Next

We are currently fine-tuning our solution and exploring different enhancement such as fractional GPU. When it is ready, we plan to open source the Bitfusion device plugin so that partners and users can take advantage of our solution. Stay tuned.