VMware’s AI/ML direction and Hardware Acceleration with vSphere Bitfusion
Machine Learning (ML) workloads are emerging as increasingly important for our customers as the competitive value of predictive modeling becomes manifest. We see these workloads from edge to data center to cloud, depending on a host of variables and customer requirements. CPU-based ML is quite common and microprocessor vendors continue to enhance their processors with new instructions and data types specifically designed to accelerate ML workloads, extending the reach of CPU for these workloads. However, there are important cases in which additional and more powerful hardware acceleration is required. This is the realm of hardware accelerators like GPUs, FPGAs, and an increasing number of domain-specific ASICs (DSA) from an array of startups. One can see this need for additional acceleration acknowledged, for example, in Intel’s acquisition of Habana Labs, a producer DSAs for both ML training and inference.
The emergence of these hardware accelerators — with GPUs leading the way — has been an enabler of some of the most impressive advances in AI over the last several years. The term Deep Learning has come to refer to the subset of ML in which the models being trained, and the data sets used for training are both so large that, practically speaking, hardware accelerators become a requirement for creating and deploying these models.
VMware has long supported vSphere VMs accessing hardware accelerator devices for High Performance Computing and now for ML workloads. Longstanding work by VMware’s Office of the CTO, VMware engineers and partners have created a rich set of offerings in this space, focused on delivering accelerator access to VMs running on hosts with installed hardware accelerators. In August of 2019, VMware acquired Bitfusion to further enhance our accelerator options and provide even more flexibility to customers.
Bitfusion Device Pooling and Sharing
Unlike our other supported mechanisms, Bitfusion allows Deep Learning applications running anywhere in the data center to consume single, multiple, or fractional GPU resources on other hosts, allowing physical GPUs to be aggregated into a centralized hardware pool. This pooling allows customers to drive up overall utilization of expensive GPU resources by avoiding situations in which under-utilized GPUs lie scattered across an organization, dedicated to separate teams who may not be using their resources optimally.
A few scenarios might make the value more obvious. Imagine you are a retailer deploying next-generation product scanners that augment barcode scanning with ML-based image classification to combat fraud. Rather than deploying a dedicated GPU for each checkout device, Bitfusion could be used to provide Deep Learning inference for multiple checkout devices by sharing access to a single powerful GPU while the checkout logic runs spread across multiple VMs and hosts for fault resilience. Or perhaps you are an educational institution wanting to provide students access to fractional GPUs for a class from their assigned generic VMs that do not have access to local GPUs.
As the leader of VMware’s Machine Learning Program Office, I was on the Bitfusion acquisition team last year and was a strong proponent for bringing the technology into VMware. As VMware Bitfusion nears its first release, I thought I’d share some broad thoughts on how the product will likely grow and morph over time.
One theme that will be evident is a deepening integration into vSphere, primarily in the management area. An obvious first step is providing visibility into Bitfusion’s GPU resource allocations via a vCenter plugin. But there are other opportunities. Consider, for example, that Bitfusion has its own resource management capability for mapping user requests for GPU resources to available GPUs. In vSphere, that type of resource management falls under the purview of DRS and so one might expect to see tighter ties between these two components in the future.
While GPUs are currently the most popular Deep Learning hardware acceleration option, both FPGAs and the veritable Cambrian explosion of DSAs that have begun to emerge have the potential to bring significant value to Machine Learning. Happily, Bitfusion is well-positioned to embrace these emerging technologies through tooling it had developed prior to acquisition that helps ease the burden of supporting APIs for new devices. For example, Bitfusion did early work enabling remote access to FPGAs via the Open Programmable Acceleration Engine (OPAE) interface. Similar work was done to enable access to OpenCL-based devices. While the initial release focuses solely on GPU access, one should expect to see additional device enablement over time.
It’s an exciting time for Machine Learning as customers begin to adopt these approaches in earnest. With our acquisition of Bitfusion, VMware has doubled down on providing the most agile, secure, and flexible infrastructure for ML workloads to our customers, whether running on the edge, in the data center, or in the cloud.