The emergence of GPUs as the preeminent AI/ML accelerators was fraught with growing pains, starting in about 2015. End users faced unstable software stacks, buggy drivers, and uneven performance. Furthermore, in their quest for speed, more companies began to develop specialized accelerators, such as AWS Inferentia, Google’s Cloud TPU, and Intel Habana. But the landscape beyond CPUs and GPUs was an unfamiliar one — a befuddling jumble of hardware and graph compilers requiring performance engineering. NVIDIA’s CUDA ecosystem quickly improved and became one of the most complete and robust software stacks available. In fact, CUDA continues to be the dominant API upon which many AI apps are built today.
But there were other problems. In 2016, GPU servers were being built with increased density, such as the NVIDIA DGX-1. Scale-out solutions emerged, such as the DGX POD, in 2019. As the number of teraflops per server grew, end users were unable to fully utilize and share them, and overall GPU utilization rates began to decrease. The economics of GPUs did not make sense without a change in this infrastructure equation.
Bitfusion: Changing the infrastructure equation
At VMware, we recognized the problems mentioned above. In 2020, we responded to these issues with the introduction of vSphere Bitfusion (acquired by VMware in 2019), which allows for GPU pooling and sharing over a network. This functionality allows a shared pool of GPUs to be used across users, use cases, time scales, and time zones. With Bitfusion, we have seen the number of users per GPU increase resulting in large infrastructure savings.
As the NVIDIA CUDA stack has evolved over time, it has become more complete and more performant … but also more proprietary, closed, and coupled. Over the past decade, GPUs have morphed from secondary offload devices to devices that can operate with increased autonomy. Demand paging from host memory (NVIDIA Pascal in 2014 and AMD HSA in 2013), the ability to interact directly with host memory with unified addressing (UVA), and unified virtual memory (UVM) allowed the GPUs to execute with little to no CPU intervention. With the recent release of CUDA 10 and 11, GPUs even gained the ability to execute complete computational graphs with complex control flows. On the hardware side, scale-out solutions often called for proprietary (e.g., NVIDIA NV Link/NV Switch) bus protocols or high-performance networking with remote direct-memory access (RDMA) or Partitioned Global Address Space (PGaS)-like verbs for data transfers between systems. We believe this trend will continue: more vendor-integrated solutions need to be accommodated as performance demands increase.
Herein lies the riddle: in a world of high-performance and vendor-optimized (and proprietary) hardware solutions, how is it possible to make them all available to existing AI/ML applications without decades of system enabling and ecosystem evolution? Is such a thing possible?
We believe the answer is yes, and the key word is virtualization.
I will share more about this later in this post. First, let us complete the story and lay out the future requirements for AI infrastructure. After watching the evolution of AI/ML accelerators, taking stock of where we are today, and thinking about the challenges that remain, the following seems clear:
- Execution of the computational graph continues to be hardware-architecture dependent.
- Distribution of the computational graph continues to be hardware and fabric-topology dependent.
- Hardware abstraction is a strong requirements for end users. Users will shun managing drivers, maintaining accelerated library stacks, or tuning models for a particular architecture.
- Hardware choice is required to pick the right architecture for the right model. Though GPUs are great general-purpose accelerators, they can easily be constrained by memory, memory bandwidth, and performance (compared to architectures with higher memory proximity to compute elements, such as systolic arrays).
- Hardware reuse is important to best make use of currently available hardware, even CPUs. Virtual infrastructure (VI) operators will shun solutions that underutilize their existing hardware investment.
Introducing Project Radium
Over the years, as new high-performance AI accelerators have entered the market, customers have often asked if we would support other vendors and architectures. Today, the answer is “yes.” Project Radium (illustrated in Figure 3 and demonstrated in this video) is an xLabs project coming out of the Advanced Technologies Group in the Office of the CTO. Radium builds upon Bitfusion and expands its feature set beyond NVIDIA GPUs to hardware from other vendors, including AMD, Graphcore, Intel, and others. This is a completely accelerator-agnostic approach to device virtualization and remoting, allowing enablement of new hardware architectures without explicit software support.
Just like Bitfusion, Project Radium will let you dynamically attach to accelerators over a standard network (10GbE and above). Because this is also a transparent virtualization, no code or workflow changes are needed. Your AI engineers can remain focused on models and not worry about compilers, vendor drivers, or having to tune each model for a different device.
Elevating the Hypervisor: Application monitor
Radium works through an application-level monitor that introduces virtualization services in much the same way a Virtual Machine Monitor (VMM) virtualizes a virtual machine. Because the application monitor operates within the context of an application, we can dynamically split the application in half and run each half on physically different systems (see Figure 3). The application monitor is responsible for maintaining application state (memory, files, loading of application libraries) as well as virtualizing interactions with the system, such as system calls and inter-process communication (IPC). By introducing an ESXi-like features within a user space application, a new form of (de)composability can be brought about. Theoretically, applications could be split across several machines, each with their own unique physical resources. All that is required is to keep application fragments coherent as they independently execute — the primary role of the application monitor.
For the purposes of accelerating AI/ML applications, let us consider the top half fragment as an application or script written over TensorFlow and the bottom half (or backend) portion of the application stack that has device-dependent code.
Once the application has been split, each half can execute independently, while the application monitor continuously keeps the data, code, and execution environment coherent. In effect, normal application code runs on the local client (initiator) side, while code requiring high performance runs in a virtual appliance with accelerators (acceptor). We can select portions of the application, both by exported library functions and by a scripted module function, via a user-configurable configuration script. In the example below (Figure 4), we configure Radium to remote the entire Python imported TensorFlow module to the remote side. This is a step towards disaggregated computation, but optimized for AI/ML.
Once the application starts, a language-specific interposing layer wraps around Python functions or modules labeled for remoting and delegates execution to the acceptor processes. The acceptor processes, when executing, will initially be empty of any data, code, or state. Instead, as the acceptor processes encounter missing dependencies, the application monitor on the acceptor side will demand memory pages with the required data from the client side. Interestingly, we can reverse the use case by hosting application code on the server side and demand-paging to the client. This has an interesting implication for how AI/ML applications could be served in the future. Imagine a frustration-free, zero-install approach to delivering apps on demand.
The result is that for the first time, users will be able to take advantage of a multitude of AI accelerators and tackle new, more industrial-strength use cases (Figure 5). As hardware vendors offer new hardware, all that is required is a version of TensorFlow/PyTorch/etc. with the support for the new device type. Another way to think about this is that each hardware vendor would have a different optimized backend implementation and the user would be able to dynamically choose which implementation to use. Independent Hardware Vendors (IHVs) can introduce most of their latest hardware features from day one. In fact, unlike today’s Bitfusion, which requires comprehension of new CUDA features as they are released, Radium will support the broad range of new CUDA features implicitly, by virtue of the layer at which it operates. In short, we will be able to enable efficient virtualization and remoting to GPUs, as well as other accelerators.
What might the performance look like? Stay tuned for comprehensive benchmarks later in the year. In the meantime, we have presented encouraging results in the VMworld 2021 Solution Keynote, as well as a live demo at our Meet the Experts breakout session (VI1297).
Radium: raising the bar for modern AI Infrastructure
All future AI infrastructure will need to provide next-level performance, without requiring application changes. This includes:
- Execution leveraging high-performance and highly specialized vendor devices, PODs, and clusters.
- Distribution over many devices and nodes, with backends that require novel and high-performance networking.
- Hardware abstraction, so all AI/ML applications can be simplified and free of vendor knobs, high-level graph tuning, or backend device compilation.
- Hardware choice, so end users can select any vendor or backend implementation.
- Hardware reuse to get more done with available hardware.
Radium is not only for exotic hardware options. It is also useful for deployments with CPU-only servers. It can target software backends for improved performance on your existing infrastructure. We have teamed up with ThirdAI to accelerate large-scale models on today’s conventional CPU systems. Their runtime graph compiler takes advantage of inherent model sparsity to achieve order-of-magnitude training and inference throughput increases with normal x86 systems. Software backends can drastically increase the availability of acceleration in a cost-effect manner. Apache TVM is another compiler framework which can optimize and run computational graphs more efficiently on a growing set of supported hardware. You can learn more about ThirdAI at thirdai.com and TVM at tvm.ai.
More to come!
Radium is in active development. Following vSphere Bitfusion, we plan for Radium to have integrations for Jupyter Notebook and Kubernetes and be runnable via console. You will hear more from our hardware partners, including Graphcore and AMD, as well as software partners, such as ThirdAI, on how they are plugging into Radium to evolve vSphere into a more powerful, open, and modern AI infrastructure. There are more exciting updates coming soon, so stay tuned.
For more information about Project Radium, email me at firstname.lastname@example.org.