Abstract landscape made of tiny cubes and human-like face, artificial intelligence concept
The Future of...

Using Xilinx FPGA on VMware vSphere for High-throughput, Low-latency Machine Learning Inference

VMware is committed to helping customers build intelligent infrastructure and optimize workload execution. With the rapidly growing interest in Machine Learning (ML) and High Performance Computing (HPC), hardware accelerators are increasingly being adopted in private, public, and hybrid cloud environments to accelerate these compute-intensive workloads. As part of facilitating this IT infrastructure transformation, VMware is collaborating with partners to ensure accelerated computing capabilities are available on vSphere for our customers.

Previously, we have demonstrated enablement of high-end GPUs and high-speed interconnects on vSphere with close-to-native performance for various ML and HPC workloads. We shared the results here and here. As part of our continuous effort to better serve our customers’ needs, we recently collaborated with Xilinx to test FPGA acceleration on vSphere, which will be revealed in this article. Given the focus of Xilinx FPGA on ML inference, we will show how to use Xilinx FPGA with VMware vSphere to achieve high-throughput and low-latency ML inference.

Xilinx FPGA and its advantages

FPGAs are adaptable devices that can be re-programmed to meet different processing and functionality requirements of desired applications, including but not limited to ML, Video, Database, Life Science, and Finance. This feature distinguishes FGPAs from GPUs and ASICs. In addition, FPGAs also have advantages in achieving high energy efficiency and low latency compared to other hardware accelerators, which makes FPGAs especially suitable for ML inference tasks. Different from GPUs, which fundamentally rely on a large number of parallel processing cores to achieve high throughput, FPGAs can simultaneously achieve high throughput and low latency for ML inference through customized hardware kernels and data flow pipelining.

Vitis AI is Xilinx’s unified development stack for ML inference on Xilinx hardware platforms from Edge to Cloud. It consists of optimized tools, libraries, models, and examples. Vitis AI supports mainstream frameworks, including Caffe and TensorFlow, as well as the latest models capable of diverse deep learning tasks. In addition, Vitis AI is open source and can be accessed on GitHub.

Vitis AI software stack [1]
Vitis AI software stack [1]
We use a Xilinx Alveo U250 datacenter card in our lab for testing, and ML models are quickly provisioned using Docker containers provided in Vitis AI. Before presenting the testing results, let’s first discuss how to enable FPGA on vSphere.

Configure Xilinx FPGA on vSphere

Currently, Xilinx FPGAs can be enabled on vSphere via DirectPath I/O mode (passthrough). In this way, an FPGA card can be directly accessed by applications running inside a VM, bypassing the hypervisor layer and thereby maximizing performance and minimizing latency. Configuring an FPGA in DirectPath I/O mode is a straightforward two-step process: First, enable the device on ESXi at the host level, and then add the device to the target VM. Detailed instructions can be found in this VMware KB article. Note that if you are running vSphere 7, host rebooting is no longer required.

Performance

We evaluate throughput and latency performance of Xilinx Alveo U250 FPGA in DirectPath I/O mode by running inference with four CNN models, namely Inception_v1, Inception_v2, Resnet50, and VGG16. These models vary in the number of model parameters and thus have different processing complexity.

The server used in this test is a Dell PowerEdge R740, which has two 10-core Intel Xeon Silver 4114 CPUs and 192 GB of DDR4 memory. The hypervisor used is ESXi 7.0, and end-to-end performance results for each model are compared to bare metal as the baseline. For fairness, Ubuntu 16.04 (kernel 4.4.0-116) is used as both the guest and native OS. In addition, Vitis AI v1.1 along with Docker CE 19.03.4 are used throughout the tests.

We use a 50k-image data set derived from ImageNet2012. Further, to avoid disk bottleneck in reading images, a RAM disk is created and used to store the 50k images. With these settings, we are able to achieve similar performance to Xilinx published performance here. The performance comparison between virtual and bare metal can be viewed in the following two figures, one for throughput and the other for latency. In both graphs the y-axis is the ratio between virtual and bare metal, with y=1.0 meaning the performance in virtual and bare metal is identical. Note that for throughput, higher is better, but for latency, lower is better.

Throughput performance comparison between bare metal and virtual for ML inference on Xilinx Alveo U250 FPGA
Throughput performance comparison between bare metal and virtual for ML inference on Xilinx Alveo U250 FPGA
Latency performance comparison between bare metal and virtual for ML inference on Xilinx Alveo U250 FPGA
Latency performance comparison between bare metal and virtual for ML inference on Xilinx Alveo U250 FPGA

It’s clear from the figures that for these tests the performance gap between virtual and bare metal is capped at 2%, for both throughput and latency. This indicates that the performance of Alveo U250 on vSphere is very close to the bare-metal baseline.

The Future

It’s expected that the adoption of hardware accelerators will continue to increase in the near future in order to meet the growing demand for computing power. At VMware, we invest significant R&D efforts to ensure our customers are able to take advantage of these advanced technologies on the vSphere platform. Our tests of FPGA on vSphere for ML inference through our partnership with Xilinx successfully demonstrated the close-to-native performance achieved with DirectPath I/O mode.

References

[1] Vitis AI GitHub page. https://github.com/Xilinx/Vitis-AI.


Michael CuiMichael Cui is a Member of Technical Staff in the VMware Office of the CTO, focusing on virtualizing High Performance Computing. His expertise spans broadly across distributed systems and parallel computing. His daily work ranges from integrating various SW and HW solutions, to conducting Proof-of-Concept studies, to performance testing and tuning, and to technical paper publishing. In addition, Michael serves on Hyperion’s HPC Advisory Panel, and participates in paper reviewing in several international conferences and journals, such as IPCCC, TC, and TSC. Previously, he was a research assistant and part-time instructor at the University of Pittsburgh. He holds both a PhD and a Master degrees in Computer Science from the University of Pittsburgh.