Using nVidia GPGPU Compute Acceleration with ESX 6.0

As was mentioned in a previous blog post, accelerator cards are becoming increasingly common in HPC environments and so it is important that we assess the performance of such cards with ESX. Thank you very much to Na Zhang for all of her work on the following.

In this post, we share the results of running the Scalable HeterOgeneous Computing (SHOC) benchmark and a CUDA-enabled version of the LAMMPS molecular dynamic program on ESX 6.0. We used a loaner nVidia Grid K2 card for this test to demonstrate how closely we can approximate bare-metal performance. The demonstrated results are relevant to HPC users despite the fact that such users will most often use  higher-end K40 and K80 cards instead.

SHOC CUDA benchmark results, showing ratios to bare-metal performance for ESX 6.0u1 and an engineering build of ESX

SHOC CUDA benchmark results, showing ratios to bare-metal performance for ESX 6.0u1 and an engineering build of ESX

As you can see in the Figure, ESX 6.0u1 delivers very close to bare-metal performance in all SHOC benchmark categories except those in which data transfer speed is the determining factor. In those cases, we see up to a 20% drop in performance relative to bare-metal. However, using an engineering build of ESX (“Testbuild”) that includes support for Intel VT-d large pages, we see that these performance drops can be completely addressed, restoring bare-metal performance in these cases.

Moving beyond low-level benchmarks, we also examined LAMMPS performance when it is accelerated using CUDA/GPGPU.

LAMMPS molecular dynamics performance using CUDA, comparing bare-metal to ESX 6.0u1 and an engineering build of ESX

LAMMPS molecular dynamics performance using CUDA, comparing bare-metal to ESX 6.0u1 and an engineering build of ESX

LAMMPS is a molecular dynamic code, which we tested with four problem sizes of the Atomic Fluid benchmark model. The virtual test is run using a 16-way MPI job running within a single 16-vCPU VM with LAMMPS taking advantage of both K2 devices on the nVidia card. Similarly, the bare-metal case used a 16-way MPI job running within a bare-meal Linux instance, also accessing the two K2 devices. As you can see, the results at all problem sizes are almost indistinguishable from bare-metal performance.

The above tests were run on one of our HP DL380p G8 compute nodes with an nVidia Grid K2 installed. We used Red Hat 7.1 and CUDA Toolkit v7 for both virtual and bare-metal configurations and used a 16-vCPU VM in the virtual case. ESX version was as is indicated in the charts. The Grid K2 card was passed directly to the guest OS using VM Direct Path I/O (passthrough mode).  To enable use of the GPU card, two BIOS changes were made: PCI Express 64-bit BAR support was enabled and Video options were set to “Embedded Video Primary, Optional Video Secondary”.

 

Other posts by

VMware’s AI/ML direction and Hardware Acceleration with vSphere Bitfusion

Machine Learning (ML) workloads are emerging as increasingly important for our customers as the competitive value of predictive modeling becomes manifest. We see these workloads from edge to data center to cloud, depending on a host of variables and customer requirements. CPU-based ML is quite common and microprocessor vendors continue to enhance their processors with new instructions and […]

Machine Learning at VMworld US 2019

Machine Learning (ML) creates tremendous opportunities for enterprises to detect patterns in the data they collect, and then use those patterns to create new products or services, improve existing offerings, and improve their internal operations. To obtain these benefits, enterprises will need to navigate a number of challenges. These challenges, and more, are discussed in […]

High Performance Computing Conference: VMware at SC’17 in Denver next week

VMware will once again have a demo kiosk inside of the Dell EMC booth at Supercomputing, which is being held next week in Denver. We will showcase the benefits of virtualization of on-premise HPC environments with good performance for an array of workloads. We have a lot to talk about! Here is a preview… New […]