As was mentioned in a previous blog post, accelerator cards are becoming increasingly common in HPC environments and so it is important that we assess the performance of such cards with ESX. Thank you very much to Na Zhang for all of her work on the following.

In this post, we share the results of running the Scalable HeterOgeneous Computing (SHOC) benchmark and a CUDA-enabled version of the LAMMPS molecular dynamic program on ESX 6.0. We used a loaner nVidia Grid K2 card for this test to demonstrate how closely we can approximate bare-metal performance. The demonstrated results are relevant to HPC users despite the fact that such users will most often use  higher-end K40 and K80 cards instead.

SHOC CUDA benchmark results, showing ratios to bare-metal performance for ESX 6.0u1 and an engineering build of ESX

SHOC CUDA benchmark results, showing ratios to bare-metal performance for ESX 6.0u1 and an engineering build of ESX

As you can see in the Figure, ESX 6.0u1 delivers very close to bare-metal performance in all SHOC benchmark categories except those in which data transfer speed is the determining factor. In those cases, we see up to a 20% drop in performance relative to bare-metal. However, using an engineering build of ESX (“Testbuild”) that includes support for Intel VT-d large pages, we see that these performance drops can be completely addressed, restoring bare-metal performance in these cases.

Moving beyond low-level benchmarks, we also examined LAMMPS performance when it is accelerated using CUDA/GPGPU.

LAMMPS molecular dynamics performance using CUDA, comparing bare-metal to ESX 6.0u1 and an engineering build of ESX

LAMMPS molecular dynamics performance using CUDA, comparing bare-metal to ESX 6.0u1 and an engineering build of ESX

LAMMPS is a molecular dynamic code, which we tested with four problem sizes of the Atomic Fluid benchmark model. The virtual test is run using a 16-way MPI job running within a single 16-vCPU VM with LAMMPS taking advantage of both K2 devices on the nVidia card. Similarly, the bare-metal case used a 16-way MPI job running within a bare-meal Linux instance, also accessing the two K2 devices. As you can see, the results at all problem sizes are almost indistinguishable from bare-metal performance.

The above tests were run on one of our HP DL380p G8 compute nodes with an nVidia Grid K2 installed. We used Red Hat 7.1 and CUDA Toolkit v7 for both virtual and bare-metal configurations and used a 16-vCPU VM in the virtual case. ESX version was as is indicated in the charts. The Grid K2 card was passed directly to the guest OS using VM Direct Path I/O (passthrough mode).  To enable use of the GPU card, two BIOS changes were made: PCI Express 64-bit BAR support was enabled and Video options were set to “Embedded Video Primary, Optional Video Secondary”.