[UPDATE: Feb, 2017: This blog entry has been updated to correct an error. To use advanced features like passthrough of large-BAR PCI devices, you must use a UEFI-enabled VM and guest OS.]
As customer interest in running HPC workloads on vSphere continues to increase, I’ve been receiving more questions about whether compute accelerators like the nVidia K80 or Intel Xeon Phi can be used with vSphere VMs.
The answer is that while VMware supports the VM Direct Path I/O feature (i.e., passthrough mode) that allows PCI devices to be made visible to a guest operating system within a VM so that the accelerator can be used for HPC computation, the official support statement for a particular card and/or card+system configuration would have to be made by nVidia or the appropriate system vendor. Having said that, we in the Office of the CTO have been keen to performance test any of these devices that we can get access to and we’ve published results for the nVidia Grid K2 and for Intel Xeon Phi. We are keen for two reasons. First, to demonstrate that we can achieve near-native performance with these cards in our virtual environment. And, second, to uncover any issues that might prevent a device from being used with vSphere so we can work with our R&D teams to address those problems.
While the nVidia Grid K2 results were useful, HPC users care more about high-end cards like the nVidia K80. Unfortunately, we’ve not had any access to a K80 and have been unable to provide guidance to customers asking whether it would work or not. This changed when I was contacted by Duke University a few weeks ago with a request to help them enable a K80 in passthrough mode on ESX 6 — in fact, they wanted to enable four K80s per host — because it was failing. I had talked previously with the folks at Duke about virtualizing research computing workloads and was excited that they were moving forward with this, but concerned that the K80 errors might be a blocker for them.
After about a week of experimenting and debugging and working directly with VMware R&D, we now have a good understanding of what does and does not work relative to using the K80 in passthrough mode with ESX 6. Here is a summary of our findings.
High-end PCI devices like the K80 use very large, multi-gigabyte passthrough MMIO device memory regions to transfer data between the host and the device. ESX 6 can support large memory regions, but it currently has a fixed limit of 32GB per VM for such mappings. A K80 card actually contains two separate (GK210) GPU devices. Unfortunately, each GPU needs to map just over 16GB of memory, which means that only one of the two K80 GPU devices can be passed through to a VM currently. It is possible to pass each of the two GK210 GPUs to a separate VM, but not possible to pass both through within a single VM. Which means that Duke’s original desire — passing four K80s (eight GPUs) to a single VM– is definitely not possible currently. Duke has told us, however, that the majority of their users run accelerated applications that only take advantage of a single GPU, though they do also have researchers who require more than one GK210 and also more than one K80 to be made available within a single VM.
As we saw with our K2 testing, Duke reports that performance of a single GK210 passed into a VM was very close to bare-metal based on a few of the standard CUDA SDK benchmarks as well as some Amber benchmarking.
The debugging session with Duke tripped over a known bug related to logfile error messages and the aforementioned 32GB platform limitation, both of which need to be fixed to enable a full K80 or multiple K80s to be passed into a single VM. While I cannot speak publically about when these fixes might appear in a shipping version of the product, I’ll just note here that the fixes in both cases were low-risk and relatively simple.
To enable a device that uses large PCI MMIO BARs (including the nVidia K40 and K80) to be passed into a UEFI-enabled VM, add the following line to the VM’s VMX file:
These same changes are required to pass an Intel Xeon Phi device into a vSphere VM, as described in a previous post.