Running HPC Applications on vSphere using InfiniBand RDMA

Looking back on 2014, this was the year in which we made significant strides in assessing and addressing High Performance Computing (HPC) performance on vSphere. We’ve shown that throughput or task-parallel applications can be run  with only small or negligible performance degradations (usually well under 5%) when virtualized — a fact that the HPC community seems to now generally accept. The big new thing this year, however, was the progress we are making with MPI applications — specifically those applications that require a high-bandwith, low-latency RDMA interconnect like InfiniBand.

As I mentioned in an earlier blog post, we generated a great deal of RDMA-based performance data about HPC applications and benchmarks this past summer due to the hard work of my intern, Na Zhang. Here is a more detailed look at some of those results.

Test Configuration

The cluster we used for these tests consisted of four DL380p G8 servers, each with two Ivy Bridge 3.30 GHz eight-core processors, 128 GB RAM, and three 600GB 10K SAS disks. The nodes were connected with Mellanox InfiniBand FDR/EN 10/40 Gb dual-port adaptors using a Mellanox 12-port FDR-based switch. We have since doubled the size of this cluster and will report higher-scale results in a later article.

We used RHEL 6.5 (kernel-2.6.32-431.17.1) as our bare-metal OS and as our guest operating system with OFED-2.2-1.0.0. The results reported here were generated on ESXi 5.5u1 (build 1623387) with hyperthreading turned off and tuning for low latency  applied as detailed in Best Practices for Performance Tuning of Latency-Sensitive Workloads in vSphere VMs. For this set of tests, a single 16-vCPU VM was run on each host and the InfiniBand cards were passed through to the guest OS using VMDirectPath I/O.

OFED Micro Benchmarks

Our ESX 5.5u1 OFED bandwidth and latency results for Read, Write, and Send are shown in the figure below, which compares native performance against both passthrough (VMDirectPath I/O) and SR-IOV. The latter — a feature that allows single PCI devices to be shared among multiple VMs — is useful for supporting multi-VM access to RDMA-connected file systems like Lustre and GPFS.

Read (L), Write (M), and Send (R) OFED RDMA performance results for bandwidth (T) and latency (B) on ESX 5.5u1 for bare-metal, VMDirectPath I/O, and SR-IOV.
Read (L), Write (M), and Send (R) OFED RDMA performance results for bandwidth (T) and latency (B) on ESX 5.5u1 for bare-metal, VMDirectPath I/O, and SR-IOV.

Some issues can be seen in both the bandwidth (top) and latency (bottom) cases across all three transfer types. The bandwidth curves show a shift and a drop — a shift that indicates reduced bandwidth for mid-range message sizes in the passthrough case, and a drop that shows reduced bandwidth for both passthrough and SR-IOV at very large message sizes. The latency curves all show roughly constant overheads for each transfer type at small message sizes with an additional anomalous blip at 128 byte message sizes for Write and Send.

Based on an ESX engineering build supplied to us by our colleagues in R&D, we now know that the passthrough latency overheads shown in these graphs can be completely eliminated by implementing support for write combining (WC) in the platform. That support also eliminates the passthrough bandwidth shift. Another ESX build has demonstrated that the passthrough bandwidth drop can also be removed. Together, these two changes will enable us to achieve essentially bare-metal passthrough latency and bandwidth performance across the entire measured range of message sizes. We are in the process — this week, in fact — of evaluating whether write combining support will improve SR-IOV performance in the same way.

The above performance enhancements lie in the future since they are not incorporated into the shipping versions of ESX. In the meantime, we’ve done higher-level performance measurements using ESX 5.5u1 with the above-documented overheads included to understand how well the current product can handle MPI workloads that depend on RDMA for performance. Some of these results are presented in the next section.

Application Benchmarks

We’ve run a set of strong scaling tests using a set of standard HPC benchmarks and other applications to assess  performance at a higher level in the presence of the overheads documented in the previous section. Strong scaling — keeping the problem size constant as the amount of parallelism is increased — causes the communication-to-computation ratio to increase as less work is done per process and more communication is required among more endpoints over the course of the computation. In our tests on the four-node cluster, we increase the amount of parallelism from 4 processes (one per node) to 64 processes (16 per node). With only four nodes available, scaling trends are present, but difficult to see; we expect our upcoming eight-nodes tests to provide additional insights.

We’ve run the HPC Challenge (HPCC) benchmarks and NAS Parallel Benchmarks (NPB) and seen results that are generally within five percent of bare-metal performance. The graph below shows our results for NPB Class C. While IS performance at np=4 shows a 7% degradation, the rest of the overheads are well under 5%. A slight decrease in performance as MPI process count increases can be seen, especially in the cases of CG and MG, though in both cases the performance drops by only a few percentage points and remains close to bare-metal performance.

Strong-scaling results for NPB Class C on the four-node test cluster. MPI process count varies from 4 to 64.
Strong-scaling results for NPB Class C on the four-node test cluster. MPI process count varies from 4 to 64.

Real applications are ultimately more interesting than benchmarks, and so we’ve spent considerable time looking at several Life Sciences MPI codes — NAMD, LAMMPS, and NWCHEM. (We have focused on Life Sciences codes due to high customer interest in using virtualization and cloud computing for these workloads. If you have suggestions about other specific applications in other disciplines, please contact me at simons at vmware dot com.) NAMD (v2.9) results are shown below for APOA1, a protein-coding gene, and for F1ATPASE, an enzyme involved in ATP synthesis. The maximum degradation shown is about 3% for the 64-process analysis of APOA1.

NAMD 2.9 results for Apoa1 and F1atpase
NAMD 2.9 results for Apoa1 and F1atpase

At this point, you may be wondering how a platform with 20%+ latency overheads at the OFED level can deliver performance close to native on higher-level benchmarks and applications? The answer is that while the HPC community often uses “half roundtrip latency for one-byte messages” as a figure of merit for  HPC systems, there are many real-world applications that do not exchange significant numbers of small messages. These applications therefore remain relatively unaffected by latency overheads at small message sizes — as has been shown by the results above. Let’s look at a more challenging example.

NWCHEM is a computational chemistry code that does exchange a significant number of small messages — and that number increases as the degree of parallelism increases. The graph below shows that as MPI process count increases, virtualized performance initially surpasses that of bare metal (we have not delved into this) and then, as communication becomes an increasingly significant performance component, the degradation rises to about 16% in the 64-way case. Other HPC applications have more significant dependencies on very small message sizes and could therefore experience significantly higher degradations.

NWCHEM MP2 gradient calculations for H2O7
NWCHEM MP2 gradient calculations for H2O7

When evaluating whether current virtualization platforms will deliver adequate MPI performance it is important to understand the messaging characteristics of one’s applications. If  that information isn’t available — and sometimes even if it is — we  recommend benchmarking your important applications using data sets representative of your own work to assess the acceptability of the approach for your own use. When performing such tests it is critical that our platform be tuned to handle these latency-sensitive workloads.

There is one other rule of thumb to keep in mind. Old performance data are often bad performance data. This is especially true in the case of virtualization since both hardware and software support for virtualization continue to advance, increasing efficiency and rendering old performance data obsolete. As a case in point, consider the engineering builds of ESX described above that remove latency overheads for small messages and increase bandwidth for very large message sizes. When those features appear in shipping product, they will improve MPI performance beyond currently achievable levels. These enhancements are just a few in the long line of improvements that have enabled virtualization to embrace an increasingly large fraction of the world’s IT workloads. As we’ve done for Enterprise IT, so we do now for HPC.