Paravirtual RDMA for Low Latency and Flexibility

The Office of the CTO has been exploring how to best enable application access to RDMA for those applications requiring the ultimate in high bandwidth, low-latency communication, which includes many HPC MPI applications as well as many scale-out databases and BigData approaches.

Passthrough mode is the most straightforward way to enable guest-level RDMA. With passthrough (which we call VM DirectPath I/O), a physical PCI device can be made directly visible to the guest operating system running within the virtual machine. We published a research note showing that this approach delivers very good InfiniBand latencies (under 2us) and excellent bandwidths over a wide range of message sizes. There is a downside, however: Punching through the virtual machine abstraction in this way disables several platform features, most notably vMotion (live migration) and Snapshots.

Many of the HPC customers I’ve talked with about this aren’t too concerned with these limitations, primarily because their bare-metal environments for the most part don’t offer these features and so they aren’t losing capabilities when they transition to a virtual environment. However, in the Office of the CTO we take a longer view — that’s our job. And what we see is that both vMotion and Snapshots can be used to offer new capabilities in virtualized HPC environments that are either difficult or impossible to implement in bare-metal environments, features like reactive or proactive fault tolerance and dynamic resource management. There is a full description of those features in the first part of this presentation, for those interested. In addition, it is clear that if RDMA is to be deployed in Enterprise datacenters (using RoCE, InfiniBand, or iWARP), then enabling a widely-used feature like vMotion is going to be very important.

My colleague Bhavesh Davda and our intern, Adit Ranadive, worked closely together this summer to design a solution to this problem, which they discussed in a video interview back in August. More recently, they’ve described their work in a paper titled Toward a Paravirtual vRDMA Device for VMware ESXi Guests, which is included in the Winter 2012 VMware Technical Journal that was just released last week. The paper describes the design of a virtual device that supports standard, Verbs-level access to RDMA within a guest operating system while maintaining the ability to perform vMotion and Snapshots, and enabling direct datapath access to the hardware, which is needed to deliver high performance. The development of the prototype is underway — watch this space for performance results and other updates.

 

Other posts by

High Performance Computing with Altitude: SC’16 Begins Tomorrow!

As readers may know, VMware has had a presence in the EMC booth for the last several years at Supercomputing, the HPC community’s largest annual ACM/IEEE conference and exhibition. With the fusion of Dell and EMC into DellEMC and with VMware now under the Dell Technologies umbrella, I am very pleased that we will have two […]

Performance of RDMA and HPC Applications in VMs using FDR InfiniBand on VMware vSphere

Customers often ask whether InfiniBand (IB) can be used with vSphere. The short answer is, yes. VM Direct Path I/O (passthrough) allows InfiniBand cards to be made directly visible within a virtual machine so the guest operating system can directly access the device. With this approach, no ESX-specific driver is required — just the hardware […]

Virtualized HPC at Johns Hopkins Applied Physics Laboratory

Johns Hopkins University Applied Physics Laboratory (JHUAPL) has deployed a virtualized HPC solution on vSphere to run Monte Carlo simulations for the US Air Force. The idea was conceived and implemented by Edmond DeMattia at JHUAPL, and has been presented by Edmond and his colleague Michael Chinn at two VMworld conferences. We now have a white […]