Paravirtual RDMA for Low Latency and Flexibility

The Office of the CTO has been exploring how to best enable application access to RDMA for those applications requiring the ultimate in high bandwidth, low-latency communication, which includes many HPC MPI applications as well as many scale-out databases and BigData approaches.

Passthrough mode is the most straightforward way to enable guest-level RDMA. With passthrough (which we call VM DirectPath I/O), a physical PCI device can be made directly visible to the guest operating system running within the virtual machine. We published a research note showing that this approach delivers very good InfiniBand latencies (under 2us) and excellent bandwidths over a wide range of message sizes. There is a downside, however: Punching through the virtual machine abstraction in this way disables several platform features, most notably vMotion (live migration) and Snapshots.

Many of the HPC customers I’ve talked with about this aren’t too concerned with these limitations, primarily because their bare-metal environments for the most part don’t offer these features and so they aren’t losing capabilities when they transition to a virtual environment. However, in the Office of the CTO we take a longer view — that’s our job. And what we see is that both vMotion and Snapshots can be used to offer new capabilities in virtualized HPC environments that are either difficult or impossible to implement in bare-metal environments, features like reactive or proactive fault tolerance and dynamic resource management. There is a full description of those features in the first part of this presentation, for those interested. In addition, it is clear that if RDMA is to be deployed in Enterprise datacenters (using RoCE, InfiniBand, or iWARP), then enabling a widely-used feature like vMotion is going to be very important.

My colleague Bhavesh Davda and our intern, Adit Ranadive, worked closely together this summer to design a solution to this problem, which they discussed in a video interview back in August. More recently, they’ve described their work in a paper titled Toward a Paravirtual vRDMA Device for VMware ESXi Guests, which is included in the Winter 2012 VMware Technical Journal that was just released last week. The paper describes the design of a virtual device that supports standard, Verbs-level access to RDMA within a guest operating system while maintaining the ability to perform vMotion and Snapshots, and enabling direct datapath access to the hardware, which is needed to deliver high performance. The development of the prototype is underway — watch this space for performance results and other updates.


Other posts by

vSphere Scale-Out for HPC and Big Data

I’m very excited that we’ve announced vSphere Scale-Out this week at VMworld here in Las Vegas. This new vSphere edition is specifically and exclusively designed for running HPC and Big Data workloads. This is an important development in our work to offer compelling virtualization solutions for these two emerging workload classes. Our strategy for addressing […]

Three Extreme Performance Talks from the Office of the CTO at VMworld USA

The Office of the CTO will be presenting three talks in the unofficial “Extreme Performance” series at the upcoming VMworld 2017 conference in Las Vegas. In addition, one of these talks will be delivered at VMworld Europe in Barcelona. Each of these talks focuses on important aspects of pushing the envelope to achieve high performance […]

How to Enable Compute Accelerators on vSphere 6.5 for Machine Learning and Other HPC Workloads

As our CTO Ray O’Farrell recently mentioned, VMware is committed to helping customers build intelligent infrastructure, which includes the ability to take advantage of Machine Learning within their private and hybrid cloud environments. As part of delivering this vision, the Office of the CTO collaborates with customers and with VMware R&D teams to ensure the […]