Paravirtual RDMA for Low Latency and Flexibility
The Office of the CTO has been exploring how to best enable application access to RDMA for those applications requiring the ultimate in high bandwidth, low-latency communication, which includes many HPC MPI applications as well as many scale-out databases and BigData approaches.
Passthrough mode is the most straightforward way to enable guest-level RDMA. With passthrough (which we call VM DirectPath I/O), a physical PCI device can be made directly visible to the guest operating system running within the virtual machine. We published a research note showing that this approach delivers very good InfiniBand latencies (under 2us) and excellent bandwidths over a wide range of message sizes. There is a downside, however: Punching through the virtual machine abstraction in this way disables several platform features, most notably vMotion (live migration) and Snapshots.
Many of the HPC customers I’ve talked with about this aren’t too concerned with these limitations, primarily because their bare-metal environments for the most part don’t offer these features and so they aren’t losing capabilities when they transition to a virtual environment. However, in the Office of the CTO we take a longer view — that’s our job. And what we see is that both vMotion and Snapshots can be used to offer new capabilities in virtualized HPC environments that are either difficult or impossible to implement in bare-metal environments, features like reactive or proactive fault tolerance and dynamic resource management. There is a full description of those features in the first part of this presentation, for those interested. In addition, it is clear that if RDMA is to be deployed in Enterprise datacenters (using RoCE, InfiniBand, or iWARP), then enabling a widely-used feature like vMotion is going to be very important.
My colleague Bhavesh Davda and our intern, Adit Ranadive, worked closely together this summer to design a solution to this problem, which they discussed in a video interview back in August. More recently, they’ve described their work in a paper titled Toward a Paravirtual vRDMA Device for VMware ESXi Guests, which is included in the Winter 2012 VMware Technical Journal that was just released last week. The paper describes the design of a virtual device that supports standard, Verbs-level access to RDMA within a guest operating system while maintaining the ability to perform vMotion and Snapshots, and enabling direct datapath access to the hardware, which is needed to deliver high performance. The development of the prototype is underway — watch this space for performance results and other updates.