Summer of RDMA
Those involved in HPC understand the need for low latency communication for many parallel distributed applications and for those applications whose storage requirements generate lots of small-message traffic.
But it would be a mistake to assume low latency is an HPC-only issue. Why? Well, for two reasons. First, because as enterprise software architectures become more horizontally scaled (also referred to as “scale-out”) the performance of the link connecting communicating endpoints becomes much more of a first-class determinant of overall application performance, much like in HPC. Middleware examples include memcached, vFabric GemFire, and Hadoop. Oracle’s Exadata and Exalogic products are another example in that these appliances use an InfiniBand interconnect (the most popular high-speed interconnect technology used in HPC) internally due to the importance of low latency. In fact, Oracle thinks this is so important to their enterprise strategy that they bought a stake in Mellanox, the primary supplier of InfiniBand products and technology.
The second reason latency should not be viewed as an HPC-only issue is that we’ve seen low latency interconnects can improve performance within our own platform, mostly notably with multi-host services like vMotion and FT (as well as others). Consider vMotion. If the VM can be transferred more quickly during a Vmotion we can see two speedup effects — the first-order improvement related simply to moving the data faster as well as a second order effect in which the faster transfers give the source VM less time to scribble on memory during vMotion, which then further reduces the overall vMotion time by reducing the total amount of data that must be transferred. This improvement relies on both increased bandwidth and reduced latencies.
Which brings us to RDMA — remote direct memory access. While space does not permit a long description of RDMA here, in brief this technique allows data to be transferred across an interconnect and placed into a remote machine’s memory without the help of host CPUs — the transfer is mediated by the HCAs themselves. This allows for very efficient data motion with very small latencies. For native InfiniBand, “small” means a small number of microseconds. There are several technologies vying for prominence in this space — chiefly native InfiniBand, iWARP, and RoCE. Thankfully, there is a software layer — OpenFabrics Verbs — that offers a standard interface on top of all three of these transport mechanisms.
Why is this the Summer of RDMA? For two reasons. First, I am thrilled to have Adit Ranadive, a third-year PhD student from Georgia Tech with deep HPC experience, here with us this summer looking very closely at RDMA performance on ESXi. He is using VMware VMDirectPath I/O with QDR InfiniBand cards lent to us by Mellanox to examine point-to-point bandwidth and latency for native InfiniBand and RoCE, comparing ESXi to Xen, KVM, and bare metal performance. He has also selected an array of higher level benchmarks, including the Presta, NAS, and OSU MPI benchmarks as well as GROMACS, HPCC, SPECMPI 2007 and some commercially-oriented messaging tests(IBM WebSphere MQ Low Latency Messaging and Informatica Ultra Messaging Platform). By using VMDirectPath and removing as much of our software as possible from the execution path we hope to understand where we may be adding overhead. So far, we are seeing excellent performance relative to native when polling for message completions. With event-based completions, we are seeing some significant overheads and are working with networking and monitor (VMM) experts to understand the causes. I’ll share some graphs once we get further into this. It is definitely a work in progress and one that I am excited to see within VMware as it will help position us to better address requirements for latency-sensitive applications and frameworks.
Bhavesh Davda is the second reason this is the Summer of RDMA. Bhavesh has recently joined the Office of the CTO from the VMware network engineering team to focus on in-guest RDMA issues. He has decided to give up the joys of engineering management and get back to some hard technical issues. Specifically, he will be looking at how we might enable RDMA access within a guest while also maintaining our ability to perform vMotion and snapshot operations, two important requirements for both enterprise and HPC environments. With his deep experience with our networking stack and with virtualizing network devices and his knowledge of all flavors of passthrough (including VMDirectPath), I can’t think of a better person to be working on this technology exploration. Though he prefers to be heads down on the project, he has promised he will write an occasional guest blog entry here to keep everyone informed of his thinking and his progress.