In a multi-CPU server, memory modules are local to the CPU to which they are connected, forming a non-uniform memory access (NUMA) architecture. Because remote accesses are slower than local accesses, this type of architecture can degrade application performance. Similar slowdowns occur when an I/O device issues non-uniform DMA (NUDMA) operations because the device is connected to memory via a single CPU.
Our VMware research team, along with a group of collaborators from Dell, the network device maker NVIDIA, Technion-Israel Institute of Technology, and Tel Aviv University, have developed a solution to this NUDMA problem for I/O devices. IOctopus, as it is called, is a device architecture that reduces the incidence of NUDMA by unifying multiple physical PCIe functions — one per CPU — so they appear as one to both the system software and externally to the server.
IOctopus prevents NUDMA effects in I/O devices and makes all node-device interactions local. The team implemented IOctopus on existing hardware and demonstrated that it improved throughput by as much as 270% and latency by 128%. Our technical paper, “IOctopus: Outsmarting Nonuniform DMA,” won the prestigious Best Paper Award at the ASPLOS 2020 conference in Lausanne, Switzerland.
Igor Smolyar, a PhD candidate at Technion who served as a research intern in the VMware Research Group (VRG), led the IOctopus project. Smolyar was introduced to the company by VRG researcher Dan Tsafrir, his advisor at Technion.
“People have been aware of the NUDMA problem and had tried to solve it using different workarounds,” Smolyar explained. “The problem with the other solutions is that they are just mitigations.”
One such workaround I frequently see with VMware customers is that they take a two-socket machine and connect a NIC to one of the sockets. In their efforts to optimize performance, they use just that one socket. But this is a terrible solution, since they’re only using half the machine. This type of static resource allocation will not work when the workload is dynamic and may require more resources than a single NUMA node can provide. It also reduces workload consolidation.
A more elegant solution
IOctopus removes this limitation. When an I/O thread migrates between NUMA nodes, the IOctopus driver automatically switches to use the local manifestation of the IOctopus NIC and programs the device to steer this I/O thread’s incoming packets accordingly. From the OS perspective, IOctopus appears as a single local network device on all NUMA nodes.
“We discovered that you can practically eliminate the effect if your NIC is connected to all your NUMA nodes in the machine. With a regular NIC today, you get bad performance on remote NUMA nodes,” Smolyar said.
The team found that IOctopus can effectively prevent NUMA for I/O devices. It’s based on the idea that multiple physical PCIe functions may serve as internal logical entities within a single device in a manner that makes them transparent — both to the external world and to system software layers higher in the I/O stack than the IOctopus device driver.
The IOctopus team proposed both architectural and device-driver changes to mitigate the NUDMA problem. Like the eight-limbed sea creature, IOctopus spreads “limbs” to each CPU in the system. PCIe lanes connect an IOctopus NIC to each NUMA node, which eliminates the need to go over the interconnect. IOctopus DMAs are always local (and therefore fast).
The development process
The team implemented an IOctopus prototype using the NVIDIA ConnectX-5 socket direct NIC. The collaborators from NVIDIA provided an API to program the internal switch on the device. The current model of ConnectX-5 can split its 16 PCIe lane connector into two 8 PCIe lane connectors, so the IOctopus team was able to connect this device to both CPUs on their dual-socket server.
Dell also took part in the IOctopus project, building a riser cage adapter that is capable of connecting each network interface.
Because it is already wired, system engineers don’t need an extender. They merely plug in the network interface, switch a few dampers on the board, and it’s ready for use.
The IOctopus device driver is based on the Linux team driver, with the added special IOctopus mode. The IOctopus driver presents an IOctopus NIC to the OS as a single device, creating only local ring buffers. When an I/O thread migrates between NUMA nodes, the driver reprograms the steering table of the NIC hardware to redirect packets to the correct socket, as illustrated below in the thread-migration graph.
Co-innovation is “business as usual” for VRG
VRG frequently collaborates with academics and students who are working on topics of interest to VMware. It forges these connections through various channels, including an academic program (via grants and sponsorships), as well as by hiring affiliated researchers, scholars in residence, and interns.
In addition to the researchers mentioned in this post, other researchers who participated in the IOctopus project included: Alex Markuze (Technion – Israel Institute of Technology), Boris Pismenny (Technion – Israel Institute of Technology and NVIDIA), Haggai Eran (Technion – Israel Institute of Technology and NVIDIA), Austin Bolen (Dell), Liran Liss (NVIDIA), Adam Morrison (Tel Aviv University), and Dan Tsafrir (Technion – Israel Institute of Technology and VMware Research)