This year, three papers from the VMware Research Group (VRG) are being presented at the highly competitive ASPLOS conference*. ASPLOS (Architectural Support for Programming Languages and Operating Systems) is the premier venue for work that spans computer architecture, programming languages, and operating systems. It is a conference known for technical excellence, practicality, and cross-disciplinary nature, underscoring some of the strengths of the research done in VRG. All three VRG papers presented at ASPLOS will explore advances in the memory sub-system, which significantly impacts the performance of existing and emerging workloads. Another exciting commonality is that all three papers are in collaboration with academia. The first work, Copy-on-Pin, extends the widely popular Copy-on-Write technique to correctly handle pinned memory and avoid numerous bugs that have previously compromised security. The second work, uBFT, shows that Byzantine Fault Tolerance can be extremely cheap, taking only 10 microseconds to replicate requests, and leveraging disaggregated memory in data centers. The third work, Mosaic Pages, shows how to significantly compress Translation Lookaside Buffer (TLB) entries to achieve a much larger TLB reach to support efficient virtual memory accesses.
Copy-on-Pin: The Missing Piece for Correct Copy-on-Write
Memory management is a complex and intricate area in operating systems, involving various techniques that interact in complicated ways. Two such mechanisms are Copy-on-Write (COW) and Memory Pinning. COW saves memory by sharing physical pages that are logically different and involves write-protection of memory, copying it, and remapping it during a page-fault. Memory Pinning enables direct memory access by the OS or I/O devices and prevents pages from being moved.
However, these two techniques can conflict with each other, as COW’s remapping of pages can violate Memory Pinning’s restriction on moving pages, resulting in many bugs in OSs fixed by inadequate ad-hoc solutions for years due to the unclear interactions between COW and Memory Pinning.
To address this issue systematically, researchers David Hildenbrand and Prof. Martin Schulz from Technical University of Munich, and Nadav Amit from VRG, conducted a detailed analysis of the problem. Their paper, “Copy-on-Pin: The Missing Piece for Correct Copy-on-Write,” describes their findings, including previously unknown bugs in Linux, FreeBSD, and NetBSD resulting from the interaction of COW and Memory Pinning.
Their analysis shows that while it is possible in some instances for a memory page to be both shared using COW and pinned, commodity OSes cannot support such scenarios due to the complexity and high tail latency they introduce. Therefore, COW and Memory Pinning must be mutually exclusive. However, previous attempts to make them mutually exclusive have failed due to the memory overhead, complexity, and synchronization required to track shared and pinned pages.
Their insight is that exact tracking is unnecessary as long as COW and Memory Pinning remain mutually exclusive. Therefore, they suggest relaxing the tracking policy, sacrificing some memory-sharing opportunities, and causing some unnecessary memory copying events to simplify the OS code and reduce the memory overhead.
David Hildenbrand implemented the relaxed tracking policy in Linux and upstreamed it, showing that it simplifies the code without requiring additional memory for tracking and causing very few unnecessary memory copying and missed memory sharing opportunities. The relaxed tracking policy is now part of Linux 5.19 and newer versions.
uBFT: Microsecond-scale BFT using Disaggregated Memory
Data center applications (social networks, web search, e-commerce, banking, etc.) increasingly need microsecond-scale performance while tolerating general failures beyond simple crashes. Protecting against such failures has traditionally required slow and expensive Byzantine Fault Tolerant (BFT) protocols for state machine replication. These protocols have had a number of drawbacks: they incur milliseconds of latency, they require a large number of replicas (3f+1 to tolerate f failures), they consume unbounded memory , and sometimes they require a large trusted computing base. These reasons might explain why BFT protocols have had no adoption in data centers.
In this paper, Marcos K. Aguilera and Naama Ben-David from VRG, together with Rachid Guerraoui, Antoine Murat, and Athanasios Xygkis from EPFL, as well as Igor Zablotchi from MIT, propose uBFT. uBFT is the first system for BFT that can replicate requests in just 10 microseconds, which is 50 times faster or more than several alternatives. Furthermore, uBFT has several other characteristics that make it practical: it requires only 2f+1 replicas, bounded memory, and a small non-tailored trusted computing base. In the common case, uBFT leverages unanimity to replicate requests quickly without invoking the trusted computing base or expensive cryptographic primitives. In the slow path—when there are failures or slowness in the network—uBFT uses a novel protocol that combines digital signatures with judicious use of a trusted computing base. The trusted computing base in uBFT is non-tailored and small. Rather than trusted enclaves with arbitrary logic such as SGX or trusted hypervisors — which have large attack surfaces due to their complexity — uBFT relies solely on disaggregated memory, a technology increasingly present in data centers due to the availability of RDMA (Remote Direct Memory Access) today and CXL (Compute Express Link) in a few years.
uBFT is notable in two aspects. First, it demonstrates that Byzantine Fault Tolerance can be achieved much faster than previously believed, making it suitable for microsecond-scale applications. Second, uBFT requires relatively few resources to do so, making it a realistic alternative to protocols such as Paxos or Raft that can tolerate only crash failures.
Mosaic Pages: Big TLB Reach with Small Pages
Virtual memory has been a crucial feature of computer architectures for decades, enabling each program to run in its own private address space and preventing one program from corrupting the memory of other programs on the system. Virtual memory has a cost, though. Every time a program accesses RAM, the CPU must translate, on the fly, a virtual address into a physical one. Programs can make billions of memory accesses every second, so this translation process must be extremely fast. The standard solution for decades has been to keep a small address-translation cache, called the Translation Lookaside Buffer (TLB), in the CPU.
Unfortunately, the number of translations a TLB can hold is sharply limited by power, heat, and chip area constraints. TLBs today are not much larger than they were a decade ago, even though RAM, program, and dataset sizes have exploded. As a result, TLBs have become much less effective caches, increasing the cost of translating virtual addresses to physical addresses. In fact, address translation can now be up to 80% of the total execution time of some programs. Prior work has proposed to solve this problem by increasing the granularity of translations through a technique called “huge pages,” but this requires the OS to find large contiguous regions of physical memory for new mappings, which is frequently not possible due to memory fragmentation.
In Mosaic Pages, an interdisciplinary team of theoreticians, operating systems researchers, and computer architecture researchers propose a way to increase TLB cache hit rates without requiring the OS to find large contiguous memory regions, as is necessary with huge pages. The core idea in Mosaic Pages is to compress translations so that each TLB entry can hold several translations at once. Mosaic Pages compress translations by restricting where each virtual address can be mapped in physical memory so that only a few bits are needed to indicate which mapping was chosen for a given virtual address. The primary challenge with restricting virtual-to-physical mappings is that it may cause associativity conflicts, i.e., if n virtual pages all need to be mapped to n-1 physical locations, then one of those pages will have to be evicted from RAM, incurring swapping I/O. Mosaic Pages solves this by structuring the virtual-to-physical mappings using a new hashing scheme called Iceberg hashing, which ensures that associativity conflicts will be rare. Iceberg hashing was developed for Mosaic Pages but is of independent interest and is described in a forthcoming SIGMOD paper.
This paper shows experimentally that (1) Mosaic Pages can substantially reduce address translation costs across a wide range of benchmarks, (2) Mosaic Pages TLBs can translate addresses fast enough to keep up with modern CPUs, and (3) Mosaic Pages does not increase swapping. Thus, Mosaic Pages is a promising direction for future TLB designs.
We hope the above summaries pique your interest in VMware’s research, spanning computer architecture and operating systems, among many other areas that VRG covers. You can find out more about the VMware Research Group and the topics we’re exploring here.
*UPDATE: During ASPLOS ’23, “Mosaic Pages: Big TLB Reach with Small Pages” won a Distinguished Paper Award, and “Copy-on-Pin: The Missing Piece for Correct Copy-on-Write” received a Distinguished Artifact Award. Check out the full list of winning papers here.
Copy-on-Pin: The Missing Piece for Correct Copy-on-Write. Authors: David Hildenbrand, Martin Schulz (Technical Univ. of Munich); Nadav Amit (VMware Research Group). https://dl.acm.org/doi/10.1145/3575693.3575716
uBFT: Microsecond-scale BFT using Disaggregated Memory. Authors: Marcos K. Aguilera, Naama Ben-David (VMware Research); Rachid Guerraoui, Antoine Murat, Athanasios Xygkis (EPFL); Igor Zablotchi (Massachusetts Inst. of Technology). https://dl.acm.org/doi/10.1145/3575693.3575732
Mosaic Pages: Big TLB Reach with Small Pages. Authors: Krishnan Gosakan (Rutgers), Jaehyun Han (UNC), William Kuszmaul (MIT), Ibrahim N. Mubarek (CMU), Nirjhar Mukherjee (CMU), Karthik Sriram (Yale), Guido Tagliavini (Rutgers), Evan West (Stony Brook University), Michael A. Bender (Stony Brook University), Abhishek Bhattacharjee (Yale), Alex Conway (VMware Research), Martin Farach-Colton (Rutgers), Jayneel Gandhi (Meta), Rob Johnson (VMware Research), Sudarsun Kannan (Rutgers), Donald E. Porter (UNC). https://dl.acm.org/doi/10.1145/3575693.3575716