As distributed systems and network communications increase in speed, our need for highly precise time measurement has grown increasingly urgent. Whereas systems in the past could tolerate errors in the seconds, modern systems are beginning to require clock errors in the range of microseconds to nanoseconds. Similar accuracy is needed to identify latency bottlenecks in end-host stacks as end-host latencies approach the microsecond range.
Until now, there has been an assumption that computer clocks are unreliable and therefore require frequent synchronization. System builders, especially at the hyperscale, have typically resorted to building bespoke and expensive new time-measurement solutions to compensate for these perceived limitations. Our research shows that they actually are reliable, with a few corrections. Frequent synchronizations are unnecessary.
Local clocks are more reliable than we think
One research paper we presented at the USENIX Symposium on Networked Systems Design and Implementation (NSDI) — co-written by Michael and Ali Najafi, who is now at Meta — demonstrates, in a principled manner, how computer clocks are built and how they differ from each other. We have been able to show that they are far more useful than previously assumed. Local clocks can both be characterized precisely enough to build basic distributed systems primitives from them and information from individual clocks can be leveraged to create an accurate processing timeline. These findings make it possible to run a low-overhead diagnostic tool that can identify the source of latencies with an end-host stack.
The paper (which, we are pleased to share, won “Best Paper” at the symposium) describes significant improvements in computer time measurement that promise new opportunities for scalable sub-microsecond clock synchronization and an order-of-magnitude reduction of end-host latencies. Our teams achieved these results using commodity hardware and software, demonstrating opportunities for increased system performance, without requiring purpose-built, high-cost interventions. This research was inspired by the growing need for faster basic system primitives, such as locks, consistency control, and fault tolerance.
The paper, “Graham: Synchronizing Clocks by Leveraging Local Clock Properties,” describes a new system for accurately characterizing the local clock on a computer, allowing it to be reliably synchronized with remote clocks at the sub-microsecond level — without the use of specialized hardware.
We posited that while any specific clock will drift, that drift can be reliably characterized. Once we had that characterization available to us, we were able to reduce the rate of synchronization, tolerate synchronization failures, and reduce network congestion and synchronization overhead — all while avoiding the use of specialized hardware.
Graham builds on insights from physics and radio engineering around the predictable impact of temperature on processor operations, including clock performance. Since all modern commodity servers incorporate multiple temperature sensors, we were able to use this temperature data to characterize a local clock’s performance against an accurate reference clock, such as a global positioning system (GPS). Graham takes this characterization and builds a synchronization model that determines how frequently the system must be synchronized and how many synchronization failures are tolerable. The result is an improvement in synchronization of between 10x and 100x, which is in line with the performance achieved by top-of-the-line specialized synchronization.
Revealing sources of latency in end hosts
The second paper we presented, “How to diagnose nanosecond network latencies in rich end-host stacks,” was co-written by Radhika, as well as by Lalith Suresh, Gerd Zellweger, Bo Gan, Timothy Merrifield, and Sujata Banerjee of VMware, as well as Roni Haecki and Timothy Roscoe of ETH Zurich. It introduces NSight, the first latency-diagnosis tool to both confirm and identify sources of network latency within end-host stacks. These latencies can significantly impact application performance.
At the heart of what inspired Radhika and her colleagues’ NSDI paper was a mystery: what causes high network latencies within end-host stacks and how can we identify where they are coming from? While today’s networking packets arrive at nanosecond speeds, as soon as they enter the end host, they are processed in a “best-effort” way that typically operates far more slowly.
Numerous existing end-host profilers try to identify the sources of these slowdowns. But existing profilers fail to capture latency deviations, due to aspects of the end-host stack that are critical to network performance — including the interface between the network interface card (NIC) and host, head-of-line blocking, and interference. They also have such high overheads that they are too heavyweight to apply to the entire end-host stack at once. So these tools have mostly been used to confirm hypotheses, rather than to diagnose latency problems.
We figured that we could identify the causes of latency in the end host if we could first construct a highly accurate timeline of system events that impact messages in different parts of the end-host stack. But how? Knowing that every component and subsystem in the end host has its own clock, we looked for a way to reconcile and then combine the information that these clocks were separately timestamping. The solution, we found, lay in stitching together two pieces of profiling that hadn’t been stitched together before: CPU profiling, which profiles the cores within the end host, and message profiling, which precisely profiles packets from the point where they enter the end-host NIC to when they are processed by applications running on end-host cores (and vice versa). We could use these to timestamp message paths in a way that gave us a view into the progression of messages through end-host stacks with nanosecond granularity, which was previously impossible. We could also compare them to identify anomalous processing paths that lead to latency deviations.
Based on this understanding, we built NSight — a diagnosis tool that can identify the source of latencies introduced within an end host, without high latency overhead. The potential impact on latency reduction is significant. For example, we were able to systematically identify and remove performance overheads in memcached, reducing 99.9th percentile latency by a factor of 40, from 2.2 ms to 51 μs.
Next steps
At present, NSight is a prototype that works on any Intel system running Linux or on VMA stacks but is easily transferable to other kinds of systems. We plan, for example, to build a hardened prototype within VMware for ESX and will test it out on a variety of other network stacks.
We are also looking into how Graham can be leveraged in VMware’s virtualized stack. We want to leverage our understanding of local clocks to build new protocols on top of them for synchronization with adjacent nodes.
The two projects may converge. While our initial clock-synchronization research operated within the scope of a single machine, we may be able to extend it to identify clock errors across multiple distributed machines, allowing us to create accurate processing timelines and diagnose system latencies (among other useful tasks). Ultimately, the goal is to perform these functions across commodity distributed systems, something that has largely been assumed to be impossible without custom-built, prohibitively expensive, time-management solutions. Stay tuned as we update our research and share our findings.
Comments