Cluster managers, such as Kubernetes, Borg, and Omega, are essential components of many modern mission-critical IT systems. However, the individual management controllers they rely upon to function face reliability issues that can lead to data loss, security vulnerabilities, and resource leaks. This underscores the importance of controller reliability tests, which are typically controller-specific and require expert guidance in the form of formal specifications or carefully crafted test inputs.
In a paper presented at this week’s 2022 USENIX Symposium on Operating Systems Design and Implementation (OSDI), we describe Sieve, the first automatic and generalizable reliability-testing technique for cluster-management controllers. The paper was co-written with lead author Xudong Sun, a VMware Research intern and PhD student at University of Illinois at Urbana-Champaign (UIUC), along with Aishwarya Ganesan, Ramnatthan Alagappan, Michael Gasch and myself from VMware and Wenqing Luo, Jiawei Tyler Gu, and Tianyin Xu from UIUC.
Sieve is controller-agnostic, doesn’t require formal specification of either the cluster manager or controller, and doesn’t need to be directed at a specific area of vulnerability. With only a manifest for building the controller image and a set of basic test workloads, Sieve can automatically and efficiently test controllers for otherwise hidden reliability issues. In an initial evaluation, we ran Sieve on 10 popular open-source Kubernetes controllers and found a total of 46 bugs. Since reporting them, we have confirmed 35 of these bugs. Notably, many were deep semantic bugs with potentially severe consequences for system reliability, data loss, and security. Sieve detected them all without expert guidance.
Exploiting the state-reconciliation principle
Cluster-management controllers are generally responsible for one specific function in their corresponding cluster manager. They rely on the state-reconciliation principle, where each controller independently observes the current cluster state and issues corrective actions to converge the cluster to a desired state. Unfortunately, because each controller is a single component in a massively complex distributed system, it’s all but impossible to predict every situation to which any specific controller might be required to respond. That makes it extremely difficult to create controllers in which we can have total confidence, even when they are responsible for mission-critical functions. In turn, this amplifies the importance of regular reliability testing, which has typically been hard to direct, generalize, or automate.
Sieve’s design is powered by a fundamental observation about state-reconciliation systems: they rely on relatively simple and highly transparent state-centric interfaces between the controllers and the core cluster manager. These interfaces perform semantically simple operations on the cluster state (e.g., reads and writes) and deliver notifications about cluster-state changes. Their simplicity and transparency allows us to build a single tool capable of autonomously testing many controllers — and automatically detecting a wide range of bugs — without needing to know what the controllers are doing.
Here’s how it works: we run a set of test workloads and trace the resulting activity at these interface boundaries, and subsequently identify promising locations for deliberately injecting a single fault into the run. Sieve then runs the same test workload again, this time with the fault strategically injected into the execution. When that injection creates a different resulting trace, we have a strong indicator of both the existence and likely location of a potential bug, without needing any semantic information about the workload we’re tracing.
As such, Sieve works without needing to formally specify the controller or the cluster manager, hypothesize where in the code bugs may lie, or use highly specialized test inputs. Nor does it rely on expert-written assertions. All it needs is a manifest for building the controller image and basic test workloads. After that, Sieve’s testing is fully automated and reproducible. This degree of usability is key to making reliability testing broadly accessible to the rapidly increasing number of custom controllers.
We evaluated Sieve on 10 popular controllers from the Kubernetes ecosystem for managing widely used cloud systems, including Cassandra, Elasticsearch and MongoDB, employing between two and five basic test workloads for each controller. It took us an average of three hours to apply Sieve to each controller, although much of that time was spent understanding how to build the controller. In that evaluation, Sieve found new bugs in every single controller, for a total of 46 safety and liveness bugs (as I mentioned earlier — with 35 already confirmed and 22 fixed) with a low false-positive rate of 3.5%.
It’s worth re-emphasizing that these bugs had severe potential consequences, ranging from application outages to security vulnerabilities, resource leaks, and data loss.
A new tool for controller development
While it was important for us to be able to detect previously unidentified bugs in existing and widely deployed controllers (and while we envision Sieve being used to regularly test the reliability of controllers already in operation), our wider goal was to aid the controller-development process, increasing reliability as controllers are written and before they start running critical workloads.
To that end, we’ve made Sieve’s code and test workloads publicly available (see https:// github.com/sieve-project/sieve), along with instructions for how to reproduce all of the bugs we discovered.
This work began as a research project devised with Xudong when he first interned at VMware Research in 2020. It’s a great example of the kind of substantive work that can spring from VMware’s academic/industry collaborations. I’m also delighted that Xudong is again interning with us this summer, where he’s looking to extend this work by making Sieve easier and faster to use and more easily integrated into development pipelines.
VMware Research is always interested in projects that explore new classes of systems in promising areas that are both challenging to tackle and haven’t yet received as much attention as they deserve. While our focus here has been on controllers that work with Kubernetes, we’re also thinking about how we can use a similar methodology in the context of other modern state-centric interfaces, and how what we have been learning can be used to improve the reliability of other kinds of applications.
If you’d like to know more about the research project that led to Sieve, check out Xudong’s and my presentation to the 2021North America KubeCon on “Automated, Distributed Systems testing for Kubernetes Controllers.” I also had a great conversation on the topic (and how it fits into our broader research interests), with Sudesh Girdhari of VMware’s CloudStream YouTube channel.