SLOs – The Emerging Universal Language in the Enterprise
To remain relevant in an information-driven society, enterprises across the globe undergo digital transformation projects to improve their agility and velocity. These changes have impacts on people, processes and the technology stack. Application architectures will become inherently distributed, leveraging modern cloud-native design principles and hybrid cloud deployments. Cross-team collaboration and breaking down communication silos are mandatory to realize the benefits of digital transformation. Flexible contracts and an easy to understand common language are required to address these concerns and set expectations right.
Service level objectives (SLOs), well established in the world of cloud and services providers, have the capabilities to become this new language. SLOs represents key metrics on which the different actors, or more general service consumers and producers, agree on. However, they are not considered a first-class concept in the enterprise yet. We predict that over the next years, SLOs will see significant adoption across the enterprise to tame complexity and reduce misunderstandings. Ultimately, SLOs will become the needed universal language (lingua franca) in and between businesses.
How do we decide whether a product is of good or bad quality? It turns out that this is based on a very subjective decision-making process, involving aspects such as look and feel, price, durability, usability, maintenance, personal preferences and experiences, etc. And thus, it’s very likely that consumers of a product or service will not come to the same conclusion about its quality.
This presents a challenge for the producer (product) or service provider, unless everyone agrees on the same definition of “good quality”. Let’s use a car manufacturer as an example. A car is a complex technical masterpiece made up of millions of different parts that wear out over time and eventually will fail. To keep customers happy in a highly competitive market, for a car producer it’s imperative to deliver as fast as possible, be efficient (i.e. reduce cost and waste) and minimize the risk of failures. One approach to achieve these contradictory goals, and popularized by Toyota, is generally understood as lean manufacturing. Later these principles paved its way into the enterprise, also known as the lean enterprise.
An important aspect behind the “lean” movement is a clear understanding of what value and quality the customer wants for a product or service. Streamlined processes, shortened feedback loops and continuous improvement aim to minimize risk, waste and overproduction on the producer side, directly translating into increased value and quality for its consumers.
Contracts Between Producers and Consumers
To set customer expectations right, a producer publishes parameters and thresholds within the product is guaranteed to perform correctly. Continuing with the car example above that could be environment temperature, pressure, maximum speed, emissions, consumption, etc. Within an industry these metrics are typically globally standardized and legally binding. Eventually, they become part of the contract between the producer and consumer.
How does this relate to information technology (IT) one might ask? With the rise of software-as-a-service (SaaS), and cloud computing in general, customer expectations and demand has never been higher. Velocity to deliver new features and bug fixes as quickly as possible, 24×7 availability and global scale are key to succeed in extremely competitive markets where the winner often takes it all.
However, purely focusing on shipping new features is shortsighted, because every new feature imposes a change, uncertainty and ultimately risk to the service (or platform) itself. This could potentially impact performance and availability. Blindly throwing resources at the problem won’t help either. For every business these resources are finite and put pressure on the margin which feeds innovation. Quickly you’ll find yourself in a dilemma, trying to achieve contradictory goals to not fail in the market.
Introducing Service Level Objectives
Pioneers in the public cloud space, such as Amazon Web Services and Google, as well as early adopters of private cloud infrastructures were in the same situation. How should they balance cost, risk and continuous improvement, both internally between organizational boundaries and externally to their customers? Directly related to this is the question of how to avoid misinterpretations or misunderstandings, for example around the performance or reliability of a specific service. A lack of common means for communication quickly leads to confusion and frustration. The lines of responsibility become blurry and blaming the status quo.
Service providers found a way out. A key element in their highly dynamic and complex IT landscapes are service level objectives (SLOs). In simple terms, a service level objective is a contract between a service provider and consumer. It defines the quality of a service the consumer can expect (or demand) and upon which the provider is bound to deliver and operate against, while also reducing misunderstandings with transparency.
A SLO should be carefully selected by the service owner and represent a meaningful property or quality attribute of this particular service to the consumer. For example:
- “99,99% availability (successfully processed requests) measure over a 30-day period”
- “99% of GET requests over a 5-min window will return within 200ms”
Internally to the service owner, developers and operations teams, the SLO is tied to key service metrics, i.e. service level indicators (SLI), to track SLO compliance. When a SLO is legally binding between two parties it becomes a service level agreement (SLA). In a recent blog post, our VMware Wavefront colleagues published a good article on these terms for further reading.
If you operate in the public cloud or in the managed services space it is almost a requirement to live and breathe SLOs. But this should not be seen as a burden or added complexity, because it has many benefits for providers and consumers alike. Let’s summarize the main advantages before we discuss the benefits for different teams within an enterprise:
- Clear and easy to understand metrics to set service consumer expectations right (internally/externally)
- Denote shared responsibility between teams, e.g. service owner, developers and operations
- Flexibility in which metric (SLI) a SLO can represent, e.g. business-centric, user-facing or platform-specific
- Transparency for top line (revenue)/bottom line (margin) effects related to a SLO (discussed below)
- Incorporates safety margins, also known as “error budget” (discussed below)
- Might optionally enforce penalties (SLA)
Benefits for the Enterprise Organization
When we see enterprises adopting SLOs often it becomes a technical discussion between development and operations teams. This is a first step, but it leaves behind the main benefits these metrics can have on the overall business.
Information technology is the enabler to drive and increase economic value in modern businesses. Thus, SLOs should be aligned (defined) with business metrics in mind, relevant to its stakeholders or customers. The immediate benefit this has is transparency, i.e. top line (revenue) and bottom line (margin) records can be linked to it. This allows us to answer important questions. Are customers having a good experience with the service or are we impacting revenue? How do we compare to the market? Can we reduce the footprint (cost) of the underlying resources, e.g. with code optimizations or outsourcing (SaaS), without impacting service quality?
Many business and related enterprise IT concerns (perspectives) can be put in context by sharing a common vocabulary, expressed via a SLO. It also helps the different IT teams to select and link their internal team-specific metrics to a shared goal, avoiding metrics fragmentation.
The developers who are responsible for translating business requirements into code, be it new features or bug fixes, will have their own metrics defined to continuously identify and improve software quality issues. For example, specific page load times, caching behavior, transaction times, network call delays, etc. By correlating these metrics with a business relevant SLO, such as new customer registrations, successful conversations or active time spend on the platform, individual development teams or service managers can make educated decisions. These insights are especially important when it comes to (re)prioritizing new features vs. stability improvements in the continuous integration and delivery (CI/CD) pipeline. The section on “error budgets” will discuss this further.
For IT Operations, the shift to embracing SLOs goes along with an organizational change to support a high-velocity software release lifecycle. Traditionally, “IT operations” used to be an umbrella term, covering multiple distinct and often isolated teams. For example, infrastructure, middleware, operating systems, legacy (Mainframe), etc. A model that worked well in times of “big bang” waterfall release processes where continuous change was the exception rather than the norm. These days, continuous delivery and shared responsibility, i.e. embracing DevOps principles, require a different operations mindset though.
Instead of siloed server, storage and network teams, you’ll find cross-functional teams for specific infrastructure-related concerns such as workload orchestration (VM’s, containers), monitoring, databases, logging, etc. Operations teams become platform teams and thus service owners, empowering their internal customers, e.g. developers, with self-service APIs rather than cumbersome and manual ticket processing. Expressing and measuring service quality in terms of SLOs becomes mandatory to achieve transparency, as these platform services become critical dependencies to other line of businesses who build applications and services on top.
Similar to how developers can relate their specific metrics to a SLO, platforms teams follow the same principle. For example, a workload orchestration platform team, leveraging VMware vSphere technology, can answer whether a high vCPU ready time or vSphere HA initiated restart of a virtual machine had a measurable impact on a particular business transaction. It would be impossible to draw such conclusions across teams without a shared understanding and agreement, i.e. common service level objectives.
Making Dependencies Explicit
The number of dependencies is proportionally growing with the increasing interaction between existing and new services. The latter including internal platforms, in-house developed microservices or consumed via software-as-a-service (SaaS) subscription model. SLOs are a perfect fit here, as they force us to reason about these dependencies. SLOs make dependencies explicit.
In this regard, Skyscanner, a travel company, recently spoke about the success they had implementing SLOs across the organization:
“It’s a way of us being clear about what to expect inter-service. If Squads (editor’s note: teams) are building services […], Squads will be reasonably autonomous, so they may build services separately. If one service is depending on another, it’s a good way of defining that relationship and what to expect from one another.
We’re seeing really good behaviors from that, and it’s taught us a tremendous amount about our dependencies. For the first time in a long while, we’ve actually looked at what we’re dependent on in order to make our SLIs and SLOs and we’ve found (editor’s note: discovered and resolved) circular dependencies.”
To further explore this, the following diagram shows an application with an associated SLO (blue) for its consumers. Let us assume this is a booking application and is composed of multiple (micro-) services depicted in grey. For data persistency it consumes a remote service, say a database, from the database platforms team with an associated SLO (orange).
Immediately, this makes dependencies between services and its individual consumers explicit. Note that the users of the booking application (blue) do not care about internal (grey) or transient (orange) dependencies. All they must know and rely on is the SLO associated with the booking application.
Such a dependency analysis allows the individual service teams to correctly calculate and derive their own SLO metrics. For example, let us define the database with an availability SLO of 99%. If it is a critical dependency to service blue, i.e. can render it unusable, the blue SLO must not be defined with an availability target beyond 99%. Obviously, all these services need to consume infrastructure resources (not explicitly shown), e.g. compute, network and storage with their own availability characteristics. These resources could be provided by a distinct platform team responsible for providing virtual machines and containers via a common API (VMware Project Pacific), backed with a minimum availability SLO of 99% or higher for this scenario.
Service owners are free to choose which metric(s) best reflect the quality of their service. As a general rule, they should be clearly defined and documented, meaningful to its users and of course achievable. A service typically defines only one or a small number of SLOs, often related to performance (response time, throughput) and availability. In case of the database (orange), its API endpoint, in addition to availability, could be defined with an average write delay of 200ms on the 95th percentile (P95). The booking application developers can now decide whether they need to incorporate additional techniques, such as caching and graceful degradation, to account for failure and slow writes.
Error Budgets to Manage Uncertainty and Risk
Another major benefit SLOs can bring is related to continuous integration and delivery (CI/CD). CI/CD is the capability to innovate and ship faster at an increasing pace. There is no such thing as perfect software, thus every change incorporates a certain amount of uncertainty and risk that must be accounted for in the release and decision-making process.
When business, development and operations (platform) teams are aligned to work against a shared goal, it becomes much simpler to agree on whether new product features (increase risk) or critical bug fixes (reduce risk) need to be prioritized. This can be decided based on the associated error budget a SLO inherently incorporates. Error budgets assist to balance velocity and stability holistically throughout the software-development lifecycle.
For example, considering a SLO of 99.99% service availability over 30 days, the error budget is 0.01%, that is ~4 minutes. If the service is not available (according to its definition of “service availability”), this counts against the error budget. Usually, if the service is close to or has fully drained its error budget, no new features can be rolled out. The focus for all teams is purely on getting the service back to normal, i.e. bug fixing, to not further impact the business and its customers. During that period new features won’t be deployed to production until the situation is resolved.
Service and platform teams should always include some buffer (error budget) in their SLOs, i.e. never aim for 100%. This has two immediate benefits. First, it forces the consumer to reason about the “error” case, i.e. where the dependent service is not available or running slowly and thus draining from its error budget. Second, it gives the service owner more control over product enhancements, maintenance and cost efficiency. Without a safety margin, the costs of maintaining such a system quickly become prohibitive. If in doubt, look at some key service metrics (SLIs) and error rates to turn that into a minimal SLO consumers would still accept. Learn from this initial experience and improve the SLO over time, if required.
Innovation at VMware
The race for customers with continuous innovation, faster time to market and “API-first” thinking will continue to change the structures and boundaries within the enterprise. Small, autonomous and service-oriented teams, empowered to decide and act as fast as possible, are required to deliver on these promises. This will drive heterogeneity and ultimately increase complexity within the enterprise organization, infrastructure and software landscape.
We believe that SLOs will become the universal language in modern businesses to manage complexity, reduce misunderstandings and, more importantly, directly measure business impact. As IT organizations evolve into embracing a service-first mentality, applications and services spanning public and private clouds will be the norm.
Existing workloads and modern cloud-native applications will be meshed together, also known as hybrid cloud architectures. For this very reason, the Hybrid Cloud Runtime (HCR) has been defined to create a connective tissue between private and public cloud, capable of delivering service mesh and application optimization functionality at runtime. Refer to the OCTO Application Platforms Positioning paper for further information.
At the recent VMworld US Day 2 Keynote, in close collaboration between VMware OCTO and the VMware NSX Service Mesh team, an early prototype of the benefits of this SLO-first mentality was shown. You can watch the keynote here (the link directly jumps to the demo part).
In summary, to fully realize the overall vision, first we introduce a new persona referred to as Application Platforms Architect (APA), secondly the SLO as common language, and thirdly a runtime like the HCR that can interpret SLOs – as shown in the below diagram. We are also eagerly working on further blog posts and exciting demos on this topic. Stay tuned.
Michael Gasch is an Application Platform Architect and distributed Systems Engineer in the Office of the CTO at VMware. He works closely with our customers and VMware R&D to advance all things Kubernetes on the VMware Software-Defined Data Center.
Emad Benjamin is currently the Sr. Director and Chief Technologist of Application Platforms with Office of the CTO at VMware, focusing on building hybrid cloud distributed runtimes that are application-aware.