From service mesh to cloud runtimes as a new emerging pattern.

We often meet with CIOs who are struggling to maintain desired Service Level Objectives (SLOs) for their users. Their systems suffer from lackluster reliability and have a complicated performance posture. When they ask their teams for an explanation of why systems are not meeting SLOs, it turns into a complicated and convoluted discussion of, “Yes, we have more knobs for that problem too.” But what CIOs really want are fewer knobs, and more intelligent systems that can observe application performance and replace the multitude of knobs with automation. Many want their systems to behave in accordance to the prescribed SLO, where an SLO is interpreted and adhered to by the system in an automated way.

Today, we are far from this ideal world in the enterprise. Many SLOs in the enterprise are defined in spreadsheets, lying dormant, with various manual processes to stitch them together in order to hopefully achieve the desired systems behavior. In many cases, organizations forget what the defined SLOs are because they are contained in a spreadsheet that is not actively maintained or referenced. This flawed approach creates an organizational disconnect and fails to encourage collaboration among teams.

In this blog series, we examine the CIO conundrum, look at how we can make multi-cloud platforms more reliable, leverage SLO-based mechanisms to tame multi cloud services, look at what new abstractions are emerging, and a market perspective of what the cloud native community is dabbling with.  This is not the first time our team has written about SLOs, you may want to also read SLOS –  The Emerging Universal Language in the Enterprise and Application Platforms Position Paper.

The CIO Conundrum

In Figure-1 we depict the vicious cycle CIOs are caught in, where a missed SLO results in a response to over provision hardware, which is necessary because there is a gap in knowledge regarding how to properly implement an application-level scalability solution. Upon closer inspection of this knowledge gap, we often find that it is due to various organization silos that exists in the enterprise, and the fact that no particular silo owns the end to end SLO. The typical debate is: are the problems in the application code, application runtime, or platform/compute space? The solution could be in any of these three parts, but often the various organizations involved have varying perspectives on the solution.  These varied perspectives drive the organization silos and gaps between them even further, and this of course causes the inability to reach a proper solution

If CIOs were only overprovisioning cloud infrastructure, they would just have to detail to their CFO why such stable systems cost so much to run. However, these systems are unreliable and are costing more to run, complicating the conversation CIOs can have with their CFOs. Missed SLOs have a direct impact on the business and it leads to poor customer experiences that complicates a CIOs life. There must be a better way!

 

Figure-1 CIOs stuck in a vicious cycle

What CIOs really need is continuous sharing of common information across the silos

In a recent conversation with a customer, they acknowledge that these silos complicate their system designs, implementations and operations, and drive further divide in the knowledge gap between their teams. However, what is even more interesting is that the customer CIO highlighted how when there is an outage, all teams come together to share information in real-time, using data to drive immediate decisions. But once the outage issue is resolved, all the teams fall back to their silos.

This behavior indicates that organizational silos will continue to exist, as they provide a useful structure for getting work done. Though what is really critical is the ability to have shared information across the silos, having systems that act as an overlay to provide just-in-time data to make decisions all the time, regardless of the organizational silos. That is the key. It is not that you want to really break down silos; after all, if you break downs silos today and wait a few months, new silos will appear. What is important is having a free flow of common information between the organizational silos to get them to collaborate in real-time.  It is this shared, real time information that can help drive better reliability across cloud platforms and services in the enterprise.

But what does it take to make the silos more efficient? This can mean many things: first it means a more continuous flow of information, less debate over which metric makes sense, and a drive towards better root cause analysis. CIOs are looking for their teams to learn from these root cause reports and build more capable systems. CIOs need a way to robotically automate their systems to interpret SLOs, measure and analyze metrics that affect the SLOs, and automatically act to achieve the SLOs. Ideally, via a unified system.

Just like robots in the auto industry use real-time information and encoded steps to improve the throughput of automobile manufacturing, in much the same way CIOs expect their cloud application platforms to deliver application SLOs in a reliable robotic fashion (refer to Figure-2).  SLOs will provide an understanding between what one system can reliably expect from another, and then based on this, create a cascade system of reliability trust to build more complex features that will accelerate the business.

Figure 2: Robots rely on real-time information and SLOs of other Robots, to complete a chain of automated work – a distributed system of trust gated by a cascade of SLOs that everyone relies on.  Distributed Clouds Services with SLOs must behave in similar manner.

Recent KubeCon Sessions

 So, are we alone in our way of thinking about SLOs, and how they can drive improved reliability?

In our recent visit to KubeCon 2019, as shown in Figure-3, we saw that approximately 33% of the sessions had something to do with SLO/policy/intent-based approaches to controlling distributed services.  Most of these sessions discussed how large multi cloud distributed services involving huge scale out factors, comes the realization that you need effective real-time telemetry with dynamic control actions to achieve stated SLO/policy/intent. No doubt these systems are like graphs and it is becoming harder to make sense out of them, and so the best we can do is measure metrics real-time, make some interpretations of these metrics, and then build systems that can understand these metrics and take appropriate control actions. Essentially, we are building robotic systems, where all systems share the same real-time information, and in a coordinated fashion, act in concert to meet their agreed upon objectives.

Figure-3 33% of the recent KubeCon sessions were focused on SLO/policy/intent auto-scaling and reliability mechanisms

 In this blog, we examined the CIO conundrum, and briefly touched on the notion of SLOs and how to use them to tame large complex systems. Next, in this series of blogs, we will address how customers are solving these challenges today, how VMware has built an SLO-capable controller that does auto scaling across a spectrum of multi cloud services, and then conclude the series with a new emerging trend related to service mesh known as distributed cloud runtime.


Emad has spent the past 27 years in various software engineering positions involving software development of application platforms and distributed systems for wide range of industries, such as finance, health, IT, and heavy industry, across the globe. Emad is currently the Sr. Director and Chief Technologist of Cloud Application Platforms with the Office of the CTO at VMware, leading a team focused on product innovations in the space of Cloud Runtimes, Servicemesh, and Multi Cloud services that are application runtimes aware.

 

Mark Schweighardt is Director of Product Management for VMware Tanzu Service Mesh. Mark works closely with VMware enterprise customers to address the connectivity, security, and opertional challenges surrounding their cloud native applications. Mark has worked with hundreds of enterprises from a variety of industries, including banking, insurance, high tech, healthcare, government, and many others. Prior to VMware, Mark held various product management and product marketing positions, mostly at Silicon Valley start-ups focused on the security space, including identity and access management and data protection.