From service mesh to cloud runtimes as a new emerging pattern (Part 2 of a 4-part Series)

In the first post of this 4-part blog series we talked about the CIO conundrum and how SLOs can help alleviate the vicious cycle CIOs find themselves in, a cycle brought about by organizational and technology design silos. In this blog post, we look at the technology teams related challenges that CIOs must address to deliver end-to-end solutions that guarantee Quality of Service (QoS) for applications and compelling experiences for customers.  We will also discuss how such systems are being built today to guarantee QoS, from the time a user initiates a transaction in an application until it is fully executed across a set of multi-cloud distributed services.

How the End-to-End User Experience is Handled in the Enterprise Today?

In order to deliver on end-to-end QoS and SLOs, today we cobble together metrics from three different teams, namely the Application Platforms Team (APT), Data Platforms Team (DPT), and User Experience Tools Team (UETT) (see Figure-1). These are the three main teams within the enterprise that have some level of responsibility to deliver on the QoS for applications (for simplicity we have omitted other teams that may play a role).

Figure 1: This diagram illustrates how three key teams are involved in measuring end-to-end quality of service for multi cloud applications, including managing availability, resiliency, reliability, and feature velocity at scale.

Figure 1: This diagram illustrates how three key teams are involved in measuring end-to-end quality of service for multi cloud applications, including managing availability, resiliency, reliability, and feature velocity at scale.

Let us assume these three teams are working together on running a mobile banking application that provides various personal banking transactions, such as depositing checks, transferring funds between accounts, paying bills, etc. To guarantee end-to-end SLOs of this banking application, one has to build a system that can provide an answer to the following question: What response times are users located in New York experiencing at any point in time (or any other location where the bank does business)?

Systems capable of providing such answers must gather telemetry, metrics, and tracing data from the time a user clicks on their mobile application to initiate a transaction, to the time the transaction completes.  The tracing would provide visibility into the entire transaction as it propagates across distributed services/APIs running in different clouds.  This implies gathering spans and metrics from all three teams mentioned earlier (see Figure-1), the UETT, APT, and DPT. These teams would need to collaborate to stitch together dozens or hundreds of spans to provide end-to-end tracing and tracking information, with an ability to analyze, and take corrective actions to improve the user experience, in near real time. Not an easy undertaking.

For example, the UETT is concerned with using edge-based services to geographically ping the banking application (as shown in Figure-1). An example of this is UptimeRobot, or other similar tools, which simulate clicks and initiate synthetic transactions to gather information on response times, or other metrics as deemed important to the SLOs being tracked.

The UETT would then gather and share the response time telemetry information with the APT so they can conduct further analysis.  It is here with the APT where things start to become interesting as they try to dissect and correlate what the response time information means.  This is no easy task that requires a lot of plumbing tools (internal tools combined with external vendor tools and services) to enable the ability to fully trace transactions. The APT will also need to trace back to the data backing services provided by the DPT. Finally, the data may need to be correlated for additional analytics, and to drive control actions that improve the end user experience, such as scaling, circuit breaking, redirecting, and other resiliency patterns, all in near real-time.

Having worked on many of such systems with our customers, we know there is a tight integration between the APT and DPT teams, and a high degree of use case continuity from various solutions and products available in the market. However, the gap between the APT and UETT teams is where things start to fall apart.  Based on our observation working with customers, the gap between the APT and UETT is wide, largely due to differing expertise/skillsets, processes, and tools.

The gap between the APT and UETT teams of course creates a challenge in terms of being able to implement elegant solutions at speed and scale. For example, how to accommodate delays in receiving metrics from various systems, how to adapt / convert differing metric formats, how to work with multiple scripting languages, and how to issue real-time control actions to deliver on the QoS and SLOs.  Today, these challenges are unsatisfactorily addressed, with many tradeoffs and compromises, by applying additional manual and costly effort.

How have Systems been Coping and Delivering End-to-End SLOs?

Adding to the complexity of handing off telemetry and metrics between teams in near real-time, is the fact you need to do this in a geographically distributed manner.  Most organizations do not have cloud infrastructure in every possible geographic edge, so the logical solution for most organizations has been to leverage a cloud and/or a CDN service to take the application services and associated SLO measurement tools to the edge.  For example, Lyft runs Envoy proxies distributed all around the world, and leverages VMware Tanzu Observability by Wavefront to handle the telemetry stitching, providing near real-time visibility into their ridership metrics, and using these to optimize the application experience.

Gathering metrics and telemetry is a critical part of the solution, but the other aspect is to use these metrics to issue control actions that can help improve the user experience.  This is where companies such as Lyft have spent quite a lot of time developing specific custom control layers that analyze telemetry data and take policy-based actions to ensure applications meet desired SLO targets.

What Could these Types of Control Systems Look Like in the Future?

So far, we have talked about the complexity of setting up end-to-end metrics, traceability, and SLOs. Is it possible to improve the user experience with a purpose built SLO-driven system that can trace user transactions from the edge across a distributed application service graph? Then take control actions such as scaling, circuit breaking, redirecting, and other resiliency patterns in near real-time? How can we abstract these control functions into a common reusable layer that is consumable by applications?

One way of abstracting these concerns away from the application business logic and the underlying cloud infrastructure is to leverage a sidecar proxy pattern and service mesh controllers. However, a service mesh alone is not enough, but additionally we need to have specialized control layers on top of service mesh to abstract a common set of application resiliency services. This notion of a common layer, referred to as a cloud runtime, was first mentioned in the VMware OCTO Application Platforms Position Paper.

In the next blog post of this series, we will cover how we built a specialized controller within VMware Tanzu Service Mesh focused on autoscaling microservices distributed across multiple clouds.  We call this capability Predictable Response Time Controller (PRTC), a dependency and topology aware auto-scaler that maintains one or many service-level objectives (SLOs) across cloud boundaries.


Emad has spent the past 27 years in various software engineering positions involving software development of application platforms and distributed systems for wide range of industries, such as finance, health, IT, and heavy industry, across the globe. Emad is currently the Sr. Director and Chief Technologist of Cloud Application Platforms with the Office of the CTO at VMware, leading a team focused on product innovations in the space of Cloud Runtimes, Servicemesh, and Multi Cloud services that are application runtimes aware.

 

Mark Schweighardt is Director of Product Management for VMware Tanzu Service Mesh. Mark works closely with VMware enterprise customers to address the connectivity, security, and opertional challenges surrounding their cloud native applications. Mark has worked with hundreds of enterprises from a variety of industries, including banking, insurance, high tech, healthcare, government, and many others. Prior to VMware, Mark held various product management and product marketing positions, mostly at Silicon Valley start-ups focused on the security space, including identity and access management and data protection.