Because today’s modern applications serve traffic across geographies and at scale, they must be deployed in multi-cloud environments with transactions traversing through edges, network PoPs, and multi-cloud backends. In this blog post, we will examine some of the hurdles faced by modern applications as a result of scale and distribution, as well as how distributed tracing can help overcome them.
Figure 1. A simplified illustration of a modern application with services distributed across geographies and platforms
To understand the problem, let’s start with an example: an online retail platform. Each product search is likely to span tens of services across different platforms, each of which involves microservices and database accesses. To meet quality of service (QoS) demands of latencies for critical transactions (such as payments services that involve communication between multiple modules, databases, etc.), it becomes imperative to quantitatively monitor transactions. Traditional methods of logging and monitoring individual services fail to provide a logical view. This is where the modern ways of instrumenting applications and data collection can help.
What’s an instrumented app and what can you do with it?
Figure 2. Application instrumentation
Driven by micro-service architecture, modern applications’ services are deployed across different infrastructure platforms. One of the best ways to instrument such applications is to follow a method of instrumentation called “distributed tracing.” Distributed tracing involves injecting critical application tasks with unique IDs and tracing requests through each service or module to provide an end-to-end view of the transaction.
Understanding the concepts involved in distributed tracing
The basic building block of distributed tracing is called a “trace.” Every instance of a transaction in an instrumented application will generate a trace. Other components include the mode of transport and the trace-collector service.
Each operation or a task is also referred to as a “span,” which, within a service, is represented by a unique identifier (ID) called a “span ID.”
Figure 3. Context of a trace
Once a span ID is generated, it is forwarded to the next downstream service. When a request traverses through the downstream services, the relationship between the spans is established by associating the new spans with a “parent ID.” The parent ID is the span ID of the immediate upstream caller.
Figure 4. Traversal of trace context between services
The metadata, which includes the trace information from all the operations or tasks, is reported independently to a backend collector, which uses a common identifier (known as a “trade ID”) associated with all the span data that was generated by each of the tasks for a single request. Together, these factors comprise the context of a trace.
Mode of transport between services
When a trace is generated, its information, along with the application’s data, is sent to downstream services in RPC requests. To ensure that a complete trace of the transaction is built, it is necessary that each service involved in the transaction can attach its own span and forward it to the downstream service. The trace data can be bundled along with the headers of the RPC requests. The same trace data must be sent to a collector service.
The collector service is responsible for collecting trace data and processing it into traces, then making it available for post-processing operations. The trace data — in JSON or binary format — is transported to collectors using HTTP/gRPC protocols.
Implementations of distributed tracing
Over the last few years, there have been quite a few implementations of distributed tracing. Some of the most popular providers, such as Zipkin, Jaeger, OpenTelemetry, etc., have SDKs for a wide array of programming languages. All the implementations are built around the concepts explained above but come with their own keywords to identify the constructs of a trace. Each has its own way of bundling the trace data and sending it to backend collector services. (Note: when instrumenting applications that span across platforms, you must use the same trace provider throughout the application.)
Choosing a library for your application
When choosing a library, it is important to understand the platforms upon which the application will be deployed. For example, if your application is deployed in a Kubernetes environment involving a service mesh such as Istio, you should look at the supported SDKs and trace data formats.
Distributed tracing with VMware for multi-cloud distributed applications
With VMware’s Tanzu Observability by Wavefront, you get a very comprehensive backend collector service capable of ingesting and supporting most of the popular tracing libraries on the market. Tanzu Observability provides the Wavefront proxy service that can be run along with your services on each of the platforms across which your application is distributed. The Wavefront proxy forwards the data to Wavefront cloud services for post processing.
Figure 5. Illustration of multi-cloud platforms leveraging Wavefront services for distributed tracing
The Tanzu Observability platform does much more than ingesting traces (learn more). You can also read more about distributed tracing and how VMware is leveraging distributed tracing in building more resilient application platforms in our Cloud Runtime project.