This post provides a summary of the OCTO white paper: Observability for Modern Application Platforms.
Observability is a new technology trend that is gaining industry traction as part of the enterprise migration towards the cloud. Together with data analytics and automation, Observability enables implementation of actionable feedback loops for effectively managing cloud-native infrastructure and applications. This post is relevant for application architects, SRE/operations leads, and business decisions makers who want to better understand how Observability can help with the transition of organizations and IT processes towards agile cloud operations.
If you want to quickly see the reference framework click here.
Drivers for Observability
In the last decade, we have seen a continued drive for increasing agility in enterprise IT to support the digitization of business processes and services. This trend has triggered several technological innovations like infrastructure virtualization and containerization.
New orchestration frameworks, like Kubernetes and cloud infrastructure services, have made it easier to provision new IT deployment environments and deploy application runtimes. Simultaneously, the underlying complexity has increased significantly which has become especially apparent in how these new deployment environments need to be monitored and managed.
The traditionally siloed IT monitoring tools’ ecosystem is not adequate to handle the complexity and explosion of telemetry data associated with new containerized microservices deployments. The above paradigms are trends within the cloud-native movement. Within this movement, observability is the “new monitoring”.
A System-Oriented Approach to IT Monitoring
Observability focuses on a system-oriented approach to IT monitoring. It takes a holistic end-to-end view of monitoring endpoints by aggregating and processing different types of telemetry data feeds, focusing on generating actionable insights as output.
This trend in IT monitoring is not much different than similar trends in other industry verticals, like the airline and automobile industry. As systems became more complex, implementing actionable monitoring helped control the impacts of this complexity in real-time.
For example, commercial airplanes’ cockpit now have advanced computer systems that aggregate and process information from thousands of measurement sensors. Similarly, car manufactures replaced traditional dashboard gauges with centralized computer displays in new Electric Vehicles (EVs), not only for displaying sensor data but also for providing advanced driver assist and self-driving functions.
In short, the process of telemetry data collection for IT monitoring is now becoming a means to an end. The focus is shifting from just collecting (and observing) data towards processing heterogenous telemetry data to derive actionable information that can be used for value-added insights and automation, creating a ‘driver assist’ for IT.
The terms observability and monitoring are often used interchangeably, but they have different meanings. Observability has its origin in control theory, where it refers to the ability to derive the (internal) state of a system based on its external outputs. The goal is to externalize the system state based on sensor data outputs. The concept has been popularized by cloud engineering and operations teams, typically focused on managing large distributed application deployments. On the other hand, monitoring originated in the operations and production support world. It describes the discipline of collecting metrics and alerts to monitor the health and performance of discrete IT infrastructure components (e.g., servers, storage, and network devices).
Observability creates actionable feedback loops using telemetry data. It expands the scope of traditional monitoring by collecting different types of data. By aggregating and correlating this data, observability is focused on deriving actionable system-level insights.
Deployment engineers and SREs can use telemetry data insights to implement feedback loops for making deployment changes. Various levels of automation can be used for implementing these feedback loops, as shown in Figure 1. Depending on the complexity level, deployment changes are fully automated by controllers or may require semi- manual support.
Observability Data Types
Figure 2 shows telemetry data that is used for observability, generalized by the observability data type. Metrics, traces, logs, and events can be considered separate implementations of this data type, characterized by specific syntax and temporal characteristics.
Metadata is configurable information and has many benefits including:
- can be added (as an attribute) to telemetry data
- enables grouping
- aggregating different sets of data, which is useful when performing data analytics such as search aggregations.
Open source frameworks like Prometheus and Jaeger, combined with new observability metrics and tracing libraries in the Spring Boot Java development framework, facilitate the implementation of fine-grain telemetry data collection from cloud infrastructure and distributed applications.
We also see leading cloud-native enterprises starting to expand the scope of observability. They align deployment insights with higher-level business metrics by combining business transactional data (such as cost and sales volume metrics) with telemetry data.
Cloud Observability Framework
Figure-3 introduces a framework for cloud observability. The framework serves as:
- A reference for evaluating existing monitoring architectures
- An implementation guide for new observability solutions using monitoring tools and cloud monitoring services
The cloud observability framework, shown in Figure 3, outlines the key functional areas that need to be considered for creating actionable observability feedback loops:
- Personas and SLOs – Roles and responsibilities for stakeholders of telemetry data.
- Actions and Analytics – Data management tools and automation for implementing actionable data intelligence.
- Policy and Controls – Data governance policies for security and data lifecycle management.
- Telemetry Data Aggregation – Collection and storage of telemetry data from instrumentation sources.
- Instrumentation – Telemetry instrumentation, including OS agents, SDKs, web hooks, etc.
Increase agility in IT Organizations with Observability
IT agility is about deploying infrastructure and applications faster in a consistent, secure, reliable, and repeatable way. IT organizations can achieve this goal with ongoing feedback and insights about the state and health of applications and underlying IT infrastructure. Observability provides this type of feedback to stakeholders across the organization.
Figure 4 shows how organizations can evolve towards a target state with observability. With IT organizations adopting cloud native deployments to support application deployment agility, Observability is becoming an essential solution for bridging the gap between developers and operations support staff, providing similar toolsets to reduce friction, and enable organization alignment.
Service Level Objectives (SLO) can be leveraged to establish mutually understood and agreed upon service availability and quality objectives for specific IT infrastructure services and applications. For additional background, please reference SLOs – The Emerging Universal Language in the Enterprise.
By leveraging observability, telemetry data can be used for monitoring Service Level Indicators (SLIs) to validate whether mutually agreed upon SLOs are being met. In this case, Observability offers a methodology for monitoring compliance of mutually agreed upon SLOs between teams and organizations.
Observability is an evolving technology focus area that will increase in importance due to the need for better monitoring visibility and control automation for distributed microservice applications and cloud infrastructure.
To learn more:
Observability for Modern Application Platforms
Attend our upcoming VMworld session: Cloud Observability Frameworks for Modern Application Platforms – OCTO3016