Root-cause analysis (RCA) is a holistic process used to investigate and diagnose the symptoms of problems and core issues. In IT operations, developer operations (DevOps) and site reliability engineering (SRE) teams must perform the RCA process when systems go down or when an end-user experience is unsatisfactory (such as with slow websites or services). They are notified of the problems via application performance monitoring (APM) or observability tools, such as VMware Tanzu Observability, which trigger alerts and monitor system health with various relevant metrics.
But RCA can be extremely manual and time-consuming. The alerts help provide context, but must still be correlated with metrics, such as logs and traces, to pinpoint the problems. Finding the relevant logs and traces is tedious since it requires the SRE to look at the data from various perspectives, filter it, and sort it to find what they’re looking for.
The Tanzu Observability team wanted to help accelerate the RCA process for our SREs. This blog post explains how we used design methods to build the best possible experience for our users, as well as our results.
Today’s enterprise software runs as complex distributed systems to ensure a consistent user experience on a global scale. These applications are comprised of connected autonomous microservices running across multiple machines that coordinate split-up work and ensure that the system is available whenever users need it. Teams of people work behind the scenes to build and maintain these systems, making this complex setup possible. As I mentioned earlier, the monitoring and observability tools track data and signals from various sources, generate alerts when things go wrong, and help troubleshoot complicated and interconnected systems.
There’s urgency involved with all this alerting, analysis, and problem solving. DevOps teams must track how quickly problems are detected and resolved as their key metrics. Organizations may have service-level agreements with customers to maintain a certain uptime. Whether the problems are errors in new deployments or malfunctioning Kubernetes pods, service problems not only cause bad end-user experiences but may also cost companies millions of dollars. So quick resolution is critically important.
Taking a human-centered design approach
The first problem we wanted to tackle was the manual work involved in searching for relevant logs and traces. Our engineering and data-science team worked together to build proprietary machine learning (ML) and rule-based algorithms to automate these processes by finding the probable root causes of service issues.
The UX team was tasked with integrating the functionality into the Tanzu Observability product itself. As part of that team, I got involved to ensure that these capabilities were intuitively available to users.
After working with the product manager to break down the problem for the pilot release, I met with the data-science team to gain an understanding of the algorithms they built and the insights they generated. Next, I worked with the developers to learn how they use these insights in their troubleshooting process.
I wanted to take a human-centered design approach — engaging with our users to understand their current processes and figuring out how to enhance their experience. Instead of drastically changing their workflows, we decided to augment the current experience with the probable RCA insights.
Because our success lies in our customers’ success, I decided to define key metrics at the outset of the project. I worked with the product manager to do this. We focused on two key metrics — user happiness, as measured through the quality of insights, and ease of use, which would be measured via adoption and retention.
Research always forms the basis of our design assumptions. So, I began by researching the competitive landscape. Next, I looked internally to dig up relevant previous research. VMware design does a phenomenal job of documenting past research and facilitating easy access to learning. I started my process by building from previous research on developer troubleshooting with metrics and traces, mapping out various user troubleshooting workflows. While I discovered that users began at different points, followed a variety of paths, and had various workflows, I could see that all of the workflows involved correlating various dimensions, such as metrics, events, alerts, logs, and traces.
I believe that design should be democratic, especially in our highly technical domain. I organized a brainstorming session with product management, development, data science, and leadership stakeholders to collect tribal knowledge, get people on the same page, and prioritize feature development. We focused on bringing the maximum value for users by integrating ML insights at the pivotal point in the troubleshooting workflows. We limited ourselves to one dimension (error) and added more dimensions as we validated our feature.
Putting it all together
I worked with a writer from our information-experience team to translate the JSON algorithmic insights into easily consumable language. We focused on shaping the insight into a simple, easy, and actionable piece of information.
We wanted to keep it simple by showing only critical insights. I worked with the team to reduce the number of insights to further decrease the cognitive load. Initially, there were more than 20 insights for service with all the algorithms running. I worked to bring the number down to less than eight by consolidating all rule-based insights into one, visualizing the data in a chart for easy viewing. I further reduced the number by restricting ML-based insights to only high-confidence insights.
I believe that our users are experts. Using this as my guiding principle led me to augment the troubleshooting experience, instead of dramatically altering it. I introduced a “Perform RCA” action in the most common action area. I built pathways right from the insights to important next steps (the Traces and Service dashboard) to make it easier to navigate.
While research is the basis for design, early validation is a crucial element of the design process to avoid unnecessary cycles. We recruited internal customers who were interested in the feature to participate in our alpha process. We built in a feedback mechanism for insights, where users could provide both quantitative and qualitative feedback about the feature. A Slack channel gathered all qualitative feedback, ensuring our team had immediate access to users’ comments.
We are working closely with internal users to understand how we can improve the feature before the external release. ML-powered troubleshooting is the future of application monitoring. I look forward to UX design becoming and remaining an integral part of AI and ML product development.