Running modern production workloads is stressful. No big surprise here. New technologies like Kubernetes help tremendously to tackle things like reliability and scalability, but – for many real-world workloads – they do little to ease workload complexity or help with in-depth monitoring. Worse still, too often, they make these things even more complicated.
Many DevOps and Security teams struggle with this combination of complexity and lack of observability, and securing modern workloads is becoming a serious challenge.
In this three-part blog series, we dive into why security is still a struggle despite using these modern technologies backing our applications, and we show a way out of this dilemma. In this first post in the series, we dig more deeply into why modern applications are hard to monitor in-depth, and we introduce Project Trinidad from VMware, which allows DevOps and Security teams to regain control of their clusters’ security.
Prelude: There is not enough coffee in the world for a day like this
For those of us working in DevOps, security, or really most types of development, the following scenario might, unfortunately, seem all too familiar. We come into the office in the morning, grab a coffee from our beloved coffee maker in the kitchen, and open our email and Slack just to discover dozens of high-urgency messages. The internal #security Slack channel seems to be going wild this morning, and our inbox is bombarded with emails from our boss, their boss, and the boss’s boss – all asking the same questions: “Have you seen the news on CVE-1234-ABCDE and are we vulnerable? And have we been attacked?”
Great. Not even able to enjoy a proper sip of coffee yet, we already know this is the beginning of several long and stressful days. But – nothing we can do, so it is time to dive into the two questions everyone seems to want to know: Are we vulnerable? And have we been attacked?
Being the professional engineer that we are, it is (at least somewhat) straight-forward to answer the first question: using our GitOps workflows, we can go back in time and review what versions of container images have been deployed in production, and we use our image catalog to identify what versions of software have been deployed lately – and, even if our catalog is not that pristine, a combination of docker/git/bash magic allows us to grep through all the packages installed in those container images. And – damn – yep, we have the vulnerable library version deployed.
After calling in reinforcements to work on fixing the production deployments, we focus on the second question, have we been attacked? This question is typically significantly more challenging to answer with reasonable confidence – if we can answer it at all. For our story, let us assume that we are lucky and the vulnerability is triggered by injecting a specific header into an API request. Since we have an excellent log collector deployed in our production Kubernetes cluster (okay, going back only 90 days, but that’s already something), a quick peak at our regexp cheat sheet allows us to cook up a filter for our logs.
Stressful minutes of staring at the progress bar go by while our logs are filtered – too stressful even to get up and grab that desperately needed next cup of coffee – until the answer comes back: 0 results found. Too good to be true, so we triple-check the regex and make it ever so slightly looser just to be on the safe side. Search, wait, and: 2 results – but they are clearly not what we are looking for.
Wow, we were indeed lucky. So, let’s get that patch out into production and update our boss, and their boss and, of course, their boss (who, by the way, has now stopped by about seven times since this morning), and who is now standing behind us, literally breathing down our neck as we quadruple-check the logs. Still, all good. It seems we were lucky and have reasonable confidence that we weren’t hit by an attack this time. Phew.
Let us fast-forward to a few hours later: post-mortem time. Everyone is clearly relieved that this played out so smoothly, and the mood is good. But then, the third question of the day is raised:
What if the attackers knew of this vulnerability already a year ago, or what would we have done if it had never been released publicly?
The mood changes. Yes, what would we have done? Blank stares in the room. No one knows. What could one ever do about this?
The problem In many production clusters
Before we dive deeper into this third question, let us take a step back and understand why answering the first two questions is often not as easy as our story made it seem.
With the rising popularity of Kubernetes and container runtimes, we are now at a place where installing a plethora of components is just a kubectl apply, a helm install, or a kapp apply away. While this is incredibly simple, it has turned many of our workload environments into a highly heterogeneous system containing:
- Many different applications – some free or open source, others commercial products,
- A wide range of frameworks and programming languages ranging from Java/spring, golang, to python or other dynamic languages, and
- Diverse container images using different packaging mechanisms, ranging from stripped-down Alpine-based images to Debian/apt images, just to name a few.
We all love diversity, but combining these technologies makes it virtually impossible to obtain a unified way to instrument all these components or to get detailed, security-relevant logs that we might need when investigating a scenario, as described in our intro above.
And, to make the situation even more complicated, these individual services typically interact with each other in many non-trivial ways. Having a good understanding of all these services in a cluster and how they interact with each other (and the outside world) is challenging; maintaining an accurate view as services evolve is almost impossible.
At the end of the day, we are facing complex and highly interconnected systems, and maintaining control of them from a security perspective has become quite challenging.
With the inherent complexity of our production workloads, we need a way to monitor our systems’ security independent of the framework/runtime/language on which these systems are built. But monitoring is not enough – we must be able to reason about whether our (micro)services are secure, and we need to understand how these systems behave.
At the scale of these systems and, more importantly, the rate at which they change, it is clear that we need an automated solution that monitors and reasons about the activity in our workloads. Fortunately, to tackle this very problem, VMware recently announced Project Trinidad:
Project Trinidad is an API Security and Analytics Platform. Project Trinidad leverages machine learning to learn normal East-West API traffic patterns between microservices in modern applications, which enables rapid detection and quarantining of anomalous activity.
While the various services that comprise modern application workloads are quite difficult to understand as humans, the nature of these microservices typically makes them have very predictable – or at least consistent – behavior. This predictability and consistency are two fundamental properties for Project Trinidad: they allow us to observe how the deployed microservices interact on the network, and they will enable us to learn and understand what is normal for our workload.
Understanding what is normal is a big leap toward answering the question of what is not normal. And, once we know when things are not behaving as they should, we are finally back in a place to answer the question of whether our workloads may be under attack. This, ultimately, answers our third question above.
Wrapping up (for now)
Modern workloads are complex and heterogeneous, a problematic combination for monitoring and securing them. In this first post in the series, we have looked into why this is the case and introduced Project Trinidad, which helps DevOps and Security teams secure their clusters.
In upcoming posts, we will take an in-depth look into how Project Trinidad is able to understand our diverse set of microservices and how it learns their expected behavior. We will dive into how we capture network data without changing anything about our deployed services and how we can automatically learn what is normal and alert on what is not.