Recently I did a Q&A with our Radius team on the evolving platform engineering (or platform ops) teams that are emerging within many companies. This is a trend that is necessitated by app modernization and the need for a standard stack to help development teams accelerate application delivery (as well as building in important capabilities like security). I’ve gotten a lot of questions about the Radius Q&A, with one of the most common questions being what exactly is the different between the site reliability engineering (SRE) team and the platform engineering team? This is a great question as they’re quite distinct teams, but some context is required to understand the difference. To start, let’s get clear on what the platform engineering team does.
Evolution of the Platform Engineering Team
Well before we can get to defining the platform engineering team, we must define what a platform is! When we use the word “platform” here, we’re really talking about all the different decisions an app team needs to make on how they will build and run their app and what tools and services they will use. These include:
- Language runtimes supported.
- Infrastructure choice: VMs, containers (Kubernetes?), OS choice
- Data management: Object store, KV store, relational (or not) database, backups
- Networking: Service mesh (or not?), (global) load balancing, CDN
- Security: IAM, secrets store, certificate management, auditing
- Runtime: Monitoring, alerting, ticketing, troubleshooting, runbook automation
- Consensus system, availability model, fault domains
- CI/CD system
The list goes on and on. All these decisions taken together define the “platform”.
Previously, most of these choices were made by the IT team and most development teams didn’t have much say in the matter. Your persistence store was an Oracle DB so you just used it! Moreover, most requests for any new resources (or DB instances, for example) were made by filing tickets. Therefore, app teams used the platform that was defined for them, rather than by them. Let us call this the old way.
Then cloud ushered in the new way. It spurred a groundswell of innovation as app teams could now get self-service access to all sorts of new capabilities that could augment or replace components of the existing platform. This meant that rather than have a single platform for the entire company (delivered centrally by IT), each app team was essentially creating its own platform. They shared some components but went in totally different directions on others. For instance, there were new NoSQL DBs releasing what felt like every day and different app teams (often within the same company) were taking advantage of different NoSQL DBs. Massive proliferation in underlying platform technology followed, leaving a company dependent on these different platform components to deliver its apps to users.
Many companies simply had no idea at the scale of deviation this proliferation would cause. They hadn’t put in place any guardrails to protect against unnecessary complexity as a result of the growth. Other companies were aware of the issue but felt it was a worthwhile tradeoff, as it allowed each app team to go faster, no longer relying on a central IT team for a single solution. There was another tradeoff here though – app teams rolling their own had to take ownership and accountability for the production availability of these new platform technologies they had chosen. Rather than a single, well-funded IT team running these platform services for all app teams, each app team now needed to fund engineers to run their own unique services just for themselves.
To manage this expanding diversity and concomitant cost increase, companies began instituting cloud centers of excellence (CCOEs). The idea was to try and drive some standardization through best practices. Rather than let teams do whatever they wanted for any reason, the CCOE defined some guardrails and pointed app teams toward well-understood solutions. In other words, standards were recreated as they existed in the old days, but this time rather than directly delivering the services themselves, the CCOE defined best practices for using cloud services and other vendor solutions. This helped to alleviate the duplication challenge, but each app team was still forced to stitch together its own platform, which take valuable cycles away from their core mission of building a great app that delighted customers.
It’s because of this that the modern platform engineering team emerged. Their goal is to focus on the following areas:
- Tying all the platform components into a single, integrated whole.
- Enabling self-service, API-based consumption of these components.
- Where possible (and desired), enabling additional cloud services to be part of the platform should an app team choose to.
- Driving some degree of standardization across the platform services offered.
In other words, the platform engineering team should focus on making the platform “just work” so the app team can focus on building the app. But in addition to ensuring speed and simplicity for developers, the platform engineering team also makes sure that the platform is secure and compliant. In the other words, the easy and fast choice is also the safe and secure choice. In this way, the company gets the best of both worlds – faster app delivery but also the enterprise security and compliance it requires. This is enabled by the modern platform engineering team.
At VMware, Tanzu has been focused on delivering a “platform as product”. What we mean by this is that rather than cobbling together a whole bunch of different components yourself, Tanzu wants to provide you with a pre-integrated platform that you can consume as easily as a product. However, each of its components is swappable so that you can switch them out as needed in case your company already has an alternative such as a public cloud service or preferred OSS option:
What is an SRE?
Before we can get into a comparison of platform engineering and SRE, let’s level set and make sure we can align on an understanding of site reliability engineering. As its name suggests, SRE is focused on ensuring reliability and availability, typically measured through service level objectives (SLOs). This is done in two ways: first through “ops” work such as handling escalations, being on-call to respond to production issues, and manually fixing problems. Second, they focus on automation. As they notice themselves spending more and more manual time on something, they automate it. They do this by writing code. In this way, an SRE is indeed an engineer, but one focused on building tooling and automation “around” the app to ensure it maintains its SLO.
SREs and application developers are distinct roles focused on different but complementary areas, working in a shared responsibility model to ensure both that app updates can be delivered quickly but also that they can done so reliability while maintaining SLOs. App developers focus on the “business logic” of their apps but also think about operational concerns such as failure modes and what can be done to reduce the cost of failures and ensure quick remediation. SREs drive greater operational maturity and automation while working closely with appdevs to provide feedback on operational improvements.
Platform Engineering vs SRE
Now that we understand platform engineering and SRE, what exactly is the difference between them? Well, if we look closely, we see that they’re actually quite different. Platform engineering is about building a platform to support apps through their full lifecycle, making the experience seamless for developers. But these platform tools also include operational capabilities (metric systems, runbook management systems, alerting systems, etc.), so they also serve SREs. The platform team is about creating the underlying infrastructure, about creating the pipes. The app and SRE teams then build the app on top. The app team writes the business logic for the app while the SRE adds on operational automation leveraging the primitives exposed by the platform team, improving exactly what metrics are collected, how they are alerted on, what actions are automatically taken when an alert is triggered and so forth.
We talked above about how developers and SREs differed in their focus and roles. But both roles are supported by the platform engineering team, as these two roles work together collectively to support an app through its build, run, and manage lifecycle. A simplification of the above diagram looks like:
However, this isn’t quite correct. As it turns out, the platform engineering team is composed of developers and SREs as well as platform engineering both needs to build the platform and then operate it for the app-focused teams. Thus, a more accurate diagram looks more like:
Note that the above isn’t necessarily an org chart. For instance, there could be a platform engineering team composed of developers and SREs or there could be an SRE team where some of the SREs work on different apps while others work on the platform. The point is that both developers and SREs can focus on either the app or the platform, but apps and the underlying platform all need both the dev and SRE role to be successful.
In the end, a platform engineering team and SREs are categorically different – the former is a layer of the stack while the latter is a role. Yet both are necessary for success in the application modernization journey. Indeed, it’s both teams collaboratively working together enables success in your modernization journey.