Project Hamlet: Secure Multi-Vendor Multi-Mesh Federation in the Open Source

The week before VMworld US in August, we announced a new open source service mesh interoperation project that was a collaboration between VMware, Google Cloud’s Anthos, HashiCorp, and Pivotal. This project was a recognition that service mesh has become a vital part of micro services infrastructure, and that its long term success would depend on customers being able to seamlessly manage meshes across multiple IT environments, mesh products/vendors, or organizational boundaries.

Now, for VMworld EU we are excited to announce that the service mesh interoperability project is now available for community contribution and it has been given a name: Hamlet!

Hamlet facilitates federation of service discovery between different service meshes of potentially different vendors. Through an API, service meshes can be interconnected to deliver the associated benefits of observability, control, and security across different organizational unit boundaries, and potentially across different products and vendors. 

Hamlet is open to the community and we encourage contributors to get in touch and to help further develop the project. The initial purpose of Hamlet is to enable customers to discover and securely communicate services across different service meshes of potentially different vendors. Moving forward, Hamlet will continue its journey enabling the federation of other service mesh functionality such as traffic management and policy.

All aboard!

Mesh Interoperation: Single vs. Multiple Organizational Unit Boundaries

VMware is a committed contributor to a number of Open Source service mesh projects including Istio and Envoy. NSX Service Mesh is a forthcoming SaaS offering, built on Istio, that will extend the service mesh to include any platform, cloud, or workload type in addition to Kubernetes. VMware is not only a VM company or a Kubernetes company, we are also a networking and a hybrid cloud company. While we have started with support for different flavors of Kubernetes, we are uniquely positioned to be able to also support VMs, API Gateways and multiple clouds in our data plane as we continue to develop the NSX Service Mesh solution.

There is a clear industry need to interconnect different workloads within the same organizational unit boundary (or administrative domain) or across different ones. Each of these use cases requires features that are associated with different types of service meshes.

  • Tightly coupled workloads. In this case, the owners of the workloads are constrained by a higher authority. This authority forms an organizational unit boundary and establishes conventions for network addressing, workload namespacing, identity and security policies. Workloads under the same organizational unit boundary naturally share the same certificate authority, as they are controlled by the same authority. The main reason for operators to adopt these conventions is to ease administration.
  • Loosely coupled workloads. In this other case, owners of the workloads are not necessarily constrained by a single higher authority or share the same certificate authority. There can be groups of workloads which are under different organizational unit boundaries, each of them governed by a different authority and within a different trust domain. Each of these authorities establishes conventions for network addressing, workload namespacing, identity and security policies, but have no influence in other authorities: there is no uniformity across different organizational unit boundaries. 

Figure 1. Tightly coupled vs loosely coupled workloads and organizational unit boundaries

Much has been discussed about multi-cluster deployments of both Kubernetes and Istio. Kubernetes’ main use case is configuration and resource replication across clusters for either disaster recovery or high availability. With Istio you can also expand a service mesh to include services running on VMs or bare metal hosts, or combine services from more than one cluster into a single composite service mesh. While these use cases are sound and needed, both presume that all of the clusters are under the same organizational unit boundary and administrative control with tightly coupled workloads. Using Kubernetes and Istio in a multi-cluster configuration in this way is typically referred to as cluster federation in Kubernetes nomenclature.

Figure 2. Cluster federation within one organizational unit boundary

However, very little has been said about service mesh interoperation, where each service mesh exists within a different and untrusted organizational unit boundary (and hence workloads are loosely coupled). In this scenario, each mesh can be of the same or different vendors, can have the same or different control and data plane implementations, be single or multi-cluster, and can provide the same or different functionality as a product. This is the problem that Hamlet solves.

Figure 3. Service mesh interoperation across several different organizational unit boundaries

When two service meshes interoperate, we are making some assumptions in Hamlet:

  1. Each service mesh can and will continue operating as a standalone service mesh, with its own application service lifecycle.
  2. Each service mesh can provide services to other service meshes which consume them.
  3. Each service mesh can consume services from other service meshes that expose them.
  4. Each service mesh is considered a black box to the others and only API interoperability can be guaranteed. No details of the infrastructure used, or the underlying control or data planes are shared.
  5. Workloads in each service mesh can and potentially will run in any infrastructure (containers, VMs, physical servers, etc.) and this must be transparent to the others.

Inter-Mesh Service Discovery Federation

In order to discover and securely communicate workloads across different organizational unit boundaries and potentially across different service mesh products and vendors, protocols and data models to share and reconcile service names, service identities, and security policies are needed. Through these protocols and data models, services meshes can be interconnected to deliver the associated benefits of observability, control, security, etc. across different organizational unit boundaries, and potentially across different products and vendors.

There is an Open Source proposal to federate trust domains and identities across different trust domains in potentially different organizational unit boundaries, hosted within the Secure Identity Framework, SPIFFE, under the CNCF umbrella, the SPIFFE Trust Domain and Bundle.

VMware NSX Service Mesh team is now leading an Open Source proposal to federate service discovery across different service meshes of the same or different vendors, and in potentially different organizational unit boundaries, with no uniform network addressing, workload namespacing, identity or security policies.

Figure 4. Inter-mesh federated service discovery

The primary goal of this proposal is to allow different loosely coupled services in different administrative domains to discover each other and to create a secure communication channel.  

Even though there is no official definition of what microservices are, a consensus view has evolved over time in the industry. We are going to adopt the definition provided by Martin Fowler and other experts 

“services in a microservice architecture are often processes that communicate over a network to fulfil a goal using technology-agnostic protocols such as HTTP.”

When there are different service meshes interoperating, application workloads are located in different service meshes depending on either the functionality required by the organization and provided by each service mesh. Or application workloads might need external services located in another service mesh for example, because of data compliance reasons.

A federated service describes the properties that an owner service mesh needs to expose to a consumer service mesh in order for it to be able to discover, reach, authenticate, and securely communicate with it. A federated service adds additional entries on the consumer service mesh registry creating a composite service registry so that auto-discovered services in the consumer service mesh can access/route to these federated services in the owner service mesh. 

Every service mesh can thus have either one single role (service owner or service consumer), or both. And this role can change over time. Operators won’t likely want to expose the majority of their service mesh services to other consumer meshes, and it’s more likely that they are only exposing a small subset. Mechanisms must be available in each service mesh product to allow operators to specify which services are going to be published to which consumer meshes by when. In VMware NSX Service Mesh, this mechanism is Global Namespaces.

When two service meshes are going to federate service discovery, they must run a federated service discovery agent, which implements Hamlet’s specification. The agent has a dual role

  • It runs the control protocol with the service meshes it is interoperating with. The protocol is service mesh product and vendor-neutral and has been implemented in the specification as a gRPC stream.
  • It configures the local service mesh to (1) allow local services to be published to consumer meshes if the local mesh is an owner and (2) program the local service registry catalog with externally published services when the local mesh is a consumer.

This federated service discovery agent is also responsible for authenticating each interoperating mesh, creating a secure mTLS channel. This channel is used to run the control protocol needed to synchronize the service catalog from the owner service mesh to its consumers using a publish-subscribe protocol so that when a service is published by an owner service mesh, its consumers get automatically notified and the new federated service entry is inserted in their service registries. Likewise, when an owner service mesh decommissions a service, it is removed from its consumers’ service registries.

In a similar way, when a service in a consumer service mesh is going to access a federated service published by an owner service mesh, the federated service entry in the consumer mesh catalog (which has been inserted by the owner service mesh) contains all the required data for the consumer service mesh to create an mTLS channel with the federated service.

These federated service entries must contain enough information for consumer service meshes to be able to discover, reach and securely communicate with these federated services in owner service meshes. Naturally, federated service entries will have parts in common with the most usual service entries out there, including the Istio Service Entry. We were looking for a product and technology-neutral service entry, valid for all the service discovery mechanisms of the vendors and products who have collaborated in the proposal and for others which might want to use it in the future. We are sure that the current federated service entry is going to evolve in the near future, as other vendors start to collaborate, and service mesh products increase in functionality and innovate at their own pace.

Finally, each mesh has an ingress and an egress gateway. The ingress is used by an owner service mesh to provide an entry point for the federated services it has published. Consumer service meshes, use an egress to route the outgoing traffic pointing to the ingress of the owner service mesh.

Figure 5. Inter-mesh mTLS service to service communication

Closing Remarks

Through Hamlet and the collaborative effort of the teams involved within Istio, Google Cloud’s Anthos, Hashicorp Consul, Pivotal, and VMware NSX Service Mesh, service meshes can now interoperate, enabling customers to discover and securely communicate services across these meshes.

Although Hamlet’s initial purpose is to enable customers to discover and securely communicate services across different service meshes of potentially different vendors, Hamlet is going to continue its journey enabling the federation of other service mesh services such as traffic management and policy, to become the standard for mesh interoperation. Hamlet is open to the community and from the VMware NSX Service Mesh team, we encourage contributors to get in touch and or to contribute. 

All aboard!

Check out the following NSX Service Mesh sessions at VMworld:

  • NSX Service Mesh: The Link to Federate and Secure a Multi-Cloud Future [CNET2741BE]
  • Introduction to NSX Service Mesh [CNET1033BE]
  • Cross-Cluster and Cross-Cloud Service Mesh Architecture and Use Cases [KUB1939BE]

Sergio Pozo is Staff Senior Solutions Engineer in the NSX Service Mesh group. He helps customers materialize the use cases to embrace NSX Service Mesh and solve their business problems, describing the structure, characteristics, behavior, and other aspects of the final solution, including integrations with other products. He also works on Service Mesh Interoperation building cross-vendor solutions, both open and closed source, with strategic technology partners.

Other posts by

The New Normal

If you are reading this during business hours, it is safe to say you are probably not in your normal office. If it’s outside of work hours, you are hopefully at home keeping healthy. Either way, you along with everyone else in the world are going through these uncertain times together. In March, our CEO, […]

What has The Edge Become?

The term “Edge” is very broad. Dell Technologies define “The Edge” as “wherever the digital world and physical world intersect, and data is securely collected, generated and processed to create new value”. People working at home individually, a router, a smart speaker, GPU farms, IOT devices, or sensor data are all things that might fall […]

Tanzu Mission Control – Using Kubernetes to Manage Kubernetes

Tanzu Mission Control is part of VMware Tanzu. VMware Tanzu is a family of products and services for modernizing your applications and infrastructure with a common goal: deliver better software to production, continuously. More about Tanzu portfolio here. In this blog post I will focus on the #Manage part of Tanzu with Tanzu Mission Control […]