Tech Deep Dives

Cloud-Native Federated Learning and Projects

In my previous post, Federated Machine Learning: Overcoming Data Silos and Strengthening Privacy, I introduced the basic concept and categories of federated learning, as well as some typical use cases. In this post, I pick up where I left off, exploring the way we do federated learning and how to solve its complexities using cloud-native technologies.

Federated AI Technology Enabler (FATE)

FATE is an open-source project hosted by the Linux Foundation, with key contributions from WeBank, VMware, Tencent, UnionPay, and other companies. It is focused on providing a secure computing framework to support the federated AI ecosystem. It implements secure computation protocols based on homomorphic encryption and multi-party computation to support various machine-learning algorithms.

FATE is designed for industrial applications and differs from other open-source federated-learning frameworks in the following ways:

  1. Out of the box, the framework provides common and frequently used horizontal and vertical federated algorithms for data engineering and machine learning (ML). It includes a workflow engine to construct a customized full-lifecycle machine-learning task.
  2. It applies various security protocols — homomorphic encryption, secret sharing, RSA, Diffie-Hellman, and more — to different algorithms to comply with the requirements of security, audit, and law.
  3. Provides a self-developed distributed computing, transmission, and storage engine for large-scale applications. It also includes the ability to integrate with major open-source projects, such as computing, transmission, or storage engines (such as Spark, Pulsar, RabbitMQ, and HDFS).

The following diagram shows FATE’s basic architecture:

Source: WeBank
  • FederatedML is the component containing federated-learning algorithms. All modules are developed in a modular fashion that can be used as a component of the workflow engine to enhance the scalability.
  • FATE-Flow is what it sounds like — FATE’s workflow. It schedules and manages the lifecycle to build an end-to-end flow of federated-learning production services.
  • FATE-Board provides federated-learning modeling tools to visualize and measure the entire training process.
  • FATE-Serving is a high-performance and scalable online federated-learning model-serving service that supports vertical federated learning cases.

For more details of FATE’s architecture, please refer to .  

Cloud-native federated learning solutions

Like other modular systems, FATE allows us to isolate dependencies and make wider use of small, well-tuned components. However, this also introduces the challenges involved with managing the complex configurations of each component and unifying all components into a single system. The federated-learning algorithms depend on different mathematical libraries, optimized instruction libraries, device drivers, multi-party computing libraries, encryption libraries, and other libraries. Compared to traditional ML, the network configuration is inherently more complex because it must consider parties from different organizations. Most of the time, some components need to be deployed in a DMZ network. The ports connected to the federation or the collaborating parties are usually limited by IT policies.

AI models, computing power, and data are the three core pillars of ML. But in deploying ML applications in the enterprise, there are further challenges, including:

  1. How to build a fully functional ML delivery pipeline
  2. How to manage the larger and more complex infrastructure environment, which hosts the ML applications
  3. How to optimize the efficiency and flexibility of the infrastructure for ML applications
  4. How to ensure resiliency and provide self-healing in the event of failure

These challenges have led to the proposal of a new discipline: AIOps or MLOps. If adopted, these would become ML’s fourth pillar.

The VMware OCTO team is working on solving the operations challenges for federated learning, based on our expertise in virtualization and infrastructure. As the main contributor to the FATE project, we proposed the concept of cloud-native federated learning, which treats the federated-learning system as a modern cloud application, then exploits the advantages of the cloud-computing delivery model.

We contributed two major projects for cloud-native federated-learning initiatives: KubeFATE and FATE-Operator.


KubeFATE is designed to provision, orchestrate, operate and manage FATE-based federated-learning systems on Kubernetes in datacenters or multi-cloud environments.

KubeFATE supports two different deployment environments: — Docker-compose and Kubernetes — for experiments and production purposes, respectively:

For a quick trial or algorithms verification, we can deploy a preset two-party FATE environment in three machines that have Docker-compose installed on them. (Refer to this step-by-step guide.)

For production or serious experiments, we strongly recommend a Kubernetes deployment with KubeFATE, which provides the following advantages:

  • Declarative-style deployment on Kubernetes
  • Flexible customizable deployments
  • Deployment version management
  • Cluster management
  • Operations features, such as log-aggregation feature support
  • Support for different engines

KubeFATE supports the following engines with a simple declaration in deployment YAML:

  • Computing engine: EggRoll (roll-pair), Spark
  • Storage engine: EggRoll (egg-pair), HDFS
  • Transmission engine: EggRoll (roll-site), RabbitMQ, Pulsar

Kubernetes environment supports

The Kubernetes deployment is based on Helm. KubeFATE provides a similar declarative YAML format to define what the system should look like. The following is a YAML example of deploying one party’s FATE cluster:

name: fate-9999
namespace: fate-9999
chartName: fate
chartVersion: v1.6.0
partyId: 9999
registry: ""
imageTag: "1.6.0-release"
- name: myregistrykey
persistence: false
  enabled: false
  enabled: false
  - rollsite
  - clustermanager
  - nodemanager
  - mysql
  - python
  - fateboard
  - client

backend: eggroll

  type: NodePort
  nodePort: 30091
  - partyId: 10000
    partyPort: 30101

  type: NodePort
  httpNodePort: 30097
  grpcNodePort: 30092

servingPort: 30095

The YAML consists of several key sections:

  • The metadata of the cluster. It includes the name of the cluster, the namespace of Kubernetes to which the cluster will be deployed, and the party ID if Istio is enabled and if the cluster data is persistent, etc.
  • The metadata of the chart. This includes the chart name and chart version. These two attributes determine the cluster type and version to be deployed. KubeFATE can be used to deploy and manage different types of clusters organized by charts, such as FATE, FATE-Serving, FATE with Spark, FATE with Spark and Pulsar, the registry repository, the pull policy, the pull secret, etc.
  • The modules to deploy in this cluster. For flexibility of different infrastructure management policies, KubeFATE can deploy parts of the components of the chart as one cluster, so that it distributes the deployment in multiple namespaces or even in different instances of Kubernetes.
  • The detailed configurations of each component.

More details about the definitions can be found here. We will discuss how to customize the clusters’ deployment in subsequent blog posts.

Below is a high-level architecture diagram of KubeFATE:

Source: Cloud Native Federated Machine Learning with KubeFATE (Office of the CTO Blog)

It includes two parts:

  1. The KubeFATE command-line tool (CLI), which offers the most common management operations for the FATE cluster.
  2. The KubeFATE service, which is deployed as an application in Kubernetes. It exposes REST APIs that are designed in a way that can be easily extended and integrated into existing cloud-management systems.

More details here.


A Kubernetes operator is a method of packaging, deploying, and managing Kubernetes applications. The operator pattern automates a task and looks after specific applications or services on how they ought to behave, how to deploy them, and how to react if there are problems. An operator contains a Kubernetes custom resource definition (CRD) and an associated controller. FATE-Operator is another important work to which we contributed as an official sub-project of Kubeflow: It enables federated learning in cloud-native platforms.

Kubeflow is a de-facto cloud-native ML platform for developing and deploying an ML application on Kubernetes. Operating Kubeflow in an increasingly multi-cloud and hybrid-cloud environment will be a key topic as the market (and Kubernetes adoption) grows. Kubeflow provides a way to operate a full lifecycle of ML. When we develop and deploy a ML application, it typically consists of several stages, such as identifying the problem, data engineering, choosing algorithms and coding the model, experiment with data, tuning the model, training the model, serving the trained model, and so on. Kubeflow provides components for each stage, as well as a pipeline for building, deploying, and managing the workflow.

Kubeflow training is a group of Kubernetes Operators that adds to Kubeflow support for distributed training of ML modules using different frameworks. The FATE-Operator is what we added to Kubeflow to enable federated-learning capability.

The FATE-Operator contains three CRDs:

  1. Kubefate: for deploying a management service for FATE
  2. FateCluster: for deploying a FATE cluster
  3. FateJob: for submitting and running a federated-learning job to a deployed FATE cluster

The typical use cases of FATE-Operator are:

  • Enabling federated learning in Kubeflow and deploying the KubeFATE with the Kubefate CRD
  • Deploying a FATE cluster with FateCluster CRD when a federated-learning job comes and there is a new collaborative party
  • Submitting and running a federated-learning job with FateJob CRD

Besides Kubeflow, the Kubefate and FateCluster CRDs can also be used directly to deploy and manage KubeFATE and FATE in a Kubernetes cluster, such as Tanzu Kubernetes Grid Cluster.

Next steps

We have explored the advantages of and potential for federated learning and briefly discussed the projects based on cloud-native technologies to enable it in production. In subsequent posts, we will delve deeper into the practical details of KubeFATE and how we can use it to provision and manage the FATE cluster.


Leave a Reply

Your email address will not be published. Required fields are marked *