Tech Deep Dives The Future of...

Federated Machine Learning: Overcoming Data Silos and Strengthening Privacy

The success of artificial intelligence relies heavily on the quantity and quality of data used to train effective prediction models. Within organizations, data often exists within individual organizations as isolated data silos. Organizations cannot share data, usually due to business competition or the enforcement of privacy-protecting laws and regulations. For these reasons, many organizations and/or departments have limited or poor-quality data, which obstructs the training that leads to meaningful machine-learning (ML) models. Federated learning is one of the most promising ML technologies to help overcome data silos — strengthening data privacy and security — while still complying with laws and regulations, such as General Data Protection Regulation (GDPR).

Concepts and catalogs  

Following are the key characteristics of federated learning:

  1. Data from all parties is stored locally, ensuring data privacy and compliance with laws and regulations.
  2. Multiple parties contribute data to develop a global model from which they can mutually benefit.
  3. All parties in the federation are of equal status.
  4. The modeling performance of federated learning is the same as, or — in case of user alignment or feature alignment of data — slightly different from the modeling result achieved through aggregation of all datasets.

Imagine there were two different enterprises — A and B — each with its own unique data. Due to GDPR, these two enterprises cannot simply merge their data. Federated learning can create a global model through parameters exchanged under an encryption mechanism, while ensuring compliance with data-privacy laws and regulations. The model provides optimal performance — comparable to a model built through the aggregation of both enterprises’ data. The difference is that the data does not move during the training of the global model. In other words, in contrast to traditional machine learning, federated learning won’t gather the data from another organization. Each organization trains local data to a partial model in their respective datacenters.

In practice, siloed data has different distribution types. Depending on the differences in features and sample data space, federated learning may be classified into horizontal federated learning and vertical federated learning.

Horizontal federated learning is also called “sample-partitioned federated learning” or “homogenous federated learning,” which means that datasets share the same feature spaces but differ in samples. With horizontal federated learning, we can use a relatively small or a partial dataset instead of a big one to increase the performance of trained models.  

Source: Yang, Qiang & Liu, Yang & Chen, Tianjian & Tong, Yongxin. (2019). Federated Machine Learning: Concept and Applications. ACM Transactions on Intelligent Systems and Technology. 10. 1-19. 10.1145/3298981.

Vertical federated learning is also called “feature-partitioned federated learning” or “heterogeneous federated learning,” which applies to the cases wherein two or more datasets with different feature spaces share the same sample ID. With vertical federated learning, we can train a model with attributes from different organizations for a full profile.

Source: Yang, Qiang & Liu, Yang & Chen, Tianjian & Tong, Yongxin. (2019). Federated Machine Learning: Concept and Applications. ACM Transactions on Intelligent Systems and Technology. 10. 1-19. 10.1145/3298981.

In some academic papers, you may see “federated transfer learning,” a term that applies to scenarios with two or more datasets with different feature spaces that have different samples. However, federated transfer learning is still a developing technology in academic circles.

Use cases

As federated learning becomes more mature, we see an increasing number of industrial use cases that fit with federated learning. There are more and more application scenarios based on Federated AI Technology Enabler (FATE, an open source project of federated learning platform hosted by Linux Foundation), including auto-insurance pricing, credit-risk management, sales forecasting, smart security, assisted diagnosis, smart advertising, autonomous driving, and so on.

Federated learning-based smart security

Smart security serves as a key component of smart cities. In traditional security scenarios, cameras are used to collect basic data and IT systems and multiprocessors are used to process this data. Control rooms are established to engage in monitoring, supplemented by manual detection of dangerous behaviors. However, this includes disadvantages, such as:

  1. The processes are lengthy, which leads to high labor costs and low efficiency.
  2. Existing anomaly definitions rely on subjective considerations, which may lead to errors and misjudgments in an early-warning system.
  3. The data collected comes from cameras, access cards, and other sources (which are not correlated with each other). This data is siloed, lowering its value.

By utilizing federated learning and multi-community data to build a security model, data can interconnect and communicate across multiple communities, forming smart security networks with overlapping dimensions. With cloud computing and big-data analysis, smart security systems engage in continuous post-incident summary and self-learning.

The federation data from videos, sensors, and information software is collected, sorted, and analyzed to provide more secure, accurate risk-prediction services.  In this case, the federated learning model (based on data collected from 10 communities) is better than single-community models in all aspects — accuracy, precision, receiver operating characteristic curve, etc. Even when the federated learning is applied to just two communities with fewer available samples, the accuracy of the federated learning model was about three percent higher than that of a single-community model.

Federated learning based credit-risk management

For credit-risk management, the cost of a credit review for a single customer is relatively high, because it calls different data APIs throughout the whole process. (Examples are the cost of calling APIs for identity verification and credit checks in consumer finance and micro and small enterprises, also known as MSEs.) Moreover, when they receive MSE credit requests, banks and other financial organizations often lack useful data about enterprise operations, which complicates and delays financing.

Federated learning and a federated data network can help credit-risk management organizations simplify pre-approval procedures. Starting from risk sources, the solutions help companies filter out blacklisted or invalid samples to further reduce credit-review costs in the later stages of the loan-approval process.

The second challenge of MSE credit-risk management is the poor data-quality problem, including low Y (label) sample volume, poor distinction samples, and deviation of sample distribution from the normal distribution. Federated learning enables consumer-finance and credit institutions to continuously accumulate business data, optimize the models by performing a cold boot of operations, then apply closed-loop AI modeling, small-sample modeling, continuous model optimization, and other advanced solutions.

Another challenge of MSE credit-risk management is scarce and incomplete historical source data. Federated learning enables a multi-source data-fusion mechanism to include transaction data, taxation, reputation, finance, intangible assets, and other MSE data to help financial institutions enrich their feature spaces without compromising data privacy or security. Vertical federated learning prevents data leakage and helps obtain an equivalent (or as close as possible) performance of the full-data model.

With the usage of FATE, Webank, the first digital bank in China, trained a federated model with a customer’s invoice data. It found that the performance of the risk-management model improved by around 12%, reducing the expected credit-review costs of consumer-finance institutions by five to 10%. Their credit-risk management capabilities improved. Expected API call costs were reduced by 20% to 30% during pre-approval because of the elimination of blacklisted/invalid samples.

What’s next

An emerging ML technology, federated learning has generated significant interest because of its potential. In our next post, we will introduce the open-source federated learning platform and the cloud- native projects that help us kickstart and manage the lifecycle of federated learning.

Comments

Leave a Reply

Your email address will not be published.