A challenging problem for the machine-learning community is how to leverage data generated at the edge to build better models. As individuals, we produce plenty of valuable data — both through institutions (medical or financial, for example) and via personal devices. This data could be harnessed and used for multiple applications, from the mundane (such as building better auto-complete apps) to the important (such as the creation of tools to prevent fraud or save lives) to the futuristic (such as building safe self-driving cars or more intelligent “smart homes”).
While centralizing the data could help, this approach is often impossible, due to issues with data privacy, data access rights, or compliance regulations. It can also be expensive. These concerns have given rise to the development of decentralized federated machine learning (FML) — harnessing data to build models without compromising the security or privacy of data belonging to separate entities.
Of course, FML has its own challenges, as well. In this blog post, we detail our recent progress in addressing one of them, namely, improving FML communication efficiency by reducing the unprecedented amounts of traffic required to train models in a federated setup. For example, our experiments revealed that training a ResNet-152 model using a decentralized CIFAR-100 dataset over 10 participants may require more than 26 TB of data exchange over the network to reach acceptable accuracy.
Deconstructing communication bottlenecks
To understand the source of such communication overheads, we must dive deeper into the FML training procedure. Usually, these training procedures consist of thousands of rounds, wherein each round’s participants obtain the updated model, perform local optimization steps (e.g., using stochastic gradient descent), and then exchange the resulting model updates (also known as gradients) over the network for the next round. The specific nature of participants (silos vs. devices) and how they communicate, obtain the updated model, and exchange updates depend on the particular setup (e.g., cross-silo or cross-device, synchronous or asynchronous, via a centralized coordinator — such as a parameter server — or in a distributed manner). However, most of these setups have two things in common:
- The size of a gradient is proportional to the number of parameters in the trained model (which may be many millions, or even billions, of parameters).
- The underlying building block is distributed mean estimation (DME), where the goal is to estimate the (possibly weighted) mean of several/many large vectors (for example, the gradients) using a limited communication budget.
Distributed mean estimation
In DME, we have n clients, each with a d-dimensional vector, that it sends over the network to a receiver using a message of limited size. The goal of the receiver is to estimate the average of these vectors in a way that optimizes some optimization criteria. While the optimization criteria may differ among setups, the receiver’s goal is often to produce an unbiased estimate of the mean (the expected value of the resulting estimate must equal the true mean) in a way that minimizes the estimation variance. How can we obtain such good estimates without overwhelming the communication network? Compression.
In particular, we focus on quantization techniques that reduce the number of bits allocated for each vector coordinate. Quantization can also be used in conjunction with additional techniques, such as sparsification and entropy encoding, but this is outside of the scope of this post.
When using quantization, each vector’s coordinate is associated with one of a small number of quantization values. This information can be encoded using a number of bits that is logarithmic with respect to the number of quantization values. This information — along with the quantization values themselves — is sufficient for the receiver to recover an estimate of the original vector.
For example, 1-bit stochastic quantization (i.e., stochastic quantization using two quantization values) can be done as follows:
For a vector v with d entries. We can send each coordinate vi as:
where it is reconstructed by the receiver as
This simple technique results in an unbiased quantization, which is the intention. However, it yields a high quantization variance that depends on the vector’s extreme (i.e., minimum and maximum) values. This generally results in slower convergence and lower final model accuracy.
The above problem persists in a k-bit stochastic quantization where, similarly to the 1-bit case, each coordinate is quantized in an unbiased manner to one of a 2k quantization values that are uniformly spaced between the vector’s extreme values (i.e., uniform quantization levels).
Can this be improved? Yes. For example, we can reduce the vector’s dynamic range (i.e., max(v)-min(v)) by transforming the vector prior to quantization to reduce the dynamic range which, in turn, reduces the quantization variance. Suresh et al. does exactly that. They leverage shared randomness to transform a vector prior to stochastically quantizing it. This allows the receiver to use this shared randomness to generate the inverse transform. These days, obtaining shared randomness is as easy as agreeing on a seed to a pseudo-random number generator. (Suresh et al. admits efficient PyTorch and TensorFlow implementations.)
So, is that the entire solution? No. While achieving significant progress over the vanilla stochastic quantization variants and providing improved theoretical error guarantees, such methods still mainly employ the same uniform quantization levels that are often significantly sub-optimal, since they are distribution-agnostic. Knowing the vector’s distribution ahead of time could enable designing quantization that would be optimized for that distribution, significantly reducing the estimation variance.
Our first breakthrough — DRIVE
In “DRIVE: One-bit Distributed Mean Estimation”, published in NeurIPS 2021, we note a few observations that allowed us to design a new quantization algorithm that uses a single bit per vector entry (plus a small bit of overhead information), and has remarkable performance and theoretical guarantees that outperforms other quantization schemes — even when those are allowed to use a larger communication budget. In particular, we observed that for high dimensional vectors, there exists a fairly simple and well-known linear transformation (known as a random rotation) with remarkable properties not previously utilized in the context of FML and DME. In particular:
- It produces a vector of the same dimension, but its entries are approximately distributed as i.i.d. normal random variables in ways that we can make use of in both our compression algorithm and its theoretical analysis. Most importantly, this guarantee is independent of the input vector distribution (i.e., it depends only on the vector’s norm, which is preserved by the random rotation).
- It is possible to use a deterministic quantization (instead of a stochastic one) to quantize a randomly rotated vector and still produce unbiased estimates by properly scaling a vector before or after making the inverse rotation at the receiver.
These two properties led us to design DRIVE, which stands for “Deterministically RoundIng randomly rotated VEctors”. In DRIVE, a sender randomly rotates its vector and sends only the sign of the rotated coordinates to the receiver, along with a single carefully chosen float termed as the vector’s scale. The receiver simply rotates back the sign vector using the shared randomness and multiplies the result by the scale. Importantly, the senders generate their rotations independently. That’s it!
We have evaluated DRIVE on a collection of distributed and federated-learning tasks using a variety of datasets and have observed a consistent improvement over the state-of-the-art DME techniques — even when those are allowed to use a larger communication budget. Intuitively, the source of DRIVE‘s efficiency is in producing unbiased estimates using deterministic quantization instead of a stochastic one and doing so on a vector whose entries are normally distributed and so are strongly concentrated around their mean. This significantly reduces the quantization variance.
Our second breakthrough — EDEN
Motivated by the success of DRIVE, in “EDEN: Communication-Efficient and Robust Distributed Mean Estimation for Federated Learning,” accepted to ICML 2022, we were able to further build on the intuition behind DRIVE to design a new algorithm — EDEN, which stands for “Efficient DME for diverse Networks,” that generalizes to arbitrary bandwidth constraints, heterogeneous clients, and lossy networks.
Similar to DRIVE, EDEN also uses random rotations. To enable its design, we developed a new formalization of the quantization framework. While previous works define the quantization via a set of quantization values, our solution requires choosing a set of intervals (whose union covers the full set of real numbers). Next, each point is quantized to the center of mass of its interval and not to the closest quantization value, which is counterintuitive. To minimize EDEN’s estimation error, we optimize the choice of such intervals for a rotated vector, where each coordinate value is assumed to take on a value from a normal distribution.
We have extensively evaluated EDEN using different scenarios, bandwidth constraints, and network conditions and observed a consistent improvement over the state-of-the-art DME techniques. For example, consider a federated scenario with 10 participants training a ResNet18 model using the CIFAR100 dataset. In our experiments (see EDEN for complete details), we demonstrated that EDEN achieves a competitive accuracy with the uncompressed baseline using only 0.5 bits per coordinate (a compression ratio of 64). No other tested DME techniques converged with such a low-bandwidth constraint.
You’ll find many exciting details in our papers presenting DRIVE and EDEN, as well as in our open-source repository for DRIVE and EDEN (also, see recent DRIVE implementation by Google research). For example, we efficiently rotate a vector using the randomized Hadamard transform (similarly to the technique by Suresh et al.), perform quantization in linear time for any number of quantization values, and more.
Figure 1 illustrates a high-level diagram of DRIVE and EDEN and how they utilize shared randomness to perform coordinated random transformations (i.e., rotations) among the participants and the parameter server.
Figure 2 focuses on a single participant and how we utilize our knowledge of the vector’s distribution after the random rotation to produce accurate estimations relying on deterministic quantization and carefully chosen scaling to ensure unbiased estimations at the receiver.
Stay tuned: unleashing the true power of shared randomness
Both DRIVE and EDEN are solid steps toward a more network-efficient FML. Both utilize shared randomness to perform the transformation at the senders and the inverse transformation at the receivers. But we can do even more with shared randomness.
In a related prior investigation (leading to a paper published in ICALP 2021), we discovered that shared randomness could be used to produce lightweight and more accurate quantization schemes than ever before, even for the very simple scenario of sending an estimate of a real number using a single bit and some shared randomness (see this blog post). Motivated by those findings and by our progress with DRIVE and EDEN, we are now investigating a new research question. One problem with DRIVE and EDEN is that each sender uses its own rotation and the receiving server must invert each rotation to obtain the vectors used to compute the estimated vector mean. If we could instead use the same rotation (derived from the same shared randomness) for all the senders in a round, the receiving server could first add the compressed vectors and then do a single inverse rotation on this sum to get the estimate, reducing the computation significantly — one rotation per round, instead of one rotation per round per sender. But DRIVE and EDEN lose their guarantees if all the participants use the same rotation. Can we achieve accuracy guarantees similar to DRIVE’s and EDEN’s by using the same random rotation’s shared randomness for all senders, reducing the computational complexity at the receiver?
While beyond the scope of this blog post, we invite the reader to peek at our most recent work, QUIC-FL, which tackles this challenge and reveals another new and exciting approach for quantization with different tradeoffs and competitive performance. Figure 3 illustrates the high-level block diagram of QUIC-FL that leverages the shared randomness not only for making the rotations but also to perform an efficient stochastic quantization that is optimized for the rotated vector’s distribution.
We look forward to seeing these compression schemes used in FML systems and continuing to look further into improving such schemes for FML.
Acknowledgments: We would like to thank our co-authors and collaborators (in alphabetical order): Ran Ben Basat (UCL), Gil Einziger (BGU), Gal Mendelson (Stanford), Michael Mitzenmacher (Harvard), and Amit Portnoy (BGU).