In the upcoming era of edge computing, the capability of an edge device to perform local and fast training and classification is of increasing need. Indeed, whether it is a health application on a mobile phone, a sensor on a refrigerator, or a camera on a robot vacuum cleaner, a local computation is often desired due to many reasons, such as the need for a quick response time, enhanced security, data privacy, and even profitability concerns.
Whether done in a federation (i.e., Federated Machine Learning) or a centralized manner, having heterogeneity, limited connectivity, and limited hardware resources for such devices is an ongoing challenge that we must address. In fact, one is often faced with contradicting requirements where an edge device has to locally carry out a substantial amount of computation, storage, and communication while adhering to resource constraints such as limited memory, network connectivity, and compute. This is often due to timing or power constraints and increasing amounts of data and available information.
In this blog post, we focus on compute and memory efficient local (i.e., on-device) training of and classification by decision-tree-based ensemble methods, which are the de-facto standard for tabular data. To provide additional context, we invite you to look at our recent blog post discussing bandwidth reduction techniques for federated machine learning.
Decision-tree-based ensemble methods
Indeed, tree-based ensemble methods, such as a Random Forest (RF) and XGBoost, are often used to classify tabular data due to their robustness, ease of use, and generalization properties. In turn, such classification tasks are commonly used as a subroutine in many application domains, such as finance, fraud detection, health care, and intrusion detection. Therefore, such methods’ efficiency and accuracy-to-utility tradeoffs are of major importance.
Let’s focus specifically on RF and XGBoost and understand their advantages:
- RF, arguably the most popular bagging method, grows each decision tree using a random subsample of the data, resulting in different and weakly correlated trees. Then, a majority vote is used to determine the classification. RFs have several advantages, including robustness, fast training, capability to deal with imbalanced datasets, embedded feature selection, handling missing, categorical, and continuous features, and interpretability for advanced human analysis or whenever required by regulations.
- XGBoost, a well-known and popular boosting algorithm, grows a small decision tree (e.g., with 8–32 terminal nodes) at each iteration. Each such tree is designed to reduce the misclassifications of previous trees. XGBoost shares most of the advantages of RF and often achieves a higher accuracy due to its controlled bias.
Nevertheless, these methods also introduce some resource-related drawbacks:
- RFs tend to be memory bound and slower at classification. Moreover, due to their large memory footprint, RFs often cannot be deployed over edge devices with limited memory, where classification tasks are often desired to take place.
- XGBoost models often require less memory than RFs, but they are compute-intensive, resulting in slower training.
In this blog post, we tackle the resource consumption drawbacks of both RF and XGBoost. In particular, we introduce a new hybrid approach that inherits the good properties of both bagging and boosting methods with a comparable machine learning (ML) performance while being significantly more resource efficient.
Redundancy in datasets
It is fairly common knowledge that the resource consumption of ML models is highly correlated with the size of the dataset used for their training. Thus, reducing the dataset size is desirable. Accordingly, we would like to raise the following questions:
- Are all the data instances in a dataset identically important for the training of a tree-based ensemble model?
- If no, how should we differentiate the data instances during training to save resources? How would that affect classification?
Indeed, often, a dataset holds many data instances that are simpler (e.g., 90%) in the sense that they are easily recognizable and therefore easy to classify; and data instances that are rare or more unique and therefore are harder to classify.
Intuitively, if such differentiation is available prior to training, one should be able to utilize this knowledge to save resources without a meaningful impact on accuracy. One such idea is to use fewer “easy” data instances during training. The challenge is how to produce such schemes and do so in an efficient manner.
RADE – resource-efficient anomaly detection model
RADE targets the supervised anomaly detection use case; namely, the dataset has only two classes where most instances are Normal or benign (e.g., 99%).
RADE leverages the above observation in the following manner: it first builds a small (coarse-grained) model using the entire dataset. It then uses this model to classify all the data instances used for its training. Correctly classified instances with high confidence are tagged as easy (often, most of the Normal instances), while all other instances are tagged as hard (often, most of the anomalous instances)
As illustrated in Figure 1, RADE introduces a high-level architecture that can be used with different classification models as long as the coarse-grained model provides a meaningful classification confidence level along with its classification.
Intuitively, since the coarse-grained model is sufficient to classify easy queries correctly, we are only left with hard queries. Using those, we build two expert (fine-grained) models that are designed to address two different cases concerning the classification result of the coarse-grained model: fine-grained model 1 is responsible for (low-confidence) Normal
classifications and fine-grained model 2 for (low-confidence) Anomaly classifications. These models might have a larger memory footprint than the coarse-grained model, yet significantly lower than that of monolithic tree-based models.
Training efficiency: As described, RADE is trained in two phases. First, the coarse-grained model is trained using the entire dataset. This training phase is relatively fast since we use a small model. Then, we classify the entire training dataset using the coarse-grained model and, based on the resulting classifications and confidence levels, produce two data subsets for the training of the fine-grained models. This training phase is also relatively fast since we train each fine-grained model with only a subset of the training data (e.g., 10%).
Classification efficiency: the classification of RADE also has two phases. First, an instance is classified by the coarse-grained model. If the resulting classification confidence is high (e.g., 0.9), then the classification is completed. Otherwise, the query is forwarded for re-classification by one of the fine-grained models according to the coarse-grained model classification result. This means that the classification time of RADE equals a weighted average of classification times: classifications that are served only by the coarse-grained model and those that are served by both the coarse-grained model and a fine-grained model. Intuitively, since the aim is
to serve most of the queries only by the coarse-grained model (e.g., 90%), the average classification time is expected to improve significantly over standard monolithic models.
Duet – resource-efficient multi-class classification model
RADE is designed for binary classification and therefore uses two fine-grained models. Extending RADE’s architecture, as is, for multi-class classification use cases is not a scalable solution as it would require K fine-grained models for K classes, and thus a new architecture is needed.
To that end, we developed Duet. Duet follows RADE’s principles, i.e., it uses a coarse-grained model that is trained over the entire dataset and classifies the easy queries. However, unlike RADE, Duet uses only a single fine-grained classifier that is trained over a subset of the training dataset and classifies the hard (low-confidence) queries.
Duet’s high-level architecture is illustrated in Figure 2. Essentially, it utilizes two classifiers: a small bagging classifier (RF) that offers meaningful classification confidence, controlled variance, and memory bound, and a boosting classifier (XGBoost) that offers controlled bias and is compute bound. Overall, Duet introduces a different and often better system/ML performance tradeoff compared to monolithic classifiers.
The main question is how to determine the subset of the training data for the boosting classifier.
Using the confidence metric as done in RADE is insufficient in this case since we have a multi-dimensional problem. So instead, we use the class probability distribution vector (by the coarse-grained model) to determine how vital an instance is for the training procedure.
Consider two classification results with identical (top-1) classification confidence for a classification task with six classes. The first with a class distribution of (almost) equal probabilities over two classes (e.g., [0.5, 0.5, 0, 0, 0, 0]), and the second with a higher probability for the correct class and much lower probabilities for all the remaining classes (e.g., [0.5, 0.1, 0.1, 0.1, 0.1, 0.1]). Obviously, the first query is harder and may be more beneficial for the training of the boosting classifier.
To capture this property, we defined a new metric – predictability that is given by the euclidian distance function that measures how far the resulting class probability distribution vector is from the perfect distribution vector with respect to the true label of the instance. A perfect distribution vector should have a probability of one in the correct label (i.e., class).
Notice that predictability is a specific metric for a given trained bagging classifier and a training dataset and therefore is an integrated step in the training process of Duet. Also, using
predictability is different than sampling methods that only depend on the dataset properties (e.g., stratified sampling, which preserves the percentage of instances per class).
Further potential and materials
The above design principles can potentially be adapted to other ML domains. While Duet mainly uses the predictability measure to reduce the computation cost of the training and classification procedures, one can use the predictability measure to reduce a training dataset’s size and therefore reduce storage and communication costs.
Also, RADE and Duet can potentially be adapted for distributed and federated learning settings. For example, the participants can jointly train a coarse-grained model, then use this model to choose a subset(s) from their local datasets, and finally train global fine-grained model(s) using these subsets.
The challenge here is determining the criteria by which a participant chooses its subset. For example, an advanced approach would be to use secure computation to collect some statistical information about the global training data. Such statistics may result in a better subset selection by each participant that, overall, has a higher training value for distributed/federated training of the fine-grained model(s).
If you’re interested in learning more, including our usage of confidence thresholds, predictability, and multiple evaluation results over servers and resource-constrained devices such as a Raspberry Pi, take a look at the following papers presenting RADE and Duet.
If you want to get your hands on the code, both RADE and Duet are implemented as sci-kit classifiers and are publicly available on GitHub.