Analogous to the role of the software-development lifecycle (SDLC), the machine learning model-development lifecycle (MDLC) guides the activities of ML model development from inception through retirement. In this article, we outline the key phases of the MDLC — including data ingestion, exploratory data analysis, model creation, and model operation.
The MDLC is depicted in the following diagram, with the typical ordering of steps:
Training a predictive model is about learning from past events using captured data. Having high-quality data is, therefore, a prerequisite for any machine learning activities. Data is scattered across many persistent locations (such as log files, event streams, or databases), often in a highly dispersed, geographically distributed environment. Data ingestion is the process of gathering data from its original sources into a central data repository with strong analytic-processing capabilities to allow for extensive data processing and analysis.
Data ingestion is typically done using a data pipeline, using the steps below.
- Extract raw data from its originating source. There are two modes:
- In the polling mode, data extraction is initiated by the pipeline at certain time intervals. Data deltas (new data since last extraction) will be extracted by the pipeline.
- In the pushing mode, data sources upload new data continuously into a data stream managed by the pipeline.
- Transform the extracted data into a form suitable for subsequent processing.
- Validate the data against predefined quality criteria, such as data structure (for example, that the number of attributes of each record is correct), data types (for example, receiving a string when a number is expected), data values (for example, rejecting 200 as an invalid human age) and missing values (a mandatory attribute is missing). Data should also be assessed to determine whether it exhibits any bias — this is especially important when handling data that represents personally identifiable information (PII) since biased data can lead to biased models.
- Data passing the validation process will be transformed based on the schema defined in the data warehouse. Irrelevant attributes can be discarded and additional attributes can be added to enrich the uploaded data.
- Store the data in an appropriate format optimized for access. Some common types of data are listed below. In addition, it is common to store image and video data in a file system or object store, such as S3.
- Tabular data is probably the most common and is organized as rows and columns, where each row represents an instance (or sample) and each column represents the attribute of the instance. SQL databases are typically used.
- Time-series data is a special case of tabular data, with one column representing a time dimension. Specialized indexing mechanisms are available to accelerate processing when using the time dimension.
- Free-form text (such as email, online chat, news websites) is also very common, especially when performing natural language processing (NLP). It is typically stored as a forward index (from document ID to document content) and an inverted index (from words to document ID) to accelerate keyword queries.
- Graph data is common when the data represents different types of entities with complex interrelationships. Data is typically represented as “nodes” and relationships are represented as “edges.” Specialized graph databases can be used to accelerate access (for example, navigating through edges to explore neighboring nodes).
Exploratory data analysis
With exploratory data analysis (EDA), we begin our exploration of the data we have collected, understand what is going on, combine our findings with business domain knowledge and generate innovative ideas for products and services. EDA is commonly used as a first qualifying step before investing in the effort of developing models.
EDA typically involves the following activities:
- Understand the story the data is telling. Visualize the data to understand data distributions and correlations from different angles via various plots. For example, a bar plot can be used to understand how customers are distributed across different regions, a line graph can be used to understand revenue change over the last six months, a scatter plot can be used to understand how company size is related to purchase volume.
We combine the collected data with business domain knowledge to tell a story. For example, explaining why a certain phenomenon is seen, understand its underlying driving factors, assess what future opportunities and risks are ahead, and determine the follow-up actions.
- Understand whether the data is aligned with (or contradicts) our prior business sense. We state our beliefs in terms of null hypotheses and apply statistical significance tests on the collected data to assess whether our beliefs still hold. Conduct A/B testing (randomized control experiments) to test the significance of the hypotheses.
With a good understanding of our data and the problems we want to solve, we are ready to create our models — often using an iterative process.
In ML, there are many types of models, including causal models that find cause/effect relationships and optimization models that decide what action to take under specific circumstances. In this article, we focus on the lifecycle of prediction models, which are the most common model type in machine learning.
Based on supervised learning — where each data record includes a labeled (ground truth) output — a prediction model is created through a training process, in which we feed the data holding both input and labeled output attributes to a machine learning algorithm, which incrementally adjusts its parameters to produce predictions close to the labeled outputs.
The following eight steps describe the training process in detail:
- Define the metrics used in measuring the model’s performance.
- R2, RMSE, and MAPE are commonly used metrics for regression models that output numeric values.
- Precision, Recall, F1, and AUC are commonly used for classification models that output categorical values.
- You also need to choose a loss function to reflect your choice of model performance metrics.
- Clean the data set.
- Remove duplicated records.
- Handle data with missing values — imputing or filling in missing values or removing problematic rows or columns.
- Find any potential outliers and remove them, but only if they represent errors; true outliers are often important data points.
- Create input features from existing attributes.
- Create additional input attributes by combining different raw input data attributes.
- For numeric input attributes, normalize each attribute’s values to fall within a uniform range (for example, between -1 and +1) with zero mean.
- Check for the skewness of numeric attributes and transform them appropriately into a bell-curve shape, if possible.
- For categorical attributes, encode strings into numbers (using one-hot encoding, class-mean encoding, or embedding).
- If there are too many input attributes in each record, consider removing some less-important attributes using dimensionality reduction techniques, such as principal component analysis (PCA). Feature selection techniques can also be used to pick a small set of more important input attributes.
- Split data into three sets (the percentage can vary, according to problem nature and data availability).
- The training set (commonly 70% of available data) is used to train the model.
- The validation set (commonly 10%) is used to optimize model hyperparameters (for example, the regularization weight of linear regression model).
- The test set (commonly 20%) is used only to measure the performance of the final trained model on unseen data.
- If the output classes are highly imbalanced, resample the data to balance the output classes.
- Over-sample records from the rare classes and under-sample records from the frequent classes.
- Optionally, create synthetic data records for the rare classes.
- Choose the model structure. Two popular models are:
- Neural networks (also known as deep learning)
- Gradient-boosting decision trees
- Tune the hyperparameters of the model (for example, the number of trees in the gradient-boosting decision tree or the learning rate in the neural network).
- For each hyperparameter configuration, a model is trained on training data and performance is evaluated with the validation data.
- The model with best performance will typically be selected as the final model. However, depending on requirements, we may opt to trade performance for a model that is more explainable, fairer, or one that supports faster inference.
- Evaluate the final model’s performance based on metrics defined in step 1 (“Define the metrics used in measuring the model’s performance”) using the test set.
- If we are happy with the performance of the final model, we can save the final model into the model repository and deploy the model into production.
- If not, we need to go back to do more data exploration, create new features, and repeat the training cycle.
- Follow company policies on model governance and check to make sure the model is not using protected attributes (such as gender, age, etc.) in making decisions or predictions.
After the model graduates from its training stage, it enters the model operation phase and becomes ready to be used in production. The model operation phase includes the following tasks.
- Deploy the model into a production serving environment.
- There are two ways to expose the model: package the model as a library function called by application code or package the model as a RESTful API service hosted by the model serving platform.
- Optionally, we can deploy the latest model incrementally to a limited percentage of users and, in parallel, run an A/B test to compare it with the currently deployed production model. Based on the performance difference, we can decide whether we should roll forward to the latest model.
- Monitor the model’s performance to detect whether there is drift in the following:
- Distribution of input variables: if new data deviates significantly from the training data, it is a good indication that something in the environment has changed.
- Model performance metrics: if the model shows degraded performance on new data relative to its performance during the training phase, model retraining may be necessary.
- If drift is detected, retrain the model with the latest refreshed data. Model retraining may also be scheduled in some cases. For example, you may choose to retrain the model every week on the latest data.
In this post, we provided an overview of the MDLC and activities involved in different phases. In future articles, we will further expand on the details of each phase and the underlying design considerations.
- Python machine learning framework scikit-learn: https://scikit-learn.org/stable/tutorial/index.html
- Machine Learning Process: https://towardsdatascience.com/the-7-steps-of-machine-learning-2877d7e5548e