Machine learning (ML) is all about learning patterns from data. That’s why effectively processing data, moving data, and storing data at various stages along its lifecycle — from creation to consumption — are such important parts of AI/ML development. In this post, I’ll examine the overall data flow through the pipeline of the ML model-development process and identify some common issues that can cause degradation of model performance. Next, I will introduce the concept of a feature store and explore its core capabilities, including how it can be used to mitigate these issues.
Data used for analytics often originates either from existing enterprise databases or from applications interacting with users to capture their activities into an operational data store (ODS). For analytics purposes, there will be a data pipeline that moves data periodically or continuously from one of those locations into a data warehouse. From there, the data will be transformed into an optimized structure for downstream analytics processing.
ML projects often come from new business ideas that create a need to explore data and model capabilities. First, a data-science team will create a project-specific workspace where they will work to pull data from the data warehouse, exploring and visualizing data to create input features that are useful for training the model. Feature engineering is a highly iterative process in which data scientists try many combinations of features to determine which provides the best model performance. It typically consumes a major portion of the project’s time and effort.
Below is a look at some common issues frequently encountered in the model-development process.
Since each project creates its own workspace to store the features they have created, features are not shared across projects. Input features are created through a highly exploratory process and are usually kept within the project team.
Feature engineering is one of the most time-consuming activities in ML model development. When features are created on a per-project basis (and therefore not shared), other projects must duplicate the engineering effort involved with designing and implementing these same features. This increases project-development costs and results in similar features being implemented in different ways, with potentially inconsistent levels of quality. This also leads to higher maintenance costs and may lower the overall quality of an organization’s data-science efforts.
Inconsistent feature creation
In many project settings, data scientists are responsible for training the model, while MLOps engineers are responsible for deploying the model into production. However, the data flow in the training and inference stages are quite different: the training flow creates features by applying data transformations on data from the data warehouse, whereas the inference flow creates features based on application requests and data from online databases. This requires the implementation of two separate codebases to perform essentially the same feature creation, causing bugs and inconsistency that may hurt model performance at inference time.
Signal leakage can occur when constructing a training set if training examples are comprised of temporally inconsistent values. For example, imagine you want to train a model to detect churn — customers who unsubscribe from a service. Imagine further that CustomerSpend is one of several pieces of information available for each customer and you’d like to include it in your training set. To create training examples, you need to look back at historical data to find examples where customers did or did not churn, so you have positive and negative examples for the training set. If your organization only stores the latest value of CustomerSpend for each customer, then you will introduce signal leakage. Why? Because each training example that includes CustomerSpend will include information “from the future,” relative to the time that the rest of the training example data is taken. Instead, when the training example is created, the customer spend at that time should be used, to ensure that self-consistent values are included in the training example. This problem can be avoided by using the feature store to keep timestamped values of all features that change over time.
Feature stores were designed to address all of the issues above. They elevate features to be first-class entities in the model-development process by explicitly providing a mechanism to store and extract features. Features are no longer kept within the project itself — they can be shared across different projects. This significantly increases the reusability of features and reduces the engineering effort of each project. Feature stores effectively provide more derivatives of data beyond what the data warehouse provides.
With an explicit storage of features, we can centrally monitor the data quality of the wide variety of features created by different project teams. In addition, we can detect any drift in the distribution of feature values over time.
Feature stores enforce a cleaner separation between feature engineering and model training. Decoupling these activities enables both to evolve independently with increasing reusability.
Since the feature engineering logic is explicitly separated from the training process, the same feature-extraction code can be used in both model training and model inference. This eliminates the deviation of feature-creation logic mentioned earlier.
Conceptually, a feature store is organized as a container of entities. Each entity is uniquely identifiable by its ID and has one or more associated feature sets. Each feature set contains features with values that are tagged with an update timestamp.
In this example, there is one entity type driver with two entities with ID values of id1 and id2. Each entity has two features: TripsToday and Rating. Entity id1 was updated at 1:15 and at 1:30. Since the feature store records timestamps along with feature values, it supports point-in-time queries with which we can construct a snapshot of feature values at any past moment. Given a moment in time, a point-in-time query will filter out all entries with a timestamp later than the chosen time, then extract the entry with the latest remaining timestamp. This represents a snapshot of all feature values seen up to the given time and provides a nice solution to the leakage problem by preventing future feature values from leaking into the training data.
Notice that there are two timestamps: EventTime represents when the feature value actually changed and ReceiveTime represents when the feature value was stored into the feature store. Ideally, ReceiveTime would be the same as EventTime, meaning there was no latency updating the feature store. But in practice, feature store updates always have a latency. Since up-to-date feature values carry a fresher and stronger signal than stale feature values, the model usually performs better if it has access to more updated features. Therefore, comparing the difference in model performance between using the most updated EventTime feature value and stale ReceiveTime feature value can give a good sense of how sensitive the model performance is to the latency of feature update. This gives data engineers an idea of how to prioritize their data pipeline optimization effort to reduce feature update latency.
With a feature store in place, the upper half of the following diagram represents the data flow at the training stage, while the lower half represents the data flow at the inference stage.
A feature store is used in both model training and inferencing in the following way:
- To enable loading data into the feature store, data engineers define a feature store schema, including the definition of entities and corresponding IDs, the definition of each feature of the entity, the grouping of features into feature sets, and the external data source holding the features.
- Based on the feature store’s update requirement, tasks are scheduled to extract, transform, and load the latest updated data from external data sources into the feature store or continuously stream data to update the feature store according to the schema definition in step 1. After the features are loaded, they are tagged with two timestamps — the event time, when the feature was updated by the application, and the receive time, when the feature is stored into the feature store.
- In the training phase, to prepare the model-training data, a point-in-time query is issued to extract features with the appropriate time, corresponding to a prediction time before the output is revealed. Notice that the point-in-time query ensures the validity of training data — it won’t contain future feature values.
- Based on the training data, a prediction model is trained, evaluated, and stored in the model repository.
- After the model is ready to be deployed, the model is loaded into the serving cluster and ready to handle online prediction requests through the prediction API service.
- The application sends a prediction request containing the entity ID to the prediction API service.
- The service queries the feature store to extract the latest version of features corresponding to the entity ID, constructs the final features, and sends them to the model-serving cluster.
- The model-serving cluster performs an inference on the final features and produces a prediction.
In this post, we’ve discussed some common issues related to handling data-processing flows in the training and inference stages, as well as how a feature store can mitigate these issues. In sum:
- Feature stores facilitate sharing and reuse of features created from multiple data-science projects, reducing the overall effort for model development.
- Feature stores unify the logic of feature creation in both model training and inferencing, reducing potential bugs that may cause lower model inference-time performance.
- Feature stores include timestamps, along with feature values, to prevent signal leakage problems that may cause significant model performance in production.
Feast is a popular open-source implementation of the feature store concept. It includes many of the capabilities described above. Consider integrating Feast into your organization’s ML pipeline.
- Python ML framework scikit-learn
- “Data Leakage in Machine Learning,” blog post by Jason Brownlee
- “Data Management: The Data Phase of Your Machine Learning Workflow,” Full Stack Deep Learning