We have trained a machine learning model that meets or exceeds performance metrics as a function of business requirements.
We have deployed this model to production after converting our Jupyter notebook into a scalable end-to-end training pipeline, including CI/CD and infrastructure-as-code.
This deployment could be a SageMaker endpoint for live inference, or a Lambda function that creates a batch transform job out of the model artifacts in S3 as needed (trigger or schedule).
However, given the dynamic nature of a marketplace or business environment, it is guaranteed that our deployed model’s performance will deteriorate over time: Feature distributions will shift, supply and demand will fluctuate, customer preferences will change/evolve, etc.
Also, if our deployed model is actively used to make decisions at scale, the machine learning solution itself will change the data distributions – hopefully in the desired direction due to better business outcomes.
In machine learning, these inevitable data distribution shifts are called drift, and a few important questions arise upon model deployment:
- “What is our model’s ongoing performance on production data?”
- “Under what conditions should we trigger re-training?”
- “What are the proper model evaluation metrics to compare new models against the current model in production?”
Let’s take the example of OLTP transactions in a relational database, such as e-commerce events.
Using a dataset of historical transactions, we trained and deployed a machine learning model that predicts the probability of a given customer purchasing a specific product. We then use this model to help inform product recommendations.
We can assess our model’s ongoing performance on production data by comparing the prediction to the actual outcome, per transaction. This monitoring can be done daily, weekly, or monthly, depending on the specific business domain.
Then, we can trigger our training pipeline (re-training) if our deployed model’s objective metric drops below a desired threshold on the latest batch of new records.
Back when we trained our first model, performance evaluation was performed on a test set through cross-validation. Randomly sampled test sets contain records from the entire time period of the training dataset.
However, it is more important for our model to perform well on the most recent records than records farther back in time. This ensures we meet the present business needs.
Therefore, model evaluation puts a premium (higher weight) on performance on the most recent records. For example, a model’s objective metric (such as mean average precision) on records from the last 7 days can be multiplied by a factor gamma, where gamma is a number between 0 and 1, with gamma decaying exponentially for each preceding week. The sum gives us the final model evaluation score.
This is one way to decide whether to replace the current production model with the brand-new one produced by re-training.
If the newly trained model achieves a higher model evaluation score, replace the deployed model. Otherwise, simply store metadata about the training pipeline run and wait for the next run as a function of drift monitoring.
There are many ways to slice a dataset or feature space during model evaluation, and each slice is a different representation of the business with a different impact. Always collaborate with business domain and subject matter experts when deciding model evaluation metrics.
What is your approach to drift monitoring, re-training, and model evaluation? Comment below! I would love to hear your thoughts so we can all learn from each other how to build better machine learning engineering solutions.
If your data science team needs help deploying models to production through scalable end-to-end ML engineering pipelines in AWS, reach out and I will be happy to help you.
Connect with me on LinkedIn: https://www.linkedin.com/in/CarlosLaraAI/