Is continuous training (CT) a machine learning operations (MLOps) best practice?
It depends on what we mean by CT.
Suppose it means continuously invoking training pipelines to ensure models ‘stay fresh’ as new production data lands in the data lake.
The training pipeline workflow could be executed automatically on schedule once per month, once per week, once per day, multiple times per day, or continuously in a loop.
This is one solution to prevent stale prod models, but there is a cost to run ML training pipelines. The more frequently we run them, the higher the cumulative sum of cost in a given time window.
Imagine continuously provisioning Spark clusters, EC2 instances, running massive hyperparameter tuning (HPT) jobs, logging metadata to S3 or DynamoDB, and more – plus the total resource utilization run time.
This form of ‘naive’ continuous training is an AWS Well-Architected Framework anti-pattern, specifically from a Cost Optimization pillar standpoint.
Is there a better way?
Consider event-driven ‘discrete’ (re)training in response to data/concept drift events:
My team and I deploy machine learning models to a production environment through CI/CD as CloudFormation stacks (check out our article on infrastructure-as-code to learn more). We deploy two Step Function serverless workflows for each model moving to prod, one for the main training pipeline and one for drift monitoring.
The drift monitoring Step Function is invoked once per day per production model. It extracts an inference dataset from our data lake containing the latest records with ground truth and proceeds to batch transform inference.
Next, we slice this ‘drift dataset’ and perform a weighted sum evaluation based on prod model performance on each slice (check out our article titled “Custom ML Model Evaluation For Production Deployments” to learn more).
If the final drift dataset score is X% below the production model’s score, we automatically invoke the training pipeline Step Function for re-training. This training pipeline execution may or may not yield a better model, but we have done our job in monitoring, measuring, and re-training with purpose.
Why do we monitor drift on a batch basis once per day?
In our domain, analytics reveals that drift never happens faster than daily or weekly. Fine tune drift monitoring frequency based on your domain because running it frequently also comes with a cost.
This intentional, purpose-driven (re)training minimizes CT cost and sheds light into the nature of the prediction problem’s domain.
In some domains, drift and re-training may happen on a daily frequency. In others, it may be once per week. In some cases, once per month or longer. Or, as my team and I learned, there could also be drift seasonality where it’s faster at certain times of the year and slower in others.
All these learnings led us to evolve from schedule-based continuous training to event-driven discrete training for all production ML models.
This article is meant to provide a different perspective on continuous training and various best practices to consider. We recommend adopting the approach that works best for you, your team, and the specific prediction problems you are solving.
Do you prefer schedule-based continuous training or event-driven discrete training for production ML models? Let us know in the comments!
Subscribe to our weekly LinkedIn newsletter: Machine Learning In Production
Reach out if you need help:
- Maximizing the business value of your data to improve core business KPIs
- Deploying & monetizing your ML models in production
- Building Well-Architected production ML software solutions
- Implementing cloud-native MLOps
- Training your teams to systematically take models from research to production
- Identifying new DS/ML opportunities in your company or taking existing projects to the next level
Would you like me to speak at your event? Email me at email@example.com
Follow our blog: https://gradientgroup.ai/blog/