Custom ML Model Evaluation For Production Deployments

My team and I built a cloud-native recommender system that matches open jobs and people who are looking for work.

We trained machine learning models to power the system, following the tried-and-true process:

  1. Set up an end-to-end data science workflow in a Jupyter notebook
  2. Use domain knowledge to create the feature space through feature engineering
  3. Train parallel models through hyperparameter tuning jobs
  4. Evaluate final model performance on the holdout test set using the appropriate objective metric
  5. Convert the Jupyter notebook into scalable training pipeline components within a serverless microservice architecture
  6. Deploy the solution using infrastructure-as-code through a modular CI/CD pipeline
  7. Monitor model performance on production traffic

Following the initial shadow deployment of the best trained model, we rolled it out to production in stages, starting with the lowest-risk cohort of jobs and workers.

A few days into this deployment, our daily drift monitoring workflow measured a drop in model performance for this cohort. This drift event triggered training pipeline execution to re-train the model.

Our objective metric during training was mean average precision (MAP) on the validation set. The best trained model from the hyperparameter tuning job was the one with the highest validation MAP.

Next, we calculated MAP on the test set and confirmed the model generalized well to previously unseen (job, worker) matches. This new “challenger” model’s MAP beat the current “champion” model’s MAP, and we replaced the production model with this new model through a blue / green deployment.

As we monitored the new inference results, we noticed the new model performed slightly worse on the rollout cohort than the previous model.

What happened? Didn’t the challenger model beat the champion model on test set MAP?

Let’s think of a dataset as composed of several smaller datasets, each representing a slice of the business.

For our dataset of (job, worker) matches, the slices might include:

  • Low-activity workers
  • Medium-activity workers
  • High-activity workers
  • Common job types
  • Rare job types
  • Geographic locations
  • Times of the year
  • Various combinations of all the above

As the team brainstorms in collaboration with product managers and domain experts, this list of dataset slices grows quickly.

What is the point of this exercise? We learned that:

  1. Some dataset slices are more valuable to the business than others.
  2. The amount of records per slice varies, leading to an imbalanced dataset.
  3. Average model performance over the entire test set obscures the true picture of model performance on individual slices of the dataset.

The conclusion became clear: New machine learning models may perform overall better on a test set, but performance varies substantially from one dataset slice to another.

In our case, the re-trained model did beat the production model on top level test set MAP, as well as several dataset slices. However, it achieved lower performance on the slice corresponding to the portion of traffic the previous champion model was receiving for that initial production rollout.

How do we solve this problem?

Evaluate a new model’s performance on distinct slices of the test set individually. Weigh each result by a float between 0 and 1 and perform a weighted sum of slice performances * weights to produce the final challenger model’s score.

Finally, we compare the challenger’s score to the champion’s score to determine whether the newly trained model will replace the current production model.

How do you know how much to weigh each slice’s performance?

This is where data scientists and machine learning engineers collaborate with product managers and domain experts to determine which slices have the highest business impact.

My team and I weigh model performance on high-value slices higher. These high-value slices include most recent records, high revenue-generating workers, most active customers, etc. Lower-value slices are weighed lower, such as the group of workers who only work 1 day every 6 months, records from 2 years ago, etc.

We also took into account the business impact of false positives and false negatives per slice before converging on the final model evaluation strategy for our prediction problem.

A model evaluation strategy is entirely domain and problem-dependent. This strategy should emerge through a collaboration between business leaders/stakeholders, domain experts, product managers, and DS/MLE teams.

How does your team evaluate new models before proceeding to production deployments? Do you have a custom model evaluation strategy? Let us know in the comments!

Check out the various links in this article to learn more about the highlighted topics.

Contact us or send me a message if you need help implementing cloud-native MLOps, Well-Architected production ML software solutions, training/inference pipelines, monetizing your ML models in production, have specific solution architecture questions, or would just like us to review your solution architecture and provide feedback based on your goals.

Subscribe to my blog at Gradient Group: https://gradientgroup.ai/blog/

One thought on “Custom ML Model Evaluation For Production Deployments

Leave a Reply

%d bloggers like this: