From a business leadership standpoint, it always feels risky to deploy a new machine learning model within a production application.
- “What if the model makes wrong predictions, thereby affecting the stable business operations?”
- “Will our users be negatively impacted by inaccurate model predictions?”
- “How do we minimize the revenue impact of false positives or false negatives in production?”
These are fair questions, and it’s our job to address them, have a plan to minimize risk, and give our business leaders and stakeholders confidence.
Even if our model’s objective metric (i.e. recall) is high enough, as determined by product managers, it can still make business leaders and stakeholders nervous to put the model in front of users. After all, ML in production is still relatively new for most organizations.
There are multiple ways to deploy machine learning models to production:
- Blue / green deployments
- Canary deployments
- A/B testing
- Shadow deployments
Your deployment strategy depends on the lifecycle of the project, inference sensitivity to model changes, end user impact, business risk tolerance, among other factors.
For new models, we always execute shadow deployments first.
A shadow deployment is where we release a new machine learning model in “stealth mode” to assess how it performs on production traffic, but without actually utilizing the predictions within the business process.
For example, my team and I recently deployed a new model that predicts conversion rates. These predictions are used to inform business decisions automatically, but we did not use the predictions initially until the model was “proven” after a couple of weeks.
We executed a shadow deployment where the model cranked out inferences at scale over two weeks, stored them behind the scenes, and were then compared against the existing business performance on a daily basis.
This is our main serverless workflow, triggered by EventBridge and orchestrated through Step Functions:
We performed daily SageMaker Batch Transform jobs on all the relevant production data for the day. Shadow deployments don’t usually have latency SLAs because the purpose is to measure model performance, so daily batch transforms work well.
Once we learned the model was meeting or exceeding existing business performance (initial project success benchmark), then we started using the predictions within production applications to help make better business decisions.
Even after a model is proven preliminarily through a shadow deployment, we still do not expose it to 100% of production traffic immediately.
We begin with a slice of the traffic – the one that presents the least risk to the business operations. Once the value of the model predictions is proven on a small, low impact slice of the production data, we gradually roll it out further until we cover the maximum scale of its application.
Shadow deployments allow us to test new model versions with production data without impacting users or business operations. We can do this for deploying new models where the existing benchmark is a legacy business operation, as well as for updating production models with newly trained ones.
Have you executed shadow deployments prior to going “all-in” with production traffic? How do you know when it’s time to use the new model on all the production traffic? Comment below!
If you need help implementing cloud-native MLOps, Well-Architected production ML software solutions, training/inference pipelines, monetizing your ML models in production, have specific solution architecture questions, or would just like us to review your solution architecture and provide feedback based on your goals, contact us or send me a message and we will be happy to help you.
Subscribe to my blog at: https://gradientgroup.ai/blog/
Follow me on LinkedIn: https://linkedin.com/in/carloslaraai