Have you released a machine learning solution to production, only to find yourself pulling KPI metrics manually every day to keep updating stakeholders on results?
Or, have you found yourself manually updating Lambda code in the AWS console to quickly fix a production bug a few hours into the release?
Both of these common scenarios represent examples of the degrees of automation for production machine learning solutions.
The dream is to have fully automated end-to-end ML solutions requiring minimal (if any) developer intervention throughout the course of operations. This is a core principle of the AWS Well-Architected Framework’s Operational Excellence pillar.
One of the biggest challenges of post-production machine learning engineering is managing live operations of each solution. As my team and I discovered, even “basic” things like BI reporting and analytics can take up all of our time post-release, preventing us from extending the solutions or building new ones.
“Automate everything” and “run what you build” are now both part of our company culture and core values.
These are the 3 degrees of automation for machine learning solutions, using AWS as an example:
The 1st degree of automation is what we call the “ClickOps” or “zero automation” phase.
There is no formal DevOps implementation and all changes happen manually throughout the AWS console. There is no way to track code changes as it’s all ad-hoc, and there are no testing gates or QA for accepting these changes into a prod environment.
MLOps may be at the early phases, such as using MLflow to track experiments, but there are minimal software engineering best practices implemented, such as converting Jupyter notebooks into serverless microservice orchestration workflows.
Machine learning models are typically produced directly from executing notebooks, without a formal process for custom evaluation, production deployments, and drift monitoring. The datasets themselves are commonly extracted ad-hoc from a data lake without a feature store or automated workflow to generate versioned point-in-time snapshots.
The 2nd degree of automation happens when DevOps practices are implemented from dev all the way to prod. This involves the use of:
- Source control for all code changes, such as CodeCommit or GitLab
- Infrastructure-as-code for all the components and resources that make up a solution, such as CloudFormation or Cloud Development Kit (CDK)
- Event-driven cross-account CI/CD pipeline to systematically take merge requests into dev to testing/staging to production (different environments/accounts for each)
- Observability of the important operational metrics for each component in the solution and threshold/anomaly based alarms, such as CloudWatch or 3rd party vendors
- Modern software engineering and MLOps best practices, such as microservices architecture for both training and inference pipelines, controlled model deployment & rollout, drift monitoring, versioning & rollback capabilities, feature store, and much more
For data science and machine learning teams, DevOps == DevOps + MLOps
3) Maximum Automation
This 3rd degree of automation happens when production ML solutions have become automated cash-flowing assets (or any other tangible business value KPI), and everything that supports them is also automated. This involves adding:
- Analytics workflows to log immutable event data, JOINing it with other data sources in the data lake/lakehouse, and using it to feed reporting dashboards for business leaders and stakeholders (no more endless SQL querying post-release based on tap-on-the-shoulder requests; the BI team handles everything with maximum automation)
- Event-driven model re-training in response to drift events and fully automated error handling, fault tolerance / failure recovery, and operational reliability for all the pieces involved in the end-to-end ML deployment lifecycle
- Business stakeholders handling end user experiences from interactions with the ML solutions, collecting feedback, and logging key learnings on a weekly basis so the ML team can learn and iterate (there is often highly valuable user feedback that cannot be collected from analytics/logging alone)
As you take a machine learning project from research to production, invest time upfront to devise an operational strategy that involves maximum automation and separation of responsibilities.
For example, align with the BI team for reporting needs so they have time to plan and prioritize sprint work. Align with the data engineering team for upstream data ingestion, transformation, and monitoring needs. Align with business stakeholders in field operations and set clear expectations on how to manage end user experiences.
It may not be possible to achieve everything at once on the first release. However, plan for iterations and have an evolution roadmap that incrementally delivers increased automation in small batches every sprint. Eventually, each production machine learning solution will reach a point of maximum automation while delivering consistent business value at scale.
How do you approach end-to-end automation for production solutions? Is DevOps a core element of your company culture? Let us know in the comments!
Check out the various links throughout this article for additional information on those topics.
Subscribe to our weekly LinkedIn newsletter: Machine Learning In Production
Reach out if you need help:
- Maximizing the business value of your data to improve core business KPIs
- Deploying & monetizing your machine learning models in production
- Building Well-Architected production software solutions
- Implementing cloud-native DevOps & MLOps
- Training your teams to systematically take models from research to production
- Identifying new DS/ML opportunities in your company or taking existing projects to the next level
Would you like me to speak at your event? Email me at email@example.com
Subscribe to our blog: https://gradientgroup.ai/blog/