CI/CD Pipelines for Machine Learning Solutions in AWS

All machine learning projects within AWS begin in a development environment – usually SageMaker Studio notebooks, or Glue notebooks (when a cluster is needed during PySpark development). Both environments support CodeCommit for source control.

Suppose we developed an end-to-end data science workflow in a Jupyter notebook using a dataset extracted from S3, Redshift, Aurora, or any other data source(s).

Result: We trained a machine learning model that meets or exceeds model evaluation metrics as a function of business objectives.

Next, we want to deploy this model into production to help improve business KPIs.

However, working with Jupyter notebooks in a development environment is not scalable, reliable, or automated enough to be sustainable – especially if we have a growing data science team with multiple ML solutions.

We need a systematic and fully automated way to continuously test and integrate code changes across team members from dev to master and deploy these changes to production.

We call this process CI/CD: Continuous Integration and Continuous Deployment.

A successful CI/CD pipeline implementation yields the following capabilities:

  • Automatically and seamlessly integrate code changes into the master branch in response to commits to the dev branch
  • Ability to test code changes (unit testing, integration testing, acceptance testing, etc.) prior to production deployments
  • Ability to update production ML solutions reliably and systematically through infrastructure-as-code (see my previous post for more information)
  • ML pipeline component versioning
  • Minimize production deployment failures
  • Ability to rollback to the previous working version of any ML pipeline [component] in case of any failures
  • Minimize/eliminate the amount of error-prone, manual labor required to move a new piece of code from a dev environment into a prod environment
  • Provide continuous value to users as fast as possible, in small batches, ideally multiple times per day (for example, my team deploys to production 3+ times per day)

Within AWS, the most commonly used service for CI/CD is CodePipeline (in conjunction with CodeCommit, CodeBuild, CodeDeploy, and CloudFormation).

A CI/CD pipeline is composed of 3 major phases: Build, Test, and Deploy.

For machine learning solutions, the build phase packages up Lambda function code (either as S3 zip file or ECR image), containerizes custom ML models, builds a test version of your ML solution via CloudFormation, and anything else you need to prepare your code for testing and deployment. This phase can be executed using CodeBuild through a buildspec.yml build configuration file.

Next, the test phase triggers unit testing per component, integration testing of the pipeline as a while (to ensure the single component update does not break the pipeline), and any other forms of testing needed. This testing can be performed using Lambda functions or ECS Fargate tasks.

Finally, the deploy phase updates the production components of the machine learning solution via CodeBuild and CloudFormation. This phase can also include an automatic merge of the dev or test branch into the master branch in CodeCommit via Lambda function.

The following solution architecture diagram illustrates how CI/CD works for serverless training pipelines in AWS:

Note how the entire CI/CD process is fully automated from initial commit into the dev branch all the way through to production deployment.

If any test phase fails, we receive a notification for the CodePipeline failure, debug via CloudWatch, CloudTrail, X-Ray, or error logs, identify the issue, fix it in the development environment, commit, and the process starts over. Broken code rarely makes it into production. And if it does, we learn from that failure by improving our testing system accordingly.

What is your approach to CI/CD for machine learning solutions in AWS? Comment below!

I would love to hear your thoughts so we can all learn from each other how to build better machine learning engineering solutions.

If you need help implementing AWS Well-Architected production machine learning solutions, training/inference pipelines, MLOps, or if you would like us to review your solution architecture and provide feedback, contact us or send me a message and we will be happy to help you.

Written by Carlos Lara, Director of Data Science & Machine Learning Engineering

Follow Carlos on LinkedIn:

Leave a Reply

%d bloggers like this: