Infrastructure-as-Code for Machine Learning Pipelines in AWS

We all start our AWS journey in the console. We do everything there.

We manually create and configure Lambda functions, Step Functions, IAM roles, S3 buckets, EMR clusters, and any other service we need as we implement a machine learning solution.

We use source control for occasional commits to the dev branch to keep track of our code in general. But, the AWS console code (i.e. Lambda handlers) and the CodeCommit source code are not in sync except through manual intervention.

This is acceptable during a POC/pilot phase as we prove the business value of machine learning on a small scale.

However, as the DS/ML team grows and pipelines grow in complexity, several questions arise:

  • “How do we keep track of all the code changes to our ML pipelines systematically?”
  • “How do we know that the console version of a component matches the one in our git repo?”
  • “How do we know if a machine learning engineer updated a pipeline component?”
  • “If a code update breaks a component or pipeline, how do we roll back to the working version?”
  • “How do we automate service provisioning and subsequent updates?”
  • “If an availability zone goes down, how do we make sure our ML pipelines continue to run?”
  • “How do we keep the DS/ML team organized in code vs a free-for-all in the AWS console?”

This is just the tip of the iceberg. There are many questions/concerns that emerge between successful pilots and production deployments at scale.

Fortunately, AWS has a service that helps address all these questions: CloudFormation.

CloudFormation allows us to define, configure, and create our ML pipelines using code files (YAML or JSON). These pipeline definition code files are called CloudFormation templates.

This approach to creating end-to-end AWS solutions using CloudFormation templates is called infrastructure-as-code.

Here is an excerpt of a CloudFormation template for an end-to-end serverless training pipeline:

This template contains additional definitions for Step Functions (serverless workflow orchestration), Lambda triggers, IAM, CloudWatch, EventBridge rules, and more. CF templates can be as long and comprehensive as you want.

CloudFormation templates are created/updated in our IDE and always kept in a git repo. For machine learning engineers, SageMaker Studio provides a CodeCommit UI where we can pull, commit, and push code to any branch with just a few clicks (I like this much better than using a system terminal).

Once we have our Lambda functions, Glue jobs, IAM roles, Step Functions, and anything else we need for our ML solutions defined in a CloudFormation template, the next step is to deploy it. A deployed template is called a CloudFormation stack.

This deployment can be done manually through the CloudFormation CLI in a SageMaker Studio terminal, but I recommend setting up a CI/CD CodePipeline workflow that performs this automatically every time you commit code.

  • What happens if the deployment fails?

The CloudFormation stack automatically rolls back to the previous (working) version. You can see this visually in the CloudFormation console.

  • What happens if we need to update a Lambda function without going to the console?

Update the code in CodeCommit, commit/push, and let CodePipeline and CloudFormation update the Lambda code automatically (zipped file in S3 or ECR image, depending on your preference).

This approach guarantees our source code and console views stay in sync.

  • If an availability zone goes down, how do we make sure our ML pipelines continue to run?

We can set up CloudWatch alarms and automatic response triggers to AZ failures (or any type of failure that brings down our pipelines).

If anything happens, we simply deploy our CloudFormation template(s) in a different availability zone automatically and within minutes our pipelines are back up.

  • How do we keep track of all the code changes to our ML pipelines?

The combination of CodeCommit, CodePipeline, and CloudFormation keeps track of all changes systematically. Additionally, test phases within CodePipeline ensures new code does not make it into production unless all tests pass.

This enables us to always have a working version of our pipelines in production while allowing for incremental changes, updates, and improvements via commits.

Note how we are able to automate everything within AWS. This automation through source control, CI/CD, and infrastructure-as-code is mandatory for sustainable machine learning deployments at scale.

No more running around the console trying to manage everything manually. Treat the AWS console as read-only as much as possible (such as analyzing CloudWatch logs when debugging pipeline failures).

What is your approach to source control, CI/CD, and infrastructure-as-code within AWS? Comment below! I would love to hear your thoughts so we can all learn from each other how to build better machine learning engineering solutions.

If your data science team needs help deploying models to production through scalable end-to-end ML engineering pipelines in AWS, reach out and I will be happy to help you.

Connect with me on LinkedIn: https://www.linkedin.com/in/CarlosLaraAI/

One thought on “Infrastructure-as-Code for Machine Learning Pipelines in AWS

Leave a Reply

%d bloggers like this: