Drift Monitoring for Machine Learning Models in AWS

We have trained a machine learning model that meets or exceeds performance metrics as a function of business requirements.

We have deployed this model to production after converting our Jupyter notebook into a scalable end-to-end training pipeline, including CI/CD and infrastructure-as-code.

This deployment could be a SageMaker endpoint for live inference, or a Lambda function that creates a batch transform job out of the model artifacts in S3 as needed (trigger or schedule).

However, given the dynamic nature of a marketplace or business environment, it is guaranteed that our deployed model’s performance will deteriorate over time: Feature distributions will shift, supply and demand will fluctuate, customer preferences will change/evolve, etc.

Also, if our deployed model is actively used to make decisions at scale, the machine learning solution itself will change the data distributions – hopefully in the desired direction due to better business outcomes.

In machine learning, these inevitable data distribution shifts are called drift, and a few important questions arise upon model deployment:

  • “What is our model’s ongoing performance on production data?”
  • “Under what conditions should we trigger re-training?”
  • “What are the proper model evaluation metrics to compare new models against the current model in production?”

Let’s take the example of OLTP transactions in a relational database, such as e-commerce events.

Using a dataset of historical transactions, we trained and deployed a machine learning model that predicts the probability of a given customer purchasing a specific product. We then use this model to help inform product recommendations.

We can assess our model’s ongoing performance on production data by comparing the prediction to the actual outcome, per transaction. This monitoring can be done daily, weekly, or monthly, depending on the specific business domain.

Then, we can trigger our training pipeline (re-training) if our deployed model’s objective metric drops below a desired threshold on the latest batch of new records.

Back when we trained our first model, performance evaluation was performed on a test set through cross-validation. Randomly sampled test sets contain records from the entire time period of the training dataset.

However, it is more important for our model to perform well on the most recent records than records farther back in time. This ensures we meet the present business needs.

Therefore, model evaluation puts a premium (higher weight) on performance on the most recent records. For example, a model’s objective metric (such as mean average precision) on records from the last 7 days can be multiplied by a factor gamma, where gamma is a number between 0 and 1, with gamma decaying exponentially for each preceding week. The sum gives us the final model evaluation score.

This is one way to decide whether to replace the current production model with the brand-new one produced by re-training.

If the newly trained model achieves a higher model evaluation score, replace the deployed model. Otherwise, simply store metadata about the training pipeline run and wait for the next run as a function of drift monitoring.

There are many ways to slice a dataset or feature space during model evaluation, and each slice is a different representation of the business with a different impact. Always collaborate with business domain and subject matter experts when deciding model evaluation metrics.

What is your approach to drift monitoring, re-training, and model evaluation? Comment below! I would love to hear your thoughts so we can all learn from each other how to build better machine learning engineering solutions.

If your data science team needs help deploying models to production through scalable end-to-end ML engineering pipelines in AWS, reach out and I will be happy to help you.

Connect with me on LinkedIn: https://www.linkedin.com/in/CarlosLaraAI/


Machine Learning Product Success Metrics

When you are building an AI/ML product, it’s paramount that you define clear success metrics from the beginning.

These metrics will help guide the AI product development lifecycle and ensure that your team converges on the right product that solves business problems/user needs.

There are two ways to assess AI/ML product success:

1) Business outcomes

Business outcomes are the most important success metrics for AI products (and AI adoption in general). These are business objectives that result in tangible value created and captured by AI/ML products.

Continue reading “Machine Learning Product Success Metrics”

CI/CD Pipelines for Machine Learning Solutions in AWS

All machine learning projects within AWS begin in a development environment – usually SageMaker Studio notebooks, or Glue notebooks (when a cluster is needed during PySpark development). Both environments support CodeCommit for source control.

Suppose we developed an end-to-end data science workflow in a Jupyter notebook using a dataset extracted from S3, Redshift, Aurora, or any other data source(s).

Result: We trained a machine learning model that meets or exceeds model evaluation metrics as a function of business objectives.

Next, we want to deploy this model into production to help improve business KPIs.

However, working with Jupyter notebooks in a development environment is not scalable, reliable, or automated enough to be sustainable – especially if we have a growing data science team with multiple ML solutions.

We need a systematic and fully automated way to continuously test and integrate code changes across team members from dev to master and deploy these changes to production.

We call this process CI/CD: Continuous Integration and Continuous Deployment.

A successful CI/CD pipeline implementation yields the following capabilities:

  • Automatically and seamlessly integrate code changes into the master branch in response to commits to the dev branch
  • Ability to test code changes (unit testing, integration testing, acceptance testing, etc.) prior to production deployments
  • Ability to update production ML solutions reliably and systematically through infrastructure-as-code (see my previous post for more information)
  • ML pipeline component versioning
  • Minimize production deployment failures
  • Ability to rollback to the previous working version of any ML pipeline [component] in case of any failures
  • Minimize/eliminate the amount of error-prone, manual labor required to move a new piece of code from a dev environment into a prod environment
  • Provide continuous value to users as fast as possible, in small batches, ideally multiple times per day (for example, my team deploys to production 3+ times per day)

Within AWS, the most commonly used service for CI/CD is CodePipeline (in conjunction with CodeCommit, CodeBuild, CodeDeploy, and CloudFormation).

A CI/CD pipeline is composed of 3 major phases: Build, Test, and Deploy.

For machine learning solutions, the build phase packages up Lambda function code (either as S3 zip file or ECR image), containerizes custom ML models, builds a test version of your ML solution via CloudFormation, and anything else you need to prepare your code for testing and deployment. This phase can be executed using CodeBuild through a buildspec.yml build configuration file.

Next, the test phase triggers unit testing per component, integration testing of the pipeline as a while (to ensure the single component update does not break the pipeline), and any other forms of testing needed. This testing can be performed using Lambda functions or ECS Fargate tasks.

Finally, the deploy phase updates the production components of the machine learning solution via CodeBuild and CloudFormation. This phase can also include an automatic merge of the dev or test branch into the master branch in CodeCommit via Lambda function.

The following solution architecture diagram illustrates how CI/CD works for serverless training pipelines in AWS:

Note how the entire CI/CD process is fully automated from initial commit into the dev branch all the way through to production deployment.

If any test phase fails, we receive a notification for the CodePipeline failure, debug via CloudWatch, CloudTrail, X-Ray, or error logs, identify the issue, fix it in the development environment, commit, and the process starts over. Broken code rarely makes it into production. And if it does, we learn from that failure by improving our testing system accordingly.

What is your approach to CI/CD for machine learning solutions in AWS? Comment below!

I would love to hear your thoughts so we can all learn from each other how to build better machine learning engineering solutions.

If your data science team needs help deploying ML solutions to production to maximize business value from models, reach out with any questions, schedule a call with me, and I will be happy to help you.

Connect with me on LinkedIn: https://www.linkedin.com/in/CarlosLaraAI/

Infrastructure-as-Code for Machine Learning Pipelines in AWS

We all start our AWS journey in the console. We do everything there.

We manually create and configure Lambda functions, Step Functions, IAM roles, S3 buckets, EMR clusters, and any other service we need as we implement a machine learning solution.

We use source control for occasional commits to the dev branch to keep track of our code in general. But, the AWS console code (i.e. Lambda handlers) and the CodeCommit source code are not in sync except through manual intervention.

This is acceptable during a POC/pilot phase as we prove the business value of machine learning on a small scale.

However, as the DS/ML team grows and pipelines grow in complexity, several questions arise:

  • “How do we keep track of all the code changes to our ML pipelines systematically?”
  • “How do we know that the console version of a component matches the one in our git repo?”
  • “How do we know if a machine learning engineer updated a pipeline component?”
  • “If a code update breaks a component or pipeline, how do we roll back to the working version?”
  • “How do we automate service provisioning and subsequent updates?”
  • “If an availability zone goes down, how do we make sure our ML pipelines continue to run?”
  • “How do we keep the DS/ML team organized in code vs a free-for-all in the AWS console?”

This is just the tip of the iceberg. There are many questions/concerns that emerge between successful pilots and production deployments at scale.

Fortunately, AWS has a service that helps address all these questions: CloudFormation.

CloudFormation allows us to define, configure, and create our ML pipelines using code files (YAML or JSON). These pipeline definition code files are called CloudFormation templates.

This approach to creating end-to-end AWS solutions using CloudFormation templates is called infrastructure-as-code.

Here is an excerpt of a CloudFormation template for an end-to-end serverless training pipeline:

This template contains additional definitions for Step Functions (serverless workflow orchestration), Lambda triggers, IAM, CloudWatch, EventBridge rules, and more. CF templates can be as long and comprehensive as you want.

CloudFormation templates are created/updated in our IDE and always kept in a git repo. For machine learning engineers, SageMaker Studio provides a CodeCommit UI where we can pull, commit, and push code to any branch with just a few clicks (I like this much better than using a system terminal).

Once we have our Lambda functions, Glue jobs, IAM roles, Step Functions, and anything else we need for our ML solutions defined in a CloudFormation template, the next step is to deploy it. A deployed template is called a CloudFormation stack.

This deployment can be done manually through the CloudFormation CLI in a SageMaker Studio terminal, but I recommend setting up a CI/CD CodePipeline workflow that performs this automatically every time you commit code.

  • What happens if the deployment fails?

The CloudFormation stack automatically rolls back to the previous (working) version. You can see this visually in the CloudFormation console.

  • What happens if we need to update a Lambda function without going to the console?

Update the code in CodeCommit, commit/push, and let CodePipeline and CloudFormation update the Lambda code automatically (zipped file in S3 or ECR image, depending on your preference).

This approach guarantees our source code and console views stay in sync.

  • If an availability zone goes down, how do we make sure our ML pipelines continue to run?

We can set up CloudWatch alarms and automatic response triggers to AZ failures (or any type of failure that brings down our pipelines).

If anything happens, we simply deploy our CloudFormation template(s) in a different availability zone automatically and within minutes our pipelines are back up.

  • How do we keep track of all the code changes to our ML pipelines?

The combination of CodeCommit, CodePipeline, and CloudFormation keeps track of all changes systematically. Additionally, test phases within CodePipeline ensures new code does not make it into production unless all tests pass.

This enables us to always have a working version of our pipelines in production while allowing for incremental changes, updates, and improvements via commits.

Note how we are able to automate everything within AWS. This automation through source control, CI/CD, and infrastructure-as-code is mandatory for sustainable machine learning deployments at scale.

No more running around the console trying to manage everything manually. Treat the AWS console as read-only as much as possible (such as analyzing CloudWatch logs when debugging pipeline failures).

What is your approach to source control, CI/CD, and infrastructure-as-code within AWS? Comment below! I would love to hear your thoughts so we can all learn from each other how to build better machine learning engineering solutions.

If your data science team needs help deploying models to production through scalable end-to-end ML engineering pipelines in AWS, reach out and I will be happy to help you.

Connect with me on LinkedIn: https://www.linkedin.com/in/CarlosLaraAI/

Delta Lake for Machine Learning Pipelines in AWS

Machine learning pipelines begin with data extraction – whether training or inference.

After all, we need a dataset to begin any ML workflow.

Most of us begin by querying OLTP/OLAP tables from an on-premises relational database, such as SQL Server. When our query completes, we save the results locally as CSV and then upload the file manually to S3.

From there, we load the [relatively small] dataset into a Pandas DataFrame within a Jupyter notebook in SageMaker Studio. This approach is manual, inefficient, and does not scale past a certain dataset size.

Machine learning is a big reason why many organizations have migrated their databases to AWS using database migration service (DMS). This is typically part of a larger plan to create a data lake in S3. This way, data can be queried by big data and analytics tools efficiently, such as EMR, Glue, Athena, Redshift, and more.

Database migration is just the beginning, though. When S3 is selected as the target of DMS, the data is partitioned into 2 major categories:

  1. Initial full load (all the records in a given table at the time of migration)
  2. Change data capture (new records/transactions as they are generated by applications)

The challenge is that these migrated tables are not in a queryable state.


Because we still need to combine the DMS full load with the CDC records to obtain the “live” version of a given table, at any given time.

Do we do this every time we need to run a query, for every table we need, before we can even SELECT * FROM small_table?

It’s not sustainable because the number of CDC records is increasing continuously, written to different S3 partitions every minute, making this “combine-every-time” job more and more expensive in compute, memory, and time.

We need an efficient approach to “upsert” (update, insert, delete) CDC records into a live version of our tables, automatically as new records are created. We call the collection of these live tables the delta lake, and just like data lakes, it’s typically built on S3.

One approach to build a delta lake from our data lake is to set up an SQS queue that receives CDC event notifications from S3, for the tables of interest. Then, each SQS message triggers a Lambda function, which submits an “upsert job” (PySpark script) to an EMR cluster. The code then takes care of the upsert (using the MERGE command) to incrementally update a given table.

This delta lake approach allows us to avoid combining DMS full load and CDC records every time we need to run a query on migrated tables in our S3 data lake. We always have live tables already combined, with additional functionality to query based on “what has changed recently” (for example, incremental feature engineering; hence the name “delta” lake).

All you need in your PySpark script for data extraction for a given table is the following:

from delta.tables import *
table_path = f”s3a://your_s3_bucket/your_prefix/{your_table}”
df = spark.read.format(“delta”).load(table_path)
df.createOrReplaceTempView(str(your_table)) # Available to query using Spark SQL

Sample Glue job definition for your CloudFormation template:

Type: AWS::Glue::Job   
Name: glueetl       
PythonVersion: “3”       
ScriptLocation: “local_folder/script.py”     
– “VPC”     
“–extra-py-files”: “s3://your_s3_bucket/spark-scripts/delta-core_2.11-0.6.1.jar”,   
“–extra-jars”: “s3://your_s3_bucket/emr-scripts/delta-core_2.11-0.6.1.jar”,         
“–conf”: “spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore –conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension”,         
“–enable-metrics”: true       
Description: “Extractor component of ML training pipeline”     
MaxConcurrentRuns: 1     
GlueVersion: “2.0”     
MaxRetries: 1     
Name: “Glue-Dataset-Extractor”     
NumberOfWorkers: 10     
Role: !Ref YourRoleArn     
Timeout: 2880     
WorkerType: “G.1X”

The DefaultArguments allow us to import delta.tables and read from our delta lake. With this capability, our ML pipelines can proceed all the way through to model training and deployment at scale.

Where are you in your data engineering infrastructure journey? Comment below! Also, let me know in the comments if you found this useful, or if you have a different approach to achieve delta lake capabilities for your ML pipelines.

If your team needs help building, testing, deploying, and scaling machine learning solutions to maximize business value from ML models, send me a LinkedIn message or email to info@carloslaraai and I will be happy to help you.

Connect with me on LinkedIn: https://www.linkedin.com/in/CarlosLaraAI/

How To Scope Out A Dataset From Scratch (Enterprise ML)

Every machine learning solution requires a dataset that encapsulates the business problem to be solved.

A machine learning system will ingest this dataset, learn its complex patterns/relationships, and output a set of business predictions that help solve a specific business problem.

This sounds great, but how do you acquire this dataset?

Continue reading “How To Scope Out A Dataset From Scratch (Enterprise ML)”

AI/ML Product Management Fundamentals

How do you build products that leverage machine learning?

Machine learning is using data to answer valuable business questions.

Answering these business questions should lead to the creation of tangible business value. This could be increased revenue, decreased costs, increased retention rate, increased operational efficiency, etc.

Therefore, always focus on the business impact of artificial intelligence when building AI/ML products.

Continue reading “AI/ML Product Management Fundamentals”

How To Identify Unknown Features In Machine Learning

What is a feature in machine learning?

A feature is a measurable property or characteristic of an event you want to predict.

But, what happens if you have missing or unknown features for the event you want to predict? What if these features are crucially important to make accurate predictions?

Let’s look at a concrete example:

Suppose your goal is to predict whether a pipe will break/collapse due to erosion.

Continue reading “How To Identify Unknown Features In Machine Learning”

Bias In Artificial Intelligence

You may have heard the term “bias” in artificial intelligence. It usually refers to machine learning algorithms that make biased predictions.

Biased predictions are a sign of underperforming machine learning models that were not trained with the proper datasets.

Most people know that the performance of a machine learning model is directly proportional to the quantity and quality of the dataset used to train it.

Continue reading “Bias In Artificial Intelligence”

The Most Important Element Of AI Adoption

What is the most important element that will determine the success or failure of an AI/ML project?

Most people, including technical professionals in the field, would think it’s the datasets: Quality, quantity, and a data engineering pipeline to produce them. This is because machine learning algorithms perform only as good as the data used to train them.

However, business leaders are quickly realizing that the most important element of AI adoption is actually defining the business problem(s) correctly.

Continue reading “The Most Important Element Of AI Adoption”