Featured

Infrastructure-as-Code for Machine Learning Pipelines in AWS

We all start our AWS journey in the console. We do everything there.

We manually create and configure Lambda functions, Step Functions, IAM roles, S3 buckets, EMR clusters, and any other service we need as we implement a machine learning solution.

We use source control for occasional commits to the dev branch to keep track of our code in general. But, the AWS console code (i.e. Lambda handlers) and the CodeCommit source code are not in sync except through manual intervention.

This is acceptable during a POC/pilot phase as we prove the business value of machine learning on a small scale.

However, as the DS/ML team grows and pipelines grow in complexity, several questions arise:

  • “How do we keep track of all the code changes to our ML pipelines systematically?”
  • “How do we know that the console version of a component matches the one in our git repo?”
  • “How do we know if a machine learning engineer updated a pipeline component?”
  • “If a code update breaks a component or pipeline, how do we roll back to the working version?”
  • “How do we automate service provisioning and subsequent updates?”
  • “If an availability zone goes down, how do we make sure our ML pipelines continue to run?”
  • “How do we keep the DS/ML team organized in code vs a free-for-all in the AWS console?”

This is just the tip of the iceberg. There are many questions/concerns that emerge between successful pilots and production deployments at scale.

Fortunately, AWS has a service that helps address all these questions: CloudFormation.

CloudFormation allows us to define, configure, and create our ML pipelines using code files (YAML or JSON). These pipeline definition code files are called CloudFormation templates.

This approach to creating end-to-end AWS solutions using CloudFormation templates is called infrastructure-as-code.

Here is an excerpt of a CloudFormation template for an end-to-end serverless training pipeline:

This template contains additional definitions for Step Functions (serverless workflow orchestration), Lambda triggers, IAM, CloudWatch, EventBridge rules, and more. CF templates can be as long and comprehensive as you want.

CloudFormation templates are created/updated in our IDE and always kept in a git repo. For machine learning engineers, SageMaker Studio provides a CodeCommit UI where we can pull, commit, and push code to any branch with just a few clicks (I like this much better than using a system terminal).

Once we have our Lambda functions, Glue jobs, IAM roles, Step Functions, and anything else we need for our ML solutions defined in a CloudFormation template, the next step is to deploy it. A deployed template is called a CloudFormation stack.

This deployment can be done manually through the CloudFormation CLI in a SageMaker Studio terminal, but I recommend setting up a CI/CD CodePipeline workflow that performs this automatically every time you commit code.

  • What happens if the deployment fails?

The CloudFormation stack automatically rolls back to the previous (working) version. You can see this visually in the CloudFormation console.

  • What happens if we need to update a Lambda function without going to the console?

Update the code in CodeCommit, commit/push, and let CodePipeline and CloudFormation update the Lambda code automatically (zipped file in S3 or ECR image, depending on your preference).

This approach guarantees our source code and console views stay in sync.

  • If an availability zone goes down, how do we make sure our ML pipelines continue to run?

We can set up CloudWatch alarms and automatic response triggers to AZ failures (or any type of failure that brings down our pipelines).

If anything happens, we simply deploy our CloudFormation template(s) in a different availability zone automatically and within minutes our pipelines are back up.

  • How do we keep track of all the code changes to our ML pipelines?

The combination of CodeCommit, CodePipeline, and CloudFormation keeps track of all changes systematically. Additionally, test phases within CodePipeline ensures new code does not make it into production unless all tests pass.

This enables us to always have a working version of our pipelines in production while allowing for incremental changes, updates, and improvements via commits.

Note how we are able to automate everything within AWS. This automation through source control, CI/CD, and infrastructure-as-code is mandatory for sustainable machine learning deployments at scale.

No more running around the console trying to manage everything manually. Treat the AWS console as read-only as much as possible (such as analyzing CloudWatch logs when debugging pipeline failures).

What is your approach to source control, CI/CD, and infrastructure-as-code within AWS? Comment below! I would love to hear your thoughts so we can all learn from each other how to build better machine learning engineering solutions.

If you need help implementing AWS Well-Architected production machine learning solutions, training/inference pipelines, MLOps, or if you would like us to review your solution architecture and provide feedback, contact us or send me a message and we will be happy to help you.

Written by Carlos Lara, Director of Data Science & Machine Learning Engineering

Follow Carlos on LinkedIn: https://www.linkedin.com/in/carloslaraai/

Featured

Drift Monitoring for Machine Learning Models in AWS

We have trained a machine learning model that meets or exceeds performance metrics as a function of business requirements.

We have deployed this model to production after converting our Jupyter notebook into a scalable end-to-end training pipeline, including CI/CD and infrastructure-as-code.

This deployment could be a SageMaker endpoint for live inference, or a Lambda function that creates a batch transform job out of the model artifacts in S3 as needed (trigger or schedule).

However, given the dynamic nature of a marketplace or business environment, it is guaranteed that our deployed model’s performance will deteriorate over time: Feature distributions will shift, supply and demand will fluctuate, customer preferences will change/evolve, etc.

Also, if our deployed model is actively used to make decisions at scale, the machine learning solution itself will change the data distributions – hopefully in the desired direction due to better business outcomes.

In machine learning, these inevitable data distribution shifts are called drift, and a few important questions arise upon model deployment:

  • “What is our model’s ongoing performance on production data?”
  • “Under what conditions should we trigger re-training?”
  • “What are the proper model evaluation metrics to compare new models against the current model in production?”

Let’s take the example of OLTP transactions in a relational database, such as e-commerce events.

Using a dataset of historical transactions, we trained and deployed a machine learning model that predicts the probability of a given customer purchasing a specific product. We then use this model to help inform product recommendations.

We can assess our model’s ongoing performance on production data by comparing the prediction to the actual outcome, per transaction. This monitoring can be done daily, weekly, or monthly, depending on the specific business domain.

Then, we can trigger our training pipeline (re-training) if our deployed model’s objective metric drops below a desired threshold on the latest batch of new records.

Back when we trained our first model, performance evaluation was performed on a test set through cross-validation. Randomly sampled test sets contain records from the entire time period of the training dataset.

However, it is more important for our model to perform well on the most recent records than records farther back in time. This ensures we meet the present business needs.

Therefore, model evaluation puts a premium (higher weight) on performance on the most recent records. For example, a model’s objective metric (such as mean average precision) on records from the last 7 days can be multiplied by a factor gamma, where gamma is a number between 0 and 1, with gamma decaying exponentially for each preceding week. The sum gives us the final model evaluation score.

This is one way to decide whether to replace the current production model with the brand-new one produced by re-training.

If the newly trained model achieves a higher model evaluation score, replace the deployed model. Otherwise, simply store metadata about the training pipeline run and wait for the next run as a function of drift monitoring.

There are many ways to slice a dataset or feature space during model evaluation, and each slice is a different representation of the business with a different impact. Always collaborate with business domain and subject matter experts when deciding model evaluation metrics.

What is your approach to drift monitoring, re-training, and model evaluation? Comment below! I would love to hear your thoughts so we can all learn from each other how to build better machine learning engineering solutions.

If you need help implementing AWS Well-Architected production machine learning solutions, training/inference pipelines, MLOps, or if you would like us to review your solution architecture and provide feedback, contact us or send me a message and we will be happy to help you.

Written by Carlos Lara, Director of Data Science & Machine Learning Engineering

Follow Carlos on LinkedIn: https://www.linkedin.com/in/carloslaraai/

Serverless Delta Lake Compaction in AWS

Setting up a transactional data lake that continuously “upserts” (updates/inserts) change data capture (CDC) from relational databases into table snapshots is only the beginning.

You may have noticed that the number of Parquet files (partitions) in S3 keeps increasing over time, for each table. This is always the case, whether you use Apache Hudi with EMR or Apache Delta Lake with Glue.

Each partition is relatively small, usually in the KB or MB sizes. An entire table, however, can be GB, TB, or more in size. Here is a serverless solution architecture my team implemented:

No alt text provided for this image


If you are using Glue to create derived views from these raw OLTP tables, you will notice that Glue job latency keeps increasing gradually (usually in the scale of days).

This happens because downstream Spark jobs load full table snapshots in preparation for JOINs to produce various views. As the number of Parquet files increases per source table, more worker node tasks are required to execute the DAGs. This leads to growing latency, unless we are willing to increase the number of DPUs allocated per Glue job.

This is unacceptable because we may have latency SLAs and budget constraints (Glue jobs cost $0.44 USD per DPU hour, which adds up over time for frequent jobs).

In fact, based on the AWS Well-Architected Cost Optimization pillar, we should always architect solutions in a way that cost is actively reduced or kept at a minimum.

The problem of growing latency and cost in transactional data lakes leads to a solution called compaction.

Compaction is the process of re-partitioning tables into a smaller number of Parquet files of larger size (such as going from many KB/MB sized files to a few GB sized files). Here is sample PySpark code for use within a Delta Lake compaction Glue job:

for table in bronze_tables:

    path = f"s3://{delta-lake-bucket}/bronze/{database}/{table}/"
    numFiles = 20
   
    df = (spark.read
         .format("delta")
         .load(path)
         .repartition(numFiles)
         .write
         .option("dataChange", "false")
         .format("delta")
         .mode("overwrite")
         .save(path))


    print(f"Table: {table} re-partitioned")

When the Glue job completes, you can verify in S3 that the latest snapshot for each table has been re-partitioned into a smaller number of larger Parquet files. Older snapshots are left unchanged, and we recommend setting up S3 lifecycle policies to gradually move them into IA, then Glacier.

We also recommend experimenting with the number of partitions to converge on the optimal number for your tables. We use “selective compaction” to re-partition groups of tables differently to achieve optimal performance across our Delta Lake zones (Bronze, Silver, and Gold).

As always, monitor every aspect of your workload, particularly rolling average Glue job latency for this use case. This will help you determine the optimal EventBridge schedule to trigger your Delta Lake compaction jobs/workflows.

Have you encountered this challenge before? If so, how did you approach the solution?

Hit like if you found this blog post helpful! Also, let us know in the comments if there are specific AWS topics you are interested in. We work across data science, data engineering & analytics, machine learning engineering, and DevOps, all cloud-native within AWS.

We are happy to share our learnings, implementations, insights, and best practices to help you build AWS Well-Architected solutions that create maximum business value for your organization. Does a topic or current challenge come to mind? Comment below!

The Real Difference Between Data Science & Machine Learning Engineering In The Enterprise

When it comes to creating business value, what is the real difference between data science and machine learning engineering?

Data science helps answer a specific question, such as “Why do we have an X% customer churn rate month over month?”

This is highly valuable because data scientists help shed light into the root cause of a business problem, as well as propose potential solutions. However, it only yields a one-time result: The answer to the question at one point in time.

On the other hand, machine learning engineering builds a production system of ongoing answers to recurring questions at scale, such as “What are the top N product recommendations for a given user?”

This is especially important when the answers to a recurring question may be changing/evolving over time (see “Drift Monitoring for ML Models” to learn more).

Data science and ML engineering are not mutually exclusive; they are sequential. Data science first (foundation in dev), ML engineering second (scale in prod). 

Top data scientists follow a methodology to systematically answer business questions. These are common steps:

  • Break down the business problem/question into its components
  • Identify the components that can be materially impacted and yield the highest ROI within a reasonable time frame
  • Develop hypotheses for root-cause analysis
  • Perform data analysis to quantify components and validate/invalidate hypotheses (this is usually a combination of SQL queries and domain expert interviews)
  • Refine hypotheses and determine whether machine learning is the right solution to help address/solve the problem
  • Convert the core business question into a machine learning question and execute end-to-end ML workflow in a dev environment
  • Present results, conclusions, and next steps to business leaders and stakeholders (statistics, plots, [engineered] features with the highest SHAP values, solution recommendations, etc.)

If the business question is recurring at scale, machine learning engineers will then take the output from data scientists (typically an end-to-end Jupyter notebook) and convert it into a scalable production software solution. These are common elements that ML engineers work on:

  • Scalability, Extensibility, Modularity, & Testability
  • Consistency & Reproducibility
  • Security
  • Logging & Monitoring
  • [Serverless] Microservices Architecture
  • Infrastructure-As-Code & Configuration Management
  • Continuous Integration & Continuous Deployment/Delivery (CI/CD)
  • Versioning & Rollback
  • Fault Tolerance & Failure Recovery
  • Containerization & Container Orchestration
  • Model Deployment, Model Drift Monitoring, Model [Re]Training, Model Evaluation, Model Explainability
  • Multi-Model Management & Model Registry
  • Pipeline Metadata Management
  • Experiment Tracking & Management
  • Feature Store

Clearly, ML engineering = software engineering with a focus on ML (see “Why Software Engineering Is King In Enterprise ML & DE” to learn more).

Machine learning engineering extracts maximum business value from data science by realizing its full potential in production.

Data science and machine learning engineering are both important for an organization. Both teams work together to achieve a business outcome, aligned on the same mission.

When allocating headcount budget, ask yourself “Do we need data scientists or machine learning engineers?” based on product management’s top business priorities for DS/ML and the lifecycle stage of each project. Work with your tech lead to gain additional clarity and verify the teams’ needs (sometimes they answer may be “we need more data engineers”).

As a rule of thumb, have 2 machine learning engineers for every 1 data scientist in your team. Alternatively, build a team of “full-stack” ML engineers where they can take either DS or MLE stories/tasks based on priorities. My team loves the latter because there is constant variety in the work and skill acquisition.

What do you see as the biggest difference between data science and ML engineering? Comment below!

If you need help implementing AWS Well-Architected production machine learning solutions, training/inference pipelines, MLOps, or if you would like us to review your solution architecture and provide feedback, contact us or send me a message and we will be happy to help you.

Written by Carlos Lara, Director of Data Science & Machine Learning Engineering

Follow Carlos on LinkedIn: https://www.linkedin.com/in/carloslaraai/

How To Deploy Lambda Functions As Docker Containers Through CI/CD

How do you deploy Lambda functions as Docker containers through CI/CD?

No alt text provided for this image

CloudFormation provides us two options for Lambda deployments:

  1. Zip the code, copy it to S3, and pass in the S3 path into the CF template
  2. Containerize the code, push it to Elastic Container Registry (ECR), and pass in the ECR image URI into the CF template

The zip deployments are the most straightforward and where we all start.

Then, why would we choose to deploy Lambda functions as Docker containers?

When importing external libraries during Lambda execution, you might have come across the issue of Lambda Layer size limits (250 MB max size):

No alt text provided for this image

This is especially common in machine learning pipeline components where we may need to import several libraries, such as Pandas, Scikit-learn, TensorFlow, etc. Given that certain dependencies are mandatory, Docker containers provide an excellent solution.

Within SageMaker Studio, we start by creating a requirements.txt file with all the required libraries for the given Lambda function:

No alt text provided for this image

Specifying versions is optional because at runtime, CodeBuild will automatically figure out the proper combination of versions to make all the libraries compatible with each other.

Next, create a Dockerfile for the Lambda function:

No alt text provided for this image

The lambda folder in this same directory contains all the Lambda code. If you are transitioning from zip deployments to container deployments, you don’t need to modify your Lambda code at all because it is deployment agnostic.

Pull, commit, and push the changes to CodeCommit. Once code review is complete and CI/CD CodePipeline is triggered, CodeBuild will perform the deployment:

No alt text provided for this image

This buildspec.yml file shows both the Docker container deployment and the traditional zip deployment. You can mix and match because the deployment decision (zip vs image) is on a per Lambda basis.

During aws cloudformation deploy (CLI), we pass in the S3 path or ECR image URI through –parameter-overrides to update the CloudFormation template’s Parameters section. The parameter is then referenced by the Lambda function in the Resources section of the template:

No alt text provided for this image

You can go to the console and verify the Lambda container has been deployed successfully. You will see the ECR image URI and a message saying your function is deployed as a container image.

However, you will not be able to see the actual code in the console. My team likes this because it forces us to modify it through CI/CD vs manually in the console. This is the habit CloudFormation has created, and Lambda container deployments enforces it further.

Deploying Lambda functions as Docker containers also has the added benefit of easy transition to ECS Fargate. This would be needed if a Lambda function must take more than 15 minutes to complete execution, and/or if it requires more memory or cores beyond Lambda limits.

How do you deploy your Lambda functions? Comment below!

If you need help implementing AWS Well-Architected production machine learning solutions, training/inference pipelines, MLOps, or if you would like us to review your solution architecture and provide feedback, contact us or send me a message and we will be happy to help you.

Written by Carlos Lara, Director of Data Science & Machine Learning Engineering

Follow Carlos on LinkedIn: https://www.linkedin.com/in/carloslaraai/

Why Software Engineering Is King In Enterprise ML & DE Projects

There seems to be a disconnect around hiring data engineers.

The industry has shifted into 2 different fields:

1) Traditional data engineer roles require mostly SQL and orchestration

Whereas there are plenty of roles out there that are really a better fit for:

2) Software engineer with a focus in data

What type of data engineer is ideal to help machine learning teams set up custom data pipelines? Services involved could be Redshift, Kinesis, EMR, Glue, EKS, etc.

Definitely the second one.

In fact, this is how it works at Amazon Prime ML. It is mostly software engineering work with a focus on data engineering projects.

We see the same thing in data science and machine learning (job descriptions and working professionals):

1) Traditional data scientist or ML engineer roles require mostly SQL and Jupyter notebooks

Whereas there are plenty of projects out there that need:

2) Software engineers with a focus in production machine learning engineering

This is exactly what I see in my team.

What we do as ML engineers is modern cloud software engineering with a focus on productionalizing ML models through Well-Architected inference pipelines and MLOps platform.

With proper domain-based feature engineering, training good models is straightforward – especially when leveraging pre-built algorithm containers with a proven track record of success, such as XGBoost.

The real work that creates value for the business is everything that supports the production deployment of trained models at scale.

And fundamentally, it’s all about software engineering – with a focus on ML:

  • Scalability, Extensibility, Modularity, & Testability
  • Consistency & Reproducibility
  • Security
  • Logging & Monitoring
  • [Serverless] Microservices Architecture
  • Infrastructure-As-Code & Configuration Management
  • Continuous Integration & Continuous Deployment/Delivery (CI/CD)
  • Versioning & Rollback
  • Fault Tolerance & Failure Recovery
  • Containerization & Container Orchestration
  • Model Deployment
  • Model Drift Monitoring
  • Model [Re]Training
  • Model Evaluation
  • Model Explainability
  • Multi-Model Management
  • Pipeline Metadata Management
  • Feature Store
  • Experiment Tracking & Management

This short list gives you an idea of what is truly involved in successful enterprise machine learning projects. My team is composed of this type of “full-stack ML engineer” (myself included), and we work on every single point above (and much more).

Many people in the DS/ML world like to say data science is not software engineering.

The more accurate statement is that most data scientists are not software engineers.

And this is a big problem, as evidenced by the high failure rate of enterprise ML projects.

Data science skills are certainly mandatory, but they should be a given for any team member. You can safely assume that anyone with enough experience has them.

However, if you want to create real business value from ML and yield an ROI, place higher focus on software engineering skills when building machine learning and data engineering teams.

What do you think? Comment below with your thoughts or questions.

If you need help implementing AWS Well-Architected production machine learning solutions, training/inference pipelines, MLOps, or if you would like us to review your solution architecture and provide feedback, contact us or send me a message and we will be happy to help you.

Written by Carlos Lara, Director of Data Science & Machine Learning Engineering

Follow Carlos on LinkedIn: https://www.linkedin.com/in/carloslaraai/

AWS Cross-Account Deployments for Production ML Pipelines

How do you deploy a machine learning training pipeline as a CloudFormation stack from a dev AWS account to a prod AWS account?

Suppose you added feature engineering steps to a component of your machine learning training pipeline (within your development environment).

If you are using CodeCommit, CodePipeline, and CodeBuild for CI/CD, follow these steps to deploy the changes to your production account:

  1. Within your dev account, commit change within SageMaker Studio or Glue Notebooks and push to CodeCommit feature branch
  2. Pull request to dev branch, code review, and merge feature branch into dev branch
  3. Have EventBridge capture the event and trigger CodePipeline, with CodeCommit as the source phase
  4. Proceed to CodeBuild test phase to perform builds, unit testing, integration testing, and other steps (this could happen in the same dev account or a separate staging/test account)
  5. If the test stage succeeds, proceed to CodeBuild deploy phase
  6. Assume prod IAM role within CodeBuild buildspec.yml file using STS CLI (make sure the prod role has the non-prod account as a trusted entity)
  • Execute your required steps for CloudFormation deployments, such as containerizing Lambda code and pushing it to ECR (or zipping the code and copying it to S3), and deploy the CloudFormation template with appropriate parameter overrides (again, using the CLI for the respective services involved). All these commands are executed in the prod account from within the non-prod account inside CodeBuild – how cool is that?!
  • Write production deployment metadata to DynamoDB (whether CodePipeline succeeded or failed – keep track of everything for analytics!)
  • (Bonus) Log into your prod account and verify all changes were deployed successfully
  • If the production deployment succeeded, merge the last pull request into the master branch in CodeCommit as the final step in CodePipeline

This workflow makes cross-account production deployments in AWS seamless, reliable, and consistent. It’s also a great experience to have the entire process unified within one platform without external dependencies (i.e. GitHub or GitLab + Jenkins).

This cross-account deployment process is also cybersecurity compliant because no data is transferred across accounts directly. Everything happens securely within CodeBuild by assuming the production role to deploy code directly into the production account.

What do you think about this deployment process? Comment below! Feel free to suggest improvements, as well. Our solutions are constantly evolving and my team is always looking for improvement opportunities.

If you need help implementing AWS Well-Architected production machine learning solutions, training/inference pipelines, MLOps, or if you would like us to review your solution architecture and provide feedback, contact us or send me a message and we will be happy to help you.

Written by Carlos Lara, Director of Data Science & Machine Learning Engineering

Follow Carlos on LinkedIn: https://www.linkedin.com/in/carloslaraai/

CI/CD Pipelines for Machine Learning Solutions in AWS

All machine learning projects within AWS begin in a development environment – usually SageMaker Studio notebooks, or Glue notebooks (when a cluster is needed during PySpark development). Both environments support CodeCommit for source control.

Suppose we developed an end-to-end data science workflow in a Jupyter notebook using a dataset extracted from S3, Redshift, Aurora, or any other data source(s).

Result: We trained a machine learning model that meets or exceeds model evaluation metrics as a function of business objectives.

Next, we want to deploy this model into production to help improve business KPIs.

However, working with Jupyter notebooks in a development environment is not scalable, reliable, or automated enough to be sustainable – especially if we have a growing data science team with multiple ML solutions.

We need a systematic and fully automated way to continuously test and integrate code changes across team members from dev to master and deploy these changes to production.

We call this process CI/CD: Continuous Integration and Continuous Deployment.

A successful CI/CD pipeline implementation yields the following capabilities:

  • Automatically and seamlessly integrate code changes into the master branch in response to commits to the dev branch
  • Ability to test code changes (unit testing, integration testing, acceptance testing, etc.) prior to production deployments
  • Ability to update production ML solutions reliably and systematically through infrastructure-as-code (see my previous post for more information)
  • ML pipeline component versioning
  • Minimize production deployment failures
  • Ability to rollback to the previous working version of any ML pipeline [component] in case of any failures
  • Minimize/eliminate the amount of error-prone, manual labor required to move a new piece of code from a dev environment into a prod environment
  • Provide continuous value to users as fast as possible, in small batches, ideally multiple times per day (for example, my team deploys to production 3+ times per day)

Within AWS, the most commonly used service for CI/CD is CodePipeline (in conjunction with CodeCommit, CodeBuild, CodeDeploy, and CloudFormation).

A CI/CD pipeline is composed of 3 major phases: Build, Test, and Deploy.

For machine learning solutions, the build phase packages up Lambda function code (either as S3 zip file or ECR image), containerizes custom ML models, builds a test version of your ML solution via CloudFormation, and anything else you need to prepare your code for testing and deployment. This phase can be executed using CodeBuild through a buildspec.yml build configuration file.

Next, the test phase triggers unit testing per component, integration testing of the pipeline as a while (to ensure the single component update does not break the pipeline), and any other forms of testing needed. This testing can be performed using Lambda functions or ECS Fargate tasks.

Finally, the deploy phase updates the production components of the machine learning solution via CodeBuild and CloudFormation. This phase can also include an automatic merge of the dev or test branch into the master branch in CodeCommit via Lambda function.

The following solution architecture diagram illustrates how CI/CD works for serverless training pipelines in AWS:

Note how the entire CI/CD process is fully automated from initial commit into the dev branch all the way through to production deployment.

If any test phase fails, we receive a notification for the CodePipeline failure, debug via CloudWatch, CloudTrail, X-Ray, or error logs, identify the issue, fix it in the development environment, commit, and the process starts over. Broken code rarely makes it into production. And if it does, we learn from that failure by improving our testing system accordingly.

What is your approach to CI/CD for machine learning solutions in AWS? Comment below!

I would love to hear your thoughts so we can all learn from each other how to build better machine learning engineering solutions.

If you need help implementing AWS Well-Architected production machine learning solutions, training/inference pipelines, MLOps, or if you would like us to review your solution architecture and provide feedback, contact us or send me a message and we will be happy to help you.

Written by Carlos Lara, Director of Data Science & Machine Learning Engineering

Follow Carlos on LinkedIn: https://www.linkedin.com/in/carloslaraai/

Delta Lake for Machine Learning Pipelines in AWS

Machine learning pipelines begin with data extraction – whether training or inference.

After all, we need a dataset to begin any ML workflow.

Most of us begin by querying OLTP/OLAP tables from an on-premises relational database, such as SQL Server. When our query completes, we save the results locally as CSV and then upload the file manually to S3.

From there, we load the [relatively small] dataset into a Pandas DataFrame within a Jupyter notebook in SageMaker Studio. This approach is manual, inefficient, and does not scale past a certain dataset size.

Machine learning is a big reason why many organizations have migrated their databases to AWS using database migration service (DMS). This is typically part of a larger plan to create a data lake in S3. This way, data can be queried by big data and analytics tools efficiently, such as EMR, Glue, Athena, Redshift, and more.

Database migration is just the beginning, though. When S3 is selected as the target of DMS, the data is partitioned into 2 major categories:

  1. Initial full load (all the records in a given table at the time of migration)
  2. Change data capture (new records/transactions as they are generated by applications)

The challenge is that these migrated tables are not in a queryable state.

Why?

Because we still need to combine the DMS full load with the CDC records to obtain the “live” version of a given table, at any given time.

Do we do this every time we need to run a query, for every table we need, before we can even SELECT * FROM small_table?

It’s not sustainable because the number of CDC records is increasing continuously, written to different S3 partitions every minute, making this “combine-every-time” job more and more expensive in compute, memory, and time.

We need an efficient approach to “upsert” (update, insert, delete) CDC records into a live version of our tables, automatically as new records are created. We call the collection of these live tables the delta lake, and just like data lakes, it’s typically built on S3.

One approach to build a delta lake from our data lake is to set up an SQS queue that receives CDC event notifications from S3, for the tables of interest. Then, each SQS message triggers a Lambda function, which submits an “upsert job” (PySpark script) to an EMR cluster. The code then takes care of the upsert (using the MERGE command) to incrementally update a given table.

This delta lake approach allows us to avoid combining DMS full load and CDC records every time we need to run a query on migrated tables in our S3 data lake. We always have live tables already combined, with additional functionality to query based on “what has changed recently” (for example, incremental feature engineering; hence the name “delta” lake).

All you need in your PySpark script for data extraction for a given table is the following:

from delta.tables import *
table_path = f”s3a://your_s3_bucket/your_prefix/{your_table}”
df = spark.read.format(“delta”).load(table_path)
df.createOrReplaceTempView(str(your_table)) # Available to query using Spark SQL

Sample Glue job definition for your CloudFormation template:

ExtractorGlueJob:   
Type: AWS::Glue::Job   
Properties:     
Command:       
Name: glueetl       
PythonVersion: “3”       
ScriptLocation: “local_folder/script.py”     
Connections:       
Connections:         
– “VPC”     
DefaultArguments:       
{         
“–extra-py-files”: “s3://your_s3_bucket/spark-scripts/delta-core_2.11-0.6.1.jar”,   
“–extra-jars”: “s3://your_s3_bucket/emr-scripts/delta-core_2.11-0.6.1.jar”,         
“–conf”: “spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore –conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension”,         
“–enable-metrics”: true       
}     
Description: “Extractor component of ML training pipeline”     
ExecutionProperty:       
MaxConcurrentRuns: 1     
GlueVersion: “2.0”     
MaxRetries: 1     
Name: “Glue-Dataset-Extractor”     
NumberOfWorkers: 10     
Role: !Ref YourRoleArn     
Timeout: 2880     
WorkerType: “G.1X”

The DefaultArguments allow us to import delta.tables and read from our delta lake. With this capability, our ML pipelines can proceed all the way through to model training and deployment at scale.

Where are you in your data engineering infrastructure journey? Comment below! Also, let me know in the comments if you found this useful, or if you have a different approach to achieve delta lake capabilities for your ML pipelines.

If you need help implementing AWS Well-Architected production machine learning solutions, training/inference pipelines, MLOps, or if you would like us to review your solution architecture and provide feedback, contact us or send me a message and we will be happy to help you.

Written by Carlos Lara, Director of Data Science & Machine Learning Engineering

Follow Carlos on LinkedIn: https://www.linkedin.com/in/carloslaraai/

How To Scope Out A Dataset From Scratch (Enterprise ML)

Every machine learning solution requires a dataset that encapsulates the business problem to be solved.

A machine learning system will ingest this dataset, learn its complex patterns/relationships, and output a set of business predictions that help solve a specific business problem.

This sounds great, but how do you acquire this dataset?

Continue reading “How To Scope Out A Dataset From Scratch (Enterprise ML)”