Serverless Delta Lake Compaction in AWS

Setting up a transactional data lake that continuously “upserts” (updates/inserts) change data capture (CDC) from relational databases into table snapshots is only the beginning. You may have noticed that the number of Parquet files (partitions) in S3 keeps increasing over time, for each table. This is always the case, whether you use Apache Hudi withContinue reading “Serverless Delta Lake Compaction in AWS”

The Real Difference Between Data Science & Machine Learning Engineering In The Enterprise

When it comes to creating business value, what is the real difference between data science and machine learning engineering? Data science helps answer a specific question, such as “Why do we have an X% customer churn rate month over month?” This is highly valuable because data scientists help shed light into the root cause of a businessContinue reading “The Real Difference Between Data Science & Machine Learning Engineering In The Enterprise”

How To Deploy Lambda Functions As Docker Containers Through CI/CD

How do you deploy Lambda functions as Docker containers through CI/CD? CloudFormation provides us two options for Lambda deployments: Zip the code, copy it to S3, and pass in the S3 path into the CF template Containerize the code, push it to Elastic Container Registry (ECR), and pass in the ECR image URI into theContinue reading “How To Deploy Lambda Functions As Docker Containers Through CI/CD”

Why Software Engineering Is King In Enterprise ML & DE Projects

There seems to be a disconnect around hiring data engineers. The industry has shifted into 2 different fields: 1) Traditional data engineer roles require mostly SQL and orchestration Whereas there are plenty of roles out there that are really a better fit for: 2) Software engineer with a focus in data What type of dataContinue reading “Why Software Engineering Is King In Enterprise ML & DE Projects”

AWS Cross-Account Deployments for Production ML Pipelines

How do you deploy a machine learning training pipeline as a CloudFormation stack from a dev AWS account to a prod AWS account? Suppose you added feature engineering steps to a component of your machine learning training pipeline (within your development environment). If you are using CodeCommit, CodePipeline, and CodeBuild for CI/CD, follow these stepsContinue reading “AWS Cross-Account Deployments for Production ML Pipelines”

CI/CD Pipelines for Machine Learning Solutions in AWS

All machine learning projects within AWS begin in a development environment – usually SageMaker Studio notebooks, or Glue notebooks (when a cluster is needed during PySpark development). Both environments support CodeCommit for source control. Suppose we developed an end-to-end data science workflow in a Jupyter notebook using a dataset extracted from S3, Redshift, Aurora, orContinue reading “CI/CD Pipelines for Machine Learning Solutions in AWS”

Infrastructure-as-Code for Machine Learning Pipelines in AWS

We all start our AWS journey in the console. We do everything there. We manually create and configure Lambda functions, Step Functions, IAM roles, S3 buckets, EMR clusters, and any other service we need as we implement a machine learning solution. We use source control for occasional commits to the dev branch to keep trackContinue reading “Infrastructure-as-Code for Machine Learning Pipelines in AWS”

Drift Monitoring for Machine Learning Models in AWS

We have trained a machine learning model that meets or exceeds performance metrics as a function of business requirements. We have deployed this model to production after converting our Jupyter notebook into a scalable end-to-end training pipeline, including CI/CD and infrastructure-as-code. This deployment could be a SageMaker endpoint for live inference, or a Lambda functionContinue reading “Drift Monitoring for Machine Learning Models in AWS”

Delta Lake for Machine Learning Pipelines in AWS

Machine learning pipelines begin with data extraction – whether training or inference. After all, we need a dataset to begin any ML workflow. Most of us begin by querying OLTP/OLAP tables from an on-premises relational database, such as SQL Server. When our query completes, we save the results locally as CSV and then upload theContinue reading “Delta Lake for Machine Learning Pipelines in AWS”

How To Scope Out A Dataset From Scratch (Enterprise ML)

Every machine learning solution requires a dataset that encapsulates the business problem to be solved. A machine learning system will ingest this dataset, learn its complex patterns/relationships, and output a set of business predictions that help solve a specific business problem. This sounds great, but how do you acquire this dataset?