Monitoring & Reliability of Production ML Workloads in AWS

My team and I released a new machine learning solution for our users this week.

There is nothing more exciting than seeing all our business KPIs exceed targets. After all, business value is the reason we build, deploy, and scale ML solutions.

Model performance metrics are great to optimize, but if you ask me, I would rather focus on business metrics, such as gross revenue and conversion rates.

Just like there is nothing more exciting than seeing the value of our ML solutions in production, there is nothing more stressful than waking up hoping nothing failed overnight.

This could be a Lambda function, Step Function, Glue job, SageMaker training job, SageMaker endpoint, or any other component in our solution.

How do we monitor our production workloads in AWS and have operational peace of mind?

To begin with, we include a CloudWatch Alarm in our CloudFormation template for every major component in the solution.

For example, each Lambda function gets its own Alarm to monitor errors and throttles. Composite alarms are also helpful to avoid excessive Alarm clutter in our templates and environment.

Each CloudWatch Alarm also contains a list of Actions to execute upon crossing a pre-defined threshold, such as the cumulative sum of errors crossing 1 in a given 10-minute period. For example, we have a list of email addresses that receive a notification if any failure takes place.

Email notifications are great, but how do we resolve the issues? What if we are asleep when a failure happens?

This is where error handling becomes vital to automate remedies. We always come across a variety of errors during thorough testing prior to going live in production.

Build error handling and failure recovery into each component based on known possible failures. Don’t worry about anticipating every possibility from the start. This is always an iterative process, and we can always expand our error handling capabilities as needed.

If you are using Step Functions to orchestrate the execution of your ML workloads, include catcher and retrier blocks per component to handle possible errors. For example, you can catch and revert to a fallback state when an error occurs. Based on the workload, there are many options to handle errors gracefully and continue operations smoothly.

We recommend handling errors at the orchestrator level vs at the component level. It’s more efficient, robust, customizable, and systematic than having each component “blindly” retry itself based on generic failures.

There are other errors, such as availability zone failures, that require a different form of error handling and failure recovery. We will expand on the Reliability pillar of the AWS Well-Architected Framework in a future article.

To learn more about monitoring the performance of machine learning models in production, check out our article titled “Drift Monitoring for Machine Learning Models in AWS.”

How do you monitor your production workloads in AWS and ensure reliability? Comment below!

If you need help implementing cloud-native MLOps, Well-Architected production ML software solutions, training/inference pipelines, monetizing your ML models in production, have specific solution architecture questions, or would just like us to review your solution architecture and provide feedback based on your goals, contact us or send me a message and we will be happy to help you.

Subscribe to my blog at: https://gradientgroup.ai/blog/

Follow me on LinkedIn: https://linkedin.com/in/carloslaraai

Leave a Reply

%d bloggers like this: