“Should we use Kubernetes or go serverless first for new software solutions?”
This is a common question among technology teams across the world. Based on a recent LinkedIn survey, the answer seems to be an event split between the two approaches, with most people flexible based on the project.
Common arguments in favor of Kubernetes include portability, scalability, low latency, low cost, open-source support, and DevOps maturity.
Common arguments in favor of serverless include simplicity, maintainability, shorter lead times, developer experience, talent / skill set availability, native integration with other cloud services, and existing commitment to the cloud.
Is there a way to combine the best of both worlds and create cloud-native, serverless container-based solutions?
Let’s look at machine learning training pipelines in AWS as an example.
Suppose we are in stage 2 of the ML Model Deployment Lifecycle, where we converted our Jupyter notebook into a microservice-based training pipeline:
This architecture is fully serverless and AWS-native, leveraging PySpark Glue jobs, SageMaker Lambda functions, and Step Functions for component orchestration and error handling. This workflow is typically triggered by EventBridge Rules in response to drift events from production machine learning models.
What if we wanted to lower the cost of the Data Validator component, leverage open-source libraries (such as TensorFlow Extended), add better DevOps capabilities, and make the component more portable?
We replaced the Glue job with an ECS Fargate Task that uses TensorFlow Extended’s Data Validation library. ECS Fargate allows us to create Docker containers out of our code repo and run them as serverless microservices – either standalone or as part of a larger production workflow.
In the case of ML training pipelines where we can afford higher latency batch jobs, Fargate Spot lowers our runtime cost to about 1 cent per vCPU per hour and a tenth of a cent per GB per hour (40x+ lower cost than AWS Glue jobs).
How do we implement this?
Let’s start in the SageMaker Studio IDE.
The first step is to create a requirements.txt file to install TFX’s Data Validation library in the container, and a Dockerfile to copy the data validation component code into the container:
Next, we git commit and trigger our CI/CD pipeline to containerize and push the Docker image to ECR. The build code is provided in CodeBuild’s buildspec.yml file:
Once the build is complete, we pass the ECR image URI into the parameter overrides of the CloudFormation deploy CLI command. Here is the basic CloudFormation template resource definition for ECS Fargate Tasks:
The ValidatorImageURI and the other referenced variables are defined in the Parameters section of the CloudFormation template. You can define static values, or pass them in dynamically through the CloudFormation deployment (as we did for ValidatorImageURI).
Finally, we include this ECS Fargate Task within our Step Function workflow (also defined in the same CloudFormation template):
We pass parameters dynamically into the container through environment variables, such as Run Date and Run Id (initialized by the init component of the Step Function). These environment variables can then be read inside the container code through os.environ[“RUN_DATE”], etc.
Extending or updating this ECS Fargate component is simply done by writing code in the relevant files and/or updating the CloudFormation template resource configuration, followed by a git commit > pull request > code review > merge request > CI/CD > production.
Given the modular microservice architecture of this ML training pipeline, any changes to this component will not break any other component in the pipeline and vice-versa.
Additional benefits of ECS Fargate Tasks can be discovered through the CloudFormation deployment process:
- Granular control over deployments
- Versioning and rollback
- Inference accelerators
- Mixed strategy of on-demand and spot EC2 instances
- Native integration with AWS Batch for running concurrent batch jobs at scale
Have you built fully serverless microservice architectures in the cloud? What benefits did you obtain? What capabilities did you give up? Let us know in the comments!
Subscribe to our weekly LinkedIn newsletter: Machine Learning In Production
Reach out if you need help:
- Maximizing the business value of your data to improve core business KPIs
- Deploying & monetizing your ML models in production
- Building Well-Architected production ML software solutions
- Implementing cloud-native MLOps
- Training your teams to systematically take models from research to production
- Identifying new DS/ML opportunities in your company or taking existing projects to the next level
- Anything else we can help you with
Would you like me to speak at your event? Email me at firstname.lastname@example.org
Subscribe to our blog: https://gradientgroup.ai/blog/