Creating a machine learning software system is like constructing a building. If the foundation is not solid, structural problems can undermine the integrity and function of the building.
MLOps considerations, such as systematically building, training, deploying, and monitoring machine learning models, are only a subset of all the elements required for end-to-end production software solutions.
This is because a machine learning model is not deployed to production in a vacuum. It is integrated within a larger software application, which itself is integrated within a larger business process with the goal of achieving specific business outcomes.
As we covered previously, production ML solutions require modern software engineering design and best practices across the entire stack.
There are 5 pillars of architecture design for production software solutions in the cloud:
- Operational Excellence
- Performance Efficiency
- Cost Optimization
These 5 pillars encapsulate the AWS Well-Architected Framework, but they are fundamental to all production software solutions that they apply regardless of cloud.
Incorporating these pillars into your architecture will help you produce stable and efficient systems. This will also allow you to focus on the other aspects of design, such as functional requirements to create great end user experiences.
Let’s go through each pillar individually (courtesy of AWS):
The Operational Excellence pillar includes the ability to support development and run workloads effectively, gain insight into their operations, and to continuously improve supporting processes and procedures to deliver business value.
There are 5 design principles for operational excellence in the cloud:
- Perform operations as code: In the cloud, you can apply the same engineering discipline that you use for application code to your entire environment. You can define your entire workload (applications, infrastructure) as code and update it with code. You can implement your operations procedures as code and automate their execution by triggering them in response to events. By performing operations as code, you limit human error and enable consistent responses to events. Check out our article titled “Infrastructure-as-Code for ML Pipelines” for more information.
- Make frequent, small, reversible changes: Design workloads to allow components to be updated regularly. Make changes in small increments that can be reversed if they fail (without affecting customers when possible). Check out our article titled “Modular Deployments in CI/CD Pipelines” for more information.
- Refine operations procedures frequently: As you use operations procedures, look for opportunities to improve them. As you evolve your workload, evolve your procedures appropriately. Set up regular game days to review and validate that all procedures are effective and that teams are familiar with them.
- Anticipate failure: Perform “pre-mortem” exercises to identify potential sources of failure so that they can be removed or mitigated. Test your failure scenarios and validate your understanding of their impact. Test your response procedures to ensure that they are effective, and that teams are familiar with their execution. Set up regular game days to test workloads and team responses to simulated events.
- Learn from all operational failures: Drive improvement through lessons learned from all operational events and failures. Share what is learned across teams and through the entire organization.
The Performance Efficiency pillar includes the ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve.
There are 5 design principles for performance efficiency in the cloud:
- Democratize advanced technologies: Make advanced technology implementation easier for your team by delegating complex tasks to your cloud vendor. Rather than asking your IT team to learn about hosting and running a new technology, consider consuming the technology as a service. This allows your team to focus on product development rather than resource provisioning and management.
- Deploy globally: Deploying your workload in multiple regions around the world allows you to provide lower latency and a better experience for your customers at minimal cost.
- Use serverless architectures: Serverless architectures remove the need for you to run and maintain physical servers for traditional compute activities. This removes the operational burden, total cost of ownership, and can lower transactional costs because managed services operate at cloud scale.
- Frequent experimentation: With virtual and automatable resources, you can quickly carry out comparative testing using different types of instances, storage, or configurations.
- Leverage services based on intended use: Understand how cloud services are consumed and always use the technology approach that aligns best with your workload goals. For example, consider data access patterns when you select database or storage approaches.
The Reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to. This includes the ability to operate and test the workload through its total lifecycle.
There are 5 design principles for reliability in the cloud:
- Automatically recover from failure: By monitoring a workload for key performance indicators (KPIs), you can trigger automation when a threshold is breached. These KPIs should be a measure of business value, not of the technical aspects of the operation of the service. This allows for automatic notification and tracking of failures, and for automated recovery processes that work around or repair the failure. With more sophisticated automation, it’s possible to anticipate and remediate failures before they occur.
- Test recovery procedures: In an on-premises environment, testing is often conducted to prove that the workload works in a particular scenario. Testing is not typically used to validate recovery strategies. In the cloud, you can test how your workload fails, and you can validate your recovery procedures. You can use automation to simulate different failures or to recreate scenarios that led to failures before. This approach exposes failure pathways that you can test and ﬁx before a real failure scenario occurs, thus reducing risk.
- Scale horizontally to increase aggregate workload availability: Replace one large resource with multiple small resources to reduce the impact of a single failure on the overall workload. Distribute requests across multiple, smaller resources to ensure that they don’t share a common point of failure.
- Stop guessing capacity: A common cause of failure in on-premises workloads is resource saturation, when the demands placed on a workload exceed the capacity of that workload (this is often the objective of denial of service attacks). In the cloud, you can monitor demand and workload utilization, and automate the addition or removal of resources to maintain the optimal level to satisfy demand without over- or under-provisioning. There are still limits, but some quotas can be controlled and others can be managed.
- Manage change in automation: Changes to your infrastructure should be made using automation. The changes that need to be managed include changes to the automation, which then can be tracked and reviewed.
The Cost Optimization pillar includes the ability to run systems to deliver business value at the lowest price point.
There are 5 design principles for cost optimization in the cloud:
- Implement Cloud Financial Management: To achieve financial success and accelerate business value realization in the cloud, we must invest in Cloud Financial Management for cost optimization. The organization needs to dedicate time and resources to build capability in this new domain of technology and usage management. We build this capability through knowledge building, programs, resources, and processes to become a cost-efficient organization.
- Adopt a consumption model: Pay only for the computing resources that you require and increase or decrease usage depending on business requirements, not by using elaborate forecasting. For example, development and test environments are typically only used for eight hours a day during the work week. You can stop these resources when they are not in use for a potential cost savings of 75% (40 hours versus 168 hours).
- Measure overall efficiency: Measure the business output of the workload and the costs associated with delivering it. Use this measure to know the gains you make from increasing output and reducing costs.
- Stop spending money on undifferentiated heavy lifting: Cloud providers do the heavy lifting of data center operations like racking, stacking, and powering servers. They also remove the operational burden of managing operating systems and applications with managed services. This allows you to focus on your customers and business projects rather than on IT infrastructure.
- Analyze and attribute expenditure: The cloud makes it easier to accurately identify the usage and cost of systems, which then allows transparent attribution of IT costs to individual workload owners. This helps measure return on investment (ROI) and gives workload owners an opportunity to optimize their resources and reduce costs.
The Security pillar encompasses the ability to protect data, systems, and assets to take advantage of cloud technologies to improve your security.
There are 7 design principles for security in the cloud:
- Implement a strong identity foundation: Implement the principle of least privilege access and enforce separation of duties with appropriate authorization for each interaction with your resources. Centralize identity management and aim to eliminate reliance on long-term static credentials.
- Enable traceability: Monitor, alert, and audit actions and changes to your environment in real time. Integrate log and metric collection with systems to automatically investigate and take action.
- Apply security at all layers: Apply a defense in depth approach with multiple security controls. Apply to all layers (for example, edge of network, VPC, load balancing, every instance and compute service, operating system, application, and code).
- Automate security best practices: Automated software-based security mechanisms improve your ability to securely scale more rapidly and cost-effectively. Create secure architectures, including the implementation of controls that are defined and managed as code in version-controlled templates.
- Protect data in transit and at rest: Classify your data into sensitivity levels and use mechanisms, such as encryption, tokenization, and access control where appropriate.
- Keep people away from data: Use mechanisms and tools to reduce or eliminate the need for direct access or manual processing of data. This reduces the risk of mishandling or modification and human error when handling sensitive data.
- Prepare for security events: Prepare for an incident by having incident management and investigation policy and processes that align to your organizational requirements. Run incident response simulations and use tools with automation to increase your speed for detection, investigation, and recovery.
As you can see, there are many additional factors that go into production ML software solutions beyond deploying & monitoring machine learning models.
It is not possible to implement everything all at once. Therefore, we recommend establishing a baseline and iteratively improving the solutions over time. You can organize and prioritize work by creating epics, stories, technical tasks, and execute over 2-week sprints.
Getting the most ROI out of your models and overall investment in data science requires integration of your models into production software applications within existing business processes. These software applications must be well-architected for reliable and sustainable business value long-term.
This article especially applies to the 4th and final stage of the ML model deployment lifecycle. Check out our article titled “Lifecycle of ML Model Deployments to Production” for more information.
How does your team design production ML software solutions? Let us know in the comments!
Subscribe to our weekly LinkedIn newsletter: Machine Learning In Production
Reach out if you need help:
- Maximizing the business value of your data to improve core business KPIs
- Deploying & monetizing your ML models in production
- Building Well-Architected production ML software solutions
- Implementing cloud-native MLOps
- Training your teams to systematically take models from research to production
- Identifying new DS/ML opportunities in your company or taking existing projects to the next level
- Anything else we can help you with
Would you like me to speak at your event? Email me at firstname.lastname@example.org
Subscribe to Gradient Group: https://gradientgroup.ai/blog/