How do we ensure machine learning pipeline components produce the exact result we expect, especially prior to production deployments?
We could sanity check by inspecting a few output records by hand, but how do we know for sure that all output records are correct every time? This manual, stage 1 automation “ClickOps” approach is not scalable, consistent, or reliable.
We covered previously how to convert an end-to-end Jupyter notebook into a serverless microservice architecture by modularizing and decoupling ML pipeline components:
Before deploying these ML microservices to production, we need a systematic and automated testing process to verify each component works as expected.
Let’s split testing requirements into 2 major categories:
- Data Validation
- Unit Testing
We will focus on testing individual ML pipeline components and for now exclude integration testing, end-to-end testing, etc. Furthermore, we will use a training pipeline for this example, but the same concepts apply for inference pipelines.
The first components of a training pipeline are usually:
- Data extraction microservice to read data from a data lake
- Data validation microservice to identify upstream data issues early
- Data transformation microservice for pre-processing and feature engineering (this component’s output becomes the input for the model training microservice)
The first component extracts a training dataset composed of “base” features (which will be used to engineer more predictive features) using PySpark. In general, production training pipelines require distributed computing environments to handle big data.
This training dataset may be extracted from a data lake, a data warehouse, a relational database, by joining data from multiple streaming sources, etc. These upstream data pipelines are built by data engineers, such as this serverless delta lake (check out the AWS YouTube channel’s “This Is My Architecture” series to learn more about it):
Before proceeding to pre-processing and feature engineering, it’s a good idea to test this training dataset and verify that upstream data pipelines are producing data as expected. As many of us have experienced, broken data pipelines are among the most common sources of bugs in ML pipelines.
Start with a baseline understanding based on domain knowledge and early exploratory data analysis (EDA) of the relevant features. Common questions include:
- Does the extracted dataset schema match the expected schema?
- Does each column contain distinct values to allow a machine learning model to learn from it?
- Do any columns contain all null values?
- Are numeric feature ranges within expected statistical bounds?
- Are distinct categorical feature values what we expect, or are there missing or new categories coming in?
This is a sample of what the data validation microservice code might look like in a PySpark script:
def verify_input_schema(df) -> None ''' Verify the input dataset's schema matches the expected base feature schema. ''' assert df.schema == StructType([ StructField("Feature_1", IntegerType(), True), StructField("Feature_2", IntegerType(), True), StructField("Feature_3", IntegerType(), True), StructField("Feature_4", DateType(), True), StructField("Feature_5", StringType(), True), StructField("Feature_6", StringType(), True), StructField("Feature_7", StringType(), True), StructField("Feature_8", DecimalType(9,2), True), StructField("Feature_9", ArrayType(StringType(), True), True), StructField("Feature_10", ArrayType(StringType(), True), True), StructField("Feature_11", StringType(), True), StructField("Feature_12", StringType(), True), StructField("Feature_13", ArrayType(StringType(), True), True), StructField("Feature_14", DecimalType(9,2), True), StructField("Feature_15", DecimalType(9,2), True), StructField("Label", IntegerType(), True) ]) print("Input schema is correct.\n") def verify_useful_columns(df) -> None: ''' Verify all columns contain distinctive information for ML models. ''' output = df.drop(*(col for col in df.columns if df.select(col).distinct().count() == 1)) assert len(output.columns) == len(df.columns) print("All columns contain distinct values.\n") def verify_no_empty_columns(df, baseline_counts) -> None: ''' Verify we get no empty columns in the input dataset. A percentage of null values per feature makes more sense based on known statistics, such as 3 standard deviations outside the mean (pending implementation). ''' for column in df.columns: output_counts = df.filter(df[column].isNull() == True).count() assert baseline_counts > output_counts print("No empty columns.\n") def verify_numeric_features(df, baseline_counts): ''' Verify numeric feature ranges are valid based on domain knowledge. Check if more than 0.5% of rows have negative values (it would indicate an upstream data problem). Binary classification label can only have 2 distinct values, 1 and 0, with no nulls. ''' numeric_columns = ["Num_Feature_1", "Num_Feature_2", "Num_Feature_3", "Num_Feature_4", "Num_Feature_5", "Num_Feature_6"] for column in numeric_columns: output_counts = df.filter(f.col(column) < 0).count() ratio = output_counts / baseline_counts * 100 assert ratio < 0.05 date_columns = ["Date_Feature_1"] for date in date_columns: output_counts = df.filter(f.col(date) < '2015-01-01').count() ratio = output_counts / baseline_counts * 100 assert ratio < 0.05 label = "Label" label_counts = df.groupBy(label).count() assert label_counts.count() == 2 print("Numeric feature ranges are valid.") print("Verifying input dataset schema...\n") verify_input_schema(test_df) print("Verifying useful columns...\n") verify_useful_columns(test_df) print("Verifying no empty columns...\n") verify_no_empty_columns(test_df, baseline_counts) print("Verifying numeric feature ranges...\n") verify_numeric_features(test_df, baseline_counts) #print("Verifying distinct categorical features...\n") #verify_categorical_features(df, baseline)
You may use PySpark, TensorFlow Data Validation, or regular Python depending on your use case requirements and preferences. The Data Validation microservice, as well as all other ML pipeline microservices, are usually Docker containers that are built, tested, and deployed independently through CI/CD pipelines.
To illustrate the decoupled nature of the training pipeline microservice architecture, we could replace the AWS Glue Data Validation job with a TensorFlow Data Validation serverless container using ECS Fargate:
From the code above, if any assertion fails, the data validation microservice fails. This stops the training pipeline execution and no issues make it into production. This is the first systematic quality gate between data engineering and machine learning engineering in the staging environment of the deployment process.
Why would we let the Data Validation component cause the entire training pipeline execution to fail? Couldn’t the pre-processing and feature engineering “Transformer” microservice handle nulls, impute, clip outliers, and more?
The answer is because if there is a fundamental problem with the way the data is generated upstream, naively handling it through pre-processing would not solve the root problem; it would only address a symptom. What’s worse, the training pipeline would probably complete successfully, but yielding an inferior model due to the hidden data quality issues.
The Data Validation microservice is vital for detecting potential upstream data pipeline problems, alerting the data engineering team accordingly, and stopping production deployments until further investigation. We use the learnings from these experiences to improve our data engineering and data validation processes.
The functions above must be unit tested, as well. After all, if we have a data validation function that checks if a column contains only null values, how do we know the function itself always behaves exactly that way?
We will cover unit testing in the next article, specifically for common functions in ML microservices.
Feel free to check out the various links throughout this article to learn more about those topics.
How do you test and validate datasets for machine learning pipelines early in the process before they reach production? Let us know in the comments!
Subscribe to our weekly LinkedIn newsletter: Machine Learning In Production
Reach out if you need help:
- Maximizing the business value of your data to improve core business KPIs
- Deploying & monetizing your ML models in production
- Building Well-Architected production ML software solutions
- Implementing cloud-native MLOps
- Training your teams to systematically take models from research to production
- Identifying new DS/ML opportunities in your company or taking existing projects to the next level
- Anything else we can help you with
Would you like me to speak at your event? Email me at email@example.com
Subscribe to our blog: https://gradientgroup.ai/blog/