You may have heard the term “bias” in artificial intelligence. It usually refers to machine learning algorithms that make biased predictions.
Biased predictions are a sign of underperforming machine learning models that were not trained with the proper datasets.
Most people know that the performance of a machine learning model is directly proportional to the quantity and quality of the dataset used to train it.
Quantity of data is intuitive and self-explanatory. The more data we have, the better the performance of the models because they learn from more examples. This allows machine learning models to generalize well once deployed in production.
The quality of the dataset refers to:
- How representative is the dataset of the real-life situations the ML model will encounter.
- In the case of classification datasets, whether binary or multi-class, the categories are evenly distributed to include equal (or near equal) amounts of each example.
- Having enough variation of examples, including edge cases (these are extremely common in production).
- In the practical case of supervised learning, making sure the data has been labeled properly and correctly (humans even have their own bias during labeling).
- Having all the features necessary to correctly address the business problem you are trying to solve, while eliminating redundant/unnecessary features (we call this feature engineering).
You also want to consider the freshness and temporal relevance of the data. If you are using historical data to make predictions about the future, make sure the historical factors/features are still relevant today and for the coming months.
If either the business problem or dataset changes over time, you need to account for this within your machine learning process to make sure the models maintain acceptable performance over time.
All of these quality factors of a given dataset greatly influence the performance of machine learning models.
In practice, bias in artificial intelligence originates from low quality training datasets, and it usually involves issues with one or all of the factors mentioned above.
The most common factor leading to biased machine learning models is 2. The different categories within a training dataset are simply not evenly distributed, and the machine learning model learns this skewed distribution.
For example, suppose you have a dataset of 1,000 images labeled as either dog or cat. 900 images are of dogs and 100 images are of cats.
During training, the machine learning model will learn the characteristics/features of dogs a lot more than those of cats. Therefore, in its own little universe, the model will be biased towards identifying everything as a dog, including cats, because it’s seen a lot more of that category than the other. This extends to multi-class classification, as well.
Uneven distribution of categories in the training dataset = biased ML model
Here are some examples of biased machine learning models in practice:
- Self-driving vehicles being biased towards identifying certain demographics vs others.
- HR/recruitment algorithms being biased towards ‘selecting’ certain applications vs others.
- Speech recognition algorithms being biased towards identifying certain accents vs others.
- Custom computer vision models being biased towards identifying certain objects (in certain locations) vs others.
Whether you are an AI/ML product manager or VP of AI/ML Products, make sure you and your team are on the lookout for sources of bias within datasets. A healthy level of paranoia is helpful because these models will (hopefully) end up in production, affecting users, customers, clients, internal stakeholders, and the overall business. Over time, they may also affect your overall industry/sector.
Make sure you address bias and dataset quality upfront before getting too deep into AI/ML product development. Iterate and test constantly with production data to squeeze out the performance blind spots in your models.
When you identify the scenarios where your model is not performing well, incorporate more of those examples within your training dataset. Again, iterate, test, and improve until you hit your target KPIs.
If you need help to accelerate your company’s machine learning efforts, or if you need help getting started with enterprise AI adoption, send me a LinkedIn message or email me at firstname.lastname@example.org and I will be happy to help you.
Subscribe to this blog to get the latest tactics and strategies to thrive in this new era of AI and machine learning.
Subscribe to my YouTube channel for business AI video tutorials and technical hands-on tutorials.
Client case studies and testimonials: https://gradientgroup.ai/enterprise-case-studies/
Follow me on LinkedIn for more content: linkedin.com/in/CarlosLaraAI