How To Scope Out A Dataset From Scratch (Enterprise ML)

Every machine learning solution requires a dataset that encapsulates the business problem to be solved.

A machine learning system will ingest this dataset, learn its complex patterns/relationships, and output a set of business predictions that help solve a specific business problem.

This sounds great, but how do you acquire this dataset?

These are the 3 steps to follow to scope out the dataset needed for a machine learning solution:

1) Identify machine learning use cases

Enterprise machine learning always begins with a set of candidate use cases, especially at the early stages of AI adoption. Each use case aims at solving a specific business problem, which is part of a larger business process.

Therefore, always begin by identifying as many business problems as possible. Don’t even ask yourself yet whether machine learning is the right solution – simply make a list of business problems.

It takes a collaboration between subject matter experts and machine learning experts to determine whether ML is the right approach to solve a given business problem.

An effective practice is to assess each business problem individually and prioritize implementation based on dataset requirements and availability, complexity, timeline, resources, and other parameters.

Speaking of dataset requirements and availability, each candidate ML use case requires a specific dataset that depends on the corresponding business problem we want to solve.

The next step is to determine the composition of this dataset.

2) Select features based on domain knowledge

Once we have selected a specific business problem as a promising candidate for a machine learning solution, the next step is to identify what are the most important features to include in our dataset.

Think of a dataset as a set of rows and columns resembling a relational database table. Each column represents a characteristic or attribute of the business problem we are trying to solve. These columns are called features of the dataset.

For example, a dataset for customer churn prediction may be composed of a row for each customer they company has ever had. Features may include geographic location, last time of purchase, frequency of purchase, service ratings, number of customer service calls, product usage, and many more.

In general, a structured dataset may be composed of multiple SQL joins across tables in multiple databases, whether OLTP or OLAP. This depends on the company, but enterprises will generally have legacy databases and tables from which we can extract useful features.

It would be a waste of historical data if we enforced the need to modernize data infrastructure and collect new, ‘fully clean’ data before getting business value from it. Therefore, we always do our best to leverage existing historical data unless the technical debt is truly making it impossible to implement a specific ML use case.

For a given machine learning use case, how do you know which features to include in the corresponding dataset?

This is entirely domain-dependent. Machine learning experts must collaborate with people who deeply understand the business domain (i.e. churn above), and also with database experts who understand the schemas, what specific columns mean, how the data was collected, what columns are reliable, what columns are missing, etc.

This collaboration will shed light into which features we need to include in our dataset to properly encapsulate the business problem. This is why domain knowledge is the most important element of AI adoption.

3) Export the dataset

Once we have identified the most important features for our dataset and gained access to the corresponding databases, the next step is to join the tables (usually a combination of fact and dimension tables) on IDs while only keeping the important features.

This ‘joined’ dataset becomes our raw dataset for our machine learning use case.

This raw dataset may be composed of billions of rows, and the size can easily exceed 1 TB. Therefore, we need a process to export the SQL query results from the data warehouse to a cloud storage location, such as an S3 bucket. This process will export the dataset in chunks – usually up to 5 GB each.

This is not the end, of course, as we still need to preprocess the dataset and get it truly ready for machine learning – usually using a distributed computing system such as Apache Spark within AWS EMR. The most common steps are data preprocessing and feature engineering, which we will cover in another post.

Scoping out a dataset from scratch and selecting the right features is not easy. It takes creativity, a deep understanding of the specific business domain, a technical understanding of an overall machine learning solution, and passion for the actual work.

This will create the high level of performance needed to position a machine learning use case to succeed.

If you need help to accelerate your company’s machine learning efforts, or if you need help getting started with enterprise AI adoption, send me a LinkedIn message or email me at and I will be happy to help you.

Subscribe to this blog to get the latest tactics and strategies to thrive in this new era of AI and machine learning.

Subscribe to my YouTube channel for business AI video tutorials and technical hands-on tutorials.

Client case studies and testimonials:

Follow me on LinkedIn for more content:

Leave a Reply

%d bloggers like this: