16 Data Preprocessing

Now that we have a good idea of what prediction with Machine Learning looks like, let’s start implementing these methods to real-world examples.

Can you, today, take data on something you want to predict and apply a KNN or a Decision Tree algorithm? Most likely not, because we are still missing a step.

This chapter will bridge the gap between the big ideas and real-world ML applications: data preprocessing.

You may have noticed that all the data used in this book was synthetic, i.e., generated by a quick Python script. There are several reasons for this:

Ensuring that the data has the required properties
Copyrights, legal and ethical issues
Convenience

This generated data had some characteristics you may have noted:

Only numeric features, no categorical or date fields
Features on the same scale (most of the time)
No missing data

In other words, in all of the examples, the data was represented as a clean table of data with numbers on the same scale.

This will not be the case for most of the datasets you will come across. It is a messy world out there. Some data will be missing, some data will not even be numbers.

The following chapters will explore ways to deal with these issues one by one. This last part of the book will allow you to apply Machine Learning models to any regression or classification problem you come across.

Note: Preprocessing is a common source of information leakage between the training set and the test set. If the idea of train/test split is not clear, I would recommend reading through the Model Evaluation chapter once more. The concept of information leakage will be explored further in this section.