15  Data Preprocessing

Now that we have a good idea of what prediction with Machine Learning looks like, let’s start implementing these methods to real-world examples.

Can you, today, take data on something you want to predict and apply a KNN or a Decision Tree algorithm? Most likely not, because we are still missing a step.

This chapter will bridge the gap between the big ideas described in the previous chapters and real-world ML applications: data preprocessing.

You may have noticed that all the data used in this book was synthetic, i.e., generated by a quick Python script. There are several reasons for this:

This generated data had some characteristics you may have noted:

In other words, in all of the examples, the data was represented as a clean table of data with numbers on the same scale.

This will not be the case for most of the datasets you will come across. It is a messy world out there. Some data will be missing, some data will not even be numbers.

The following chapters will explore ways to deal with these issues one by one. This last part of the book will allow you to apply Machine Learning models to any regression or classification problem you come across.

Note: Preprocessing is a common source of information leakage between the training set and the test set. If the idea of train/test split is not clear, I would recommend reading through the Model Evaluation chapter once more. The concept of information leakage will be explored further in this section.

If this book helped you today, consider supporting the project. In return, you'll get the Complete Edition with companion Python code and video walkthroughs.

Support & Get the Bundle