20  Dealing with Missing Categorical Data

Categorical values can also be missing. As a reminder, categorical features represent groups or categories. In the Berlin property pricing example, the neighbourhood of the property (e.g., “Neukölln”, “Kreuzberg” or “Charlottenburg”) is a categorical feature. As neighbourhood is such an important feature in property pricing, it may make sense to filter out observations for which it is missing.

Just like numerical features, there are many reasons why a categorical value may be missing:

It is important to understand why data may be missing before coming up with an imputation strategy.

20.1 Excluding Rows with Missing Observations

If a missing value is critical to the prediction task, filtering out rows with missing values is the safest strategy. You would not want to price a property without knowing its neighbourhood or its post code.

20.2 Excluding Columns with Missing Observations

If a column contains many missing values, removing the column or reviewing the data processing script should be the preferred strategy. This is similar to the method used for numerical missing values.

20.3 Excluding Rows with Missing Observations

Likewise, if the missing categorical value is critical to the prediction task, excluding the observation is the safest approach. Could you consider a property without knowing if it is a house or a flat? Or without knowing its neighbourhood?

20.4 Creating a new Null Category

In other cases, missing values could be replaced by a “missing” category and treated as another value for that categorical variable. Let’s make this more concrete with this example:

Surface Area (sq m) Neighb. Energy Rating Outdoor Space Sell Price (K€) Years since Build
85 Neukölln B Balcony 420
120 Kreuzberg A Terrace 610 8
95 Charlott. C None 390
70 Kreuzberg D Garden 370 22
110 Charlott. B Balcony 700 12
60 Neukölln F 310

Here, one of the “Outdoor Space” values is missing. As mentioned above, the missing value can be replaced by an “Unknown” value.

Surface Area (sq m) Neighb. Energy Rating Outdoor Space Sell Price (K€) Years since Build
85 Neukölln B Balcony 420
120 Kreuzberg A Terrace 610 8
95 Charlott. C None 390
70 Kreuzberg D Garden 370 22
110 Charlott. B Balcony 700 12
60 Neukölln F Unknown 310

This value can then be treated in the same way as all the other categorical values, with methods like One-Hot Encoding or Target Encoding.

20.5 Information Leakage

Just like categorical variable encoding, handling missing categorical values can create information leakage.

This process should be done on the training data only, and applied to the test set. Any new category seen in the test set only should be either excluded or labelled as missing.

20.6 Final Thoughts

This chapter walked through handling missing values with the following steps:

  • Understand why the values are missing
  • Explore whether filtering is needed
  • Flag missing values with a new categorical value

It is important to remember to avoid information leakage in null value handling. This can be done by a clear separation of the training and test set, before data preprocessing.

The next chapter will go back to numerical data and explore the issue of scaling.

If this book helped you today, consider supporting the project by buying the ebook, including notebooks and video walkthroughs.

Buy the Ebook