20 Dealing with Missing Categorical Data

Categorical values can also be missing. As a reminder, categorical features represent groups or categories. In the Berlin property pricing example, the neighbourhood of the property (e.g., “Neukölln”, “Kreuzberg” or “Charlottenburg”) is a categorical feature. As neighbourhood is such an important feature in property pricing, it may make sense to filter out observations for which it is missing.

Just like numerical features, there are many reasons why a categorical value may be missing:

Missing at Random: the observation is missing for an unknown reason, without any apparent pattern
Not Collected: optional question in a survey
Not Applicable: as a man donating blood, I leave the answer to the question “are you pregnant?” blank, as this is not applicable
Missing Measurement: a missing energy efficiency grade reading could mean that the property still needs an inspection

It is important to understand why data may be missing before coming up with an imputation strategy.

20.1 Excluding Rows with Missing Observations

If a missing value is critical to the prediction task, filtering out rows with missing values is the safest strategy. You would not want to price a property without knowing its neighbourhood or its post code.

20.2 Excluding Columns with Missing Observations

If a column contains many missing values, removing the column or reviewing the data processing script should be the preferred strategy. This is similar to the method used for numerical missing values.

20.3 Excluding Rows with Missing Observations

Likewise, if the missing categorical value is critical to the prediction task, excluding the observation is the safest approach. Could you consider a property without knowing if it is a house or a flat? Or without knowing its neighbourhood?

20.4 Creating a new Null Category

In other cases, missing values could be replaced by a “missing” category and treated as another value for that categorical variable. Let’s make this more concrete with this example:

Surface Area (sq m)	Neighb.	Energy Rating	Outdoor Space	Sell Price (K€)	Years since Build
85	Neukölln	B	Balcony	420
120	Kreuzberg	A	Terrace	610	8
95	Charlott.	C	None	390
70	Kreuzberg	D	Garden	370	22
110	Charlott.	B	Balcony	700	12
60	Neukölln	F		310

Here, one of the “Outdoor Space” values is missing. As mentioned above, the missing value can be replaced by an “Unknown” value.

Surface Area (sq m)	Neighb.	Energy Rating	Outdoor Space	Sell Price (K€)	Years since Build
85	Neukölln	B	Balcony	420
120	Kreuzberg	A	Terrace	610	8
95	Charlott.	C	None	390
70	Kreuzberg	D	Garden	370	22
110	Charlott.	B	Balcony	700	12
60	Neukölln	F	Unknown	310

This value can then be treated in the same way as all the other categorical values, with methods like One-Hot Encoding or Target Encoding.

20.5 Information Leakage

Just like categorical variable encoding, handling missing categorical values can create information leakage.

This process should be done on the training data only, and applied to the test set. Any new category seen in the test set only should be either excluded or labelled as missing.

20.6 Final Thoughts

This chapter walked through handling missing values with the following steps:

Understand why the values are missing
Explore whether filtering is needed
Flag missing values with a new categorical value

It is important to remember to avoid information leakage in null value handling. This can be done by a clear separation of the training and test set, before data preprocessing.

The next chapter will go back to numerical data and explore the issue of scaling.