Robust One-Hot Encoding. Production grade one-hot encoding… | by Hans Christian Ekne | Apr, 2024

0


The way we build traditional machine learning models is to first train the models on a “training dataset” — typically a dataset of historic values — and then later we generate predictions on a new dataset, the “inference dataset.” If the columns of the training dataset and the inference dataset don’t match, your machine learning algorithm will usually fail. This is primarily due to either missing or new factor levels in the inference dataset.

The first problem: Missing factors

For the following examples, assume that you used the dataset above to train your machine learning model. You one-hot encoded the dataset into dummy variables, and your fully transformed training data looks like below:

Transformed training dataset with pd.get_dummies / image by author

Now, let’s introduce the inference dataset, this is what you would use for making predictions. Let’s say it is given like below:

# Creating the inference_data DataFrame in Python
inference_data = pd.DataFrame({
'numerical_1': [11, 12, 13, 14, 15, 16, 17, 18],
'color_1_': ['black', 'blue', 'black', 'green',
'green', 'black', 'black', 'blue'],
'color_2_': ['orange', 'orange', 'black', 'orange',
'black', 'orange', 'orange', 'orange']
})
Inference data with 3 columns / image by author

Using a naive one-hot encoding strategy like we used above (pd.get_dummies)

# Converting categorical columns in inference_data to 
# Dummy variables with integers
inference_data_dummies = pd.get_dummies(inference_data,
columns=['color_1_', 'color_2_']).astype(int)

This would transform your inference dataset in the same way, and you obtain the dataset below:

Transformed inference dataset with pd.get_dummies / image by author

Do you notice the problems? The first problem is that the inference dataset is missing the columns:

missing_colmns =['color_1__red', 'color_2__pink', 
'color_2__blue', 'color_2__purple']

If you ran this in a model trained with the “training dataset” it would usually crash.

The second problem: New factors

The other problem that can occur with one-hot encoding is if your inference dataset includes new and unseen factors. Consider again the same datasets as above. If you examine closely, you see that the inference dataset now has a new column: color_2__orange.

This is the opposite problem as previously, and our inference dataset contains new columns which our training dataset didn’t have. This is actually a common occurrence and can happen if one of your factor variables had changes. For example, if the colours above represent colours of a car, and a car producer suddenly started making orange cars, then this data might not be available in the training data, but could nonetheless show up in the inference data. In this case you need a robust way of dealing with the issue.

One could argue, well why don’t you list all the columns in the transformed training dataset as columns that would be needed for your inference dataset? The problem here is that you often don’t know what factor levels are in the training data upfront.

For example, new levels could be introduced regularly, which could make it difficult to maintain. On top of that comes the process of then matching your inference dataset with the training data, so you would need to check all actual transformed column names that went into the training algorithm, and then match them with the transformed inference dataset. If any columns were missing you would need to insert new columns with 0 values and if you had extra columns, like the color_2__orange columns above, those would need to be deleted. This is a rather cumbersome way of solving the issue, and thankfully there are better options available.

The solution to this problem is rather straightforward, however many of the packages and libraries that attempt to streamline the process of creating prediction models fail to implement it well. The key lies in having a function or class that is first fitted on the training data, and then use that same instance of the function or class to transform both the training dataset and the inference dataset. Below we explore how this is done using both Python and R.

In Python

Python is arguably one the best programming language to use for machine learning, largely due to its extensive network of developers and mature package libraries, and its ease of use, which promotes rapid development.

Regarding the issues related to one-hot encoding we described above, they can be mitigated by using the widely available and tested scikit-learn library, and more specifically the sklearn.preprocessing.OneHotEncoder class. So, let’s see how we can use that on our training and inference datasets to create a robust one-hot encoding.

from sklearn.preprocessing import OneHotEncoder

# Initialize the encoder
enc = OneHotEncoder(handle_unknown='ignore')

# Define columns to transform
trans_columns = ['color_1_', 'color_2_']

# Fit and transform the data
enc_data = enc.fit_transform(training_data[trans_columns])

# Get feature names
feature_names = enc.get_feature_names_out(trans_columns)

# Convert to DataFrame
enc_df = pd.DataFrame(enc_data.toarray(),
columns=feature_names)

# Concatenate with the numerical data
final_df = pd.concat([training_data[['numerical_1']],
enc_df], axis=1)

This produces a final DataFrameof transformed values as shown below:

Transformed training dataset with sklearn / image by author

If we break down the code above, we see that the first step is to initialize the an instance of the encoder class. We use the option handle_unknown='ignore' so that we avoid issues with unknow values for the columns when we use the encoder to transform on our inference dataset.

After that, we combine a fit and transform action into one step with the fit_transform method. And finally, we create a new data frame from the encoded data and concatenate it with the rest of the original dataset.

Leave a Reply

Your email address will not be published. Required fields are marked *