Encoders#

sklearn.preprocessing.OneHotEncoder is realisatoin of one hot encoding in sklearn. Here is some details associated with working with this tool.

import numpy as np
import pandas as pd

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

NaN handing#

Encoders treat empty values in the column under consideration as a separate category.


The following cells show how different encoders process a categorical column with the empty values.

example = np.array([np.nan, "a", "b", "b"])[:, None]
example
array([['nan'],
       ['a'],
       ['b'],
       ['b']], dtype='<U32')
OneHotEncoder(sparse_output=False).fit_transform(example)
array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.]])
OrdinalEncoder().fit_transform(example)
array([[2.],
       [0.],
       [1.],
       [1.]])

They all create a separate encoding for the positions with a NaN value.

categories#

There is a special argument that allows you to select which categories to use as new columns - categories argument.


The following cell generates and shows example that will be used.

np.random.seed(10)
categories = [chr(i) for i in range(ord("a"), ord("e"))]
sample_size = 10

test_frame = pd.DataFrame({
    "col1" : np.random.choice(categories, sample_size), 
    "col2" : np.random.choice(categories, sample_size)
})
test_frame
col1 col2
0 b a
1 b b
2 a b
3 d c
4 a a
5 b b
6 d a
7 a c
8 b a
9 b c

The following code demonstrates the application of the sklearn.processing.OneHotEncoder with the a and b categories left for the first column and a and c categories for the second column.

ohe_transformer = OneHotEncoder(
    sparse_output=False,
    categories=[
        ["a", "b"],
        ["a", "c"]
    ],
    handle_unknown='ignore'
)

pd.DataFrame(
    ohe_transformer.fit_transform(test_frame),
    columns = ohe_transformer.get_feature_names_out()
)
col1_a col1_b col2_a col2_c
0 0.0 1.0 1.0 0.0
1 0.0 1.0 0.0 0.0
2 1.0 0.0 0.0 0.0
3 0.0 0.0 0.0 1.0
4 1.0 0.0 1.0 0.0
5 0.0 1.0 0.0 0.0
6 0.0 0.0 1.0 0.0
7 1.0 0.0 0.0 1.0
8 0.0 1.0 1.0 0.0
9 0.0 1.0 0.0 1.0

Note if some options are omitted in categories and drop='first' is set, it turns out that the converter can lose information uncontrollably. Because separate use of drop='first' does not lead to information loss, because the discarded column can be restored using the remaining ones. If some columns are dropped through categories, it will lead to unexpected information loss. In this case sklearn will generate a warning even though its text does not correspond to the problem.

The following cell shows an example of such a case.

ohe_transformer = OneHotEncoder(
    sparse_output=False,
    categories=[
        ["a", "b", "c", "d"],
        ["a", "c"]
    ],
    handle_unknown='ignore',
    drop="first"
)

pd.DataFrame(
    ohe_transformer.fit_transform(test_frame),
    columns = ohe_transformer.get_feature_names_out()
)
/home/fedor/.virtualenvs/knowledge/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [1] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
col1_b col1_c col1_d col2_c
0 1.0 0.0 0.0 0.0
1 1.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 1.0 1.0
4 0.0 0.0 0.0 0.0
5 1.0 0.0 0.0 0.0
6 0.0 0.0 1.0 0.0
7 0.0 0.0 0.0 1.0
8 1.0 0.0 0.0 0.0
9 1.0 0.0 0.0 1.0

Extra categories#

You can specify categories that are not listed in the dataframe for fitting. So you can just mention it - in output it will just be a category containing only zeros.


The next cell applies a transformation that references the m category for the first column of the input.

ohe_transformer = OneHotEncoder(
    sparse_output=False,
    categories=[
        ["a", "b", "m"],
        ["a", "b", "c"]
    ],
    handle_unknown='ignore'
)

pd.DataFrame(
    ohe_transformer.fit_transform(test_frame),
    columns = ohe_transformer.get_feature_names_out()
)
col1_a col1_b col1_m col2_a col2_b col2_c
0 0.0 1.0 0.0 1.0 0.0 0.0
1 0.0 1.0 0.0 0.0 1.0 0.0
2 1.0 0.0 0.0 0.0 1.0 0.0
3 0.0 0.0 0.0 0.0 0.0 1.0
4 1.0 0.0 0.0 1.0 0.0 0.0
5 0.0 1.0 0.0 0.0 1.0 0.0
6 0.0 0.0 0.0 1.0 0.0 0.0
7 1.0 0.0 0.0 0.0 0.0 1.0
8 0.0 1.0 0.0 1.0 0.0 0.0
9 0.0 1.0 0.0 0.0 0.0 1.0