Encoders#
sklearn.preprocessing.OneHotEncoder
is realisatoin of one hot encoding in sklearn. Here is some details associated with working with this tool.
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
NaN handing#
Encoders treat empty values in the column under consideration as a separate category.
The following cells show how different encoders process a categorical column with the empty values.
example = np.array([np.nan, "a", "b", "b"])[:, None]
example
array([['nan'],
['a'],
['b'],
['b']], dtype='<U32')
OneHotEncoder(sparse_output=False).fit_transform(example)
array([[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.],
[0., 1., 0.]])
OrdinalEncoder().fit_transform(example)
array([[2.],
[0.],
[1.],
[1.]])
They all create a separate encoding for the positions with a NaN value.
categories
#
There is a special argument that allows you to select which categories to use as new columns - categories
argument.
The following cell generates and shows example that will be used.
np.random.seed(10)
categories = [chr(i) for i in range(ord("a"), ord("e"))]
sample_size = 10
test_frame = pd.DataFrame({
"col1" : np.random.choice(categories, sample_size),
"col2" : np.random.choice(categories, sample_size)
})
test_frame
col1 | col2 | |
---|---|---|
0 | b | a |
1 | b | b |
2 | a | b |
3 | d | c |
4 | a | a |
5 | b | b |
6 | d | a |
7 | a | c |
8 | b | a |
9 | b | c |
The following code demonstrates the application of the sklearn.processing.OneHotEncoder
with the a
and b
categories left for the first column and a
and c
categories for the second column.
ohe_transformer = OneHotEncoder(
sparse_output=False,
categories=[
["a", "b"],
["a", "c"]
],
handle_unknown='ignore'
)
pd.DataFrame(
ohe_transformer.fit_transform(test_frame),
columns = ohe_transformer.get_feature_names_out()
)
col1_a | col1_b | col2_a | col2_c | |
---|---|---|---|---|
0 | 0.0 | 1.0 | 1.0 | 0.0 |
1 | 0.0 | 1.0 | 0.0 | 0.0 |
2 | 1.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 1.0 |
4 | 1.0 | 0.0 | 1.0 | 0.0 |
5 | 0.0 | 1.0 | 0.0 | 0.0 |
6 | 0.0 | 0.0 | 1.0 | 0.0 |
7 | 1.0 | 0.0 | 0.0 | 1.0 |
8 | 0.0 | 1.0 | 1.0 | 0.0 |
9 | 0.0 | 1.0 | 0.0 | 1.0 |
Note if some options are omitted in categories and drop='first'
is set, it turns out that the converter can lose information uncontrollably. Because separate use of drop='first'
does not lead to information loss, because the discarded column can be restored using the remaining ones. If some columns are dropped through categories
, it will lead to unexpected information loss. In this case sklearn will generate a warning even though its text does not correspond to the problem.
The following cell shows an example of such a case.
ohe_transformer = OneHotEncoder(
sparse_output=False,
categories=[
["a", "b", "c", "d"],
["a", "c"]
],
handle_unknown='ignore',
drop="first"
)
pd.DataFrame(
ohe_transformer.fit_transform(test_frame),
columns = ohe_transformer.get_feature_names_out()
)
/home/fedor/.virtualenvs/knowledge/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [1] during transform. These unknown categories will be encoded as all zeros
warnings.warn(
col1_b | col1_c | col1_d | col2_c | |
---|---|---|---|---|
0 | 1.0 | 0.0 | 0.0 | 0.0 |
1 | 1.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 1.0 | 1.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 |
5 | 1.0 | 0.0 | 0.0 | 0.0 |
6 | 0.0 | 0.0 | 1.0 | 0.0 |
7 | 0.0 | 0.0 | 0.0 | 1.0 |
8 | 1.0 | 0.0 | 0.0 | 0.0 |
9 | 1.0 | 0.0 | 0.0 | 1.0 |
Extra categories#
You can specify categories that are not listed in the dataframe for fitting. So you can just mention it - in output it will just be a category containing only zeros.
The next cell applies a transformation that references the m
category for the first column of the input.
ohe_transformer = OneHotEncoder(
sparse_output=False,
categories=[
["a", "b", "m"],
["a", "b", "c"]
],
handle_unknown='ignore'
)
pd.DataFrame(
ohe_transformer.fit_transform(test_frame),
columns = ohe_transformer.get_feature_names_out()
)
col1_a | col1_b | col1_m | col2_a | col2_b | col2_c | |
---|---|---|---|---|---|---|
0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
2 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
4 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
5 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
6 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
7 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
8 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
9 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |