Specific processing for column#
Sometimes different columns need to be transformed in different ways. The most obvious example is the different processing of categorical and numerical columns:
For numeric columns, you need to apply normalisation techniques;
For categorical columns, you need to apply encoding (a hot, mean, etc.).
It’s easy to build such a transformation yourself, but it’s convenient that sklearn
has an out-of-the-box solution that can be easily integrated into sklearn type pipelines - sklearn.compose.ColumnTransformer
.
Learn more here.
import numpy as np
import pandas as pd
from sklearn.preprocessing import (
OneHotEncoder,
StandardScaler,
FunctionTransformer
)
from sklearn.compose import ColumnTransformer
from IPython.display import HTML
header_template = "<text style='font-size:17px'>{}</text>"
Basic example#
So in the next cell a random data frame is generated, with some categorical and some numerical columns. Let’s show how can be builded component of the pipeline that process categorical columns in one way and numeric in other.
sample_size = 500
np.random.seed(10)
generate_word = lambda: "".join([
chr(val) for val in
np.random.randint(ord("a"), ord("z") + 1, 10)
])
get_cat_var = lambda: np.random.choice(
[
generate_word() for i in
range(np.random.randint(2,7))
],
sample_size
)
get_num_var = lambda: np.random.normal(
np.random.uniform(-1,1),
np.random.uniform(1,10),
sample_size
)
variables_generator = [get_cat_var, get_num_var]
data_frame = pd.concat(
{
f"var {i}" : \
pd.Series(np.random.choice(variables_generator)())
for i in range(20)
},
axis = 1
)
data_frame.head()
var 0 | var 1 | var 2 | var 3 | var 4 | var 5 | var 6 | var 7 | var 8 | var 9 | var 10 | var 11 | var 12 | var 13 | var 14 | var 15 | var 16 | var 17 | var 18 | var 19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6.352738 | ghfmmekjzz | ewvspmvrkg | -3.916784 | 0.579251 | 3.078876 | jljighbmio | iieafcivri | -3.503851 | hwadgiwzth | zderdinjyy | -0.851043 | -7.137998 | -0.990391 | 4.128471 | lduutwjjin | -6.858011 | -3.455499 | kdzpmsglss | fjogwgrkig |
1 | -1.562264 | ghfmmekjzz | dlfjbofnbr | -1.458950 | 0.755219 | -0.498048 | phrxnjsbae | iieafcivri | -5.212578 | yxickhmgkp | kpqepphruh | -6.878257 | -1.712574 | -7.783903 | -3.623413 | lduutwjjin | 3.198987 | -7.290196 | eywzqkuzza | fjogwgrkig |
2 | -2.453819 | booaisyeuj | dlfjbofnbr | -0.124566 | 4.070167 | -2.271910 | lzsssmsaim | vhfoucvgil | -3.504522 | pdzajvgbzz | ynhwdgvtke | -0.838181 | 1.898630 | -6.632060 | -1.394765 | zghwqxiakd | -14.830121 | 10.490557 | irrdfszbwf | voumadgklp |
3 | -0.042513 | booaisyeuj | kkagxtgiko | -6.897858 | -0.065287 | -3.459478 | phrxnjsbae | yfmijifvmo | 0.742066 | wectjxhbio | kpqepphruh | -0.087694 | -1.808818 | 0.053985 | 0.494845 | lduutwjjin | -0.341344 | 4.539596 | eywzqkuzza | dzlpowvufa |
4 | -5.946806 | xmtwmxfxpz | dlfjbofnbr | 7.453730 | -3.450039 | 0.091773 | jljighbmio | vhfoucvgil | 1.830545 | hwadgiwzth | ynhwdgvtke | -0.218426 | 0.492733 | -2.954776 | -2.614179 | zghwqxiakd | -2.672298 | 6.436154 | kdzpmsglss | dzlpowvufa |
To prepare a transformer that handles different columns in different ways, you need to pass a list of your transformers to the transformers
parameter of the sklearn.compose.ColumnTransformer
constructor.
Each element of the transformers list should be of the form (<transformer name>, <transformer class>, <columns that will use this transformer>)
.
So in the following cell we have created such an object, showing how it will look in the Jupyter output and possible results of this transformation for the data frame described above.
numeric_columns = list(data_frame.select_dtypes("number").columns)
categorical_columns = list(set(data_frame.columns) - set(numeric_columns))
my_transformer = ColumnTransformer(
transformers = [
("one_hot_encoder", OneHotEncoder(), categorical_columns),
("standart_scaler", StandardScaler(), numeric_columns)
]
)
display(HTML(header_template.format("Class display in jupyter")))
display(my_transformer)
display(HTML(header_template.format("Fit and transfrom result")))
display(
pd.DataFrame(
my_transformer.fit_transform(data_frame)
).head()
)
ColumnTransformer(transformers=[('one_hot_encoder', OneHotEncoder(), ['var 1', 'var 18', 'var 2', 'var 15', 'var 10', 'var 9', 'var 7', 'var 19', 'var 6']), ('standart_scaler', StandardScaler(), ['var 0', 'var 3', 'var 4', 'var 5', 'var 8', 'var 11', 'var 12', 'var 13', 'var 14', 'var 16', 'var 17'])])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(transformers=[('one_hot_encoder', OneHotEncoder(), ['var 1', 'var 18', 'var 2', 'var 15', 'var 10', 'var 9', 'var 7', 'var 19', 'var 6']), ('standart_scaler', StandardScaler(), ['var 0', 'var 3', 'var 4', 'var 5', 'var 8', 'var 11', 'var 12', 'var 13', 'var 14', 'var 16', 'var 17'])])
['var 1', 'var 18', 'var 2', 'var 15', 'var 10', 'var 9', 'var 7', 'var 19', 'var 6']
OneHotEncoder()
['var 0', 'var 3', 'var 4', 'var 5', 'var 8', 'var 11', 'var 12', 'var 13', 'var 14', 'var 16', 'var 17']
StandardScaler()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | -1.169878 | 0.128906 | 1.319012 | -0.962918 | -0.316245 | -1.378868 | -0.008420 | 1.003985 | -0.811118 | -0.795706 |
1 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | -0.391908 | 0.160042 | -0.407646 | -1.467765 | -2.575318 | -0.360064 | -1.728754 | -1.025626 | 0.436845 | -1.531170 |
2 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.030460 | 0.746601 | -1.263927 | -0.963117 | -0.311424 | 0.318060 | -1.437070 | -0.442118 | -1.800370 | 1.879037 |
3 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | -2.113468 | 0.014859 | -1.837191 | 0.291547 | -0.030132 | -0.378137 | 0.256049 | 0.052623 | -0.002472 | 0.737690 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 2.429194 | -0.584051 | -0.122927 | 0.613140 | -0.079132 | 0.054056 | -0.505865 | -0.761387 | -0.291717 | 1.101435 |
5 rows × 47 columns
No transformations#
If you need to build a transformer that doesn’t change some of the columns in any way, there are two options:
Use a
"pass-through"
literal instead of a transformer;Use a so-called “dummy_transformer”
FunctionTransformer(lambda x: x)
which just returns it’s input.
The following example shows both options.
np.random.seed(10)
sample_size = 10
df = pd.DataFrame({
"col1" : np.random.uniform(5, 10, sample_size),
"col2" : np.random.normal(5, 10, sample_size)
})
display(HTML(header_template.format("Input frame")))
display(df)
col_transform = ColumnTransformer(
transformers = [
("dummy", FunctionTransformer(lambda x: x), ["col1", "col2"]),
("passthrough", "passthrough", ["col1", "col2"]),
("standart_scaler", StandardScaler(), ["col1", "col2"]),
]
)
col_transform.set_output(transform="pandas")
df = col_transform.fit_transform(df)
display(HTML(header_template.format("Transformation result")))
display(df)
col1 | col2 | |
---|---|---|
0 | 8.856603 | 7.655116 |
1 | 5.103760 | 6.085485 |
2 | 8.168241 | 5.042914 |
3 | 8.744019 | 3.253998 |
4 | 7.492535 | 9.330262 |
5 | 6.123983 | 17.030374 |
6 | 5.990314 | -4.650657 |
7 | 8.802654 | 15.282741 |
8 | 5.845554 | 7.286301 |
9 | 5.441699 | 9.451376 |
dummy__col1 | dummy__col2 | passthrough__col1 | passthrough__col2 | standart_scaler__col1 | standart_scaler__col2 | |
---|---|---|---|---|---|---|
0 | 8.856603 | 7.655116 | 8.856603 | 7.655116 | 1.258270 | 0.013588 |
1 | 5.103760 | 6.085485 | 5.103760 | 6.085485 | -1.365599 | -0.258714 |
2 | 8.168241 | 5.042914 | 8.168241 | 5.042914 | 0.776989 | -0.439580 |
3 | 8.744019 | 3.253998 | 8.744019 | 3.253998 | 1.179555 | -0.749924 |
4 | 7.492535 | 9.330262 | 7.492535 | 9.330262 | 0.304557 | 0.304194 |
5 | 6.123983 | 17.030374 | 6.123983 | 17.030374 | -0.652291 | 1.640020 |
6 | 5.990314 | -4.650657 | 5.990314 | -4.650657 | -0.745748 | -2.121234 |
7 | 8.802654 | 15.282741 | 8.802654 | 15.282741 | 1.220550 | 1.336838 |
8 | 5.845554 | 7.286301 | 5.845554 | 7.286301 | -0.846960 | -0.050395 |
9 | 5.441699 | 9.451376 | 5.441699 | 9.451376 | -1.129323 | 0.325205 |
Pandas multiIndex#
Let’s see how sklearn.compose.ColumnTransformer
works with a pandas.DataFrame
containing multiIndex as columns.
The following cell implements pandas.DataFrame
with multiIndex.
sample_size = 1000
np.random.seed(10)
test_frame = pd.DataFrame({
("numeric", "var1") : np.random.randint(0, 10, size=sample_size),
("numeric", "var2") : np.random.randint(0, 10, size=sample_size),
("categorial", "var1") : np.random.choice(["a", "b"], size=sample_size),
("categorial", "var2") : np.random.choice(["a", "b"], size=sample_size)
})
test_frame.head()
numeric | categorial | |||
---|---|---|---|---|
var1 | var2 | var1 | var2 | |
0 | 9 | 2 | b | b |
1 | 4 | 0 | b | a |
2 | 0 | 2 | a | a |
3 | 1 | 0 | b | a |
4 | 9 | 7 | b | a |
Basic case#
Here we build a ColumnsTransformer
which literally specifies as columns what is in the columns
argument of the corresponding subsample. The transofmer itself is displayed in HTML so you can check what exactly was passed as the names of the columns.
numeric_columns = test_frame.select_dtypes("number").columns
categorial_columns = test_frame.select_dtypes("O").columns
transformer = ColumnTransformer([
("numeric transformer", StandardScaler(), numeric_columns),
("categorial transformer", OneHotEncoder(), categorial_columns)
])
display(HTML(header_template.format(
"Transformer representation in HTML"
)))
display(transformer)
display(HTML(header_template.format("Transforamtion result")))
display(transformer.fit_transform(test_frame).round(3))
display(HTML(header_template.format("Features names out")))
display(transformer.get_feature_names_out())
ColumnTransformer(transformers=[('numeric transformer', StandardScaler(), MultiIndex([('numeric', 'var1'), ('numeric', 'var2')], )), ('categorial transformer', OneHotEncoder(), MultiIndex([('categorial', 'var1'), ('categorial', 'var2')], ))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(transformers=[('numeric transformer', StandardScaler(), MultiIndex([('numeric', 'var1'), ('numeric', 'var2')], )), ('categorial transformer', OneHotEncoder(), MultiIndex([('categorial', 'var1'), ('categorial', 'var2')], ))])
MultiIndex([('numeric', 'var1'), ('numeric', 'var2')], )
StandardScaler()
MultiIndex([('categorial', 'var1'), ('categorial', 'var2')], )
OneHotEncoder()
array([[ 1.517, -0.853, 0. , 1. , 0. , 1. ],
[-0.193, -1.547, 0. , 1. , 1. , 0. ],
[-1.562, -0.853, 1. , 0. , 1. , 0. ],
...,
[-0.878, 0.188, 1. , 0. , 1. , 0. ],
[-0.193, -0.853, 1. , 0. , 1. , 0. ],
[ 0.491, 1.229, 0. , 1. , 0. , 1. ]])
array(['numeric transformer__x0', 'numeric transformer__x1',
'categorial transformer__x2_a', 'categorial transformer__x2_b',
'categorial transformer__x3_a', 'categorial transformer__x3_b'],
dtype=object)
It works fine. But in the out feature names it uses x0, x1, x2, ...
instead of the input names.
Passing upper levels#
It would be great if sklearn
were able to pass the subframe under a certain value of the upper index.
try:
ColumnTransformer([
("numeric transformer", StandardScaler(), ["numeric"]),
("categorial transformer", OneHotEncoder(), ["categorial"])
]).fit_transform(
test_frame
)
except Exception as e:
print(e)
Selected columns, ['numeric'], are not unique in dataframe
It returns error so it isn’t possible yet.