Specific processing for column#

Sometimes different columns need to be transformed in different ways. The most obvious example is the different processing of categorical and numerical columns:

  • For numeric columns, you need to apply normalisation techniques;

  • For categorical columns, you need to apply encoding (a hot, mean, etc.).

It’s easy to build such a transformation yourself, but it’s convenient that sklearn has an out-of-the-box solution that can be easily integrated into sklearn type pipelines - sklearn.compose.ColumnTransformer.

Learn more here.

import numpy as np
import pandas as pd

from sklearn.preprocessing import (
    OneHotEncoder,
    StandardScaler,
    FunctionTransformer
)
from sklearn.compose import ColumnTransformer
from IPython.display import HTML
header_template = "<text style='font-size:17px'>{}</text>"

Basic example#

So in the next cell a random data frame is generated, with some categorical and some numerical columns. Let’s show how can be builded component of the pipeline that process categorical columns in one way and numeric in other.

sample_size = 500
np.random.seed(10)

generate_word = lambda: "".join([
    chr(val) for val in 
    np.random.randint(ord("a"), ord("z") + 1, 10)
])
get_cat_var = lambda: np.random.choice(
    [
        generate_word() for i in 
        range(np.random.randint(2,7))
    ], 
    sample_size
)
get_num_var = lambda: np.random.normal(
    np.random.uniform(-1,1), 
    np.random.uniform(1,10),
    sample_size
)

variables_generator = [get_cat_var, get_num_var]

data_frame = pd.concat(
    {
        f"var {i}" : \
        pd.Series(np.random.choice(variables_generator)())
        for i in range(20)
    },
    axis = 1
)

data_frame.head()
var 0 var 1 var 2 var 3 var 4 var 5 var 6 var 7 var 8 var 9 var 10 var 11 var 12 var 13 var 14 var 15 var 16 var 17 var 18 var 19
0 6.352738 ghfmmekjzz ewvspmvrkg -3.916784 0.579251 3.078876 jljighbmio iieafcivri -3.503851 hwadgiwzth zderdinjyy -0.851043 -7.137998 -0.990391 4.128471 lduutwjjin -6.858011 -3.455499 kdzpmsglss fjogwgrkig
1 -1.562264 ghfmmekjzz dlfjbofnbr -1.458950 0.755219 -0.498048 phrxnjsbae iieafcivri -5.212578 yxickhmgkp kpqepphruh -6.878257 -1.712574 -7.783903 -3.623413 lduutwjjin 3.198987 -7.290196 eywzqkuzza fjogwgrkig
2 -2.453819 booaisyeuj dlfjbofnbr -0.124566 4.070167 -2.271910 lzsssmsaim vhfoucvgil -3.504522 pdzajvgbzz ynhwdgvtke -0.838181 1.898630 -6.632060 -1.394765 zghwqxiakd -14.830121 10.490557 irrdfszbwf voumadgklp
3 -0.042513 booaisyeuj kkagxtgiko -6.897858 -0.065287 -3.459478 phrxnjsbae yfmijifvmo 0.742066 wectjxhbio kpqepphruh -0.087694 -1.808818 0.053985 0.494845 lduutwjjin -0.341344 4.539596 eywzqkuzza dzlpowvufa
4 -5.946806 xmtwmxfxpz dlfjbofnbr 7.453730 -3.450039 0.091773 jljighbmio vhfoucvgil 1.830545 hwadgiwzth ynhwdgvtke -0.218426 0.492733 -2.954776 -2.614179 zghwqxiakd -2.672298 6.436154 kdzpmsglss dzlpowvufa

To prepare a transformer that handles different columns in different ways, you need to pass a list of your transformers to the transformers parameter of the sklearn.compose.ColumnTransformer constructor.

Each element of the transformers list should be of the form (<transformer name>, <transformer class>, <columns that will use this transformer>).

So in the following cell we have created such an object, showing how it will look in the Jupyter output and possible results of this transformation for the data frame described above.

numeric_columns = list(data_frame.select_dtypes("number").columns)
categorical_columns = list(set(data_frame.columns) - set(numeric_columns))

my_transformer = ColumnTransformer(
    transformers = [
        ("one_hot_encoder", OneHotEncoder(), categorical_columns),
        ("standart_scaler", StandardScaler(), numeric_columns)
    ]
)


display(HTML(header_template.format("Class display in jupyter")))
display(my_transformer)
display(HTML(header_template.format("Fit and transfrom result")))
display(
    pd.DataFrame(
        my_transformer.fit_transform(data_frame)
    ).head()
)
Class display in jupyter
ColumnTransformer(transformers=[('one_hot_encoder', OneHotEncoder(),
                                 ['var 1', 'var 18', 'var 2', 'var 15',
                                  'var 10', 'var 9', 'var 7', 'var 19',
                                  'var 6']),
                                ('standart_scaler', StandardScaler(),
                                 ['var 0', 'var 3', 'var 4', 'var 5', 'var 8',
                                  'var 11', 'var 12', 'var 13', 'var 14',
                                  'var 16', 'var 17'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Fit and transfrom result
0 1 2 3 4 5 6 7 8 9 ... 37 38 39 40 41 42 43 44 45 46
0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... -1.169878 0.128906 1.319012 -0.962918 -0.316245 -1.378868 -0.008420 1.003985 -0.811118 -0.795706
1 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... -0.391908 0.160042 -0.407646 -1.467765 -2.575318 -0.360064 -1.728754 -1.025626 0.436845 -1.531170
2 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.030460 0.746601 -1.263927 -0.963117 -0.311424 0.318060 -1.437070 -0.442118 -1.800370 1.879037
3 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... -2.113468 0.014859 -1.837191 0.291547 -0.030132 -0.378137 0.256049 0.052623 -0.002472 0.737690
4 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 ... 2.429194 -0.584051 -0.122927 0.613140 -0.079132 0.054056 -0.505865 -0.761387 -0.291717 1.101435

5 rows × 47 columns

No transformations#

If you need to build a transformer that doesn’t change some of the columns in any way, there are two options:

  • Use a "pass-through" literal instead of a transformer;

  • Use a so-called “dummy_transformer” FunctionTransformer(lambda x: x) which just returns it’s input.

The following example shows both options.

np.random.seed(10)
sample_size = 10

df = pd.DataFrame({
    "col1" : np.random.uniform(5, 10, sample_size),
    "col2" : np.random.normal(5, 10, sample_size)
})

display(HTML(header_template.format("Input frame")))
display(df)

col_transform = ColumnTransformer(
    transformers = [
        ("dummy", FunctionTransformer(lambda x: x), ["col1", "col2"]),
        ("passthrough", "passthrough", ["col1", "col2"]),
        ("standart_scaler", StandardScaler(), ["col1", "col2"]),
    ]
)
col_transform.set_output(transform="pandas")
df = col_transform.fit_transform(df)
display(HTML(header_template.format("Transformation result")))
display(df)
Input frame
col1 col2
0 8.856603 7.655116
1 5.103760 6.085485
2 8.168241 5.042914
3 8.744019 3.253998
4 7.492535 9.330262
5 6.123983 17.030374
6 5.990314 -4.650657
7 8.802654 15.282741
8 5.845554 7.286301
9 5.441699 9.451376
Transformation result
dummy__col1 dummy__col2 passthrough__col1 passthrough__col2 standart_scaler__col1 standart_scaler__col2
0 8.856603 7.655116 8.856603 7.655116 1.258270 0.013588
1 5.103760 6.085485 5.103760 6.085485 -1.365599 -0.258714
2 8.168241 5.042914 8.168241 5.042914 0.776989 -0.439580
3 8.744019 3.253998 8.744019 3.253998 1.179555 -0.749924
4 7.492535 9.330262 7.492535 9.330262 0.304557 0.304194
5 6.123983 17.030374 6.123983 17.030374 -0.652291 1.640020
6 5.990314 -4.650657 5.990314 -4.650657 -0.745748 -2.121234
7 8.802654 15.282741 8.802654 15.282741 1.220550 1.336838
8 5.845554 7.286301 5.845554 7.286301 -0.846960 -0.050395
9 5.441699 9.451376 5.441699 9.451376 -1.129323 0.325205

Pandas multiIndex#

Let’s see how sklearn.compose.ColumnTransformer works with a pandas.DataFrame containing multiIndex as columns.

The following cell implements pandas.DataFrame with multiIndex.

sample_size = 1000
np.random.seed(10)

test_frame = pd.DataFrame({
    ("numeric", "var1") : np.random.randint(0, 10, size=sample_size),
    ("numeric", "var2") : np.random.randint(0, 10, size=sample_size),
    ("categorial", "var1") : np.random.choice(["a", "b"], size=sample_size),
    ("categorial", "var2") : np.random.choice(["a", "b"], size=sample_size)
})
test_frame.head()
numeric categorial
var1 var2 var1 var2
0 9 2 b b
1 4 0 b a
2 0 2 a a
3 1 0 b a
4 9 7 b a

Basic case#

Here we build a ColumnsTransformer which literally specifies as columns what is in the columns argument of the corresponding subsample. The transofmer itself is displayed in HTML so you can check what exactly was passed as the names of the columns.

numeric_columns = test_frame.select_dtypes("number").columns
categorial_columns = test_frame.select_dtypes("O").columns
transformer = ColumnTransformer([
    ("numeric transformer", StandardScaler(), numeric_columns),
    ("categorial transformer", OneHotEncoder(), categorial_columns)
])

display(HTML(header_template.format(
    "Transformer representation in HTML"
)))
display(transformer)
display(HTML(header_template.format("Transforamtion result")))
display(transformer.fit_transform(test_frame).round(3))
display(HTML(header_template.format("Features names out")))
display(transformer.get_feature_names_out())
Transformer representation in HTML
ColumnTransformer(transformers=[('numeric transformer', StandardScaler(),
                                 MultiIndex([('numeric', 'var1'),
            ('numeric', 'var2')],
           )),
                                ('categorial transformer', OneHotEncoder(),
                                 MultiIndex([('categorial', 'var1'),
            ('categorial', 'var2')],
           ))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Transforamtion result
array([[ 1.517, -0.853,  0.   ,  1.   ,  0.   ,  1.   ],
       [-0.193, -1.547,  0.   ,  1.   ,  1.   ,  0.   ],
       [-1.562, -0.853,  1.   ,  0.   ,  1.   ,  0.   ],
       ...,
       [-0.878,  0.188,  1.   ,  0.   ,  1.   ,  0.   ],
       [-0.193, -0.853,  1.   ,  0.   ,  1.   ,  0.   ],
       [ 0.491,  1.229,  0.   ,  1.   ,  0.   ,  1.   ]])
Features names out
array(['numeric transformer__x0', 'numeric transformer__x1',
       'categorial transformer__x2_a', 'categorial transformer__x2_b',
       'categorial transformer__x3_a', 'categorial transformer__x3_b'],
      dtype=object)

It works fine. But in the out feature names it uses x0, x1, x2, ... instead of the input names.

Passing upper levels#

It would be great if sklearn were able to pass the subframe under a certain value of the upper index.

try:
    ColumnTransformer([
        ("numeric transformer", StandardScaler(), ["numeric"]),
        ("categorial transformer", OneHotEncoder(), ["categorial"])
    ]).fit_transform(
        test_frame
    )
except Exception as e:
    print(e)
Selected columns, ['numeric'], are not unique in dataframe

It returns error so it isn’t possible yet.