Specific processing for column

Specific processing for column#

Sometimes different columns need to be transformed in different ways. The most obvious example is the different processing of categorical and numerical columns:

For numeric columns, you need to apply normalisation techniques;
For categorical columns, you need to apply encoding (a hot, mean, etc.).

It’s easy to build such a transformation yourself, but it’s convenient that sklearn has an out-of-the-box solution that can be easily integrated into sklearn type pipelines - sklearn.compose.ColumnTransformer.

Learn more here.

import numpy as np
import pandas as pd

from sklearn.preprocessing import (
    OneHotEncoder,
    StandardScaler,
    FunctionTransformer
)
from sklearn.compose import ColumnTransformer
from IPython.display import HTML
header_template = "<text style='font-size:17px'>{}</text>"

Basic example#

So in the next cell a random data frame is generated, with some categorical and some numerical columns. Let’s show how can be builded component of the pipeline that process categorical columns in one way and numeric in other.

sample_size = 500
np.random.seed(10)

generate_word = lambda: "".join([
    chr(val) for val in 
    np.random.randint(ord("a"), ord("z") + 1, 10)
])
get_cat_var = lambda: np.random.choice(
    [
        generate_word() for i in 
        range(np.random.randint(2,7))
    ], 
    sample_size
)
get_num_var = lambda: np.random.normal(
    np.random.uniform(-1,1), 
    np.random.uniform(1,10),
    sample_size
)

variables_generator = [get_cat_var, get_num_var]

data_frame = pd.concat(
    {
        f"var {i}" : \
        pd.Series(np.random.choice(variables_generator)())
        for i in range(20)
    },
    axis = 1
)

data_frame.head()

	var 0	var 1	var 2	var 3	var 4	var 5	var 6	var 7	var 8	var 9	var 10	var 11	var 12	var 13	var 14	var 15	var 16	var 17	var 18	var 19
0	6.352738	ghfmmekjzz	ewvspmvrkg	-3.916784	0.579251	3.078876	jljighbmio	iieafcivri	-3.503851	hwadgiwzth	zderdinjyy	-0.851043	-7.137998	-0.990391	4.128471	lduutwjjin	-6.858011	-3.455499	kdzpmsglss	fjogwgrkig
1	-1.562264	ghfmmekjzz	dlfjbofnbr	-1.458950	0.755219	-0.498048	phrxnjsbae	iieafcivri	-5.212578	yxickhmgkp	kpqepphruh	-6.878257	-1.712574	-7.783903	-3.623413	lduutwjjin	3.198987	-7.290196	eywzqkuzza	fjogwgrkig
2	-2.453819	booaisyeuj	dlfjbofnbr	-0.124566	4.070167	-2.271910	lzsssmsaim	vhfoucvgil	-3.504522	pdzajvgbzz	ynhwdgvtke	-0.838181	1.898630	-6.632060	-1.394765	zghwqxiakd	-14.830121	10.490557	irrdfszbwf	voumadgklp
3	-0.042513	booaisyeuj	kkagxtgiko	-6.897858	-0.065287	-3.459478	phrxnjsbae	yfmijifvmo	0.742066	wectjxhbio	kpqepphruh	-0.087694	-1.808818	0.053985	0.494845	lduutwjjin	-0.341344	4.539596	eywzqkuzza	dzlpowvufa
4	-5.946806	xmtwmxfxpz	dlfjbofnbr	7.453730	-3.450039	0.091773	jljighbmio	vhfoucvgil	1.830545	hwadgiwzth	ynhwdgvtke	-0.218426	0.492733	-2.954776	-2.614179	zghwqxiakd	-2.672298	6.436154	kdzpmsglss	dzlpowvufa

To prepare a transformer that handles different columns in different ways, you need to pass a list of your transformers to the transformers parameter of the sklearn.compose.ColumnTransformer constructor.

Each element of the transformers list should be of the form (<transformer name>, <transformer class>, <columns that will use this transformer>).

So in the following cell we have created such an object, showing how it will look in the Jupyter output and possible results of this transformation for the data frame described above.

numeric_columns = list(data_frame.select_dtypes("number").columns)
categorical_columns = list(set(data_frame.columns) - set(numeric_columns))

my_transformer = ColumnTransformer(
    transformers = [
        ("one_hot_encoder", OneHotEncoder(), categorical_columns),
        ("standart_scaler", StandardScaler(), numeric_columns)
    ]
)


display(HTML(header_template.format("Class display in jupyter")))
display(my_transformer)
display(HTML(header_template.format("Fit and transfrom result")))
display(
    pd.DataFrame(
        my_transformer.fit_transform(data_frame)
    ).head()
)

Class display in jupyter

ColumnTransformer(transformers=[('one_hot_encoder', OneHotEncoder(),
                                 ['var 1', 'var 18', 'var 2', 'var 15',
                                  'var 10', 'var 9', 'var 7', 'var 19',
                                  'var 6']),
                                ('standart_scaler', StandardScaler(),
                                 ['var 0', 'var 3', 'var 4', 'var 5', 'var 8',
                                  'var 11', 'var 12', 'var 13', 'var 14',
                                  'var 16', 'var 17'])])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Fit and transfrom result

	0	1	4	5	6	8	...	37	38	39	40	41	42	43	44	45	46
0	0.0	1.0	0.0	0.0	0.0	1.0	...	-1.169878	0.128906	1.319012	-0.962918	-0.316245	-1.378868	-0.008420	1.003985	-0.811118	-0.795706
1	0.0	1.0	0.0	1.0	0.0	0.0	...	-0.391908	0.160042	-0.407646	-1.467765	-2.575318	-0.360064	-1.728754	-1.025626	0.436845	-1.531170
2	1.0	0.0	0.0	0.0	1.0	0.0	...	0.030460	0.746601	-1.263927	-0.963117	-0.311424	0.318060	-1.437070	-0.442118	-1.800370	1.879037
3	1.0	0.0	0.0	1.0	0.0	0.0	...	-2.113468	0.014859	-1.837191	0.291547	-0.030132	-0.378137	0.256049	0.052623	-0.002472	0.737690
4	0.0	0.0	1.0	0.0	0.0	1.0	...	2.429194	-0.584051	-0.122927	0.613140	-0.079132	0.054056	-0.505865	-0.761387	-0.291717	1.101435

5 rows × 47 columns

No transformations#

If you need to build a transformer that doesn’t change some of the columns in any way, there are two options:

Use a "pass-through" literal instead of a transformer;
Use a so-called “dummy_transformer” FunctionTransformer(lambda x: x) which just returns it’s input.

The following example shows both options.

np.random.seed(10)
sample_size = 10

df = pd.DataFrame({
    "col1" : np.random.uniform(5, 10, sample_size),
    "col2" : np.random.normal(5, 10, sample_size)
})

display(HTML(header_template.format("Input frame")))
display(df)

col_transform = ColumnTransformer(
    transformers = [
        ("dummy", FunctionTransformer(lambda x: x), ["col1", "col2"]),
        ("passthrough", "passthrough", ["col1", "col2"]),
        ("standart_scaler", StandardScaler(), ["col1", "col2"]),
    ]
)
col_transform.set_output(transform="pandas")
df = col_transform.fit_transform(df)
display(HTML(header_template.format("Transformation result")))
display(df)

Input frame

	col1	col2
0	8.856603	7.655116
1	5.103760	6.085485
2	8.168241	5.042914
3	8.744019	3.253998
4	7.492535	9.330262
5	6.123983	17.030374
6	5.990314	-4.650657
7	8.802654	15.282741
8	5.845554	7.286301
9	5.441699	9.451376

Transformation result

	dummy__col1	dummy__col2	passthrough__col1	passthrough__col2	standart_scaler__col1	standart_scaler__col2
0	8.856603	7.655116	8.856603	7.655116	1.258270	0.013588
1	5.103760	6.085485	5.103760	6.085485	-1.365599	-0.258714
2	8.168241	5.042914	8.168241	5.042914	0.776989	-0.439580
3	8.744019	3.253998	8.744019	3.253998	1.179555	-0.749924
4	7.492535	9.330262	7.492535	9.330262	0.304557	0.304194
5	6.123983	17.030374	6.123983	17.030374	-0.652291	1.640020
6	5.990314	-4.650657	5.990314	-4.650657	-0.745748	-2.121234
7	8.802654	15.282741	8.802654	15.282741	1.220550	1.336838
8	5.845554	7.286301	5.845554	7.286301	-0.846960	-0.050395
9	5.441699	9.451376	5.441699	9.451376	-1.129323	0.325205

Pandas multiIndex#

Let’s see how sklearn.compose.ColumnTransformer works with a pandas.DataFrame containing multiIndex as columns.

The following cell implements pandas.DataFrame with multiIndex.

sample_size = 1000
np.random.seed(10)

test_frame = pd.DataFrame({
    ("numeric", "var1") : np.random.randint(0, 10, size=sample_size),
    ("numeric", "var2") : np.random.randint(0, 10, size=sample_size),
    ("categorial", "var1") : np.random.choice(["a", "b"], size=sample_size),
    ("categorial", "var2") : np.random.choice(["a", "b"], size=sample_size)
})
test_frame.head()

	numeric		categorial
	var1	var2	var1	var2
0	9	2	b	b
1	4	0	b	a
2	0	2	a	a
3	1	0	b	a
4	9	7	b	a

Basic case#

Here we build a ColumnsTransformer which literally specifies as columns what is in the columns argument of the corresponding subsample. The transofmer itself is displayed in HTML so you can check what exactly was passed as the names of the columns.

numeric_columns = test_frame.select_dtypes("number").columns
categorial_columns = test_frame.select_dtypes("O").columns
transformer = ColumnTransformer([
    ("numeric transformer", StandardScaler(), numeric_columns),
    ("categorial transformer", OneHotEncoder(), categorial_columns)
])

display(HTML(header_template.format(
    "Transformer representation in HTML"
)))
display(transformer)
display(HTML(header_template.format("Transforamtion result")))
display(transformer.fit_transform(test_frame).round(3))
display(HTML(header_template.format("Features names out")))
display(transformer.get_feature_names_out())

Transformer representation in HTML

ColumnTransformer(transformers=[('numeric transformer', StandardScaler(),
                                 MultiIndex([('numeric', 'var1'),
            ('numeric', 'var2')],
           )),
                                ('categorial transformer', OneHotEncoder(),
                                 MultiIndex([('categorial', 'var1'),
            ('categorial', 'var2')],
           ))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Transforamtion result

array([[ 1.517, -0.853,  0.   ,  1.   ,  0.   ,  1.   ],
       [-0.193, -1.547,  0.   ,  1.   ,  1.   ,  0.   ],
       [-1.562, -0.853,  1.   ,  0.   ,  1.   ,  0.   ],
       ...,
       [-0.878,  0.188,  1.   ,  0.   ,  1.   ,  0.   ],
       [-0.193, -0.853,  1.   ,  0.   ,  1.   ,  0.   ],
       [ 0.491,  1.229,  0.   ,  1.   ,  0.   ,  1.   ]])

Features names out

array(['numeric transformer__x0', 'numeric transformer__x1',
       'categorial transformer__x2_a', 'categorial transformer__x2_b',
       'categorial transformer__x3_a', 'categorial transformer__x3_b'],
      dtype=object)

It works fine. But in the out feature names it uses x0, x1, x2, ... instead of the input names.

Passing upper levels#

It would be great if sklearn were able to pass the subframe under a certain value of the upper index.

try:
    ColumnTransformer([
        ("numeric transformer", StandardScaler(), ["numeric"]),
        ("categorial transformer", OneHotEncoder(), ["categorial"])
    ]).fit_transform(
        test_frame
    )
except Exception as e:
    print(e)

Selected columns, ['numeric'], are not unique in dataframe

It returns error so it isn’t possible yet.