Columns transformer

Columns transformer#

A column transformer is a typical component of a sklearn pipeline that applies different transformations to different columns of the input dataframe.

import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
    FunctionTransformer,
    StandardScaler,
    OneHotEncoder
)

No transformations#

If you need to build a transformer that doesn’t change some of the columns in any way, there are two options:

Use a "pass-through" literal instead of a transformer;
Use a so-called “dummy_transformer” FunctionTransformer(lambda x: x) which just returns it’s input.

The following cell generates a dataset with two columns. Each column is processed by a ColumnTransformer, but different mechanisms are used for each column.

np.random.seed(10)
sample_size = 10

df = pd.DataFrame({
    "col1" : np.random.uniform(5, 10, sample_size),
    "col2" : np.random.normal(5, 10, sample_size)
})

df

	col1	col2
0	8.856603	7.655116
1	5.103760	6.085485
2	8.168241	5.042914
3	8.744019	3.253998
4	7.492535	9.330262
5	6.123983	17.030374
6	5.990314	-4.650657
7	8.802654	15.282741
8	5.845554	7.286301
9	5.441699	9.451376

The following cell shows the column transformer that passes both columns unchanged by both options. It also adds a standard scaler to demonstrate that the other output remains unaffected.

col_transform = ColumnTransformer(
    transformers = [
        ("dummy", FunctionTransformer(lambda x: x), ["col1", "col2"]),
        ("passthrough", "passthrough", ["col1", "col2"]),
        ("standart_scaler", StandardScaler(), ["col1", "col2"]),
    ]
)
col_transform.set_output(transform="pandas")
col_transform.fit_transform(df)

	dummy__col1	dummy__col2	passthrough__col1	passthrough__col2	standart_scaler__col1	standart_scaler__col2
0	8.856603	7.655116	8.856603	7.655116	1.258270	0.013588
1	5.103760	6.085485	5.103760	6.085485	-1.365599	-0.258714
2	8.168241	5.042914	8.168241	5.042914	0.776989	-0.439580
3	8.744019	3.253998	8.744019	3.253998	1.179555	-0.749924
4	7.492535	9.330262	7.492535	9.330262	0.304557	0.304194
5	6.123983	17.030374	6.123983	17.030374	-0.652291	1.640020
6	5.990314	-4.650657	5.990314	-4.650657	-0.745748	-2.121234
7	8.802654	15.282741	8.802654	15.282741	1.220550	1.336838
8	5.845554	7.286301	5.845554	7.286301	-0.846960	-0.050395
9	5.441699	9.451376	5.441699	9.451376	-1.129323	0.325205

Pandas multiIndex#

Let’s see how sklearn.compose.ColumnTransformer works with a pandas.DataFrame containing multiIndex as columns.

The following cell implements pandas.DataFrame with multiIndex.

sample_size = 1000
np.random.seed(10)

test_frame = pd.DataFrame({
    ("numeric", "var1") : np.random.randint(0, 10, size=sample_size),
    ("numeric", "var2") : np.random.randint(0, 10, size=sample_size),
    ("categorial", "var1") : np.random.choice(["a", "b"], size=sample_size),
    ("categorial", "var2") : np.random.choice(["a", "b"], size=sample_size)
})
test_frame.head()

	numeric		categorial
	var1	var2	var1	var2
0	9	2	b	b
1	4	0	b	a
2	0	2	a	a
3	1	0	b	a
4	9	7	b	a

Basic case#

Here we build a ColumnsTransformer which literally specifies as columns what is in the columns argument of the corresponding subsample. The transofmer itself is displayed in HTML so you can check what exactly was passed as the names of the columns.

numeric_columns = test_frame.select_dtypes("number").columns
categorial_columns = test_frame.select_dtypes("O").columns
transformer = ColumnTransformer([
    ("numeric transformer", StandardScaler(), numeric_columns),
    ("categorial transformer", OneHotEncoder(), categorial_columns)
])

display(transformer)
display(transformer.fit_transform(test_frame).round(3))
display(transformer.get_feature_names_out())

ColumnTransformer(transformers=[('numeric transformer', StandardScaler(),
                                 MultiIndex([('numeric', 'var1'),
            ('numeric', 'var2')],
           )),
                                ('categorial transformer', OneHotEncoder(),
                                 MultiIndex([('categorial', 'var1'),
            ('categorial', 'var2')],
           ))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

ColumnTransformer

?Documentation for ColumnTransformeriNot fitted

Parameters

	transformers	[('numeric transformer', ...), ('categorial transformer', ...)]
	remainder	'drop'
	sparse_threshold	0.3
	n_jobs	None
	transformer_weights	None
	verbose	False
	verbose_feature_names_out	True
	force_int_remainder_cols	'deprecated'

numeric transformer

MultiIndex([('numeric', 'var1'),
            ('numeric', 'var2')],
           )

StandardScaler

?Documentation for StandardScaler

Parameters

	copy	True
	with_mean	True
	with_std	True

categorial transformer

MultiIndex([('categorial', 'var1'),
            ('categorial', 'var2')],
           )

OneHotEncoder

?Documentation for OneHotEncoder

Parameters

	categories	'auto'
	drop	None
	sparse_output	True
	dtype	<class 'numpy.float64'>
	handle_unknown	'error'
	min_frequency	None
	max_categories	None
	feature_name_combiner	'concat'

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

Cell In[19], line 9

      3 transformer = ColumnTransformer([

      4     ("numeric transformer", StandardScaler(), numeric_columns),

      5     ("categorial transformer", OneHotEncoder(), categorial_columns)

      6 ])

      8 display(transformer)

----> 9 display(transformer.fit_transform(test_frame).round(3))

     10 display(transformer.get_feature_names_out())

File ~/.virtualenvironments/python/lib/python3.13/site-packages/sklearn/utils/_set_output.py:316, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)

    314 @wraps(f)

    315 def wrapped(self, X, *args, **kwargs):

--> 316     data_to_wrap = f(self, X, *args, **kwargs)

    317     if isinstance(data_to_wrap, tuple):

    318         # only wrap the first output for cross decomposition

    319         return_tuple = (

    320             _wrap_data_with_container(method, data_to_wrap[0], X, self),

    321             *data_to_wrap[1:],

    322         )

File ~/.virtualenvironments/python/lib/python3.13/site-packages/sklearn/base.py:1365, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)

   1358     estimator._validate_params()

   1360 with config_context(

   1361     skip_parameter_validation=(

   1362         prefer_skip_nested_validation or global_skip_validation

   1363     )

   1364 ):

-> 1365     return fit_method(estimator, *args, **kwargs)

File ~/.virtualenvironments/python/lib/python3.13/site-packages/sklearn/compose/_column_transformer.py:989, in ColumnTransformer.fit_transform(self, X, y, **params)

    986 n_samples = _num_samples(X)

    988 self._validate_column_callables(X)

--> 989 self._validate_remainder(X)

    991 if _routing_enabled():

    992     routed_params = process_routing(self, "fit_transform", **params)

File ~/.virtualenvironments/python/lib/python3.13/site-packages/sklearn/compose/_column_transformer.py:554, in ColumnTransformer._validate_remainder(self, X)

    552 remaining = sorted(set(range(self.n_features_in_)) - cols)

    553 self._transformer_to_input_indices["remainder"] = remaining

--> 554 remainder_cols = self._get_remainder_cols(remaining)

    555 self._remainder = ("remainder", self.remainder, remainder_cols)

File ~/.virtualenvironments/python/lib/python3.13/site-packages/sklearn/compose/_column_transformer.py:571, in ColumnTransformer._get_remainder_cols(self, indices)

    569 dtype = self._get_remainder_cols_dtype()

    570 if dtype == "str":

--> 571     return list(self.feature_names_in_[indices])

    572 if dtype == "bool":

    573     return [i in indices for i in range(self.n_features_in_)]

AttributeError: 'ColumnTransformer' object has no attribute 'feature_names_in_'

It works fine. But in the out feature names it uses x0, x1, x2, ... instead of the input names.

Passing upper levels#

It would be great if sklearn were able to pass the subframe under a certain value of the upper index.

try:
    ColumnTransformer([
        ("numeric transformer", StandardScaler(), ["numeric"]),
        ("categorial transformer", OneHotEncoder(), ["categorial"])
    ]).fit_transform(
        test_frame
    )
except Exception as e:
    print(e)

Selected columns, ['numeric'], are not unique in dataframe

It returns error so it isn’t possible yet.

Columns transformer

Contents

Columns transformer#

No transformations#

Pandas multiIndex#

Basic case#

Passing upper levels#