Columns transformer#

A column transformer is a typical component of a sklearn pipeline that applies different transformations to different columns of the input dataframe.

import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
    FunctionTransformer,
    StandardScaler,
    OneHotEncoder
)

No transformations#

If you need to build a transformer that doesn’t change some of the columns in any way, there are two options:

  • Use a "pass-through" literal instead of a transformer;

  • Use a so-called “dummy_transformer” FunctionTransformer(lambda x: x) which just returns it’s input.


The following cell generates a dataset with two columns. Each column is processed by a ColumnTransformer, but different mechanisms are used for each column.

np.random.seed(10)
sample_size = 10

df = pd.DataFrame({
    "col1" : np.random.uniform(5, 10, sample_size),
    "col2" : np.random.normal(5, 10, sample_size)
})

df
col1 col2
0 8.856603 7.655116
1 5.103760 6.085485
2 8.168241 5.042914
3 8.744019 3.253998
4 7.492535 9.330262
5 6.123983 17.030374
6 5.990314 -4.650657
7 8.802654 15.282741
8 5.845554 7.286301
9 5.441699 9.451376

The following cell shows the column transformer that passes both columns unchanged by both options. It also adds a standard scaler to demonstrate that the other output remains unaffected.

col_transform = ColumnTransformer(
    transformers = [
        ("dummy", FunctionTransformer(lambda x: x), ["col1", "col2"]),
        ("passthrough", "passthrough", ["col1", "col2"]),
        ("standart_scaler", StandardScaler(), ["col1", "col2"]),
    ]
)
col_transform.set_output(transform="pandas")
col_transform.fit_transform(df)
dummy__col1 dummy__col2 passthrough__col1 passthrough__col2 standart_scaler__col1 standart_scaler__col2
0 8.856603 7.655116 8.856603 7.655116 1.258270 0.013588
1 5.103760 6.085485 5.103760 6.085485 -1.365599 -0.258714
2 8.168241 5.042914 8.168241 5.042914 0.776989 -0.439580
3 8.744019 3.253998 8.744019 3.253998 1.179555 -0.749924
4 7.492535 9.330262 7.492535 9.330262 0.304557 0.304194
5 6.123983 17.030374 6.123983 17.030374 -0.652291 1.640020
6 5.990314 -4.650657 5.990314 -4.650657 -0.745748 -2.121234
7 8.802654 15.282741 8.802654 15.282741 1.220550 1.336838
8 5.845554 7.286301 5.845554 7.286301 -0.846960 -0.050395
9 5.441699 9.451376 5.441699 9.451376 -1.129323 0.325205

Pandas multiIndex#

Let’s see how sklearn.compose.ColumnTransformer works with a pandas.DataFrame containing multiIndex as columns.

The following cell implements pandas.DataFrame with multiIndex.

sample_size = 1000
np.random.seed(10)

test_frame = pd.DataFrame({
    ("numeric", "var1") : np.random.randint(0, 10, size=sample_size),
    ("numeric", "var2") : np.random.randint(0, 10, size=sample_size),
    ("categorial", "var1") : np.random.choice(["a", "b"], size=sample_size),
    ("categorial", "var2") : np.random.choice(["a", "b"], size=sample_size)
})
test_frame.head()
numeric categorial
var1 var2 var1 var2
0 9 2 b b
1 4 0 b a
2 0 2 a a
3 1 0 b a
4 9 7 b a

Basic case#

Here we build a ColumnsTransformer which literally specifies as columns what is in the columns argument of the corresponding subsample. The transofmer itself is displayed in HTML so you can check what exactly was passed as the names of the columns.

numeric_columns = test_frame.select_dtypes("number").columns
categorial_columns = test_frame.select_dtypes("O").columns
transformer = ColumnTransformer([
    ("numeric transformer", StandardScaler(), numeric_columns),
    ("categorial transformer", OneHotEncoder(), categorial_columns)
])

display(transformer)
display(transformer.fit_transform(test_frame).round(3))
display(transformer.get_feature_names_out())
ColumnTransformer(transformers=[('numeric transformer', StandardScaler(),
                                 MultiIndex([('numeric', 'var1'),
            ('numeric', 'var2')],
           )),
                                ('categorial transformer', OneHotEncoder(),
                                 MultiIndex([('categorial', 'var1'),
            ('categorial', 'var2')],
           ))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

Cell In[19], line 9

      3 transformer = ColumnTransformer([

      4     ("numeric transformer", StandardScaler(), numeric_columns),

      5     ("categorial transformer", OneHotEncoder(), categorial_columns)

      6 ])

      8 display(transformer)

----> 9 display(transformer.fit_transform(test_frame).round(3))

     10 display(transformer.get_feature_names_out())



File ~/.virtualenvironments/python/lib/python3.13/site-packages/sklearn/utils/_set_output.py:316, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)

    314 @wraps(f)

    315 def wrapped(self, X, *args, **kwargs):

--> 316     data_to_wrap = f(self, X, *args, **kwargs)

    317     if isinstance(data_to_wrap, tuple):

    318         # only wrap the first output for cross decomposition

    319         return_tuple = (

    320             _wrap_data_with_container(method, data_to_wrap[0], X, self),

    321             *data_to_wrap[1:],

    322         )



File ~/.virtualenvironments/python/lib/python3.13/site-packages/sklearn/base.py:1365, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)

   1358     estimator._validate_params()

   1360 with config_context(

   1361     skip_parameter_validation=(

   1362         prefer_skip_nested_validation or global_skip_validation

   1363     )

   1364 ):

-> 1365     return fit_method(estimator, *args, **kwargs)



File ~/.virtualenvironments/python/lib/python3.13/site-packages/sklearn/compose/_column_transformer.py:989, in ColumnTransformer.fit_transform(self, X, y, **params)

    986 n_samples = _num_samples(X)

    988 self._validate_column_callables(X)

--> 989 self._validate_remainder(X)

    991 if _routing_enabled():

    992     routed_params = process_routing(self, "fit_transform", **params)



File ~/.virtualenvironments/python/lib/python3.13/site-packages/sklearn/compose/_column_transformer.py:554, in ColumnTransformer._validate_remainder(self, X)

    552 remaining = sorted(set(range(self.n_features_in_)) - cols)

    553 self._transformer_to_input_indices["remainder"] = remaining

--> 554 remainder_cols = self._get_remainder_cols(remaining)

    555 self._remainder = ("remainder", self.remainder, remainder_cols)



File ~/.virtualenvironments/python/lib/python3.13/site-packages/sklearn/compose/_column_transformer.py:571, in ColumnTransformer._get_remainder_cols(self, indices)

    569 dtype = self._get_remainder_cols_dtype()

    570 if dtype == "str":

--> 571     return list(self.feature_names_in_[indices])

    572 if dtype == "bool":

    573     return [i in indices for i in range(self.n_features_in_)]



AttributeError: 'ColumnTransformer' object has no attribute 'feature_names_in_'

It works fine. But in the out feature names it uses x0, x1, x2, ... instead of the input names.

Passing upper levels#

It would be great if sklearn were able to pass the subframe under a certain value of the upper index.

try:
    ColumnTransformer([
        ("numeric transformer", StandardScaler(), ["numeric"]),
        ("categorial transformer", OneHotEncoder(), ["categorial"])
    ]).fit_transform(
        test_frame
    )
except Exception as e:
    print(e)
Selected columns, ['numeric'], are not unique in dataframe

It returns error so it isn’t possible yet.