Columns transformer#
A column transformer is a typical component of a sklearn pipeline that applies different transformations to different columns of the input dataframe.
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
FunctionTransformer,
StandardScaler,
OneHotEncoder
)
No transformations#
If you need to build a transformer that doesn’t change some of the columns in any way, there are two options:
Use a
"pass-through"
literal instead of a transformer;Use a so-called “dummy_transformer”
FunctionTransformer(lambda x: x)
which just returns it’s input.
The following cell generates a dataset with two columns. Each column is processed by a ColumnTransformer
, but different mechanisms are used for each column.
np.random.seed(10)
sample_size = 10
df = pd.DataFrame({
"col1" : np.random.uniform(5, 10, sample_size),
"col2" : np.random.normal(5, 10, sample_size)
})
df
col1 | col2 | |
---|---|---|
0 | 8.856603 | 7.655116 |
1 | 5.103760 | 6.085485 |
2 | 8.168241 | 5.042914 |
3 | 8.744019 | 3.253998 |
4 | 7.492535 | 9.330262 |
5 | 6.123983 | 17.030374 |
6 | 5.990314 | -4.650657 |
7 | 8.802654 | 15.282741 |
8 | 5.845554 | 7.286301 |
9 | 5.441699 | 9.451376 |
The following cell shows the column transformer that passes both columns unchanged by both options. It also adds a standard scaler to demonstrate that the other output remains unaffected.
col_transform = ColumnTransformer(
transformers = [
("dummy", FunctionTransformer(lambda x: x), ["col1", "col2"]),
("passthrough", "passthrough", ["col1", "col2"]),
("standart_scaler", StandardScaler(), ["col1", "col2"]),
]
)
col_transform.set_output(transform="pandas")
col_transform.fit_transform(df)
dummy__col1 | dummy__col2 | passthrough__col1 | passthrough__col2 | standart_scaler__col1 | standart_scaler__col2 | |
---|---|---|---|---|---|---|
0 | 8.856603 | 7.655116 | 8.856603 | 7.655116 | 1.258270 | 0.013588 |
1 | 5.103760 | 6.085485 | 5.103760 | 6.085485 | -1.365599 | -0.258714 |
2 | 8.168241 | 5.042914 | 8.168241 | 5.042914 | 0.776989 | -0.439580 |
3 | 8.744019 | 3.253998 | 8.744019 | 3.253998 | 1.179555 | -0.749924 |
4 | 7.492535 | 9.330262 | 7.492535 | 9.330262 | 0.304557 | 0.304194 |
5 | 6.123983 | 17.030374 | 6.123983 | 17.030374 | -0.652291 | 1.640020 |
6 | 5.990314 | -4.650657 | 5.990314 | -4.650657 | -0.745748 | -2.121234 |
7 | 8.802654 | 15.282741 | 8.802654 | 15.282741 | 1.220550 | 1.336838 |
8 | 5.845554 | 7.286301 | 5.845554 | 7.286301 | -0.846960 | -0.050395 |
9 | 5.441699 | 9.451376 | 5.441699 | 9.451376 | -1.129323 | 0.325205 |
Pandas multiIndex#
Let’s see how sklearn.compose.ColumnTransformer
works with a pandas.DataFrame
containing multiIndex as columns.
The following cell implements pandas.DataFrame
with multiIndex.
sample_size = 1000
np.random.seed(10)
test_frame = pd.DataFrame({
("numeric", "var1") : np.random.randint(0, 10, size=sample_size),
("numeric", "var2") : np.random.randint(0, 10, size=sample_size),
("categorial", "var1") : np.random.choice(["a", "b"], size=sample_size),
("categorial", "var2") : np.random.choice(["a", "b"], size=sample_size)
})
test_frame.head()
numeric | categorial | |||
---|---|---|---|---|
var1 | var2 | var1 | var2 | |
0 | 9 | 2 | b | b |
1 | 4 | 0 | b | a |
2 | 0 | 2 | a | a |
3 | 1 | 0 | b | a |
4 | 9 | 7 | b | a |
Basic case#
Here we build a ColumnsTransformer
which literally specifies as columns what is in the columns
argument of the corresponding subsample. The transofmer itself is displayed in HTML so you can check what exactly was passed as the names of the columns.
numeric_columns = test_frame.select_dtypes("number").columns
categorial_columns = test_frame.select_dtypes("O").columns
transformer = ColumnTransformer([
("numeric transformer", StandardScaler(), numeric_columns),
("categorial transformer", OneHotEncoder(), categorial_columns)
])
display(transformer)
display(transformer.fit_transform(test_frame).round(3))
display(transformer.get_feature_names_out())
ColumnTransformer(transformers=[('numeric transformer', StandardScaler(), MultiIndex([('numeric', 'var1'), ('numeric', 'var2')], )), ('categorial transformer', OneHotEncoder(), MultiIndex([('categorial', 'var1'), ('categorial', 'var2')], ))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
transformers | [('numeric transformer', ...), ('categorial transformer', ...)] | |
remainder | 'drop' | |
sparse_threshold | 0.3 | |
n_jobs | None | |
transformer_weights | None | |
verbose | False | |
verbose_feature_names_out | True | |
force_int_remainder_cols | 'deprecated' |
MultiIndex([('numeric', 'var1'), ('numeric', 'var2')], )
Parameters
copy | True | |
with_mean | True | |
with_std | True |
MultiIndex([('categorial', 'var1'), ('categorial', 'var2')], )
Parameters
categories | 'auto' | |
drop | None | |
sparse_output | True | |
dtype | <class 'numpy.float64'> | |
handle_unknown | 'error' | |
min_frequency | None | |
max_categories | None | |
feature_name_combiner | 'concat' |
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[19], line 9
3 transformer = ColumnTransformer([
4 ("numeric transformer", StandardScaler(), numeric_columns),
5 ("categorial transformer", OneHotEncoder(), categorial_columns)
6 ])
8 display(transformer)
----> 9 display(transformer.fit_transform(test_frame).round(3))
10 display(transformer.get_feature_names_out())
File ~/.virtualenvironments/python/lib/python3.13/site-packages/sklearn/utils/_set_output.py:316, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
314 @wraps(f)
315 def wrapped(self, X, *args, **kwargs):
--> 316 data_to_wrap = f(self, X, *args, **kwargs)
317 if isinstance(data_to_wrap, tuple):
318 # only wrap the first output for cross decomposition
319 return_tuple = (
320 _wrap_data_with_container(method, data_to_wrap[0], X, self),
321 *data_to_wrap[1:],
322 )
File ~/.virtualenvironments/python/lib/python3.13/site-packages/sklearn/base.py:1365, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
1358 estimator._validate_params()
1360 with config_context(
1361 skip_parameter_validation=(
1362 prefer_skip_nested_validation or global_skip_validation
1363 )
1364 ):
-> 1365 return fit_method(estimator, *args, **kwargs)
File ~/.virtualenvironments/python/lib/python3.13/site-packages/sklearn/compose/_column_transformer.py:989, in ColumnTransformer.fit_transform(self, X, y, **params)
986 n_samples = _num_samples(X)
988 self._validate_column_callables(X)
--> 989 self._validate_remainder(X)
991 if _routing_enabled():
992 routed_params = process_routing(self, "fit_transform", **params)
File ~/.virtualenvironments/python/lib/python3.13/site-packages/sklearn/compose/_column_transformer.py:554, in ColumnTransformer._validate_remainder(self, X)
552 remaining = sorted(set(range(self.n_features_in_)) - cols)
553 self._transformer_to_input_indices["remainder"] = remaining
--> 554 remainder_cols = self._get_remainder_cols(remaining)
555 self._remainder = ("remainder", self.remainder, remainder_cols)
File ~/.virtualenvironments/python/lib/python3.13/site-packages/sklearn/compose/_column_transformer.py:571, in ColumnTransformer._get_remainder_cols(self, indices)
569 dtype = self._get_remainder_cols_dtype()
570 if dtype == "str":
--> 571 return list(self.feature_names_in_[indices])
572 if dtype == "bool":
573 return [i in indices for i in range(self.n_features_in_)]
AttributeError: 'ColumnTransformer' object has no attribute 'feature_names_in_'
It works fine. But in the out feature names it uses x0, x1, x2, ...
instead of the input names.
Passing upper levels#
It would be great if sklearn
were able to pass the subframe under a certain value of the upper index.
try:
ColumnTransformer([
("numeric transformer", StandardScaler(), ["numeric"]),
("categorial transformer", OneHotEncoder(), ["categorial"])
]).fit_transform(
test_frame
)
except Exception as e:
print(e)
Selected columns, ['numeric'], are not unique in dataframe
It returns error so it isn’t possible yet.