# Columns transformer

A column transformer is a typical component of a sklearn pipeline that applies different transformations to different columns of the input dataframe.

In [None]:
import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
    FunctionTransformer,
    StandardScaler,
    OneHotEncoder
)

## No transformations

If you need to build a transformer that doesn't change some of the columns in any way, there are two options:

- Use a `"pass-through"` literal instead of a transformer;
- Use a so-called "dummy_transformer" `FunctionTransformer(lambda x: x)` which just returns it's input.

---

The following cell generates a dataset with two columns. Each column is processed by a `ColumnTransformer`, but different mechanisms are used for each column.

In [None]:
np.random.seed(10)
sample_size = 10

df = pd.DataFrame({
    "col1" : np.random.uniform(5, 10, sample_size),
    "col2" : np.random.normal(5, 10, sample_size)
})

df

Unnamed: 0,col1,col2
0,8.856603,7.655116
1,5.10376,6.085485
2,8.168241,5.042914
3,8.744019,3.253998
4,7.492535,9.330262
5,6.123983,17.030374
6,5.990314,-4.650657
7,8.802654,15.282741
8,5.845554,7.286301
9,5.441699,9.451376


The following cell shows the column transformer that passes both columns unchanged by both options. It also adds a standard scaler to demonstrate that the other output remains unaffected.

In [None]:
col_transform = ColumnTransformer(
    transformers = [
        ("dummy", FunctionTransformer(lambda x: x), ["col1", "col2"]),
        ("passthrough", "passthrough", ["col1", "col2"]),
        ("standart_scaler", StandardScaler(), ["col1", "col2"]),
    ]
)
col_transform.set_output(transform="pandas")
col_transform.fit_transform(df)

Unnamed: 0,dummy__col1,dummy__col2,passthrough__col1,passthrough__col2,standart_scaler__col1,standart_scaler__col2
0,8.856603,7.655116,8.856603,7.655116,1.25827,0.013588
1,5.10376,6.085485,5.10376,6.085485,-1.365599,-0.258714
2,8.168241,5.042914,8.168241,5.042914,0.776989,-0.43958
3,8.744019,3.253998,8.744019,3.253998,1.179555,-0.749924
4,7.492535,9.330262,7.492535,9.330262,0.304557,0.304194
5,6.123983,17.030374,6.123983,17.030374,-0.652291,1.64002
6,5.990314,-4.650657,5.990314,-4.650657,-0.745748,-2.121234
7,8.802654,15.282741,8.802654,15.282741,1.22055,1.336838
8,5.845554,7.286301,5.845554,7.286301,-0.84696,-0.050395
9,5.441699,9.451376,5.441699,9.451376,-1.129323,0.325205


### Pandas multiIndex

Let's see how `sklearn.compose.ColumnTransformer` works with a `pandas.DataFrame` containing multiIndex as columns. 

The following cell implements `pandas.DataFrame` with multiIndex.

In [None]:
sample_size = 1000
np.random.seed(10)

test_frame = pd.DataFrame({
    ("numeric", "var1") : np.random.randint(0, 10, size=sample_size),
    ("numeric", "var2") : np.random.randint(0, 10, size=sample_size),
    ("categorial", "var1") : np.random.choice(["a", "b"], size=sample_size),
    ("categorial", "var2") : np.random.choice(["a", "b"], size=sample_size)
})
test_frame.head()

Unnamed: 0_level_0,numeric,numeric,categorial,categorial
Unnamed: 0_level_1,var1,var2,var1,var2
0,9,2,b,b
1,4,0,b,a
2,0,2,a,a
3,1,0,b,a
4,9,7,b,a


## Basic case

Here we build a `ColumnsTransformer` which literally specifies as columns what is in the `columns` argument of the corresponding subsample. The transofmer itself is displayed in HTML so you can check what exactly was passed as the names of the columns.

In [None]:
numeric_columns = test_frame.select_dtypes("number").columns
categorial_columns = test_frame.select_dtypes("O").columns
transformer = ColumnTransformer([
    ("numeric transformer", StandardScaler(), numeric_columns),
    ("categorial transformer", OneHotEncoder(), categorial_columns)
])

display(transformer)
display(transformer.fit_transform(test_frame).round(3))
display(transformer.get_feature_names_out())

0,1,2
,transformers,"[('numeric transformer', ...), ('categorial transformer', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


AttributeError: 'ColumnTransformer' object has no attribute 'feature_names_in_'

It works fine. But in the out feature names it uses `x0, x1, x2, ... ` instead of the input names.

#### Passing upper levels

It would be great if `sklearn` were able to pass the subframe under a certain value of the upper index.

In [None]:
try:
    ColumnTransformer([
        ("numeric transformer", StandardScaler(), ["numeric"]),
        ("categorial transformer", OneHotEncoder(), ["categorial"])
    ]).fit_transform(
        test_frame
    )
except Exception as e:
    print(e)

Selected columns, ['numeric'], are not unique in dataframe


It returns error so it isn't possible yet.