Features names

Features names#

On this page, I want to focus on how variable names flow through sklearn.pipeline.Pipeline.

The content of this page is strongly related to the content of features names out page, so check it out.

In short, the skleanr.pipeline.Pipeline steps do not know anything about the column names of the previous steps. But when the skleanr.pipeline.Pipeline.get_feature_names_out method is called, each of the steps calls this method with the results of the previous or input data, in case of the first step. The following experiments indicate this.

from copy import copy

import numpy as np
import pandas as pd

from sklearn.datasets import make_classification
from sklearn.preprocessing import (
    PolynomialFeatures, FunctionTransformer
)
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

Here’s the data frame that will be used in the following examples:

X,y = make_classification(
    n_features=3,
    n_samples=10000,
    n_informative=3,
    n_redundant=0,
    random_state=0
)
X = pd.DataFrame(
    X,
    columns = [f"x_{i+1}" for i in range(X.shape[1])]
)
display(X.head())

	x_1	x_2	x_3
0	1.766138	1.603858	1.550204
1	1.703640	1.146002	1.084877
2	-0.691141	-1.720920	-1.593803
3	0.845986	-1.062863	-1.060188
4	0.175248	-0.676483	-0.743816

Here’s a couple of sklearn.preprocessing.FunctionTransformers’ that have some properties that will be great for us in the following steps. Each time you call their get_featre_names_out they will sing in the output that they have been called.

def feature_names(transformer, in_columns):
    print("="*50)
    print("features_names_out of the", transformer.name, "is called.")
    print("my in columns", in_columns)
    print("="*50)
    return [f"{transformer.name}__{col}" for col in in_columns]

first = FunctionTransformer(
    lambda X:X, 
    feature_names_out=feature_names
)
first.name = "first"
second = FunctionTransformer(
    lambda X:X, 
    feature_names_out=feature_names
)
second.name = "second"

Numpy output#

A sklearn.Pipeline.pipeline is defined here, which uses sklearn.preporcessing.PolynomialFeatures and then passes the results to the function transformer, which will signal in the output that its get_feature_names_out method has been called.

my_pipe=Pipeline([
    (
        "transf", 
        PolynomialFeatures(
            interaction_only=True, 
            include_bias=False
        )
    ),
    ("first", first),
    ("second", second)
])
my_pipe.fit(X)

Pipeline(steps=[('transf',
                 PolynomialFeatures(include_bias=False, interaction_only=True)),
                ('first',
                 FunctionTransformer(feature_names_out=<function feature_names at 0x7feeefdf4a60>,
                                     func=<function <lambda> at 0x7feeefdf4af0>)),
                ('second',
                 FunctionTransformer(feature_names_out=<function feature_names at 0x7feeefdf4a60>,
                                     func=<function <lambda> at 0x7feeefdf4b80>))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In the previous steps, Transformers didn’t have any information about feature names. The next cell shows it:

try:
    my_pipe["first"].feature_names_in_
except Exception as e:
    print(e)

'FunctionTransformer' object has no attribute 'feature_names_in_'

But you can still call the get_feature_names_out method, which will just call get_features_names_out from it’s components in the chain:

my_pipe.get_feature_names_out()

==================================================
features_names_out of the first is called.
my in columns ['x_1' 'x_2' 'x_3' 'x_1 x_2' 'x_1 x_3' 'x_2 x_3']
==================================================
==================================================
features_names_out of the second is called.
my in columns ['first__x_1' 'first__x_2' 'first__x_3' 'first__x_1 x_2' 'first__x_1 x_3'
 'first__x_2 x_3']
==================================================

array(['second__first__x_1', 'second__first__x_2', 'second__first__x_3',
       'second__first__x_1 x_2', 'second__first__x_1 x_3',
       'second__first__x_2 x_3'], dtype=object)

Pandas output#

You can call the method set_output(transform="pandas") for your pipeline - this will make each step return pandas.DataFrame as output of the transformation - it garantee that each following step will know the output columns of the previous step. So you can use the names of the columns in the transformation logic of the following steps.

The following example demonstrates the creation of such a transformer - note that the method get_feature_names_out has been called for each step during fitting.

my_pipe=Pipeline([
    (
        "transf", 
        PolynomialFeatures(
            interaction_only=True, 
            include_bias=False
        )
    ),
    ("first", first),
    ("second", second)
])
my_pipe.set_output(transform="pandas")
my_pipe.fit(X)

==================================================
features_names_out of the first is called.
my in columns ['x_1' 'x_2' 'x_3' 'x_1 x_2' 'x_1 x_3' 'x_2 x_3']
==================================================
==================================================
features_names_out of the first is called.
my in columns ['x_1' 'x_2' 'x_3' 'x_1 x_2' 'x_1 x_3' 'x_2 x_3']
==================================================
==================================================
features_names_out of the first is called.
my in columns ['x_1' 'x_2' 'x_3' 'x_1 x_2' 'x_1 x_3' 'x_2 x_3']
==================================================

Pipeline(steps=[('transf',
                 PolynomialFeatures(include_bias=False, interaction_only=True)),
                ('first',
                 FunctionTransformer(feature_names_out=<function feature_names at 0x7feeefdf4a60>,
                                     func=<function <lambda> at 0x7feeefdf4af0>)),
                ('second',
                 FunctionTransformer(feature_names_out=<function feature_names at 0x7feeefdf4a60>,
                                     func=<function <lambda> at 0x7feeefdf4b80>))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Each step knows the column names of the output of the previous step in the feature_names_in_ field:

display(my_pipe["first"].feature_names_in_)
display(my_pipe["second"].feature_names_in_)

array(['x_1', 'x_2', 'x_3', 'x_1 x_2', 'x_1 x_3', 'x_2 x_3'], dtype=object)

array(['first__x_1', 'first__x_2', 'first__x_3', 'first__x_1 x_2',
       'first__x_1 x_3', 'first__x_2 x_3'], dtype=object)

And calling get_feature_names_out from the sklearn.pipeline.Pipeline will result in the same chain of calls to it steps:

my_pipe.get_feature_names_out()

==================================================
features_names_out of the first is called.
my in columns ['x_1' 'x_2' 'x_3' 'x_1 x_2' 'x_1 x_3' 'x_2 x_3']
==================================================
==================================================
features_names_out of the second is called.
my in columns ['first__x_1' 'first__x_2' 'first__x_3' 'first__x_1 x_2' 'first__x_1 x_3'
 'first__x_2 x_3']
==================================================

array(['second__first__x_1', 'second__first__x_2', 'second__first__x_3',
       'second__first__x_1 x_2', 'second__first__x_1 x_3',
       'second__first__x_2 x_3'], dtype=object)

Features names

Contents

Features names#

Numpy output#

Pandas output#