Features names#
On this page, I want to focus on how variable names flow through sklearn.pipeline.Pipeline
.
The content of this page is strongly related to the content of features names out page, so check it out.
In short, the skleanr.pipeline.Pipeline
steps do not know anything about the column names of the previous steps. But when the skleanr.pipeline.Pipeline.get_feature_names_out
method is called, each of the steps calls this method with the results of the previous or input data, in case of the first step. The following experiments indicate this.
from copy import copy
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.preprocessing import (
PolynomialFeatures, FunctionTransformer
)
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
Here’s the data frame that will be used in the following examples:
X,y = make_classification(
n_features=3,
n_samples=10000,
n_informative=3,
n_redundant=0,
random_state=0
)
X = pd.DataFrame(
X,
columns = [f"x_{i+1}" for i in range(X.shape[1])]
)
display(X.head())
x_1 | x_2 | x_3 | |
---|---|---|---|
0 | 1.766138 | 1.603858 | 1.550204 |
1 | 1.703640 | 1.146002 | 1.084877 |
2 | -0.691141 | -1.720920 | -1.593803 |
3 | 0.845986 | -1.062863 | -1.060188 |
4 | 0.175248 | -0.676483 | -0.743816 |
Here’s a couple of sklearn.preprocessing.FunctionTransformer
s’ that have some properties that will be great for us in the following steps. Each time you call their get_featre_names_out
they will sing in the output that they have been called.
def feature_names(transformer, in_columns):
print("="*50)
print("features_names_out of the", transformer.name, "is called.")
print("my in columns", in_columns)
print("="*50)
return [f"{transformer.name}__{col}" for col in in_columns]
first = FunctionTransformer(
lambda X:X,
feature_names_out=feature_names
)
first.name = "first"
second = FunctionTransformer(
lambda X:X,
feature_names_out=feature_names
)
second.name = "second"
Numpy output#
A sklearn.Pipeline.pipeline
is defined here, which uses sklearn.preporcessing.PolynomialFeatures
and then passes the results to the function transformer, which will signal in the output that its get_feature_names_out
method has been called.
my_pipe=Pipeline([
(
"transf",
PolynomialFeatures(
interaction_only=True,
include_bias=False
)
),
("first", first),
("second", second)
])
my_pipe.fit(X)
Pipeline(steps=[('transf', PolynomialFeatures(include_bias=False, interaction_only=True)), ('first', FunctionTransformer(feature_names_out=<function feature_names at 0x7feeefdf4a60>, func=<function <lambda> at 0x7feeefdf4af0>)), ('second', FunctionTransformer(feature_names_out=<function feature_names at 0x7feeefdf4a60>, func=<function <lambda> at 0x7feeefdf4b80>))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('transf', PolynomialFeatures(include_bias=False, interaction_only=True)), ('first', FunctionTransformer(feature_names_out=<function feature_names at 0x7feeefdf4a60>, func=<function <lambda> at 0x7feeefdf4af0>)), ('second', FunctionTransformer(feature_names_out=<function feature_names at 0x7feeefdf4a60>, func=<function <lambda> at 0x7feeefdf4b80>))])
PolynomialFeatures(include_bias=False, interaction_only=True)
FunctionTransformer(feature_names_out=<function feature_names at 0x7feeefdf4a60>, func=<function <lambda> at 0x7feeefdf4af0>)
FunctionTransformer(feature_names_out=<function feature_names at 0x7feeefdf4a60>, func=<function <lambda> at 0x7feeefdf4b80>)
In the previous steps, Transformers didn’t have any information about feature names. The next cell shows it:
try:
my_pipe["first"].feature_names_in_
except Exception as e:
print(e)
'FunctionTransformer' object has no attribute 'feature_names_in_'
But you can still call the get_feature_names_out
method, which will just call get_features_names_out
from it’s components in the chain:
my_pipe.get_feature_names_out()
==================================================
features_names_out of the first is called.
my in columns ['x_1' 'x_2' 'x_3' 'x_1 x_2' 'x_1 x_3' 'x_2 x_3']
==================================================
==================================================
features_names_out of the second is called.
my in columns ['first__x_1' 'first__x_2' 'first__x_3' 'first__x_1 x_2' 'first__x_1 x_3'
'first__x_2 x_3']
==================================================
array(['second__first__x_1', 'second__first__x_2', 'second__first__x_3',
'second__first__x_1 x_2', 'second__first__x_1 x_3',
'second__first__x_2 x_3'], dtype=object)
Pandas output#
You can call the method set_output(transform="pandas")
for your pipeline - this will make each step return pandas.DataFrame
as output of the transformation - it garantee that each following step will know the output columns of the previous step. So you can use the names of the columns in the transformation logic of the following steps.
The following example demonstrates the creation of such a transformer - note that the method get_feature_names_out
has been called for each step during fitting.
my_pipe=Pipeline([
(
"transf",
PolynomialFeatures(
interaction_only=True,
include_bias=False
)
),
("first", first),
("second", second)
])
my_pipe.set_output(transform="pandas")
my_pipe.fit(X)
==================================================
features_names_out of the first is called.
my in columns ['x_1' 'x_2' 'x_3' 'x_1 x_2' 'x_1 x_3' 'x_2 x_3']
==================================================
==================================================
features_names_out of the first is called.
my in columns ['x_1' 'x_2' 'x_3' 'x_1 x_2' 'x_1 x_3' 'x_2 x_3']
==================================================
==================================================
features_names_out of the first is called.
my in columns ['x_1' 'x_2' 'x_3' 'x_1 x_2' 'x_1 x_3' 'x_2 x_3']
==================================================
Pipeline(steps=[('transf', PolynomialFeatures(include_bias=False, interaction_only=True)), ('first', FunctionTransformer(feature_names_out=<function feature_names at 0x7feeefdf4a60>, func=<function <lambda> at 0x7feeefdf4af0>)), ('second', FunctionTransformer(feature_names_out=<function feature_names at 0x7feeefdf4a60>, func=<function <lambda> at 0x7feeefdf4b80>))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('transf', PolynomialFeatures(include_bias=False, interaction_only=True)), ('first', FunctionTransformer(feature_names_out=<function feature_names at 0x7feeefdf4a60>, func=<function <lambda> at 0x7feeefdf4af0>)), ('second', FunctionTransformer(feature_names_out=<function feature_names at 0x7feeefdf4a60>, func=<function <lambda> at 0x7feeefdf4b80>))])
PolynomialFeatures(include_bias=False, interaction_only=True)
FunctionTransformer(feature_names_out=<function feature_names at 0x7feeefdf4a60>, func=<function <lambda> at 0x7feeefdf4af0>)
FunctionTransformer(feature_names_out=<function feature_names at 0x7feeefdf4a60>, func=<function <lambda> at 0x7feeefdf4b80>)
Each step knows the column names of the output of the previous step in the feature_names_in_
field:
display(my_pipe["first"].feature_names_in_)
display(my_pipe["second"].feature_names_in_)
array(['x_1', 'x_2', 'x_3', 'x_1 x_2', 'x_1 x_3', 'x_2 x_3'], dtype=object)
array(['first__x_1', 'first__x_2', 'first__x_3', 'first__x_1 x_2',
'first__x_1 x_3', 'first__x_2 x_3'], dtype=object)
And calling get_feature_names_out
from the sklearn.pipeline.Pipeline
will result in the same chain of calls to it steps:
my_pipe.get_feature_names_out()
==================================================
features_names_out of the first is called.
my in columns ['x_1' 'x_2' 'x_3' 'x_1 x_2' 'x_1 x_3' 'x_2 x_3']
==================================================
==================================================
features_names_out of the second is called.
my in columns ['first__x_1' 'first__x_2' 'first__x_3' 'first__x_1 x_2' 'first__x_1 x_3'
'first__x_2 x_3']
==================================================
array(['second__first__x_1', 'second__first__x_2', 'second__first__x_3',
'second__first__x_1 x_2', 'second__first__x_1 x_3',
'second__first__x_2 x_3'], dtype=object)