Pipeline

Pipeline#

The sklearn.pipeline.Pipeline class allows you to combine a set of scikit-learn transformations into a single pipeline. This page covers the details of the sklearn pipeline, as well as issues and tricks for working with it.

import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.datasets import make_regression, make_classification
from sklearn.linear_model import Lasso
from sklearn.preprocessing import (
    PolynomialFeatures, FunctionTransformer
)

Caching#

There is a mechanism implemented by sklearn that allows not to recompute transform stages of the pipeline each time. By setting the memory argument, you make sklearn.pipeline.Pipeline store the temporary results of the pipeline.

If the inputs are the same as in one of the previous runs, it allows to avoid refitting the pipeline.

For more details check:

Example from sklearn pipeline guilde.
Example from oficial sklearn site “selecting dimensionality reduction with Pipeline and GridSearchCV”.
Stackoverflow question using scikit Pipeline for testing models but preprocessing data only once.

The following cells define two almost identical GridSearchCV experiments - the difference is that the second one passes the memory argument. Let’s see which one runs faster.

X, y = make_regression(n_features=10, random_state=10, n_samples=100_000)

steps = [
    ("pca", PCA(n_components=3)),
    ("lasso", Lasso()),
]

The following cell uses the timeit magic command to estimate the performance of the pipeline that don’t use caching.

%%timeit
Pipeline(steps=steps).fit(X, y)

155 ms ± 21.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The same code but with the specification of the memory="/tmp"

%%timeit
Pipeline(steps=steps, memory="/tmp").fit(X, y)

21.1 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Features names#

The content of this section is strongly related to the content of features names out page, so check it out.

The skleanr.pipeline.Pipeline steps do not know anything about the column names of the previous steps. But when the skleanr.pipeline.Pipeline.get_feature_names_out method is called, each of the steps calls this method with the results of the previous or input data, in case of the first step. The following experiments indicate this.

Here’s the data frame that will be used in the following examples:

X,y = make_classification(
    n_features=3,
    n_samples=10000,
    n_informative=3,
    n_redundant=0,
    random_state=0
)
X = pd.DataFrame(
    X,
    columns = [f"x_{i+1}" for i in range(X.shape[1])]
)
display(X.head())

	x_1	x_2	x_3
0	1.766138	1.603858	1.550204
1	1.703640	1.146002	1.084877
2	-0.691141	-1.720920	-1.593803
3	0.845986	-1.062863	-1.060188
4	0.175248	-0.676483	-0.743816

Here’s a couple of sklearn.preprocessing.FunctionTransformer objects that have some properties that will be great for us in the following steps. Each time you call their get_featre_names_out they will print information they received as input.

def feature_names(transformer, in_columns):
    print("="*50)
    print("features_names_out of the", transformer.name, "is called.")
    print("my in columns", in_columns)
    print("="*50)
    return [f"{transformer.name}__{col}" for col in in_columns]

first = FunctionTransformer(
    lambda X:X, 
    feature_names_out=feature_names
)
first.name = "first"

second = FunctionTransformer(
    lambda X:X, 
    feature_names_out=feature_names
)
second.name = "second"

The following cell shows how it is supposed to work: the component prints the columns it receives, which are the same with dataframe columns.

first.fit(X).get_feature_names_out()

==================================================
features_names_out of the first is called.
my in columns ['x_1' 'x_2' 'x_3']
==================================================

array(['first__x_1', 'first__x_2', 'first__x_3'], dtype=object)

The transformation shows the input columns for it, and returns the columns with some additional information indicating that the output has passed through this transformer.

A sklearn.Pipeline.pipeline is defined and fitted here:

my_pipe=Pipeline([
    ("first", first),
    ("second", second)
])
my_pipe.fit(X)

==================================================
features_names_out of the first is called.
my in columns ['x_1' 'x_2' 'x_3']
==================================================
==================================================
features_names_out of the first is called.
my in columns ['x_1' 'x_2' 'x_3']
==================================================
==================================================
features_names_out of the first is called.
my in columns ['x_1' 'x_2' 'x_3']
==================================================

Pipeline(steps=[('first',
                 FunctionTransformer(feature_names_out=<function feature_names at 0x7a6ee07e4180>,
                                     func=<function <lambda> at 0x7a6ee07e4220>)),
                ('second',
                 FunctionTransformer(feature_names_out=<function feature_names at 0x7a6ee07e4180>,
                                     func=<function <lambda> at 0x7a6ee07e44a0>))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

?Documentation for PipelineiFitted

Parameters

	steps	[('first', ...), ('second', ...)]
	transform_input	None
	memory	None
	verbose	False

FunctionTransformer

?Documentation for FunctionTransformer

Parameters

	func	<function <la...x7a6ee07e4220>
	inverse_func	None
	validate	False
	accept_sparse	False
	check_inverse	True
	feature_names_out	<function fea...x7a6ee07e4180>
	kw_args	None
	inv_kw_args	None

FunctionTransformer

?Documentation for FunctionTransformer

Parameters

	func	<function <la...x7a6ee07e44a0>
	inverse_func	None
	validate	False
	accept_sparse	False
	check_inverse	True
	feature_names_out	<function fea...x7a6ee07e4180>
	kw_args	None
	inv_kw_args	None

It’s interesting that the first transformer’s feature_names_out was called three times.

In the previous steps, Transformers didn’t have any information about feature names. The next cell shows it:

try:
    my_pipe["first"].feature_names_in_
except Exception as e:
    print(e)

'FunctionTransformer' object has no attribute 'feature_names_in_'

But you can still call the get_feature_names_out method, which will just call get_features_names_out from it’s components in the chain:

my_pipe.get_feature_names_out()

==================================================
features_names_out of the first is called.
my in columns ['x_1' 'x_2' 'x_3']
==================================================
==================================================
features_names_out of the second is called.
my in columns ['first__x_1' 'first__x_2' 'first__x_3']
==================================================

array(['second__first__x_1', 'second__first__x_2', 'second__first__x_3'],
      dtype=object)

Each transformer prints its input columns, of the entire transformation are returned.

Pandas output#

To convert the output of each step into have a form of the pandas.DataFrame just call pipeline.set_output(transform="pandas").

The following cell shows that by default the output of the Pipeline is a numpy array.

my_pipe=Pipeline([
    ("PCA", PCA(n_components=2)),
    ("poly", PolynomialFeatures())
])
my_pipe.fit(X)
my_pipe.transform(X)

Pipeline(steps=[('PCA', PCA(n_components=2)), ('poly', PolynomialFeatures())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

However, it begins to return a pandas.DataFrame when the transform is set to "pandas".

my_pipe.set_output(transform="pandas")
my_pipe.transform(X)

/home/fedor/.virtualenvironments/python/lib/python3.13/site-packages/sklearn/utils/validation.py:2742: UserWarning: X has feature names, but PolynomialFeatures was fitted without feature names
  warnings.warn(

	1	x0	x1	x0^2	x0 x1	x1^2
0	1.0	1.442495	2.064890	2.080791e+00	2.978593	4.263771
1	1.0	1.484650	1.415801	2.204186e+00	2.101969	2.004492
2	1.0	0.000436	-2.657027	1.897801e-07	-0.001158	7.059795
3	1.0	1.209246	-1.690244	1.462277e+00	-2.043921	2.856925
4	1.0	0.512068	-1.274820	2.622138e-01	-0.652795	1.625165
...	...	...	...	...	...	...
9995	1.0	-1.682327	-0.658349	2.830223e+00	1.107558	0.433424
9996	1.0	0.777373	0.563872	6.043086e-01	0.438339	0.317951
9997	1.0	0.498767	1.841249	2.487688e-01	0.918355	3.390197
9998	1.0	2.123719	2.954422	4.510183e+00	6.274363	8.728611
9999	1.0	-1.477795	-0.275411	2.183877e+00	0.407001	0.075851

10000 rows × 6 columns

	n_components	2
	copy	True
	whiten	False
	svd_solver	'auto'
	tol	0.0
	iterated_power	'auto'
	n_oversamples	10
	power_iteration_normalizer	'auto'
	random_state	None

	degree	2
	interaction_only	False
	include_bias	True
	order	'C'

	steps	[('PCA', ...), ('poly', ...)]
	transform_input	None
	memory	None
	verbose	False

Pipeline

Contents

Pipeline#

Caching#

Features names#

Pandas output#