Pipeline#
The sklearn.pipeline.Pipeline
class allows you to combine a set of scikit-learn transformations into a single pipeline. This page covers the details of the sklearn pipeline, as well as issues and tricks for working with it.
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.datasets import make_regression, make_classification
from sklearn.linear_model import Lasso
from sklearn.preprocessing import (
PolynomialFeatures, FunctionTransformer
)
Caching#
There is a mechanism implemented by sklearn that allows not to recompute transform stages of the pipeline each time. By setting the memory
argument, you make sklearn.pipeline.Pipeline
store the temporary results of the pipeline.
If the inputs are the same as in one of the previous runs, it allows to avoid refitting the pipeline.
For more details check:
Example from oficial sklearn site “selecting dimensionality reduction with Pipeline and GridSearchCV”.
Stackoverflow question using scikit Pipeline for testing models but preprocessing data only once.
The following cells define two almost identical GridSearchCV
experiments - the difference is that the second one passes the memory
argument. Let’s see which one runs faster.
X, y = make_regression(n_features=10, random_state=10, n_samples=100_000)
steps = [
("pca", PCA(n_components=3)),
("lasso", Lasso()),
]
The following cell uses the timeit
magic command to estimate the performance of the pipeline that don’t use caching.
%%timeit
Pipeline(steps=steps).fit(X, y)
155 ms ± 21.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The same code but with the specification of the memory="/tmp"
%%timeit
Pipeline(steps=steps, memory="/tmp").fit(X, y)
21.1 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Features names#
The content of this section is strongly related to the content of features names out page, so check it out.
The skleanr.pipeline.Pipeline
steps do not know anything about the column names of the previous steps. But when the skleanr.pipeline.Pipeline.get_feature_names_out
method is called, each of the steps calls this method with the results of the previous or input data, in case of the first step. The following experiments indicate this.
Here’s the data frame that will be used in the following examples:
X,y = make_classification(
n_features=3,
n_samples=10000,
n_informative=3,
n_redundant=0,
random_state=0
)
X = pd.DataFrame(
X,
columns = [f"x_{i+1}" for i in range(X.shape[1])]
)
display(X.head())
x_1 | x_2 | x_3 | |
---|---|---|---|
0 | 1.766138 | 1.603858 | 1.550204 |
1 | 1.703640 | 1.146002 | 1.084877 |
2 | -0.691141 | -1.720920 | -1.593803 |
3 | 0.845986 | -1.062863 | -1.060188 |
4 | 0.175248 | -0.676483 | -0.743816 |
Here’s a couple of sklearn.preprocessing.FunctionTransformer
objects that have some properties that will be great for us in the following steps. Each time you call their get_featre_names_out
they will print information they received as input.
def feature_names(transformer, in_columns):
print("="*50)
print("features_names_out of the", transformer.name, "is called.")
print("my in columns", in_columns)
print("="*50)
return [f"{transformer.name}__{col}" for col in in_columns]
first = FunctionTransformer(
lambda X:X,
feature_names_out=feature_names
)
first.name = "first"
second = FunctionTransformer(
lambda X:X,
feature_names_out=feature_names
)
second.name = "second"
The following cell shows how it is supposed to work: the component prints the columns it receives, which are the same with dataframe columns.
first.fit(X).get_feature_names_out()
==================================================
features_names_out of the first is called.
my in columns ['x_1' 'x_2' 'x_3']
==================================================
array(['first__x_1', 'first__x_2', 'first__x_3'], dtype=object)
The transformation shows the input columns for it, and returns the columns with some additional information indicating that the output has passed through this transformer.
A sklearn.Pipeline.pipeline
is defined and fitted here:
my_pipe=Pipeline([
("first", first),
("second", second)
])
my_pipe.fit(X)
==================================================
features_names_out of the first is called.
my in columns ['x_1' 'x_2' 'x_3']
==================================================
==================================================
features_names_out of the first is called.
my in columns ['x_1' 'x_2' 'x_3']
==================================================
==================================================
features_names_out of the first is called.
my in columns ['x_1' 'x_2' 'x_3']
==================================================
Pipeline(steps=[('first', FunctionTransformer(feature_names_out=<function feature_names at 0x7a6ee07e4180>, func=<function <lambda> at 0x7a6ee07e4220>)), ('second', FunctionTransformer(feature_names_out=<function feature_names at 0x7a6ee07e4180>, func=<function <lambda> at 0x7a6ee07e44a0>))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
steps | [('first', ...), ('second', ...)] | |
transform_input | None | |
memory | None | |
verbose | False |
Parameters
func | <function <la...x7a6ee07e4220> | |
inverse_func | None | |
validate | False | |
accept_sparse | False | |
check_inverse | True | |
feature_names_out | <function fea...x7a6ee07e4180> | |
kw_args | None | |
inv_kw_args | None |
Parameters
func | <function <la...x7a6ee07e44a0> | |
inverse_func | None | |
validate | False | |
accept_sparse | False | |
check_inverse | True | |
feature_names_out | <function fea...x7a6ee07e4180> | |
kw_args | None | |
inv_kw_args | None |
It’s interesting that the first transformer’s feature_names_out
was called three times.
In the previous steps, Transformers didn’t have any information about feature names. The next cell shows it:
try:
my_pipe["first"].feature_names_in_
except Exception as e:
print(e)
'FunctionTransformer' object has no attribute 'feature_names_in_'
But you can still call the get_feature_names_out
method, which will just call get_features_names_out
from it’s components in the chain:
my_pipe.get_feature_names_out()
==================================================
features_names_out of the first is called.
my in columns ['x_1' 'x_2' 'x_3']
==================================================
==================================================
features_names_out of the second is called.
my in columns ['first__x_1' 'first__x_2' 'first__x_3']
==================================================
array(['second__first__x_1', 'second__first__x_2', 'second__first__x_3'],
dtype=object)
Each transformer prints its input columns, of the entire transformation are returned.
Pandas output#
To convert the output of each step into have a form of the pandas.DataFrame
just call pipeline.set_output(transform="pandas")
.
The following cell shows that by default the output of the Pipeline
is a numpy array.
my_pipe=Pipeline([
("PCA", PCA(n_components=2)),
("poly", PolynomialFeatures())
])
my_pipe.fit(X)
my_pipe.transform(X)
Pipeline(steps=[('PCA', PCA(n_components=2)), ('poly', PolynomialFeatures())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
steps | [('PCA', ...), ('poly', ...)] | |
transform_input | None | |
memory | None | |
verbose | False |
Parameters
n_components | 2 | |
copy | True | |
whiten | False | |
svd_solver | 'auto' | |
tol | 0.0 | |
iterated_power | 'auto' | |
n_oversamples | 10 | |
power_iteration_normalizer | 'auto' | |
random_state | None |
Parameters
degree | 2 | |
interaction_only | False | |
include_bias | True | |
order | 'C' |
However, it begins to return a pandas.DataFrame
when the transform
is set to "pandas"
.
my_pipe.set_output(transform="pandas")
my_pipe.transform(X)
/home/fedor/.virtualenvironments/python/lib/python3.13/site-packages/sklearn/utils/validation.py:2742: UserWarning: X has feature names, but PolynomialFeatures was fitted without feature names
warnings.warn(
1 | x0 | x1 | x0^2 | x0 x1 | x1^2 | |
---|---|---|---|---|---|---|
0 | 1.0 | 1.442495 | 2.064890 | 2.080791e+00 | 2.978593 | 4.263771 |
1 | 1.0 | 1.484650 | 1.415801 | 2.204186e+00 | 2.101969 | 2.004492 |
2 | 1.0 | 0.000436 | -2.657027 | 1.897801e-07 | -0.001158 | 7.059795 |
3 | 1.0 | 1.209246 | -1.690244 | 1.462277e+00 | -2.043921 | 2.856925 |
4 | 1.0 | 0.512068 | -1.274820 | 2.622138e-01 | -0.652795 | 1.625165 |
... | ... | ... | ... | ... | ... | ... |
9995 | 1.0 | -1.682327 | -0.658349 | 2.830223e+00 | 1.107558 | 0.433424 |
9996 | 1.0 | 0.777373 | 0.563872 | 6.043086e-01 | 0.438339 | 0.317951 |
9997 | 1.0 | 0.498767 | 1.841249 | 2.487688e-01 | 0.918355 | 3.390197 |
9998 | 1.0 | 2.123719 | 2.954422 | 4.510183e+00 | 6.274363 | 8.728611 |
9999 | 1.0 | -1.477795 | -0.275411 | 2.183877e+00 | 0.407001 | 0.075851 |
10000 rows × 6 columns