Frozen steps#

Sometimes it’s useful to have components that can be part of the pipeline, but that don’t change when the pipeline is fitted. This page is focused on options to realise such options.

import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA

X, y = make_regression(
    n_samples=1000,
    n_features=10,
    n_informative=3,
    random_state=10
)
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, random_state=10, test_size=10)

Problem with an example#

Suppose we have regression task. And our task is to build pipeline with pca transformation to linear regression. But we suppose that PCA doesn’t affect overfitting any how and needed to fit it on whole data and then use in pipeline whole pipeline. Thus, the PCA transformation is fixed for each fold in the cross-validation and test data, but the linear regression is re-fitted each time.

So, as in the example above, let’s apply the PCA transformation to the whole data and check its components.

pca_transformer = PCA(n_components=3).fit(X)
pd.DataFrame(pca_transformer.components_).T
0 1 2
0 -0.029002 -0.472792 0.677339
1 -0.447752 0.023186 -0.103205
2 -0.024158 0.413848 0.094252
3 -0.152105 0.180614 0.032731
4 0.699786 -0.278504 -0.148883
5 -0.168125 -0.012485 0.002570
6 0.007655 0.492975 0.469945
7 0.302708 0.375889 -0.305682
8 0.389330 0.313503 0.416662
9 -0.117147 0.108241 -0.102644

Now if you look at the PCA components from the trained pipeline, they will be different from what we got when we separately fit PCA to the full dataset. Obviously, the fitting process has happened again - and we need to prevent it.

my_pipe = Pipeline([
    ("pca_transformer", pca_transformer),
    ("model", LinearRegression())
])
my_pipe.fit(X_train, y_train)
pd.DataFrame(my_pipe["pca_transformer"].components_).T
0 1 2
0 -0.036841 -0.518257 0.651796
1 -0.451093 0.037510 -0.098019
2 -0.039621 0.418203 0.113842
3 -0.164741 0.173786 0.035227
4 0.674812 -0.275627 -0.182952
5 -0.170362 -0.055523 0.007950
6 0.010807 0.439391 0.548820
7 0.328344 0.426361 -0.200186
8 0.396116 0.232198 0.408454
9 -0.131220 0.130733 -0.098516

Transformer-wrapper#

The option offered in this section is to create a class inherited from sklearn.base.BaseEstimator. Which takes an already fitted transformer in __init__ and does nothing during fit. But in transform it just calls transform on the passed object.

This solution may seem a bit sketchy, but in fact the same option is shown as an example in the official sklearn guide “Developing scikit-learn estimators”.

The following cell defines such a wrapper:

from sklearn.base import BaseEstimator
class FrozenTransformer(BaseEstimator):
    def __init__(self, fitted_transformer):
        self.fitted_transformer=fitted_transformer
    def fit(self):
        return self
    def transform(self, X, y=None):
        return self.fitted_transformer.transform(X)
    def fit_transform(self, X, y=None):
        return self.transform(X)

Here is creating, fitting and showing components of the PCA transformation. Finally, FrozenTransformer was created based on this PCA transformation.

pca_transformer = PCA(n_components=3).fit(X)
display(pd.DataFrame(pca_transformer.components_).T)
frozen_tranformer = FrozenTransformer(pca_transformer)
0 1 2
0 -0.029002 -0.472792 0.677339
1 -0.447752 0.023186 -0.103205
2 -0.024158 0.413848 0.094252
3 -0.152105 0.180614 0.032731
4 0.699786 -0.278504 -0.148883
5 -0.168125 -0.012485 0.002570
6 0.007655 0.492975 0.469945
7 0.302708 0.375889 -0.305682
8 0.389330 0.313503 0.416662
9 -0.117147 0.108241 -0.102644

Now let’s try to use instance of the FrozenTransformer as a step in the Pipeline. So during fit of the pipeline components wasn’t changed.

my_pipe = Pipeline([
    ("pca_transformer", frozen_tranformer),
    ("model", LinearRegression())
])
my_pipe.fit(X_train, y_train)
pd.DataFrame(
    my_pipe["pca_transformer"]
    .fitted_transformer.components_
).T
0 1 2
0 -0.029002 -0.472792 0.677339
1 -0.447752 0.023186 -0.103205
2 -0.024158 0.413848 0.094252
3 -0.152105 0.180614 0.032731
4 0.699786 -0.278504 -0.148883
5 -0.168125 -0.012485 0.002570
6 0.007655 0.492975 0.469945
7 0.302708 0.375889 -0.305682
8 0.389330 0.313503 0.416662
9 -0.117147 0.108241 -0.102644