Frozen steps#
Sometimes it’s useful to have components that can be part of the pipeline, but that don’t change when the pipeline is fitted. This page is focused on options to realise such options.
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
X, y = make_regression(
n_samples=1000,
n_features=10,
n_informative=3,
random_state=10
)
X_train, X_test, y_train, y_test = \
train_test_split(X, y, random_state=10, test_size=10)
Problem with an example#
Suppose we have regression task. And our task is to build pipeline with pca transformation to linear regression. But we suppose that PCA doesn’t affect overfitting any how and needed to fit it on whole data and then use in pipeline whole pipeline. Thus, the PCA transformation is fixed for each fold in the cross-validation and test data, but the linear regression is re-fitted each time.
So, as in the example above, let’s apply the PCA transformation to the whole data and check its components.
pca_transformer = PCA(n_components=3).fit(X)
pd.DataFrame(pca_transformer.components_).T
0 | 1 | 2 | |
---|---|---|---|
0 | -0.029002 | -0.472792 | 0.677339 |
1 | -0.447752 | 0.023186 | -0.103205 |
2 | -0.024158 | 0.413848 | 0.094252 |
3 | -0.152105 | 0.180614 | 0.032731 |
4 | 0.699786 | -0.278504 | -0.148883 |
5 | -0.168125 | -0.012485 | 0.002570 |
6 | 0.007655 | 0.492975 | 0.469945 |
7 | 0.302708 | 0.375889 | -0.305682 |
8 | 0.389330 | 0.313503 | 0.416662 |
9 | -0.117147 | 0.108241 | -0.102644 |
Now if you look at the PCA components from the trained pipeline, they will be different from what we got when we separately fit PCA to the full dataset. Obviously, the fitting process has happened again - and we need to prevent it.
my_pipe = Pipeline([
("pca_transformer", pca_transformer),
("model", LinearRegression())
])
my_pipe.fit(X_train, y_train)
pd.DataFrame(my_pipe["pca_transformer"].components_).T
0 | 1 | 2 | |
---|---|---|---|
0 | -0.036841 | -0.518257 | 0.651796 |
1 | -0.451093 | 0.037510 | -0.098019 |
2 | -0.039621 | 0.418203 | 0.113842 |
3 | -0.164741 | 0.173786 | 0.035227 |
4 | 0.674812 | -0.275627 | -0.182952 |
5 | -0.170362 | -0.055523 | 0.007950 |
6 | 0.010807 | 0.439391 | 0.548820 |
7 | 0.328344 | 0.426361 | -0.200186 |
8 | 0.396116 | 0.232198 | 0.408454 |
9 | -0.131220 | 0.130733 | -0.098516 |
Transformer-wrapper#
The option offered in this section is to create a class inherited from sklearn.base.BaseEstimator
. Which takes an already fitted transformer in __init__
and does nothing during fit
. But in transform
it just calls transform
on the passed object.
This solution may seem a bit sketchy, but in fact the same option is shown as an example in the official sklearn guide “Developing scikit-learn estimators”.
The following cell defines such a wrapper:
from sklearn.base import BaseEstimator
class FrozenTransformer(BaseEstimator):
def __init__(self, fitted_transformer):
self.fitted_transformer=fitted_transformer
def fit(self):
return self
def transform(self, X, y=None):
return self.fitted_transformer.transform(X)
def fit_transform(self, X, y=None):
return self.transform(X)
Here is creating, fitting and showing components of the PCA
transformation. Finally, FrozenTransformer
was created based on this PCA
transformation.
pca_transformer = PCA(n_components=3).fit(X)
display(pd.DataFrame(pca_transformer.components_).T)
frozen_tranformer = FrozenTransformer(pca_transformer)
0 | 1 | 2 | |
---|---|---|---|
0 | -0.029002 | -0.472792 | 0.677339 |
1 | -0.447752 | 0.023186 | -0.103205 |
2 | -0.024158 | 0.413848 | 0.094252 |
3 | -0.152105 | 0.180614 | 0.032731 |
4 | 0.699786 | -0.278504 | -0.148883 |
5 | -0.168125 | -0.012485 | 0.002570 |
6 | 0.007655 | 0.492975 | 0.469945 |
7 | 0.302708 | 0.375889 | -0.305682 |
8 | 0.389330 | 0.313503 | 0.416662 |
9 | -0.117147 | 0.108241 | -0.102644 |
Now let’s try to use instance of the FrozenTransformer
as a step in the Pipeline
. So during fit
of the pipeline components wasn’t changed.
my_pipe = Pipeline([
("pca_transformer", frozen_tranformer),
("model", LinearRegression())
])
my_pipe.fit(X_train, y_train)
pd.DataFrame(
my_pipe["pca_transformer"]
.fitted_transformer.components_
).T
0 | 1 | 2 | |
---|---|---|---|
0 | -0.029002 | -0.472792 | 0.677339 |
1 | -0.447752 | 0.023186 | -0.103205 |
2 | -0.024158 | 0.413848 | 0.094252 |
3 | -0.152105 | 0.180614 | 0.032731 |
4 | 0.699786 | -0.278504 | -0.148883 |
5 | -0.168125 | -0.012485 | 0.002570 |
6 | 0.007655 | 0.492975 | 0.469945 |
7 | 0.302708 | 0.375889 | -0.305682 |
8 | 0.389330 | 0.313503 | 0.416662 |
9 | -0.117147 | 0.108241 | -0.102644 |