memory
parameter(chacing)#
Sources
Example from oficial sklearn site “selecting dimensionality reduction with Pipeline and GridSearchCV”;
Stackoverflow question using scikit Pipeline for testing models but preprocessing data only once.
There is a mechanism implemented by sklearn that allows not to recompute transform stages of the pipeline each time. By setting the memory
argument, you make sklearn.pipeline.Pipeline
store the temporary results of the pipeline.
It still not really clear how it works but here is few cases that show what actually can be imporved with chaching.
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.datasets import make_regression
from sklearn.model_selection import KFold
from sklearn.linear_model import (
Lasso, LinearRegression
)
from sklearn.model_selection import GridSearchCV
Example of improvement#
The following cells define two almost identical GridSearchCV
experiments - the difference is that the second one passes the memory
argument. Let’s see which one runs faster.
X,y = make_regression(n_features=10, random_state=10, n_samples=1000)
# The grid is intentionally redundant
# to maximise the usefulness of caching
param_grid = {"lasso__alpha":np.arange(0.1,1, 0.001)}
steps = [
("pca", PCA(n_components=3)),
("lasso", Lasso()),
]
%%timeit
just_pipe = Pipeline(steps).fit(X, y)
GridSearchCV(
just_pipe,
param_grid=param_grid
).fit(X,y)
12.4 s ± 192 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
frozen_pipe = Pipeline(
steps,
memory="chaching_pipeline_files"
).fit(X, y)
GridSearchCV(
frozen_pipe,
param_grid={"lasso__alpha":np.arange(0.1,1, 0.05)}
).fit(X,y)
301 ms ± 3.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So the cell that uses chacing is much faster.
How it doesn’t work#
At first I thought that this argument could be used to restrict frozen stages in sklearn transformations. This is a topic of great interest to me, and there is a separate page for it.
But caching of sklearn.pipeline.Pipeline
is useless for this purposes. The followig cell show why. Here is fitted LinearRegression
on the whole data and on the separate parts of data.
def get_pca_components(pipe):
return pd.DataFrame(
pipe["pca"].components_.T,
columns=range(1, pipe["pca"].n_components+1),
index=range(1, len(pipe["pca"].components_.T)+1)
)
def get_reg_coefficients(pipe):
return pd.Series(
pipe["regression"].coef_.T,
index=range(1, len(pipe["regression"].coef_)+1)
)
def cv_fit_pipe(pipe, X, y):
pca = {}
reg = {}
for i, (train, test) in enumerate(KFold(n_splits=3).split(X)):
pipe.fit(X[train], y[train])
pca[f"split {i+1}"] = get_pca_components(pipe)
reg[f"split {i+1}"] = get_reg_coefficients(pipe)
return pca, reg
X,y = make_regression(n_features=3, random_state=10)
frozen_pipe = Pipeline(
[
("pca", PCA(n_components=3)),
("regression", LinearRegression()),
],
memory="chaching_pipeline_files"
).fit(X, y)
pca, reg = cv_fit_pipe(frozen_pipe, X, y)
pd.concat(
{
"PCA components":pd.concat({
"initial_fit" : get_pca_components(frozen_pipe),
**pca
}),
"Regression cofficients" : pd.concat({
"initial_fit" : get_reg_coefficients(frozen_pipe),
**reg
}).rename("").to_frame()
},
axis=1
)
PCA components | Regression cofficients | ||||
---|---|---|---|---|---|
1 | 2 | 3 | |||
initial_fit | 1 | -0.672216 | 0.738179 | -0.056720 | -32.308331 |
2 | -0.433684 | -0.454703 | -0.777922 | -22.777824 | |
3 | 0.600037 | 0.498333 | -0.625795 | -67.793876 | |
split 1 | 1 | -0.348919 | -0.857311 | 0.378514 | -68.506909 |
2 | -0.892296 | 0.427377 | 0.145455 | 28.807335 | |
3 | 0.286469 | 0.286994 | 0.914095 | 25.211267 | |
split 2 | 1 | -0.675871 | -0.476143 | -0.562571 | 49.580387 |
2 | 0.732309 | -0.347679 | -0.585528 | -21.533917 | |
3 | -0.083201 | 0.807717 | -0.583670 | -56.892379 | |
split 3 | 1 | -0.672216 | 0.738179 | -0.056720 | -32.308331 |
2 | -0.433684 | -0.454703 | -0.777922 | -22.777824 | |
3 | 0.600037 | 0.498333 | -0.625795 | -67.793876 |
I was hoping the fitted pca parameters wouldn’t change after the first fit. But new run on different data - new coefficients.