memory parameter(chacing)

memory parameter(chacing)#

Sources

There is a mechanism implemented by sklearn that allows not to recompute transform stages of the pipeline each time. By setting the memory argument, you make sklearn.pipeline.Pipeline store the temporary results of the pipeline.

It still not really clear how it works but here is few cases that show what actually can be imporved with chaching.

import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.datasets import make_regression
from sklearn.model_selection import KFold
from sklearn.linear_model import (
    Lasso, LinearRegression
)
from sklearn.model_selection import GridSearchCV

Example of improvement#

The following cells define two almost identical GridSearchCV experiments - the difference is that the second one passes the memory argument. Let’s see which one runs faster.

X,y = make_regression(n_features=10, random_state=10, n_samples=1000)
# The grid is intentionally redundant 
# to maximise the usefulness of caching
param_grid = {"lasso__alpha":np.arange(0.1,1, 0.001)}
steps = [
    ("pca", PCA(n_components=3)),
    ("lasso", Lasso()),
]
%%timeit
just_pipe = Pipeline(steps).fit(X, y)

GridSearchCV(
    just_pipe,
    param_grid=param_grid
).fit(X,y)
12.4 s ± 192 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
frozen_pipe = Pipeline(
    steps,
    memory="chaching_pipeline_files"
).fit(X, y)

GridSearchCV(
    frozen_pipe,
    param_grid={"lasso__alpha":np.arange(0.1,1, 0.05)}
).fit(X,y)
301 ms ± 3.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So the cell that uses chacing is much faster.

How it doesn’t work#

At first I thought that this argument could be used to restrict frozen stages in sklearn transformations. This is a topic of great interest to me, and there is a separate page for it.

But caching of sklearn.pipeline.Pipeline is useless for this purposes. The followig cell show why. Here is fitted LinearRegression on the whole data and on the separate parts of data.

def get_pca_components(pipe):
    return pd.DataFrame(
        pipe["pca"].components_.T,
        columns=range(1, pipe["pca"].n_components+1),
        index=range(1, len(pipe["pca"].components_.T)+1)
    )

def get_reg_coefficients(pipe):
    return pd.Series(
        pipe["regression"].coef_.T,
        index=range(1, len(pipe["regression"].coef_)+1)
    )

def cv_fit_pipe(pipe, X, y):
    pca = {}
    reg = {}
    for i, (train, test) in enumerate(KFold(n_splits=3).split(X)):
        pipe.fit(X[train], y[train])
        pca[f"split {i+1}"] = get_pca_components(pipe)
        reg[f"split {i+1}"] = get_reg_coefficients(pipe)
    return pca, reg

X,y = make_regression(n_features=3, random_state=10)

frozen_pipe = Pipeline(
    [
        ("pca", PCA(n_components=3)),
        ("regression", LinearRegression()),
    ],
    memory="chaching_pipeline_files"
).fit(X, y)

pca, reg = cv_fit_pipe(frozen_pipe, X, y)
pd.concat(
    {
        "PCA components":pd.concat({
            "initial_fit" : get_pca_components(frozen_pipe),
            **pca
        }),
        "Regression cofficients" : pd.concat({
            "initial_fit" : get_reg_coefficients(frozen_pipe),
            **reg
        }).rename("").to_frame()
    },
    axis=1
)
PCA components Regression cofficients
1 2 3
initial_fit 1 -0.672216 0.738179 -0.056720 -32.308331
2 -0.433684 -0.454703 -0.777922 -22.777824
3 0.600037 0.498333 -0.625795 -67.793876
split 1 1 -0.348919 -0.857311 0.378514 -68.506909
2 -0.892296 0.427377 0.145455 28.807335
3 0.286469 0.286994 0.914095 25.211267
split 2 1 -0.675871 -0.476143 -0.562571 49.580387
2 0.732309 -0.347679 -0.585528 -21.533917
3 -0.083201 0.807717 -0.583670 -56.892379
split 3 1 -0.672216 0.738179 -0.056720 -32.308331
2 -0.433684 -0.454703 -0.777922 -22.777824
3 0.600037 0.498333 -0.625795 -67.793876

I was hoping the fitted pca parameters wouldn’t change after the first fit. But new run on different data - new coefficients.