Pipeline#

The sklearn.pipeline.Pipeline is a tool that allows you to create objects that contain all the necessary stages of model fitting. In other words, you can create your own estimator as a combination of different objects that perform data processing or model fitting.

import numpy as np

import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, cross_validate
from sklearn.metrics import mean_squared_error

Demonstration of benefits#

Imagine that you need to create a model building pipeline that includes data standardisation and then model fitting. In this section I want to show the difference in code length and convenience of coding all by yourself and using the sklean.pipeline.Pipeline class.

Data generation#

In the following cell I just generate a random regression task for use in the example.

sample_size = 1000
features_count = 20
np.random.seed(50)

X = []

for i in range(features_count):

    mean = np.random.uniform(0,100)
    std = np.abs(np.random.normal(0, 50))
    
    X.append(np.random.normal(mean, std, [sample_size, 1]))

X = np.concatenate(X,axis=1)
theoretical_coefs = np.random.normal(0, 20, [features_count, 1])
y = np.dot(X, theoretical_coefs) + np.random.normal(0, 500, sample_size)

Self coding#

So here is code that does:

  • 10-fold split cross-validation for the named pipeline;

  • Display cross-validation results;

  • Fit model to full data sample;

  • Compute the mean prediction over the entire data sample.

You need to create a cycle that fits StandardScaler for current split and fit model to standardised data. After the cycle at the step of fitting the model to the whole data, you need to describe the whole pipeline again!

my_split = KFold(n_splits = 10)
train_errors = []
test_errors = []


for train_ind, test_ind in my_split.split(X):
    
    this_scaler = StandardScaler().fit(X[train_ind, :])
    
    train_X = this_scaler.transform(X[train_ind, :])
    train_y = y[train_ind]
    
    test_X = this_scaler.transform(X[test_ind,:])
    test_y = y[test_ind]

    model = LinearRegression().fit(train_X, train_y)
    train_errors.append(mean_squared_error(train_y, model.predict(train_X)))
    test_errors.append(mean_squared_error(test_y, model.predict(test_X)))

print("Train error:", np.mean(np.array(train_errors)))
print("Test error:", np.mean(np.array(test_errors)))

standart_X = StandardScaler().fit_transform(X)
final_model = LinearRegression().fit(standart_X, y)
print("Mean predict", np.mean(final_model.predict(standart_X)))
Train error: 3.5097084970944476e-22
Test error: 3.4625913023012295e-22
Mean predict 19010.954067406543

Using sklearn.pipeline#

In the following cell, I perform exactly the same calculations using only sklearn.pipeline.Pipeline.

You just need to define a my_pipline object where I describe the steps of the pipeline in the format [(<name of step 1>,<object performing step 1>), (<name of step 2>,<object performing step 2>), ...] and then just use it as a normal estimator - it will perform all the steps automatically.

So in the following cell it used in combination with cross_validate function to perform cross-validation, and after that just called fit(...).predict(...) to run the entire sample through the pipeline.

The results are exactly the same.

Less code! Easier to manage!

my_split = KFold(n_splits = 10)

my_pipe = Pipeline([
    ("test_scaler", StandardScaler()),
    ("my_model", LinearRegression())
])

cv_results = cross_validate(
    estimator = my_pipe,
    X = X, y = y,
    scoring="neg_mean_squared_error",
    cv = my_split,
    return_train_score=True
)

print("Train error:", np.mean(cv_results["train_score"]))
print("Test error:", np.mean(cv_results["test_score"]))
print("Mean predict", np.mean(my_pipe.fit(X,y).predict(X)))
Train error: -3.5097084970944476e-22
Test error: -3.4625913023012295e-22
Mean predict 19010.954067406543