Pack object

Pack object#

You can pack all the transformations under the datafame and the final model into a single object. With this approach, you will benefit during the model deployment phase because the entire pipeline is packed as a single object and can be easily deployed.

import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, cross_validate
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import (
    OneHotEncoder,
    StandardScaler
)
from sklearn.compose import ColumnTransformer
from IPython.display import HTML

Pipeline#

Imagine that you need to create a model building pipeline that includes data standardisation and then model fitting. In this section I want to show the difference in code length and convenience of coding all by yourself and using the sklean.pipeline.Pipeline class. Check more on pipelines page.

Data generation#

In the following cell I just generate a random regression task for use in the example.

sample_size = 1000
features_count = 20
np.random.seed(50)

X = []

for i in range(features_count):

    mean = np.random.uniform(0,100)
    std = np.abs(np.random.normal(0, 50))
    
    X.append(np.random.normal(mean, std, [sample_size, 1]))

X = np.concatenate(X, axis=1)
theoretical_coefs = np.random.normal(0, 20, [features_count, 1])
y = np.dot(X, theoretical_coefs) + np.random.normal(0, 500, sample_size)

Self coding#

So here is code that does:

10-fold split cross-validation for the named pipeline;
Display cross-validation results;
Fit model to full data sample;
Compute the mean prediction over the entire data sample.

You need to create a cycle that fits StandardScaler for current split and fit model to standardised data. After the cycle at the step of fitting the model to the whole data, you need to describe the whole pipeline again!

my_split = KFold(n_splits = 10)
train_errors = []
test_errors = []


for train_ind, test_ind in my_split.split(X):
    
    this_scaler = StandardScaler().fit(X[train_ind, :])
    
    train_X = this_scaler.transform(X[train_ind, :])
    train_y = y[train_ind]
    
    test_X = this_scaler.transform(X[test_ind,:])
    test_y = y[test_ind]

    model = LinearRegression().fit(train_X, train_y)
    train_errors.append(mean_squared_error(train_y, model.predict(train_X)))
    test_errors.append(mean_squared_error(test_y, model.predict(test_X)))

print("Train error:", np.mean(np.array(train_errors)))
print("Test error:", np.mean(np.array(test_errors)))

standart_X = StandardScaler().fit_transform(X)
final_model = LinearRegression().fit(standart_X, y)
print("Mean predict", np.mean(final_model.predict(standart_X)))

Train error: 3.3319604786449884e-22
Test error: 3.301381336530025e-22
Mean predict 19010.954067406547

Using `sklearn.pipeline`#

In the following cell, I perform exactly the same calculations using only sklearn.pipeline.Pipeline.

You just need to define a my_pipline object where I describe the steps of the pipeline in the format [(<name of step 1>,<object performing step 1>), (<name of step 2>,<object performing step 2>), ...] and then just use it as a normal estimator - it will perform all the steps automatically.

So in the following cell it used in combination with cross_validate function to perform cross-validation, and after that just called fit(...).predict(...) to run the entire sample through the pipeline.

The results are exactly the same.

Less code! Easier to manage!

my_split = KFold(n_splits = 10)

my_pipe = Pipeline([
    ("test_scaler", StandardScaler()),
    ("my_model", LinearRegression())
])

cv_results = cross_validate(
    estimator=my_pipe,
    X=X, y=y,
    scoring="neg_mean_squared_error",
    cv=my_split,
    return_train_score=True
)

print("Train error:", np.mean(cv_results["train_score"]))
print("Test error:", np.mean(cv_results["test_score"]))
print("Mean predict", np.mean(my_pipe.fit(X,y).predict(X)))

Train error: -3.3319573813131387e-22
Test error: -3.301340242197193e-22
Mean predict 19010.954067406547

Columns transformer#

Sometimes different columns need to be transformed in different ways. The most obvious example is the different processing of categorical and numerical columns:

For numeric columns, you need to apply normalisation techniques;
For categorical columns, you need to apply encoding (a hot, mean, etc.).

It’s easy to build such a transformation yourself, but it’s convenient that sklearn has an out-of-the-box solution that can be easily integrated into sklearn type pipelines - sklearn.compose.ColumnTransformer.

For more details check:

Sklearn API reference.
Columns transformer page.

In the next cell a random data frame is generated, with some categorical and some numerical columns. Let’s show how can be builded component of the pipeline that process categorical columns in one way and numeric in other.

sample_size = 500
np.random.seed(10)

generate_word = lambda: "".join([
    chr(val) for val in 
    np.random.randint(ord("a"), ord("z") + 1, 10)
])
get_cat_var = lambda: np.random.choice(
    [
        generate_word() for i in 
        range(np.random.randint(2,7))
    ], 
    sample_size
)
get_num_var = lambda: np.random.normal(
    np.random.uniform(-1,1), 
    np.random.uniform(1,10),
    sample_size
)

variables_generator = [get_cat_var, get_num_var]

data_frame = pd.concat(
    {
        f"var {i}" : \
        pd.Series(np.random.choice(variables_generator)())
        for i in range(20)
    },
    axis = 1
)

data_frame.head()

	var 0	var 1	var 2	var 3	var 4	var 5	var 6	var 7	var 8	var 9	var 10	var 11	var 12	var 13	var 14	var 15	var 16	var 17	var 18	var 19
0	6.352738	ghfmmekjzz	ewvspmvrkg	-3.916784	0.579251	3.078876	jljighbmio	iieafcivri	-3.503851	hwadgiwzth	zderdinjyy	-0.851043	-7.137998	-0.990391	4.128471	lduutwjjin	-6.858011	-3.455499	kdzpmsglss	fjogwgrkig
1	-1.562264	ghfmmekjzz	dlfjbofnbr	-1.458950	0.755219	-0.498048	phrxnjsbae	iieafcivri	-5.212578	yxickhmgkp	kpqepphruh	-6.878257	-1.712574	-7.783903	-3.623413	lduutwjjin	3.198987	-7.290196	eywzqkuzza	fjogwgrkig
2	-2.453819	booaisyeuj	dlfjbofnbr	-0.124566	4.070167	-2.271910	lzsssmsaim	vhfoucvgil	-3.504522	pdzajvgbzz	ynhwdgvtke	-0.838181	1.898630	-6.632060	-1.394765	zghwqxiakd	-14.830121	10.490557	irrdfszbwf	voumadgklp
3	-0.042513	booaisyeuj	kkagxtgiko	-6.897858	-0.065287	-3.459478	phrxnjsbae	yfmijifvmo	0.742066	wectjxhbio	kpqepphruh	-0.087694	-1.808818	0.053985	0.494845	lduutwjjin	-0.341344	4.539596	eywzqkuzza	dzlpowvufa
4	-5.946806	xmtwmxfxpz	dlfjbofnbr	7.453730	-3.450039	0.091773	jljighbmio	vhfoucvgil	1.830545	hwadgiwzth	ynhwdgvtke	-0.218426	0.492733	-2.954776	-2.614179	zghwqxiakd	-2.672298	6.436154	kdzpmsglss	dzlpowvufa

To prepare a transformer that handles different columns in different ways, you need to pass a list of your transformers to the transformers parameter of the sklearn.compose.ColumnTransformer constructor.

Each element of the transformers list should be of the form (<transformer name>, <transformer class>, <columns that will use this transformer>).

In the following cell we have created such an object, showing how it will look in the Jupyter output and possible results of this transformation for the data frame described above.

numeric_columns = list(data_frame.select_dtypes("number").columns)
categorical_columns = list(set(data_frame.columns) - set(numeric_columns))

my_transformer = ColumnTransformer(
    transformers = [
        ("one_hot_encoder", OneHotEncoder(), categorical_columns),
        ("standart_scaler", StandardScaler(), numeric_columns)
    ]
)


display(my_transformer)
display(
    pd.DataFrame(
        my_transformer.fit_transform(data_frame)
    ).head()
)

ColumnTransformer(transformers=[('one_hot_encoder', OneHotEncoder(),
                                 ['var 10', 'var 18', 'var 7', 'var 2',
                                  'var 15', 'var 9', 'var 6', 'var 19',
                                  'var 1']),
                                ('standart_scaler', StandardScaler(),
                                 ['var 0', 'var 3', 'var 4', 'var 5', 'var 8',
                                  'var 11', 'var 12', 'var 13', 'var 14',
                                  'var 16', 'var 17'])])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

ColumnTransformer

?Documentation for ColumnTransformeriNot fitted

Parameters

	transformers	[('one_hot_encoder', ...), ('standart_scaler', ...)]
	remainder	'drop'
	sparse_threshold	0.3
	n_jobs	None
	transformer_weights	None
	verbose	False
	verbose_feature_names_out	True
	force_int_remainder_cols	'deprecated'

one_hot_encoder

['var 10', 'var 18', 'var 7', 'var 2', 'var 15', 'var 9', 'var 6', 'var 19', 'var 1']

OneHotEncoder

?Documentation for OneHotEncoder

Parameters

	categories	'auto'
	drop	None
	sparse_output	True
	dtype	<class 'numpy.float64'>
	handle_unknown	'error'
	min_frequency	None
	max_categories	None
	feature_name_combiner	'concat'

standart_scaler

['var 0', 'var 3', 'var 4', 'var 5', 'var 8', 'var 11', 'var 12', 'var 13', 'var 14', 'var 16', 'var 17']

StandardScaler

?Documentation for StandardScaler

Parameters

	copy	True
	with_mean	True
	with_std	True

	0	1	2	3	4	6	8	9	...	37	38	39	40	41	42	43	44	45	46
0	0.0	0.0	1.0	0.0	0.0	1.0	1.0	0.0	...	-1.169878	0.128906	1.319012	-0.962918	-0.316245	-1.378868	-0.008420	1.003985	-0.811118	-0.795706
1	1.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	...	-0.391908	0.160042	-0.407646	-1.467765	-2.575318	-0.360064	-1.728754	-1.025626	0.436845	-1.531170
2	0.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0	...	0.030460	0.746601	-1.263927	-0.963117	-0.311424	0.318060	-1.437070	-0.442118	-1.800370	1.879037
3	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	...	-2.113468	0.014859	-1.837191	0.291547	-0.030132	-0.378137	0.256049	0.052623	-0.002472	0.737690
4	0.0	1.0	0.0	0.0	0.0	1.0	0.0	1.0	...	2.429194	-0.584051	-0.122927	0.613140	-0.079132	0.054056	-0.505865	-0.761387	-0.291717	1.101435

5 rows × 47 columns

Pack object

Contents

Pack object#

Pipeline#

Data generation#

Self coding#

Using sklearn.pipeline#

Columns transformer#

Using `sklearn.pipeline`#