Pack object#
You can pack all the transformations under the datafame and the final model into a single object. With this approach, you will benefit during the model deployment phase because the entire pipeline is packed as a single object and can be easily deployed.
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, cross_validate
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import (
OneHotEncoder,
StandardScaler
)
from sklearn.compose import ColumnTransformer
from IPython.display import HTML
Pipeline#
Imagine that you need to create a model building pipeline that includes data standardisation and then model fitting. In this section I want to show the difference in code length and convenience of coding all by yourself and using the sklean.pipeline.Pipeline
class. Check more on pipelines page.
Data generation#
In the following cell I just generate a random regression task for use in the example.
sample_size = 1000
features_count = 20
np.random.seed(50)
X = []
for i in range(features_count):
mean = np.random.uniform(0,100)
std = np.abs(np.random.normal(0, 50))
X.append(np.random.normal(mean, std, [sample_size, 1]))
X = np.concatenate(X, axis=1)
theoretical_coefs = np.random.normal(0, 20, [features_count, 1])
y = np.dot(X, theoretical_coefs) + np.random.normal(0, 500, sample_size)
Self coding#
So here is code that does:
10-fold split cross-validation for the named pipeline;
Display cross-validation results;
Fit model to full data sample;
Compute the mean prediction over the entire data sample.
You need to create a cycle that fits StandardScaler
for current split and fit model to standardised data. After the cycle at the step of fitting the model to the whole data, you need to describe the whole pipeline again!
my_split = KFold(n_splits = 10)
train_errors = []
test_errors = []
for train_ind, test_ind in my_split.split(X):
this_scaler = StandardScaler().fit(X[train_ind, :])
train_X = this_scaler.transform(X[train_ind, :])
train_y = y[train_ind]
test_X = this_scaler.transform(X[test_ind,:])
test_y = y[test_ind]
model = LinearRegression().fit(train_X, train_y)
train_errors.append(mean_squared_error(train_y, model.predict(train_X)))
test_errors.append(mean_squared_error(test_y, model.predict(test_X)))
print("Train error:", np.mean(np.array(train_errors)))
print("Test error:", np.mean(np.array(test_errors)))
standart_X = StandardScaler().fit_transform(X)
final_model = LinearRegression().fit(standart_X, y)
print("Mean predict", np.mean(final_model.predict(standart_X)))
Train error: 3.3319604786449884e-22
Test error: 3.301381336530025e-22
Mean predict 19010.954067406547
Using sklearn.pipeline
#
In the following cell, I perform exactly the same calculations using only sklearn.pipeline.Pipeline
.
You just need to define a my_pipline
object where I describe the steps of the pipeline in the format [(<name of step 1>,<object performing step 1>), (<name of step 2>,<object performing step 2>), ...]
and then just use it as a normal estimator - it will perform all the steps automatically.
So in the following cell it used in combination with cross_validate
function to perform cross-validation, and after that just called fit(...).predict(...)
to run the entire sample through the pipeline.
The results are exactly the same.
Less code! Easier to manage!
my_split = KFold(n_splits = 10)
my_pipe = Pipeline([
("test_scaler", StandardScaler()),
("my_model", LinearRegression())
])
cv_results = cross_validate(
estimator=my_pipe,
X=X, y=y,
scoring="neg_mean_squared_error",
cv=my_split,
return_train_score=True
)
print("Train error:", np.mean(cv_results["train_score"]))
print("Test error:", np.mean(cv_results["test_score"]))
print("Mean predict", np.mean(my_pipe.fit(X,y).predict(X)))
Train error: -3.3319573813131387e-22
Test error: -3.301340242197193e-22
Mean predict 19010.954067406547
Columns transformer#
Sometimes different columns need to be transformed in different ways. The most obvious example is the different processing of categorical and numerical columns:
For numeric columns, you need to apply normalisation techniques;
For categorical columns, you need to apply encoding (a hot, mean, etc.).
It’s easy to build such a transformation yourself, but it’s convenient that sklearn
has an out-of-the-box solution that can be easily integrated into sklearn type pipelines - sklearn.compose.ColumnTransformer
.
For more details check:
In the next cell a random data frame is generated, with some categorical and some numerical columns. Let’s show how can be builded component of the pipeline that process categorical columns in one way and numeric in other.
sample_size = 500
np.random.seed(10)
generate_word = lambda: "".join([
chr(val) for val in
np.random.randint(ord("a"), ord("z") + 1, 10)
])
get_cat_var = lambda: np.random.choice(
[
generate_word() for i in
range(np.random.randint(2,7))
],
sample_size
)
get_num_var = lambda: np.random.normal(
np.random.uniform(-1,1),
np.random.uniform(1,10),
sample_size
)
variables_generator = [get_cat_var, get_num_var]
data_frame = pd.concat(
{
f"var {i}" : \
pd.Series(np.random.choice(variables_generator)())
for i in range(20)
},
axis = 1
)
data_frame.head()
var 0 | var 1 | var 2 | var 3 | var 4 | var 5 | var 6 | var 7 | var 8 | var 9 | var 10 | var 11 | var 12 | var 13 | var 14 | var 15 | var 16 | var 17 | var 18 | var 19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6.352738 | ghfmmekjzz | ewvspmvrkg | -3.916784 | 0.579251 | 3.078876 | jljighbmio | iieafcivri | -3.503851 | hwadgiwzth | zderdinjyy | -0.851043 | -7.137998 | -0.990391 | 4.128471 | lduutwjjin | -6.858011 | -3.455499 | kdzpmsglss | fjogwgrkig |
1 | -1.562264 | ghfmmekjzz | dlfjbofnbr | -1.458950 | 0.755219 | -0.498048 | phrxnjsbae | iieafcivri | -5.212578 | yxickhmgkp | kpqepphruh | -6.878257 | -1.712574 | -7.783903 | -3.623413 | lduutwjjin | 3.198987 | -7.290196 | eywzqkuzza | fjogwgrkig |
2 | -2.453819 | booaisyeuj | dlfjbofnbr | -0.124566 | 4.070167 | -2.271910 | lzsssmsaim | vhfoucvgil | -3.504522 | pdzajvgbzz | ynhwdgvtke | -0.838181 | 1.898630 | -6.632060 | -1.394765 | zghwqxiakd | -14.830121 | 10.490557 | irrdfszbwf | voumadgklp |
3 | -0.042513 | booaisyeuj | kkagxtgiko | -6.897858 | -0.065287 | -3.459478 | phrxnjsbae | yfmijifvmo | 0.742066 | wectjxhbio | kpqepphruh | -0.087694 | -1.808818 | 0.053985 | 0.494845 | lduutwjjin | -0.341344 | 4.539596 | eywzqkuzza | dzlpowvufa |
4 | -5.946806 | xmtwmxfxpz | dlfjbofnbr | 7.453730 | -3.450039 | 0.091773 | jljighbmio | vhfoucvgil | 1.830545 | hwadgiwzth | ynhwdgvtke | -0.218426 | 0.492733 | -2.954776 | -2.614179 | zghwqxiakd | -2.672298 | 6.436154 | kdzpmsglss | dzlpowvufa |
To prepare a transformer that handles different columns in different ways, you need to pass a list of your transformers to the transformers
parameter of the sklearn.compose.ColumnTransformer
constructor.
Each element of the transformers list should be of the form (<transformer name>, <transformer class>, <columns that will use this transformer>)
.
In the following cell we have created such an object, showing how it will look in the Jupyter output and possible results of this transformation for the data frame described above.
numeric_columns = list(data_frame.select_dtypes("number").columns)
categorical_columns = list(set(data_frame.columns) - set(numeric_columns))
my_transformer = ColumnTransformer(
transformers = [
("one_hot_encoder", OneHotEncoder(), categorical_columns),
("standart_scaler", StandardScaler(), numeric_columns)
]
)
display(my_transformer)
display(
pd.DataFrame(
my_transformer.fit_transform(data_frame)
).head()
)
ColumnTransformer(transformers=[('one_hot_encoder', OneHotEncoder(), ['var 10', 'var 18', 'var 7', 'var 2', 'var 15', 'var 9', 'var 6', 'var 19', 'var 1']), ('standart_scaler', StandardScaler(), ['var 0', 'var 3', 'var 4', 'var 5', 'var 8', 'var 11', 'var 12', 'var 13', 'var 14', 'var 16', 'var 17'])])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
transformers | [('one_hot_encoder', ...), ('standart_scaler', ...)] | |
remainder | 'drop' | |
sparse_threshold | 0.3 | |
n_jobs | None | |
transformer_weights | None | |
verbose | False | |
verbose_feature_names_out | True | |
force_int_remainder_cols | 'deprecated' |
['var 10', 'var 18', 'var 7', 'var 2', 'var 15', 'var 9', 'var 6', 'var 19', 'var 1']
Parameters
categories | 'auto' | |
drop | None | |
sparse_output | True | |
dtype | <class 'numpy.float64'> | |
handle_unknown | 'error' | |
min_frequency | None | |
max_categories | None | |
feature_name_combiner | 'concat' |
['var 0', 'var 3', 'var 4', 'var 5', 'var 8', 'var 11', 'var 12', 'var 13', 'var 14', 'var 16', 'var 17']
Parameters
copy | True | |
with_mean | True | |
with_std | True |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | -1.169878 | 0.128906 | 1.319012 | -0.962918 | -0.316245 | -1.378868 | -0.008420 | 1.003985 | -0.811118 | -0.795706 |
1 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | -0.391908 | 0.160042 | -0.407646 | -1.467765 | -2.575318 | -0.360064 | -1.728754 | -1.025626 | 0.436845 | -1.531170 |
2 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.030460 | 0.746601 | -1.263927 | -0.963117 | -0.311424 | 0.318060 | -1.437070 | -0.442118 | -1.800370 | 1.879037 |
3 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | -2.113468 | 0.014859 | -1.837191 | 0.291547 | -0.030132 | -0.378137 | 0.256049 | 0.052623 | -0.002472 | 0.737690 |
4 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | ... | 2.429194 | -0.584051 | -0.122927 | 0.613140 | -0.079132 | 0.054056 | -0.505865 | -0.761387 | -0.291717 | 1.101435 |
5 rows × 47 columns