Data transform

Data transform#

In sklearn there are a lot of tools that allow you to organise the “flow of data”. It’s extremely convenient and saves a lot of time in real tasks. This section is focused on them.

Find out more in the relevant section of the Sklearn User Guide.

import sklearn
from sklearn import preprocessing
import numpy as np

Encoders#

An encoder is a type of transformation (or better to say “encode”) that converts categorical data into the numeric. The following table ones awailable in scikit-learn:

Encoder

Description

sklearn.preprocessing.OneHotEncoder

Converts categorical variables into a one-hot numeric array (sparse or dense).

sklearn.preprocessing.OrdinalEncoder

Encodes categorical features as integers (0 to n_categories - 1).

sklearn.preprocessing.LabelEncoder

Encodes target labels with values between 0 and n_classes - 1. Intended for labels, not features.

Check more details in Encoders page.


The following cell creates a column of categorial features that will serve as example.

inp = np.array(["a", "b", "c", "a", "c"])[:, None]
inp
array([['a'],
       ['b'],
       ['c'],
       ['a'],
       ['c']], dtype='<U1')

The following cell shows the application of the one-hot encoding to the example column.

(
    preprocessing.
    OneHotEncoder(sparse_output=False).
    fit_transform(inp)
)
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.]])

The following code shows how to apply the sklearn.preprocessing.OrdinalEncoder to the example.

preprocessing.OrdinalEncoder().fit_transform(inp)
array([[0.],
       [1.],
       [2.],
       [0.],
       [2.]])

Pack object#

The sklearn objects follow a common API. This API provides a special abstractions that compose objects into a single object. This final object includes all the required transformations and a model as a final step. Classes that allows to compose other sklearn objects are:

  • The sklearn.pipeline.Pipeline: allows you to apply transformations one by one.

  • The sklearn.compose.ColumnsTransformer: allows to define which transformations will be applied to which columns at each step.

Check more in Pack object page.


Let’s consider an example of how to construct such a solution. The following cell generates the data that we will use in the example.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import KBinsDiscretizer, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    n_classes=2,
    random_state=42  
)

Suppose that for some columns, you have to use discretization, and for others, you have to use standardization. You can combine discretizer and standard scaler into a single object that will concatenate the outputs of both, as shown in the following cell:

discretizer = KBinsDiscretizer(
    n_bins=3,
    quantile_method='linear'
)
standard_scaler = StandardScaler()

column_transformer = ColumnTransformer([
    ("discretizer",  discretizer, [0, 3, 5]),
    ("standard_scaler", standard_scaler, [1, 2, 4])
])
column_transformer
ColumnTransformer(transformers=[('discretizer',
                                 KBinsDiscretizer(n_bins=3,
                                                  quantile_method='linear'),
                                 [0, 3, 5]),
                                ('standard_scaler', StandardScaler(),
                                 [1, 2, 4])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The output of these transformations should be used in the logistic regression. To chain column transformer and the model use, a pipeline as shown in the following cell.

pipeline = Pipeline([
    ("column_transformer", column_transformer),
    ("log_regression", LogisticRegression())
])
pipeline
Pipeline(steps=[('column_transformer',
                 ColumnTransformer(transformers=[('discretizer',
                                                  KBinsDiscretizer(n_bins=3,
                                                                   quantile_method='linear'),
                                                  [0, 3, 5]),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  [1, 2, 4])])),
                ('log_regression', LogisticRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Since everything you need is composed into the pipeline, you can invoke your model in just in one line:

pipeline.fit(X, y).score(X, y)
0.778