Data transform

Contents

Data transform#

In sklearn there are a lot of tools that allow you to organise the “flow of data”. It’s extremely convenient and saves a lot of time in real tasks. This section is focused on them.

Find out more in the relevant section of the Sklearn User Guide.

import sklearn
from sklearn import preprocessing
import numpy as np

Encoders#

An encoder is a type of transformation (or better to say “encode”) that converts categorical data into the numeric. The following table ones awailable in scikit-learn:

Encoder

Description

sklearn.preprocessing.OneHotEncoder

Converts categorical variables into a one-hot numeric array (sparse or dense).

sklearn.preprocessing.OrdinalEncoder

Encodes categorical features as integers (0 to n\_categories - 1).

sklearn.preprocessing.LabelEncoder

Encodes target labels with values between 0 and n\_classes - 1. Intended for labels, not features.

Check more details in (Encoders)[data_transform/encoders.ipynb] page.


The following cell creates a column of categorial features that will serve as example.

inp = np.array(["a", "b", "c", "a", "c"])[:, None]
inp
array([['a'],
       ['b'],
       ['c'],
       ['a'],
       ['c']], dtype='<U1')

The following cell shows the application of the one-hot encoding to the example column.

(
    preprocessing.
    OneHotEncoder(sparse_output=False).
    fit_transform(inp)
)
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.]])

The following code shows how to apply the sklearn.preprocessing.OrdinalEncoder to the example.

preprocessing.OrdinalEncoder().fit_transform(inp)
array([[0.],
       [1.],
       [2.],
       [0.],
       [2.]])