Data transform

Data transform#

In sklearn there are a lot of tools that allow you to organise the “flow of data”. It’s extremely convenient and saves a lot of time in real tasks. This section is focused on them.

Find out more in the relevant section of the Sklearn User Guide.

import sklearn
from sklearn import preprocessing
import numpy as np

Encoders#

An encoder is a type of transformation (or better to say “encode”) that converts categorical data into the numeric. The following table ones awailable in scikit-learn:

Encoder	Description
`sklearn.preprocessing.OneHotEncoder`	Converts categorical variables into a one-hot numeric array (sparse or dense).
`sklearn.preprocessing.OrdinalEncoder`	Encodes categorical features as integers (`0` to `n\_categories - 1)`.
`sklearn.preprocessing.LabelEncoder`	Encodes target labels with values between `0` and `n\_classes - 1`. Intended for labels, not features.

Check more details in Encoders page.

The following cell creates a column of categorial features that will serve as example.

inp = np.array(["a", "b", "c", "a", "c"])[:, None]
inp

array([['a'],
       ['b'],
       ['c'],
       ['a'],
       ['c']], dtype='<U1')

The following cell shows the application of the one-hot encoding to the example column.

(
    preprocessing.
    OneHotEncoder(sparse_output=False).
    fit_transform(inp)
)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.]])

The following code shows how to apply the sklearn.preprocessing.OrdinalEncoder to the example.

preprocessing.OrdinalEncoder().fit_transform(inp)

array([[0.],
       [1.],
       [2.],
       [0.],
       [2.]])

Data transform

Contents

Data transform#

Encoders#