Data transform#
In sklearn there are a lot of tools that allow you to organise the “flow of data”. It’s extremely convenient and saves a lot of time in real tasks. This section is focused on them.
Find out more in the relevant section of the Sklearn User Guide.
import sklearn
from sklearn import preprocessing
import numpy as np
Encoders#
An encoder is a type of transformation (or better to say “encode”) that converts categorical data into the numeric. The following table ones awailable in scikit-learn
:
Encoder |
Description |
---|---|
|
Converts categorical variables into a one-hot numeric array (sparse or dense). |
|
Encodes categorical features as integers ( |
|
Encodes target labels with values between |
Check more details in (Encoders)[data_transform/encoders.ipynb] page.
The following cell creates a column of categorial features that will serve as example.
inp = np.array(["a", "b", "c", "a", "c"])[:, None]
inp
array([['a'],
['b'],
['c'],
['a'],
['c']], dtype='<U1')
The following cell shows the application of the one-hot encoding to the example column.
(
preprocessing.
OneHotEncoder(sparse_output=False).
fit_transform(inp)
)
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.],
[1., 0., 0.],
[0., 0., 1.]])
The following code shows how to apply the sklearn.preprocessing.OrdinalEncoder
to the example.
preprocessing.OrdinalEncoder().fit_transform(inp)
array([[0.],
[1.],
[2.],
[0.],
[2.]])