Developing transformer#
This page focuses on building your own sklearn
transformers.
Check developing scikit-learn estimators guide.
The following cell makes some inputs and generates a data frame that will be used for the examples on this page.
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.utils.estimator_checks import check_estimator
Creating Example#
This cell creates the example that’s used on the pages in this section.
np.random.seed(10)
char_1 = ["a", "b"]
char_2 = ["x", "y"]
sample_size=200
example_data = pd.DataFrame({
f"{c1} {c2}":np.random.normal(size=sample_size)
for c1 in char_1 for c2 in char_2
})
display(example_data.head())
example_data.to_parquet(
Path("developing_transformer")/"example_frame.parquet"
)
a x | a y | b x | b y | |
---|---|---|---|---|
0 | 1.331587 | 0.133137 | 0.462386 | 0.570693 |
1 | 0.715279 | 1.202744 | -1.219856 | -0.512875 |
2 | -1.545400 | -1.024753 | 0.192573 | 0.275782 |
3 | -0.008384 | 0.160399 | 0.435450 | -0.389282 |
4 | 0.621336 | -1.130475 | -1.634944 | 0.648529 |
Minimum setup#
For minimum setup you need:
For minimum setup you need
Inherit class from
BaseEstimator, TransformerMixin
;Implement
__init__
which defines the hyperparameters of the transformer;Implement
fit
andtransform
methods.
Which subtracts given columns of input data from each other.
from sklearn.base import BaseEstimator, TransformerMixin
class ColumnsSubtraction(BaseEstimator, TransformerMixin):
def __init__(self, A_columns : list, B_columns : list):
self.A_columns = A_columns
self.B_columns = B_columns
def fit(self, X, y=None):
return self
def transform(self, X):
return (
X[self.A_columns].to_numpy() -
X[self.B_columns].to_numpy()
)
The following cell shows that instances of such a transformer can transform data.
display(
ColumnsSubtraction(["a x", "a y"], ["b x", "b y"])
.transform(example_data)[:5, :]
)
display(
ColumnsSubtraction(["a x", "b x"], ["a y", "b y"])
.transform(example_data)[:5, :]
)
array([[ 0.86920095, -0.43755596],
[ 1.93513451, 1.71561907],
[-1.73797313, -1.30053539],
[-0.44383424, 0.54968071],
[ 2.25628011, -1.7790047 ]])
array([[ 1.19844901, -0.1083079 ],
[-0.4874649 , -0.70698034],
[-0.52064733, -0.08320958],
[-0.16878301, 0.82473193],
[ 1.75181126, -2.28347355]])
As well you can use them in the sklearn.pipeline.FeatureUnion
:
test_union = FeatureUnion([
("a-b", ColumnsSubtraction(["a x", "a y"], ["b x", "b y"])),
("x-y", ColumnsSubtraction(["a x", "b x"], ["a y", "b y"]))
])
test_union.fit_transform(example_data)[:5,:]
array([[ 0.86920095, -0.43755596, 1.19844901, -0.1083079 ],
[ 1.93513451, 1.71561907, -0.4874649 , -0.70698034],
[-1.73797313, -1.30053539, -0.52064733, -0.08320958],
[-0.44383424, 0.54968071, -0.16878301, 0.82473193],
[ 2.25628011, -1.7790047 , 1.75181126, -2.28347355]])
And using the sklearn.pipeline.FeatureUnion
from the previous cell as a step for the sklearn.pipeline.Pipeline
also works well:
test_pipeline = Pipeline([
("test_union", test_union),
("pca", PCA())
])
np.round(test_pipeline.fit_transform(example_data)[:5,:], 3)
array([[ 1.221, 0.587, -0.502, 0. ],
[ 0.056, 1.582, 2.072, -0. ],
[-0.522, -2.11 , -0.93 , -0. ],
[-1.084, 0.235, -0.449, -0. ],
[ 3.915, -0.036, 0.554, -0. ]])
check_estimator
#
Such a transformer doesn’t path sklearn.utils.estimator_checks.check_estimator
. It looks like this happens because it’s trying to run transform
on the instance, which just passes a random numpy.array
as input. But this transformer expects to have pandas.DataFrame
as input.
The following example shows such error:
from sklearn.utils.estimator_checks import check_estimator
try:
check_estimator(ColumnsSubtraction(["a x", "a y"], ["b x", "b y"]))
except Exception as e:
print(e)
only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
But it’s not the transformer itself that’s the problem. The problem lies in the type of transformation it performs - its logic is based on column names, which numpy.array
doesn’t have. The following cell shows that FunctionTransformer
with the same transformation in function will have the same result in the check_estimator
function.
from sklearn.preprocessing import FunctionTransformer
try:
check_estimator(FunctionTransformer(
lambda X: X[["a x", "a y"]] - X[["b x", "b y"]]
))
except Exception as e:
print(e)
only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices