Developing transformer

Developing transformer#

This page focuses on building your own sklearn transformers.

Check developing scikit-learn estimators guide.

The following cell makes some inputs and generates a data frame that will be used for the examples on this page.

import numpy as np
import pandas as pd

from pathlib import Path

from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.utils.estimator_checks import check_estimator

Creating Example#

This cell creates the example that’s used on the pages in this section.

np.random.seed(10)
char_1 = ["a", "b"]
char_2 = ["x", "y"]
sample_size=200
example_data = pd.DataFrame({
    f"{c1} {c2}":np.random.normal(size=sample_size) 
    for c1 in char_1 for c2 in char_2
})
display(example_data.head())
example_data.to_parquet(
    Path("developing_transformer")/"example_frame.parquet"
)
a x a y b x b y
0 1.331587 0.133137 0.462386 0.570693
1 0.715279 1.202744 -1.219856 -0.512875
2 -1.545400 -1.024753 0.192573 0.275782
3 -0.008384 0.160399 0.435450 -0.389282
4 0.621336 -1.130475 -1.634944 0.648529

Minimum setup#

For minimum setup you need:

For minimum setup you need

  • Inherit class from BaseEstimator, TransformerMixin;

  • Implement __init__ which defines the hyperparameters of the transformer;

  • Implement fit and transform methods.

Which subtracts given columns of input data from each other.

from sklearn.base import BaseEstimator, TransformerMixin

class ColumnsSubtraction(BaseEstimator, TransformerMixin):
    
    def __init__(self, A_columns : list, B_columns : list):        
        self.A_columns = A_columns
        self.B_columns = B_columns

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return (
            X[self.A_columns].to_numpy() - 
            X[self.B_columns].to_numpy()
        )

The following cell shows that instances of such a transformer can transform data.

display(
    ColumnsSubtraction(["a x", "a y"], ["b x", "b y"])
    .transform(example_data)[:5, :]
)
display(
    ColumnsSubtraction(["a x", "b x"], ["a y", "b y"])
    .transform(example_data)[:5, :]
)
array([[ 0.86920095, -0.43755596],
       [ 1.93513451,  1.71561907],
       [-1.73797313, -1.30053539],
       [-0.44383424,  0.54968071],
       [ 2.25628011, -1.7790047 ]])
array([[ 1.19844901, -0.1083079 ],
       [-0.4874649 , -0.70698034],
       [-0.52064733, -0.08320958],
       [-0.16878301,  0.82473193],
       [ 1.75181126, -2.28347355]])

As well you can use them in the sklearn.pipeline.FeatureUnion:

test_union = FeatureUnion([
    ("a-b", ColumnsSubtraction(["a x", "a y"], ["b x", "b y"])),
    ("x-y", ColumnsSubtraction(["a x", "b x"], ["a y", "b y"]))
])
test_union.fit_transform(example_data)[:5,:]
array([[ 0.86920095, -0.43755596,  1.19844901, -0.1083079 ],
       [ 1.93513451,  1.71561907, -0.4874649 , -0.70698034],
       [-1.73797313, -1.30053539, -0.52064733, -0.08320958],
       [-0.44383424,  0.54968071, -0.16878301,  0.82473193],
       [ 2.25628011, -1.7790047 ,  1.75181126, -2.28347355]])

And using the sklearn.pipeline.FeatureUnion from the previous cell as a step for the sklearn.pipeline.Pipeline also works well:

test_pipeline = Pipeline([
    ("test_union", test_union),
    ("pca", PCA())
])
np.round(test_pipeline.fit_transform(example_data)[:5,:], 3)
array([[ 1.221,  0.587, -0.502,  0.   ],
       [ 0.056,  1.582,  2.072, -0.   ],
       [-0.522, -2.11 , -0.93 , -0.   ],
       [-1.084,  0.235, -0.449, -0.   ],
       [ 3.915, -0.036,  0.554, -0.   ]])

check_estimator#

Such a transformer doesn’t path sklearn.utils.estimator_checks.check_estimator. It looks like this happens because it’s trying to run transform on the instance, which just passes a random numpy.array as input. But this transformer expects to have pandas.DataFrame as input.

The following example shows such error:

from sklearn.utils.estimator_checks import check_estimator
try:
    check_estimator(ColumnsSubtraction(["a x", "a y"], ["b x", "b y"]))
except Exception as e:
    print(e)
only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

But it’s not the transformer itself that’s the problem. The problem lies in the type of transformation it performs - its logic is based on column names, which numpy.array doesn’t have. The following cell shows that FunctionTransformer with the same transformation in function will have the same result in the check_estimator function.

from sklearn.preprocessing import FunctionTransformer
try:
    check_estimator(FunctionTransformer(
        lambda X: X[["a x", "a y"]] - X[["b x", "b y"]]
    ))
except Exception as e:
    print(e)
only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices