Estimator to transformer

Contents

Estimator to transformer#

I faced a problem when I needed sklearn estimator to behave as sklearn transformer to use it as an intermediate step in sklearn pipeline. Here I describe issues and possible solutions associated with this case.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)

Suppose you need to build such a pipeline, where some columns go through RandomForestClassifier, others only through StandardScaler, but then the results of the transformation need to be used in LogisticRegression.

Problem#

The main problem here is that according to sklearn there are two types of objects in pipelines: estimators and transformers, and you can’t use a RandomForestClassifier in a ColumnTransformer, just like you can’t use a transformation. So the following cell tries to pass a RandomForestClassifier instance as a transformation - during fit it got an error.

model = RandomForestClassifier(
    n_estimators=10, 
    max_depth=10
)

transformer=ColumnTransformer([
    ("model", model, [0,1,2,3]),
    ("standart_scaler", StandardScaler(), [4,5,6,7,8,9])
])
try:
    transformer.fit(X,y)
except Exception as e:
    print(e)
All estimators should implement fit and transform, or can be 'drop' or 'passthrough' specifiers. 'RandomForestClassifier(max_depth=10, n_estimators=10)' (type <class 'sklearn.ensemble._forest.RandomForestClassifier'>) doesn't.

Solution#

The best solution I’ve found for now is to just define a custom transformer (read more about custom transformers in sklearn), and wrap it with the estimator you need - so we’ll have a transformer that will behave like an estimator. In the following cell such an idea is implemented:

model = RandomForestClassifier(
    n_estimators=10, 
    max_depth=10
)

class ClassifierToTransformer(BaseEstimator, TransformerMixin):
    '''
    This is class that wrap sklearn classifier
    '''
    def __init__(self, estimator):
        self.estimator = estimator
        super().__init__()
    
    def fit(self, X, y=None):
        self.estimator.fit(X,y)
        return self

    def transform(self, X, y=None):
        return self.estimator.predict_proba(X)[:,[1]]

transformer=ColumnTransformer([
    ("model", ClassifierToTransformer(model), [0,1,2,3]),
    ("standart_scaler", StandardScaler(), [4,5,6,7,8,9])
])

test_model = Pipeline([
    ("class_to_transform", transformer),
    ("log_reg", LogisticRegression())
])

test_model.fit(X, y)
Pipeline(steps=[('class_to_transform',
                 ColumnTransformer(transformers=[('model',
                                                  ClassifierToTransformer(estimator=RandomForestClassifier(max_depth=10,
                                                                                                           n_estimators=10)),
                                                  [0, 1, 2, 3]),
                                                 ('standart_scaler',
                                                  StandardScaler(),
                                                  [4, 5, 6, 7, 8, 9])])),
                ('log_reg', LogisticRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.