Feature names out

Feature names out#

Check developer API for set_output for more details.

In practice, it’s extremely important to be able to save column names through transformations.

import numpy as np
import pandas as pd
example_data = pd.read_parquet("example_frame.parquet")

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion

The class from minimum setup the previous section doesn’t have get_feature_names_out by default - so you’ll get a corresponding error. The following cell shows it:

class ColumnsSubtraction(BaseEstimator, TransformerMixin):
    def __init__(self, A_columns : list, B_columns : list):
        self.A_columns = A_columns
        self.B_columns = B_columns

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return (
            X[self.A_columns].to_numpy() - 
            X[self.B_columns].to_numpy()
        )


transoformer = (
    ColumnsSubtraction(
        ["a x", "a y"], 
        ["b x", "b y"]
    )
    .fit(example_data)
)
try:
    transoformer.get_feature_names_out()
except Exception as e:
    print(e)
'ColumnsSubtraction' object has no attribute 'get_feature_names_out'

As a result, you cannot use interfaces associated with feature names in a transformer that contains such components.

Defining get_feature_names_out#

Here, for our example with a transformer that subtracts columns, we define get_feature_names_out, which returns column names corresponding to the sense of the transformation.

class ColumnsSubtractionNames(BaseEstimator, TransformerMixin):
    def __init__(self, A_columns : list, B_columns : list):
        if len(A_columns) != len(B_columns):
            raise ValueError(
                "The number of columns in the decreasing "
                "and subtracting columns do not match."
            )
        self.A_columns = A_columns
        self.B_columns = B_columns

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return (
            X[self.A_columns].to_numpy() - 
            X[self.B_columns].to_numpy()
        )

    def get_feature_names_out(self, input_features=None):
        return np.array([
            f"'{a_col}'-'{b_col}'"
            for a_col, b_col in
            zip(self.A_columns, self.B_columns)
        ])

Instances of ColumnsSubtractionNames will return column names and any complex transforms that are used will be able to use the get_feature_names_out method.

test_union = FeatureUnion([
    ("a-b", ColumnsSubtractionNames(["a x", "a y"], ["b x", "b y"])),
    ("x-y", ColumnsSubtractionNames(["a x", "b x"], ["a y", "b y"]))
])
test_union.fit(example_data)
test_union.get_feature_names_out()
array(["a-b__'a x'-'b x'", "a-b__'a y'-'b y'", "x-y__'a x'-'a y'",
       "x-y__'b x'-'b y'"], dtype=object)

Note By applying the method set_output(transform="pandas") to the instance of your transformer with defined get_feature_names_out - you make it return pandas data frames.

test_union.set_output(transform="pandas")
test_union.fit_transform(example_data).head()
a-b__'a x'-'b x' a-b__'a y'-'b y' x-y__'a x'-'a y' x-y__'b x'-'b y'
0 0.869201 -0.437556 1.198449 -0.108308
1 1.935135 1.715619 -0.487465 -0.706980
2 -1.737973 -1.300535 -0.520647 -0.083210
3 -0.443834 0.549681 -0.168783 0.824732
4 2.256280 -1.779005 1.751811 -2.283474

Note All currently known transformers return numpy.array from get_feature_names_out so, all things being equal, you should also try to return numpy.array.