Feature names out

Feature names out#

Check developer API for set_output for more details.

In practice, it’s extremely important to be able to save column names through transformations.

import numpy as np
import pandas as pd
example_data = pd.read_parquet("example_frame.parquet")

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion

The class from minimum setup the previous section doesn’t have get_feature_names_out by default - so you’ll get a corresponding error. The following cell shows it:

class ColumnsSubtraction(BaseEstimator, TransformerMixin):
    def __init__(self, A_columns : list, B_columns : list):
        self.A_columns = A_columns
        self.B_columns = B_columns

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return (
            X[self.A_columns].to_numpy() - 
            X[self.B_columns].to_numpy()
        )


transoformer = (
    ColumnsSubtraction(
        ["a x", "a y"], 
        ["b x", "b y"]
    )
    .fit(example_data)
)
try:
    transoformer.get_feature_names_out()
except Exception as e:
    print(e)

'ColumnsSubtraction' object has no attribute 'get_feature_names_out'

As a result, you cannot use interfaces associated with feature names in a transformer that contains such components.

Defining `get_feature_names_out`#

Here, for our example with a transformer that subtracts columns, we define get_feature_names_out, which returns column names corresponding to the sense of the transformation.

class ColumnsSubtractionNames(BaseEstimator, TransformerMixin):
    def __init__(self, A_columns : list, B_columns : list):
        if len(A_columns) != len(B_columns):
            raise ValueError(
                "The number of columns in the decreasing "
                "and subtracting columns do not match."
            )
        self.A_columns = A_columns
        self.B_columns = B_columns

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return (
            X[self.A_columns].to_numpy() - 
            X[self.B_columns].to_numpy()
        )

    def get_feature_names_out(self, input_features=None):
        return np.array([
            f"'{a_col}'-'{b_col}'"
            for a_col, b_col in
            zip(self.A_columns, self.B_columns)
        ])

Instances of ColumnsSubtractionNames will return column names and any complex transforms that are used will be able to use the get_feature_names_out method.

test_union = FeatureUnion([
    ("a-b", ColumnsSubtractionNames(["a x", "a y"], ["b x", "b y"])),
    ("x-y", ColumnsSubtractionNames(["a x", "b x"], ["a y", "b y"]))
])
test_union.fit(example_data)
test_union.get_feature_names_out()

array(["a-b__'a x'-'b x'", "a-b__'a y'-'b y'", "x-y__'a x'-'a y'",
       "x-y__'b x'-'b y'"], dtype=object)

Note By applying the method set_output(transform="pandas") to the instance of your transformer with defined get_feature_names_out - you make it return pandas data frames.

test_union.set_output(transform="pandas")
test_union.fit_transform(example_data).head()

	a-b__'a x'-'b x'	a-b__'a y'-'b y'	x-y__'a x'-'a y'	x-y__'b x'-'b y'
0	0.869201	-0.437556	1.198449	-0.108308
1	1.935135	1.715619	-0.487465	-0.706980
2	-1.737973	-1.300535	-0.520647	-0.083210
3	-0.443834	0.549681	-0.168783	0.824732
4	2.256280	-1.779005	1.751811	-2.283474

Note All currently known transformers return numpy.array from get_feature_names_out so, all things being equal, you should also try to return numpy.array.

Feature names out

Contents

Feature names out#

Defining get_feature_names_out#

Defining `get_feature_names_out`#