Feature names out#
Check developer API for set_output
for more details.
In practice, it’s extremely important to be able to save column names through transformations.
import numpy as np
import pandas as pd
example_data = pd.read_parquet("example_frame.parquet")
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion
The class from minimum setup the previous section doesn’t have get_feature_names_out
by default - so you’ll get a corresponding error. The following cell shows it:
class ColumnsSubtraction(BaseEstimator, TransformerMixin):
def __init__(self, A_columns : list, B_columns : list):
self.A_columns = A_columns
self.B_columns = B_columns
def fit(self, X, y=None):
return self
def transform(self, X):
return (
X[self.A_columns].to_numpy() -
X[self.B_columns].to_numpy()
)
transoformer = (
ColumnsSubtraction(
["a x", "a y"],
["b x", "b y"]
)
.fit(example_data)
)
try:
transoformer.get_feature_names_out()
except Exception as e:
print(e)
'ColumnsSubtraction' object has no attribute 'get_feature_names_out'
As a result, you cannot use interfaces associated with feature names in a transformer that contains such components.
Defining get_feature_names_out
#
Here, for our example with a transformer that subtracts columns, we define get_feature_names_out
, which returns column names corresponding to the sense of the transformation.
class ColumnsSubtractionNames(BaseEstimator, TransformerMixin):
def __init__(self, A_columns : list, B_columns : list):
if len(A_columns) != len(B_columns):
raise ValueError(
"The number of columns in the decreasing "
"and subtracting columns do not match."
)
self.A_columns = A_columns
self.B_columns = B_columns
def fit(self, X, y=None):
return self
def transform(self, X):
return (
X[self.A_columns].to_numpy() -
X[self.B_columns].to_numpy()
)
def get_feature_names_out(self, input_features=None):
return np.array([
f"'{a_col}'-'{b_col}'"
for a_col, b_col in
zip(self.A_columns, self.B_columns)
])
Instances of ColumnsSubtractionNames
will return column names and any complex transforms that are used will be able to use the get_feature_names_out
method.
test_union = FeatureUnion([
("a-b", ColumnsSubtractionNames(["a x", "a y"], ["b x", "b y"])),
("x-y", ColumnsSubtractionNames(["a x", "b x"], ["a y", "b y"]))
])
test_union.fit(example_data)
test_union.get_feature_names_out()
array(["a-b__'a x'-'b x'", "a-b__'a y'-'b y'", "x-y__'a x'-'a y'",
"x-y__'b x'-'b y'"], dtype=object)
Note By applying the method set_output(transform="pandas")
to the instance of your transformer with defined get_feature_names_out
- you make it return pandas data frames.
test_union.set_output(transform="pandas")
test_union.fit_transform(example_data).head()
a-b__'a x'-'b x' | a-b__'a y'-'b y' | x-y__'a x'-'a y' | x-y__'b x'-'b y' | |
---|---|---|---|---|
0 | 0.869201 | -0.437556 | 1.198449 | -0.108308 |
1 | 1.935135 | 1.715619 | -0.487465 | -0.706980 |
2 | -1.737973 | -1.300535 | -0.520647 | -0.083210 |
3 | -0.443834 | 0.549681 | -0.168783 | 0.824732 |
4 | 2.256280 | -1.779005 | 1.751811 | -2.283474 |
Note All currently known transformers return numpy.array
from get_feature_names_out
so, all things being equal, you should also try to return numpy.array
.