Intro#
Is a library that implements many tools for machine learning and advanced data analysis.
Data transform#
The Sklearn provides a set of tools for data processing. The following table lists some of these classes.
Area of Preprocessing |
Class Name |
Description |
---|---|---|
Scaling & Standardization |
|
Standardizes features by removing the mean and scaling to unit variance (\(z\)-score). |
|
Scales features to a given range, typically \([0, 1]\). |
|
|
Scales each feature by its maximum absolute value, typically to the range \([-1, 1]\). Useful for sparse data. |
|
|
Scales features using statistics that are robust to outliers (e.g., median and interquartile range). |
|
Normalization |
|
Normalizes samples individually to have unit norm (e.g., \(l_1\) or \(l_2\) norm). |
Missing Values Imputation |
|
Imputes missing values using a simple strategy (e.g., mean, median, most frequent, or a constant value). |
|
Imputes missing values using the K-Nearest Neighbors approach. |
|
|
Imputes missing values by modeling each feature with missing values as a function of other features and using that estimate for imputation. |
|
Categorical Encoding |
|
Encodes categorical features as a one-hot or dummy feature matrix. |
|
Encodes categorical features into ordinal integers. |
|
Feature Transformation |
|
Generates polynomial and interaction features (e.g., \(x_1^2, x_1 x_2, x_2^2\)). |
|
Allows creating a transformer from an arbitrary function. |
|
|
Transforms features using quantiles, forcing the data to follow a uniform or normal distribution. |
|
|
Applies power transforms (e.g., Yeo-Johnson or Box-Cox) to make data more Gaussian-like. |
|
Discretization / Binning |
|
Bins continuous data into \(k\) discrete intervals. |
Pipeline/Composition |
|
Sequentially applies a list of transformers and a final estimator. |
|
Applies different transformers to different subsets of features. |
For more details, check the Data Transform page.