Metrics#

This section focuses on ways to measure something numerically.

import numpy as np
import matplotlib.pyplot as plt

Regression#

Regression metrics help to determine how close two sets of numbers are. The following table lists the common regression metrics.

Metric

Formula (simplified)

Range

Interpretation

Pros

Cons

MSE (Mean Squared Error)

\( \frac{1}{N}\sum (y - \hat{y})^2 \)

\([0, \infty)\)

Average squared difference between predicted and true values.

Penalizes large errors heavily, differentiable.

Not robust to outliers, less interpretable.

RMSE (Root Mean Squared Error)

\( \sqrt{\frac{1}{N}\sum (y - \hat{y})^2} \)

\([0, \infty)\)

Square root of MSE, in same units as target.

Interpretable (same units), penalizes large errors.

Still sensitive to outliers.

MAE (Mean Absolute Error)

\(\frac{1}{N}\sum \vert y - \hat{y} \vert \)

\([0, \infty)\)

Average absolute difference.

Robust to outliers, easy to interpret.

Less sensitive to large errors, not differentiable at 0.

MAPE (Mean Absolute Percentage Error)

\( \frac{100}{N}\sum \frac{\vert y - \hat{y} \vert}{\vert y \vert} \)

\([0, \infty)\)

Avg. relative error in percentage.

Scale-independent, interpretable.

Undefined at \(y=0\), biased toward small values.

(Coefficient of Determination)

\(1 - \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2}\)

\((-\infty, 1]\)

Proportion of variance explained by model.

Intuitive measure of fit.

Can be negative, misleading for non-linear models.

Adj. R² (Adjusted R²)

\(1 - (1-R^2)\frac{N-1}{N-p-1}\)

\((-\infty, 1]\)

R² adjusted for number of predictors.

Penalizes overfitting, better for model comparison.

Interpretation less direct than R².

MedAE (Median Absolute Error)

\(median(\vert y - \hat{y} \vert\))

\([0, \infty)\)

Median absolute difference.

Very robust to outliers.

Ignores distribution of other errors.

Classification#

Classification metrics are measures that evaluate how well predicted classes or class scores correspond to the true classes. The following cell lists the most important ones.

Metric

Formula (simplified)

Range

Interpretation

Pros

Cons

Accuracy

\(\frac{TP + TN}{TP + TN + FP + FN}\)

\([0, 1]\)

Fraction of correct predictions.

Easy to interpret.

Misleading with imbalanced data.

Precision

\(\frac{TP}{TP + FP}\)

\([0, 1]\)

How many predicted positives are correct.

Good when FP cost is high.

Ignores false negatives.

Recall (Sensitivity, TPR)

\(\frac{TP}{TP + FN}\)

\([0, 1]\)

How many actual positives are detected.

Good when FN cost is high.

Ignores false positives.

F1-score

\(2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}\)

\([0, 1]\)

Harmonic mean of precision & recall.

Balances FP and FN.

Harder to interpret than accuracy.

Specificity (TNR)

\(\frac{TN}{TN + FP}\)

\([0, 1]\)

How many actual negatives are detected.

Complements recall.

Ignores false negatives.

ROC-AUC

Area under ROC curve

\([0, 1]\)

Probability model ranks positive > negative.

Threshold-independent.

Can be misleading with class imbalance.

Log-Loss (Cross-Entropy)

\(-\frac{1}{N} \sum_{i=1}^N \sum_{c=1}^C y_{i,c} \log(\hat{p}_{i,c})\)

\([0, \infty)\)

Penalizes wrong confident predictions.

Works with probabilities, smooth.

Harder to interpret, sensitive to mislabeled data.

MCC (Matthews Corr. Coeff.)

\(\frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}\)

\([-1, 1]\)

Correlation between predictions and truth.

Works well with imbalanced data.

Less intuitive than accuracy/precision.

Check more details on the Classification page.

Clustering#

Clustering metrics typically evaluate how consistent the clusters are, and how significantly they are separated from each other. The following lists the popular clustering metrics with a short description:

Metric

Formula (simplified)

Range

Interpretation

Pros

Cons

Average distance to cluster center

\(\frac{\sum_{j=1}^{n^i}d(c_i, x_j)}{n^i}\)

\(\left[0, \infty \right)\)

The closer observations are to the \(i\)-s cluster center, the more consistent the cluster is.

Simple

It do not account for possible relationships with other clusters.

Average distance to other cluster center

\(\frac{\sum^{n}_{j=1}d(c_k, x_j)}{n^i}\)

\(\left[0, \infty \right)\)

The greater the distance from the other clusters, the better the better the clusters are separated.

Simple

It do not account how consistent cluster is.

Maximum distance to cluster center

\(\max_i \left(d[c, x_i]\right)\)

\(\left[0, \infty \right)\)

The smaller the maximum distance for the cluster, the better it composed.

Extremely simple

Sensitive to emissions.

Silhouette Coefficient

\(S(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}\)

\([-1, 1]\)

Measures how well an object is matched to its own cluster compared to other clusters. A high value indicates the object is well-clustered.

Works for any number of dimensions; provides an intuitive measure of cluster quality; can be used to compare different clustering algorithms or parameter settings.

Can be computationally expensive for large datasets; can be misleading when clusters are not well-separated or have irregular shapes.

Davies-Bouldin Index

\(DB = \frac{1}{k} \sum_{i=1}^{k} \max_{j \neq i} \left( \frac{\sigma_i + \sigma_j}{d(c_i, c_j)} \right)\)

\([0, \infty)\)

A lower value indicates better clustering. It measures the ratio of within-cluster scatter to between-cluster separation.

Fast to compute; simple to understand; works with various cluster shapes.

Tends to favor spherical clusters; sensitive to the number of clusters; may not be suitable for clusters with complex or overlapping structures.

Calinski-Harabasz Index

\(CH = \frac{\text{tr}(B_k)}{\text{tr}(W_k)} \frac{N-k}{k-1}\)

\([0, \infty)\)

Higher values correspond to a better partitioning. It is a ratio of the between-cluster variance to the within-cluster variance.

Fast to compute; no assumption on cluster shapes; works for various cluster sizes.

Tends to favor spherical and compact clusters; sensitive to outliers.

Dunn Index

\(DI = \min_{i=1..k} \left\{ \min_{j=i+1..k} \left( \frac{d(c_i, c_j)}{\max_{l=1..k} \text{diam}(c_l)} \right) \right\}\)

\([0, \infty)\)

Higher values indicate better clustering, as it represents well-separated and compact clusters.

Does not require assumptions on cluster shapes or density.

Computationally expensive due to the need to calculate all pairwise distances; sensitive to noisy data or outliers; may not be effective for overlapping clusters.

Inertia (Within-Cluster Sum of Squares)

\(\sum_{i=0}^{n} \min_{\mu_j \in C} (\lVert x_i - \mu_j\rVert^2)\)

\([0, \infty)\)

A lower value is better, as it indicates more compact clusters. Measures the sum of squared distances of samples to their closest cluster center.

Simple to calculate; widely used and easy to interpret.

Assumes clusters are spherical and of equal size; not a good measure for clusters with irregular shapes or densities; decreases with the number of clusters, so it’s not useful for comparing different numbers of clusters.

Model interpretation#

There is a set of approaches that allows us to understand why machine learning models make certain decisions. The following table represents some of these approaches:

Check the Interpretable Machine Learning book for more details on interpretation of the machine learning models.

Technique

Scope

Type

Description

When to Use

Pros

Cons

Feature Importance (coefficients)

Global

Model-specific

Uses model parameters (e.g., regression coefficients, tree splits).

Linear/logistic regression, decision trees, random forests.

Simple, direct interpretation.

Not reliable for correlated features; model-dependent.

Permutation Importance

Global

Model-agnostic

Measures drop in performance after shuffling a feature.

Any black-box model.

Intuitive, works with any model.

Expensive to compute; biased with correlated features.

Partial Dependence Plot (PDP)

Global

Model-agnostic

Shows marginal effect of one/two features on predictions.

Any model where feature effect needs visualization.

Easy to interpret, visual.

Misleading if features are correlated.

Accumulated Local Effects (ALE)

Global

Model-agnostic

Corrected PDPs that handle correlated features.

When feature correlation is present.

More reliable than PDPs under correlation.

Less intuitive than PDPs.

Global Surrogate Models

Global

Model-agnostic

Train a simple interpretable model to approximate complex model.

Explaining black-box models globally.

Produces interpretable model (tree, linear).

Approximation may be poor; loses fidelity.

LIME

Local

Model-agnostic

Fits a local interpretable model around one prediction.

Explaining individual predictions.

Widely used, intuitive.

Instability; explanations may vary across runs.

SHAP

Local

Model-agnostic

Shapley values distribute contributions fairly among features.

Both global + local explanation; any model.

Theoretically solid; consistent; widely adopted.

Computationally expensive; approximations needed.

Counterfactual Explanations

Local

Model-agnostic

Shows what small change in input flips prediction.

Explaining actionable changes in decisions (credit, fraud).

Intuitive, user-friendly.

May suggest unrealistic changes; hard in high dimensions.

Individual Conditional Expectation (ICE)

Local

Model-agnostic

Shows how varying one feature changes prediction for one instance.

When detailed instance-level effect is needed.

Complements PDPs; shows heterogeneity.

Hard to interpret with many features.

Decision Tree Visualization

Global

Model-specific

Show splits and decision paths.

Tree-based models.

Easy to visualize, transparent.

Only works for small trees; large ones are messy.

Linear Model Coefficients

Global

Model-specific

Coefficients represent feature weights.

Linear/logistic regression.

Simple, mathematical clarity.

Sensitive to scaling and multicollinearity.

Generalized Additive Models (GAMs)

Global

Model-specific

Each feature has an interpretable function.

When you need balance between accuracy and interpretability.

Captures nonlinearity while interpretable.

Limited flexibility compared to black-box models.

Attention Mechanisms

Local

Model-specific

Attention weights highlight important parts of input (NLP, vision).

Sequence/transformer models.

Natural for sequence data; intuitive visualization.

Attention ≠ explanation (debated).

Grad-CAM / Saliency Maps

Local

Model-specific

Highlights image regions influencing CNN output.

Computer vision models.

Strong visual explanations.

Sometimes noisy, hard to interpret precisely.

Anchors

Local

Model-agnostic

High-precision if-then rules explaining a prediction.

Explaining individual decisions clearly.

Produces human-readable rules.

Can be computationally expensive; rules may not always exist.

Integrated Gradients

Local

Model-specific

Attributes importance by integrating gradients wrt inputs.

Deep learning models.

Theoretically sound; avoids gradient saturation.

Requires baseline choice; complex math.

SmoothGrad

Local

Model-specific

Averages noisy gradients for stable saliency maps.

Deep learning interpretability.

Produces clearer explanations.

Adds computation overhead.

Feature Interaction (SHAP int.)

Both

Model-agnostic

Identifies feature interaction effects in predictions.

When interactions matter (complex models).

Highlights hidden relationships.

Hard to visualize for many features.

TCAV (Concept Activation Vectors)

Global

Model-specific

Explains predictions with human concepts, not raw features.

Interpreting deep nets with domain concepts.

Aligns with human reasoning.

Needs well-defined concepts; not trivial to design.

Check more details for some of the methods in the corresponding page.