Metrics#
This section focuses on ways to measure something numerically.
import numpy as np
import matplotlib.pyplot as plt
Regression#
Regression metrics help to determine how close two sets of numbers are. The following table lists the common regression metrics.
Metric |
Formula (simplified) |
Range |
Interpretation |
Pros |
Cons |
---|---|---|---|---|---|
MSE (Mean Squared Error) |
\( \frac{1}{N}\sum (y - \hat{y})^2 \) |
\([0, \infty)\) |
Average squared difference between predicted and true values. |
Penalizes large errors heavily, differentiable. |
Not robust to outliers, less interpretable. |
RMSE (Root Mean Squared Error) |
\( \sqrt{\frac{1}{N}\sum (y - \hat{y})^2} \) |
\([0, \infty)\) |
Square root of MSE, in same units as target. |
Interpretable (same units), penalizes large errors. |
Still sensitive to outliers. |
MAE (Mean Absolute Error) |
\(\frac{1}{N}\sum \vert y - \hat{y} \vert \) |
\([0, \infty)\) |
Average absolute difference. |
Robust to outliers, easy to interpret. |
Less sensitive to large errors, not differentiable at 0. |
MAPE (Mean Absolute Percentage Error) |
\( \frac{100}{N}\sum \frac{\vert y - \hat{y} \vert}{\vert y \vert} \) |
\([0, \infty)\) |
Avg. relative error in percentage. |
Scale-independent, interpretable. |
Undefined at \(y=0\), biased toward small values. |
R² (Coefficient of Determination) |
\(1 - \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2}\) |
\((-\infty, 1]\) |
Proportion of variance explained by model. |
Intuitive measure of fit. |
Can be negative, misleading for non-linear models. |
Adj. R² (Adjusted R²) |
\(1 - (1-R^2)\frac{N-1}{N-p-1}\) |
\((-\infty, 1]\) |
R² adjusted for number of predictors. |
Penalizes overfitting, better for model comparison. |
Interpretation less direct than R². |
MedAE (Median Absolute Error) |
\(median(\vert y - \hat{y} \vert\)) |
\([0, \infty)\) |
Median absolute difference. |
Very robust to outliers. |
Ignores distribution of other errors. |
Classification#
Classification metrics are measures that evaluate how well predicted classes or class scores correspond to the true classes. The following cell lists the most important ones.
Metric |
Formula (simplified) |
Range |
Interpretation |
Pros |
Cons |
---|---|---|---|---|---|
Accuracy |
\(\frac{TP + TN}{TP + TN + FP + FN}\) |
\([0, 1]\) |
Fraction of correct predictions. |
Easy to interpret. |
Misleading with imbalanced data. |
Precision |
\(\frac{TP}{TP + FP}\) |
\([0, 1]\) |
How many predicted positives are correct. |
Good when FP cost is high. |
Ignores false negatives. |
Recall (Sensitivity, TPR) |
\(\frac{TP}{TP + FN}\) |
\([0, 1]\) |
How many actual positives are detected. |
Good when FN cost is high. |
Ignores false positives. |
F1-score |
\(2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}\) |
\([0, 1]\) |
Harmonic mean of precision & recall. |
Balances FP and FN. |
Harder to interpret than accuracy. |
Specificity (TNR) |
\(\frac{TN}{TN + FP}\) |
\([0, 1]\) |
How many actual negatives are detected. |
Complements recall. |
Ignores false negatives. |
ROC-AUC |
Area under ROC curve |
\([0, 1]\) |
Probability model ranks positive > negative. |
Threshold-independent. |
Can be misleading with class imbalance. |
Log-Loss (Cross-Entropy) |
\(-\frac{1}{N} \sum_{i=1}^N \sum_{c=1}^C y_{i,c} \log(\hat{p}_{i,c})\) |
\([0, \infty)\) |
Penalizes wrong confident predictions. |
Works with probabilities, smooth. |
Harder to interpret, sensitive to mislabeled data. |
MCC (Matthews Corr. Coeff.) |
\(\frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}\) |
\([-1, 1]\) |
Correlation between predictions and truth. |
Works well with imbalanced data. |
Less intuitive than accuracy/precision. |
Check more details on the Classification page.
Clustering#
Clustering metrics typically evaluate how consistent the clusters are, and how significantly they are separated from each other. The following lists the popular clustering metrics with a short description:
Metric |
Formula (simplified) |
Range |
Interpretation |
Pros |
Cons |
---|---|---|---|---|---|
Average distance to cluster center |
\(\frac{\sum_{j=1}^{n^i}d(c_i, x_j)}{n^i}\) |
\(\left[0, \infty \right)\) |
The closer observations are to the \(i\)-s cluster center, the more consistent the cluster is. |
Simple |
It do not account for possible relationships with other clusters. |
Average distance to other cluster center |
\(\frac{\sum^{n}_{j=1}d(c_k, x_j)}{n^i}\) |
\(\left[0, \infty \right)\) |
The greater the distance from the other clusters, the better the better the clusters are separated. |
Simple |
It do not account how consistent cluster is. |
Maximum distance to cluster center |
\(\max_i \left(d[c, x_i]\right)\) |
\(\left[0, \infty \right)\) |
The smaller the maximum distance for the cluster, the better it composed. |
Extremely simple |
Sensitive to emissions. |
Silhouette Coefficient |
\(S(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}\) |
\([-1, 1]\) |
Measures how well an object is matched to its own cluster compared to other clusters. A high value indicates the object is well-clustered. |
Works for any number of dimensions; provides an intuitive measure of cluster quality; can be used to compare different clustering algorithms or parameter settings. |
Can be computationally expensive for large datasets; can be misleading when clusters are not well-separated or have irregular shapes. |
Davies-Bouldin Index |
\(DB = \frac{1}{k} \sum_{i=1}^{k} \max_{j \neq i} \left( \frac{\sigma_i + \sigma_j}{d(c_i, c_j)} \right)\) |
\([0, \infty)\) |
A lower value indicates better clustering. It measures the ratio of within-cluster scatter to between-cluster separation. |
Fast to compute; simple to understand; works with various cluster shapes. |
Tends to favor spherical clusters; sensitive to the number of clusters; may not be suitable for clusters with complex or overlapping structures. |
Calinski-Harabasz Index |
\(CH = \frac{\text{tr}(B_k)}{\text{tr}(W_k)} \frac{N-k}{k-1}\) |
\([0, \infty)\) |
Higher values correspond to a better partitioning. It is a ratio of the between-cluster variance to the within-cluster variance. |
Fast to compute; no assumption on cluster shapes; works for various cluster sizes. |
Tends to favor spherical and compact clusters; sensitive to outliers. |
Dunn Index |
\(DI = \min_{i=1..k} \left\{ \min_{j=i+1..k} \left( \frac{d(c_i, c_j)}{\max_{l=1..k} \text{diam}(c_l)} \right) \right\}\) |
\([0, \infty)\) |
Higher values indicate better clustering, as it represents well-separated and compact clusters. |
Does not require assumptions on cluster shapes or density. |
Computationally expensive due to the need to calculate all pairwise distances; sensitive to noisy data or outliers; may not be effective for overlapping clusters. |
Inertia (Within-Cluster Sum of Squares) |
\(\sum_{i=0}^{n} \min_{\mu_j \in C} (\lVert x_i - \mu_j\rVert^2)\) |
\([0, \infty)\) |
A lower value is better, as it indicates more compact clusters. Measures the sum of squared distances of samples to their closest cluster center. |
Simple to calculate; widely used and easy to interpret. |
Assumes clusters are spherical and of equal size; not a good measure for clusters with irregular shapes or densities; decreases with the number of clusters, so it’s not useful for comparing different numbers of clusters. |
Model interpretation#
There is a set of approaches that allows us to understand why machine learning models make certain decisions. The following table represents some of these approaches:
Check the Interpretable Machine Learning book for more details on interpretation of the machine learning models.
Technique |
Scope |
Type |
Description |
When to Use |
Pros |
Cons |
---|---|---|---|---|---|---|
Feature Importance (coefficients) |
Global |
Model-specific |
Uses model parameters (e.g., regression coefficients, tree splits). |
Linear/logistic regression, decision trees, random forests. |
Simple, direct interpretation. |
Not reliable for correlated features; model-dependent. |
Permutation Importance |
Global |
Model-agnostic |
Measures drop in performance after shuffling a feature. |
Any black-box model. |
Intuitive, works with any model. |
Expensive to compute; biased with correlated features. |
Partial Dependence Plot (PDP) |
Global |
Model-agnostic |
Shows marginal effect of one/two features on predictions. |
Any model where feature effect needs visualization. |
Easy to interpret, visual. |
Misleading if features are correlated. |
Accumulated Local Effects (ALE) |
Global |
Model-agnostic |
Corrected PDPs that handle correlated features. |
When feature correlation is present. |
More reliable than PDPs under correlation. |
Less intuitive than PDPs. |
Global Surrogate Models |
Global |
Model-agnostic |
Train a simple interpretable model to approximate complex model. |
Explaining black-box models globally. |
Produces interpretable model (tree, linear). |
Approximation may be poor; loses fidelity. |
LIME |
Local |
Model-agnostic |
Fits a local interpretable model around one prediction. |
Explaining individual predictions. |
Widely used, intuitive. |
Instability; explanations may vary across runs. |
SHAP |
Local |
Model-agnostic |
Shapley values distribute contributions fairly among features. |
Both global + local explanation; any model. |
Theoretically solid; consistent; widely adopted. |
Computationally expensive; approximations needed. |
Counterfactual Explanations |
Local |
Model-agnostic |
Shows what small change in input flips prediction. |
Explaining actionable changes in decisions (credit, fraud). |
Intuitive, user-friendly. |
May suggest unrealistic changes; hard in high dimensions. |
Individual Conditional Expectation (ICE) |
Local |
Model-agnostic |
Shows how varying one feature changes prediction for one instance. |
When detailed instance-level effect is needed. |
Complements PDPs; shows heterogeneity. |
Hard to interpret with many features. |
Decision Tree Visualization |
Global |
Model-specific |
Show splits and decision paths. |
Tree-based models. |
Easy to visualize, transparent. |
Only works for small trees; large ones are messy. |
Linear Model Coefficients |
Global |
Model-specific |
Coefficients represent feature weights. |
Linear/logistic regression. |
Simple, mathematical clarity. |
Sensitive to scaling and multicollinearity. |
Generalized Additive Models (GAMs) |
Global |
Model-specific |
Each feature has an interpretable function. |
When you need balance between accuracy and interpretability. |
Captures nonlinearity while interpretable. |
Limited flexibility compared to black-box models. |
Attention Mechanisms |
Local |
Model-specific |
Attention weights highlight important parts of input (NLP, vision). |
Sequence/transformer models. |
Natural for sequence data; intuitive visualization. |
Attention ≠ explanation (debated). |
Grad-CAM / Saliency Maps |
Local |
Model-specific |
Highlights image regions influencing CNN output. |
Computer vision models. |
Strong visual explanations. |
Sometimes noisy, hard to interpret precisely. |
Anchors |
Local |
Model-agnostic |
High-precision if-then rules explaining a prediction. |
Explaining individual decisions clearly. |
Produces human-readable rules. |
Can be computationally expensive; rules may not always exist. |
Integrated Gradients |
Local |
Model-specific |
Attributes importance by integrating gradients wrt inputs. |
Deep learning models. |
Theoretically sound; avoids gradient saturation. |
Requires baseline choice; complex math. |
SmoothGrad |
Local |
Model-specific |
Averages noisy gradients for stable saliency maps. |
Deep learning interpretability. |
Produces clearer explanations. |
Adds computation overhead. |
Feature Interaction (SHAP int.) |
Both |
Model-agnostic |
Identifies feature interaction effects in predictions. |
When interactions matter (complex models). |
Highlights hidden relationships. |
Hard to visualize for many features. |
TCAV (Concept Activation Vectors) |
Global |
Model-specific |
Explains predictions with human concepts, not raw features. |
Interpreting deep nets with domain concepts. |
Aligns with human reasoning. |
Needs well-defined concepts; not trivial to design. |
Check more details for some of the methods in the corresponding page.