Grid search CV#
sklearn.model_selection.GridSearchCV
is an extremely useful tool that allows you to try out your model with different combinations of hyper-parameters.
Learn more:
Use one train/test split#
Sometimes it’s not necessary to use cross-validation grid search, so here’s how to do a fit on train data and a validation on test once per parameter combination using sklearn.model_selection.GridSearchCV
.
There is a cv
argument in the GridSearchCV
constructor. One of the options of arguments that can be passed is an iteramble object, where each element contains a tuple
like object that contains train and test subsample indexes. So to achieve our goal we can pass an element list that contains a specific train/test split.
So in the following example is showen that self coded solution and using GridSearchCV
object in described way will lead to same results. But the GridSearchCV
option requires much less code and getting all features of GridSearchCV
out of the box.
This cell:
Gerates sample;
Performs train/test split. Note that
train_test_split
is passed an array that matches the indices of the observations in the original sample, so that it also returns train/test split for sample indices;Defines the hyperparameter values that will be tried.
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples = 500,
n_features = 2,
n_redundant = 0,
n_classes = 2,
random_state = 1
)
X_train, X_test, y_train, y_test, train_inds, test_inds =\
train_test_split(X, y, np.arange(len(X)))
param_grid = {
"max_leaf_nodes" : [5, 10, 20, 50, 100],
"max_depth" : [3, 7, 10]
}
Here is a selfmade enumeration of all possible combinations of hyperparameters. The result will be an array with \(AUC_{roc}\) estimates on the test sample.
from sklearn.metrics import roc_auc_score
my_tree = DecisionTreeClassifier(random_state = 1)
roc_aucs = []
for max_leaf_nodes in param_grid["max_leaf_nodes"]:
for max_depth in param_grid["max_depth"]:
my_tree.set_params(
max_leaf_nodes = max_leaf_nodes,
max_depth = max_depth
)
my_tree.fit(X_train, y_train)
roc_aucs.append(roc_auc_score(
y_test,
my_tree.predict_proba(X_test)[:,1]
))
The trick described above is used here. It leteraly does the same as the previous cell, but in much less code.
from sklearn.model_selection import GridSearchCV
grid_search_cv = GridSearchCV(
estimator = my_tree,
scoring = "roc_auc",
param_grid = param_grid,
cv = [(train_inds, test_inds)]
).fit(X, y)
Finally let’s compare results of selfmade solution and the one performed by GridSearchCV
- they are the same.
import pandas as pd
pd.DataFrame({
"Self code" : roc_aucs,
"GridSearchCV" : grid_search_cv.cv_results_["mean_test_score"]
})
Self code | GridSearchCV | |
---|---|---|
0 | 0.966872 | 0.966872 |
1 | 0.966872 | 0.963919 |
2 | 0.966872 | 0.963919 |
3 | 0.963919 | 0.963919 |
4 | 0.929122 | 0.963919 |
5 | 0.929122 | 0.966872 |
6 | 0.963919 | 0.929122 |
7 | 0.932717 | 0.932717 |
8 | 0.921161 | 0.932717 |
9 | 0.963919 | 0.932717 |
10 | 0.932717 | 0.966872 |
11 | 0.911531 | 0.929122 |
12 | 0.963919 | 0.921161 |
13 | 0.932717 | 0.911531 |
14 | 0.911531 | 0.911531 |