Grid search CV

Grid search CV#

sklearn.model_selection.GridSearchCV is an extremely useful tool that allows you to try out your model with different combinations of hyper-parameters.

Learn more:

Use one train/test split#

Sometimes it’s not necessary to use cross-validation grid search, so here’s how to do a fit on train data and a validation on test once per parameter combination using sklearn.model_selection.GridSearchCV.

There is a cv argument in the GridSearchCV constructor. One of the options of arguments that can be passed is an iteramble object, where each element contains a tuple like object that contains train and test subsample indexes. So to achieve our goal we can pass an element list that contains a specific train/test split.

So in the following example is showen that self coded solution and using GridSearchCV object in described way will lead to same results. But the GridSearchCV option requires much less code and getting all features of GridSearchCV out of the box.

This cell:

  • Gerates sample;

  • Performs train/test split. Note that train_test_split is passed an array that matches the indices of the observations in the original sample, so that it also returns train/test split for sample indices;

  • Defines the hyperparameter values that will be tried.

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples = 500, 
    n_features = 2, 
    n_redundant = 0, 
    n_classes = 2,
    random_state = 1
)

X_train, X_test, y_train, y_test, train_inds, test_inds =\
    train_test_split(X, y, np.arange(len(X)))

param_grid = {
    "max_leaf_nodes" : [5, 10, 20, 50, 100],
    "max_depth" : [3, 7, 10]
}

Here is a selfmade enumeration of all possible combinations of hyperparameters. The result will be an array with \(AUC_{roc}\) estimates on the test sample.

from sklearn.metrics import roc_auc_score

my_tree = DecisionTreeClassifier(random_state = 1)
roc_aucs = []


for max_leaf_nodes in param_grid["max_leaf_nodes"]:
    for max_depth in param_grid["max_depth"]:
        my_tree.set_params(
            max_leaf_nodes = max_leaf_nodes,
            max_depth = max_depth
        )
        my_tree.fit(X_train, y_train)
        roc_aucs.append(roc_auc_score(
            y_test,
            my_tree.predict_proba(X_test)[:,1]
        ))

The trick described above is used here. It leteraly does the same as the previous cell, but in much less code.

from sklearn.model_selection import GridSearchCV
grid_search_cv = GridSearchCV(
    estimator = my_tree,
    scoring = "roc_auc",
    param_grid = param_grid,
    cv = [(train_inds, test_inds)]
).fit(X, y)

Finally let’s compare results of selfmade solution and the one performed by GridSearchCV - they are the same.

import pandas as pd
pd.DataFrame({
    "Self code" : roc_aucs,
    "GridSearchCV" : grid_search_cv.cv_results_["mean_test_score"]
})
Self code GridSearchCV
0 0.966872 0.966872
1 0.966872 0.963919
2 0.966872 0.963919
3 0.963919 0.963919
4 0.929122 0.963919
5 0.929122 0.966872
6 0.963919 0.929122
7 0.932717 0.932717
8 0.921161 0.932717
9 0.963919 0.932717
10 0.932717 0.966872
11 0.911531 0.929122
12 0.963919 0.921161
13 0.932717 0.911531
14 0.911531 0.911531