XGBoost#

XGBoost a popular package that implements a gradient boosting algorithm.

import xgboost
xgboost.config_context(verbosity=0)
from pprint import pformat
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

Evaluation#

XGBoost has built-in evaluation tools. You can:

  • Set the eval_metric for the object that implements the model. Look for description in the Learning Task Parameters.

  • Set the eval_set for fit method, which will be used to evaluate the model the fitting process.

  • Use the evals_result attribute of the fitted model to access its outputs.

Note. the eval_metric doesn’t influence the optimisation problem: “objective function” (in terns of XGBoost it is a loss function with regularisation component) does this.


The following cell defines the model, that would be evaluated using rmse and mae.

X, y = make_regression(n_samples=100, n_features=5, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y)
model = xgboost.XGBRegressor(
    n_estimators=10,
    eval_metric=['rmse', 'mae']
).fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    verbose=False,
)

The following code invokes the validation result.

model.evals_result()
{'validation_0': OrderedDict([('rmse',
               [88.36185611431438,
                85.5619274553268,
                82.15710350923845,
                83.82585462742627,
                85.4184908720323,
                87.2008393703865,
                87.60270608669752,
                87.69609884330428,
                87.66542240780693,
                88.09251860390917]),
              ('mae',
               [68.94008346557617,
                64.97411354064941,
                59.32854522705078,
                61.621359786987306,
                64.64699867248535,
                67.44699798583984,
                68.72293533325195,
                69.23456924438477,
                69.49240661621094,
                70.03242248535156])])}

Save model#

Use the save_model method of the model’s object.

For more details check the Introduction to Model IO official tutorial.


The following cell just fits a XGBoost and saves it.

X, y = make_regression(n_samples=100, n_features=5)
model = xgboost.XGBRegressor(n_estimators=10, seed=42)
model.fit(X, y)

model.save_model("/tmp/xgb_model.json")

The next cell displays some lines from the result JSON file.

import json
with open("/tmp/xgb_model.json", "r") as f:
    model_json = json.load(f)
print(pformat(model_json)[:1000])
{'learner': {'attributes': {'scikit_learn': '{"_estimator_type": "regressor"}'},
             'feature_names': [],
             'feature_types': [],
             'gradient_booster': {'model': {'gbtree_model_param': {'num_parallel_tree': '1',
                                                                   'num_trees': '10'},
                                            'iteration_indptr': [0,
                                                                 1,
                                                                 2,
                                                                 3,
                                                                 4,
                                                                 5,
                                                                 6,
                                                                 7,
                                                                 8,
                                                           

Get just config of the booster.

out = model.get_booster().save_config()
out[:50] + " ... "+ out[-50:]
'{"learner":{"generic_param":{"device":"cpu","fail_ ... ram":{"scale_pos_weight":"1"}}},"version":[3,0,5]}'

Booster type#

For gradient boosting, you can specify the booster argument, which determines the algorithm used. You can specify the following options:

  • gbtree: for typical tree-based boosting.

  • gblinear: each estimator would be based on a regression model.

  • dart: adds a dropout mechanism that regulates the overfitting of the model.


The following cell defines the dataset that will be used as an example.

X, y = make_regression(n_samples=100, n_features=5)

The next cell fits a tree-based model and prints its text representation.

tree_model = xgboost.XGBRegressor(
    max_depth=2,
    n_estimators=10,
    booster='gbtree'
).fit(X, y)

booster = tree_model.get_booster()
print(booster.get_dump()[0])
0:[f3<-0.299755722] yes=1,no=2,missing=2
	1:[f2<-0.612705946] yes=3,no=4,missing=4
		3:leaf=-47.1132851
		4:leaf=-12.5197821
	2:[f2<0.0449830927] yes=5,no=6,missing=6
		5:leaf=-5.050488
		6:leaf=31.3139687

The result is a set of principles that the tree uses to make a decisions.

In contrast, the following cell fits gblinear boosting, and shows the coefficients of the specific estimator.

linear_model = xgboost.XGBRegressor(
    n_estimators=10,
    booster='gblinear'
).fit(X, y)

booster = linear_model.get_booster()
print(booster.get_dump()[0])
bias:
-7.45684
weight:
46.7333
25.2602
58.0306
67.4508
33.0191