XGBoost#
XGBoost a popular package that implements a gradient boosting algorithm.
import xgboost
xgboost.config_context(verbosity=0)
from pprint import pformat
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
Evaluation#
XGBoost has built-in evaluation tools. You can:
Set the
eval_metricfor the object that implements the model. Look for description in the Learning Task Parameters.Set the
eval_setforfitmethod, which will be used to evaluate the model the fitting process.Use the
evals_resultattribute of the fitted model to access its outputs.
Note. the eval_metric doesn’t influence the optimisation problem: “objective function” (in terns of XGBoost it is a loss function with regularisation component) does this.
The following cell defines the model, that would be evaluated using rmse and mae.
X, y = make_regression(n_samples=100, n_features=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = xgboost.XGBRegressor(
n_estimators=10,
eval_metric=['rmse', 'mae']
).fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
verbose=False,
)
The following code invokes the validation result.
model.evals_result()
{'validation_0': OrderedDict([('rmse',
[88.36185611431438,
85.5619274553268,
82.15710350923845,
83.82585462742627,
85.4184908720323,
87.2008393703865,
87.60270608669752,
87.69609884330428,
87.66542240780693,
88.09251860390917]),
('mae',
[68.94008346557617,
64.97411354064941,
59.32854522705078,
61.621359786987306,
64.64699867248535,
67.44699798583984,
68.72293533325195,
69.23456924438477,
69.49240661621094,
70.03242248535156])])}
Save model#
Use the save_model method of the model’s object.
For more details check the Introduction to Model IO official tutorial.
The following cell just fits a XGBoost and saves it.
X, y = make_regression(n_samples=100, n_features=5)
model = xgboost.XGBRegressor(n_estimators=10, seed=42)
model.fit(X, y)
model.save_model("/tmp/xgb_model.json")
The next cell displays some lines from the result JSON file.
import json
with open("/tmp/xgb_model.json", "r") as f:
model_json = json.load(f)
print(pformat(model_json)[:1000])
{'learner': {'attributes': {'scikit_learn': '{"_estimator_type": "regressor"}'},
'feature_names': [],
'feature_types': [],
'gradient_booster': {'model': {'gbtree_model_param': {'num_parallel_tree': '1',
'num_trees': '10'},
'iteration_indptr': [0,
1,
2,
3,
4,
5,
6,
7,
8,
Get just config of the booster.
out = model.get_booster().save_config()
out[:50] + " ... "+ out[-50:]
'{"learner":{"generic_param":{"device":"cpu","fail_ ... ram":{"scale_pos_weight":"1"}}},"version":[3,0,5]}'
Booster type#
For gradient boosting, you can specify the booster argument, which determines the algorithm used. You can specify the following options:
gbtree: for typical tree-based boosting.gblinear: each estimator would be based on a regression model.dart: adds a dropout mechanism that regulates the overfitting of the model.
The following cell defines the dataset that will be used as an example.
X, y = make_regression(n_samples=100, n_features=5)
The next cell fits a tree-based model and prints its text representation.
tree_model = xgboost.XGBRegressor(
max_depth=2,
n_estimators=10,
booster='gbtree'
).fit(X, y)
booster = tree_model.get_booster()
print(booster.get_dump()[0])
0:[f3<-0.299755722] yes=1,no=2,missing=2
1:[f2<-0.612705946] yes=3,no=4,missing=4
3:leaf=-47.1132851
4:leaf=-12.5197821
2:[f2<0.0449830927] yes=5,no=6,missing=6
5:leaf=-5.050488
6:leaf=31.3139687
The result is a set of principles that the tree uses to make a decisions.
In contrast, the following cell fits gblinear boosting, and shows the coefficients of the specific estimator.
linear_model = xgboost.XGBRegressor(
n_estimators=10,
booster='gblinear'
).fit(X, y)
booster = linear_model.get_booster()
print(booster.get_dump()[0])
bias:
-7.45684
weight:
46.7333
25.2602
58.0306
67.4508
33.0191