Boosting Algorithms

In a previously published article, we made a review of ensemble learning techniques, explaining how they perform in getting groups of learners together in order to build a more robust model. In this article, we are going to discuss other Ensemble Learning techniques, the Boosting Algorithms, and demonstrate a bit about the most powerful out of them, the XGBoost, which has been used a lot in Kaggle Competitions.

Different from other ensemble learning techniques, most of the boosting methods work by training predictors sequentially while trying to correct its predecessor. One way of doing that is by paying more attention to the training instances that the predecessor misclassified, resulting in new predictors focusing more on the bottlenecks of the previous predictor. This method is used by AdaBoost and Gradient Boosting, two of the most popular boosting techniques.

AdaBoost (Adaptative Boosting)

As said before, the AdaBoost uses a series of weak classifiers. The first one is trained using a random sample of the training set. After that, it is used to make predictions out of the training set, separating the instances that were correctly classified from the ones that were not. The training instances are then given weights of probabilities of being selected by the following classifier and another sample is taken to serve as train data. However, in this second sample, the instances that were misclassified by the first classifier have bigger weights than the ones that were correctly classified and, therefore, a bigger probability of being selected in the following sample. The instances that are not selected are them discarded from the training sample and this process goes on until the last classifier is trained.

After the whole training process, the predictions are made similarly to Bagging and Pasting, however, the predictors have different weights based on their overall accuracy on the weighted training set.

Gradient Boosting

Even though Gradient Boosting also works by training predictors sequentially, it operates differently from AdaBoost. Instead of adjusting the weights in every iteration, it focuses on training the new predictor based on the residual (difference of what is predicted and real data) of the previous one.

First of all, the Algorithm makes its first prediction by calculating the label´s average. Second, the method builds a new predictor in order to predict the residuals (called pseudo residual for Gradient Boosting) of the previous one. Third, the prediction made in the previous iteration (for the first predictor, the mean) is summed up with the predicted residual times a learning rate in order to get a new prediction. The learning rate is important so the process goes on in small steps towards the right direction. Therefore, the Gradient Boost works by the steps:

(1) Calculate the residuals (the difference between the actual prediction and the observed results);

(2) Predict the residuals;

(3) Sum the actual prediction with the residuals times the learning rate to get the new predictions;

XGBoost (eXtreme Gradient Boosting)

XGboost is an implementation of gradient boosting that aims at gains in speed and performance. Generally, it performs better and faster than ordinary gradient boosting and also succeeds when compared to other ensemble methods, like Random Forest.

In order to demonstrate its performance, we used the same case study used in the previous article and compared the performance of Default XGBoost with the Random Forest Classifier. In this case study, we used the income data set that is usually used to classify people into low income (people who earn less than 50k $/year) and high income (the ones that earn 50k $/year or more). The code used to implement a grid search with this comparison is the following, you can check the preprocessing steps for building the full_pipeline_preprocessing in the previous article:

The full pipeline as a step in another pipeline with an estimator as the final step

pipe = Pipeline(steps = [('full_pipeline', full_pipeline_preprocessing),
("fs",SelectKBest()),
("clf",XGBClassifier())])

create a dictionary with the hyperparameters

search_space = [
{"clf":[RandomForestClassifier()],
"clf_n_estimators": [200],
"clfcriterion": ["entropy"],
"clfmax_leaf_nodes": [128],
"clfrandom_state": [seed],
"fsscore_func":[chi2],
"fsk":[13]},
{"clf":[XGBClassifier()],
"clfrandom_state": [seed],
"fsscore_func":[chi2],
"fs_k":[13]}
]

create grid search

kfold = KFold(n_splits=num_folds,random_state=seed)

setting the grid search

grid = GridSearchCV(estimator=pipe,
param_grid=search_space,
cv=kfold,
scoring=scoring,
return_train_score=True,
n_jobs=-1,
refit="AUC")
tmp = time.time()

fit grid search

best_model = grid.fit(X_train,y_train)
The Grid Search’s results are presented in the following tables:

From the tables, one can conclude that the default XGbosst classifier performed better than the best Random Forest configuration. We also used the following search space to find better values for the XBoost hyperparameters:

search_space = [
{"clf":[XGBClassifier()],
"clf_n_estimators":[100, 200, 300],
"clflearning_rate":[0.05, 0.1, 0.3, 0.5],
"clfrandom_state": [seed],
"fsscore_func":[chi2],
"fs_k":[13]}

]
The grid search found that the XGboost classifier with 300 estimators and 0.1 learning rate had the best results out of them when it comes to AUC, having a better performance than the default XGbosst:

Using Google Colab’s GPU

Google Colaboratory provides a way of using GPU and TPU for processing our codes. In order to set this configuration, one has to go Runtime > Change runtime type > Hardware accelerator and chose between none, GPU and TPU. We tested the best XGBoost hyperparameter configuration without a hardware accelerator and with the GPU options to evaluate how it goes.

printing training time for each hardware acelerator

print("Training Time: %s seconds" % (str(time.time() - tmp)))

The training time results seem quite exotic since the GPU time was slower than the CPU training time. More research will be carried out to see why it happened.