Understanding Boosted Trees Models
In the previous post, we learned about tree based learning methods  basics of tree based models and the use of bagging to reduce variance. We also looked at one of the most famous learning algorithms based on the idea of bagging random forests.
In this post, we will look into the details of yet another type of treebased learning algorithms: boosted trees.
Boosting
Boosting, similar to Bagging, is a general class of learning algorithm where a set of weak learners are combined to get strong learners.
Recall that bagging involves creating multiple copies of the original training data set via bootstrapping, fitting a separate decision tree to each copy, and then combining all of the trees in order to create a single predictive model. Notably, each tree is built on a bootstrap data set, independent of the other trees.
In contrast, in Boosting, the trees are grown sequentially: each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.
Unlike fitting a single large decision tree to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly. Given the current model, a decision tree is fit to the residuals from the model. That is, a tree is fit using the current residuals, rather than the outcome as the response.
The above concept of additively fitting weak learners to get a strong learner is not limited to tree based methods. In fact, we can choose these weak learners to be any model of our choice. In practice, however, decision trees or sometimes linear models are the most common choices. In this post, we will be focusing mainly on boosted trees  i.e. our choice of weak learners would be decision trees. Boosted trees are one of the most used models in online machine learning competitions like Kaggle.
Let us consider the case of boosted trees more rigorously. In particular, we will look at the case of boosted decision trees for regression problems. To begin, set the resultant model $\hat{F}(x) = 0$ and residual of the training response as the training target $y$. Now Sequentially, for steps $b=1,2\ldots,B$ update $\hat{F}(x)$ as follows: $\hat{F}(x)\leftarrow \hat{F}(x) + \lambda_b \hat{f}^{b}(x)$, where $\hat{f}^{b}(x)$ is a decision tree learner fit on $(X, r)$ with $d$ splits (or $d+1$ terminal nodes) and $\lambda_b$ is a weighting parameter. Also update residuals, $r$ as, $r \leftarrow r  \lambda_b \hat{f}^{b}(x)$. At the end of $B$ steps, we will get the final model as $\hat{F}(x) = \sum_{b=1}^{B}\lambda_b\hat{f}^{b}(x)$.
The above algorithm of boosted is quite generic. One can come with several strategies for how the decision trees are grown and fit on residuals, and for the choice of the weighting parameter $\lambda$. Therefore, there are various boosting methods. Some common examples are:
In this post, we will focus specifically on two of the most common boosting algorithms, AdaBoost and Gradient Boosting.
Adaptive Boosting (AdaBoost)
Adaptive Boosting or AdaBoost refers to a particular formulation of boosting where the idea of residual learning is implemented through the concept of sample weights. In particular, at any iteration, $b$, all training samples have a weight $w_i^b$ $\forall i = 1, \ldots N$, which is equal to the current error $E(\hat{f}(x_i))$. The decision tree $\hat{f}^{b}(x)$ is fit to minimize the error $E_b = \sum_{i=1}^{N}{E\big(\hat{F}(x) + \lambda_b \hat{f}^{b}(x)\big)}$. AdaBoost uses an exponential error/loss function: $E(\hat{F}(x_i)) = e^{y_i \hat{F}(x_i)}$. The choice of the error function, $E()$, also determines the way weights $w_i^b$ and the weighting parameter $\lambda_b$ are updated at different steps. A neat mathematical derivation of these quantities for the case of binary classification can be found at the Wiki article.
In particular case of the classification problems, we can also think of AdaBoost to be an algorithm where at any step, the misclassified samples get a higher weight than the correctly classified samples. The weights attached to samples are used to inform the training of the weak learner, in this case, decision trees to be grown in a way that favor splitting the sets of samples with high weights.
In practice, the above definition of the AdaBoost model is modified to include a learning rate term, $\gamma$, as a regularization term that shrinks the contribution of all of the models. In particular, the model is updated as, $\hat{F}(x) \leftarrow \hat{F}(x) + \gamma \lambda_b \hat{f}^{b}(x)$. The main tuning parameters of AdaBoost are, the learning rate $\gamma$, number of estimators $B$, and decision tree related parameters like depth of trees $d$, and number of samples per split etc.
AdaBoost Classifier in Python
Recall the US income data that we used in the
previous based post on tree based methods. In summary, in this dataset, we
are required to predict the income range of adults (<=50K or >50K) based on following features:
Race
, Sex
, Education
, Work Class
, Capital Loss
, Capital Gain
, Relationship
,
Marital Status
, Age Group
, Occupation
and Hours of Work per Week
. We have already seen
that, with the best decision tree model, we were able to achieve a prediction accuracy of 85.9%.
With the use of random forest models, we were able to increase the accuracy to 86.6%.
Let us try to solve the same problem using AdaBoost classifier from scikitlearn module. Please have a look at the previous post on treebased methods to understand the EDA and preparation of the data.


Without any parameter tuning, we see an accuracy of 85.54%!
Once the number of hyperparameters and the range of those parameter in models becomes large, the time required to find the best parameters becomes exponentially large using simple grid search. Instead of trying out every possible combination of parameters, we can evolve only the combinations that give the best results using Genetic Algorithms. A scikitlearn friendly implementation of this can be found in the sklearndeap library. In many of the examples below we will be using this library to search for best model parameters.
The genetic algorithm
itself has three main parameters: population size
, tournament size
and no. of generations
.
Typically, for the problem of parameter search, we can use a population of 1015 per
hyperparameter. The tournament size parameter affects the diversity of population being
considered in future generation. This roughly translates into global vs local optimization for
parameter search. In other words, if the tournament size is too large, we will have higher chance
of getting a solution that is from local minima. A typical value of 0.1 times the population
works well for the exercise of finding optimal model parameters. The number of generation decides
the convergence of genetic algorithms. A larger value leads to better convergence but requires
larger computation time.
Let us try using genetic algorithm to find optimal model parameters for AdaBoost classifier.


This should take about 50 minutes on a reasonable desktop machine! We can now use the best parameters from this and create a new AdaBoost classifier.


We see a significant improvement in our results with an accuracy of 87.06% on the testing data!
Given our data is highly imbalanced, let us look at the confusion matrix of our model on the test
data. Note that we are using the confusion_matrix()
method from the
previous post on tree based methods.


We find that our model has a much better predictive power (94.1%) for the dominant class (<=50K), while it has prediction rate of only 64.2% for the other class.
Gradient Boosting
The approach used in the case of AdaBoost can be also viewed as an exponential loss minimization approach. Let us look at this mathematically for the case of a generic loss function, $L\big(y,\hat{F}(x)\big)$. The goal of the method is to find an $\hat{F}(x)$ that minimizes the average value of the loss function $L\big(y,\hat{F}(x)\big)$ on the training set. This can be achieved by starting with a constant function $\hat{F_0}(x)$, and incrementally expanding it in a greedy fashion as follows:
$$ \begin{gathered} \hat{F_0}(x) = \mathop{\arg\min}\limits_{\gamma} \sum_{i=1}^{N} L\big(y,\gamma \big) \cr \hat{F_b}(x) = \hat{F}_{b1}(x) + \cr \mathop{\arg\min}\limits_{f \in \mathcal{H}} \sum_{i=1}^{N} L\big(y,\hat{F}_{b1}(x) + f(x) \big) \cr \text{, for } b=1,2,\ldots, B \end{gathered} $$
where $f(x) \in \mathcal{H}$ refers to base learners, in this case tree models. The problem with this set up is that, it is computationally infeasible to choose optimal $f(x)$ at every step for an arbitrary loss function $L\big(y,\hat{F}(x)\big)$.
However, this can be simplified using a steepest descent approach. In this approach, at any step the decision tree model is trained on the pseudoresiduals, rather than residuals. The approach can be described in the following algorithm:
 Initialize model as, $\hat{F_0}(x) = \mathop{\arg\min}\limits_{\gamma} \sum_{i=1}^{N} L\big(y,\gamma \big)$
 for steps $b=1,2,\ldots, B$:
 compute pseudoresiduals as: $r_{ib} = \Bigg[ \frac{\partial{L\big(y_i, \hat{F}_{b1}(x_i)\big)}}{\partial{\hat{F}_{b1}(x_i)}} \Bigg]$
 Fit a decision tree $\hat{f^b}(x)$ to pseudoresiduals, i.e. train it using the training set $(x_i, r_{ib})_{i=1}^N$
 Compute multiplier $\gamma_b$ using line search, where $0<\nu<1$ is the learning rate parameter $\gamma_b = \mathop{\arg\min}\limits_{\gamma} \sum_{i=1}^N L\big(y_i, \hat{F}_{b1}(x) + \nu \gamma \hat{f^b}(x)\big)$
 update the model $\hat{F_b}(x) = \hat{F}_{b1}(x) + \nu \gamma_b \hat{f^b}(x)$
In most real implementations of gradient boosted trees, rather than an individual tree weighing parameter $\gamma_b$, different parameters are used at different splits, $\gamma_{jb}$. If you recall from the last post, a decision tree model corresponds to diving the feature space in multiple rectangular regions and hence it can be represented as,
$$ \hat{f^b}(x) = \sum_{j=1}^{J} k_{jb} I\big(x\in R_{jb}\big) $$
where, $J$ is the number of terminal nodes (leaves), $I\big(x\in R_{jb}\big)$ is an indicator function which is 1 if $x\in R_{jb}$ and $k_{jb}$ is the prediction in $j^{th}$ leaf. Now, we can replace $\hat{f^b}(x)$ in above algorithm and replace $\gamma_b$ for the whole tree by $\gamma_{jb}$ per terminal node (leaf).
$$ \hat{F_b}(x) = \hat{F}_{b1}(x) + \nu \sum_{j=1}^{J_b} \gamma_{jb} I\big(x\in R_{jb}\big) $$
where $\gamma_{jb}$ is given by the following line search,
$$ \gamma_{jb} = \mathop{\arg\min}\limits_{\gamma} \sum_{x_i \in R_{jb}} L\big(y_i, \hat{F}_{b1}(x) + \gamma \big) $$
Here $J$ refers to the number of terminal nodes (leaves) in any of constituent decision trees. A value of $J_b =2$, i.e. decision stumps means no interactions among feature variables are considered. With a value of $J_b=3$ the model may include effects of the interaction between up to two variables, and so on. Typically a value of $4 \le J_b \le 8$ work well for boosting.
Regularization of Gradient Boosted Trees
Gradient Boosted Trees can be regularized by multiple approaches. Some common approaches are:
 Shrinkage / Learning Rate: For each gradient step, the step magnitude is multiplied by a factor between 0 and 1 called a learning rate ($0 <\nu < 1$). In other words, each gradient step is shrunken by some factor $\nu$. The rational for this to work as a regularization parameter has never been clear to me. My personal take is the shrinkage enables us to use a different prior. Telgarsky et al. provide a mathematical proof that shrinkage makes gradient boosting to produce an approximate maximum margin classifier, i.e. a classifier which is able to maximize the associated distance from the decision boundary for each example.
 SubSampling: Motivated by the bagging method, at each iteration of the algorithm, a decision tree is fit on a subsample of the training set drawn at random without replacement. Also, like in bagging, subsampling allows one to define an outofbag error of the prediction performance improvement by evaluating predictions on those observations which were not used in the building of the next base learner. Outofbag estimates help avoid the need for an independent validation dataset, but often underestimate actual performance improvement and the optimal number of iterations.
 Minimum sample size for splitting trees, and Minimum sample size for tree leaves: It is used in the tree building process by ignoring any splits that lead to nodes containing fewer than this number of training set instances. Imposing this limit helps to reduce variance in predictions at leaves.
 The number of trees or Boosting Iterations, $B$ Increasing $B$ reduces the error on training set, but setting it too high may lead to overfitting. An optimal value of $B$ is often selected by monitoring prediction error on a separate validation data set.
 Sampling Features: We can apply the idea of randomly choosing small subsets of features for different trees, as in the case of Random Forest models.
scikitlearn Implementation
scikitlearn provides a simple and generic implementation of the above described algorithm that is valid of different types of loss functions. Below is a simple implementation for the case of income data.


Turns out we have got a quite good model just by chance! This has an accuracy of 87.11% on our test data!
We could try finding optimal parameters for this as before. However, in my experience this is a very academic implementation of boosted trees. Other implementation like XGBoost, LightGBM and CatBoost are more optimized implementations and hence we will focus our tuning for some of these libraries only.
XGBoost
The biggest drawback of the gradient boosting trees is that the algorithm is quite sequential in nature and hence are very slow and they can not take advantage of advanced computing features for parallelization like multiple threads and cores, GPU etc.
XGBoost is a python (C++ and R as well) library that provides an optimized implementation of gradient boosted trees. It uses various tricks like regularization of number of leaves and leaf weights, sparsity aware split finding, column block for parallel learning and cacheaware access. Using these tricks the implementation provides a parallelized efficient version of the algorithm. The details of these can be found here. XGBoost is one of the most famous machine learning libraries used in online machine learning competitions like Kaggle.
Regularization Apart from regular gradient boosted trees, XGBoost provides two additional types of regularization, by adding $L_1$ constraints on number of leaves ($J_b$) and $L_2$ constraints on the leaf weights ($\gamma_{jb}$) to the loss function. Mathematically, The loss function is modified as follows (Loss Term at step b):
$$ \sum_{i=1}^{N} L\big( y_i, \hat{F_b}(x_i) \big) + \sum_{k=1}^b \Big (\eta J_k + \frac{1}{2} \lambda \left\lVert \gamma_{jk} \right\rVert^2 \Big ) $$
Here, the second term in the loss function, penalizes the complexity of the model, i.e. decision tree functions.
Additional Weak Learners: Apart from decision trees, XGBoost also supports linear models and DART (decision trees with dropout) as weak learners. In the DART algorithm, only a subset of available trees are considered in calculating the pseudoresiduals on which the new trees are fit.
XGBoost has many parameters that control the fitting of the model. Below are some of the relevant parameters and tuning them would be helpful in the most common cases. Please note that original XGBoost library parameters might have a different name than before, since I am using the scikitlearn API parameter names below.
 max_depth (default=3) : Maximum depth of a tree, increase this value will make the model more complex / likely to be overfitting. A value of 0 indicates no limit. A Typical Value in the range of 210 can be used for model optimization.
 n_estimators (default=100) : The number of boosting steps to perform. This can also be seen
as number of trees to be used in the boosted model. This number is inversely proportional to the
learning*rate (eta) parameter, i.e. if we use a smaller value of learning_rate,
n_estimators
has to be made larger. A Typical Value in the range of $\ge 100$ can be used for model optimization. However, it is best to used XGBoost builtincv()
method to find this parameter (See the example ahead).  min_child_weight (default=1) : The minimum sum of instance weight needed in a child node. If
the tree partition step results in a leaf node with the sum of instance weight less than the
min_child_weight
, then any further partitioning will be stopped. The larger the value of this parameter, the more conservative the algorithm will be. A Typical Value in the range of 110 can be used for model optimization*.  learning_rate (default=0.1) : The step size shrinkage used to prevent overfitting. After
each boosting step, we can directly get the weights of new features. and the
learning_rate
parameter (also referred as eta in regular XGBoost API) shrinks the feature weights to make the boosting process more conservative (See above formulation of Gradient Boosted Trees for mathematical details). A Typical Value in the range of 0.0010.3 can be used for model optimization.  gamma (default=0) : (Also referred as min_split_loss in regular XGBoost API) The minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be. The value of this parameter depends on the type of loss function being used. A Typical Value in the range of 0.00.7 can be used for model optimization.
 reg_alpha (default=0) : (Also referred as alpha in regular XGBoost API) L1 regularization term on weights, increasing this value will make model more conservative (see XGBoost paper for mathematical details). A Typical Value in the range of 01 can be used for model optimization.
 reg_lambda (default=1) : (Also referred as lambda in regular XGBoost API) L2 regularization term on weights, increasing this value will make model more conservative (see XGBoost paper for mathematical details). A Typical Value in the range of 01 can be used for model optimization.
 subsample (default=1) : Subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees. This parameter is used to prevent overfitting. A Typical Value in the range of 0.51.0 can be used for model optimization.
 colsample_bytree (default=1) : Subsample ratio of columns when constructing each tree. Similar to Random Forest models, models tend to be more generalizable, if a number between 0 and 1 is used. A Typical Value in the range of 0.51.0 can be used for model optimization.
 scale_pos_weight (default=1) : Controls the balance of positive and negative class weights, useful for unbalanced class problems. A typical value to consider: sum(negative cases) / sum(positive cases).
Apart from above set of parameters, there are several parameters that should also be considered while tuning. Some examples of such parameters are: objective (objective function/ loss function to use, depends on problem, for eg. binary vs. multiclass classification), tree_method (The tree construction algorithm used in XGBoost(see description in the paper), random_state (seed for random number generator), n_jobs (number of threads to use to train the model) and many other parameters related to different types of tree methods used. This article can be used as a good general reference for tuning XGBoost models.
Let us tune XGBoost model for our problem of income prediction. A simple sklearn API implementation can be used as below.


With this reasonable guess of parameters based on previous models, we already see an accuracy of 86.75%.
Let us try to find the optimal parameters for the XGBoost model. If we simply try to do a brute force grid search, it can be computationally very expensive and unreasonable on a desktop. Here is a sample parameters list that can give us an idea of what such a grid search could look like.


If we try to do a grid search of this with 5fold cross validation, it will involve a whopping 0.81 million model training calls! And, even this will not be enough, as we will need additional model training steps to finetune our parameter search for finer and/or different range of parameters. Clearly, we need a different approach to solve this.
I will take an approach of optimizing different set of parameters in batches. To begin, we will choose a fixed learning rate of 0.1, and n_estimators=200. We will try to find only tree related parameters (i.e. max_depth, gamma, subsample and colsample_bytree) using grid search or genetic algorithm.


This gives us the following optimal values for different parameters:
Best individual is: {'max_depth': 6, 'subsample': 1.0, 'colsample_bytree': 0.8, 'gamma': 0.2}
with fitness: 0.8710727557507447
We can do a finer grid search to get more precise values. For this exercise, let us move on to the next stage of parameter tuning of XGBoost.
XGBoost provides an optimized version of cross validation using cv()
method which supports early
stopping to give us optimal value of n_estimators.


This gives us a value of n_estimators = 206. We will now use these parameters to search for the next set of tree building parameters: max_depth and min_child_weight.


The optimal set of parameters found by this are:
Best individual is: {'max_depth': 7, 'min_child_weight': 1}
with fitness: 0.8712877368631184
We can now use these parameters as fixed values and optimize regularization parameters: reg_alpha and reg_lambda.


The optimal set of parameters found by this search are:
Best individual is: {'reg_alpha': 0.001, 'reg_lambda': 1}
with fitness: 0.8714720063880101
We can now decrease the learning rate by an order to magnitude to get a more stable model. However,
we will also need to find the optimal value of number of estimators again using the cv()
method.


We get an optimal value of n_estimators = 1559. Let us use now all of these optimized values to make a final XGBoost model.


We get test accuracy of 86.99%. Now, this seems odd. After we did all this computation and we still have got a test accuracy that is smaller than an optimized version of scikitlearn’s Gradient Boosting Tree implementation. However, if you have paid attention to the metric, you can notice  for cross validation, I started using ‘auc’ as metric, instead of ‘accuracy’. This will give us better accuracy for the less abundant class (> 50K salary) but at the cost of slight decrease in the accuracy of the more abundant class (<= 50K salary). We can verify this by the following confusion matrix plot.


We see the largest accuracy of the less abundant class with an accuracy of 64.9%, compared to the previous best of 64.2%. We can also look the importance of different features for this XGBoost model.


LightGBM
Similar to XGBoost, LightGBM is another optimized implementation of Gradient Boosting developed by Microsoft (similar to XGBoost available in Python, C++ and R). The main difference between LightGBM and other Gradient boosted trees (like XGBoost) implementations is in the way trees are grown. The details of this can be found on their features page. Briefly, LightGBM splits the tree leafwise with the best fit whereas other boosting algorithms split the tree depthwise or levelwise. The two approaches can be best visualized in the following illustrations:
Levelwise Splits:
Leafwise Splits:
Leafwise splits lead to increase in complexity and may lead to overfitting, and hence extra caution needs to be taken in tuning. Some of the biggest advantages of LightGBM over XGBoost is in terms of extremely fast training speed, lower memory usage, compatibility with large datasets and highly parallel computational support using threads, MPI and GPUs. LightGBM also has inbuilt support for categorical variables, unlike XGBoost, where one has to preprocess the data to convert all of the categorical features to integer ones using onehot encoding or label encoding.
Since the trees are grown differently in LightGBM, its tuning procedure is quite different than XGBoost. Note that the latest version of XGBoost also provides a tree building strategy (depthwise) which is quite similar to LightGBM.
All the parameters described above for XGBoost are also valid for LightGBM library. However, some parameters can have different names. Given the strategy of growing trees is different in LightGBM, an additional parameter needs to be tuned as well.
num_leaves: Maximum tree leaves for base learners. Note that since trees are grown depthfirst, this parameters is independent of the max_depth parameter and has to be tuned independently. A typical value for starting should be much less than 2^{max _depth}.
When using the scikitlearn API of LightGBM, one should keep in mind that some of the parameter names are not standard ones (even though described in the API reference). In particular, I found that seed and nthreads as parameters, instead of random_state and n_jobs, respectively.
Let us tune a LightGBM model for the problem of Income prediction.


This results in test accuracy of 86.8%.
Given this library also has many parameters, similar to XGBoost, we need to use a similar strategy of tuning in stages.
First we will fix learning rate to a reasonable value of 0.1 and number of estimators = 200, and
tune only the major tree building parameters: max_depth
, subsample
, colsample_bytree
and
num_leaves
. We will use the genetic algorithm to search for optimal values of these parameters.


This gives the following set of optimal parameters:
Best individual is: {'max_depth': 6, 'subsample': 1.0, 'colsample_bytree': 0.75, 'num_leaves': 54}
with fitness: 0.870888486225853
Now, we can use grid search to fine tune the search of number of leaves parameter.


Similar to XGBoost, LightGBM also provides a cv()
method that can be used to find optimal value
of number of estimators using early stopping strategy. Another strategy would be to search for this
parameter as well. In the following, I want to use this grid search strategy to find best value of
number of estimators, just to show how tedious this can be!


We find that the optimal value of n_estimators
to be 327.
Now, we can use the similar strategy to find and finetune the best regularization parameters.


Finally, we can decrease the learning rate to 0.01 and find the optimal value of n_estimators
.


We find the optimal n_estimators
to be equal to 3327 for a learning rate of 0.01. We can now
built a final LightGBM model using these parameters and evaluate the test data.


We get a test accuracy of 87.01%. Similar to previous cases, we can again look at the accuracy of individual classes using the following confusion matrix plot:


We find the results to be slightly better than XGBoost with 65% accuracy of the less abundant >50K salary class. We can also look the importance of different features in this model:


Concluding Remarks
XGBoost has been one of the most famous libraries used to win several machine learning competitions on Kaggle and similar sites. Slowly, LightGBM is also gaining traction due to its speed and parallelization capabilities. CatBoost is another boosting library from Yandex that has been shown to be quite efficient. I have used it personally yet though. If I find it to be worth making a move to, I will write about it in a future post.