*Gradient Boosting is a machine learning algorithm, used for both classification and regression problems. It works on the principle that many weak learners(eg: shallow trees) can together make a more accurate predictor.*

## How does Gradient Boosting Work?

Gradient boosting works by building simpler (weak) prediction models sequentially where each model tries to predict the error left over by the previous model. Because of this, the algorithm tends to overfit rather quick. But, what is a weak learning model? A model that does slightly better than random predictions. I will show you the exact formula shortly. But for clearly understanding the underlying principles and working of GBT, it’s important to first learn the basic concepts of decision trees and ensemble learning. This tutorial will take you through the concepts behind gradient boosting and also through two practical implementations of the algorithm:

- Gradient Boosting from scratch
- Using the scikit-learn in-built function.

# Contents:

- Decision Trees
- Ensemble Learning
- Why boosting?
- Gradient Boosting Decision Trees
- Gradient Boosting in Action
- Implementation
- Improving perfomance of gradient boosted decision trees
- References

## Decision trees

A decision tree is a machine learning model that builds upon iteratively asking questions to partition data and reach a solution. It is the most intuitive way to zero in on a classification or label for an object. Visually too, it resembles and upside down tree with protruding branches and hence the name. For example if you went hiking, and saw a animal that you couldnt immediately recognise through its features. You could later come home and ask yourself a set of questions about its features which could help you decide what excat species of animal did you notice. A decision tree for this problem would look something like this. A decision tree is a flowchart-like tree structure where each node is used to denote feature of the dataset, each branch is used to denote a decision, and each leaf node is used to denote the outcome. The topmost node in a decision tree is known as the **root node**. It learns to partition on the basis of the feature value. It partitions the tree in a recursive manner, also call recursive partitioning. This flowchart-like structure helps in decision making. It’s visualization, as shown above, is like a flowchart diagram which easily mimics the human level thinking. That is why decision trees are easy to understand and interpret. The major problem with decision trees is overfitting, which is why they will perform well on the validation dataset but will have poor accuracy on the test dataset. To overcome this issue posed by decision trees, data scientists came up with *ensemble learning*. To understand how exactly decision trees divide the data recursively, you can go through this article.

## Ensemble Learning

Ensemble learning, in general, is a model that makes predictions based on a number of different models. By a combining a number of different models, an ensemble learning tends to be more flexible (less bias) and less data sensitive (less variance). The two most popular ensemble learning methods are bagging and boosting. 1. *Bagging* : Training a bunch of models in parallel way. Each model learns from a random subset of the data.

*Boosting*: Training a bunch of models sequentially. Each model learns from the mistakes of the previous model.

```
Image('bagboost.png')
```

The application of bagging is found in **Random Forests**. Random forests are a parallel combination of decision trees. Each tree is trained on random subset of the same data and the results from all trees are averaged to find the classification. The application of boosting is found in **Gradient Boosting Decision Trees**, about which we are going to discuss in more detail.

## Why Boosting?

Boosting works on the principle of improving mistakes of the previous learner through the next learner. In boosting, weak learner are used which perform only slightly better than a random chance. Boosting focuses on sequentially adding up these weak learners and filtering out the observations that a learner gets correct at every step. Basically the stress is on developing new weak learners to handle the remaining difficult observations at each step. One of the very first boosting algorithms developed was Adaboost. Gradient boosting improvised upon some of the features of Adaboost to create a stronger and more efficient algorithm. Let’s look at a brief overview of Adaboost.

### AdaBoost

Adaboost used decision stumps as weak learners. Decision stumps are decision trees with only a single split. It also attached weights to observations, adding more weight to difficult to classify instances and less weight to easy to classify instances. The aim was to put stress on the difficult to classify instances for every new weak learner. Further, the final result was average of weighted outputs from all individual learners. The weights associated with outputs were proprotional to their accuracy. Gradient boosting algorithm is slightly different from Adaboost. Instead of using the weighted average of individual outputs as the final outputs, it uses a loss function to minimize loss and converge upon a final output value. The loss function optimization is done using gradient descent, and hence the name gradient boosting. Further, gradient boosting uses short, less-complex decision trees instead of decision stumps. To understand this in more detail, let’s see how exactly a new weak learner in gradient boosting algorithm learns from the mistakes of previous weak learners.

## Gradient Boosting Decision Trees

In gradient boosting decision trees, we combine many weak learners to come up with one strong learner. The weak learners here are the individual decision trees. All the trees are conncted in series and each tree tries to minimise the error of the previous tree. Due to this sequential connection, boosting algorithms are usually slow to learn, but also highly accurate. In statistical learning, models that learn slowly perform better.

```
Image('residual.png')
```

The weak learners are fit in such a way that each new learner fits into the residuals of the previous step so as the model improves. The final model aggregates the result of each step and thus a strong learner is achieved. A loss function is used to detect the residuals. For instance, mean squared error (MSE) can be used for a regression task and logarithmic loss (log loss) can be used for classification tasks. It is worth noting that existing trees in the model do not change when a new tree is added. The added decision tree fits the residuals from the current model.

### Learning rate and n_estimators (Hyperparameters)

Hyperparemetes are key parts of learning algorithms which effect the performance and accuracy of a model. Learning rate and n_estimators are two critical hyperparameters for gradient boosting decision trees. Learning rate, denoted as α, simply means how fast the model learns. Each tree added modifies the overall model. The magnitude of the modification is controlled by learning rate. The lower the learning rate, the slower the model learns. The advantage of slower learning rate is that the model becomes more robust and efficient. In statistical learning, models that learn slowly perform better. However, learning slowly comes at a cost. It takes more time to train the model which brings us to the other significant hyperparameter. n_estimator is the number of trees used in the model. If the learning rate is low, we need more trees to train the model. However, we need to be very careful at selecting the number of trees. It creates a high risk of overfitting to use too many trees.

### Note

One problem that we may encounter in gradient boosting decision trees but not random forests is overfitting due to addition of too many trees. In random forests, addition of too many tress wont cause overfitting. The accuracy of the model doesn’t improve after a certain point but no problem of overfitting is faced. On the other hand, in gradient boosting decision trees we have to be careful about the number of trees we select, because having too many weak learners in the model may lead to overfitting of data. Therefore, gradient boositng decision trees require a very careful tuning of the hyperparameters.

## Gradient Boosting – In Action

Till now, we have seen how gradient boosting works in theory. Now, we will dive into the maths and logic behind it, discuss the algorithm of gradient boosting and make a python program that applies this algorithm to real time data. First let’s go over the basic principle behind gradient boosting once again. We are given data with features (say x) and targets (say y). Our main aim is to predict a y given a set of x. The difference between our prediction and the actual value is known as the residual (or in this case, pseudo residuals), on the basis of which the gradient boosting builds successive trees.

### Algorithm

Let’s say the output model $y$ when fit to only 1 decision tree, is given by $$A_1 + B_1*x +e_1$$ where $e_1$ is the residual from this decision tree. In gradient boosting, we fit the consecutive decision trees on the residual from the last one. so when gradient boosting is applied to this model, the consecutive decision trees will be mathematically represented as: $$ e_1 = A_2 + B_2 x + e_2$$ $$ e_2 = A_3 + B_3x + e_3$$ Note that here we stop at 3 decision trees, but in an actual gradient boosting model, the number of learners or decision trees is much more. The final model of the decision tree will be given by: $$ y = A_1 + A_2 + A_3 + B_1x + B_2x + B_3*x + e_3 $$

## Implementation

### Implementation from Scratch

Let’s consider simulated data as shown in scatter plot below with 1 input (x) and 1 output (y) variables.

```
# The above plot has been generated using the following code
x = np.arange(0, 50)
x = pd.DataFrame({'x': x})
# just random uniform distributions in differnt range
y1 = np.random.uniform(10, 15, 10)
y2 = np.random.uniform(20, 25, 10)
y3 = np.random.uniform(0, 5, 10)
y4 = np.random.uniform(30, 32, 10)
y5 = np.random.uniform(13, 17, 10)
y = np.concatenate((y1, y2, y3, y4, y5))
y = y[:, None]
```

- Fit a decision tree on data. [call x as input and y as output]

```
xi = x # initialization of input
yi = y # initialization of target
# x,y --> use where no need to change original y
ei = 0 # initialization of error
n = len(yi) # number of rows
predf = 0 # initial prediction 0
for i in range(30): # loop will make 30 trees (n_estimators).
# DecisionTree scratch code can be found on www.kaggle.com/grroverpr/gradient-boosting-simplified
tree = DecisionTree(xi, yi)
# It just create a single decision tree with provided min. sample leaf
# For selected input variable, this splits (<n and >n) data so that std. deviation of
tree.find_better_split(0)
# target variable in both splits is minimum as compared to all other splits
# finds index where this best split occurs
r = np.where(xi == tree.split)[0][0]
left_idx = np.where(xi <= tree.split)[0] # index lhs of split
right_idx = np.where(xi > tree.split)[0] # index rhs of split
```

- Calculate error residuals. Actual target value, minus predicted target value [e1= y – y_predicted1 ]
- Fit a new model on error residuals as target variable with same input variables [call it e1_predicted]
- Add the predicted residuals to the previous predictions [y_predicted2 = y_predicted1 + e1_predicted]
- Fit another model on residuals that is still left. i.e. [e2 = y – y_predicted2] and repeat steps 2 to 5 until it starts overfitting or the sum of residuals become constant. Overfitting can be controlled by consistently checking accuracy on validation data.

```
# predictions by ith decisision tree
predi = np.zeros(n)
# replace left side mean y
np.put(predi, left_idx, np.repeat(np.mean(yi[left_idx]), r))
np.put(predi, right_idx, np.repeat(
np.mean(yi[right_idx]), n-r)) # right side mean y
predi = predi[:, None] # make long vector (nx1) in compatible with y
# final prediction will be previous prediction value + new prediction of residual
predf = predf + predi
ei = y - predf # needed originl y here as residual always from original y
yi = ei # update yi as residual to reloop
```

The code above is a very basic implementation of gradient boosting trees. The actual libraries have a lot of hyperparameters that can be tuned for better results. This can be better understood by using the gradient boosting algorithm on a real dataset.

### Implementation using scikit-learn

For implementation on a dataset, we will be using the PIMA Indians Diabetes dataset, which has information about a an individual’s health parameters and an output of 0 or 1, depending on whether or not he has diabates. The task here is classify a individual as diabetic, when given the required inputs about his health. First, let’s import all required libraries.

```
# Import all relevant libraries
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn import preprocessing
import warnings
warnings.filterwarnings("ignore")
```

Now load the dataset and look at the columns to understand the given information better.

```
# Load the dataset
pima = pd.read_csv('diabetes.csv')
pima.head()
```

```
# Split dataset into test and train data
X_train, X_test, y_train, y_test = train_test_split(pima.drop('Outcome', axis=1),
pima['Outcome'], test_size=0.2)
# Scale the data
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_transformed = scaler.transform(X_train)
X_test_transformed = scaler.transform(X_test)
```

Now that the data has been sufficiently preprocessed, let’s go ahead with defining the Gradient Boosting Classifier along with it’s hyperparameters. Next, we will fit this model on the training data.

```
# Define Gradient Boosting Classifier with hyperparameters
gbc=GradientBoostingClassifier(n_estimators=500,learning_rate=0.05,random_state=100,max_features=5 )
# Fit train data to GBC
gbc.fit(X_train_transformed, y_train)
```

```
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.05, loss='deviance', max_depth=3,
max_features=5, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=500,
n_iter_no_change=None, presort='auto',
random_state=100, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False)
```

The model has been trained and we can now observe the outputs as well. Below, you can see the confusion matrix of the model, which gives a report of the number of classifications and misclassifications.

```
# Confusion matrix will give number of correct and incorrect classifications
print(confusion_matrix(y_test, gbc.predict(X_test_transformed)))
```

```
[[82 17]
[25 30]]
```

The number of misclassifications by the Gradient Boosting Classifier are 42, comapared to 112 correct classifications. The model has performed decently. Let’s check the accuracy

```
# Accuracy of model
print("GBC accuracy is %2.2f" % accuracy_score(
y_test, gbc.predict(X_test_transformed)))
```

```
GBC accuracy is 0.73
```

The accuracy is 73%, which is average. This can be improved by tuning the hyperparameters or processing the data to remove outliers.| The discussion above is just the tip of iceberg when it comes to gradient boosting. The underlying concepts can be understood in more detail by starting with the very basics of machine learning algorithms and understanding the working of python code. This however gives us the basic idea behind gradient boosting and its underlying working principles.

## Improving perfomance of gradient boosted decision trees

As we already discussed above, gradient boosting algorithms are prone to overfitting and consequently poor perfomance on test dataset. There are some pointers you can keep in mind to improve the perfomance of gradient boosting algorithm.

### Stochastic Gradient Boosting

Stochastic gradient boosting involves subsampling the training dataset and training individual learners on random samples created by this subsampling. This reduces the correlation between results from individual learners and combining results with low correlation provides us with a better overall result. A few variants of stochastic boosting that can be used: * Subsample rows before creating each tree. * Subsample columns before creating each tree * Subsample columns before considering each split.

### Shrinkage

The predictions of each tree are added together sequentially. Instead, the contribution of each tree to this sum can be weighted to slow down the learning by the algorithm. This weighting is called a shrinkage or a learning rate. Using a low learning rate can dramatically improve the perfomance of your gradient boosting model. Usually a learning rate in the range of 0.1 to 0.3 gives the best results. Keep in mind that a low learning rate can significantly drive up the training time, as your model will require more number of iterations to converge to a final loss value.

### Regularization

L1 and L2 regularization penalties can be implemented on leaf weight values to slow down learning and prevent overfitting. Gradient tree boosting implementations often also use regularization by limiting the minimum number of observations in trees’ terminal nodes

### Tree Constraints

There are a number of ways in which a tree can be constrained to improve perfomance.

- Number of trees : Adding excessive number of trees can lead to overfitting, so it is important to stop at the point where the loss value converges.
- Tree depth : Shorter trees are preferred over more complex trees. Limit the number of levels in a tree to 4-8.
- Minimum improvement in loss : If you have gone over the process of decision making in decision trees, you will know that there is a loss associated at each level of a decision tree. A minimum improvement in loss required to build a new level in a tree can be predecided. This helps in pruning the tree to keep it short.
- Number of observations per split :This imposes a minimum constraint on the amount of training data at a training node before a split can be considered