Boosting Boosting refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor.
AdaBoost One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfit. This results in new predictors focusing more and more on the hard cases. This is the technique used by AdaBoost.
VC NOTE: The AdaBoost algorithm works by training an ensemble of predictors such tha tthe next predictor corrects its predecessor by weighing underfit instances more strongly. This notion is depicted below:
The following figure shows the decision boundary of an AdaBoostClassifier trained on the moons dataset with a DecisionTreeClassifier as the base estimator:
#When training an AdaBoostClassifier, the algorithm first trains a base classifier (like a DecisionTreeClassifier)...
#...and uses it to make prediction on the training set
#Decision Stump is a decision tree with Max Depth = 1, ...
#...which means only one split (one parent node, two leaf nodes)
from sklearn.ensemble import AdaBoostClassifier
#Obstantiate an object of the AdaBoostClassifier class,
#...the base classifier is the DecisionTreeClassifier with max depth of 1 with 200 decision stumps
ada_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=1), n_estimators=200,
algorithm="SAMME.R", learning_rate=0.5, random_state=42)
#fit the model to get trained
ada_clf.fit(X_train, y_train)
output:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1), learning_rate=0.5, n_estimators=200, random_state=42)
plot_decision_boundary(ada_clf, X, y) #This is now the full dataset
output:
The following figure shows the decision boundaries of five consecutive SVM predictors that are highly regularized and trained on the moons dataset. This code illustrates a custom implementation that is similar to the AdaBoost algorithm, but not identical.
m = len(X_train)
fix, axes = plt.subplots(ncols=2, figsize=(10,4), sharey=True)
for subplot, learning_rate in ((0, 1), (1, 0.5)):
sample_weights = np.ones(m)
plt.sca(axes[subplot])
for i in range(5):
#obstantiate object of classifier
svm_clf = SVC(kernel="rbf", C=0.05, gamma="scale", random_state=42)
#fit the model
svm_clf.fit(X_train, y_train, sample_weight=sample_weights)
y_pred = svm_clf.predict(X_train)
#if predicted class is not equal to true class, increase the learning rate
sample_weights[y_pred != y_train] *= (1 + learning_rate)
plot_decision_boundary(svm_clf, X, y, alpha=0.2)
plt.title("learning_rate = {}".format(learning_rate), fontsize=16)
if subplot == 0:
plt.text(-0.7, -0.65, "1", fontsize=14)
plt.text(-0.6, -0.10, "2", fontsize=14)
plt.text(-0.5, 0.10, "3", fontsize=14)
plt.text(-0.4, 0.55, "4", fontsize=14)
plt.text(-0.3, 0.90, "5", fontsize=14)
else:
plt.ylabel("")
save_fig("boosting_plot")
plt.show()
output:
Gradient Boosting
Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new predictor to the residual errors made by the previous predictor.
VC: Blue is the dataset - line of best fit is comprised of the red - the errors are the distance between the blue (true value) and the red what the model predicts - this distance is the errors.
Let’s go through a simple regression example using Decision Trees as the base predictors. This is called Gradient Tree Boosting, or Gradient Boosted Regression Trees (GBRT).
#get data with generated random numbers
np.random.seed(42)
#get feature matrix X, a one column array of 100 random numbers (100 rows, 1 column)
# All of these will be between 0 - 1, subtract 0.5 get random #'s between -.5 adn +.5
X = np.random.rand(100, 1) - 0.5
y = 3*X[:, 0]**2 + 0.05 * np.random.randn(100)
First, let’s fit a DecisionTreeRegressor to the training set:
from sklearn.tree import DecisionTreeRegressor
#train decision tree
#obstantiate an object of the DecisionTreeRegressor class
#max depth of two - 2 levels of splitting
tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
#fit model on the data and labels
#Decision tree is trained on the data and the labels
tree_reg1.fit(X, y)
output:
DecisionTreeRegressor(max_depth=2, random_state=42)
Next, we’ll train a second DecisionTreeRegressor on the residual errors made by the first predictor:
#In order to train a second DecisionTreeRegressor on the residual errors made by the first predictor...
#...we will need to go get the what the predictions are for the first model...
#...and then subtract them from the true values
#y2 (next labels for second decision tree that we are trainging value)
#....y2 represent the residual errors
#y (true labels from first decision tree)
#tree_reg (first decision tree object)
#tree.reg1.predict(X) predict on data X
#new labels = previous labels - previous prediction
#new label (y2) = residual error
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=42)
#fit model on data and residual errors
#Decision tree is trained on the data and the residual errors
tree_reg2.fit(X, y2)
output:
DecisionTreeRegressor(max_depth=2, random_state=42)
Then we train a third regressor on the residual errors made by the second predictor:
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg3.fit(X, y3)
output:
DecisionTreeRegressor(max_depth=2, random_state=42)
Now we have an ensemble containing three trees. It can make predictions on a new instance simply by adding up the predictions of all the trees:
#Instance (data point) to test on
X_new = np.array([[0.8]])
#To get the final prediction - sum prediction of all the models
y_pred = sum(tree.predict(X_new)
for tree in (tree_reg1, tree_reg2, tree_reg3))
y_pred
output:
array([0.75026781])
The following figure represents the predictions of these three trees in the left column, and the ensemble’s predictions in the right column.
def plot_predictions(regressors, X, y, axes, label=None, style="r-", data_style="b.", data_label=None):
x1 = np.linspace(axes[0], axes[1], 500)
y_pred = sum(regressor.predict(x1.reshape(-1, 1)) for regressor in regressors)
plt.plot(X[:, 0], y, data_style, label=data_label)
plt.plot(x1, y_pred, style, linewidth=2, label=label)
if label or data_label:
plt.legend(loc="upper center", fontsize=16)
plt.axis(axes)
plt.figure(figsize=(11,11))
plt.subplot(321)
plot_predictions([tree_reg1], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="$h_1(x_1)$", style="g-", data_label="Training set")
plt.ylabel("$y$", fontsize=16, rotation=0)
plt.title("Residuals and tree predictions", fontsize=16)
plt.subplot(322)
plot_predictions([tree_reg1], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="$h(x_1) = h_1(x_1)$", data_label="Training set")
plt.ylabel("$y$", fontsize=16, rotation=0)
plt.title("Ensemble predictions", fontsize=16)
plt.subplot(323)
plot_predictions([tree_reg2], X, y2, axes=[-0.5, 0.5, -0.5, 0.5], label="$h_2(x_1)$", style="g-", data_style="k+", data_label="Residuals")
plt.ylabel("$y - h_1(x_1)$", fontsize=16)
plt.subplot(324)
plot_predictions([tree_reg1, tree_reg2], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="$h(x_1) = h_1(x_1) + h_2(x_1)$")
plt.ylabel("$y$", fontsize=16, rotation=0)
plt.subplot(325)
plot_predictions([tree_reg3], X, y3, axes=[-0.5, 0.5, -0.5, 0.5], label="$h_3(x_1)$", style="g-", data_style="k+")
plt.ylabel("$y - h_1(x_1) - h_2(x_1)$", fontsize=16)
plt.xlabel("$x_1$", fontsize=16)
plt.subplot(326)
plot_predictions([tree_reg1, tree_reg2, tree_reg3], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="$h(x_1) = h_1(x_1) + h_2(x_1) + h_3(x_1)$")
plt.xlabel("$x_1$", fontsize=16)
plt.ylabel("$y$", fontsize=16, rotation=0)
save_fig("gradient_boosting_plot")
plt.show()
output:
A simpler way to train GBRT ensembles is to use Scikit-Learn’s GradientBoostingRegressor class.
The following code and figure shows two GBRT ensembles trained with a low learning rate: the first model does not have enough trees to fit the training set, while the second one has too many trees and overfits the training set.
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth = 2, n_estimators = 3, learning_rate=1, random_state=42)
gbrt.fit(X,y)
output:
GradientBoostingRegressor(learning_rate=1, max_depth=2, n_estimators=3, random_state=42)
#low learning rate
#will need more trees in enseble to fit training set, but prediction will be better
#shrinkage - scale contribution of each tree by the learning rate
gbrt_slow = GradientBoostingRegressor(max_depth = 2, n_estimators = 200, learning_rate=0.1, random_state=42)
gbrt_slow.fit(X,y)
output:
GradientBoostingRegressor(max_depth=2, n_estimators=200, random_state=42)
fix, axes = plt.subplots(ncols=2, figsize=(10,4), sharey=True)
plt.sca(axes[0])
plot_predictions([gbrt], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="Ensemble predictions")
plt.title("learning_rate={}, n_estimators={}".format(gbrt.learning_rate, gbrt.n_estimators), fontsize=14)
plt.xlabel("$x_1$", fontsize=16)
plt.ylabel("$y$", fontsize=16, rotation=0)
plt.sca(axes[1])
plot_predictions([gbrt_slow], X, y, axes=[-0.5, 0.5, -0.1, 0.8])
plt.title("learning_rate={}, n_estimators={}".format(gbrt_slow.learning_rate, gbrt_slow.n_estimators), fontsize=14)
plt.xlabel("$x_1$", fontsize=16)
save_fig("gbrt_learning_rate_plot")
plt.show()
output:
Using XGBoost
There is an optimized implementation of Gradient Boosting available in the popular Python library XGBoost, which stands for Extreme Gradient Boosting.
It is not required to download the XGBoost library for this class, however.
try:
import xgboost
except ImportError as ex:
print("Error: the xgboost library is not installed.")
xgboost = None
if xgboost is not None: # not shown in the book
xgb_reg = xgboost.XGBRegressor(random_state=42)
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_val)
val_error = mean_squared_error(y_val, y_pred) # Not shown
print("Validation MSE:", val_error) # Not shown
if xgboost is not None: # not shown in the book
xgb_reg.fit(X_train, y_train,
eval_set=[(X_val, y_val)], early_stopping_rounds=2)
y_pred = xgb_reg.predict(X_val)
val_error = mean_squared_error(y_val, y_pred) # Not shown
print("Validation MSE:", val_error) # Not shown
%timeit xgboost.XGBRegressor().fit(X_train, y_train) if xgboost is not None else None
%timeit GradientBoostingRegressor().fit(X_train, y_train)
Comments