Introduction
Remember we've talked about random forest and how it was used to improve the performance of a single Decision Tree classifier. The idea of fitting a number of decision tree classifiers on various sub-samples of the dataset and using averaging to improve the predictive accuracy can be used to other algorithms as well and it's called boosting. There are several boosting techniques, which can be used to improve our algorithm, we'll cover the most used ones: AdaBoost and Bagging boost.
AdaBoost
An AdaBoost classifier begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset, but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.
Bagging boost
A Bagging classifier fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions to form a final prediction.
Implementation
All boosting methods are located under sklearn.ensemble module. Let's see first how to boost SVM using bagging classifier.
import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import BaggingClassifier from sklearn.datasets import make_gaussian_quantiles from sklearn.svm import SVC # Construct dataset X1, y1 = make_gaussian_quantiles(cov=2., n_samples=200, n_features=2, n_classes=2, random_state=1) X2, y2 = make_gaussian_quantiles(mean=(3, 3), cov=1.5, n_samples=300, n_features=2, n_classes=2, random_state=1) X = np.concatenate((X1, X2)) y = np.concatenate((y1, - y2 + 1)) # Create and fit an boosted svm bdt = BaggingClassifier(SVC()) bdt.fit(X, y) plot_colors = "br" plot_step = 0.02 class_names = "AB" plt.figure(figsize=(10, 5)) # Plot the decision boundaries x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step)) Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired) plt.axis("tight") # Plot the training points for i, n, c in zip(range(2), class_names, plot_colors): idx = np.where(y == i) plt.scatter(X[idx, 0], X[idx, 1], c=c, cmap=plt.cm.Paired, label="Class %s" % n) plt.xlim(x_min, x_max) plt.ylim(y_min, y_max) plt.legend(loc='upper right') plt.xlabel('x') plt.ylabel('y') plt.title('Decision Boundary')
Now, let's change the line 18th to AdaBoost and Decision Tree
from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier ... bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), algorithm="SAMME", n_estimators=200) ...
Conclusion
Boosting techniques may dramatically improve the base algorithm and serve an irreplaceable tool in any data scientists toolkit.