DataScience Deep Dive: Model Ensembles

Introduction

Remember we've talked about random forest and how it was used to improve the performance of a single Decision Tree classifier. The idea of fitting a number of decision tree classifiers on various sub-samples of the dataset and using averaging to improve the predictive accuracy can be used to other algorithms as well and it's called boosting. There are several boosting techniques, which can be used to improve our algorithm, we'll cover the most used ones: AdaBoost and Bagging boost.

AdaBoost

An AdaBoost classifier begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset, but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

Bagging boost

A Bagging classifier fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions to form a final prediction.

Implementation

All boosting methods are located under sklearn.ensemble module. Let's see first how to boost SVM using bagging classifier.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import BaggingClassifier
from sklearn.datasets import make_gaussian_quantiles
from sklearn.svm import SVC

# Construct dataset
X1, y1 = make_gaussian_quantiles(cov=2.,
                                 n_samples=200, n_features=2,
                                 n_classes=2, random_state=1)
X2, y2 = make_gaussian_quantiles(mean=(3, 3), cov=1.5,
                                 n_samples=300, n_features=2,
                                 n_classes=2, random_state=1)
X = np.concatenate((X1, X2))
y = np.concatenate((y1, - y2 + 1))

# Create and fit an boosted svm
bdt = BaggingClassifier(SVC())

bdt.fit(X, y)

plot_colors = "br"
plot_step = 0.02
class_names = "AB"

plt.figure(figsize=(10, 5))

# Plot the decision boundaries
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                     np.arange(y_min, y_max, plot_step))

Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
plt.axis("tight")

# Plot the training points
for i, n, c in zip(range(2), class_names, plot_colors):
    idx = np.where(y == i)
    plt.scatter(X[idx, 0], X[idx, 1],
                c=c, cmap=plt.cm.Paired,
                label="Class %s" % n)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.legend(loc='upper right')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Decision Boundary')

Now, let's change the line 18th to AdaBoost and Decision Tree

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
...
bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), algorithm="SAMME", n_estimators=200)
...

Conclusion

Boosting techniques may dramatically improve the base algorithm and serve an irreplaceable tool in any data scientists toolkit.

Random Forest with Python

Introduction

In the article about decision tree we've talked about it's drawbacks of being sensitive to small variations or noise in the data. Today we'll see how to deal with them by introducing a random forest. It belongs to a larger class of machine learning algorithms called ensemble methods, which use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms. So which models does random forest aggregate? You might already know the answer - the decision trees. It fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.

Implementation

Scikit-learn provides us with two classes RandomForestClassifier and RandomForestRegressor for classification and regression problems respectively. Let's use the code from the previous example and see how the result will different, using random forest with 100 trees.

...
clf = RandomForestClassifier(n_estimators=100).fit(X, y)
...

As you can see the edges are much smoother, that is less overfitted, than using a single decision tree.

Conclusion

Despite it's relative simplicity, random forest performs the job remarkably well. According to empirical comparisons, even better than SVMs.

Classifier Boosting with Python

Introduction

AdaBoost

Bagging boost

Implementation

Conclusion

Random Forest with Python

Introduction

Implementation

Conclusion

Blog Archive

Popular Posts

Labels