Wednesday, October 7, 2015

Random Forest with Python


Introduction

In the article about decision tree we've talked about it's drawbacks of being sensitive to small variations or noise in the data. Today we'll see how to deal with them by introducing a random forest. It belongs to a larger class of machine learning algorithms called ensemble methods, which use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms. So which models does random forest aggregate? You might already know the answer - the decision trees. It fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.

Implementation

Scikit-learn provides us with two classes RandomForestClassifier and RandomForestRegressor for classification and regression problems respectively. Let's use the code from the previous example and see how the result will different, using random forest with 100 trees.

...
clf = RandomForestClassifier(n_estimators=100).fit(X, y)
...

As you can see the edges are much smoother, that is less overfitted, than using a single decision tree.

Conclusion

Despite it's relative simplicity, random forest performs the job remarkably well. According to empirical comparisons, even better than SVMs.