Tuesday, October 27, 2015

Python for Data Scientists - Rodeo

Introduction

I love Python, I really do and that goes for IPython as well - it's a great tool and simplifies the work by a lot. But.. there is always a but, isn't it? RStudio is so much better and until recently we, the Python data enthusiasts, could only nervously look at RStudio while working on somewhat beloved, somewhat limped brother IPython. Well, no more. Let me introduce you Rodeo

The IDE is free and super easy to use, it's very similar to RStudio and after you watch the introduction video above, you'll be ready to go.

Wednesday, October 7, 2015

Random Forest with Python


Introduction

In the article about decision tree we've talked about it's drawbacks of being sensitive to small variations or noise in the data. Today we'll see how to deal with them by introducing a random forest. It belongs to a larger class of machine learning algorithms called ensemble methods, which use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms. So which models does random forest aggregate? You might already know the answer - the decision trees. It fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.

Implementation

Scikit-learn provides us with two classes RandomForestClassifier and RandomForestRegressor for classification and regression problems respectively. Let's use the code from the previous example and see how the result will different, using random forest with 100 trees.

...
clf = RandomForestClassifier(n_estimators=100).fit(X, y)
...

As you can see the edges are much smoother, that is less overfitted, than using a single decision tree.

Conclusion

Despite it's relative simplicity, random forest performs the job remarkably well. According to empirical comparisons, even better than SVMs.