Monday, September 28, 2015

Decision Tree with Python


Last time we talked about non-linear classifier using Support Vector Machines or SVM. Today we'll be discussing another non-linear classifier and regressor called decision tree. The way decision tree works is by creating a model, which predicts the value of a target variable by learning simple decision rules inferred from the data features.

Since trees can be visualized and is something we're all used to, decision trees can easily be explained, visualized and manipulated the non-linearity in an intuitive manner. Surely there are some disadvantages as well, but we'll note them a bit later, firstly let's see them in action.


It won't come as a complete surprise to you, that scikit package has already taken initiative and implemented the whole thing using DecisionTreeRegressor and DecisionTreeClassifier classes. What is left for us is to bear the fruits of someone else's hard labour.

import StringIO
import numpy as np
import matplotlib.pyplot as plt
import pydot
from IPython.display import Image
from sklearn import tree
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Parameters
n_classes = 3
plot_colors = "bry"
plot_step = 0.02
plt.rcParams["figure.figsize"] = [12, 8]

# Load data
iris = load_iris()

for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3],
                                [1, 2], [1, 3], [2, 3]]):
    # We only take the two corresponding features
    X =[:, pair]
    y =

    # Shuffle
    idx = np.arange(X.shape[0])
    X = X[idx]
    y = y[idx]

    # Standardize
    mean = X.mean(axis=0)
    std = X.std(axis=0)
    X = (X - mean) / std

    # Train
    clf = DecisionTreeClassifier().fit(X, y)

    # Plot the decision boundary
    plt.subplot(2, 3, pairidx + 1)

    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                         np.arange(y_min, y_max, plot_step))

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    cs = plt.pcolormesh(xx, yy, Z,


    # Plot the training points
    for i, color in zip(range(n_classes), plot_colors):
        idx = np.where(y == i)
        plt.scatter(X[idx, 0], X[idx, 1], c=color,

plt.legend(loc="upper left")

First we normalize the data and then draw decision boundaries and at last the data itself.

Take a look at the first subplot and let's compare it with one, where we used logistic regression. Back then we couldn't classify the data using these features, now using the decision tree we surely can.


Since decision tree uses a tree data-structure, wouldn't it be cool to visualize it.
Notice: You'll need to install GraphViz package to run this example

dot_data = StringIO.StringIO()  
tree.export_graphviz(clf, out_file=dot_data,  
                     filled=True, rounded=True,  
graph = pydot.graph_from_dot_data(dot_data.getvalue())  

Now, when we have our tree in place, let's see how the decisions were made with Gini coefficient attached to each node. What Gini coefficient measures is the inequality among values of a frequency distribution, in our case iris species. A Gini coefficient of zero expresses perfect equality, where all values are the same - all iris are of the same species.

So the first check is against septal length being lesser than -0.7442, and from Gini coefficient 0.6667 we can deduce that it splits the data with one third going into a single category, setosa. The process continues until we reach Gini coefficient 0, that is the remaining data is of a single category.


One of the notable advantages of using decision trees is it's prediction performance, which is logarithmic in the number of data points used to train the tree. But there is are some downsides as well, which are needed to be considered as well. The first one is can be seen from our first subplot example - overfitting. Trees tend to perform incredibly well at the top, but at the same time tend to overfit at the bottom. This is a major downsite and therefore trees should be pruned! Also decision trees can be unstable because small variations or noise in the data might result in a completely different tree being generated. However this problem is mitigated by using decision trees within an ensemble, of which we'll talk next time.

Tuesday, September 8, 2015

Support Vector Machines with Python


Having learnt different optimization and classification methods, you must feel quit confident to start exploring the datasets of interest. However you might very quickly run into a dataset, which no matter how hyper-parameters are set, is just not linearly separable. Surely there is no place for despair, especially since we have just a classifier to deal with these situation, called Support Vector Machine.

How it works

Support Vector Machine, or SVM, are a set of supervised learning methods used for classification and with a slight change for regression. The core idea of it is to linearly separate the hyper-space of features. The prefix hyper is not occasional, as SVM increases the dimension of feature space to achieve it's goal. The power of the method comes from using kernel functions, which enable it to operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space. This operation is often computationally cheaper than the explicit computation of the coordinates, thus allowing SVM to operate on datasets with much bigger feature space, like image recognition. Consider the following example:

There's no linear decision boundary for this dataset, which will separate observations of two classes. If you apply linear classifier, you'll just receive an "arbitrary" line throughout the space crossing both of the classes - you just cannot do it correctly with logistic regression.

The way SVM approaches the problem, is by augmenting the dataset with additional dimensions and trying to maximize the margin between the classes. It places the separation so that the distance to closest misclassified entity is the widest. Have a look at the following illustration depicting the process:

Applying a kernel function, it transforms the observations to a space, where they can be linearly separable, resulting the following separation:


It won't come as a big surprise to you, that scikit-learn has a fully implemented and optimized SVM support, so let's have a look how it deals with various situations. We'll create three demo datasets, one in shape of moons, one of circles and the last will contain linearly separable observations. Then we'll apply SVM with two different kernel functions: linear and radial based function. In the end we'll plot the observations and color the background according to distance from boundary hyperplane using diverging palette, that is the farther it's from center the darker the color (red or blue depending on the sign). Take a closer look into the example code and make sure you understand the process.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.svm import SVC

h = .02  # step size in the mesh
names = ["Linear SVM", "RBF SVM"]
classifiers = [
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1)]

X, y = make_classification(n_features=2, n_redundant=0,
        random_state=1, n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += rng.uniform(size=X.shape)
linearly_separable = (X, y)

datasets = [make_moons(noise=0.3, random_state=0),
            make_circles(noise=0.2, factor=0.5, random_state=1),

figure = plt.figure(figsize=(9, 9))
i = 1
# iterate over datasets
for ds in datasets:
    # preprocess dataset, split into training and test part
    X, y = ds
    X = StandardScaler().fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X, y,

    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    # just plot the dataset first
    cm =
    cm_bright = ListedColormap(['#FF0000', '#0000FF'])
    ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
    # Plot the training points
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train,
    # and testing points
    ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test,
      cmap=cm_bright, alpha=0.6)
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    i += 1

    # iterate over classifiers
    for name, clf in zip(names, classifiers):
        ax = plt.subplot(len(datasets), len(classifiers) + 1, i), y_train)
        score = clf.score(X_test, y_test)

        # Plot the decision boundary. We'll assign a color to
        # each point in the mesh [x_min, m_max]x[y_min, y_max].
        Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])

        # Put the result into a color plot
        Z = Z.reshape(xx.shape)
        ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)

        # Plot also the training points
        ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train,
        # and testing points
        ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test,
          cmap=cm_bright, alpha=0.6)

        ax.set_xlim(xx.min(), xx.max())
        ax.set_ylim(yy.min(), yy.max())
        ax.text(xx.max() - .3, yy.min() + .3, 
          ('%.2f' % score).lstrip('0'), size=15, horizontalalignment='right')
        i += 1

The results, as anticipated, are much more promising than using a linear classifier. For moon and circle shaped datasets, there is tremendous increase classification score, where as for linearly separable observations the result doesn't change. This is of course understandable as linearly separable dataset can be easily dealt with using linear classifier.


SVM is a powerful tool, however from a practical point of view perhaps the most serious problem with SVMs is the high algorithmic complexity and extensive memory requirements of the required quadratic with the number of samples, which makes it hard to scale to dataset with more than a couple of 10000 samples.