Tuesday, September 8, 2015

Support Vector Machines with Python


Introduction

Having learnt different optimization and classification methods, you must feel quit confident to start exploring the datasets of interest. However you might very quickly run into a dataset, which no matter how hyper-parameters are set, is just not linearly separable. Surely there is no place for despair, especially since we have just a classifier to deal with these situation, called Support Vector Machine.

How it works

Support Vector Machine, or SVM, are a set of supervised learning methods used for classification and with a slight change for regression. The core idea of it is to linearly separate the hyper-space of features. The prefix hyper is not occasional, as SVM increases the dimension of feature space to achieve it's goal. The power of the method comes from using kernel functions, which enable it to operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space. This operation is often computationally cheaper than the explicit computation of the coordinates, thus allowing SVM to operate on datasets with much bigger feature space, like image recognition. Consider the following example:

There's no linear decision boundary for this dataset, which will separate observations of two classes. If you apply linear classifier, you'll just receive an "arbitrary" line throughout the space crossing both of the classes - you just cannot do it correctly with logistic regression.

The way SVM approaches the problem, is by augmenting the dataset with additional dimensions and trying to maximize the margin between the classes. It places the separation so that the distance to closest misclassified entity is the widest. Have a look at the following illustration depicting the process:

Applying a kernel function, it transforms the observations to a space, where they can be linearly separable, resulting the following separation:

Implementation

It won't come as a big surprise to you, that scikit-learn has a fully implemented and optimized SVM support, so let's have a look how it deals with various situations. We'll create three demo datasets, one in shape of moons, one of circles and the last will contain linearly separable observations. Then we'll apply SVM with two different kernel functions: linear and radial based function. In the end we'll plot the observations and color the background according to distance from boundary hyperplane using diverging palette, that is the farther it's from center the darker the color (red or blue depending on the sign). Take a closer look into the example code and make sure you understand the process.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.svm import SVC

h = .02  # step size in the mesh
names = ["Linear SVM", "RBF SVM"]
classifiers = [
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1)]

X, y = make_classification(n_features=2, n_redundant=0,
        random_state=1, n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += rng.uniform(size=X.shape)
linearly_separable = (X, y)

datasets = [make_moons(noise=0.3, random_state=0),
            make_circles(noise=0.2, factor=0.5, random_state=1),
            linearly_separable
            ]

figure = plt.figure(figsize=(9, 9))
i = 1
# iterate over datasets
for ds in datasets:
    # preprocess dataset, split into training and test part
    X, y = ds
    X = StandardScaler().fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X, y,
      test_size=0.4)

    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    # just plot the dataset first
    cm = plt.cm.RdBu
    cm_bright = ListedColormap(['#FF0000', '#0000FF'])
    ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
    # Plot the training points
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train,
      cmap=cm_bright)
    # and testing points
    ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test,
      cmap=cm_bright, alpha=0.6)
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xticks(())
    ax.set_yticks(())
    i += 1

    # iterate over classifiers
    for name, clf in zip(names, classifiers):
        ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
        clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)

        # Plot the decision boundary. We'll assign a color to
        # each point in the mesh [x_min, m_max]x[y_min, y_max].
        Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])

        # Put the result into a color plot
        Z = Z.reshape(xx.shape)
        ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)

        # Plot also the training points
        ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train,
          cmap=cm_bright)
        # and testing points
        ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test,
          cmap=cm_bright, alpha=0.6)

        ax.set_xlim(xx.min(), xx.max())
        ax.set_ylim(yy.min(), yy.max())
        ax.set_xticks(())
        ax.set_yticks(())
        ax.set_title(name)
        ax.text(xx.max() - .3, yy.min() + .3, 
          ('%.2f' % score).lstrip('0'), size=15, horizontalalignment='right')
        i += 1

plt.show()

The results, as anticipated, are much more promising than using a linear classifier. For moon and circle shaped datasets, there is tremendous increase classification score, where as for linearly separable observations the result doesn't change. This is of course understandable as linearly separable dataset can be easily dealt with using linear classifier.

Conclusion

SVM is a powerful tool, however from a practical point of view perhaps the most serious problem with SVMs is the high algorithmic complexity and extensive memory requirements of the required quadratic with the number of samples, which makes it hard to scale to dataset with more than a couple of 10000 samples.