Wednesday, October 7, 2015

Random Forest with Python


Introduction

In the article about decision tree we've talked about it's drawbacks of being sensitive to small variations or noise in the data. Today we'll see how to deal with them by introducing a random forest. It belongs to a larger class of machine learning algorithms called ensemble methods, which use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms. So which models does random forest aggregate? You might already know the answer - the decision trees. It fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.

Implementation

Scikit-learn provides us with two classes RandomForestClassifier and RandomForestRegressor for classification and regression problems respectively. Let's use the code from the previous example and see how the result will different, using random forest with 100 trees.

...
clf = RandomForestClassifier(n_estimators=100).fit(X, y)
...

As you can see the edges are much smoother, that is less overfitted, than using a single decision tree.

Conclusion

Despite it's relative simplicity, random forest performs the job remarkably well. According to empirical comparisons, even better than SVMs.

Monday, September 28, 2015

Decision Tree with Python


Introduction

Last time we talked about non-linear classifier using Support Vector Machines or SVM. Today we'll be discussing another non-linear classifier and regressor called decision tree. The way decision tree works is by creating a model, which predicts the value of a target variable by learning simple decision rules inferred from the data features.

Since trees can be visualized and is something we're all used to, decision trees can easily be explained, visualized and manipulated the non-linearity in an intuitive manner. Surely there are some disadvantages as well, but we'll note them a bit later, firstly let's see them in action.

Implementation

It won't come as a complete surprise to you, that scikit package has already taken initiative and implemented the whole thing using DecisionTreeRegressor and DecisionTreeClassifier classes. What is left for us is to bear the fruits of someone else's hard labour.

import StringIO
import numpy as np
import matplotlib.pyplot as plt
import pydot
from IPython.display import Image
from sklearn import tree
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Parameters
n_classes = 3
plot_colors = "bry"
plot_step = 0.02
plt.rcParams["figure.figsize"] = [12, 8]

# Load data
iris = load_iris()

for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3],
                                [1, 2], [1, 3], [2, 3]]):
    # We only take the two corresponding features
    X = iris.data[:, pair]
    y = iris.target

    # Shuffle
    idx = np.arange(X.shape[0])
    np.random.seed(13)
    np.random.shuffle(idx)
    X = X[idx]
    y = y[idx]

    # Standardize
    mean = X.mean(axis=0)
    std = X.std(axis=0)
    X = (X - mean) / std

    # Train
    clf = DecisionTreeClassifier().fit(X, y)

    # Plot the decision boundary
    plt.subplot(2, 3, pairidx + 1)

    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                         np.arange(y_min, y_max, plot_step))

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    cs = plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

    plt.xlabel(iris.feature_names[pair[0]])
    plt.ylabel(iris.feature_names[pair[1]])
    plt.axis()

    # Plot the training points
    for i, color in zip(range(n_classes), plot_colors):
        idx = np.where(y == i)
        plt.scatter(X[idx, 0], X[idx, 1], c=color,
                    label=iris.target_names[i],
                    cmap=plt.cm.Paired)
    plt.axis()

plt.legend(loc="upper left")
plt.show()

First we normalize the data and then draw decision boundaries and at last the data itself.

Take a look at the first subplot and let's compare it with one, where we used logistic regression. Back then we couldn't classify the data using these features, now using the decision tree we surely can.

Visualization

Since decision tree uses a tree data-structure, wouldn't it be cool to visualize it.
Notice: You'll need to install GraphViz package to run this example

...
dot_data = StringIO.StringIO()  
tree.export_graphviz(clf, out_file=dot_data,  
                     feature_names=iris.feature_names,  
                     class_names=iris.target_names,  
                     filled=True, rounded=True,  
                     special_characters=True)  
graph = pydot.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

Now, when we have our tree in place, let's see how the decisions were made with Gini coefficient attached to each node. What Gini coefficient measures is the inequality among values of a frequency distribution, in our case iris species. A Gini coefficient of zero expresses perfect equality, where all values are the same - all iris are of the same species.

So the first check is against septal length being lesser than -0.7442, and from Gini coefficient 0.6667 we can deduce that it splits the data with one third going into a single category, setosa. The process continues until we reach Gini coefficient 0, that is the remaining data is of a single category.

Conclusion

One of the notable advantages of using decision trees is it's prediction performance, which is logarithmic in the number of data points used to train the tree. But there is are some downsides as well, which are needed to be considered as well. The first one is can be seen from our first subplot example - overfitting. Trees tend to perform incredibly well at the top, but at the same time tend to overfit at the bottom. This is a major downsite and therefore trees should be pruned! Also decision trees can be unstable because small variations or noise in the data might result in a completely different tree being generated. However this problem is mitigated by using decision trees within an ensemble, of which we'll talk next time.

Tuesday, September 8, 2015

Support Vector Machines with Python


Introduction

Having learnt different optimization and classification methods, you must feel quit confident to start exploring the datasets of interest. However you might very quickly run into a dataset, which no matter how hyper-parameters are set, is just not linearly separable. Surely there is no place for despair, especially since we have just a classifier to deal with these situation, called Support Vector Machine.

How it works

Support Vector Machine, or SVM, are a set of supervised learning methods used for classification and with a slight change for regression. The core idea of it is to linearly separate the hyper-space of features. The prefix hyper is not occasional, as SVM increases the dimension of feature space to achieve it's goal. The power of the method comes from using kernel functions, which enable it to operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space. This operation is often computationally cheaper than the explicit computation of the coordinates, thus allowing SVM to operate on datasets with much bigger feature space, like image recognition. Consider the following example:

There's no linear decision boundary for this dataset, which will separate observations of two classes. If you apply linear classifier, you'll just receive an "arbitrary" line throughout the space crossing both of the classes - you just cannot do it correctly with logistic regression.

The way SVM approaches the problem, is by augmenting the dataset with additional dimensions and trying to maximize the margin between the classes. It places the separation so that the distance to closest misclassified entity is the widest. Have a look at the following illustration depicting the process:

Applying a kernel function, it transforms the observations to a space, where they can be linearly separable, resulting the following separation:

Implementation

It won't come as a big surprise to you, that scikit-learn has a fully implemented and optimized SVM support, so let's have a look how it deals with various situations. We'll create three demo datasets, one in shape of moons, one of circles and the last will contain linearly separable observations. Then we'll apply SVM with two different kernel functions: linear and radial based function. In the end we'll plot the observations and color the background according to distance from boundary hyperplane using diverging palette, that is the farther it's from center the darker the color (red or blue depending on the sign). Take a closer look into the example code and make sure you understand the process.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.svm import SVC

h = .02  # step size in the mesh
names = ["Linear SVM", "RBF SVM"]
classifiers = [
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1)]

X, y = make_classification(n_features=2, n_redundant=0,
        random_state=1, n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += rng.uniform(size=X.shape)
linearly_separable = (X, y)

datasets = [make_moons(noise=0.3, random_state=0),
            make_circles(noise=0.2, factor=0.5, random_state=1),
            linearly_separable
            ]

figure = plt.figure(figsize=(9, 9))
i = 1
# iterate over datasets
for ds in datasets:
    # preprocess dataset, split into training and test part
    X, y = ds
    X = StandardScaler().fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X, y,
      test_size=0.4)

    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    # just plot the dataset first
    cm = plt.cm.RdBu
    cm_bright = ListedColormap(['#FF0000', '#0000FF'])
    ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
    # Plot the training points
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train,
      cmap=cm_bright)
    # and testing points
    ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test,
      cmap=cm_bright, alpha=0.6)
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xticks(())
    ax.set_yticks(())
    i += 1

    # iterate over classifiers
    for name, clf in zip(names, classifiers):
        ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
        clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)

        # Plot the decision boundary. We'll assign a color to
        # each point in the mesh [x_min, m_max]x[y_min, y_max].
        Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])

        # Put the result into a color plot
        Z = Z.reshape(xx.shape)
        ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)

        # Plot also the training points
        ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train,
          cmap=cm_bright)
        # and testing points
        ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test,
          cmap=cm_bright, alpha=0.6)

        ax.set_xlim(xx.min(), xx.max())
        ax.set_ylim(yy.min(), yy.max())
        ax.set_xticks(())
        ax.set_yticks(())
        ax.set_title(name)
        ax.text(xx.max() - .3, yy.min() + .3, 
          ('%.2f' % score).lstrip('0'), size=15, horizontalalignment='right')
        i += 1

plt.show()

The results, as anticipated, are much more promising than using a linear classifier. For moon and circle shaped datasets, there is tremendous increase classification score, where as for linearly separable observations the result doesn't change. This is of course understandable as linearly separable dataset can be easily dealt with using linear classifier.

Conclusion

SVM is a powerful tool, however from a practical point of view perhaps the most serious problem with SVMs is the high algorithmic complexity and extensive memory requirements of the required quadratic with the number of samples, which makes it hard to scale to dataset with more than a couple of 10000 samples.

Tuesday, August 25, 2015

Dimension reduction with Python


Introduction

Sometimes, no matter how good an algorithm is, it just doesn’t work. Or worse, it doesn’t pick up anything. Data can be quite noisy, and sometimes it’s just about impossible to figure out what went wrong.

It's worth noticing that the most interesting machine learning challenges always involve some sort of feature engineering, where we try to use our insight into the problem to carefully craft additional features, that the machine learner hopefully picks up.

Garbage in, garbage out, that's what we know from real life. Not surprisingly this pattern also holds true, when applying machine learning methods to training data. To tackle the issue, we will go in the opposite direction with dimensionality reduction involving cutting away features that are irrelevant or redundant. There are several good reasons to trim down the dimensions as much as possible:

  • Most of the models hate high-dimensional spaces and superfluous features, which often irritate or mislead the learner.
  • More features means more parameters to tune and a higher risk of overfitting.
  • Linearly correlated data is common phenomenon in datasets, which as a result might just have artificial high dimensions, whereas the real dimension might be small.
  • Less dimensions means faster training and more variations to try out, resulting in better end results.

Dimension reduction techniques

Before diving into more sophisticated algorithms, let's try to get an educated guess how to get rid of the garbage within our data, while keeping the valuable part of it.

Missing Values Ratio

Data columns with too many missing values are unlikely to carry much useful information. Thus data columns with number of missing values greater than a given threshold can be removed. Naturally, the higher the threshold, the more aggressive the reduction.

Low Variance

Another candidates are columns with little changes, that is low variance. Don't forget to normalize the data before comparing against the threshold, as variance is range dependent.

High Correlation

Data columns with very similar trends are also likely to carry very similar information. In this case, only one of them will suffice to feed the machine learning model. Here we calculate the correlation coefficient between numerical and nominal columns as the Coefficient and the Pearson’s chi square value respectively. Pairs of columns with correlation coefficient higher than a threshold are reduced to only one. A word of caution: correlation is scale sensitive; therefore column normalization is required for a meaningful correlation comparison.

Backward Feature Elimination

This great technique allows us to remove least important features. At a given iteration, the selected classification algorithm is trained on n input features. Then we remove one input feature at a time and train the same model on n-1 input features n times. The input feature whose removal has produced the smallest increase in the error rate is removed, leaving us with n-1 input features. Each iteration k produces a model trained on n-k features and an error rate (k). Selecting the maximum tolerable error rate, we define the smallest number of features necessary to reach that classification performance with the selected machine learning algorithm.

Forward Feature Construction

This is the inverse process to the Backward Feature Elimination. We start with 1 feature only, progressively adding 1 feature at a time, i.e. the feature that produces the highest increase in performance.

Principal Component Analysis

Principal Component Analysis, also known as the Karhunen-Loeve Transform, is a technique used to search for patterns in high-dimensional data. PCA reduces a set of possibly-correlated, high-dimensional variables to a lower-dimensional set of linearly uncorrelated synthetic variables called principal components. The technique orthogonally transforms the original n coordinates of a data set into a new set of n coordinates called principal components by reducing the dimensions of a data set projecting the data onto a lower-dimensional subspace. The lower-dimensional data will preserve as much of the variance of the original data as possible. For example, a two dimensional data set could be reduced by projecting the points onto a line; each instance in the data set would then be represented by a single value rather than a pair of values. A three-dimensional dataset could be reduced to two dimensions by projecting the variables onto a plane. In general, an n-dimensional dataset can be reduced by projecting the dataset onto a k-dimensional subspace, where k is less than n. Let's have a look at two dimensional example of four points, projected onto on dimensional space - a single line.

Implementation

Some of the presented technique, while being very effective, unfortunately not always can be used in practice. For instance both Backward Feature Elimination and Forward Feature Construction, are quite time and computationally expensive. They are practically only applicable to a data set with an already relatively low number of input columns. PCA is much less computationally expensive algorithm and is commonly used to explore and visualize high-dimensional data sets. It can also be used to compress data, and process data before it is used by another estimator.

Many implementations of PCA, including the one of scikit-learn, use singular value decomposition to calculate the eigenvectors and eigenvalues. SVD is given by the following equation:

The columns of U are called left singular vectors of the data matrix, the columns of conjugately transposed matrix V are its right singular vectors, and Σ the diagonal entries of are its singular values. To make the algebra work, the matrices have to follow the dimensions rules, as depicted in the following illustration

While the singular vectors and values of a matrix are useful in some applications of signal processing and statistics, we are only interested in them as they relate to the eigenvectors and eigenvalues of the data matrix. Specifically, the left singular vectors are the eigenvectors of the covariance matrix and the diagonal elements of Σ are the square roots of the eigenvalues of the covariance matrix. Eigenvectors found using SVD should be similar to those derived from a covariance matrix.

Let's see how it's done in practice. As you remember Iris dataset consists of four dimensions: sepal length, sepal width, petal length and lastly petal width. Let's apply PCA on the dataset and plot the results in 3 dimensional space.

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import decomposition
from sklearn import datasets

centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target

fig = plt.figure(1, figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

pca = decomposition.PCA(n_components=3)
pca.fit(X)
X = pca.transform(X)

for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:
    ax.text3D(X[y == label, 0].mean(),
              X[y == label, 1].mean() + 1.5,
              X[y == label, 2].mean(), name,
              horizontalalignment='center',
              bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
    
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap=plt.cm.prism)

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])

plt.show()

Conclusion

Indeed, in the era of big data, when more is axiomatically better, we have re-discovered that too many noisy or even faulty input data columns often lead to a less than desirable algorithm performance. Removing un-informative or even worse dis-informative input attributes might help build a model on more extensive data regions, with more general classification rules, and overall with better performances on new unseen data.

Tuesday, August 11, 2015

Hyperparameter optimization with Python


Introduction

In the previous articles we introduced several linear techniques, where as you have probably noticed, we provided the algorithms with several parameters. The dependence of machine learning algorithm upon learning parameters is a common case though and one has to check the performance of various parameters to achieve the best results. The task of course is no trifle and is called hyperparameter optimization or model selection. It is the problem of choosing a set of hyperparameters for a learning algorithm, usually with the goal of optimizing a measure of the algorithm's performance on an independent data set.

Implementation

Grid Search

The traditional way of performing hyperparameter optimization is a grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. Scikit-learn provides us with a class GridSearchCV implementing the technique. Let's try to find the best learning rate, regularization technique and batch size for stochastic gradient descent from the previous article.

import numpy as np
from time import time
from operator import itemgetter
from sklearn.grid_search import GridSearchCV
from sklearn.datasets import load_digits
from sklearn.linear_model import SGDClassifier

# get some data
digits = load_digits()
X, y = digits.data, digits.target

# build a classifier
clf = SGDClassifier()

# Utility function to report best scores
def report(grid_scores, n_top=3):
    top_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n_top]
    for i, score in enumerate(top_scores):
        print("Model with rank: {0}".format(i + 1))
        print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
              score.mean_validation_score,
              np.std(score.cv_validation_scores)))
        print("Parameters: {0}\n".format(score.parameters))

# use a full grid over all parameters
param_grid = {"n_iter": [1, 5, 10],
              "alpha": [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
              "penalty": ["none", "l1", "l2"]}

# run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(X, y)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(grid_search.grid_scores_)))
report(grid_search.grid_scores_)

# GridSearchCV took 4.85 seconds for 63 candidate parameter settings.
# Model with rank: 1
# Mean validation score: 0.929 (std: 0.020)
# Parameters: {'penalty': 'l1', 'alpha': 0.0001, 'n_iter': 5}
 
# Model with rank: 2
# Mean validation score: 0.924 (std: 0.016)
# Parameters: {'penalty': 'l2', 'alpha': 1, 'n_iter': 10}

# Model with rank: 3
# Mean validation score: 0.922 (std: 0.018)
# Parameters: {'penalty': 'none', 'alpha': 0.0001, 'n_iter': 5}

The output of the procedure is a list, grid_scores_, consisting of cross evaluation score of all provided parameters. To get the best parameters, we order the list in a descending order by score mean resulting the best parameters for the provided dataset using Lasso regularization, learning rate of 0.0001 in batches of 5. For more details about regularization types, have a look a broader explanation in linear regression article.

Randomized Grid Search

Since grid searching is an exhaustive and therefore potentially expensive method, there is a random variation of the technique and is implemented by RandomizedSearchCV class in scikit-learn. It doesn't try out all parameter values, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter parameter.

from sklearn.grid_search import RandomizedSearchCV
from scipy.stats import randint, norm

...

# specify parameters and distributions to sample from
param_dist = {"n_iter": randint(1, 11),
              "alpha": uniform(scale=0.01),
              "penalty": ["none", "l1", "l2"]}

# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)

...

# RandomizedSearchCV took 1.18 seconds for 20 candidates parameter settings.
# Model with rank: 1
# Mean validation score: 0.920 (std: 0.021)
# Parameters: {'penalty': 'l1', 'alpha': 0.0006507721384509924, 'n_iter': 7}
 
# Model with rank: 2
# Mean validation score: 0.919 (std: 0.016)
# Parameters: {'penalty': 'none', 'alpha': 0.0008117236028203323, 'n_iter': 9}
 
# Model with rank: 3
# Mean validation score: 0.918 (std: 0.018)
# Parameters: {'penalty': 'l1', 'alpha': 0.0012165253638165819, 'n_iter': 5}

Here, instead of providing the procedure with list of values, we gave a desired distribution from which the values would be drawn. As we set the n_iter to be 20, the procedure evaluated 20 random variation of the parameters and following the previous logic, the best appeared to be using Lasso regularization, learning rate of 0.00065 in batches of 7.

Conclusion

The randomized search and the grid search explore exactly the same space of parameters. The result in parameter settings is quite similar, while the run time for randomized search is drastically lower. Note that in practice, one would not search over this many different parameters simultaneously using grid search, but pick only the ones deemed most important.

Wednesday, July 22, 2015

Gradient Descent with Python


Introduction

In the previous two articles, we've talked about linear and logistic regression, covering by it most linear models. And that would be it, if only these models could process data with numerous features, that is x1, x2... xn-1, xn. Let's us see how our classification model deals with Olivetti faces classification dataset:

from sklearn import datasets, metrics, linear_model, cross_validation
import matplotlib.pyplot as plt
import time

#Load the digits dataset
faces = datasets.fetch_olivetti_faces()

start = time.time()

X = faces.data
Y = faces.target

x_train, x_test, y_train, y_test = cross_validation.train_test_split(X, Y)

# Create a classifier: a support vector classifier
model = linear_model.LogisticRegression(C=10000)

# We learn the faces of train set
model.fit(x_train, y_train)

# Now predict the person on test set
expected = y_test
predicted = model.predict(x_test)

end = time.time()
print("%.2f seconds" % (end - start)) # 8.85 seconds

# Reshape data into images
x_train_images = x_train.reshape(len(x_train), 64, 64)
x_test_images = x_test.reshape(len(x_test), 64, 64)

# Plot train images
for index, (image, label) in enumerate(
    zip(x_train_images, y_train)[:4]):
    plt.subplot(2, 4, index + 1)
    plt.axis('off')
    plt.imshow(image, cmap=plt.cm.gray)
    plt.title('Train: %i' % label)

# Plot test images    
for index, (image, prediction, expectation) in enumerate(
    zip(x_test_images, predicted, expected)[:4]):
    plt.subplot(2, 4, index + 5)
    plt.axis('off')
    plt.imshow(image, cmap=plt.cm.gray)
    plt.title('Predict: %i/%i' % (prediction, expectation))

The dataset contains 64x64 different images of 40 distinct subjects, resulting each item of the dataset containing 4096 features. Despite the good performance of our classifier, we won't dive into measures now, it took almost 9 seconds to process this toy dataset, which is not good enough if we want to learn really big datasets.

The reason for this delay, is the way the classifier solves the optimization problem. Our models analytically solved the values of the model's parameters, which minimize the cost function with the following equation:

X is the matrix of the values of the explanatory variables for each training example. The dot product inside the brackets results a square matrix with dimensions nxn, where n is the number of features. The computational complexity of inverting this square matrix is nearly cubic in the number of explanatory variables. Thus the more features we have, the more complex the computation will be, consequently taking more time.

Gradient Descent

Given a function defined by a set of parameters, gradient descent iteratively moves toward a set of parameter values, which minimize the function. This iterative minimization is achieved by taking steps in the negative direction of the function gradient. As though a blindfolded man, who is trying to find his way from somewhere on a mountainside to the lowest point of the valley. He takes a step in the direction with the steepest decline and the sizes of his steps are proportional to the steepness of the terrain. If he were to take large steps, he may have accidentally stepped over the valley's lowest point. In opposite case, while he would reach the desired point, however since the steps are too small, it would take him a considerable amount of time. The proportion is specified by the learning rate provided to the model. When used to minimize the function, a standard (or "batch") gradient descent method would perform the following iterations until it converges:

Stochastic Gradient Descent

Standard gradient descent evaluates the sum-gradient, which may require expensive evaluations of the gradients from all sums and functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients. In stochastic (or "on-line") gradient descent, the true gradient of Q(w) is approximated by a gradient at a single example:

We compute an estimate or approximation to this direction. The most simple way is to just look at one training example (or subset of training examples) and compute the direction to move only on this approximation. It is called as Stochastic because the approximate direction that is computed at every step can be thought of a random variable of a stochastic process. Stochastic method converges much faster compared to standard, but the error function is not as well minimized as in the case of the latter. Often in most cases, the close approximation that you get in stochastic method for the parameter values are enough because they reach the optimal values and keep oscillating there.

Minibatch Gradient Descent

A compromise between the two forms called "mini-batches" computes the gradient against more than one training examples at each step. This can perform significantly better than true stochastic gradient descent, because the code can make use of vectorization libraries rather than computing each step separately. It may also result in smoother convergence, as the gradient computed at each step uses more training examples.

Implementation

Scikit-learn provides us with both linear and logistic regression implementation using gradient descent through SGDRegressor and SGDClassifier classes respectively. By default the models use stochastic method with batches of 5 examples. Since all scikit-learn models follow the same API, the only needed change to the previous example, is to change the model instantiation.

...
model = linear_model.SGDClassifier(alpha=0.00001)
...

After we change the classifier, the code finished within less then a second!

Conclusion

The important point to remember is that gradient descent looks for local minima, which sometimes may differ from the global one. The outcome of the process is significantly dependent upon where it starts, as illustrated below. To gain more confidence in the results, one must run it several times.

Hope you've enjoyed the article and feel free to comment below. In the next article we'll look into techniques to choose the parameters provided to the models, to achieve better results.

Wednesday, July 8, 2015

Logistic regression with Python


Introduction

After having mastered linear regression in the previous article, let's take a look at logistic regression. Despite its name, it is not that different from linear regression, but rather a linear model for classification achieved by using sigmoid function instead of polynomial one. Specifically we'll be using logistic function, hence the name of the regression. Don't worry, in the future articles we'll be covering the usage of other sigmoid functions as well.

Implementation

The implementation of logistic regression in scikit-learn can be accessed from class LogisticRegression. This implementation can fit a multiclass logistic regression with optional L1 or L2 regularization. In the next example we'll classify iris flowers according to their sepal length and width:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target

model = linear_model.LogisticRegression(C=10000) # C = 1/alpha
model.fit(X, Y)

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]:[y_min, y_max]
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

h = .02  # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())

plt.show()

Firstly, pay attention to the inversed regularization parameter, C, provided to the classifier. As with linear regression, scikit provides class, LogisticRegressionCV, to evaluate different learning rates. Secondly, though not related directly to the topic, have a look at meshgrid feature of numpy library and how it is used to plot a decision boundary.

Performance metrics

As opposed to linear regression, classification results cannot be evaluated with R-squared measure. A variety of metrics exist to evaluate the performance of classifiers against labels. The most common metrics are accuracy, precision, recall, F1 score and ROC AUC score. All of these measures depend on the concepts of true positives, true negatives, false positives and false negatives. All these apply to a binary classifier, but with a little trick of setting a positive label, the measures can be used on multiclass models as well.

True positives (TP)

The amount of labels, which were correctly identified by the classifier, that is assigned point labeled A to class A.

True negatives (TN)

The amount of labels, which were correctly rejected by the classifier, that is assigned point labeled A' to class A'.

False positives (FP)

The amount of labels, which were incorrectly identified by the classifier, that is assigned point labeled A' to class A.

False negatives (FN)

The amount of labels, which were incorrectly rejected by the classifier, that is assigned point labeled A to class A'.

Accuracy

Accuracy measures a fraction of the classifier's predictions that are correct, that is the number of correct assessments divided by the number of all assessments - (TN + TP)/(TN + TP + FN + FP).

Precision (P)

Precision is the fraction of positive predictions that are correct - TP/(TP + FP). Be careful as classifier predicting only a single positive instance, that happens to be correct, will achieve perfect precision.

Recall (R)

Recall, sometimes called sensitivity in medical domains, measures the fraction of the truly positive instances. A score of 1 indicates, that no false negative were present - TP/(TP + FN). Be careful as classifier predicting positive for every example will achieve a recall of 1.

F1 score

Both precision and recall scores provide an incomplete view on the classifier performance and sometimes may provide skewed results. The F1 measure provides a better view by calculating weighted average of the scores - 2*P*R/(P + R). A model with perfect precision and recall scores will achieve an F1 score of one.

ROC AUC

ROC curves plot the classifier's recall against its fall-out, false positive rate, is the number of false positives divided by the total number of negatives - FP/(TN + FP). AUC is the area under the ROC curve; it reduces the ROC curve to a single value, which represents the expected performance of the classifier. A classifier that predicts classes randomly, has an AUC of 0.5. Any number above this, outperforms the random guessing. Unfortunately scikit-learn supports only binary classifier, when it comes to ROC, however iterating over each label and setting it as positive one, results what we need:
import numpy as np
from sklearn import linear_model, datasets, cross_validation, metrics

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target
model = linear_model.LogisticRegression(C=10000)

x_train, x_test, y_train, y_test = cross_validation.train_test_split(X, Y)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print("Accuracy: %2f" % metrics.accuracy_score(y_test, y_pred))
print("Precision: %2f" % metrics.precision_score(y_test, y_pred, average="macro"))
print("F1: %2f" % metrics.f1_score(y_test, y_pred, average="macro"))
 
for label in np.arange(3):
    false_positive_rate, recall, thresholds = metrics.roc_curve(y_test, y_pred, pos_label=label)
    roc_auc = metrics.auc(false_positive_rate, recall)
    plt.plot(false_positive_rate, recall, label='AUC(%d) = %0.2f' % (label, roc_auc))

plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()
# Accuracy: 0.815789
# Precision: 0.817460
# F1: 0.813789

Conclusion

This concludes the classification topic and with linear regression, covers linear models. Next time we'll talk about techniques used to handle numerous explanatory variables.