Wednesday, July 22, 2015

Gradient Descent with Python


In the previous two articles, we've talked about linear and logistic regression, covering by it most linear models. And that would be it, if only these models could process data with numerous features, that is x1, x2... xn-1, xn. Let's us see how our classification model deals with Olivetti faces classification dataset:

from sklearn import datasets, metrics, linear_model, cross_validation
import matplotlib.pyplot as plt
import time

#Load the digits dataset
faces = datasets.fetch_olivetti_faces()

start = time.time()

X =
Y =

x_train, x_test, y_train, y_test = cross_validation.train_test_split(X, Y)

# Create a classifier: a support vector classifier
model = linear_model.LogisticRegression(C=10000)

# We learn the faces of train set, y_train)

# Now predict the person on test set
expected = y_test
predicted = model.predict(x_test)

end = time.time()
print("%.2f seconds" % (end - start)) # 8.85 seconds

# Reshape data into images
x_train_images = x_train.reshape(len(x_train), 64, 64)
x_test_images = x_test.reshape(len(x_test), 64, 64)

# Plot train images
for index, (image, label) in enumerate(
    zip(x_train_images, y_train)[:4]):
    plt.subplot(2, 4, index + 1)
    plt.title('Train: %i' % label)

# Plot test images    
for index, (image, prediction, expectation) in enumerate(
    zip(x_test_images, predicted, expected)[:4]):
    plt.subplot(2, 4, index + 5)
    plt.title('Predict: %i/%i' % (prediction, expectation))

The dataset contains 64x64 different images of 40 distinct subjects, resulting each item of the dataset containing 4096 features. Despite the good performance of our classifier, we won't dive into measures now, it took almost 9 seconds to process this toy dataset, which is not good enough if we want to learn really big datasets.

The reason for this delay, is the way the classifier solves the optimization problem. Our models analytically solved the values of the model's parameters, which minimize the cost function with the following equation:

X is the matrix of the values of the explanatory variables for each training example. The dot product inside the brackets results a square matrix with dimensions nxn, where n is the number of features. The computational complexity of inverting this square matrix is nearly cubic in the number of explanatory variables. Thus the more features we have, the more complex the computation will be, consequently taking more time.

Gradient Descent

Given a function defined by a set of parameters, gradient descent iteratively moves toward a set of parameter values, which minimize the function. This iterative minimization is achieved by taking steps in the negative direction of the function gradient. As though a blindfolded man, who is trying to find his way from somewhere on a mountainside to the lowest point of the valley. He takes a step in the direction with the steepest decline and the sizes of his steps are proportional to the steepness of the terrain. If he were to take large steps, he may have accidentally stepped over the valley's lowest point. In opposite case, while he would reach the desired point, however since the steps are too small, it would take him a considerable amount of time. The proportion is specified by the learning rate provided to the model. When used to minimize the function, a standard (or "batch") gradient descent method would perform the following iterations until it converges:

Stochastic Gradient Descent

Standard gradient descent evaluates the sum-gradient, which may require expensive evaluations of the gradients from all sums and functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients. In stochastic (or "on-line") gradient descent, the true gradient of Q(w) is approximated by a gradient at a single example:

We compute an estimate or approximation to this direction. The most simple way is to just look at one training example (or subset of training examples) and compute the direction to move only on this approximation. It is called as Stochastic because the approximate direction that is computed at every step can be thought of a random variable of a stochastic process. Stochastic method converges much faster compared to standard, but the error function is not as well minimized as in the case of the latter. Often in most cases, the close approximation that you get in stochastic method for the parameter values are enough because they reach the optimal values and keep oscillating there.

Minibatch Gradient Descent

A compromise between the two forms called "mini-batches" computes the gradient against more than one training examples at each step. This can perform significantly better than true stochastic gradient descent, because the code can make use of vectorization libraries rather than computing each step separately. It may also result in smoother convergence, as the gradient computed at each step uses more training examples.


Scikit-learn provides us with both linear and logistic regression implementation using gradient descent through SGDRegressor and SGDClassifier classes respectively. By default the models use stochastic method with batches of 5 examples. Since all scikit-learn models follow the same API, the only needed change to the previous example, is to change the model instantiation.

model = linear_model.SGDClassifier(alpha=0.00001)

After we change the classifier, the code finished within less then a second!


The important point to remember is that gradient descent looks for local minima, which sometimes may differ from the global one. The outcome of the process is significantly dependent upon where it starts, as illustrated below. To gain more confidence in the results, one must run it several times.

Hope you've enjoyed the article and feel free to comment below. In the next article we'll look into techniques to choose the parameters provided to the models, to achieve better results.