Wednesday, July 22, 2015

Gradient Descent with Python


In the previous two articles, we've talked about linear and logistic regression, covering by it most linear models. And that would be it, if only these models could process data with numerous features, that is x1, x2... xn-1, xn. Let's us see how our classification model deals with Olivetti faces classification dataset:

from sklearn import datasets, metrics, linear_model, cross_validation
import matplotlib.pyplot as plt
import time

#Load the digits dataset
faces = datasets.fetch_olivetti_faces()

start = time.time()

X =
Y =

x_train, x_test, y_train, y_test = cross_validation.train_test_split(X, Y)

# Create a classifier: a support vector classifier
model = linear_model.LogisticRegression(C=10000)

# We learn the faces of train set, y_train)

# Now predict the person on test set
expected = y_test
predicted = model.predict(x_test)

end = time.time()
print("%.2f seconds" % (end - start)) # 8.85 seconds

# Reshape data into images
x_train_images = x_train.reshape(len(x_train), 64, 64)
x_test_images = x_test.reshape(len(x_test), 64, 64)

# Plot train images
for index, (image, label) in enumerate(
    zip(x_train_images, y_train)[:4]):
    plt.subplot(2, 4, index + 1)
    plt.title('Train: %i' % label)

# Plot test images    
for index, (image, prediction, expectation) in enumerate(
    zip(x_test_images, predicted, expected)[:4]):
    plt.subplot(2, 4, index + 5)
    plt.title('Predict: %i/%i' % (prediction, expectation))

The dataset contains 64x64 different images of 40 distinct subjects, resulting each item of the dataset containing 4096 features. Despite the good performance of our classifier, we won't dive into measures now, it took almost 9 seconds to process this toy dataset, which is not good enough if we want to learn really big datasets.

The reason for this delay, is the way the classifier solves the optimization problem. Our models analytically solved the values of the model's parameters, which minimize the cost function with the following equation:

X is the matrix of the values of the explanatory variables for each training example. The dot product inside the brackets results a square matrix with dimensions nxn, where n is the number of features. The computational complexity of inverting this square matrix is nearly cubic in the number of explanatory variables. Thus the more features we have, the more complex the computation will be, consequently taking more time.

Gradient Descent

Given a function defined by a set of parameters, gradient descent iteratively moves toward a set of parameter values, which minimize the function. This iterative minimization is achieved by taking steps in the negative direction of the function gradient. As though a blindfolded man, who is trying to find his way from somewhere on a mountainside to the lowest point of the valley. He takes a step in the direction with the steepest decline and the sizes of his steps are proportional to the steepness of the terrain. If he were to take large steps, he may have accidentally stepped over the valley's lowest point. In opposite case, while he would reach the desired point, however since the steps are too small, it would take him a considerable amount of time. The proportion is specified by the learning rate provided to the model. When used to minimize the function, a standard (or "batch") gradient descent method would perform the following iterations until it converges:

Stochastic Gradient Descent

Standard gradient descent evaluates the sum-gradient, which may require expensive evaluations of the gradients from all sums and functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients. In stochastic (or "on-line") gradient descent, the true gradient of Q(w) is approximated by a gradient at a single example:

We compute an estimate or approximation to this direction. The most simple way is to just look at one training example (or subset of training examples) and compute the direction to move only on this approximation. It is called as Stochastic because the approximate direction that is computed at every step can be thought of a random variable of a stochastic process. Stochastic method converges much faster compared to standard, but the error function is not as well minimized as in the case of the latter. Often in most cases, the close approximation that you get in stochastic method for the parameter values are enough because they reach the optimal values and keep oscillating there.

Minibatch Gradient Descent

A compromise between the two forms called "mini-batches" computes the gradient against more than one training examples at each step. This can perform significantly better than true stochastic gradient descent, because the code can make use of vectorization libraries rather than computing each step separately. It may also result in smoother convergence, as the gradient computed at each step uses more training examples.


Scikit-learn provides us with both linear and logistic regression implementation using gradient descent through SGDRegressor and SGDClassifier classes respectively. By default the models use stochastic method with batches of 5 examples. Since all scikit-learn models follow the same API, the only needed change to the previous example, is to change the model instantiation.

model = linear_model.SGDClassifier(alpha=0.00001)

After we change the classifier, the code finished within less then a second!


The important point to remember is that gradient descent looks for local minima, which sometimes may differ from the global one. The outcome of the process is significantly dependent upon where it starts, as illustrated below. To gain more confidence in the results, one must run it several times.

Hope you've enjoyed the article and feel free to comment below. In the next article we'll look into techniques to choose the parameters provided to the models, to achieve better results.

Wednesday, July 8, 2015

Logistic regression with Python


After having mastered linear regression in the previous article, let's take a look at logistic regression. Despite its name, it is not that different from linear regression, but rather a linear model for classification achieved by using sigmoid function instead of polynomial one. Specifically we'll be using logistic function, hence the name of the regression. Don't worry, in the future articles we'll be covering the usage of other sigmoid functions as well.


The implementation of logistic regression in scikit-learn can be accessed from class LogisticRegression. This implementation can fit a multiclass logistic regression with optional L1 or L2 regularization. In the next example we'll classify iris flowers according to their sepal length and width:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets

# import some data to play with
iris = datasets.load_iris()
X =[:, :2]  # we only take the first two features.
Y =

model = linear_model.LogisticRegression(C=10000) # C = 1/alpha, Y)

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]:[y_min, y_max]
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

h = .02  # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.pcolormesh(xx, yy, Z,

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())

Firstly, pay attention to the inversed regularization parameter, C, provided to the classifier. As with linear regression, scikit provides class, LogisticRegressionCV, to evaluate different learning rates. Secondly, though not related directly to the topic, have a look at meshgrid feature of numpy library and how it is used to plot a decision boundary.

Performance metrics

As opposed to linear regression, classification results cannot be evaluated with R-squared measure. A variety of metrics exist to evaluate the performance of classifiers against labels. The most common metrics are accuracy, precision, recall, F1 score and ROC AUC score. All of these measures depend on the concepts of true positives, true negatives, false positives and false negatives. All these apply to a binary classifier, but with a little trick of setting a positive label, the measures can be used on multiclass models as well.

True positives (TP)

The amount of labels, which were correctly identified by the classifier, that is assigned point labeled A to class A.

True negatives (TN)

The amount of labels, which were correctly rejected by the classifier, that is assigned point labeled A' to class A'.

False positives (FP)

The amount of labels, which were incorrectly identified by the classifier, that is assigned point labeled A' to class A.

False negatives (FN)

The amount of labels, which were incorrectly rejected by the classifier, that is assigned point labeled A to class A'.


Accuracy measures a fraction of the classifier's predictions that are correct, that is the number of correct assessments divided by the number of all assessments - (TN + TP)/(TN + TP + FN + FP).

Precision (P)

Precision is the fraction of positive predictions that are correct - TP/(TP + FP). Be careful as classifier predicting only a single positive instance, that happens to be correct, will achieve perfect precision.

Recall (R)

Recall, sometimes called sensitivity in medical domains, measures the fraction of the truly positive instances. A score of 1 indicates, that no false negative were present - TP/(TP + FN). Be careful as classifier predicting positive for every example will achieve a recall of 1.

F1 score

Both precision and recall scores provide an incomplete view on the classifier performance and sometimes may provide skewed results. The F1 measure provides a better view by calculating weighted average of the scores - 2*P*R/(P + R). A model with perfect precision and recall scores will achieve an F1 score of one.


ROC curves plot the classifier's recall against its fall-out, false positive rate, is the number of false positives divided by the total number of negatives - FP/(TN + FP). AUC is the area under the ROC curve; it reduces the ROC curve to a single value, which represents the expected performance of the classifier. A classifier that predicts classes randomly, has an AUC of 0.5. Any number above this, outperforms the random guessing. Unfortunately scikit-learn supports only binary classifier, when it comes to ROC, however iterating over each label and setting it as positive one, results what we need:
import numpy as np
from sklearn import linear_model, datasets, cross_validation, metrics

# import some data to play with
iris = datasets.load_iris()
X =[:, :2]  # we only take the first two features.
Y =
model = linear_model.LogisticRegression(C=10000)

x_train, x_test, y_train, y_test = cross_validation.train_test_split(X, Y), y_train)
y_pred = model.predict(x_test)
print("Accuracy: %2f" % metrics.accuracy_score(y_test, y_pred))
print("Precision: %2f" % metrics.precision_score(y_test, y_pred, average="macro"))
print("F1: %2f" % metrics.f1_score(y_test, y_pred, average="macro"))
for label in np.arange(3):
    false_positive_rate, recall, thresholds = metrics.roc_curve(y_test, y_pred, pos_label=label)
    roc_auc = metrics.auc(false_positive_rate, recall)
    plt.plot(false_positive_rate, recall, label='AUC(%d) = %0.2f' % (label, roc_auc))

plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
# Accuracy: 0.815789
# Precision: 0.817460
# F1: 0.813789


This concludes the classification topic and with linear regression, covers linear models. Next time we'll talk about techniques used to handle numerous explanatory variables.