Wednesday, July 8, 2015

Logistic regression with Python


Introduction

After having mastered linear regression in the previous article, let's take a look at logistic regression. Despite its name, it is not that different from linear regression, but rather a linear model for classification achieved by using sigmoid function instead of polynomial one. Specifically we'll be using logistic function, hence the name of the regression. Don't worry, in the future articles we'll be covering the usage of other sigmoid functions as well.

Implementation

The implementation of logistic regression in scikit-learn can be accessed from class LogisticRegression. This implementation can fit a multiclass logistic regression with optional L1 or L2 regularization. In the next example we'll classify iris flowers according to their sepal length and width:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target

model = linear_model.LogisticRegression(C=10000) # C = 1/alpha
model.fit(X, Y)

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]:[y_min, y_max]
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

h = .02  # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())

plt.show()

Firstly, pay attention to the inversed regularization parameter, C, provided to the classifier. As with linear regression, scikit provides class, LogisticRegressionCV, to evaluate different learning rates. Secondly, though not related directly to the topic, have a look at meshgrid feature of numpy library and how it is used to plot a decision boundary.

Performance metrics

As opposed to linear regression, classification results cannot be evaluated with R-squared measure. A variety of metrics exist to evaluate the performance of classifiers against labels. The most common metrics are accuracy, precision, recall, F1 score and ROC AUC score. All of these measures depend on the concepts of true positives, true negatives, false positives and false negatives. All these apply to a binary classifier, but with a little trick of setting a positive label, the measures can be used on multiclass models as well.

True positives (TP)

The amount of labels, which were correctly identified by the classifier, that is assigned point labeled A to class A.

True negatives (TN)

The amount of labels, which were correctly rejected by the classifier, that is assigned point labeled A' to class A'.

False positives (FP)

The amount of labels, which were incorrectly identified by the classifier, that is assigned point labeled A' to class A.

False negatives (FN)

The amount of labels, which were incorrectly rejected by the classifier, that is assigned point labeled A to class A'.

Accuracy

Accuracy measures a fraction of the classifier's predictions that are correct, that is the number of correct assessments divided by the number of all assessments - (TN + TP)/(TN + TP + FN + FP).

Precision (P)

Precision is the fraction of positive predictions that are correct - TP/(TP + FP). Be careful as classifier predicting only a single positive instance, that happens to be correct, will achieve perfect precision.

Recall (R)

Recall, sometimes called sensitivity in medical domains, measures the fraction of the truly positive instances. A score of 1 indicates, that no false negative were present - TP/(TP + FN). Be careful as classifier predicting positive for every example will achieve a recall of 1.

F1 score

Both precision and recall scores provide an incomplete view on the classifier performance and sometimes may provide skewed results. The F1 measure provides a better view by calculating weighted average of the scores - 2*P*R/(P + R). A model with perfect precision and recall scores will achieve an F1 score of one.

ROC AUC

ROC curves plot the classifier's recall against its fall-out, false positive rate, is the number of false positives divided by the total number of negatives - FP/(TN + FP). AUC is the area under the ROC curve; it reduces the ROC curve to a single value, which represents the expected performance of the classifier. A classifier that predicts classes randomly, has an AUC of 0.5. Any number above this, outperforms the random guessing. Unfortunately scikit-learn supports only binary classifier, when it comes to ROC, however iterating over each label and setting it as positive one, results what we need:
import numpy as np
from sklearn import linear_model, datasets, cross_validation, metrics

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target
model = linear_model.LogisticRegression(C=10000)

x_train, x_test, y_train, y_test = cross_validation.train_test_split(X, Y)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print("Accuracy: %2f" % metrics.accuracy_score(y_test, y_pred))
print("Precision: %2f" % metrics.precision_score(y_test, y_pred, average="macro"))
print("F1: %2f" % metrics.f1_score(y_test, y_pred, average="macro"))
 
for label in np.arange(3):
    false_positive_rate, recall, thresholds = metrics.roc_curve(y_test, y_pred, pos_label=label)
    roc_auc = metrics.auc(false_positive_rate, recall)
    plt.plot(false_positive_rate, recall, label='AUC(%d) = %0.2f' % (label, roc_auc))

plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()
# Accuracy: 0.815789
# Precision: 0.817460
# F1: 0.813789

Conclusion

This concludes the classification topic and with linear regression, covers linear models. Next time we'll talk about techniques used to handle numerous explanatory variables.