Monday, December 28, 2015

K-nearest Neighbors (KNN) in Python


Introduction

Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.


Implementation

scikit-learn implements two different nearest neighbors classifiers: KNeighborsClassifier implements learning based on the k nearest neighbors of each query point, where k is an integer value specified by the user. RadiusNeighborsClassifier implements learning based on the number of neighbors within a fixed radius r of each training point, where r is a floating-point value specified by the user.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets

n_neighbors = 15

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
y = iris.target

h = .02  # step size in the mesh

# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

# we create an instance of Neighbours Classifier and fit the data.
clf = neighbors.KNeighborsClassifier(n_neighbors)
clf.fit(X, y)

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())

plt.show()

Now let's see how RadiusNeighborsClassifier works. After playing a bit with hyperparameters, we'll achieve the following plot:

...
clf = neighbors.RadiusNeighborsClassifier(3.0, weights='distance')
...

Conclusion

Despite its simplicity, nearest neighbors has been successful in a large number of classification and regression problems, including handwritten digits or satellite image scenes. Being a non-parametric method, it is often successful in classification situations where the decision boundary is very irregular.

Thursday, December 3, 2015

Classifier Boosting with Python


Introduction

Remember we've talked about random forest and how it was used to improve the performance of a single Decision Tree classifier. The idea of fitting a number of decision tree classifiers on various sub-samples of the dataset and using averaging to improve the predictive accuracy can be used to other algorithms as well and it's called boosting. There are several boosting techniques, which can be used to improve our algorithm, we'll cover the most used ones: AdaBoost and Bagging boost.

AdaBoost

An AdaBoost classifier begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset, but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

Bagging boost

A Bagging classifier fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions to form a final prediction.

Implementation

All boosting methods are located under sklearn.ensemble module. Let's see first how to boost SVM using bagging classifier.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import BaggingClassifier
from sklearn.datasets import make_gaussian_quantiles
from sklearn.svm import SVC

# Construct dataset
X1, y1 = make_gaussian_quantiles(cov=2.,
                                 n_samples=200, n_features=2,
                                 n_classes=2, random_state=1)
X2, y2 = make_gaussian_quantiles(mean=(3, 3), cov=1.5,
                                 n_samples=300, n_features=2,
                                 n_classes=2, random_state=1)
X = np.concatenate((X1, X2))
y = np.concatenate((y1, - y2 + 1))

# Create and fit an boosted svm
bdt = BaggingClassifier(SVC())

bdt.fit(X, y)

plot_colors = "br"
plot_step = 0.02
class_names = "AB"

plt.figure(figsize=(10, 5))

# Plot the decision boundaries
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                     np.arange(y_min, y_max, plot_step))

Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
plt.axis("tight")

# Plot the training points
for i, n, c in zip(range(2), class_names, plot_colors):
    idx = np.where(y == i)
    plt.scatter(X[idx, 0], X[idx, 1],
                c=c, cmap=plt.cm.Paired,
                label="Class %s" % n)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.legend(loc='upper right')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Decision Boundary')

Now, let's change the line 18th to AdaBoost and Decision Tree

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
...
bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), algorithm="SAMME", n_estimators=200)
...

Conclusion

Boosting techniques may dramatically improve the base algorithm and serve an irreplaceable tool in any data scientists toolkit.

Thursday, November 19, 2015

K-means clustering with Python


Introduction

K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain K number of clusters. The main idea is to define K centroids, one for each cluster. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. At this point we need to re-calculate K new centroids of the clusters resulting from the previous step. After we have these K new centroids, a new binding has to be done between the same data set points and the nearest new centroid. As a result of this loop we may notice that the K centroids change their location step by step until no more changes are done.

Implementation

Scikit-learn provides with full implementation of K-means algorithm though KMeans class. Let's have a look at several interesting situations, which might occur during data clustering:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

plt.figure(figsize=(12, 12))

n_samples = 1500
random_state = 170
X, y = make_blobs(n_samples=n_samples, random_state=random_state)

# Incorrect number of clusters
y_pred = KMeans(n_clusters=2, random_state=random_state).fit_predict(X)

plt.subplot(221)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.title("Incorrect Number of Blobs")

# Anisotropicly distributed data
transformation = [[ 0.60834549, -0.63667341], [-0.40887718, 0.85253229]]
X_aniso = np.dot(X, transformation)
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_aniso)

plt.subplot(222)
plt.scatter(X_aniso[:, 0], X_aniso[:, 1], c=y_pred)
plt.title("Anisotropicly Distributed Blobs")

# Different variance
X_varied, y_varied = make_blobs(n_samples=n_samples,
                                cluster_std=[1.0, 2.5, 0.5],
                                random_state=random_state)
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_varied)

plt.subplot(223)
plt.scatter(X_varied[:, 0], X_varied[:, 1], c=y_pred)
plt.title("Unequal Variance")

# Unevenly sized blobs
X_filtered = np.vstack((X[y == 0][:500], X[y == 1][:100], X[y == 2][:10]))
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_filtered)

plt.subplot(224)
plt.scatter(X_filtered[:, 0], X_filtered[:, 1], c=y_pred)
plt.title("Unevenly Sized Blobs")

plt.show()

At first we wrongly assume that there are 3 clusters in the data and make K-means find them. Later we transform the data anisotropicly. Since K-means uses Euclidian distance to to associate points to clusters, it does not work well with non-globular clusters. The last two datasets introduce blobs of different variance and size, but this makes no difference and the classification succeeds.

Conclusion

K-means provides us with easy to use clustering algorithms. It's fast, easy to follow and is used vastly in various fields including Vector Quantization.

Sunday, November 8, 2015

Naïve Bayes with Python


Introduction

The Naive Bayes algorithm is based on conditional probabilities. It uses Bayes' Theorem, a formula that calculates a probability by counting the frequency of values and combinations of values in the historical data. Bayes' Theorem finds the probability of an event occurring given the probability of another event that has already occurred. If B represents the dependent event and A represents the prior event, Bayes' theorem can be stated as follows. To calculate the probability of B given A, the algorithm counts the number of cases where A and B occur together and divides it by the number of cases where A occurs alone.

Implementation

Scikit-learn provides implementation of Naïve Bayes algorithm of 3 flavors: MultinomialNB implementing the naive Bayes algorithm for multinomially distributed data; GaussianNB implementing the Gaussian Naive Bayes algorithm for classification; and BernoulliNB implements the naive Bayes training and classification algorithms for data that is distributed according to multivariate Bernoulli distributions.

Let's take a look at Naïve Bayes algorithm at work classifying Iris data and since anything the nature produces is distributed according to a Gaussian distribution, we'll be using this appropriate class

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
 
# Parameters
n_classes = 3
plot_colors = "bry"
plot_step = 0.02
plt.rcParams["figure.figsize"] = [12, 8]
 
# Load data
iris = load_iris()
 
for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3],
                                [1, 2], [1, 3], [2, 3]]):
    # We only take the two corresponding features
    X = iris.data[:, pair]
    y = iris.target
 
    # Shuffle
    idx = np.arange(X.shape[0])
    np.random.seed(13)
    np.random.shuffle(idx)
    X = X[idx]
    y = y[idx]
 
    # Train
    clf = GaussianNB().fit(X, y)
 
    # Plot the decision boundary
    plt.subplot(2, 3, pairidx + 1)
 
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                         np.arange(y_min, y_max, plot_step))
 
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    cs = plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)
 
    plt.xlabel(iris.feature_names[pair[0]])
    plt.ylabel(iris.feature_names[pair[1]])
    plt.axis()
 
    # Plot the training points
    for i, color in zip(range(n_classes), plot_colors):
        idx = np.where(y == i)
        plt.scatter(X[idx, 0], X[idx, 1], c=color,
                    label=iris.target_names[i],
                    cmap=plt.cm.Paired)
    plt.axis()
 
plt.legend(loc="upper left")
plt.show()

Pretty cool, isn't it!

Conclusion

The Naive Bayes algorithm affords fast, highly scalable model building and scoring. It scales linearly with the number of predictors and rows. You'll need, however, a big data set in order to make reliable estimations of the probability of each class. You can use Naïve Bayes classification algorithm with a small data set, but precision and recall will keep very low. For small reminder about what those are, have a look at performance metrics section here.

Tuesday, October 27, 2015

Python for Data Scientists - Rodeo

Introduction

I love Python, I really do and that goes for IPython as well - it's a great tool and simplifies the work by a lot. But.. there is always a but, isn't it? RStudio is so much better and until recently we, the Python data enthusiasts, could only nervously look at RStudio while working on somewhat beloved, somewhat limped brother IPython. Well, no more. Let me introduce you Rodeo

The IDE is free and super easy to use, it's very similar to RStudio and after you watch the introduction video above, you'll be ready to go.

Wednesday, October 7, 2015

Random Forest with Python


Introduction

In the article about decision tree we've talked about it's drawbacks of being sensitive to small variations or noise in the data. Today we'll see how to deal with them by introducing a random forest. It belongs to a larger class of machine learning algorithms called ensemble methods, which use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms. So which models does random forest aggregate? You might already know the answer - the decision trees. It fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.

Implementation

Scikit-learn provides us with two classes RandomForestClassifier and RandomForestRegressor for classification and regression problems respectively. Let's use the code from the previous example and see how the result will different, using random forest with 100 trees.

...
clf = RandomForestClassifier(n_estimators=100).fit(X, y)
...

As you can see the edges are much smoother, that is less overfitted, than using a single decision tree.

Conclusion

Despite it's relative simplicity, random forest performs the job remarkably well. According to empirical comparisons, even better than SVMs.

Monday, September 28, 2015

Decision Tree with Python


Introduction

Last time we talked about non-linear classifier using Support Vector Machines or SVM. Today we'll be discussing another non-linear classifier and regressor called decision tree. The way decision tree works is by creating a model, which predicts the value of a target variable by learning simple decision rules inferred from the data features.

Since trees can be visualized and is something we're all used to, decision trees can easily be explained, visualized and manipulated the non-linearity in an intuitive manner. Surely there are some disadvantages as well, but we'll note them a bit later, firstly let's see them in action.

Implementation

It won't come as a complete surprise to you, that scikit package has already taken initiative and implemented the whole thing using DecisionTreeRegressor and DecisionTreeClassifier classes. What is left for us is to bear the fruits of someone else's hard labour.

import StringIO
import numpy as np
import matplotlib.pyplot as plt
import pydot
from IPython.display import Image
from sklearn import tree
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Parameters
n_classes = 3
plot_colors = "bry"
plot_step = 0.02
plt.rcParams["figure.figsize"] = [12, 8]

# Load data
iris = load_iris()

for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3],
                                [1, 2], [1, 3], [2, 3]]):
    # We only take the two corresponding features
    X = iris.data[:, pair]
    y = iris.target

    # Shuffle
    idx = np.arange(X.shape[0])
    np.random.seed(13)
    np.random.shuffle(idx)
    X = X[idx]
    y = y[idx]

    # Standardize
    mean = X.mean(axis=0)
    std = X.std(axis=0)
    X = (X - mean) / std

    # Train
    clf = DecisionTreeClassifier().fit(X, y)

    # Plot the decision boundary
    plt.subplot(2, 3, pairidx + 1)

    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                         np.arange(y_min, y_max, plot_step))

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    cs = plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

    plt.xlabel(iris.feature_names[pair[0]])
    plt.ylabel(iris.feature_names[pair[1]])
    plt.axis()

    # Plot the training points
    for i, color in zip(range(n_classes), plot_colors):
        idx = np.where(y == i)
        plt.scatter(X[idx, 0], X[idx, 1], c=color,
                    label=iris.target_names[i],
                    cmap=plt.cm.Paired)
    plt.axis()

plt.legend(loc="upper left")
plt.show()

First we normalize the data and then draw decision boundaries and at last the data itself.

Take a look at the first subplot and let's compare it with one, where we used logistic regression. Back then we couldn't classify the data using these features, now using the decision tree we surely can.

Visualization

Since decision tree uses a tree data-structure, wouldn't it be cool to visualize it.
Notice: You'll need to install GraphViz package to run this example

...
dot_data = StringIO.StringIO()  
tree.export_graphviz(clf, out_file=dot_data,  
                     feature_names=iris.feature_names,  
                     class_names=iris.target_names,  
                     filled=True, rounded=True,  
                     special_characters=True)  
graph = pydot.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

Now, when we have our tree in place, let's see how the decisions were made with Gini coefficient attached to each node. What Gini coefficient measures is the inequality among values of a frequency distribution, in our case iris species. A Gini coefficient of zero expresses perfect equality, where all values are the same - all iris are of the same species.

So the first check is against septal length being lesser than -0.7442, and from Gini coefficient 0.6667 we can deduce that it splits the data with one third going into a single category, setosa. The process continues until we reach Gini coefficient 0, that is the remaining data is of a single category.

Conclusion

One of the notable advantages of using decision trees is it's prediction performance, which is logarithmic in the number of data points used to train the tree. But there is are some downsides as well, which are needed to be considered as well. The first one is can be seen from our first subplot example - overfitting. Trees tend to perform incredibly well at the top, but at the same time tend to overfit at the bottom. This is a major downsite and therefore trees should be pruned! Also decision trees can be unstable because small variations or noise in the data might result in a completely different tree being generated. However this problem is mitigated by using decision trees within an ensemble, of which we'll talk next time.

Tuesday, September 8, 2015

Support Vector Machines with Python


Introduction

Having learnt different optimization and classification methods, you must feel quit confident to start exploring the datasets of interest. However you might very quickly run into a dataset, which no matter how hyper-parameters are set, is just not linearly separable. Surely there is no place for despair, especially since we have just a classifier to deal with these situation, called Support Vector Machine.

How it works

Support Vector Machine, or SVM, are a set of supervised learning methods used for classification and with a slight change for regression. The core idea of it is to linearly separate the hyper-space of features. The prefix hyper is not occasional, as SVM increases the dimension of feature space to achieve it's goal. The power of the method comes from using kernel functions, which enable it to operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space. This operation is often computationally cheaper than the explicit computation of the coordinates, thus allowing SVM to operate on datasets with much bigger feature space, like image recognition. Consider the following example:

There's no linear decision boundary for this dataset, which will separate observations of two classes. If you apply linear classifier, you'll just receive an "arbitrary" line throughout the space crossing both of the classes - you just cannot do it correctly with logistic regression.

The way SVM approaches the problem, is by augmenting the dataset with additional dimensions and trying to maximize the margin between the classes. It places the separation so that the distance to closest misclassified entity is the widest. Have a look at the following illustration depicting the process:

Applying a kernel function, it transforms the observations to a space, where they can be linearly separable, resulting the following separation:

Implementation

It won't come as a big surprise to you, that scikit-learn has a fully implemented and optimized SVM support, so let's have a look how it deals with various situations. We'll create three demo datasets, one in shape of moons, one of circles and the last will contain linearly separable observations. Then we'll apply SVM with two different kernel functions: linear and radial based function. In the end we'll plot the observations and color the background according to distance from boundary hyperplane using diverging palette, that is the farther it's from center the darker the color (red or blue depending on the sign). Take a closer look into the example code and make sure you understand the process.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.svm import SVC

h = .02  # step size in the mesh
names = ["Linear SVM", "RBF SVM"]
classifiers = [
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1)]

X, y = make_classification(n_features=2, n_redundant=0,
        random_state=1, n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += rng.uniform(size=X.shape)
linearly_separable = (X, y)

datasets = [make_moons(noise=0.3, random_state=0),
            make_circles(noise=0.2, factor=0.5, random_state=1),
            linearly_separable
            ]

figure = plt.figure(figsize=(9, 9))
i = 1
# iterate over datasets
for ds in datasets:
    # preprocess dataset, split into training and test part
    X, y = ds
    X = StandardScaler().fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X, y,
      test_size=0.4)

    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    # just plot the dataset first
    cm = plt.cm.RdBu
    cm_bright = ListedColormap(['#FF0000', '#0000FF'])
    ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
    # Plot the training points
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train,
      cmap=cm_bright)
    # and testing points
    ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test,
      cmap=cm_bright, alpha=0.6)
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xticks(())
    ax.set_yticks(())
    i += 1

    # iterate over classifiers
    for name, clf in zip(names, classifiers):
        ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
        clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)

        # Plot the decision boundary. We'll assign a color to
        # each point in the mesh [x_min, m_max]x[y_min, y_max].
        Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])

        # Put the result into a color plot
        Z = Z.reshape(xx.shape)
        ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)

        # Plot also the training points
        ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train,
          cmap=cm_bright)
        # and testing points
        ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test,
          cmap=cm_bright, alpha=0.6)

        ax.set_xlim(xx.min(), xx.max())
        ax.set_ylim(yy.min(), yy.max())
        ax.set_xticks(())
        ax.set_yticks(())
        ax.set_title(name)
        ax.text(xx.max() - .3, yy.min() + .3, 
          ('%.2f' % score).lstrip('0'), size=15, horizontalalignment='right')
        i += 1

plt.show()

The results, as anticipated, are much more promising than using a linear classifier. For moon and circle shaped datasets, there is tremendous increase classification score, where as for linearly separable observations the result doesn't change. This is of course understandable as linearly separable dataset can be easily dealt with using linear classifier.

Conclusion

SVM is a powerful tool, however from a practical point of view perhaps the most serious problem with SVMs is the high algorithmic complexity and extensive memory requirements of the required quadratic with the number of samples, which makes it hard to scale to dataset with more than a couple of 10000 samples.

Tuesday, August 25, 2015

Dimension reduction with Python


Introduction

Sometimes, no matter how good an algorithm is, it just doesn’t work. Or worse, it doesn’t pick up anything. Data can be quite noisy, and sometimes it’s just about impossible to figure out what went wrong.

It's worth noticing that the most interesting machine learning challenges always involve some sort of feature engineering, where we try to use our insight into the problem to carefully craft additional features, that the machine learner hopefully picks up.

Garbage in, garbage out, that's what we know from real life. Not surprisingly this pattern also holds true, when applying machine learning methods to training data. To tackle the issue, we will go in the opposite direction with dimensionality reduction involving cutting away features that are irrelevant or redundant. There are several good reasons to trim down the dimensions as much as possible:

  • Most of the models hate high-dimensional spaces and superfluous features, which often irritate or mislead the learner.
  • More features means more parameters to tune and a higher risk of overfitting.
  • Linearly correlated data is common phenomenon in datasets, which as a result might just have artificial high dimensions, whereas the real dimension might be small.
  • Less dimensions means faster training and more variations to try out, resulting in better end results.

Dimension reduction techniques

Before diving into more sophisticated algorithms, let's try to get an educated guess how to get rid of the garbage within our data, while keeping the valuable part of it.

Missing Values Ratio

Data columns with too many missing values are unlikely to carry much useful information. Thus data columns with number of missing values greater than a given threshold can be removed. Naturally, the higher the threshold, the more aggressive the reduction.

Low Variance

Another candidates are columns with little changes, that is low variance. Don't forget to normalize the data before comparing against the threshold, as variance is range dependent.

High Correlation

Data columns with very similar trends are also likely to carry very similar information. In this case, only one of them will suffice to feed the machine learning model. Here we calculate the correlation coefficient between numerical and nominal columns as the Coefficient and the Pearson’s chi square value respectively. Pairs of columns with correlation coefficient higher than a threshold are reduced to only one. A word of caution: correlation is scale sensitive; therefore column normalization is required for a meaningful correlation comparison.

Backward Feature Elimination

This great technique allows us to remove least important features. At a given iteration, the selected classification algorithm is trained on n input features. Then we remove one input feature at a time and train the same model on n-1 input features n times. The input feature whose removal has produced the smallest increase in the error rate is removed, leaving us with n-1 input features. Each iteration k produces a model trained on n-k features and an error rate (k). Selecting the maximum tolerable error rate, we define the smallest number of features necessary to reach that classification performance with the selected machine learning algorithm.

Forward Feature Construction

This is the inverse process to the Backward Feature Elimination. We start with 1 feature only, progressively adding 1 feature at a time, i.e. the feature that produces the highest increase in performance.

Principal Component Analysis

Principal Component Analysis, also known as the Karhunen-Loeve Transform, is a technique used to search for patterns in high-dimensional data. PCA reduces a set of possibly-correlated, high-dimensional variables to a lower-dimensional set of linearly uncorrelated synthetic variables called principal components. The technique orthogonally transforms the original n coordinates of a data set into a new set of n coordinates called principal components by reducing the dimensions of a data set projecting the data onto a lower-dimensional subspace. The lower-dimensional data will preserve as much of the variance of the original data as possible. For example, a two dimensional data set could be reduced by projecting the points onto a line; each instance in the data set would then be represented by a single value rather than a pair of values. A three-dimensional dataset could be reduced to two dimensions by projecting the variables onto a plane. In general, an n-dimensional dataset can be reduced by projecting the dataset onto a k-dimensional subspace, where k is less than n. Let's have a look at two dimensional example of four points, projected onto on dimensional space - a single line.

Implementation

Some of the presented technique, while being very effective, unfortunately not always can be used in practice. For instance both Backward Feature Elimination and Forward Feature Construction, are quite time and computationally expensive. They are practically only applicable to a data set with an already relatively low number of input columns. PCA is much less computationally expensive algorithm and is commonly used to explore and visualize high-dimensional data sets. It can also be used to compress data, and process data before it is used by another estimator.

Many implementations of PCA, including the one of scikit-learn, use singular value decomposition to calculate the eigenvectors and eigenvalues. SVD is given by the following equation:

The columns of U are called left singular vectors of the data matrix, the columns of conjugately transposed matrix V are its right singular vectors, and Σ the diagonal entries of are its singular values. To make the algebra work, the matrices have to follow the dimensions rules, as depicted in the following illustration

While the singular vectors and values of a matrix are useful in some applications of signal processing and statistics, we are only interested in them as they relate to the eigenvectors and eigenvalues of the data matrix. Specifically, the left singular vectors are the eigenvectors of the covariance matrix and the diagonal elements of Σ are the square roots of the eigenvalues of the covariance matrix. Eigenvectors found using SVD should be similar to those derived from a covariance matrix.

Let's see how it's done in practice. As you remember Iris dataset consists of four dimensions: sepal length, sepal width, petal length and lastly petal width. Let's apply PCA on the dataset and plot the results in 3 dimensional space.

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import decomposition
from sklearn import datasets

centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target

fig = plt.figure(1, figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

pca = decomposition.PCA(n_components=3)
pca.fit(X)
X = pca.transform(X)

for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:
    ax.text3D(X[y == label, 0].mean(),
              X[y == label, 1].mean() + 1.5,
              X[y == label, 2].mean(), name,
              horizontalalignment='center',
              bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
    
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap=plt.cm.prism)

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])

plt.show()

Conclusion

Indeed, in the era of big data, when more is axiomatically better, we have re-discovered that too many noisy or even faulty input data columns often lead to a less than desirable algorithm performance. Removing un-informative or even worse dis-informative input attributes might help build a model on more extensive data regions, with more general classification rules, and overall with better performances on new unseen data.

Tuesday, August 11, 2015

Hyperparameter optimization with Python


Introduction

In the previous articles we introduced several linear techniques, where as you have probably noticed, we provided the algorithms with several parameters. The dependence of machine learning algorithm upon learning parameters is a common case though and one has to check the performance of various parameters to achieve the best results. The task of course is no trifle and is called hyperparameter optimization or model selection. It is the problem of choosing a set of hyperparameters for a learning algorithm, usually with the goal of optimizing a measure of the algorithm's performance on an independent data set.

Implementation

Grid Search

The traditional way of performing hyperparameter optimization is a grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. Scikit-learn provides us with a class GridSearchCV implementing the technique. Let's try to find the best learning rate, regularization technique and batch size for stochastic gradient descent from the previous article.

import numpy as np
from time import time
from operator import itemgetter
from sklearn.grid_search import GridSearchCV
from sklearn.datasets import load_digits
from sklearn.linear_model import SGDClassifier

# get some data
digits = load_digits()
X, y = digits.data, digits.target

# build a classifier
clf = SGDClassifier()

# Utility function to report best scores
def report(grid_scores, n_top=3):
    top_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n_top]
    for i, score in enumerate(top_scores):
        print("Model with rank: {0}".format(i + 1))
        print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
              score.mean_validation_score,
              np.std(score.cv_validation_scores)))
        print("Parameters: {0}\n".format(score.parameters))

# use a full grid over all parameters
param_grid = {"n_iter": [1, 5, 10],
              "alpha": [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
              "penalty": ["none", "l1", "l2"]}

# run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(X, y)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(grid_search.grid_scores_)))
report(grid_search.grid_scores_)

# GridSearchCV took 4.85 seconds for 63 candidate parameter settings.
# Model with rank: 1
# Mean validation score: 0.929 (std: 0.020)
# Parameters: {'penalty': 'l1', 'alpha': 0.0001, 'n_iter': 5}
 
# Model with rank: 2
# Mean validation score: 0.924 (std: 0.016)
# Parameters: {'penalty': 'l2', 'alpha': 1, 'n_iter': 10}

# Model with rank: 3
# Mean validation score: 0.922 (std: 0.018)
# Parameters: {'penalty': 'none', 'alpha': 0.0001, 'n_iter': 5}

The output of the procedure is a list, grid_scores_, consisting of cross evaluation score of all provided parameters. To get the best parameters, we order the list in a descending order by score mean resulting the best parameters for the provided dataset using Lasso regularization, learning rate of 0.0001 in batches of 5. For more details about regularization types, have a look a broader explanation in linear regression article.

Randomized Grid Search

Since grid searching is an exhaustive and therefore potentially expensive method, there is a random variation of the technique and is implemented by RandomizedSearchCV class in scikit-learn. It doesn't try out all parameter values, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter parameter.

from sklearn.grid_search import RandomizedSearchCV
from scipy.stats import randint, norm

...

# specify parameters and distributions to sample from
param_dist = {"n_iter": randint(1, 11),
              "alpha": uniform(scale=0.01),
              "penalty": ["none", "l1", "l2"]}

# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)

...

# RandomizedSearchCV took 1.18 seconds for 20 candidates parameter settings.
# Model with rank: 1
# Mean validation score: 0.920 (std: 0.021)
# Parameters: {'penalty': 'l1', 'alpha': 0.0006507721384509924, 'n_iter': 7}
 
# Model with rank: 2
# Mean validation score: 0.919 (std: 0.016)
# Parameters: {'penalty': 'none', 'alpha': 0.0008117236028203323, 'n_iter': 9}
 
# Model with rank: 3
# Mean validation score: 0.918 (std: 0.018)
# Parameters: {'penalty': 'l1', 'alpha': 0.0012165253638165819, 'n_iter': 5}

Here, instead of providing the procedure with list of values, we gave a desired distribution from which the values would be drawn. As we set the n_iter to be 20, the procedure evaluated 20 random variation of the parameters and following the previous logic, the best appeared to be using Lasso regularization, learning rate of 0.00065 in batches of 7.

Conclusion

The randomized search and the grid search explore exactly the same space of parameters. The result in parameter settings is quite similar, while the run time for randomized search is drastically lower. Note that in practice, one would not search over this many different parameters simultaneously using grid search, but pick only the ones deemed most important.

Wednesday, July 22, 2015

Gradient Descent with Python


Introduction

In the previous two articles, we've talked about linear and logistic regression, covering by it most linear models. And that would be it, if only these models could process data with numerous features, that is x1, x2... xn-1, xn. Let's us see how our classification model deals with Olivetti faces classification dataset:

from sklearn import datasets, metrics, linear_model, cross_validation
import matplotlib.pyplot as plt
import time

#Load the digits dataset
faces = datasets.fetch_olivetti_faces()

start = time.time()

X = faces.data
Y = faces.target

x_train, x_test, y_train, y_test = cross_validation.train_test_split(X, Y)

# Create a classifier: a support vector classifier
model = linear_model.LogisticRegression(C=10000)

# We learn the faces of train set
model.fit(x_train, y_train)

# Now predict the person on test set
expected = y_test
predicted = model.predict(x_test)

end = time.time()
print("%.2f seconds" % (end - start)) # 8.85 seconds

# Reshape data into images
x_train_images = x_train.reshape(len(x_train), 64, 64)
x_test_images = x_test.reshape(len(x_test), 64, 64)

# Plot train images
for index, (image, label) in enumerate(
    zip(x_train_images, y_train)[:4]):
    plt.subplot(2, 4, index + 1)
    plt.axis('off')
    plt.imshow(image, cmap=plt.cm.gray)
    plt.title('Train: %i' % label)

# Plot test images    
for index, (image, prediction, expectation) in enumerate(
    zip(x_test_images, predicted, expected)[:4]):
    plt.subplot(2, 4, index + 5)
    plt.axis('off')
    plt.imshow(image, cmap=plt.cm.gray)
    plt.title('Predict: %i/%i' % (prediction, expectation))

The dataset contains 64x64 different images of 40 distinct subjects, resulting each item of the dataset containing 4096 features. Despite the good performance of our classifier, we won't dive into measures now, it took almost 9 seconds to process this toy dataset, which is not good enough if we want to learn really big datasets.

The reason for this delay, is the way the classifier solves the optimization problem. Our models analytically solved the values of the model's parameters, which minimize the cost function with the following equation:

X is the matrix of the values of the explanatory variables for each training example. The dot product inside the brackets results a square matrix with dimensions nxn, where n is the number of features. The computational complexity of inverting this square matrix is nearly cubic in the number of explanatory variables. Thus the more features we have, the more complex the computation will be, consequently taking more time.

Gradient Descent

Given a function defined by a set of parameters, gradient descent iteratively moves toward a set of parameter values, which minimize the function. This iterative minimization is achieved by taking steps in the negative direction of the function gradient. As though a blindfolded man, who is trying to find his way from somewhere on a mountainside to the lowest point of the valley. He takes a step in the direction with the steepest decline and the sizes of his steps are proportional to the steepness of the terrain. If he were to take large steps, he may have accidentally stepped over the valley's lowest point. In opposite case, while he would reach the desired point, however since the steps are too small, it would take him a considerable amount of time. The proportion is specified by the learning rate provided to the model. When used to minimize the function, a standard (or "batch") gradient descent method would perform the following iterations until it converges:

Stochastic Gradient Descent

Standard gradient descent evaluates the sum-gradient, which may require expensive evaluations of the gradients from all sums and functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients. In stochastic (or "on-line") gradient descent, the true gradient of Q(w) is approximated by a gradient at a single example:

We compute an estimate or approximation to this direction. The most simple way is to just look at one training example (or subset of training examples) and compute the direction to move only on this approximation. It is called as Stochastic because the approximate direction that is computed at every step can be thought of a random variable of a stochastic process. Stochastic method converges much faster compared to standard, but the error function is not as well minimized as in the case of the latter. Often in most cases, the close approximation that you get in stochastic method for the parameter values are enough because they reach the optimal values and keep oscillating there.

Minibatch Gradient Descent

A compromise between the two forms called "mini-batches" computes the gradient against more than one training examples at each step. This can perform significantly better than true stochastic gradient descent, because the code can make use of vectorization libraries rather than computing each step separately. It may also result in smoother convergence, as the gradient computed at each step uses more training examples.

Implementation

Scikit-learn provides us with both linear and logistic regression implementation using gradient descent through SGDRegressor and SGDClassifier classes respectively. By default the models use stochastic method with batches of 5 examples. Since all scikit-learn models follow the same API, the only needed change to the previous example, is to change the model instantiation.

...
model = linear_model.SGDClassifier(alpha=0.00001)
...

After we change the classifier, the code finished within less then a second!

Conclusion

The important point to remember is that gradient descent looks for local minima, which sometimes may differ from the global one. The outcome of the process is significantly dependent upon where it starts, as illustrated below. To gain more confidence in the results, one must run it several times.

Hope you've enjoyed the article and feel free to comment below. In the next article we'll look into techniques to choose the parameters provided to the models, to achieve better results.

Wednesday, July 8, 2015

Logistic regression with Python


Introduction

After having mastered linear regression in the previous article, let's take a look at logistic regression. Despite its name, it is not that different from linear regression, but rather a linear model for classification achieved by using sigmoid function instead of polynomial one. Specifically we'll be using logistic function, hence the name of the regression. Don't worry, in the future articles we'll be covering the usage of other sigmoid functions as well.

Implementation

The implementation of logistic regression in scikit-learn can be accessed from class LogisticRegression. This implementation can fit a multiclass logistic regression with optional L1 or L2 regularization. In the next example we'll classify iris flowers according to their sepal length and width:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target

model = linear_model.LogisticRegression(C=10000) # C = 1/alpha
model.fit(X, Y)

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]:[y_min, y_max]
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

h = .02  # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())

plt.show()

Firstly, pay attention to the inversed regularization parameter, C, provided to the classifier. As with linear regression, scikit provides class, LogisticRegressionCV, to evaluate different learning rates. Secondly, though not related directly to the topic, have a look at meshgrid feature of numpy library and how it is used to plot a decision boundary.

Performance metrics

As opposed to linear regression, classification results cannot be evaluated with R-squared measure. A variety of metrics exist to evaluate the performance of classifiers against labels. The most common metrics are accuracy, precision, recall, F1 score and ROC AUC score. All of these measures depend on the concepts of true positives, true negatives, false positives and false negatives. All these apply to a binary classifier, but with a little trick of setting a positive label, the measures can be used on multiclass models as well.

True positives (TP)

The amount of labels, which were correctly identified by the classifier, that is assigned point labeled A to class A.

True negatives (TN)

The amount of labels, which were correctly rejected by the classifier, that is assigned point labeled A' to class A'.

False positives (FP)

The amount of labels, which were incorrectly identified by the classifier, that is assigned point labeled A' to class A.

False negatives (FN)

The amount of labels, which were incorrectly rejected by the classifier, that is assigned point labeled A to class A'.

Accuracy

Accuracy measures a fraction of the classifier's predictions that are correct, that is the number of correct assessments divided by the number of all assessments - (TN + TP)/(TN + TP + FN + FP).

Precision (P)

Precision is the fraction of positive predictions that are correct - TP/(TP + FP). Be careful as classifier predicting only a single positive instance, that happens to be correct, will achieve perfect precision.

Recall (R)

Recall, sometimes called sensitivity in medical domains, measures the fraction of the truly positive instances. A score of 1 indicates, that no false negative were present - TP/(TP + FN). Be careful as classifier predicting positive for every example will achieve a recall of 1.

F1 score

Both precision and recall scores provide an incomplete view on the classifier performance and sometimes may provide skewed results. The F1 measure provides a better view by calculating weighted average of the scores - 2*P*R/(P + R). A model with perfect precision and recall scores will achieve an F1 score of one.

ROC AUC

ROC curves plot the classifier's recall against its fall-out, false positive rate, is the number of false positives divided by the total number of negatives - FP/(TN + FP). AUC is the area under the ROC curve; it reduces the ROC curve to a single value, which represents the expected performance of the classifier. A classifier that predicts classes randomly, has an AUC of 0.5. Any number above this, outperforms the random guessing. Unfortunately scikit-learn supports only binary classifier, when it comes to ROC, however iterating over each label and setting it as positive one, results what we need:
import numpy as np
from sklearn import linear_model, datasets, cross_validation, metrics

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target
model = linear_model.LogisticRegression(C=10000)

x_train, x_test, y_train, y_test = cross_validation.train_test_split(X, Y)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print("Accuracy: %2f" % metrics.accuracy_score(y_test, y_pred))
print("Precision: %2f" % metrics.precision_score(y_test, y_pred, average="macro"))
print("F1: %2f" % metrics.f1_score(y_test, y_pred, average="macro"))
 
for label in np.arange(3):
    false_positive_rate, recall, thresholds = metrics.roc_curve(y_test, y_pred, pos_label=label)
    roc_auc = metrics.auc(false_positive_rate, recall)
    plt.plot(false_positive_rate, recall, label='AUC(%d) = %0.2f' % (label, roc_auc))

plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()
# Accuracy: 0.815789
# Precision: 0.817460
# F1: 0.813789

Conclusion

This concludes the classification topic and with linear regression, covers linear models. Next time we'll talk about techniques used to handle numerous explanatory variables.

Wednesday, June 24, 2015

Linear Regression with Python


Introduction

Our first insight into machine learning will be through the simplest model - linear regression. The goal in regression problems is to predict the value of a continuous response variable. First we'll examine linear regression, which models the relationship between a response variable and one explanatory variable. Next, we will discuss polynomial regression and regularization methods.

Simple model

Linear regression tries to minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation. Mathematically it solves a problem of the form:

We'll demonstrate the process using the toy diabetes dataset, included in scikit-learn. For more details about the loading process, take a look at the previous article about loading datasets in Python.

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.cross_validation import train_test_split
 
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
 
# Use only one feature
x = diabetes.data[:, np.newaxis]
x_temp = x[:, :, 2]
y = diabetes.target
 
# Split the data into training/testing sets
x_train, x_test, y_train, y_test = train_test_split(x_temp, y)
 
# Create linear regression object
model = linear_model.LinearRegression()
 
# Train the model using the training sets
model.fit(x_train, y_train)
 
# The coefficients
model.coef_                       # array([ 938.23786125])
 
# The mean square error
np.mean((model.predict(x_test) - y_test) ** 2) # 4017.22
 
# Explained variance score: 1 is perfect prediction
model.score(x_test, y_test)       # 0.31
 
# Plot outputs
plt.scatter(x_test, y_test,  color='black')
plt.plot(x_test, model.predict(x_test), color='blue', linewidth=3)
 
plt.xticks(())
plt.yticks(())
 
plt.show()

Cross validation

We used the train_test_split function to randomly partition the data into training and test sets. The proportions of the data for both partitions can be specified using keyword arguments. By default, 25 percent of the data is assigned to the test set. Finally, we trained the model and evaluated it on the test set. The resulted r-squared score of 0.31 indicates that 31 percent of the variance in the test set is explained by the model. However the performance might have changed if a different data was chosen as the training set.

To obtain a better understanding of the estimator's performance, we can use cross-validation, whose goal is to define a dataset to "test" the model in the training phase, in order to limit problems like overfitting, and to give an insight on how the model will generalize to an independent dataset.

from sklearn.cross_validation import cross_val_score

...

scores = cross_val_score(model, x_temp, diabetes.target)
scores        # array([0.2861453, 0.39028236, 0.33343477])
scores.mean() # 0.3366

cross_val_score by default uses three-fold cross validation, that is, each instance will be randomly assigned to one of the three partitions. Each partition will be used to train and test the model. It returns the value of the estimator's score method for each round. The r-squared scores range from 0.29 to 0.39! The mean of the scores, 0.34, is a better estimate of the estimator's predictive power than the r-squared score produced from a single train/test split.

Polynomial regression

In the previous examples, we assumed that the real relationship between the explanatory variables and the response variable is linear. This assumption is rarely true. Let's discuss polynomial regression, that adds terms with degrees greater than one to the model.

import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# function to approximate by polynomial interpolation
def f(x): return x * np.sin(x)

# generate points used to plot
x_plot = np.linspace(0, 10, 100)

# generate points and keep a subset of them
x = np.linspace(0, 10, 100)
rng = np.random.RandomState(0)
rng.shuffle(x)
x = np.sort(x[:20])
y = f(x)

# create matrix versions of these arrays
X = x[:, np.newaxis]
X_plot = x_plot[:, np.newaxis]

plt.figure(figsize=(20,10))
plt.plot(x_plot, f(x_plot), label="ground truth")
plt.scatter(x, y, label="training points")

for degree in [1, 3, 5, 8]:
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    model.fit(X, y)
    y_plot = model.predict(X_plot)
    plt.plot(x_plot, y_plot, label="degree %d" % degree)

plt.xlim([-0.25,10.25])
plt.ylim([-15.5,9])
plt.legend(loc='lower left')

plt.show()

Have a look at the chart above and how different polynomial curves try to estimate the "ground truth" line, colored in blue. The linear green line is no where close to what we seek, but as the degree grows, the curves become more and more close to the one covering all the points - colored in purple.

Regularization

Introduction

In the previous section, one must have felt suspicious of the purple curve covering all the points. And one should be; as the degree of polynomial approximation grows, so does the risk of falling into overfitting. The most common technique to tackle this is regularization.

Regularization is a process of adding the penalty associated with the coefficient values to the error of the hypothesis. This way, an accurate hypothesis with unlikely coefficients would be penalized while a somewhat less accurate but more conservative hypothesis with low coefficients would not be penalized as much.

Two popular regularization methods are L1 and L2, which minimize a loss function E(X, Y) by instead minimizing E(X, Y) + α‖w‖, where w is the model's weight vector, ‖·‖ is either the L1 norm or the squared L2 norm, and α is a free parameter that needs to be tuned empirically. L1 regularization is often preferred because it produces sparse models and thus performs feature selection within the learning algorithm, but since the L1 norm is not differentiable, it may require changes to learning algorithms. L2 penalization is preferable for data that is not at all sparse, i.e. where you do not expect regression coefficients to show a decaying property, which penalizes large weights. In such cases, incorrectly using an L1 penalty for non-sparse data will give you a large estimation error. When applied in linear regression, the resulting models are termed Lasso or Ridge regression respectively.

Implementation

Among other regularization methods, scikit-learn implements both Lasso, L1, and Ridge, L2, inside linear_model package. Let's see the plots after applying each method to the previous code example:

from sklearn.linear_model import Lasso, Ridge

...

for degree in [1, 3, 5, 8]:
    model = make_pipeline(PolynomialFeatures(degree), Lasso(alpha = 0.1))

...
from sklearn.linear_model import Lasso, Ridge

...

for degree in [1, 3, 5, 8]:
    model = make_pipeline(PolynomialFeatures(degree), Ridge(alpha = 0.1))

...

I've arbitrary chosen the regularization parameter, alpha, to be 0.1, however this may be incorrect and usually more thorough analysis is required. Luckily scikit-learn provides us with methods to do so, an already described cross validation technique to find the best fitting alpha parameter for both Lasso and Ridge methods, called LassoCV and RidgeCV.

from sklearn.linear_model import LassoCV

...

for degree in [1, 3, 5, 8]:
    poly = PolynomialFeatures(degree)
    X_ = poly.fit_transform(X)
    X_plot_ = poly.fit_transform(X_plot)
    
    model = LassoCV(cv=20)
    model.fit(X_, y)
    print "chosen alpha %d" % model.alpha_
    y_plot = model.predict(X_plot_)
    plt.plot(x_plot, y_plot, label="degree %d" % degree)

...

# chosen alpha 2
# chosen alpha 265
# chosen alpha 16
# chosen alpha 4783

Notice how all curves got smoothed, in respect to previous results using alpha 0.1 and notice how in each iteration different parameter was chosen.

Conclusion

Now you know how to apply linear regression and different techniques used along with it. In the next article we'll describe logistic regression used in classification problems.

Friday, June 5, 2015

Loading datasets using Python


Introduction

Before we proceed with either kind of machine learning problem, we need to get the data on which we'll operate. We can of course generate data by hand, but this course of action won't get us far as is too tedious and lacks the diversity we may require. The are numerous sources of real data we can use and if none of it satisfies ones needs, there are some popular artificial generators, creating datasets according to preset parameters. scikit-learn provides a plenty of methods to load and fetch popular datasets as well as generate artificial data. All these can be found in sklearn.datasets package.

Toy Datasets

The scikit-learn embeds some small toy datasets, which provide data scientists a playground to experiment a new algorithm and evaluate the correctness of their code before applying it to a real world sized data. Let's load and render one of the most common datasets - iris dataset

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data[:, :2]  # only take the first two features.
Y = iris.target

# Plot the training points
plt.scatter(X[:, 0], X[:, 1], c=Y)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

Real Size Datasets

Popular datasets

scikit-learn provides loaders that will automatically download, cache, parse the metadata files of several popular real size datasets:

  • 20 newsgroups dataset
  • Labeled faces in the wild dataset
  • Olivetti faces dataset from AT&T
  • California housing dataset from StatLib
  • Forest cover type dataset
from sklearn.datasets import fetch_20newsgroup

cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)
list(newsgroups_train.target_names) # ['alt.atheism', 'sci.space']
newsgroups_train.filenames.shape    # (1073,)
newsgroups_train.target.shape       # (1073,)

mldata.org repository

The sklearn.datasets package is able to directly download data sets from the repository using the function fetch_mldata. For example, to download the MNIST digit recognition database, which contains a total of 70000 examples of handwritten digits of size 28x28 pixels, labeled from 0 to 9:

from sklearn.datasets import fetch_mldata

mnist = fetch_mldata('MNIST original', data_home=some_path)
mnist.data.shape        # (70000, 784)
mnist.target.shape      # (70000,)
np.unique(mnist.target) # array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Datasets in svmlight/libsvm format

scikit-learn includes utility functions for loading datasets in the svmlight/libsvm format. In this format, each line takes the form

from sklearn.datasets import load_svmlight_file

X_train, y_train = load_svmlight_file("/path/to/train_dataset.txt")

Load datasets from web

More often, though, one will have to load datasets from the internet from various repositories. UC Irvine Machine Learning Repository is one of such repositories, which contains several hundreds of datasets donated as far as 80s.

import numpy as np
import urllib
# URL for the Pima Indians Diabetes dataset (UCI Machine Learning Repository)
url = "http://goo.gl/j0Rvxq"
raw_data = urllib.urlopen(url)                 # download the file
dataset = np.loadtxt(raw_data, delimiter=",")  # load the CSV file as a numpy matrix
# separate the data from the target attributes
X = dataset[:,0:7]
y = dataset[:,8]

Generated Datasets

Sometimes real datasets are not enough. Either one needs data following specific patterns or diversity which cannot be achieved through real datasets. scikit-learn includes various random sample generators that can be used to build artificial datasets of controlled size and complexity. This includes single and multi label data, regression, classifications, clustering and more. Let's create several datasets for classification problem:

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.datasets import make_blobs
from sklearn.datasets import make_gaussian_quantiles

plt.figure(figsize=(8, 8))
plt.subplots_adjust(bottom=.05, top=.9, left=.05, right=.95)

plt.subplot(321)
plt.title("One informative feature, one cluster per class", fontsize='small')
X1, Y1 = make_classification(n_features=2, n_redundant=0, n_informative=1,
                             n_clusters_per_class=1)
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)

plt.subplot(322)
plt.title("Two informative features, one cluster per class", fontsize='small')
X1, Y1 = make_classification(n_features=2, n_redundant=0, n_informative=2,
                             n_clusters_per_class=1)
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)

plt.subplot(323)
plt.title("Two informative features, two clusters per class", fontsize='small')
X2, Y2 = make_classification(n_features=2, n_redundant=0, n_informative=2)
plt.scatter(X2[:, 0], X2[:, 1], marker='o', c=Y2)


plt.subplot(324)
plt.title("Multi-class, two informative features, one cluster",
          fontsize='small')
X1, Y1 = make_classification(n_features=2, n_redundant=0, n_informative=2,
                             n_clusters_per_class=1, n_classes=3)
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)

plt.subplot(325)
plt.title("Three blobs", fontsize='small')
X1, Y1 = make_blobs(n_features=2, centers=3)
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)

plt.subplot(326)
plt.title("Gaussian divided into three quantiles", fontsize='small')
X1, Y1 = make_gaussian_quantiles(n_features=2, n_classes=3)
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)

plt.show()

Conclusion

Now that we've discussed how to get our data, we're ready to dive into data analysis.