Tuesday, August 25, 2015

Dimension reduction with Python


Introduction

Sometimes, no matter how good an algorithm is, it just doesn’t work. Or worse, it doesn’t pick up anything. Data can be quite noisy, and sometimes it’s just about impossible to figure out what went wrong.

It's worth noticing that the most interesting machine learning challenges always involve some sort of feature engineering, where we try to use our insight into the problem to carefully craft additional features, that the machine learner hopefully picks up.

Garbage in, garbage out, that's what we know from real life. Not surprisingly this pattern also holds true, when applying machine learning methods to training data. To tackle the issue, we will go in the opposite direction with dimensionality reduction involving cutting away features that are irrelevant or redundant. There are several good reasons to trim down the dimensions as much as possible:

  • Most of the models hate high-dimensional spaces and superfluous features, which often irritate or mislead the learner.
  • More features means more parameters to tune and a higher risk of overfitting.
  • Linearly correlated data is common phenomenon in datasets, which as a result might just have artificial high dimensions, whereas the real dimension might be small.
  • Less dimensions means faster training and more variations to try out, resulting in better end results.

Dimension reduction techniques

Before diving into more sophisticated algorithms, let's try to get an educated guess how to get rid of the garbage within our data, while keeping the valuable part of it.

Missing Values Ratio

Data columns with too many missing values are unlikely to carry much useful information. Thus data columns with number of missing values greater than a given threshold can be removed. Naturally, the higher the threshold, the more aggressive the reduction.

Low Variance

Another candidates are columns with little changes, that is low variance. Don't forget to normalize the data before comparing against the threshold, as variance is range dependent.

High Correlation

Data columns with very similar trends are also likely to carry very similar information. In this case, only one of them will suffice to feed the machine learning model. Here we calculate the correlation coefficient between numerical and nominal columns as the Coefficient and the Pearson’s chi square value respectively. Pairs of columns with correlation coefficient higher than a threshold are reduced to only one. A word of caution: correlation is scale sensitive; therefore column normalization is required for a meaningful correlation comparison.

Backward Feature Elimination

This great technique allows us to remove least important features. At a given iteration, the selected classification algorithm is trained on n input features. Then we remove one input feature at a time and train the same model on n-1 input features n times. The input feature whose removal has produced the smallest increase in the error rate is removed, leaving us with n-1 input features. Each iteration k produces a model trained on n-k features and an error rate (k). Selecting the maximum tolerable error rate, we define the smallest number of features necessary to reach that classification performance with the selected machine learning algorithm.

Forward Feature Construction

This is the inverse process to the Backward Feature Elimination. We start with 1 feature only, progressively adding 1 feature at a time, i.e. the feature that produces the highest increase in performance.

Principal Component Analysis

Principal Component Analysis, also known as the Karhunen-Loeve Transform, is a technique used to search for patterns in high-dimensional data. PCA reduces a set of possibly-correlated, high-dimensional variables to a lower-dimensional set of linearly uncorrelated synthetic variables called principal components. The technique orthogonally transforms the original n coordinates of a data set into a new set of n coordinates called principal components by reducing the dimensions of a data set projecting the data onto a lower-dimensional subspace. The lower-dimensional data will preserve as much of the variance of the original data as possible. For example, a two dimensional data set could be reduced by projecting the points onto a line; each instance in the data set would then be represented by a single value rather than a pair of values. A three-dimensional dataset could be reduced to two dimensions by projecting the variables onto a plane. In general, an n-dimensional dataset can be reduced by projecting the dataset onto a k-dimensional subspace, where k is less than n. Let's have a look at two dimensional example of four points, projected onto on dimensional space - a single line.

Implementation

Some of the presented technique, while being very effective, unfortunately not always can be used in practice. For instance both Backward Feature Elimination and Forward Feature Construction, are quite time and computationally expensive. They are practically only applicable to a data set with an already relatively low number of input columns. PCA is much less computationally expensive algorithm and is commonly used to explore and visualize high-dimensional data sets. It can also be used to compress data, and process data before it is used by another estimator.

Many implementations of PCA, including the one of scikit-learn, use singular value decomposition to calculate the eigenvectors and eigenvalues. SVD is given by the following equation:

The columns of U are called left singular vectors of the data matrix, the columns of conjugately transposed matrix V are its right singular vectors, and Σ the diagonal entries of are its singular values. To make the algebra work, the matrices have to follow the dimensions rules, as depicted in the following illustration

While the singular vectors and values of a matrix are useful in some applications of signal processing and statistics, we are only interested in them as they relate to the eigenvectors and eigenvalues of the data matrix. Specifically, the left singular vectors are the eigenvectors of the covariance matrix and the diagonal elements of Σ are the square roots of the eigenvalues of the covariance matrix. Eigenvectors found using SVD should be similar to those derived from a covariance matrix.

Let's see how it's done in practice. As you remember Iris dataset consists of four dimensions: sepal length, sepal width, petal length and lastly petal width. Let's apply PCA on the dataset and plot the results in 3 dimensional space.

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import decomposition
from sklearn import datasets

centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target

fig = plt.figure(1, figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

pca = decomposition.PCA(n_components=3)
pca.fit(X)
X = pca.transform(X)

for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:
    ax.text3D(X[y == label, 0].mean(),
              X[y == label, 1].mean() + 1.5,
              X[y == label, 2].mean(), name,
              horizontalalignment='center',
              bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
    
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap=plt.cm.prism)

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])

plt.show()

Conclusion

Indeed, in the era of big data, when more is axiomatically better, we have re-discovered that too many noisy or even faulty input data columns often lead to a less than desirable algorithm performance. Removing un-informative or even worse dis-informative input attributes might help build a model on more extensive data regions, with more general classification rules, and overall with better performances on new unseen data.

Tuesday, August 11, 2015

Hyperparameter optimization with Python


Introduction

In the previous articles we introduced several linear techniques, where as you have probably noticed, we provided the algorithms with several parameters. The dependence of machine learning algorithm upon learning parameters is a common case though and one has to check the performance of various parameters to achieve the best results. The task of course is no trifle and is called hyperparameter optimization or model selection. It is the problem of choosing a set of hyperparameters for a learning algorithm, usually with the goal of optimizing a measure of the algorithm's performance on an independent data set.

Implementation

Grid Search

The traditional way of performing hyperparameter optimization is a grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. Scikit-learn provides us with a class GridSearchCV implementing the technique. Let's try to find the best learning rate, regularization technique and batch size for stochastic gradient descent from the previous article.

import numpy as np
from time import time
from operator import itemgetter
from sklearn.grid_search import GridSearchCV
from sklearn.datasets import load_digits
from sklearn.linear_model import SGDClassifier

# get some data
digits = load_digits()
X, y = digits.data, digits.target

# build a classifier
clf = SGDClassifier()

# Utility function to report best scores
def report(grid_scores, n_top=3):
    top_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n_top]
    for i, score in enumerate(top_scores):
        print("Model with rank: {0}".format(i + 1))
        print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
              score.mean_validation_score,
              np.std(score.cv_validation_scores)))
        print("Parameters: {0}\n".format(score.parameters))

# use a full grid over all parameters
param_grid = {"n_iter": [1, 5, 10],
              "alpha": [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
              "penalty": ["none", "l1", "l2"]}

# run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(X, y)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(grid_search.grid_scores_)))
report(grid_search.grid_scores_)

# GridSearchCV took 4.85 seconds for 63 candidate parameter settings.
# Model with rank: 1
# Mean validation score: 0.929 (std: 0.020)
# Parameters: {'penalty': 'l1', 'alpha': 0.0001, 'n_iter': 5}
 
# Model with rank: 2
# Mean validation score: 0.924 (std: 0.016)
# Parameters: {'penalty': 'l2', 'alpha': 1, 'n_iter': 10}

# Model with rank: 3
# Mean validation score: 0.922 (std: 0.018)
# Parameters: {'penalty': 'none', 'alpha': 0.0001, 'n_iter': 5}

The output of the procedure is a list, grid_scores_, consisting of cross evaluation score of all provided parameters. To get the best parameters, we order the list in a descending order by score mean resulting the best parameters for the provided dataset using Lasso regularization, learning rate of 0.0001 in batches of 5. For more details about regularization types, have a look a broader explanation in linear regression article.

Randomized Grid Search

Since grid searching is an exhaustive and therefore potentially expensive method, there is a random variation of the technique and is implemented by RandomizedSearchCV class in scikit-learn. It doesn't try out all parameter values, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter parameter.

from sklearn.grid_search import RandomizedSearchCV
from scipy.stats import randint, norm

...

# specify parameters and distributions to sample from
param_dist = {"n_iter": randint(1, 11),
              "alpha": uniform(scale=0.01),
              "penalty": ["none", "l1", "l2"]}

# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)

...

# RandomizedSearchCV took 1.18 seconds for 20 candidates parameter settings.
# Model with rank: 1
# Mean validation score: 0.920 (std: 0.021)
# Parameters: {'penalty': 'l1', 'alpha': 0.0006507721384509924, 'n_iter': 7}
 
# Model with rank: 2
# Mean validation score: 0.919 (std: 0.016)
# Parameters: {'penalty': 'none', 'alpha': 0.0008117236028203323, 'n_iter': 9}
 
# Model with rank: 3
# Mean validation score: 0.918 (std: 0.018)
# Parameters: {'penalty': 'l1', 'alpha': 0.0012165253638165819, 'n_iter': 5}

Here, instead of providing the procedure with list of values, we gave a desired distribution from which the values would be drawn. As we set the n_iter to be 20, the procedure evaluated 20 random variation of the parameters and following the previous logic, the best appeared to be using Lasso regularization, learning rate of 0.00065 in batches of 7.

Conclusion

The randomized search and the grid search explore exactly the same space of parameters. The result in parameter settings is quite similar, while the run time for randomized search is drastically lower. Note that in practice, one would not search over this many different parameters simultaneously using grid search, but pick only the ones deemed most important.