Tuesday, August 25, 2015

Dimension reduction with Python


Sometimes, no matter how good an algorithm is, it just doesn’t work. Or worse, it doesn’t pick up anything. Data can be quite noisy, and sometimes it’s just about impossible to figure out what went wrong.

It's worth noticing that the most interesting machine learning challenges always involve some sort of feature engineering, where we try to use our insight into the problem to carefully craft additional features, that the machine learner hopefully picks up.

Garbage in, garbage out, that's what we know from real life. Not surprisingly this pattern also holds true, when applying machine learning methods to training data. To tackle the issue, we will go in the opposite direction with dimensionality reduction involving cutting away features that are irrelevant or redundant. There are several good reasons to trim down the dimensions as much as possible:

  • Most of the models hate high-dimensional spaces and superfluous features, which often irritate or mislead the learner.
  • More features means more parameters to tune and a higher risk of overfitting.
  • Linearly correlated data is common phenomenon in datasets, which as a result might just have artificial high dimensions, whereas the real dimension might be small.
  • Less dimensions means faster training and more variations to try out, resulting in better end results.

Dimension reduction techniques

Before diving into more sophisticated algorithms, let's try to get an educated guess how to get rid of the garbage within our data, while keeping the valuable part of it.

Missing Values Ratio

Data columns with too many missing values are unlikely to carry much useful information. Thus data columns with number of missing values greater than a given threshold can be removed. Naturally, the higher the threshold, the more aggressive the reduction.

Low Variance

Another candidates are columns with little changes, that is low variance. Don't forget to normalize the data before comparing against the threshold, as variance is range dependent.

High Correlation

Data columns with very similar trends are also likely to carry very similar information. In this case, only one of them will suffice to feed the machine learning model. Here we calculate the correlation coefficient between numerical and nominal columns as the Coefficient and the Pearson’s chi square value respectively. Pairs of columns with correlation coefficient higher than a threshold are reduced to only one. A word of caution: correlation is scale sensitive; therefore column normalization is required for a meaningful correlation comparison.

Backward Feature Elimination

This great technique allows us to remove least important features. At a given iteration, the selected classification algorithm is trained on n input features. Then we remove one input feature at a time and train the same model on n-1 input features n times. The input feature whose removal has produced the smallest increase in the error rate is removed, leaving us with n-1 input features. Each iteration k produces a model trained on n-k features and an error rate (k). Selecting the maximum tolerable error rate, we define the smallest number of features necessary to reach that classification performance with the selected machine learning algorithm.

Forward Feature Construction

This is the inverse process to the Backward Feature Elimination. We start with 1 feature only, progressively adding 1 feature at a time, i.e. the feature that produces the highest increase in performance.

Principal Component Analysis

Principal Component Analysis, also known as the Karhunen-Loeve Transform, is a technique used to search for patterns in high-dimensional data. PCA reduces a set of possibly-correlated, high-dimensional variables to a lower-dimensional set of linearly uncorrelated synthetic variables called principal components. The technique orthogonally transforms the original n coordinates of a data set into a new set of n coordinates called principal components by reducing the dimensions of a data set projecting the data onto a lower-dimensional subspace. The lower-dimensional data will preserve as much of the variance of the original data as possible. For example, a two dimensional data set could be reduced by projecting the points onto a line; each instance in the data set would then be represented by a single value rather than a pair of values. A three-dimensional dataset could be reduced to two dimensions by projecting the variables onto a plane. In general, an n-dimensional dataset can be reduced by projecting the dataset onto a k-dimensional subspace, where k is less than n. Let's have a look at two dimensional example of four points, projected onto on dimensional space - a single line.


Some of the presented technique, while being very effective, unfortunately not always can be used in practice. For instance both Backward Feature Elimination and Forward Feature Construction, are quite time and computationally expensive. They are practically only applicable to a data set with an already relatively low number of input columns. PCA is much less computationally expensive algorithm and is commonly used to explore and visualize high-dimensional data sets. It can also be used to compress data, and process data before it is used by another estimator.

Many implementations of PCA, including the one of scikit-learn, use singular value decomposition to calculate the eigenvectors and eigenvalues. SVD is given by the following equation:

The columns of U are called left singular vectors of the data matrix, the columns of conjugately transposed matrix V are its right singular vectors, and Σ the diagonal entries of are its singular values. To make the algebra work, the matrices have to follow the dimensions rules, as depicted in the following illustration

While the singular vectors and values of a matrix are useful in some applications of signal processing and statistics, we are only interested in them as they relate to the eigenvectors and eigenvalues of the data matrix. Specifically, the left singular vectors are the eigenvectors of the covariance matrix and the diagonal elements of Σ are the square roots of the eigenvalues of the covariance matrix. Eigenvectors found using SVD should be similar to those derived from a covariance matrix.

Let's see how it's done in practice. As you remember Iris dataset consists of four dimensions: sepal length, sepal width, petal length and lastly petal width. Let's apply PCA on the dataset and plot the results in 3 dimensional space.

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import decomposition
from sklearn import datasets

centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target

fig = plt.figure(1, figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

pca = decomposition.PCA(n_components=3)
X = pca.transform(X)

for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:
    ax.text3D(X[y == label, 0].mean(),
              X[y == label, 1].mean() + 1.5,
              X[y == label, 2].mean(), name,
              bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap=plt.cm.prism)




Indeed, in the era of big data, when more is axiomatically better, we have re-discovered that too many noisy or even faulty input data columns often lead to a less than desirable algorithm performance. Removing un-informative or even worse dis-informative input attributes might help build a model on more extensive data regions, with more general classification rules, and overall with better performances on new unseen data.