Monday, December 28, 2015

# K-nearest Neighbors (KNN) in Python

## Introduction

Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.

## Implementation

scikit-learn implements two different nearest neighbors classifiers: KNeighborsClassifier implements learning based on the k nearest neighbors of each query point, where k is an integer value specified by the user. RadiusNeighborsClassifier implements learning based on the number of neighbors within a fixed radius r of each training point, where r is a floating-point value specified by the user.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets

n_neighbors = 15

# import some data to play with
X = iris.data[:, :2]  # we only take the first two features.
y = iris.target

h = .02  # step size in the mesh

# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

# we create an instance of Neighbours Classifier and fit the data.
clf = neighbors.KNeighborsClassifier(n_neighbors)
clf.fit(X, y)

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())

plt.show()


Now let's see how RadiusNeighborsClassifier works. After playing a bit with hyperparameters, we'll achieve the following plot:

...
...


## Conclusion

Despite its simplicity, nearest neighbors has been successful in a large number of classification and regression problems, including handwritten digits or satellite image scenes. Being a non-parametric method, it is often successful in classification situations where the decision boundary is very irregular.

Thursday, December 3, 2015

# Classifier Boosting with Python

## Introduction

Remember we've talked about random forest and how it was used to improve the performance of a single Decision Tree classifier. The idea of fitting a number of decision tree classifiers on various sub-samples of the dataset and using averaging to improve the predictive accuracy can be used to other algorithms as well and it's called boosting. There are several boosting techniques, which can be used to improve our algorithm, we'll cover the most used ones: AdaBoost and Bagging boost.

An AdaBoost classifier begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset, but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

### Bagging boost

A Bagging classifier fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions to form a final prediction.

## Implementation

All boosting methods are located under sklearn.ensemble module. Let's see first how to boost SVM using bagging classifier.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import BaggingClassifier
from sklearn.datasets import make_gaussian_quantiles
from sklearn.svm import SVC

# Construct dataset
X1, y1 = make_gaussian_quantiles(cov=2.,
n_samples=200, n_features=2,
n_classes=2, random_state=1)
X2, y2 = make_gaussian_quantiles(mean=(3, 3), cov=1.5,
n_samples=300, n_features=2,
n_classes=2, random_state=1)
X = np.concatenate((X1, X2))
y = np.concatenate((y1, - y2 + 1))

# Create and fit an boosted svm
bdt = BaggingClassifier(SVC())

bdt.fit(X, y)

plot_colors = "br"
plot_step = 0.02
class_names = "AB"

plt.figure(figsize=(10, 5))

# Plot the decision boundaries
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
np.arange(y_min, y_max, plot_step))

Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
plt.axis("tight")

# Plot the training points
for i, n, c in zip(range(2), class_names, plot_colors):
idx = np.where(y == i)
plt.scatter(X[idx, 0], X[idx, 1],
c=c, cmap=plt.cm.Paired,
label="Class %s" % n)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.legend(loc='upper right')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Decision Boundary')


Now, let's change the line 18th to AdaBoost and Decision Tree

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
...
...


## Conclusion

Boosting techniques may dramatically improve the base algorithm and serve an irreplaceable tool in any data scientists toolkit.

Thursday, November 19, 2015

# K-means clustering with Python

## Introduction

K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain K number of clusters. The main idea is to define K centroids, one for each cluster. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. At this point we need to re-calculate K new centroids of the clusters resulting from the previous step. After we have these K new centroids, a new binding has to be done between the same data set points and the nearest new centroid. As a result of this loop we may notice that the K centroids change their location step by step until no more changes are done.

## Implementation

Scikit-learn provides with full implementation of K-means algorithm though KMeans class. Let's have a look at several interesting situations, which might occur during data clustering:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

plt.figure(figsize=(12, 12))

n_samples = 1500
random_state = 170
X, y = make_blobs(n_samples=n_samples, random_state=random_state)

# Incorrect number of clusters
y_pred = KMeans(n_clusters=2, random_state=random_state).fit_predict(X)

plt.subplot(221)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.title("Incorrect Number of Blobs")

# Anisotropicly distributed data
transformation = [[ 0.60834549, -0.63667341], [-0.40887718, 0.85253229]]
X_aniso = np.dot(X, transformation)
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_aniso)

plt.subplot(222)
plt.scatter(X_aniso[:, 0], X_aniso[:, 1], c=y_pred)
plt.title("Anisotropicly Distributed Blobs")

# Different variance
X_varied, y_varied = make_blobs(n_samples=n_samples,
cluster_std=[1.0, 2.5, 0.5],
random_state=random_state)
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_varied)

plt.subplot(223)
plt.scatter(X_varied[:, 0], X_varied[:, 1], c=y_pred)
plt.title("Unequal Variance")

# Unevenly sized blobs
X_filtered = np.vstack((X[y == 0][:500], X[y == 1][:100], X[y == 2][:10]))
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_filtered)

plt.subplot(224)
plt.scatter(X_filtered[:, 0], X_filtered[:, 1], c=y_pred)
plt.title("Unevenly Sized Blobs")

plt.show()


At first we wrongly assume that there are 3 clusters in the data and make K-means find them. Later we transform the data anisotropicly. Since K-means uses Euclidian distance to to associate points to clusters, it does not work well with non-globular clusters. The last two datasets introduce blobs of different variance and size, but this makes no difference and the classification succeeds.

## Conclusion

K-means provides us with easy to use clustering algorithms. It's fast, easy to follow and is used vastly in various fields including Vector Quantization.

Sunday, November 8, 2015

# Naïve Bayes with Python

## Introduction

The Naive Bayes algorithm is based on conditional probabilities. It uses Bayes' Theorem, a formula that calculates a probability by counting the frequency of values and combinations of values in the historical data. Bayes' Theorem finds the probability of an event occurring given the probability of another event that has already occurred. If B represents the dependent event and A represents the prior event, Bayes' theorem can be stated as follows. To calculate the probability of B given A, the algorithm counts the number of cases where A and B occur together and divides it by the number of cases where A occurs alone.

## Implementation

Scikit-learn provides implementation of Naïve Bayes algorithm of 3 flavors: MultinomialNB implementing the naive Bayes algorithm for multinomially distributed data; GaussianNB implementing the Gaussian Naive Bayes algorithm for classification; and BernoulliNB implements the naive Bayes training and classification algorithms for data that is distributed according to multivariate Bernoulli distributions.

Let's take a look at Naïve Bayes algorithm at work classifying Iris data and since anything the nature produces is distributed according to a Gaussian distribution, we'll be using this appropriate class

import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB

# Parameters
n_classes = 3
plot_colors = "bry"
plot_step = 0.02
plt.rcParams["figure.figsize"] = [12, 8]

for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3],
[1, 2], [1, 3], [2, 3]]):
# We only take the two corresponding features
X = iris.data[:, pair]
y = iris.target

# Shuffle
idx = np.arange(X.shape[0])
np.random.seed(13)
np.random.shuffle(idx)
X = X[idx]
y = y[idx]

# Train
clf = GaussianNB().fit(X, y)

# Plot the decision boundary
plt.subplot(2, 3, pairidx + 1)

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
np.arange(y_min, y_max, plot_step))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

plt.xlabel(iris.feature_names[pair[0]])
plt.ylabel(iris.feature_names[pair[1]])
plt.axis()

# Plot the training points
for i, color in zip(range(n_classes), plot_colors):
idx = np.where(y == i)
plt.scatter(X[idx, 0], X[idx, 1], c=color,
label=iris.target_names[i],
cmap=plt.cm.Paired)
plt.axis()

plt.legend(loc="upper left")
plt.show()


Pretty cool, isn't it!

## Conclusion

The Naive Bayes algorithm affords fast, highly scalable model building and scoring. It scales linearly with the number of predictors and rows. You'll need, however, a big data set in order to make reliable estimations of the probability of each class. You can use Naïve Bayes classification algorithm with a small data set, but precision and recall will keep very low. For small reminder about what those are, have a look at performance metrics section here.

Tuesday, October 27, 2015

# Python for Data Scientists - Rodeo

## Introduction

I love Python, I really do and that goes for IPython as well - it's a great tool and simplifies the work by a lot. But.. there is always a but, isn't it? RStudio is so much better and until recently we, the Python data enthusiasts, could only nervously look at RStudio while working on somewhat beloved, somewhat limped brother IPython. Well, no more. Let me introduce you Rodeo

The IDE is free and super easy to use, it's very similar to RStudio and after you watch the introduction video above, you'll be ready to go.

Wednesday, October 7, 2015

# Random Forest with Python

## Introduction

In the article about decision tree we've talked about it's drawbacks of being sensitive to small variations or noise in the data. Today we'll see how to deal with them by introducing a random forest. It belongs to a larger class of machine learning algorithms called ensemble methods, which use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms. So which models does random forest aggregate? You might already know the answer - the decision trees. It fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.

## Implementation

Scikit-learn provides us with two classes RandomForestClassifier and RandomForestRegressor for classification and regression problems respectively. Let's use the code from the previous example and see how the result will different, using random forest with 100 trees.

...
clf = RandomForestClassifier(n_estimators=100).fit(X, y)
...


As you can see the edges are much smoother, that is less overfitted, than using a single decision tree.

## Conclusion

Despite it's relative simplicity, random forest performs the job remarkably well. According to empirical comparisons, even better than SVMs.

Monday, September 28, 2015

# Decision Tree with Python

## Introduction

Last time we talked about non-linear classifier using Support Vector Machines or SVM. Today we'll be discussing another non-linear classifier and regressor called decision tree. The way decision tree works is by creating a model, which predicts the value of a target variable by learning simple decision rules inferred from the data features.

Since trees can be visualized and is something we're all used to, decision trees can easily be explained, visualized and manipulated the non-linearity in an intuitive manner. Surely there are some disadvantages as well, but we'll note them a bit later, firstly let's see them in action.

## Implementation

It won't come as a complete surprise to you, that scikit package has already taken initiative and implemented the whole thing using DecisionTreeRegressor and DecisionTreeClassifier classes. What is left for us is to bear the fruits of someone else's hard labour.

import StringIO
import numpy as np
import matplotlib.pyplot as plt
import pydot
from IPython.display import Image
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

# Parameters
n_classes = 3
plot_colors = "bry"
plot_step = 0.02
plt.rcParams["figure.figsize"] = [12, 8]

for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3],
[1, 2], [1, 3], [2, 3]]):
# We only take the two corresponding features
X = iris.data[:, pair]
y = iris.target

# Shuffle
idx = np.arange(X.shape[0])
np.random.seed(13)
np.random.shuffle(idx)
X = X[idx]
y = y[idx]

# Standardize
mean = X.mean(axis=0)
std = X.std(axis=0)
X = (X - mean) / std

# Train
clf = DecisionTreeClassifier().fit(X, y)

# Plot the decision boundary
plt.subplot(2, 3, pairidx + 1)

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
np.arange(y_min, y_max, plot_step))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

plt.xlabel(iris.feature_names[pair[0]])
plt.ylabel(iris.feature_names[pair[1]])
plt.axis()

# Plot the training points
for i, color in zip(range(n_classes), plot_colors):
idx = np.where(y == i)
plt.scatter(X[idx, 0], X[idx, 1], c=color,
label=iris.target_names[i],
cmap=plt.cm.Paired)
plt.axis()

plt.legend(loc="upper left")
plt.show()


First we normalize the data and then draw decision boundaries and at last the data itself.

Take a look at the first subplot and let's compare it with one, where we used logistic regression. Back then we couldn't classify the data using these features, now using the decision tree we surely can.

## Visualization

Since decision tree uses a tree data-structure, wouldn't it be cool to visualize it.
Notice: You'll need to install GraphViz package to run this example

...
dot_data = StringIO.StringIO()
tree.export_graphviz(clf, out_file=dot_data,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True,
special_characters=True)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())


Now, when we have our tree in place, let's see how the decisions were made with Gini coefficient attached to each node. What Gini coefficient measures is the inequality among values of a frequency distribution, in our case iris species. A Gini coefficient of zero expresses perfect equality, where all values are the same - all iris are of the same species.

So the first check is against septal length being lesser than -0.7442, and from Gini coefficient 0.6667 we can deduce that it splits the data with one third going into a single category, setosa. The process continues until we reach Gini coefficient 0, that is the remaining data is of a single category.

## Conclusion

One of the notable advantages of using decision trees is it's prediction performance, which is logarithmic in the number of data points used to train the tree. But there is are some downsides as well, which are needed to be considered as well. The first one is can be seen from our first subplot example - overfitting. Trees tend to perform incredibly well at the top, but at the same time tend to overfit at the bottom. This is a major downsite and therefore trees should be pruned! Also decision trees can be unstable because small variations or noise in the data might result in a completely different tree being generated. However this problem is mitigated by using decision trees within an ensemble, of which we'll talk next time.

Tuesday, September 8, 2015

# Support Vector Machines with Python

## Introduction

Having learnt different optimization and classification methods, you must feel quit confident to start exploring the datasets of interest. However you might very quickly run into a dataset, which no matter how hyper-parameters are set, is just not linearly separable. Surely there is no place for despair, especially since we have just a classifier to deal with these situation, called Support Vector Machine.

## How it works

Support Vector Machine, or SVM, are a set of supervised learning methods used for classification and with a slight change for regression. The core idea of it is to linearly separate the hyper-space of features. The prefix hyper is not occasional, as SVM increases the dimension of feature space to achieve it's goal. The power of the method comes from using kernel functions, which enable it to operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space. This operation is often computationally cheaper than the explicit computation of the coordinates, thus allowing SVM to operate on datasets with much bigger feature space, like image recognition. Consider the following example:

There's no linear decision boundary for this dataset, which will separate observations of two classes. If you apply linear classifier, you'll just receive an "arbitrary" line throughout the space crossing both of the classes - you just cannot do it correctly with logistic regression.

The way SVM approaches the problem, is by augmenting the dataset with additional dimensions and trying to maximize the margin between the classes. It places the separation so that the distance to closest misclassified entity is the widest. Have a look at the following illustration depicting the process:

Applying a kernel function, it transforms the observations to a space, where they can be linearly separable, resulting the following separation:

## Implementation

It won't come as a big surprise to you, that scikit-learn has a fully implemented and optimized SVM support, so let's have a look how it deals with various situations. We'll create three demo datasets, one in shape of moons, one of circles and the last will contain linearly separable observations. Then we'll apply SVM with two different kernel functions: linear and radial based function. In the end we'll plot the observations and color the background according to distance from boundary hyperplane using diverging palette, that is the farther it's from center the darker the color (red or blue depending on the sign). Take a closer look into the example code and make sure you understand the process.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.svm import SVC

h = .02  # step size in the mesh
names = ["Linear SVM", "RBF SVM"]
classifiers = [
SVC(kernel="linear", C=0.025),
SVC(gamma=2, C=1)]

X, y = make_classification(n_features=2, n_redundant=0,
random_state=1, n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += rng.uniform(size=X.shape)
linearly_separable = (X, y)

datasets = [make_moons(noise=0.3, random_state=0),
make_circles(noise=0.2, factor=0.5, random_state=1),
linearly_separable
]

figure = plt.figure(figsize=(9, 9))
i = 1
# iterate over datasets
for ds in datasets:
# preprocess dataset, split into training and test part
X, y = ds
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.4)

x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))

# just plot the dataset first
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
# Plot the training points
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train,
cmap=cm_bright)
# and testing points
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test,
cmap=cm_bright, alpha=0.6)
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
i += 1

# iterate over classifiers
for name, clf in zip(names, classifiers):
ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)

# Plot the decision boundary. We'll assign a color to
# each point in the mesh [x_min, m_max]x[y_min, y_max].
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)

# Plot also the training points
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train,
cmap=cm_bright)
# and testing points
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test,
cmap=cm_bright, alpha=0.6)

ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(name)
ax.text(xx.max() - .3, yy.min() + .3,
('%.2f' % score).lstrip('0'), size=15, horizontalalignment='right')
i += 1

plt.show()


The results, as anticipated, are much more promising than using a linear classifier. For moon and circle shaped datasets, there is tremendous increase classification score, where as for linearly separable observations the result doesn't change. This is of course understandable as linearly separable dataset can be easily dealt with using linear classifier.

## Conclusion

SVM is a powerful tool, however from a practical point of view perhaps the most serious problem with SVMs is the high algorithmic complexity and extensive memory requirements of the required quadratic with the number of samples, which makes it hard to scale to dataset with more than a couple of 10000 samples.

Tuesday, August 25, 2015

# Dimension reduction with Python

## Introduction

Sometimes, no matter how good an algorithm is, it just doesn’t work. Or worse, it doesn’t pick up anything. Data can be quite noisy, and sometimes it’s just about impossible to figure out what went wrong.

It's worth noticing that the most interesting machine learning challenges always involve some sort of feature engineering, where we try to use our insight into the problem to carefully craft additional features, that the machine learner hopefully picks up.

Garbage in, garbage out, that's what we know from real life. Not surprisingly this pattern also holds true, when applying machine learning methods to training data. To tackle the issue, we will go in the opposite direction with dimensionality reduction involving cutting away features that are irrelevant or redundant. There are several good reasons to trim down the dimensions as much as possible:

• Most of the models hate high-dimensional spaces and superfluous features, which often irritate or mislead the learner.
• More features means more parameters to tune and a higher risk of overfitting.
• Linearly correlated data is common phenomenon in datasets, which as a result might just have artificial high dimensions, whereas the real dimension might be small.
• Less dimensions means faster training and more variations to try out, resulting in better end results.

## Dimension reduction techniques

Before diving into more sophisticated algorithms, let's try to get an educated guess how to get rid of the garbage within our data, while keeping the valuable part of it.

### Missing Values Ratio

Data columns with too many missing values are unlikely to carry much useful information. Thus data columns with number of missing values greater than a given threshold can be removed. Naturally, the higher the threshold, the more aggressive the reduction.

### Low Variance

Another candidates are columns with little changes, that is low variance. Don't forget to normalize the data before comparing against the threshold, as variance is range dependent.

### High Correlation

Data columns with very similar trends are also likely to carry very similar information. In this case, only one of them will suffice to feed the machine learning model. Here we calculate the correlation coefficient between numerical and nominal columns as the Coefficient and the Pearson’s chi square value respectively. Pairs of columns with correlation coefficient higher than a threshold are reduced to only one. A word of caution: correlation is scale sensitive; therefore column normalization is required for a meaningful correlation comparison.

### Backward Feature Elimination

This great technique allows us to remove least important features. At a given iteration, the selected classification algorithm is trained on n input features. Then we remove one input feature at a time and train the same model on n-1 input features n times. The input feature whose removal has produced the smallest increase in the error rate is removed, leaving us with n-1 input features. Each iteration k produces a model trained on n-k features and an error rate (k). Selecting the maximum tolerable error rate, we define the smallest number of features necessary to reach that classification performance with the selected machine learning algorithm.

### Forward Feature Construction

This is the inverse process to the Backward Feature Elimination. We start with 1 feature only, progressively adding 1 feature at a time, i.e. the feature that produces the highest increase in performance.

### Principal Component Analysis

Principal Component Analysis, also known as the Karhunen-Loeve Transform, is a technique used to search for patterns in high-dimensional data. PCA reduces a set of possibly-correlated, high-dimensional variables to a lower-dimensional set of linearly uncorrelated synthetic variables called principal components. The technique orthogonally transforms the original n coordinates of a data set into a new set of n coordinates called principal components by reducing the dimensions of a data set projecting the data onto a lower-dimensional subspace. The lower-dimensional data will preserve as much of the variance of the original data as possible. For example, a two dimensional data set could be reduced by projecting the points onto a line; each instance in the data set would then be represented by a single value rather than a pair of values. A three-dimensional dataset could be reduced to two dimensions by projecting the variables onto a plane. In general, an n-dimensional dataset can be reduced by projecting the dataset onto a k-dimensional subspace, where k is less than n. Let's have a look at two dimensional example of four points, projected onto on dimensional space - a single line.

## Implementation

Some of the presented technique, while being very effective, unfortunately not always can be used in practice. For instance both Backward Feature Elimination and Forward Feature Construction, are quite time and computationally expensive. They are practically only applicable to a data set with an already relatively low number of input columns. PCA is much less computationally expensive algorithm and is commonly used to explore and visualize high-dimensional data sets. It can also be used to compress data, and process data before it is used by another estimator.

Many implementations of PCA, including the one of scikit-learn, use singular value decomposition to calculate the eigenvectors and eigenvalues. SVD is given by the following equation:

The columns of U are called left singular vectors of the data matrix, the columns of conjugately transposed matrix V are its right singular vectors, and Σ the diagonal entries of are its singular values. To make the algebra work, the matrices have to follow the dimensions rules, as depicted in the following illustration

While the singular vectors and values of a matrix are useful in some applications of signal processing and statistics, we are only interested in them as they relate to the eigenvectors and eigenvalues of the data matrix. Specifically, the left singular vectors are the eigenvectors of the covariance matrix and the diagonal elements of Σ are the square roots of the eigenvalues of the covariance matrix. Eigenvectors found using SVD should be similar to those derived from a covariance matrix.

Let's see how it's done in practice. As you remember Iris dataset consists of four dimensions: sepal length, sepal width, petal length and lastly petal width. Let's apply PCA on the dataset and plot the results in 3 dimensional space.

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import decomposition
from sklearn import datasets

centers = [[1, 1], [-1, -1], [1, -1]]
X = iris.data
y = iris.target

fig = plt.figure(1, figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

pca = decomposition.PCA(n_components=3)
pca.fit(X)
X = pca.transform(X)

for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:
ax.text3D(X[y == label, 0].mean(),
X[y == label, 1].mean() + 1.5,
X[y == label, 2].mean(), name,
horizontalalignment='center',
bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))

ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap=plt.cm.prism)

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])

plt.show()


## Conclusion

Indeed, in the era of big data, when more is axiomatically better, we have re-discovered that too many noisy or even faulty input data columns often lead to a less than desirable algorithm performance. Removing un-informative or even worse dis-informative input attributes might help build a model on more extensive data regions, with more general classification rules, and overall with better performances on new unseen data.

Tuesday, August 11, 2015

# Hyperparameter optimization with Python

## Introduction

In the previous articles we introduced several linear techniques, where as you have probably noticed, we provided the algorithms with several parameters. The dependence of machine learning algorithm upon learning parameters is a common case though and one has to check the performance of various parameters to achieve the best results. The task of course is no trifle and is called hyperparameter optimization or model selection. It is the problem of choosing a set of hyperparameters for a learning algorithm, usually with the goal of optimizing a measure of the algorithm's performance on an independent data set.

## Implementation

### Grid Search

The traditional way of performing hyperparameter optimization is a grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. Scikit-learn provides us with a class GridSearchCV implementing the technique. Let's try to find the best learning rate, regularization technique and batch size for stochastic gradient descent from the previous article.

import numpy as np
from time import time
from operator import itemgetter
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import SGDClassifier

# get some data
X, y = digits.data, digits.target

# build a classifier
clf = SGDClassifier()

# Utility function to report best scores
def report(grid_scores, n_top=3):
top_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n_top]
for i, score in enumerate(top_scores):
print("Model with rank: {0}".format(i + 1))
print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
score.mean_validation_score,
np.std(score.cv_validation_scores)))
print("Parameters: {0}\n".format(score.parameters))

# use a full grid over all parameters
param_grid = {"n_iter": [1, 5, 10],
"alpha": [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
"penalty": ["none", "l1", "l2"]}

# run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(X, y)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
% (time() - start, len(grid_search.grid_scores_)))
report(grid_search.grid_scores_)

# GridSearchCV took 4.85 seconds for 63 candidate parameter settings.
# Model with rank: 1
# Mean validation score: 0.929 (std: 0.020)
# Parameters: {'penalty': 'l1', 'alpha': 0.0001, 'n_iter': 5}

# Model with rank: 2
# Mean validation score: 0.924 (std: 0.016)
# Parameters: {'penalty': 'l2', 'alpha': 1, 'n_iter': 10}

# Model with rank: 3
# Mean validation score: 0.922 (std: 0.018)
# Parameters: {'penalty': 'none', 'alpha': 0.0001, 'n_iter': 5}


The output of the procedure is a list, grid_scores_, consisting of cross evaluation score of all provided parameters. To get the best parameters, we order the list in a descending order by score mean resulting the best parameters for the provided dataset using Lasso regularization, learning rate of 0.0001 in batches of 5. For more details about regularization types, have a look a broader explanation in linear regression article.

### Randomized Grid Search

Since grid searching is an exhaustive and therefore potentially expensive method, there is a random variation of the technique and is implemented by RandomizedSearchCV class in scikit-learn. It doesn't try out all parameter values, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter parameter.

from sklearn.grid_search import RandomizedSearchCV
from scipy.stats import randint, norm

...

# specify parameters and distributions to sample from
param_dist = {"n_iter": randint(1, 11),
"alpha": uniform(scale=0.01),
"penalty": ["none", "l1", "l2"]}

# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
n_iter=n_iter_search)

...

# RandomizedSearchCV took 1.18 seconds for 20 candidates parameter settings.
# Model with rank: 1
# Mean validation score: 0.920 (std: 0.021)
# Parameters: {'penalty': 'l1', 'alpha': 0.0006507721384509924, 'n_iter': 7}

# Model with rank: 2
# Mean validation score: 0.919 (std: 0.016)
# Parameters: {'penalty': 'none', 'alpha': 0.0008117236028203323, 'n_iter': 9}

# Model with rank: 3
# Mean validation score: 0.918 (std: 0.018)
# Parameters: {'penalty': 'l1', 'alpha': 0.0012165253638165819, 'n_iter': 5}


Here, instead of providing the procedure with list of values, we gave a desired distribution from which the values would be drawn. As we set the n_iter to be 20, the procedure evaluated 20 random variation of the parameters and following the previous logic, the best appeared to be using Lasso regularization, learning rate of 0.00065 in batches of 7.

## Conclusion

The randomized search and the grid search explore exactly the same space of parameters. The result in parameter settings is quite similar, while the run time for randomized search is drastically lower. Note that in practice, one would not search over this many different parameters simultaneously using grid search, but pick only the ones deemed most important.

Wednesday, July 22, 2015

## Introduction

In the previous two articles, we've talked about linear and logistic regression, covering by it most linear models. And that would be it, if only these models could process data with numerous features, that is x1, x2... xn-1, xn. Let's us see how our classification model deals with Olivetti faces classification dataset:

from sklearn import datasets, metrics, linear_model, cross_validation
import matplotlib.pyplot as plt
import time

faces = datasets.fetch_olivetti_faces()

start = time.time()

X = faces.data
Y = faces.target

x_train, x_test, y_train, y_test = cross_validation.train_test_split(X, Y)

# Create a classifier: a support vector classifier
model = linear_model.LogisticRegression(C=10000)

# We learn the faces of train set
model.fit(x_train, y_train)

# Now predict the person on test set
expected = y_test
predicted = model.predict(x_test)

end = time.time()
print("%.2f seconds" % (end - start)) # 8.85 seconds

# Reshape data into images
x_train_images = x_train.reshape(len(x_train), 64, 64)
x_test_images = x_test.reshape(len(x_test), 64, 64)

# Plot train images
for index, (image, label) in enumerate(
zip(x_train_images, y_train)[:4]):
plt.subplot(2, 4, index + 1)
plt.axis('off')
plt.imshow(image, cmap=plt.cm.gray)
plt.title('Train: %i' % label)

# Plot test images
for index, (image, prediction, expectation) in enumerate(
zip(x_test_images, predicted, expected)[:4]):
plt.subplot(2, 4, index + 5)
plt.axis('off')
plt.imshow(image, cmap=plt.cm.gray)
plt.title('Predict: %i/%i' % (prediction, expectation))


The dataset contains 64x64 different images of 40 distinct subjects, resulting each item of the dataset containing 4096 features. Despite the good performance of our classifier, we won't dive into measures now, it took almost 9 seconds to process this toy dataset, which is not good enough if we want to learn really big datasets.

The reason for this delay, is the way the classifier solves the optimization problem. Our models analytically solved the values of the model's parameters, which minimize the cost function with the following equation:

X is the matrix of the values of the explanatory variables for each training example. The dot product inside the brackets results a square matrix with dimensions nxn, where n is the number of features. The computational complexity of inverting this square matrix is nearly cubic in the number of explanatory variables. Thus the more features we have, the more complex the computation will be, consequently taking more time.

Given a function defined by a set of parameters, gradient descent iteratively moves toward a set of parameter values, which minimize the function. This iterative minimization is achieved by taking steps in the negative direction of the function gradient. As though a blindfolded man, who is trying to find his way from somewhere on a mountainside to the lowest point of the valley. He takes a step in the direction with the steepest decline and the sizes of his steps are proportional to the steepness of the terrain. If he were to take large steps, he may have accidentally stepped over the valley's lowest point. In opposite case, while he would reach the desired point, however since the steps are too small, it would take him a considerable amount of time. The proportion is specified by the learning rate provided to the model. When used to minimize the function, a standard (or "batch") gradient descent method would perform the following iterations until it converges:

Standard gradient descent evaluates the sum-gradient, which may require expensive evaluations of the gradients from all sums and functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients. In stochastic (or "on-line") gradient descent, the true gradient of Q(w) is approximated by a gradient at a single example:

We compute an estimate or approximation to this direction. The most simple way is to just look at one training example (or subset of training examples) and compute the direction to move only on this approximation. It is called as Stochastic because the approximate direction that is computed at every step can be thought of a random variable of a stochastic process. Stochastic method converges much faster compared to standard, but the error function is not as well minimized as in the case of the latter. Often in most cases, the close approximation that you get in stochastic method for the parameter values are enough because they reach the optimal values and keep oscillating there.

A compromise between the two forms called "mini-batches" computes the gradient against more than one training examples at each step. This can perform significantly better than true stochastic gradient descent, because the code can make use of vectorization libraries rather than computing each step separately. It may also result in smoother convergence, as the gradient computed at each step uses more training examples.

## Implementation

Scikit-learn provides us with both linear and logistic regression implementation using gradient descent through SGDRegressor and SGDClassifier classes respectively. By default the models use stochastic method with batches of 5 examples. Since all scikit-learn models follow the same API, the only needed change to the previous example, is to change the model instantiation.

...
model = linear_model.SGDClassifier(alpha=0.00001)
...


After we change the classifier, the code finished within less then a second!

## Conclusion

The important point to remember is that gradient descent looks for local minima, which sometimes may differ from the global one. The outcome of the process is significantly dependent upon where it starts, as illustrated below. To gain more confidence in the results, one must run it several times.

Hope you've enjoyed the article and feel free to comment below. In the next article we'll look into techniques to choose the parameters provided to the models, to achieve better results.

Wednesday, July 8, 2015

# Logistic regression with Python

## Introduction

After having mastered linear regression in the previous article, let's take a look at logistic regression. Despite its name, it is not that different from linear regression, but rather a linear model for classification achieved by using sigmoid function instead of polynomial one. Specifically we'll be using logistic function, hence the name of the regression. Don't worry, in the future articles we'll be covering the usage of other sigmoid functions as well.

## Implementation

The implementation of logistic regression in scikit-learn can be accessed from class LogisticRegression. This implementation can fit a multiclass logistic regression with optional L1 or L2 regularization. In the next example we'll classify iris flowers according to their sepal length and width:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets

# import some data to play with
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target

model = linear_model.LogisticRegression(C=10000) # C = 1/alpha
model.fit(X, Y)

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]:[y_min, y_max]
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

h = .02  # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())

plt.show()


Firstly, pay attention to the inversed regularization parameter, C, provided to the classifier. As with linear regression, scikit provides class, LogisticRegressionCV, to evaluate different learning rates. Secondly, though not related directly to the topic, have a look at meshgrid feature of numpy library and how it is used to plot a decision boundary.

## Performance metrics

As opposed to linear regression, classification results cannot be evaluated with R-squared measure. A variety of metrics exist to evaluate the performance of classifiers against labels. The most common metrics are accuracy, precision, recall, F1 score and ROC AUC score. All of these measures depend on the concepts of true positives, true negatives, false positives and false negatives. All these apply to a binary classifier, but with a little trick of setting a positive label, the measures can be used on multiclass models as well.

### True positives (TP)

The amount of labels, which were correctly identified by the classifier, that is assigned point labeled A to class A.

### True negatives (TN)

The amount of labels, which were correctly rejected by the classifier, that is assigned point labeled A' to class A'.

### False positives (FP)

The amount of labels, which were incorrectly identified by the classifier, that is assigned point labeled A' to class A.

### False negatives (FN)

The amount of labels, which were incorrectly rejected by the classifier, that is assigned point labeled A to class A'.

### Accuracy

Accuracy measures a fraction of the classifier's predictions that are correct, that is the number of correct assessments divided by the number of all assessments - (TN + TP)/(TN + TP + FN + FP).

### Precision (P)

Precision is the fraction of positive predictions that are correct - TP/(TP + FP). Be careful as classifier predicting only a single positive instance, that happens to be correct, will achieve perfect precision.

### Recall (R)

Recall, sometimes called sensitivity in medical domains, measures the fraction of the truly positive instances. A score of 1 indicates, that no false negative were present - TP/(TP + FN). Be careful as classifier predicting positive for every example will achieve a recall of 1.

### F1 score

Both precision and recall scores provide an incomplete view on the classifier performance and sometimes may provide skewed results. The F1 measure provides a better view by calculating weighted average of the scores - 2*P*R/(P + R). A model with perfect precision and recall scores will achieve an F1 score of one.

### ROC AUC

ROC curves plot the classifier's recall against its fall-out, false positive rate, is the number of false positives divided by the total number of negatives - FP/(TN + FP). AUC is the area under the ROC curve; it reduces the ROC curve to a single value, which represents the expected performance of the classifier. A classifier that predicts classes randomly, has an AUC of 0.5. Any number above this, outperforms the random guessing. Unfortunately scikit-learn supports only binary classifier, when it comes to ROC, however iterating over each label and setting it as positive one, results what we need:
import numpy as np
from sklearn import linear_model, datasets, cross_validation, metrics

# import some data to play with
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target
model = linear_model.LogisticRegression(C=10000)

x_train, x_test, y_train, y_test = cross_validation.train_test_split(X, Y)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print("Accuracy: %2f" % metrics.accuracy_score(y_test, y_pred))
print("Precision: %2f" % metrics.precision_score(y_test, y_pred, average="macro"))
print("F1: %2f" % metrics.f1_score(y_test, y_pred, average="macro"))

for label in np.arange(3):
false_positive_rate, recall, thresholds = metrics.roc_curve(y_test, y_pred, pos_label=label)
roc_auc = metrics.auc(false_positive_rate, recall)
plt.plot(false_positive_rate, recall, label='AUC(%d) = %0.2f' % (label, roc_auc))

plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()
# Accuracy: 0.815789
# Precision: 0.817460
# F1: 0.813789


## Conclusion

This concludes the classification topic and with linear regression, covers linear models. Next time we'll talk about techniques used to handle numerous explanatory variables.

Wednesday, June 24, 2015

# Linear Regression with Python

## Introduction

Our first insight into machine learning will be through the simplest model - linear regression. The goal in regression problems is to predict the value of a continuous response variable. First we'll examine linear regression, which models the relationship between a response variable and one explanatory variable. Next, we will discuss polynomial regression and regularization methods.

## Simple model

Linear regression tries to minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation. Mathematically it solves a problem of the form:

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.cross_validation import train_test_split

# Use only one feature
x = diabetes.data[:, np.newaxis]
x_temp = x[:, :, 2]
y = diabetes.target

# Split the data into training/testing sets
x_train, x_test, y_train, y_test = train_test_split(x_temp, y)

# Create linear regression object
model = linear_model.LinearRegression()

# Train the model using the training sets
model.fit(x_train, y_train)

# The coefficients
model.coef_                       # array([ 938.23786125])

# The mean square error
np.mean((model.predict(x_test) - y_test) ** 2) # 4017.22

# Explained variance score: 1 is perfect prediction
model.score(x_test, y_test)       # 0.31

# Plot outputs
plt.scatter(x_test, y_test,  color='black')
plt.plot(x_test, model.predict(x_test), color='blue', linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()


## Cross validation

We used the train_test_split function to randomly partition the data into training and test sets. The proportions of the data for both partitions can be specified using keyword arguments. By default, 25 percent of the data is assigned to the test set. Finally, we trained the model and evaluated it on the test set. The resulted r-squared score of 0.31 indicates that 31 percent of the variance in the test set is explained by the model. However the performance might have changed if a different data was chosen as the training set.

To obtain a better understanding of the estimator's performance, we can use cross-validation, whose goal is to define a dataset to "test" the model in the training phase, in order to limit problems like overfitting, and to give an insight on how the model will generalize to an independent dataset.

from sklearn.cross_validation import cross_val_score

...

scores = cross_val_score(model, x_temp, diabetes.target)
scores        # array([0.2861453, 0.39028236, 0.33343477])
scores.mean() # 0.3366


cross_val_score by default uses three-fold cross validation, that is, each instance will be randomly assigned to one of the three partitions. Each partition will be used to train and test the model. It returns the value of the estimator's score method for each round. The r-squared scores range from 0.29 to 0.39! The mean of the scores, 0.34, is a better estimate of the estimator's predictive power than the r-squared score produced from a single train/test split.

## Polynomial regression

In the previous examples, we assumed that the real relationship between the explanatory variables and the response variable is linear. This assumption is rarely true. Let's discuss polynomial regression, that adds terms with degrees greater than one to the model.

import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# function to approximate by polynomial interpolation
def f(x): return x * np.sin(x)

# generate points used to plot
x_plot = np.linspace(0, 10, 100)

# generate points and keep a subset of them
x = np.linspace(0, 10, 100)
rng = np.random.RandomState(0)
rng.shuffle(x)
x = np.sort(x[:20])
y = f(x)

# create matrix versions of these arrays
X = x[:, np.newaxis]
X_plot = x_plot[:, np.newaxis]

plt.figure(figsize=(20,10))
plt.plot(x_plot, f(x_plot), label="ground truth")
plt.scatter(x, y, label="training points")

for degree in [1, 3, 5, 8]:
model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
model.fit(X, y)
y_plot = model.predict(X_plot)
plt.plot(x_plot, y_plot, label="degree %d" % degree)

plt.xlim([-0.25,10.25])
plt.ylim([-15.5,9])
plt.legend(loc='lower left')

plt.show()


Have a look at the chart above and how different polynomial curves try to estimate the "ground truth" line, colored in blue. The linear green line is no where close to what we seek, but as the degree grows, the curves become more and more close to the one covering all the points - colored in purple.

## Regularization

### Introduction

In the previous section, one must have felt suspicious of the purple curve covering all the points. And one should be; as the degree of polynomial approximation grows, so does the risk of falling into overfitting. The most common technique to tackle this is regularization.

Regularization is a process of adding the penalty associated with the coefficient values to the error of the hypothesis. This way, an accurate hypothesis with unlikely coefficients would be penalized while a somewhat less accurate but more conservative hypothesis with low coefficients would not be penalized as much.

Two popular regularization methods are L1 and L2, which minimize a loss function E(X, Y) by instead minimizing E(X, Y) + α‖w‖, where w is the model's weight vector, ‖·‖ is either the L1 norm or the squared L2 norm, and α is a free parameter that needs to be tuned empirically. L1 regularization is often preferred because it produces sparse models and thus performs feature selection within the learning algorithm, but since the L1 norm is not differentiable, it may require changes to learning algorithms. L2 penalization is preferable for data that is not at all sparse, i.e. where you do not expect regression coefficients to show a decaying property, which penalizes large weights. In such cases, incorrectly using an L1 penalty for non-sparse data will give you a large estimation error. When applied in linear regression, the resulting models are termed Lasso or Ridge regression respectively.

### Implementation

Among other regularization methods, scikit-learn implements both Lasso, L1, and Ridge, L2, inside linear_model package. Let's see the plots after applying each method to the previous code example:

from sklearn.linear_model import Lasso, Ridge

...

for degree in [1, 3, 5, 8]:
model = make_pipeline(PolynomialFeatures(degree), Lasso(alpha = 0.1))

...

from sklearn.linear_model import Lasso, Ridge

...

for degree in [1, 3, 5, 8]:
model = make_pipeline(PolynomialFeatures(degree), Ridge(alpha = 0.1))

...


I've arbitrary chosen the regularization parameter, alpha, to be 0.1, however this may be incorrect and usually more thorough analysis is required. Luckily scikit-learn provides us with methods to do so, an already described cross validation technique to find the best fitting alpha parameter for both Lasso and Ridge methods, called LassoCV and RidgeCV.

from sklearn.linear_model import LassoCV

...

for degree in [1, 3, 5, 8]:
poly = PolynomialFeatures(degree)
X_ = poly.fit_transform(X)
X_plot_ = poly.fit_transform(X_plot)

model = LassoCV(cv=20)
model.fit(X_, y)
print "chosen alpha %d" % model.alpha_
y_plot = model.predict(X_plot_)
plt.plot(x_plot, y_plot, label="degree %d" % degree)

...

# chosen alpha 2
# chosen alpha 265
# chosen alpha 16
# chosen alpha 4783


Notice how all curves got smoothed, in respect to previous results using alpha 0.1 and notice how in each iteration different parameter was chosen.

## Conclusion

Now you know how to apply linear regression and different techniques used along with it. In the next article we'll describe logistic regression used in classification problems.

Friday, June 5, 2015

## Introduction

Before we proceed with either kind of machine learning problem, we need to get the data on which we'll operate. We can of course generate data by hand, but this course of action won't get us far as is too tedious and lacks the diversity we may require. The are numerous sources of real data we can use and if none of it satisfies ones needs, there are some popular artificial generators, creating datasets according to preset parameters. scikit-learn provides a plenty of methods to load and fetch popular datasets as well as generate artificial data. All these can be found in sklearn.datasets package.

## Toy Datasets

The scikit-learn embeds some small toy datasets, which provide data scientists a playground to experiment a new algorithm and evaluate the correctness of their code before applying it to a real world sized data. Let's load and render one of the most common datasets - iris dataset

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

X = iris.data[:, :2]  # only take the first two features.
Y = iris.target

# Plot the training points
plt.scatter(X[:, 0], X[:, 1], c=Y)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')


## Real Size Datasets

### Popular datasets

• 20 newsgroups dataset
• Labeled faces in the wild dataset
• Olivetti faces dataset from AT&T
• California housing dataset from StatLib
• Forest cover type dataset
from sklearn.datasets import fetch_20newsgroup

cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)
list(newsgroups_train.target_names) # ['alt.atheism', 'sci.space']
newsgroups_train.filenames.shape    # (1073,)
newsgroups_train.target.shape       # (1073,)


### mldata.org repository

The sklearn.datasets package is able to directly download data sets from the repository using the function fetch_mldata. For example, to download the MNIST digit recognition database, which contains a total of 70000 examples of handwritten digits of size 28x28 pixels, labeled from 0 to 9:

from sklearn.datasets import fetch_mldata

mnist = fetch_mldata('MNIST original', data_home=some_path)
mnist.data.shape        # (70000, 784)
mnist.target.shape      # (70000,)
np.unique(mnist.target) # array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])


### Datasets in svmlight/libsvm format

scikit-learn includes utility functions for loading datasets in the svmlight/libsvm format. In this format, each line takes the form

from sklearn.datasets import load_svmlight_file



More often, though, one will have to load datasets from the internet from various repositories. UC Irvine Machine Learning Repository is one of such repositories, which contains several hundreds of datasets donated as far as 80s.

import numpy as np
import urllib
# URL for the Pima Indians Diabetes dataset (UCI Machine Learning Repository)
url = "http://goo.gl/j0Rvxq"
dataset = np.loadtxt(raw_data, delimiter=",")  # load the CSV file as a numpy matrix
# separate the data from the target attributes
X = dataset[:,0:7]
y = dataset[:,8]


## Generated Datasets

Sometimes real datasets are not enough. Either one needs data following specific patterns or diversity which cannot be achieved through real datasets. scikit-learn includes various random sample generators that can be used to build artificial datasets of controlled size and complexity. This includes single and multi label data, regression, classifications, clustering and more. Let's create several datasets for classification problem:

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.datasets import make_blobs
from sklearn.datasets import make_gaussian_quantiles

plt.figure(figsize=(8, 8))

plt.subplot(321)
plt.title("One informative feature, one cluster per class", fontsize='small')
X1, Y1 = make_classification(n_features=2, n_redundant=0, n_informative=1,
n_clusters_per_class=1)
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)

plt.subplot(322)
plt.title("Two informative features, one cluster per class", fontsize='small')
X1, Y1 = make_classification(n_features=2, n_redundant=0, n_informative=2,
n_clusters_per_class=1)
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)

plt.subplot(323)
plt.title("Two informative features, two clusters per class", fontsize='small')
X2, Y2 = make_classification(n_features=2, n_redundant=0, n_informative=2)
plt.scatter(X2[:, 0], X2[:, 1], marker='o', c=Y2)

plt.subplot(324)
plt.title("Multi-class, two informative features, one cluster",
fontsize='small')
X1, Y1 = make_classification(n_features=2, n_redundant=0, n_informative=2,
n_clusters_per_class=1, n_classes=3)
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)

plt.subplot(325)
plt.title("Three blobs", fontsize='small')
X1, Y1 = make_blobs(n_features=2, centers=3)
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)

plt.subplot(326)
plt.title("Gaussian divided into three quantiles", fontsize='small')
X1, Y1 = make_gaussian_quantiles(n_features=2, n_classes=3)
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)

plt.show()


## Conclusion

Now that we've discussed how to get our data, we're ready to dive into data analysis.

Tuesday, May 26, 2015

# Python for Data Scientists - scikit-learn

## Introduction

In the previous posts we've covered the basics of data analysis. Now it's gloves off and here come the big guns - machine learning library called scikit-learn.

scikit-learn has become one of the most popular open source machine learning libraries for Python. It provides algorithms for machine learning tasks including classification, regression, dimensionality reduction, clustering and many more. It also provides modules for extracting features, processing data and evaluating models.

## Installation

scikit-learn is dependent upon both NumPy and SciPy, of which we've talked. So make sure to upgrade both to latest version prior to installing the package, which is done, of course, using the python package manager.

pip install scikit-learn


## Conclusion

scikit-learn covers a very broad spectrum of data science fields, each deserving a dedicated discussion. And this is exactly what we're going to do for the next couple of sessions, diving deeper into each sphere of data analysis and discovering how scikit-learn assists us in each field.

This article concludes the python for data scientists series and as of now we have enough knowledge to dive deeper into murky waters of data science.

Monday, May 11, 2015

# Python for Data Scientists - Matplotlib

## Introduction

Sure, with both pandas and SciPy you can perform some superb data analysis. And with the IPython, working sure became much easier. But how about presenting your results? Today we'll talk about Matplotlib - our presentation package.

Making plots and static or interactive visualizations is one of the most important tasks in data analysis. It may be a part of the exploratory process; for example, helping identify outliers, needed data transformations, or coming up with ideas for models.

## Installation

Installation of matplotlib is easy. If don't have it preinstalled as part of your Python distribution, just do it manually using python package manager

pip install matplotlib


## Usage

Since we're already familiar with IPython, I'll be only covering it's usage as this is a preferable way of writing data analysis procedures. In console mode graphs are plotted in a separate newly created window, each time you render a plot. In web mode, it's better to put the graphs inside the document, along with it's code and possible documentation. To achieve this, one must add the following line in the beginning of the code.

%matplotlib inline


## Examples

Let's walk through several graph examples for you to acquire a taste of what Matplotlib is all about.

### Bar Chart

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

N = 5
ind = np.arange(N)  # the x locations for the groups
width = 0.35       # the width of the bars

# render men data bar charts with std candle
menMeans = (20, 35, 30, 35, 27)
menStd =   (2, 3, 4, 1, 2)
rects1 = plt.bar(ind, menMeans, width, color='r', yerr=menStd)

# render women data bar charts with std candle
womenMeans = (25, 32, 34, 20, 25)
womenStd =   (3, 5, 2, 3, 3)
rects2 = plt.bar(ind+width, womenMeans, width, color='y', yerr=womenStd)

plt.legend( (rects1[0], rects2[0]), ('Men', 'Women') )

# label bars
def autolabel(rects):
# attach some text labels
for rect in rects:
height = rect.get_height()
plt.text(rect.get_x()+rect.get_width()/2., 1.05*height, '%d'%int(height),
ha='center', va='bottom')

autolabel(rects1)
autolabel(rects2)

# add some text for labels, title and axes ticks
ax = plt.gca()
ax.set_ylabel('Scores')
ax.set_title('Scores by group and gender')
ax.set_xticks(ind+width)
ax.set_xticklabels( ('G1', 'G2', 'G3', 'G4', 'G5') )

plt.show()    # show the plot


As you can see to render a bar chart is not that big of a deal and the code with some comments is fairly self-explanatory.

The interesting part is the way we drew standard deviation candles, through yerr parameter. The optional arguments color, edgecolor, linewidth, xerr, and yerr can be either scalars or sequences of length equal to the number of bars.

### Pie Chart

Let's take a look at some more interesting charts, like pie chart with an exploding slice:

# The slices will be ordered and plotted counter-clockwise.
labels = 'Frogs', 'Hogs', 'Dogs', 'Logs'
sizes = [15, 30, 45, 10]
colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral']
explode = (0, 0.1, 0, 0) # only "explode" the 2nd slice (i.e. 'Hogs')

plt.pie(sizes, explode=explode, labels=labels, colors=colors,
# Set aspect ratio to be equal so that pie is drawn as a circle.
plt.axis('equal')

plt.show()


### Sub Plots

Sometimes you need to render several plots in one graph. Matplotlib has a notion of subplot, which does exactly this. To do this use the function subplot, which receives number of rows, number of cols and plot number, which is used to identify the particular subplot that this function is to create within the notional grid. Plot number starts at 1, increments across rows first and has a maximum of rows * cols.

x1 = np.linspace(0.0, 5.0)
x2 = np.linspace(0.0, 2.0)

y1 = np.cos(2 * np.pi * x1) * np.exp(-x1)
y2 = np.cos(2 * np.pi * x2)

plt.subplot(2, 1, 1) # # reference 1st plot
plt.plot(x1, y1, 'ko-')
plt.title('A tale of 2 subplots')
plt.ylabel('Damped oscillation')

plt.subplot(2, 1, 2) # reference 2nd plot
plt.plot(x2, y2, 'r.-')
plt.xlabel('time (s)')
plt.ylabel('Undamped')

plt.show()


Notice how we apply the line appearance. Matplotlib uses variation of different codes to determine the plot's styling. Here we styled our line as solid by applying '-' code, colored it in black using 'k' sign and made sure it was a circle marker using 'o' sign. For the full list of the supported codes, have a look here.

### Toolkits

Toolkits are collections of application-specific functions that extend Matplotlib. Some of them come pre-packed with Matplotlib distribution, the bigger ones come as a stand alone packages. Have a look at the most popular ones here.

from mpl_toolkits.axes_grid1 import ImageGrid

fig = plt.figure(1, (4., 4.))
grid = ImageGrid(fig, 111,            # similar to subplot(111)
nrows_ncols = (2, 2), # creates 2x2 grid of axes