Thursday, November 19, 2015

K-means clustering with Python


K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain K number of clusters. The main idea is to define K centroids, one for each cluster. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. At this point we need to re-calculate K new centroids of the clusters resulting from the previous step. After we have these K new centroids, a new binding has to be done between the same data set points and the nearest new centroid. As a result of this loop we may notice that the K centroids change their location step by step until no more changes are done.


Scikit-learn provides with full implementation of K-means algorithm though KMeans class. Let's have a look at several interesting situations, which might occur during data clustering:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

plt.figure(figsize=(12, 12))

n_samples = 1500
random_state = 170
X, y = make_blobs(n_samples=n_samples, random_state=random_state)

# Incorrect number of clusters
y_pred = KMeans(n_clusters=2, random_state=random_state).fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.title("Incorrect Number of Blobs")

# Anisotropicly distributed data
transformation = [[ 0.60834549, -0.63667341], [-0.40887718, 0.85253229]]
X_aniso =, transformation)
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_aniso)

plt.scatter(X_aniso[:, 0], X_aniso[:, 1], c=y_pred)
plt.title("Anisotropicly Distributed Blobs")

# Different variance
X_varied, y_varied = make_blobs(n_samples=n_samples,
                                cluster_std=[1.0, 2.5, 0.5],
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_varied)

plt.scatter(X_varied[:, 0], X_varied[:, 1], c=y_pred)
plt.title("Unequal Variance")

# Unevenly sized blobs
X_filtered = np.vstack((X[y == 0][:500], X[y == 1][:100], X[y == 2][:10]))
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_filtered)

plt.scatter(X_filtered[:, 0], X_filtered[:, 1], c=y_pred)
plt.title("Unevenly Sized Blobs")

At first we wrongly assume that there are 3 clusters in the data and make K-means find them. Later we transform the data anisotropicly. Since K-means uses Euclidian distance to to associate points to clusters, it does not work well with non-globular clusters. The last two datasets introduce blobs of different variance and size, but this makes no difference and the classification succeeds.


K-means provides us with easy to use clustering algorithms. It's fast, easy to follow and is used vastly in various fields including Vector Quantization.

Sunday, November 8, 2015

Naïve Bayes with Python


The Naive Bayes algorithm is based on conditional probabilities. It uses Bayes' Theorem, a formula that calculates a probability by counting the frequency of values and combinations of values in the historical data. Bayes' Theorem finds the probability of an event occurring given the probability of another event that has already occurred. If B represents the dependent event and A represents the prior event, Bayes' theorem can be stated as follows. To calculate the probability of B given A, the algorithm counts the number of cases where A and B occur together and divides it by the number of cases where A occurs alone.


Scikit-learn provides implementation of Naïve Bayes algorithm of 3 flavors: MultinomialNB implementing the naive Bayes algorithm for multinomially distributed data; GaussianNB implementing the Gaussian Naive Bayes algorithm for classification; and BernoulliNB implements the naive Bayes training and classification algorithms for data that is distributed according to multivariate Bernoulli distributions.

Let's take a look at Naïve Bayes algorithm at work classifying Iris data and since anything the nature produces is distributed according to a Gaussian distribution, we'll be using this appropriate class

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
# Parameters
n_classes = 3
plot_colors = "bry"
plot_step = 0.02
plt.rcParams["figure.figsize"] = [12, 8]
# Load data
iris = load_iris()
for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3],
                                [1, 2], [1, 3], [2, 3]]):
    # We only take the two corresponding features
    X =[:, pair]
    y =
    # Shuffle
    idx = np.arange(X.shape[0])
    X = X[idx]
    y = y[idx]
    # Train
    clf = GaussianNB().fit(X, y)
    # Plot the decision boundary
    plt.subplot(2, 3, pairidx + 1)
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                         np.arange(y_min, y_max, plot_step))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    cs = plt.pcolormesh(xx, yy, Z,
    # Plot the training points
    for i, color in zip(range(n_classes), plot_colors):
        idx = np.where(y == i)
        plt.scatter(X[idx, 0], X[idx, 1], c=color,
plt.legend(loc="upper left")

Pretty cool, isn't it!


The Naive Bayes algorithm affords fast, highly scalable model building and scoring. It scales linearly with the number of predictors and rows. You'll need, however, a big data set in order to make reliable estimations of the probability of each class. You can use Naïve Bayes classification algorithm with a small data set, but precision and recall will keep very low. For small reminder about what those are, have a look at performance metrics section here.