Friday, June 5, 2015

Loading datasets using Python


Before we proceed with either kind of machine learning problem, we need to get the data on which we'll operate. We can of course generate data by hand, but this course of action won't get us far as is too tedious and lacks the diversity we may require. The are numerous sources of real data we can use and if none of it satisfies ones needs, there are some popular artificial generators, creating datasets according to preset parameters. scikit-learn provides a plenty of methods to load and fetch popular datasets as well as generate artificial data. All these can be found in sklearn.datasets package.

Toy Datasets

The scikit-learn embeds some small toy datasets, which provide data scientists a playground to experiment a new algorithm and evaluate the correctness of their code before applying it to a real world sized data. Let's load and render one of the most common datasets - iris dataset

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

iris = datasets.load_iris()
X =[:, :2]  # only take the first two features.
Y =

# Plot the training points
plt.scatter(X[:, 0], X[:, 1], c=Y)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

Real Size Datasets

Popular datasets

scikit-learn provides loaders that will automatically download, cache, parse the metadata files of several popular real size datasets:

  • 20 newsgroups dataset
  • Labeled faces in the wild dataset
  • Olivetti faces dataset from AT&T
  • California housing dataset from StatLib
  • Forest cover type dataset
from sklearn.datasets import fetch_20newsgroup

cats = ['alt.atheism', '']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)
list(newsgroups_train.target_names) # ['alt.atheism', '']
newsgroups_train.filenames.shape    # (1073,)       # (1073,) repository

The sklearn.datasets package is able to directly download data sets from the repository using the function fetch_mldata. For example, to download the MNIST digit recognition database, which contains a total of 70000 examples of handwritten digits of size 28x28 pixels, labeled from 0 to 9:

from sklearn.datasets import fetch_mldata

mnist = fetch_mldata('MNIST original', data_home=some_path)        # (70000, 784)      # (70000,)
np.unique( # array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Datasets in svmlight/libsvm format

scikit-learn includes utility functions for loading datasets in the svmlight/libsvm format. In this format, each line takes the form

from sklearn.datasets import load_svmlight_file

X_train, y_train = load_svmlight_file("/path/to/train_dataset.txt")

Load datasets from web

More often, though, one will have to load datasets from the internet from various repositories. UC Irvine Machine Learning Repository is one of such repositories, which contains several hundreds of datasets donated as far as 80s.

import numpy as np
import urllib
# URL for the Pima Indians Diabetes dataset (UCI Machine Learning Repository)
url = ""
raw_data = urllib.urlopen(url)                 # download the file
dataset = np.loadtxt(raw_data, delimiter=",")  # load the CSV file as a numpy matrix
# separate the data from the target attributes
X = dataset[:,0:7]
y = dataset[:,8]

Generated Datasets

Sometimes real datasets are not enough. Either one needs data following specific patterns or diversity which cannot be achieved through real datasets. scikit-learn includes various random sample generators that can be used to build artificial datasets of controlled size and complexity. This includes single and multi label data, regression, classifications, clustering and more. Let's create several datasets for classification problem:

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.datasets import make_blobs
from sklearn.datasets import make_gaussian_quantiles

plt.figure(figsize=(8, 8))
plt.subplots_adjust(bottom=.05, top=.9, left=.05, right=.95)

plt.title("One informative feature, one cluster per class", fontsize='small')
X1, Y1 = make_classification(n_features=2, n_redundant=0, n_informative=1,
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)

plt.title("Two informative features, one cluster per class", fontsize='small')
X1, Y1 = make_classification(n_features=2, n_redundant=0, n_informative=2,
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)

plt.title("Two informative features, two clusters per class", fontsize='small')
X2, Y2 = make_classification(n_features=2, n_redundant=0, n_informative=2)
plt.scatter(X2[:, 0], X2[:, 1], marker='o', c=Y2)

plt.title("Multi-class, two informative features, one cluster",
X1, Y1 = make_classification(n_features=2, n_redundant=0, n_informative=2,
                             n_clusters_per_class=1, n_classes=3)
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)

plt.title("Three blobs", fontsize='small')
X1, Y1 = make_blobs(n_features=2, centers=3)
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)

plt.title("Gaussian divided into three quantiles", fontsize='small')
X1, Y1 = make_gaussian_quantiles(n_features=2, n_classes=3)
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)


Now that we've discussed how to get our data, we're ready to dive into data analysis.