Introduction
Before we proceed with either kind of machine learning problem, we need to get the data on which we'll operate. We can of course generate data by hand, but this course of action won't get us far as is too tedious and lacks the diversity we may require. The are numerous sources of real data we can use and if none of it satisfies ones needs, there are some popular artificial generators, creating datasets according to preset parameters. scikit-learn provides a plenty of methods to load and fetch popular datasets as well as generate artificial data. All these can be found in sklearn.datasets package.
Toy Datasets
The scikit-learn embeds some small toy datasets, which provide data scientists a playground to experiment a new algorithm and evaluate the correctness of their code before applying it to a real world sized data. Let's load and render one of the most common datasets - iris dataset
import numpy as np import matplotlib.pyplot as plt from sklearn import datasets iris = datasets.load_iris() X = iris.data[:, :2] # only take the first two features. Y = iris.target # Plot the training points plt.scatter(X[:, 0], X[:, 1], c=Y) plt.xlabel('Sepal length') plt.ylabel('Sepal width')
Real Size Datasets
Popular datasets
scikit-learn provides loaders that will automatically download, cache, parse the metadata files of several popular real size datasets:
- 20 newsgroups dataset
- Labeled faces in the wild dataset
- Olivetti faces dataset from AT&T
- California housing dataset from StatLib
- Forest cover type dataset
from sklearn.datasets import fetch_20newsgroup cats = ['alt.atheism', 'sci.space'] newsgroups_train = fetch_20newsgroups(subset='train', categories=cats) list(newsgroups_train.target_names) # ['alt.atheism', 'sci.space'] newsgroups_train.filenames.shape # (1073,) newsgroups_train.target.shape # (1073,)
mldata.org repository
The sklearn.datasets package is able to directly download data sets from the repository using the function fetch_mldata. For example, to download the MNIST digit recognition database, which contains a total of 70000 examples of handwritten digits of size 28x28 pixels, labeled from 0 to 9:
from sklearn.datasets import fetch_mldata mnist = fetch_mldata('MNIST original', data_home=some_path) mnist.data.shape # (70000, 784) mnist.target.shape # (70000,) np.unique(mnist.target) # array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Datasets in svmlight/libsvm format
scikit-learn includes utility functions for loading datasets in the svmlight/libsvm format. In this format, each line takes the form
from sklearn.datasets import load_svmlight_file X_train, y_train = load_svmlight_file("/path/to/train_dataset.txt")
Load datasets from web
More often, though, one will have to load datasets from the internet from various repositories. UC Irvine Machine Learning Repository is one of such repositories, which contains several hundreds of datasets donated as far as 80s.
import numpy as np import urllib # URL for the Pima Indians Diabetes dataset (UCI Machine Learning Repository) url = "http://goo.gl/j0Rvxq" raw_data = urllib.urlopen(url) # download the file dataset = np.loadtxt(raw_data, delimiter=",") # load the CSV file as a numpy matrix # separate the data from the target attributes X = dataset[:,0:7] y = dataset[:,8]
Generated Datasets
Sometimes real datasets are not enough. Either one needs data following specific patterns or diversity which cannot be achieved through real datasets. scikit-learn includes various random sample generators that can be used to build artificial datasets of controlled size and complexity. This includes single and multi label data, regression, classifications, clustering and more. Let's create several datasets for classification problem:
import matplotlib.pyplot as plt from sklearn.datasets import make_classification from sklearn.datasets import make_blobs from sklearn.datasets import make_gaussian_quantiles plt.figure(figsize=(8, 8)) plt.subplots_adjust(bottom=.05, top=.9, left=.05, right=.95) plt.subplot(321) plt.title("One informative feature, one cluster per class", fontsize='small') X1, Y1 = make_classification(n_features=2, n_redundant=0, n_informative=1, n_clusters_per_class=1) plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1) plt.subplot(322) plt.title("Two informative features, one cluster per class", fontsize='small') X1, Y1 = make_classification(n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1) plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1) plt.subplot(323) plt.title("Two informative features, two clusters per class", fontsize='small') X2, Y2 = make_classification(n_features=2, n_redundant=0, n_informative=2) plt.scatter(X2[:, 0], X2[:, 1], marker='o', c=Y2) plt.subplot(324) plt.title("Multi-class, two informative features, one cluster", fontsize='small') X1, Y1 = make_classification(n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, n_classes=3) plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1) plt.subplot(325) plt.title("Three blobs", fontsize='small') X1, Y1 = make_blobs(n_features=2, centers=3) plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1) plt.subplot(326) plt.title("Gaussian divided into three quantiles", fontsize='small') X1, Y1 = make_gaussian_quantiles(n_features=2, n_classes=3) plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1) plt.show()
Conclusion
Now that we've discussed how to get our data, we're ready to dive into data analysis.