Tuesday, August 11, 2015

# Hyperparameter optimization with Python

## Introduction

In the previous articles we introduced several linear techniques, where as you have probably noticed, we provided the algorithms with several parameters. The dependence of machine learning algorithm upon learning parameters is a common case though and one has to check the performance of various parameters to achieve the best results. The task of course is no trifle and is called hyperparameter optimization or model selection. It is the problem of choosing a set of hyperparameters for a learning algorithm, usually with the goal of optimizing a measure of the algorithm's performance on an independent data set.

## Implementation

### Grid Search

The traditional way of performing hyperparameter optimization is a grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. Scikit-learn provides us with a class GridSearchCV implementing the technique. Let's try to find the best learning rate, regularization technique and batch size for stochastic gradient descent from the previous article.

import numpy as np
from time import time
from operator import itemgetter
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import SGDClassifier

# get some data
X, y = digits.data, digits.target

# build a classifier
clf = SGDClassifier()

# Utility function to report best scores
def report(grid_scores, n_top=3):
top_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n_top]
for i, score in enumerate(top_scores):
print("Model with rank: {0}".format(i + 1))
print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
score.mean_validation_score,
np.std(score.cv_validation_scores)))
print("Parameters: {0}\n".format(score.parameters))

# use a full grid over all parameters
param_grid = {"n_iter": [1, 5, 10],
"alpha": [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
"penalty": ["none", "l1", "l2"]}

# run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(X, y)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
% (time() - start, len(grid_search.grid_scores_)))
report(grid_search.grid_scores_)

# GridSearchCV took 4.85 seconds for 63 candidate parameter settings.
# Model with rank: 1
# Mean validation score: 0.929 (std: 0.020)
# Parameters: {'penalty': 'l1', 'alpha': 0.0001, 'n_iter': 5}

# Model with rank: 2
# Mean validation score: 0.924 (std: 0.016)
# Parameters: {'penalty': 'l2', 'alpha': 1, 'n_iter': 10}

# Model with rank: 3
# Mean validation score: 0.922 (std: 0.018)
# Parameters: {'penalty': 'none', 'alpha': 0.0001, 'n_iter': 5}


The output of the procedure is a list, grid_scores_, consisting of cross evaluation score of all provided parameters. To get the best parameters, we order the list in a descending order by score mean resulting the best parameters for the provided dataset using Lasso regularization, learning rate of 0.0001 in batches of 5. For more details about regularization types, have a look a broader explanation in linear regression article.

### Randomized Grid Search

Since grid searching is an exhaustive and therefore potentially expensive method, there is a random variation of the technique and is implemented by RandomizedSearchCV class in scikit-learn. It doesn't try out all parameter values, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter parameter.

from sklearn.grid_search import RandomizedSearchCV
from scipy.stats import randint, norm

...

# specify parameters and distributions to sample from
param_dist = {"n_iter": randint(1, 11),
"alpha": uniform(scale=0.01),
"penalty": ["none", "l1", "l2"]}

# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
n_iter=n_iter_search)

...

# RandomizedSearchCV took 1.18 seconds for 20 candidates parameter settings.
# Model with rank: 1
# Mean validation score: 0.920 (std: 0.021)
# Parameters: {'penalty': 'l1', 'alpha': 0.0006507721384509924, 'n_iter': 7}

# Model with rank: 2
# Mean validation score: 0.919 (std: 0.016)
# Parameters: {'penalty': 'none', 'alpha': 0.0008117236028203323, 'n_iter': 9}

# Model with rank: 3
# Mean validation score: 0.918 (std: 0.018)
# Parameters: {'penalty': 'l1', 'alpha': 0.0012165253638165819, 'n_iter': 5}


Here, instead of providing the procedure with list of values, we gave a desired distribution from which the values would be drawn. As we set the n_iter to be 20, the procedure evaluated 20 random variation of the parameters and following the previous logic, the best appeared to be using Lasso regularization, learning rate of 0.00065 in batches of 7.

## Conclusion

The randomized search and the grid search explore exactly the same space of parameters. The result in parameter settings is quite similar, while the run time for randomized search is drastically lower. Note that in practice, one would not search over this many different parameters simultaneously using grid search, but pick only the ones deemed most important.