Tuesday, February 23, 2016

Overview of Machine Learning Metrics


One of the core tasks in building a machine learning model is to evaluate its performance. The usual data science pipeline consists of prototyping a model on some historical data, reaching a satisfying model and deploying it into production, where it will go through further testing on live data. The stages are usually called offline and online evaluations, where the former analyses prototyped model on historical data and the latter the deployed model on live data. Surprisingly to some, evaluation is really hard as good measurement are often vague or infeasible. Also generally statistical models assume that the distribution of data stays the same over time. But in practice, the distribution of data changes constantly, sometimes drastically. This is called distribution drift.

One way to detect distribution drift is to continue tracking the model’s performance on the validation metric on live data. That's why any data science project cannot just end after the model is written, simply because the model has to be re-evaluated and tweaked on regular basis.

Different machine learning tasks, require different metrics and there are various metrics for the tasks of classification, regression, ranking, clustering, topic modeling, etc.

Classification Metrics

We already know what classification is and have had numerous discussions about it using different algorithms. In the logistic regression article we even introduced some of the performance metrics used for classification. Let us add additional metrics besides ones described there: Accuracy, Precision, Recall, F1 and AUC.

Per-Class Accuracy

A variation of accuracy is the average per-class accuracy — the average of the accuracy for each class. Looking at the confusion matrix from the Wikipedia, one can clearly tell that the positive, or cat, class has higher accuracy: 5/(5+4) = 0.55, whereas dog's accuracy is: 2/(2+3)=0.4. In our example the average per-class accuracy would be (0.55 + 0.4)0/2 = 0.475. Note that in this case, the average per-class accuracy is quite different from the overall accuracy, which is (TN + TP)/(TN + TP + FN + FP) = (5+3)/(2+3+4+5) = 0.57. When the classes are imbalanced, meaning there are a much more examples of one class than the other, the accuracy will give a very skewed view, since the class with more observations will dominate the metric. In that case, we should look at the per-class accuracy, both the average and the individual per-class accuracy numbers.


Log-loss is a “soft” measurement of accuracy that incorporates the idea of probabilistic confidence. If the classifier calculates 0.51 probability belonging to class A, and thus assigning the observation to class A, then even though the classifier would be making a mistake, it’s a near miss because the probability is very close to the decision boundary of 0.5.

Log-loss is the cross entropy between the distribution of the true labels and the predictions. Intuitively speaking, entropy measures the unpredictability of something. By minimizing the cross entropy, we maximize the accuracy of the classifier.

Precision-Recall vs ROC Curves

ROC curves are commonly used to present results for binary decision problems in machine learning. However, when dealing with highly skewed datasets, Precision-Recall (PR) curves give a more informative picture of an algorithm’s performance. As a reminder to what precision and recall are, have a look at logistic regression article.

The performances of the Algorithms 1 and 2 appear to be comparable in ROC space in the image below (left), however, in PR space (right) we can see that Algorithm 2 has a clear advantage over Algorithm 1. This difference exists because in this domain the number of negative examples greatly exceeds the number of positives examples. Consequently, a large change in the number of false positives can lead to a small change in the false positive rate used in ROC analysis. Precision, on the other hand, by comparing false positives to true positives rather than true negatives, captures the effect of the large number of negative examples on the algorithm’s performance.

Regression Metrics

In linear regression article, we've slightly touched the regression metrics. Let's revise them.


The most commonly used metric for regression tasks is root mean square error, or RMSE, also known as root mean square deviation, or RMSD, defined as the square root of the average squared distance between the actual score and the predicted one.


Mean absolute error, or MAE, measures the average magnitude of the errors in a set of forecasts, without considering their direction. It measures accuracy for continuous variables.

Since RMSE is an average, it is sensitive to large outliers. If the regressor performs really badly on a single data point, the average error could be very big. To spot this, MAE and the RMSE can be used together to diagnose the variation in the errors in a set of forecasts. The RMSE will always be larger or equal to the MAE; the greater the difference between them, the greater the variance in the individual errors in the sample.


One of the problems with both RMSE and MAE is they are not bounded and different datasets yield different numbers for both of these metrics. Coefficient of determination, or R-squared, is a number that indicates how well data fit a statistical model – sometimes simply a line or a curve. A value of 1 indicates that the regression line perfectly fits the data, while 0 indicates that the line does not fit the data at all. The definition of R-squared uses sum of squares total, or SStot, and sum of squares residuals, or SSres metrics. The difference between SStot and SSres is the improvement in prediction from the regression model, compared to the mean model.


The F-test evaluates the null hypothesis that all regression coefficients are equal to zero versus the alternative that at least one does not. It compares a model with no predictors to the model that you specify. A regression model that contains no predictors is also known as an intercept-only model and it is equivalent to R-squared being zero. A significant F-test indicates that the observed R-squared is reliable, and is not a spurious result of oddities in the data set.

While R-squared provides an estimate of the strength of the relationship between your model and the response variable, it does not provide a formal hypothesis test for this relationship. The overall F-test determines whether this relationship is statistically significant.

Ranking Metrics

We haven't been discussing yet the topic of ranking, but the problem is very related to binary classification. In an underlying implementation, the classifier may assign a numeric score to each item instead of a categorical class label, and the ranker may simply order the items by the raw score. Since ranking is "sort of" classification, we use the same metrics applied there: Accuracy, Precision, Recall, F1 and AUC. Besides these, there is one additional ranking metric called Normalized Discounted Cumulative Gain or NDCG.


Precision and recall treat all retrieved items equally; a relevant item in position k counts just as much as a relevant item in position 1. But this is not usually how people think. When we look at the results from a search engine, the top few answers matter much more than answers that are lower down on the list.

NDCG comes to rescue by introducing normalized version of discounted cumulative gain, or DCG, which discounts items that are further down the list. NDCG calculates a divided DCG by it's ideal score, so that the normalized score always lies between 0.0 and 1.0, where 1.0 representing the ideal ranking of the entities. This metric is commonly used in infomation retrieval and to evaluate the performance of web search engines algorithms, among them the most famous one - PageRank.


It’s easy to write down the formula of a metric, but it's completely different story to interpret the actual metric measured on real data. Always think about what the data looks like and how it affects the metric. In particular, always be on the look out for data skew. And never, never rely on one metric whether it's classification, regression or ranking problem.

Monday, February 8, 2016

Natural Language Processing with Python


Natural language processing, or NLP, is a process of analyzing the text and extracting insights from it. It is used everywhere, from search engines such as Google or Bing, to voice interfaces such as Siri or Cortana. The pipeline usually involves tokenization, replacing and correcting words, part-of-speech tagging, named-entity recognition and classification. In this article we'll be describing tokenization, by using a full example from Kaggle notebook. The full code can be found on GitHub repository.


For the purposes of NLP, we'll be using NLTK Python library, a leading platform to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries. Installing the package is easy using the Python package manager:

pip install nltk


Let's walk through the Kaggle notebook and see that we understand what is done there. We'll be using already covered packages like Pandas, Scikit-Learn and Matplotlib. In addition we'll be using Seaborn, Python visualization library based on Matplotlib, which of course can be installed using the Python package manager:

pip install seaborn

The notebook analyzes the US baby names data between 1880 and 2014 and will be looking into questions how frequency occurrence of names in Bible correlate with US baby names. Firstly we'll load the data located in CSVs files, using Pandas read_csv method. To extract the names from the bible, the use of NLTK is done by taking advantage of nltk.tokenize package. Since we need all words staring with capital letter, we'll construct an appropriate regular expression rule. More about regular expression syntax can be found here.

nationalNamesDS = pd.read_csv(nationalNamesURL)
stateNamesDS = pd.read_csv(stateNamesURL)

bibleNamesDS = pd.read_csv(bibleNamesURL)
# retrieve all words starting with capital letter and having atleast length of 3
tokenizer = RegexpTokenizer("[A-Z][a-z]{2,}")
# load new testament
file = open(newTestamentURL)
bibleData = file.read()
newTestamentWordsCount = pd.DataFrame(tokenizer.tokenize(bibleData))

# load old testament
file = open(oldTestamentURL)
bibleData = file.read()
oldTestamentWordsCount = pd.DataFrame(tokenizer.tokenize(bibleData))

NLP is never used by itself and usually you'll want to some pre-processing prior to analyzing with text. Using Pandas drop and merge methods, we'll remove irrelevant columns and join the Bible capital words with known names from the Bible:

# remove irrelevant columns
stateNamesDS.drop(['Id', 'Gender'], axis=1, inplace=True)
nationalNamesDS.drop(['Id', 'Gender'], axis=1, inplace=True)

# retrieve unique names count of each testament
bibleNames = pd.Series(bibleNamesDS['Name'].unique())
# filtering out Bible names
newTestamentNamesCount = pd.merge(newTestamentWordsCount,
pd.DataFrame(bibleNames), right_on=0, left_index=True)
newTestamentNamesCount = newTestamentNamesCount.ix[:, 0:2]
newTestamentNamesCount.columns = ['Name', 'BibleCount']

oldTestamentNamesCount = pd.merge(oldTestamentWordsCount,
pd.DataFrame(bibleNames), right_on=0, left_index=True)
oldTestamentNamesCount = oldTestamentNamesCount.ix[:, 0:2]
oldTestamentNamesCount.columns = ['Name', 'BibleCount']

Great, now that we have our data, let's plot it with Matplotlib:

# plot top TOP_BIBLE_NAMES old testament names
topOldTestamentNamesCount = oldTestamentNamesCount.sort_values('BibleCount', ascending=False).head(TOP_BIBLE_NAMES)
topOldTestamentNamesCount.plot(kind='bar', x='Name', legend=False, title='Old Testament names count')

DataScience is not just applying some already written algorithms and plotting the results. The insight to the domain is required to make a valuable and meaningful decisions. Otherwise we could just use Amazon Machine Learning. Using this knowledge, we understand that two the most frequent names are 'God' and 'Israel' should be removed. 'God' is not really a name, even though there is a statistically insignificant number of babies with this name in US. Despite 'Israel' being a name, it's also a country, of which Old Testament is all about.

oldTestamentNamesCount = oldTestamentNamesCount.drop(oldTestamentNamesCount[(oldTestamentNamesCount.Name == 'God') | (oldTestamentNamesCount.Name == 'Israel')].index)

After the pre-processing stage, the analysis starts. We wanted to see the correlate of frequency occurrence, so for this we'll be using Pearson correlation by Pandas corr method and plotting the data using Seaborn package. Why? The Matplotlib package, despite being a great one, doesn't provide very easy to use interface to plotting a scatter plot with colored categories. So, to ease our life, we'll use another package which supports exactly that. Have a close look at the code in lines 7-9. Since scatter plot method requires 2 dimensional data, we have to make our data such, by removing and flattening the data using Pandas unstack and reset_index methods.

# scale and calculate plot states with high corr
def plotStateCorr(stateNamesCount, title):
    stateNamesCount[['Count','BibleCount']] = stateNamesCount[['Count','BibleCount']].apply(lambda x: MinMaxScaler().fit_transform(x))
    stateNamesCount = stateNamesCount.groupby(['Year', 'State']).corr()
    stateNamesCount = stateNamesCount[::2]
    highCorrStateNamesCount = stateNamesCount[stateNamesCount.Count > HIGH_CORR_THRESHOLD]
    highCorrStateNamesCount.drop(['BibleCount'], axis=1, inplace=True)
    highCorrStateNamesCount = highCorrStateNamesCount.unstack()
    highCorrStateNamesCount = highCorrStateNamesCount.reset_index()
    fg = sns.FacetGrid(data=highCorrStateNamesCount, hue='State', size=5)
    fg.map(pyplot.scatter, 'Year', 'Count').add_legend().set_axis_labels('Year', 'Correlation coefficient')

plotStateCorr(newTestamentStateNamesCount, 'Correlation of New Testament and US state names')
plotStateCorr(oldTestamentStateNamesCount, 'Correlation of Old Testament and US state names')
oldTestamentStateNamesCount = None
newTestamentStateNamesCount = None
stateNamesDS = None

Similar stages of pre-processing is done on national scale, without any particular interesting difference, so we'll be ending our discussing at this point. You can of course follow the Kaggle notebook code and explanation till the end.


NLP with the assistance of NLTK library, provides us with tools, which open a huge spectrum of possibilities to us, previously only available to linguists professionals. In this article we've taken a glimpse at what NLTK does, by using tokenization tools. In the next articles we'll cover other aspects of NLP.