Introduction
One of the core tasks in building a machine learning model is to evaluate its performance. The usual data science pipeline consists of prototyping a model on some historical data, reaching a satisfying model and deploying it into production, where it will go through further testing on live data. The stages are usually called offline and online evaluations, where the former analyses prototyped model on historical data and the latter the deployed model on live data. Surprisingly to some, evaluation is really hard as good measurement are often vague or infeasible. Also generally statistical models assume that the distribution of data stays the same over time. But in practice, the distribution of data changes constantly, sometimes drastically. This is called distribution drift.
One way to detect distribution drift is to continue tracking the model’s performance on the validation metric on live data. That's why any data science project cannot just end after the model is written, simply because the model has to be re-evaluated and tweaked on regular basis.
Different machine learning tasks, require different metrics and there are various metrics for the tasks of classification, regression, ranking, clustering, topic modeling, etc.
Classification Metrics
We already know what classification is and have had numerous discussions about it using different algorithms. In the logistic regression article we even introduced some of the performance metrics used for classification. Let us add additional metrics besides ones described there: Accuracy, Precision, Recall, F1 and AUC.
Per-Class Accuracy
A variation of accuracy is the average per-class accuracy — the average of the accuracy for each class. Looking at the confusion matrix from the Wikipedia, one can clearly tell that the positive, or cat, class has higher accuracy: 5/(5+4) = 0.55, whereas dog's accuracy is: 2/(2+3)=0.4. In our example the average per-class accuracy would be (0.55 + 0.4)0/2 = 0.475. Note that in this case, the average per-class accuracy is quite different from the overall accuracy, which is (TN + TP)/(TN + TP + FN + FP) = (5+3)/(2+3+4+5) = 0.57. When the classes are imbalanced, meaning there are a much more examples of one class than the other, the accuracy will give a very skewed view, since the class with more observations will dominate the metric. In that case, we should look at the per-class accuracy, both the average and the individual per-class accuracy numbers.
Log-loss
Log-loss is a “soft” measurement of accuracy that incorporates the idea of probabilistic confidence. If the classifier calculates 0.51 probability belonging to class A, and thus assigning the observation to class A, then even though the classifier would be making a mistake, it’s a near miss because the probability is very close to the decision boundary of 0.5.
Log-loss is the cross entropy between the distribution of the true labels and the predictions. Intuitively speaking, entropy measures the unpredictability of something. By minimizing the cross entropy, we maximize the accuracy of the classifier.
Precision-Recall vs ROC Curves
ROC curves are commonly used to present results for binary decision problems in machine learning. However, when dealing with highly skewed datasets, Precision-Recall (PR) curves give a more informative picture of an algorithm’s performance. As a reminder to what precision and recall are, have a look at logistic regression article.
The performances of the Algorithms 1 and 2 appear to be comparable in ROC space in the image below (left), however, in PR space (right) we can see that Algorithm 2 has a clear advantage over Algorithm 1. This difference exists because in this domain the number of negative examples greatly exceeds the number of positives examples. Consequently, a large change in the number of false positives can lead to a small change in the false positive rate used in ROC analysis. Precision, on the other hand, by comparing false positives to true positives rather than true negatives, captures the effect of the large number of negative examples on the algorithm’s performance.
Regression Metrics
In linear regression article, we've slightly touched the regression metrics. Let's revise them.
RMSE
The most commonly used metric for regression tasks is root mean square error, or RMSE, also known as root mean square deviation, or RMSD, defined as the square root of the average squared distance between the actual score and the predicted one.
MAE
Mean absolute error, or MAE, measures the average magnitude of the errors in a set of forecasts, without considering their direction. It measures accuracy for continuous variables.
Since RMSE is an average, it is sensitive to large outliers. If the regressor performs really badly on a single data point, the average error could be very big. To spot this, MAE and the RMSE can be used together to diagnose the variation in the errors in a set of forecasts. The RMSE will always be larger or equal to the MAE; the greater the difference between them, the greater the variance in the individual errors in the sample.
R-squared
One of the problems with both RMSE and MAE is they are not bounded and different datasets yield different numbers for both of these metrics. Coefficient of determination, or R-squared, is a number that indicates how well data fit a statistical model – sometimes simply a line or a curve. A value of 1 indicates that the regression line perfectly fits the data, while 0 indicates that the line does not fit the data at all. The definition of R-squared uses sum of squares total, or SStot, and sum of squares residuals, or SSres metrics. The difference between SStot and SSres is the improvement in prediction from the regression model, compared to the mean model.
F-test
The F-test evaluates the null hypothesis that all regression coefficients are equal to zero versus the alternative that at least one does not. It compares a model with no predictors to the model that you specify. A regression model that contains no predictors is also known as an intercept-only model and it is equivalent to R-squared being zero. A significant F-test indicates that the observed R-squared is reliable, and is not a spurious result of oddities in the data set.
While R-squared provides an estimate of the strength of the relationship between your model and the response variable, it does not provide a formal hypothesis test for this relationship. The overall F-test determines whether this relationship is statistically significant.
Ranking Metrics
We haven't been discussing yet the topic of ranking, but the problem is very related to binary classification. In an underlying implementation, the classifier may assign a numeric score to each item instead of a categorical class label, and the ranker may simply order the items by the raw score. Since ranking is "sort of" classification, we use the same metrics applied there: Accuracy, Precision, Recall, F1 and AUC. Besides these, there is one additional ranking metric called Normalized Discounted Cumulative Gain or NDCG.
NDCG
Precision and recall treat all retrieved items equally; a relevant item in position k counts just as much as a relevant item in position 1. But this is not usually how people think. When we look at the results from a search engine, the top few answers matter much more than answers that are lower down on the list.
NDCG comes to rescue by introducing normalized version of discounted cumulative gain, or DCG, which discounts items that are further down the list. NDCG calculates a divided DCG by it's ideal score, so that the normalized score always lies between 0.0 and 1.0, where 1.0 representing the ideal ranking of the entities. This metric is commonly used in infomation retrieval and to evaluate the performance of web search engines algorithms, among them the most famous one - PageRank.
Conclusion
It’s easy to write down the formula of a metric, but it's completely different story to interpret the actual metric measured on real data. Always think about what the data looks like and how it affects the metric. In particular, always be on the look out for data skew. And never, never rely on one metric whether it's classification, regression or ranking problem.