Friday, March 25, 2016

## Introduction

Gradient Descent is one of the most popular technique to optimize machine learning algorithm. We've already discussed Gradient Descent in the past in Gradient descent with Python article, and gave some intuitions toward it's behaviour. We've also made an overview about choosing learning rate hyper-parameter for the algorithm in hyperparameter optimization article. So by now, you should have a fair understanding of how it works. Today we'll discuss different ways to optimize the performance of the algorithm itself.

We've already three variants of the Gradient Descent in Gradient Descent with Python article: Batch Gradient Descent, Stochastic Gradient Descent and Mini-Batch Gradient Descent. What we haven't discussed was problems arising when using these techniques. Choosing a proper learning rate is difficult. A too small learning rate leads to tremendously slow convergence, while a very large learning rate that can cause the loss function to fluctuate around the minimum or even to diverge. Additionally, in all these variants of Gradient Descent the same learning rate applies to all parameter updates. However when the data is sparse with features having different frequencies, it would be better to perform a larger update for rarely occurring features.

### Momentum

Gradient Descent struggles navigating ravines, areas where the surface curves much more steeply in one dimension than in another. Once fallen into ravine, Gradient Descent oscillates across the slopes of the ravine, without making much progress towards the local optimum. Momentum technique accelerates Gradient Descent in the relevant direction and lessens oscillations. In the illustrations below, the left one is vanilla Gradient Descent and the right is Gradient Descent with Momentum.

When Momentum technique is applied, the fraction of the update vector of the past time step is added to the current update vector:

$v_t = \gamma v_{t-1} - \eta \nabla_\theta J( \theta_{t-1})$
$\theta_t = \theta_{t-1} + v_t$

The momentum parameter $\gamma$ is usually set to 0.9

The idea behind using momentum accelerating speed of the ball as it rolls down the hill, until it reaches its terminal velocity if there is air resistance, that is our parameter $\gamma$. Similarly the momentum increases updates for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions gaining faster convergence while reducing oscillation.

### Nesterov

The formula for Nesterov accelerated gradient is as following with momentum parameter set to 0.9:

$v_t = \gamma_{t-1} v_{t-1} - \eta_{t-1} \nabla_\theta J( \theta_{t-1} + \gamma_{t-1} v_{t-1} )$
$\theta_t = \theta_{t-1} + v_t$

It enjoys stronger theoretical converge guarantees for convex functions and in practice it also consistently works slightly better than standard momentum.

All previous approaches we’ve discussed so far manipulated the learning rate globally and equally for all parameters. Adagrad is a well-suited algorithm for dealing with sparse data - it edits the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. Adagrad uses a different learning rate for every parameter $\theta$ at each step, and not an update for all parameters at once and given by:

$\theta_{t+1} = \theta_{t} - \dfrac{\eta}{\sqrt{G_{t} + \epsilon}} \odot \nabla_\theta J(\theta)$

where $\odot$ is an element-wise multiplication, $\epsilon$ is a smoothing term that avoids division by zero (usually on the order of 1e−8), $G_{t}$ is a diagonal matrix of sum of the squares of the past gradients - $\nabla_\theta J( \theta)^2$

One of Adagrad's main benefits is that it eliminates the need to manually tune the learning rate. Most implementations use a default value of 0.01 and leave it at that.

Adadelta is an improvement over Adagrad which reduces its aggressiveness and monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size $w$.With Adadelta, we do not even need to set a default learning rate.

### RMSprop

$E[\nabla_\theta J( \theta)^2]_t = \gamma E[\nabla_\theta J( \theta)^2]_{t-1} + (1-\gamma) \nabla_\theta J( \theta)^2_t$ $\theta_{t+1} = \theta_{t} - \dfrac{\eta}{\sqrt{E[\nabla_\theta J( \theta)^2]_t + \epsilon}} \nabla_\theta J( \theta)_{t}$

where E is a running average. RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients. Momentum rate is usually set to 0.9, while a good default value for the learning rate is 0.001.

The following two animations provide some intuitions towards the optimization behaviour of the presented optimization algorithms on loss surface contours and saddle point:

It's clearly seen how standard SGD legs behind anyone else.

## Conclusions

There is no right answer for choosing the correct optimization algorithm as SGD still holds the crown, while the others attempt to decrease learning rate dramatically sometimes sacrificing performance. In fact many papers use vanilla SGD without momentum. However, if you care about fast convergence you should choose one of the adaptive learning rate methods.

Wednesday, March 9, 2016

# Tweets Analysis with Python and NLP

## Introduction

You should be already familiar with the concepts of NLP from our previous post, so today we'll see more useful case of analysis the tweets and classifying them into marketing and non-marketing tweets. We won't get into details of tweets retrieval, this can be done with various packages with Tweepy being the most popular one.

## Baseline

For the purpose of the discussion we already have 2 sets of tweets separated into files and are uploaded into GitHub folder. First we download the datasets, add target column as 1 for marketing tweets and unite the datasets. Then we'll check the baseline classification results, without any pre-processing. We do this so later we could understand whether our changes improve the metrics. We'll be using Random Forest for classification, since it doesn't expect linear features or even features that interact linearly and it can handle very well high dimensional spaces as well as large number of training examples. Plus it doesn't require a lot of configuration. Have a look at Random Forest and classifier boosting articles for more details.

# -*- coding: utf-8 -*-
import re
import numpy as np
import pandas as pd
from nltk.tokenize import WordPunctTokenizer
from sklearn.cross_validation import cross_val_predict
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import f1_score, accuracy_score, roc_auc_score

prefix = 'https://github.com/aie0/data/blob/master/tweets/'
goodMarketingTweetsURL = prefix + 'good_non_marketing_tweets.txt'

vectorizer = CountVectorizer(max_features=5000)
tokenizer = WordPunctTokenizer()

encoding='utf-8',
names=['ID', 'Content'])
encoding='utf-8',
names=['ID', 'Content'])

# marking target
goodMarketingTweetsDS['isMarket'] = pd.Series(np.zeros(len(goodMarketingTweetsDS)),
index=goodMarketingTweetsDS.index)

y = X['isMarket']
X = X['Content']

def test_RF(transformator, options={}):
tweets = transformator(X, options)
features = vectorizer.fit_transform(tweets).toarray()

y_pred = cross_val_predict(RandomForestClassifier(verbose=3,
n_estimators=100,
max_depth=10,
n_jobs=-1), features, y)
acc = accuracy_score(y, y_pred)
roc = roc_auc_score(y, y_pred)
return acc, roc, f1

# 1. baseline
def transform_tweets1(tweets, options):
return tweets

print(test_RF(transform_tweets1))
# (0.83827723901882489, 0.83818275302681977, 0.8342105263157894)
pass


Please notice the header of the file, # -*- coding: utf-8 -*-. Since tweets contain non-ascii characters, according to PEP 263, we should mark the file as such.

Our pipeline consists of 3 phases: pre-process the tweets (currently doing nothing), vectorizing the tweets and training the classifier using cross-validation to avoid overfitting. For details about cross-validation, read appropriate paragraph in linear regression article

We need vectorization since classifier cannot work with words and requires numeric vectors. To achieve the goal, we use scikit-learn CountVectorizer class, which implements bag of words technique.

Using no pre-processing we achieve 0.83 accuracy. Remember to check F1 and ROC metrics as well to spot skewed datasets, for more details see machine learning metrics article.

## Pre-processing

Let's try several things to see what effect our changes have on the metrics.

def transform_tweets2(tweets, options):
results = []
length = len(tweets)
i = 0
for tweet in tweets:
if i % 100 is 0:
print("%d of %d\n" % (i, length))
i += 1
s = tweet.lower()

if 'markEmoji' in options:
try:
s.decode('ascii')
except:
new_str = ''
for l in s:
new_str += (" VGEMOJINAME " if ord(l) > 128 else l)
s = new_str

if 'patterns' in options:
for (pattern, repl) in options['patterns']:
s = re.sub(pattern, repl, s)

words = tokenizer.tokenize(s)
if 'remove_stop_words' in options:
stops = set(stopwords.words("english"))
result = " ".join([w for w in words if not w in stops])
else:
result = " ".join([w for w in words])
results.append(result)
return results


## Removing stop words

We've all been taught that removing the stop words should the first step of any NLP pipeline. So that's only natural we start by doing so. The performance however not only hasn't improved, but actually showed a significant decline. Remember that we always should take into account the total number of instances when interpreting the performance, thus 0.834 - 0.822 = 0.012 decrease at 7000 instances is about 90 cases, which is a lot.

options = {
'remove_stop_words': True
}
print(test_RF(transform_tweets2, options))
#(0.82943525385054195, 0.8290763420757612, 0.82242424242424241)


Since removing the stop words didn't help, let's try something else - all links in tweeter are encoded, they don't provide any additional information and may only worsen the performance. Let's replace all links with hardcoded string VGLINKNAME. The performance increases by nearly 1 percent - good start!

replacement_patterns = [
]
options = {
'patterns': [(re.compile(regex), repl) for (regex, repl) in replacement_patterns]
}
print(test_RF(transform_tweets2, options))
#(0.84811751283513981, 0.84782082009277737, 0.85790527018012008)


## Marking money

Marketing content usually contains monetary strings like "Win 100$". Let's try identify them and mark then with VGMONEYNAME. And the result - another percent up. replacement_patterns = [ (r"http:\/\/t.co\/[a-zA-Z0-9]+", " VGLINKNAME "), #link (r'\$\s{0,3}[0-9,]+', ' VGMONEYNAME ') # money
]
patterns = [(re.compile(regex), repl) for (regex, repl) in replacement_patterns]
options = {
'patterns': [(re.compile(regex), repl) for (regex, repl) in replacement_patterns]
}
print(test_RF(transform_tweets2, options))
# (0.85667427267541363, 0.85637385309532066, 0.86601786428476202)


## Marking Emojis

How about emotions? Do non-marketing emails contain more emotions through the use of Emojis signs? Let's try to create a feature around this idea by marking all non-ascii characters as VGEMOJINAME and check the results. Again the increase in performance by around half percent.

replacement_patterns = [
(r'\\$\s{0,3}[0-9,]+', ' VGMONEYNAME ') #money
]
patterns = [(re.compile(regex), repl) for (regex, repl) in replacement_patterns]
options = {
'patterns': [(re.compile(regex), repl) for (regex, repl) in replacement_patterns],
'markEmoji': True
}
print(test_RF(transform_tweets2, patterns))
#(0.85853850541928122, 0.85730503637390198, 0.87023249526899159)


## Conclusions

We can check more different ways to improve the performance, some will work, others won't. The main point you should take from this article is always rely on the data, never on what people say. Removing stop words may be a good idea in some domains and worsen the performance in others. Validate every assumption and play with the data as many as possible. Good luck!