Friday, March 25, 2016

Optimizations of Gradient Descent


Gradient Descent is one of the most popular technique to optimize machine learning algorithm. We've already discussed Gradient Descent in the past in Gradient descent with Python article, and gave some intuitions toward it's behaviour. We've also made an overview about choosing learning rate hyper-parameter for the algorithm in hyperparameter optimization article. So by now, you should have a fair understanding of how it works. Today we'll discuss different ways to optimize the performance of the algorithm itself.

Gradient Descent Variants

We've already three variants of the Gradient Descent in Gradient Descent with Python article: Batch Gradient Descent, Stochastic Gradient Descent and Mini-Batch Gradient Descent. What we haven't discussed was problems arising when using these techniques. Choosing a proper learning rate is difficult. A too small learning rate leads to tremendously slow convergence, while a very large learning rate that can cause the loss function to fluctuate around the minimum or even to diverge. Additionally, in all these variants of Gradient Descent the same learning rate applies to all parameter updates. However when the data is sparse with features having different frequencies, it would be better to perform a larger update for rarely occurring features.


Gradient Descent struggles navigating ravines, areas where the surface curves much more steeply in one dimension than in another. Once fallen into ravine, Gradient Descent oscillates across the slopes of the ravine, without making much progress towards the local optimum. Momentum technique accelerates Gradient Descent in the relevant direction and lessens oscillations. In the illustrations below, the left one is vanilla Gradient Descent and the right is Gradient Descent with Momentum.

When Momentum technique is applied, the fraction of the update vector of the past time step is added to the current update vector:

The momentum parameter is usually set to 0.9

The idea behind using momentum accelerating speed of the ball as it rolls down the hill, until it reaches its terminal velocity if there is air resistance, that is our parameter . Similarly the momentum increases updates for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions gaining faster convergence while reducing oscillation.


The standard momentum method first computes the gradient at the current location and then takes a big jump in the direction of the updated accumulated gradient. The Nesterov accelerated gradient (NAG) looks ahead by calculating the gradient not by our current parameters but by approximating future position of our parameters. In the following illustration, instead of evaluating gradient at the current position (red circle), we know that our momentum is about to carry us to the tip of the green arrow. With Nesterov momentum we therefore instead evaluate the gradient at this "looked-ahead" position.

The formula for Nesterov accelerated gradient is as following with momentum parameter set to 0.9:

It enjoys stronger theoretical converge guarantees for convex functions and in practice it also consistently works slightly better than standard momentum.


All previous approaches we’ve discussed so far manipulated the learning rate globally and equally for all parameters. Adagrad is a well-suited algorithm for dealing with sparse data - it edits the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. Adagrad uses a different learning rate for every parameter at each step, and not an update for all parameters at once and given by:

where is an element-wise multiplication, is a smoothing term that avoids division by zero (usually on the order of 1e−8), is a diagonal matrix of sum of the squares of the past gradients -

One of Adagrad's main benefits is that it eliminates the need to manually tune the learning rate. Most implementations use a default value of 0.01 and leave it at that.


Adadelta is an improvement over Adagrad which reduces its aggressiveness and monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size .With Adadelta, we do not even need to set a default learning rate.


RMSprop also tries to overcome the diminishing learning rates of Adagrad and works similarly to Adadelta as following:

where E is a running average. RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients. Momentum rate is usually set to 0.9, while a good default value for the learning rate is 0.001.

The following two animations provide some intuitions towards the optimization behaviour of the presented optimization algorithms on loss surface contours and saddle point:

It's clearly seen how standard SGD legs behind anyone else.


There is no right answer for choosing the correct optimization algorithm as SGD still holds the crown, while the others attempt to decrease learning rate dramatically sometimes sacrificing performance. In fact many papers use vanilla SGD without momentum. However, if you care about fast convergence you should choose one of the adaptive learning rate methods.

Wednesday, March 9, 2016

Tweets Analysis with Python and NLP


You should be already familiar with the concepts of NLP from our previous post, so today we'll see more useful case of analysis the tweets and classifying them into marketing and non-marketing tweets. We won't get into details of tweets retrieval, this can be done with various packages with Tweepy being the most popular one.


For the purpose of the discussion we already have 2 sets of tweets separated into files and are uploaded into GitHub folder. First we download the datasets, add target column as 1 for marketing tweets and unite the datasets. Then we'll check the baseline classification results, without any pre-processing. We do this so later we could understand whether our changes improve the metrics. We'll be using Random Forest for classification, since it doesn't expect linear features or even features that interact linearly and it can handle very well high dimensional spaces as well as large number of training examples. Plus it doesn't require a lot of configuration. Have a look at Random Forest and classifier boosting articles for more details.

# -*- coding: utf-8 -*-
import re
import numpy as np
import pandas as pd
from nltk.tokenize import WordPunctTokenizer
from sklearn.cross_validation import cross_val_predict
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import f1_score, accuracy_score, roc_auc_score

prefix = ''
badMarketingTweetsURL = prefix + 'bad_marketing_tweets.txt'
goodMarketingTweetsURL = prefix + 'good_non_marketing_tweets.txt'

vectorizer = CountVectorizer(max_features=5000)
tokenizer = WordPunctTokenizer()

# load ds
badMarketingTweetsDS = pd.read_csv(badMarketingTweetsURL, sep='\t',
                                   names=['ID', 'Content'])
goodMarketingTweetsDS = pd.read_csv(goodMarketingTweetsURL, sep='\t',
                                    names=['ID', 'Content'])

# marking target
badMarketingTweetsDS['isMarket'] = pd.Series(np.ones(len(badMarketingTweetsDS)),
goodMarketingTweetsDS['isMarket'] = pd.Series(np.zeros(len(goodMarketingTweetsDS)),

X = badMarketingTweetsDS.append(goodMarketingTweetsDS)
y = X['isMarket']
X = X['Content']

def test_RF(transformator, options={}):
    tweets = transformator(X, options)
    features = vectorizer.fit_transform(tweets).toarray()

    y_pred = cross_val_predict(RandomForestClassifier(verbose=3,
                                                      n_jobs=-1), features, y)
    acc = accuracy_score(y, y_pred)
    roc = roc_auc_score(y, y_pred)
    return acc, roc, f1

# 1. baseline
def transform_tweets1(tweets, options):
    return tweets

# (0.83827723901882489, 0.83818275302681977, 0.8342105263157894)

Please notice the header of the file, # -*- coding: utf-8 -*-. Since tweets contain non-ascii characters, according to PEP 263, we should mark the file as such.

Our pipeline consists of 3 phases: pre-process the tweets (currently doing nothing), vectorizing the tweets and training the classifier using cross-validation to avoid overfitting. For details about cross-validation, read appropriate paragraph in linear regression article

We need vectorization since classifier cannot work with words and requires numeric vectors. To achieve the goal, we use scikit-learn CountVectorizer class, which implements bag of words technique.

Using no pre-processing we achieve 0.83 accuracy. Remember to check F1 and ROC metrics as well to spot skewed datasets, for more details see machine learning metrics article.


Let's try several things to see what effect our changes have on the metrics.

def transform_tweets2(tweets, options):
    results = []
    length = len(tweets)
    i = 0
    for tweet in tweets:
        if i % 100 is 0:
            print("%d of %d\n" % (i, length))
        i += 1
        s = tweet.lower()

        if 'markEmoji' in options:
                new_str = ''
                for l in s:
                    new_str += (" VGEMOJINAME " if ord(l) > 128 else l)
                s = new_str

        if 'patterns' in options:
            for (pattern, repl) in options['patterns']:
                s = re.sub(pattern, repl, s)

        words = tokenizer.tokenize(s)
        if 'remove_stop_words' in options:
            stops = set(stopwords.words("english"))
            result = " ".join([w for w in words if not w in stops])
            result = " ".join([w for w in words])
    return results

Removing stop words

We've all been taught that removing the stop words should the first step of any NLP pipeline. So that's only natural we start by doing so. The performance however not only hasn't improved, but actually showed a significant decline. Remember that we always should take into account the total number of instances when interpreting the performance, thus 0.834 - 0.822 = 0.012 decrease at 7000 instances is about 90 cases, which is a lot.

options = {
    'remove_stop_words': True
print(test_RF(transform_tweets2, options))
#(0.82943525385054195, 0.8290763420757612, 0.82242424242424241)

Marking links

Since removing the stop words didn't help, let's try something else - all links in tweeter are encoded, they don't provide any additional information and may only worsen the performance. Let's replace all links with hardcoded string VGLINKNAME. The performance increases by nearly 1 percent - good start!

replacement_patterns = [
    (r"http:\/\/\/[a-zA-Z0-9]+", " VGLINKNAME ")
options = {
    'patterns': [(re.compile(regex), repl) for (regex, repl) in replacement_patterns]
print(test_RF(transform_tweets2, options))
#(0.84811751283513981, 0.84782082009277737, 0.85790527018012008)

Marking money

Marketing content usually contains monetary strings like "Win 100$". Let's try identify them and mark then with VGMONEYNAME. And the result - another percent up.

replacement_patterns = [
    (r"http:\/\/\/[a-zA-Z0-9]+", " VGLINKNAME "), #link
    (r'\$\s{0,3}[0-9,]+', ' VGMONEYNAME ') # money
patterns = [(re.compile(regex), repl) for (regex, repl) in replacement_patterns]
options = {
    'patterns': [(re.compile(regex), repl) for (regex, repl) in replacement_patterns]
print(test_RF(transform_tweets2, options))
# (0.85667427267541363, 0.85637385309532066, 0.86601786428476202)

Marking Emojis

How about emotions? Do non-marketing emails contain more emotions through the use of Emojis signs? Let's try to create a feature around this idea by marking all non-ascii characters as VGEMOJINAME and check the results. Again the increase in performance by around half percent.

replacement_patterns = [
    (r"http:\/\/\/[a-zA-Z0-9]+", " VGLINKNAME "), #link
    (r'\$\s{0,3}[0-9,]+', ' VGMONEYNAME ') #money
patterns = [(re.compile(regex), repl) for (regex, repl) in replacement_patterns]
options = {
    'patterns': [(re.compile(regex), repl) for (regex, repl) in replacement_patterns],
    'markEmoji': True
print(test_RF(transform_tweets2, patterns))
#(0.85853850541928122, 0.85730503637390198, 0.87023249526899159)


We can check more different ways to improve the performance, some will work, others won't. The main point you should take from this article is always rely on the data, never on what people say. Removing stop words may be a good idea in some domains and worsen the performance in others. Validate every assumption and play with the data as many as possible. Good luck!