Wednesday, March 9, 2016

Tweets Analysis with Python and NLP


Introduction

You should be already familiar with the concepts of NLP from our previous post, so today we'll see more useful case of analysis the tweets and classifying them into marketing and non-marketing tweets. We won't get into details of tweets retrieval, this can be done with various packages with Tweepy being the most popular one.

Baseline

For the purpose of the discussion we already have 2 sets of tweets separated into files and are uploaded into GitHub folder. First we download the datasets, add target column as 1 for marketing tweets and unite the datasets. Then we'll check the baseline classification results, without any pre-processing. We do this so later we could understand whether our changes improve the metrics. We'll be using Random Forest for classification, since it doesn't expect linear features or even features that interact linearly and it can handle very well high dimensional spaces as well as large number of training examples. Plus it doesn't require a lot of configuration. Have a look at Random Forest and classifier boosting articles for more details.

# -*- coding: utf-8 -*-
import re
import numpy as np
import pandas as pd
from nltk.tokenize import WordPunctTokenizer
from sklearn.cross_validation import cross_val_predict
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import f1_score, accuracy_score, roc_auc_score

prefix = 'https://github.com/aie0/data/blob/master/tweets/'
badMarketingTweetsURL = prefix + 'bad_marketing_tweets.txt'
goodMarketingTweetsURL = prefix + 'good_non_marketing_tweets.txt'

vectorizer = CountVectorizer(max_features=5000)
tokenizer = WordPunctTokenizer()

# load ds
badMarketingTweetsDS = pd.read_csv(badMarketingTweetsURL, sep='\t',
                                   encoding='utf-8',
                                   names=['ID', 'Content'])
goodMarketingTweetsDS = pd.read_csv(goodMarketingTweetsURL, sep='\t',
                                    encoding='utf-8',
                                    names=['ID', 'Content'])

# marking target
badMarketingTweetsDS['isMarket'] = pd.Series(np.ones(len(badMarketingTweetsDS)),
                                             index=badMarketingTweetsDS.index)
goodMarketingTweetsDS['isMarket'] = pd.Series(np.zeros(len(goodMarketingTweetsDS)),
                                              index=goodMarketingTweetsDS.index)

X = badMarketingTweetsDS.append(goodMarketingTweetsDS)
y = X['isMarket']
X = X['Content']

def test_RF(transformator, options={}):
    tweets = transformator(X, options)
    features = vectorizer.fit_transform(tweets).toarray()

    y_pred = cross_val_predict(RandomForestClassifier(verbose=3,
                                                      n_estimators=100,
                                                      max_depth=10,
                                                      n_jobs=-1), features, y)
    acc = accuracy_score(y, y_pred)
    roc = roc_auc_score(y, y_pred)
    return acc, roc, f1


# 1. baseline
def transform_tweets1(tweets, options):
    return tweets

print(test_RF(transform_tweets1))
# (0.83827723901882489, 0.83818275302681977, 0.8342105263157894)
pass

Please notice the header of the file, # -*- coding: utf-8 -*-. Since tweets contain non-ascii characters, according to PEP 263, we should mark the file as such.

Our pipeline consists of 3 phases: pre-process the tweets (currently doing nothing), vectorizing the tweets and training the classifier using cross-validation to avoid overfitting. For details about cross-validation, read appropriate paragraph in linear regression article

We need vectorization since classifier cannot work with words and requires numeric vectors. To achieve the goal, we use scikit-learn CountVectorizer class, which implements bag of words technique.

Using no pre-processing we achieve 0.83 accuracy. Remember to check F1 and ROC metrics as well to spot skewed datasets, for more details see machine learning metrics article.

Pre-processing

Let's try several things to see what effect our changes have on the metrics.

def transform_tweets2(tweets, options):
    results = []
    length = len(tweets)
    i = 0
    for tweet in tweets:
        if i % 100 is 0:
            print("%d of %d\n" % (i, length))
        i += 1
        s = tweet.lower()

        if 'markEmoji' in options:
            try:
                s.decode('ascii')
            except:
                new_str = ''
                for l in s:
                    new_str += (" VGEMOJINAME " if ord(l) > 128 else l)
                s = new_str

        if 'patterns' in options:
            for (pattern, repl) in options['patterns']:
                s = re.sub(pattern, repl, s)

        words = tokenizer.tokenize(s)
        if 'remove_stop_words' in options:
            stops = set(stopwords.words("english"))
            result = " ".join([w for w in words if not w in stops])
        else:
            result = " ".join([w for w in words])
        results.append(result)
    return results

Removing stop words

We've all been taught that removing the stop words should the first step of any NLP pipeline. So that's only natural we start by doing so. The performance however not only hasn't improved, but actually showed a significant decline. Remember that we always should take into account the total number of instances when interpreting the performance, thus 0.834 - 0.822 = 0.012 decrease at 7000 instances is about 90 cases, which is a lot.

options = {
    'remove_stop_words': True
}
print(test_RF(transform_tweets2, options))
#(0.82943525385054195, 0.8290763420757612, 0.82242424242424241)

Marking links

Since removing the stop words didn't help, let's try something else - all links in tweeter are encoded, they don't provide any additional information and may only worsen the performance. Let's replace all links with hardcoded string VGLINKNAME. The performance increases by nearly 1 percent - good start!

replacement_patterns = [
    (r"http:\/\/t.co\/[a-zA-Z0-9]+", " VGLINKNAME ")
]
options = {
    'patterns': [(re.compile(regex), repl) for (regex, repl) in replacement_patterns]
}
print(test_RF(transform_tweets2, options))
#(0.84811751283513981, 0.84782082009277737, 0.85790527018012008)

Marking money

Marketing content usually contains monetary strings like "Win 100$". Let's try identify them and mark then with VGMONEYNAME. And the result - another percent up.

replacement_patterns = [
    (r"http:\/\/t.co\/[a-zA-Z0-9]+", " VGLINKNAME "), #link
    (r'\$\s{0,3}[0-9,]+', ' VGMONEYNAME ') # money
]
patterns = [(re.compile(regex), repl) for (regex, repl) in replacement_patterns]
options = {
    'patterns': [(re.compile(regex), repl) for (regex, repl) in replacement_patterns]
}
print(test_RF(transform_tweets2, options))
# (0.85667427267541363, 0.85637385309532066, 0.86601786428476202)

Marking Emojis

How about emotions? Do non-marketing emails contain more emotions through the use of Emojis signs? Let's try to create a feature around this idea by marking all non-ascii characters as VGEMOJINAME and check the results. Again the increase in performance by around half percent.

replacement_patterns = [
    (r"http:\/\/t.co\/[a-zA-Z0-9]+", " VGLINKNAME "), #link
    (r'\$\s{0,3}[0-9,]+', ' VGMONEYNAME ') #money
]
patterns = [(re.compile(regex), repl) for (regex, repl) in replacement_patterns]
options = {
    'patterns': [(re.compile(regex), repl) for (regex, repl) in replacement_patterns],
    'markEmoji': True
}
print(test_RF(transform_tweets2, patterns))
#(0.85853850541928122, 0.85730503637390198, 0.87023249526899159)

Conclusions

We can check more different ways to improve the performance, some will work, others won't. The main point you should take from this article is always rely on the data, never on what people say. Removing stop words may be a good idea in some domains and worsen the performance in others. Validate every assumption and play with the data as many as possible. Good luck!