Sunday, April 10, 2016

NLP for Data Scientists - SpaCy


Introduction

It's been a while since we introduced new library, so today we'll talk a bit about SpaCy, which is an Natural Language Processing library for Python. Now, you'll say: Wait a minute, what about NLTK? Yes, both in Natural Language Processing with Python and Tweets analysis with Python and NLP we used NLTK, but from now on - no more. The reason couldn't be described better than in Spacy's author article about why he chose to write the library in the first place.

What NLTK has is a decent tokenizer, some passable stemmers, a good implementation of the Punkt sentence boundary detector, some visualization tools, and some wrappers for other libraries. Nothing else is of any use.

Installation

Starting to work with SpiCy is easy, first install it and then download the model data.

pip install scipy
python -m spacy.en.download

The rest is pretty straight forward, import the library and start using according to the documentation. Let's see how to use it's POS or Part of speech identifier.

import spacy

_spacy = spacy.load('en')
doc = _spacy("This is just an example")
for token in doc:
    print(str(token) + ": " + token.pos_)

# This: DET
# is: VERB
# just: ADV
# an: DET
# example: NOUN

After taking some time to load the models, which is of course done only once, the parsing is blazingly fast as opposed to NLTK bridge using Stanford POS tagger.

Conclusions

The reason I didn't show you the library before, was SpaCy was under dual licensing and I'm personally don't like to write articles about libraries with restrictions. However, now that's it under MIT License, feel free to throw NLTK and use SpaCY instead.