spaCy vs NLTK

While working on my Slack bot that knows how to transform business questions into SQL and answer back, I found myself comparing the 2 most used Python libraries for natural language processing: spaCy and NLTK. Here are some differences I found — with examples.

I used this lyrics dataset from Kaggle — I recognize about 5% of the artists present.

We will be using the following helper functions:

Tokenization

First, I compared running times of tokenizing the lyrics. I repeated the experiment 100 times to get some statistical significance.

Image for post
spaCy is much faster tokenizing phrases

Notice how this invalidates the analysis here. As spaCy now supports tokenization without analyzing the semantic structure, it’s not slower than NLTK anymore.

Part of Speech

This part is tricky, as spaCy does not have a method that only computes POS. What you can do is to creating a DOC object, which has many different linguistic annotations, including POS. Anyway, let’s see the speed difference:

Image for post
NLTK wins

Yes, NLTK wins, but again, you get immediate access to other annotations with spaCy. Anyway, if you just want POS, then you should definitely go with NLTK.

Entity Recognition

This starts to show when computing entity recognition. As you may know, entity recognition needs all previous steps to be computed, so we can expect spaCy to perform better than in the last scenario, as it already calculated everything and stored it in the doc object.

Image for post
NLTK wins by a far smaller margin — it’s about 8% faster.

NLTK processes strings and always returns strings (or arrays, and pairs of strings). spaCy, in contrast, takes an object-oriented approach and is much more user friendly.

Until now, we saw that regarding semantic analysis, NLTK seems to be faster than spaCy. But is this really true?

Usually, when we do text processing, we don’t just want the entities, POS, etc. Real life is more complex than that.

Love is in the air

This lyrics dataset made me wonder what these songs can teach me about love. Let’s find out!

spaCy makes this really easy. The following code lets you match all phrases of the form LOVE [conjugated verb “to be”] [optional adverb] [adjective]

"love being free",
"love is bright",
"love is true",
"love was handmade",
"love was unmoved",
"love's too overrated",
"lovin’ was good"

Yeah, lovin’ was good. Also, the code is easy to read, and easy to write. If we repeat the experiment 10 times, interestingly enough, it takes longer each time:

Image for post

On the other hand, if you want to do this with NLTK, things get a bit complicated. NLTK doesn’t give you an easy way of matching a pattern that contains both tags and literal words. However, a hack that works is to replace love’s tag by a unique ID and then search by that ID. I changed love’s tag to literally “love”, and the tag of the verb to be to “BE”.

"love being free"
"love is true"
"love was unmoved"
"loved is long"

Altough NLTK found much less matches, it did find something spaCy didn’t: the last phrase, love is long.

I guess it is, if you are lucky! Anyway, NLTK’s code is definitely long, but it takes about one third of the time spaCy takes:

Image for post

We can compare them with another boxplot:

Image for post

TL;DR: if you really worry about speed, then NLTK might be worth to check out — but the more annotations you use, the better spaCy becomes.

This happens because spaCy is object-oriented, has more readability, and can pre-compute many linguistic annotations through the DOM object.

Analyze the trade-off between human labor and efficiency. For one-time data science experiments, I’ll go with spaCy. For tools used in production, I may go with NLTK but if I have the time I will perform a benchmark test first.

Leave a Comment

Close Bitnami banner
Bitnami