While working on my Slack bot that knows how to transform business questions into SQL and answer back, I found myself comparing the 2 most used Python libraries for natural language processing: spaCy and NLTK. Here are some differences I found — with examples.
I used this lyrics dataset from Kaggle — I recognize about 5% of the artists present.
We will be using the following helper functions:
First, I compared running times of tokenizing the lyrics. I repeated the experiment 100 times to get some statistical significance.
Notice how this invalidates the analysis here. As spaCy now supports tokenization without analyzing the semantic structure, it’s not slower than NLTK anymore.
Part of Speech
This part is tricky, as spaCy does not have a method that only computes POS. What you can do is to creating a DOC object, which has many different linguistic annotations, including POS. Anyway, let’s see the speed difference:
Yes, NLTK wins, but again, you get immediate access to other annotations with spaCy. Anyway, if you just want POS, then you should definitely go with NLTK.
This starts to show when computing entity recognition. As you may know, entity recognition needs all previous steps to be computed, so we can expect spaCy to perform better than in the last scenario, as it already calculated everything and stored it in the doc object.
NLTK processes strings and always returns strings (or arrays, and pairs of strings). spaCy, in contrast, takes an object-oriented approach and is much more user friendly.
Until now, we saw that regarding semantic analysis, NLTK seems to be faster than spaCy. But is this really true?
Usually, when we do text processing, we don’t just want the entities, POS, etc. Real life is more complex than that.
Love is in the air
This lyrics dataset made me wonder what these songs can teach me about love. Let’s find out!
spaCy makes this really easy. The following code lets you match all phrases of the form LOVE [conjugated verb “to be”] [optional adverb] [adjective]
"love being free",
"love is bright",
"love is true",
"love was handmade",
"love was unmoved",
"love's too overrated",
"lovin’ was good"
Yeah, lovin’ was good. Also, the code is easy to read, and easy to write. If we repeat the experiment 10 times, interestingly enough, it takes longer each time:
On the other hand, if you want to do this with NLTK, things get a bit complicated. NLTK doesn’t give you an easy way of matching a pattern that contains both tags and literal words. However, a hack that works is to replace love’s tag by a unique ID and then search by that ID. I changed love’s tag to literally “love”, and the tag of the verb to be to “BE”.
"love being free"
"love is true"
"love was unmoved"
"loved is long"
Altough NLTK found much less matches, it did find something spaCy didn’t: the last phrase, love is long.
I guess it is, if you are lucky! Anyway, NLTK’s code is definitely long, but it takes about one third of the time spaCy takes:
We can compare them with another boxplot:
TL;DR: if you really worry about speed, then NLTK might be worth to check out — but the more annotations you use, the better spaCy becomes.
This happens because spaCy is object-oriented, has more readability, and can pre-compute many linguistic annotations through the DOM object.
Analyze the trade-off between human labor and efficiency. For one-time data science experiments, I’ll go with spaCy. For tools used in production, I may go with NLTK but if I have the time I will perform a benchmark test first.