NLTK: Natural Language Made Easy

Dealing with text is hard! Thankfully, it’s hard for everyone, so tools exist to make it easier.

NLTK, the Natural Language Toolkit, is a python package “for building Python programs to work with human language data”. It has many tools for basic language processing (e.g. tokenization, \(n\)-grams, etc.) as well as tools for more complicated language processing (e.g. part of speech tagging, parse trees, etc.).

NLTK has an associated book about NLP that provides some context for the corpora and models.

Installing NLTK, or “why do I need to download so much data?”

We can conda install nltk to get the package. Then we need to do something somewhat strange: we have to download data.

In [1]:
import nltk
#nltk.download()

This pops up a GUI where we can choose what data to download.

What is this stuff? The data is separated into two categories:

  1. Corpora
    • These data are a set of collections of text.
  2. Models
    • These are data (e.g. weights, etc.) for trained models.

NLTK provides several collections of data to make installing easier.

  • all: All corpora and models
  • all-corpora: All corpora, no models
  • all-nltk: Everything plus more data from the website
  • book: Data to run the associated book
  • popular: The most popular packages
  • third-party: Extra data from third parties

Downloading the popular collection is recommended.

Analyzing tweets

First pass

Let’s take a look at one corpus in particular: positive and negative tweets.

In [2]:
# read some twitter data
neg_id = nltk.corpus.twitter_samples.fileids()[0]
neg_tweets = nltk.corpus.twitter_samples.strings(neg_id)
pos_id = nltk.corpus.twitter_samples.fileids()[1]
pos_tweets = nltk.corpus.twitter_samples.strings(pos_id)
In [3]:
print(pos_tweets[10])
print()
print(neg_tweets[10])
#FollowFriday @wncer1 @Defense_gouv for being top influencers in my community this week :)

I have a really good m&g idea but I'm never going to meet them :(((

How does the language in positive and negative tweets differ?

We can start by looking at how the words differ. NLTK provides tools for tokenization.

In [4]:
def tokenize_tweets1(tweets):
    """Get all of the tokens in a set of tweets"""
    tokens = [token for tweet in tweets for token in nltk.word_tokenize(tweet)]
    return(tokens)

What does this output?

In [5]:
pos_tokens = tokenize_tweets1(pos_tweets)
neg_tokens = tokenize_tweets1(neg_tweets)
print(pos_tokens[:10])
['#', 'FollowFriday', '@', 'France_Inte', '@', 'PKuchly57', '@', 'Milipol_Paris', 'for', 'being']

We can look at the most common words (like in the first homework) using Python’s Counter class.

In [6]:
from collections import Counter

pos_count = Counter(pos_tokens)
neg_count = Counter(neg_tokens)
In [7]:
pos_count.most_common(20)
Out[7]:
[(':', 6667),
 (')', 5165),
 ('@', 5119),
 ('!', 1920),
 ('you', 1427),
 ('.', 1323),
 ('#', 1292),
 ('I', 1176),
 ('to', 1063),
 ('the', 997),
 (',', 964),
 ('a', 881),
 ('-', 863),
 ('http', 856),
 ('for', 749),
 ('D', 662),
 ('and', 656),
 ('?', 582),
 ('it', 566),
 ('my', 484)]
In [8]:
neg_count.most_common(20)
Out[8]:
[('(', 7076),
 (':', 5959),
 ('@', 3181),
 ('I', 1986),
 ('.', 1078),
 ('to', 1067),
 ('#', 913),
 ('!', 895),
 ('the', 846),
 (',', 733),
 ('you', 707),
 ('i', 684),
 ('?', 650),
 ('my', 629),
 ('a', 626),
 ("n't", 614),
 ('and', 613),
 ('-', 600),
 ('it', 591),
 ('me', 520)]

The two most common tokens for postiive tweets are “:” and “)” and the tweo most common tokens for negative tweets are “(” and “:”. These are smiley and frowny faces! The basic word tokenizer is treating these as separate tokens, which makes sense in most cases but not for text from social media.

A better tokenizer

We’re not the first people to see this problem, and NLTK actually has a wide set of tokenizers in the `nltk.tokenizer module <http://www.nltk.org/api/nltk.tokenize.html>`__. In particular, there’s a tokenizer that’s optimized for tweets.

In [9]:
def tokenize_tweets2(tweets):
    """Get all of the tokens in a set of tweets"""
    twt = nltk.tokenize.TweetTokenizer(strip_handles=True, reduce_len=True)
    tokens = [token for tweet in tweets for token in twt.tokenize(tweet)]
    return(tokens)
In [10]:
pos_tokens = tokenize_tweets2(pos_tweets)
neg_tokens = tokenize_tweets2(neg_tweets)
pos_count = Counter(pos_tokens)
neg_count = Counter(neg_tokens)
In [11]:
pos_count.most_common(20)
Out[11]:
[(':)', 3691),
 ('!', 1844),
 ('you', 1341),
 ('.', 1341),
 ('to', 1065),
 ('the', 999),
 (',', 964),
 ('I', 890),
 ('a', 888),
 ('for', 749),
 (':-)', 701),
 ('and', 660),
 (':D', 658),
 ('?', 581),
 (')', 525),
 ('my', 484),
 ('in', 481),
 ('it', 460),
 ('is', 418),
 ('of', 403)]
In [12]:
neg_count.most_common(20)
Out[12]:
[(':(', 4585),
 ('I', 1587),
 ('(', 1180),
 ('.', 1092),
 ('to', 1068),
 ('the', 846),
 ('!', 831),
 (',', 734),
 ('you', 660),
 ('?', 644),
 ('my', 629),
 ('a', 627),
 ('i', 620),
 ('and', 614),
 ('me', 524),
 (':-(', 501),
 ('so', 466),
 ('is', 456),
 ('it', 449),
 ('in', 421)]

Much better! This tokenizer got rid of twitter handles for us, so no more “@” tokens, and catches emoticons. However, there are still some questions:

  1. Should we count a capitalized word differently from a non-capitalized word? e.g. should “Thanks” be different from “thanks”?
  2. Do we want to be counting punctuation?
  3. Do we want to count words like “I”, “me”, etc.?

Using a combination of NLTK and basic Python string tools we can address these concerns.

We can easily take a string and get a lowercase version of it.

In [13]:
"ThIS IS a cRaZy sTRing".lower()
Out[13]:
'this is a crazy string'

The string module in base Python has a set of punctuation for the latin alphabet.

In [14]:
import string

string.punctuation
Out[14]:
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

NLTK has a collection of “stop words” for many languages, including English. This is one of the corpora we downloaded.

In [15]:
from nltk.corpus import stopwords

stopwords.words("english")[:20]
Out[15]:
['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 'her',
 'hers']

We can combine all of these into our tokenizer

In [16]:
def tokenize_tweets3(tweets):
    """Get all of the tokens in a set of tweets"""
    twt = nltk.tokenize.TweetTokenizer(strip_handles=True, reduce_len=True)
    # combine stop words and punctuation
    stop = stopwords.words("english") + list(string.punctuation)
    # filter out stop words and punctuation and send to lower case
    tokens = [token.lower() for tweet in tweets
              for token in twt.tokenize(tweet)
              if token.lower() not in stop]
    return(tokens)
In [17]:
pos_tokens = tokenize_tweets3(pos_tweets)
neg_tokens = tokenize_tweets3(neg_tweets)
pos_count = Counter(pos_tokens)
neg_count = Counter(neg_tokens)
In [18]:
pos_count.most_common(20)
Out[18]:
[(':)', 3691),
 (':-)', 701),
 (':d', 658),
 ('thanks', 392),
 ('follow', 304),
 ('...', 290),
 ('love', 273),
 ('thank', 247),
 ('u', 245),
 ('good', 234),
 ('like', 218),
 ('day', 209),
 ('happy', 191),
 ("i'm", 183),
 ('hi', 173),
 ('great', 172),
 ('get', 168),
 ('see', 167),
 ('back', 162),
 ("it's", 162)]
In [19]:
neg_count.most_common(20)
Out[19]:
[(':(', 4585),
 (':-(', 501),
 ("i'm", 343),
 ('...', 332),
 ('please', 274),
 ('miss', 238),
 ('want', 218),
 ('♛', 210),
 ('》', 210),
 ('like', 206),
 ('u', 193),
 ('get', 180),
 ("can't", 180),
 ("it's", 178),
 ("don't", 176),
 ('sorry', 149),
 ('one', 144),
 ('follow', 142),
 ('time', 141),
 ('much', 139)]

Additional processing

How we pre-process text is very important. NLTK provides more tools for pre-processing.

One popular method of pre-processing is stemming. The idea here is to find the “root” of each word.

In [20]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmer.stem("actually")
Out[20]:
'actual'

Does this always work how we want?

In [21]:
print(stemmer.stem("please"), stemmer.stem("pleasing"))
pleas pleas

Let’s update the tokenizer

In [22]:
def tokenize_tweets4(tweets):
    """Get all of the tokens in a set of tweets"""
    twt = nltk.tokenize.TweetTokenizer(strip_handles=True, reduce_len=True)
    # combine stop words and punctuation
    stop = stopwords.words("english") + list(string.punctuation)
    # create the stemmer
    stemmer = PorterStemmer()
    # filter out stop words and punctuation and send to lower case
    tokens = [ stemmer.stem(token) for tweet in tweets
              for token in twt.tokenize(tweet)
              if token.lower() not in stop]
    return(tokens)
pos_tokens = tokenize_tweets4(pos_tweets)
neg_tokens = tokenize_tweets4(neg_tweets)
pos_count = Counter(pos_tokens)
neg_count = Counter(neg_tokens)
In [23]:
pos_count.most_common(20)
Out[23]:
[(':)', 3691),
 (':-)', 701),
 (':D', 658),
 ('thank', 643),
 ('follow', 443),
 ('love', 398),
 ('...', 290),
 ('day', 245),
 ('good', 238),
 ('like', 232),
 ('u', 228),
 ('get', 209),
 ('happi', 206),
 ('see', 186),
 ("i'm", 183),
 ('great', 172),
 ('back', 163),
 ("it'", 162),
 ('know', 155),
 ('new', 153)]
In [24]:
neg_count.most_common(20)
Out[24]:
[(':(', 4585),
 (':-(', 501),
 ("i'm", 343),
 ('...', 332),
 ('miss', 301),
 ('pleas', 275),
 ('follow', 263),
 ('want', 246),
 ('get', 233),
 ('like', 223),
 ('go', 218),
 ('♛', 210),
 ('》', 210),
 ("can't", 180),
 ("it'", 178),
 ("don't", 176),
 ('time', 166),
 ('u', 164),
 ('feel', 158),
 ('love', 151)]

Runtime and optimizations

How does the runtime change as we add all of these complications?

In [25]:
small_twt =  pos_tweets[:2000]
In [26]:
%%time
# Base NLTK tokenizer
_ = tokenize_tweets1(small_twt)
CPU times: user 344 ms, sys: 0 ns, total: 344 ms
Wall time: 340 ms
In [27]:
%%time
# Twitter optimized tokenizer
_ = tokenize_tweets2(small_twt)
CPU times: user 78.1 ms, sys: 15.6 ms, total: 93.8 ms
Wall time: 84.8 ms
In [28]:
%%time
# Get rid of stop words and lowercase
_ = tokenize_tweets3(small_twt)
CPU times: user 141 ms, sys: 0 ns, total: 141 ms
Wall time: 131 ms
In [29]:
%%time
# Also stemming
_ = tokenize_tweets4(small_twt)
CPU times: user 344 ms, sys: 15.6 ms, total: 359 ms
Wall time: 352 ms

Takeaways: - The general NLTK word tokenizer works on many problems, but that generality makes it slow - Using a tokenizer optimized to your problem will be faster - Adding more and more complications adds more and more time - Sometimes need to work to optimize these also

This optimization really does matter. Here’s a “fast” version of tokenization made for a specific project.

In [30]:
import re

def word_tokenize(words):
    """Faster word tokenization than nltk.word_tokenize
    Input:
        words: a string to be tokenized
    Output:
        tokens: tokenized words
    """
    tokens = re.findall(r"[a-z]+-?[a-z]+", words.lower())
    return(tokens)
In [31]:
small_twt = " ".join(pos_tweets[:10000])
twt = nltk.tokenize.TweetTokenizer(strip_handles=True, reduce_len=True)
In [32]:
%%time
_ = nltk.word_tokenize(small_twt)
CPU times: user 500 ms, sys: 15.6 ms, total: 516 ms
Wall time: 507 ms
In [33]:
%%time
_ = twt.tokenize(small_twt)
CPU times: user 188 ms, sys: 15.6 ms, total: 203 ms
Wall time: 194 ms
In [34]:
%%time
_ = word_tokenize(small_twt)
CPU times: user 31.2 ms, sys: 0 ns, total: 31.2 ms
Wall time: 26.6 ms

We can see that optimizing our tokenization can really help the speed. But this tokenizer isn’t optimized for this problem. For instance, it doesn’t pick up emoticons.

In [35]:
Counter(word_tokenize(small_twt)).most_common(20)
Out[35]:
[('you', 1591),
 ('co', 1196),
 ('the', 1096),
 ('to', 1094),
 ('http', 856),
 ('for', 772),
 ('and', 706),
 ('it', 681),
 ('my', 560),
 ('in', 505),
 ('have', 436),
 ('is', 434),
 ('of', 413),
 ('thanks', 393),
 ('me', 364),
 ('that', 343),
 ('https', 336),
 ('your', 333),
 ('on', 326),
 ('follow', 308)]

So we see that NLTK has some pros and cons: - Pros - Easy to use - Fast enough for a one off analysis on small(ish) data - Great when (time to code solution) > (time to run NLTK) - Cons - Much slower than optimized solutions - Really feel the crunch on larger corpora or large analyses

More involved processing

NLTK has many other modules to perform more complicated text processsing.

We can get the parts of speech for each word in a sentence

In [36]:
tokens = word_tokenize(small_twt[:100])
nltk.pos_tag(tokens)
Out[36]:
[('followfriday', 'JJ'),
 ('france', 'NN'),
 ('inte', 'NN'),
 ('pkuchly', 'RB'),
 ('milipol', 'JJ'),
 ('paris', 'NN'),
 ('for', 'IN'),
 ('being', 'VBG'),
 ('top', 'JJ'),
 ('engaged', 'VBN'),
 ('members', 'NNS'),
 ('in', 'IN'),
 ('my', 'PRP$'),
 ('community', 'NN')]