{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# NLTK: Natural Language Made Easy\n", "\n", "Dealing with text is hard! Thankfully, it's hard for everyone, so tools exist to make it easier.\n", "\n", "NLTK, the Natural Language Toolkit, is a python package \"for building Python programs to work with human language data\". It has many tools for basic language processing (e.g. tokenization, $n$-grams, etc.) as well as tools for more complicated language processing (e.g. part of speech tagging, parse trees, etc.).\n", "\n", "NLTK has an [associated book about NLP](http://www.nltk.org/book/) that provides some context for the corpora and models.\n", "\n", "## Installing NLTK, or \"why do I need to download so much data?\"\n", "We can `conda install nltk` to get the package. Then we need to do something somewhat strange: we have to download data.\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import nltk\n", "#nltk.download()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This pops up a GUI where we can choose what data to download.\n", "\n", "---\n", "![nltk download](fig/nltk_download.png)\n", "\n", "---\n", "\n", "What is this stuff? The data is separated into two categories:\n", "\n", "1. Corpora\n", " - These data are a set of collections of text.\n", "1. Models\n", " - These are data (e.g. weights, etc.) for trained models.\n", "\n", "NLTK provides several collections of data to make installing easier.\n", "\n", "- `all`: All corpora and models\n", "- `all-corpora`: All corpora, no models\n", "- `all-nltk`: Everything plus more data from the website\n", "- `book`: Data to run the associated book\n", "- `popular`: The most popular packages\n", "- `third-party`: Extra data from third parties\n", "\n", "Downloading the `popular` collection is recommended.\n", "\n", "## Analyzing tweets\n", "### First pass\n", "Let's take a look at one corpus in particular: positive and negative tweets." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# read some twitter data\n", "neg_id = nltk.corpus.twitter_samples.fileids()[0]\n", "neg_tweets = nltk.corpus.twitter_samples.strings(neg_id)\n", "pos_id = nltk.corpus.twitter_samples.fileids()[1]\n", "pos_tweets = nltk.corpus.twitter_samples.strings(pos_id)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#FollowFriday @wncer1 @Defense_gouv for being top influencers in my community this week :)\n", "\n", "I have a really good m&g idea but I'm never going to meet them :(((\n" ] } ], "source": [ "print(pos_tweets[10])\n", "print()\n", "print(neg_tweets[10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How does the language in positive and negative tweets differ?\n", "\n", "We can start by looking at how the words differ. NLTK provides tools for tokenization." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def tokenize_tweets1(tweets):\n", " \"\"\"Get all of the tokens in a set of tweets\"\"\"\n", " tokens = [token for tweet in tweets for token in nltk.word_tokenize(tweet)]\n", " return(tokens)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What does this output?" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['#', 'FollowFriday', '@', 'France_Inte', '@', 'PKuchly57', '@', 'Milipol_Paris', 'for', 'being']\n" ] } ], "source": [ "pos_tokens = tokenize_tweets1(pos_tweets)\n", "neg_tokens = tokenize_tweets1(neg_tweets)\n", "print(pos_tokens[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can look at the most common words (like in the first homework) using Python's Counter class." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from collections import Counter\n", "\n", "pos_count = Counter(pos_tokens)\n", "neg_count = Counter(neg_tokens)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(':', 6667),\n", " (')', 5165),\n", " ('@', 5119),\n", " ('!', 1920),\n", " ('you', 1427),\n", " ('.', 1323),\n", " ('#', 1292),\n", " ('I', 1176),\n", " ('to', 1063),\n", " ('the', 997),\n", " (',', 964),\n", " ('a', 881),\n", " ('-', 863),\n", " ('http', 856),\n", " ('for', 749),\n", " ('D', 662),\n", " ('and', 656),\n", " ('?', 582),\n", " ('it', 566),\n", " ('my', 484)]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pos_count.most_common(20)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('(', 7076),\n", " (':', 5959),\n", " ('@', 3181),\n", " ('I', 1986),\n", " ('.', 1078),\n", " ('to', 1067),\n", " ('#', 913),\n", " ('!', 895),\n", " ('the', 846),\n", " (',', 733),\n", " ('you', 707),\n", " ('i', 684),\n", " ('?', 650),\n", " ('my', 629),\n", " ('a', 626),\n", " (\"n't\", 614),\n", " ('and', 613),\n", " ('-', 600),\n", " ('it', 591),\n", " ('me', 520)]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "neg_count.most_common(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The two most common tokens for postiive tweets are \":\" and \")\" and the tweo most common tokens for negative tweets are \"(\" and \":\". These are smiley and frowny faces! The basic word tokenizer is treating these as separate tokens, which makes sense in most cases but not for text from social media.\n", "\n", "### A better tokenizer\n", "We're not the first people to see this problem, and NLTK actually has a wide set of tokenizers in the [`nltk.tokenizer` module](http://www.nltk.org/api/nltk.tokenize.html). In particular, there's a [tokenizer that's optimized for tweets](http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual)." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def tokenize_tweets2(tweets):\n", " \"\"\"Get all of the tokens in a set of tweets\"\"\"\n", " twt = nltk.tokenize.TweetTokenizer(strip_handles=True, reduce_len=True)\n", " tokens = [token for tweet in tweets for token in twt.tokenize(tweet)]\n", " return(tokens)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pos_tokens = tokenize_tweets2(pos_tweets)\n", "neg_tokens = tokenize_tweets2(neg_tweets)\n", "pos_count = Counter(pos_tokens)\n", "neg_count = Counter(neg_tokens)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(':)', 3691),\n", " ('!', 1844),\n", " ('you', 1341),\n", " ('.', 1341),\n", " ('to', 1065),\n", " ('the', 999),\n", " (',', 964),\n", " ('I', 890),\n", " ('a', 888),\n", " ('for', 749),\n", " (':-)', 701),\n", " ('and', 660),\n", " (':D', 658),\n", " ('?', 581),\n", " (')', 525),\n", " ('my', 484),\n", " ('in', 481),\n", " ('it', 460),\n", " ('is', 418),\n", " ('of', 403)]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pos_count.most_common(20)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(':(', 4585),\n", " ('I', 1587),\n", " ('(', 1180),\n", " ('.', 1092),\n", " ('to', 1068),\n", " ('the', 846),\n", " ('!', 831),\n", " (',', 734),\n", " ('you', 660),\n", " ('?', 644),\n", " ('my', 629),\n", " ('a', 627),\n", " ('i', 620),\n", " ('and', 614),\n", " ('me', 524),\n", " (':-(', 501),\n", " ('so', 466),\n", " ('is', 456),\n", " ('it', 449),\n", " ('in', 421)]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "neg_count.most_common(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Much better! This tokenizer got rid of twitter handles for us, so no more \"@\" tokens, and catches emoticons. However, there are still some questions:\n", "\n", "1. Should we count a capitalized word differently from a non-capitalized word? e.g. should \"Thanks\" be different from \"thanks\"?\n", "1. Do we want to be counting punctuation?\n", "1. Do we want to count words like \"I\", \"me\", etc.?\n", "\n", "Using a combination of NLTK and basic Python string tools we can address these concerns.\n", "\n", "We can easily take a string and get a lowercase version of it." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'this is a crazy string'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"ThIS IS a cRaZy sTRing\".lower()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `string` module in base Python has a set of punctuation for the latin alphabet." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import string\n", "\n", "string.punctuation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NLTK has a collection of \"stop words\" for many languages, including English. This is one of the corpora we downloaded." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['i',\n", " 'me',\n", " 'my',\n", " 'myself',\n", " 'we',\n", " 'our',\n", " 'ours',\n", " 'ourselves',\n", " 'you',\n", " 'your',\n", " 'yours',\n", " 'yourself',\n", " 'yourselves',\n", " 'he',\n", " 'him',\n", " 'his',\n", " 'himself',\n", " 'she',\n", " 'her',\n", " 'hers']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.corpus import stopwords\n", "\n", "stopwords.words(\"english\")[:20]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can combine all of these into our tokenizer" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def tokenize_tweets3(tweets):\n", " \"\"\"Get all of the tokens in a set of tweets\"\"\"\n", " twt = nltk.tokenize.TweetTokenizer(strip_handles=True, reduce_len=True)\n", " # combine stop words and punctuation\n", " stop = stopwords.words(\"english\") + list(string.punctuation)\n", " # filter out stop words and punctuation and send to lower case\n", " tokens = [token.lower() for tweet in tweets \n", " for token in twt.tokenize(tweet) \n", " if token.lower() not in stop]\n", " return(tokens)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pos_tokens = tokenize_tweets3(pos_tweets)\n", "neg_tokens = tokenize_tweets3(neg_tweets)\n", "pos_count = Counter(pos_tokens)\n", "neg_count = Counter(neg_tokens)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(':)', 3691),\n", " (':-)', 701),\n", " (':d', 658),\n", " ('thanks', 392),\n", " ('follow', 304),\n", " ('...', 290),\n", " ('love', 273),\n", " ('thank', 247),\n", " ('u', 245),\n", " ('good', 234),\n", " ('like', 218),\n", " ('day', 209),\n", " ('happy', 191),\n", " (\"i'm\", 183),\n", " ('hi', 173),\n", " ('great', 172),\n", " ('get', 168),\n", " ('see', 167),\n", " ('back', 162),\n", " (\"it's\", 162)]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pos_count.most_common(20)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(':(', 4585),\n", " (':-(', 501),\n", " (\"i'm\", 343),\n", " ('...', 332),\n", " ('please', 274),\n", " ('miss', 238),\n", " ('want', 218),\n", " ('♛', 210),\n", " ('》', 210),\n", " ('like', 206),\n", " ('u', 193),\n", " ('get', 180),\n", " (\"can't\", 180),\n", " (\"it's\", 178),\n", " (\"don't\", 176),\n", " ('sorry', 149),\n", " ('one', 144),\n", " ('follow', 142),\n", " ('time', 141),\n", " ('much', 139)]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "neg_count.most_common(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Additional processing\n", "How we pre-process text is very important. NLTK provides more tools for pre-processing.\n", "\n", "One popular method of pre-processing is **stemming**. The idea here is to find the \"root\" of each word. " ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'actual'" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.stem import PorterStemmer\n", "\n", "stemmer = PorterStemmer()\n", "stemmer.stem(\"actually\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Does this always work how we want?" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pleas pleas\n" ] } ], "source": [ "print(stemmer.stem(\"please\"), stemmer.stem(\"pleasing\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's update the tokenizer" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def tokenize_tweets4(tweets):\n", " \"\"\"Get all of the tokens in a set of tweets\"\"\"\n", " twt = nltk.tokenize.TweetTokenizer(strip_handles=True, reduce_len=True)\n", " # combine stop words and punctuation\n", " stop = stopwords.words(\"english\") + list(string.punctuation)\n", " # create the stemmer\n", " stemmer = PorterStemmer()\n", " # filter out stop words and punctuation and send to lower case\n", " tokens = [ stemmer.stem(token) for tweet in tweets \n", " for token in twt.tokenize(tweet) \n", " if token.lower() not in stop]\n", " return(tokens)\n", "pos_tokens = tokenize_tweets4(pos_tweets)\n", "neg_tokens = tokenize_tweets4(neg_tweets)\n", "pos_count = Counter(pos_tokens)\n", "neg_count = Counter(neg_tokens)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(':)', 3691),\n", " (':-)', 701),\n", " (':D', 658),\n", " ('thank', 643),\n", " ('follow', 443),\n", " ('love', 398),\n", " ('...', 290),\n", " ('day', 245),\n", " ('good', 238),\n", " ('like', 232),\n", " ('u', 228),\n", " ('get', 209),\n", " ('happi', 206),\n", " ('see', 186),\n", " (\"i'm\", 183),\n", " ('great', 172),\n", " ('back', 163),\n", " (\"it'\", 162),\n", " ('know', 155),\n", " ('new', 153)]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pos_count.most_common(20)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(':(', 4585),\n", " (':-(', 501),\n", " (\"i'm\", 343),\n", " ('...', 332),\n", " ('miss', 301),\n", " ('pleas', 275),\n", " ('follow', 263),\n", " ('want', 246),\n", " ('get', 233),\n", " ('like', 223),\n", " ('go', 218),\n", " ('♛', 210),\n", " ('》', 210),\n", " (\"can't\", 180),\n", " (\"it'\", 178),\n", " (\"don't\", 176),\n", " ('time', 166),\n", " ('u', 164),\n", " ('feel', 158),\n", " ('love', 151)]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "neg_count.most_common(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Runtime and optimizations\n", "How does the runtime change as we add all of these complications?" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": true }, "outputs": [], "source": [ "small_twt = pos_tweets[:2000]" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 344 ms, sys: 0 ns, total: 344 ms\n", "Wall time: 340 ms\n" ] } ], "source": [ "%%time\n", "# Base NLTK tokenizer\n", "_ = tokenize_tweets1(small_twt)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 78.1 ms, sys: 15.6 ms, total: 93.8 ms\n", "Wall time: 84.8 ms\n" ] } ], "source": [ "%%time\n", "# Twitter optimized tokenizer\n", "_ = tokenize_tweets2(small_twt)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 141 ms, sys: 0 ns, total: 141 ms\n", "Wall time: 131 ms\n" ] } ], "source": [ "%%time\n", "# Get rid of stop words and lowercase\n", "_ = tokenize_tweets3(small_twt)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 344 ms, sys: 15.6 ms, total: 359 ms\n", "Wall time: 352 ms\n" ] } ], "source": [ "%%time\n", "# Also stemming\n", "_ = tokenize_tweets4(small_twt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Takeaways:\n", "- The general NLTK word tokenizer works on many problems, but that generality makes it slow\n", " - Using a tokenizer optimized to your problem will be faster\n", "- Adding more and more complications adds more and more time\n", " - Sometimes need to work to optimize these also\n", "\n", "This optimization really does matter. Here's a \"fast\" version of tokenization made for a specific project." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import re\n", "\n", "def word_tokenize(words):\n", " \"\"\"Faster word tokenization than nltk.word_tokenize\n", " Input:\n", " words: a string to be tokenized\n", " Output:\n", " tokens: tokenized words\n", " \"\"\"\n", " tokens = re.findall(r\"[a-z]+-?[a-z]+\", words.lower())\n", " return(tokens)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true }, "outputs": [], "source": [ "small_twt = \" \".join(pos_tweets[:10000])\n", "twt = nltk.tokenize.TweetTokenizer(strip_handles=True, reduce_len=True)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 500 ms, sys: 15.6 ms, total: 516 ms\n", "Wall time: 507 ms\n" ] } ], "source": [ "%%time\n", "_ = nltk.word_tokenize(small_twt)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 188 ms, sys: 15.6 ms, total: 203 ms\n", "Wall time: 194 ms\n" ] } ], "source": [ "%%time\n", "_ = twt.tokenize(small_twt)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 31.2 ms, sys: 0 ns, total: 31.2 ms\n", "Wall time: 26.6 ms\n" ] } ], "source": [ "%%time\n", "_ = word_tokenize(small_twt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that optimizing our tokenization can really help the speed. But this tokenizer isn't optimized for this problem. For instance, it doesn't pick up emoticons." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('you', 1591),\n", " ('co', 1196),\n", " ('the', 1096),\n", " ('to', 1094),\n", " ('http', 856),\n", " ('for', 772),\n", " ('and', 706),\n", " ('it', 681),\n", " ('my', 560),\n", " ('in', 505),\n", " ('have', 436),\n", " ('is', 434),\n", " ('of', 413),\n", " ('thanks', 393),\n", " ('me', 364),\n", " ('that', 343),\n", " ('https', 336),\n", " ('your', 333),\n", " ('on', 326),\n", " ('follow', 308)]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(word_tokenize(small_twt)).most_common(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we see that NLTK has some pros and cons:\n", "- Pros\n", " - Easy to use\n", " - Fast enough for a one off analysis on small(ish) data\n", " - Great when (time to code solution) > (time to run NLTK)\n", "- Cons\n", " - Much slower than optimized solutions \n", " - Really feel the crunch on larger corpora or large analyses" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### More involved processing\n", "NLTK has many other modules to perform more complicated text processsing.\n", "\n", "We can get the parts of speech for each word in a sentence" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('followfriday', 'JJ'),\n", " ('france', 'NN'),\n", " ('inte', 'NN'),\n", " ('pkuchly', 'RB'),\n", " ('milipol', 'JJ'),\n", " ('paris', 'NN'),\n", " ('for', 'IN'),\n", " ('being', 'VBG'),\n", " ('top', 'JJ'),\n", " ('engaged', 'VBN'),\n", " ('members', 'NNS'),\n", " ('in', 'IN'),\n", " ('my', 'PRP$'),\n", " ('community', 'NN')]" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokens = word_tokenize(small_twt[:100])\n", "nltk.pos_tag(tokens)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }