{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NLTK: Natural Language Made Easy\n",
    "\n",
    "Dealing with text is hard! Thankfully, it's hard for everyone, so tools exist to make it easier.\n",
    "\n",
    "NLTK, the Natural Language Toolkit, is a python package \"for building Python programs to work with human language data\". It has many tools for basic language processing (e.g. tokenization, $n$-grams, etc.) as well as tools for more complicated language processing (e.g. part of speech tagging, parse trees, etc.).\n",
    "\n",
    "NLTK has an [associated book about NLP](http://www.nltk.org/book/) that provides some context for the corpora and models.\n",
    "\n",
    "## Installing NLTK, or \"why do I need to download so much data?\"\n",
    "We can `conda install nltk` to get the package. Then we need to do something somewhat strange: we have to download data.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import nltk\n",
    "#nltk.download()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This pops up a GUI where we can choose what data to download.\n",
    "\n",
    "---\n",
    "![nltk download](fig/nltk_download.png)\n",
    "\n",
    "---\n",
    "\n",
    "What is this stuff? The data is separated into two categories:\n",
    "\n",
    "1. Corpora\n",
    "    - These data are a set of collections of text.\n",
    "1. Models\n",
    "    - These are data (e.g. weights, etc.) for trained models.\n",
    "\n",
    "NLTK provides several collections of data to make installing easier.\n",
    "\n",
    "- `all`: All corpora and models\n",
    "- `all-corpora`: All corpora, no models\n",
    "- `all-nltk`: Everything plus more data from the website\n",
    "- `book`: Data to run the associated book\n",
    "- `popular`: The most popular packages\n",
    "- `third-party`: Extra data from third parties\n",
    "\n",
    "Downloading the `popular` collection is recommended.\n",
    "\n",
    "## Analyzing tweets\n",
    "### First pass\n",
    "Let's take a look at one corpus in particular: positive and negative tweets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# read some twitter data\n",
    "neg_id = nltk.corpus.twitter_samples.fileids()[0]\n",
    "neg_tweets = nltk.corpus.twitter_samples.strings(neg_id)\n",
    "pos_id = nltk.corpus.twitter_samples.fileids()[1]\n",
    "pos_tweets = nltk.corpus.twitter_samples.strings(pos_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "#FollowFriday @wncer1 @Defense_gouv for being top influencers in my community this week :)\n",
      "\n",
      "I have a really good m&amp;g idea but I'm never going to meet them :(((\n"
     ]
    }
   ],
   "source": [
    "print(pos_tweets[10])\n",
    "print()\n",
    "print(neg_tweets[10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How does the language in positive and negative tweets differ?\n",
    "\n",
    "We can start by looking at how the words differ. NLTK provides tools for tokenization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def tokenize_tweets1(tweets):\n",
    "    \"\"\"Get all of the tokens in a set of tweets\"\"\"\n",
    "    tokens = [token for tweet in tweets for token in nltk.word_tokenize(tweet)]\n",
    "    return(tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What does this output?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['#', 'FollowFriday', '@', 'France_Inte', '@', 'PKuchly57', '@', 'Milipol_Paris', 'for', 'being']\n"
     ]
    }
   ],
   "source": [
    "pos_tokens = tokenize_tweets1(pos_tweets)\n",
    "neg_tokens = tokenize_tweets1(neg_tweets)\n",
    "print(pos_tokens[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can look at the most common words (like in the first homework) using Python's Counter class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "\n",
    "pos_count = Counter(pos_tokens)\n",
    "neg_count = Counter(neg_tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(':', 6667),\n",
       " (')', 5165),\n",
       " ('@', 5119),\n",
       " ('!', 1920),\n",
       " ('you', 1427),\n",
       " ('.', 1323),\n",
       " ('#', 1292),\n",
       " ('I', 1176),\n",
       " ('to', 1063),\n",
       " ('the', 997),\n",
       " (',', 964),\n",
       " ('a', 881),\n",
       " ('-', 863),\n",
       " ('http', 856),\n",
       " ('for', 749),\n",
       " ('D', 662),\n",
       " ('and', 656),\n",
       " ('?', 582),\n",
       " ('it', 566),\n",
       " ('my', 484)]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pos_count.most_common(20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('(', 7076),\n",
       " (':', 5959),\n",
       " ('@', 3181),\n",
       " ('I', 1986),\n",
       " ('.', 1078),\n",
       " ('to', 1067),\n",
       " ('#', 913),\n",
       " ('!', 895),\n",
       " ('the', 846),\n",
       " (',', 733),\n",
       " ('you', 707),\n",
       " ('i', 684),\n",
       " ('?', 650),\n",
       " ('my', 629),\n",
       " ('a', 626),\n",
       " (\"n't\", 614),\n",
       " ('and', 613),\n",
       " ('-', 600),\n",
       " ('it', 591),\n",
       " ('me', 520)]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "neg_count.most_common(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The two most common tokens for postiive tweets are \":\" and \")\" and the tweo most common tokens for negative tweets are \"(\" and \":\". These are smiley and frowny faces! The basic word tokenizer is treating these as separate tokens, which makes sense in most cases but not for text from social media.\n",
    "\n",
    "### A better tokenizer\n",
    "We're not the first people to see this problem, and NLTK actually has a wide set of tokenizers in the [`nltk.tokenizer` module](http://www.nltk.org/api/nltk.tokenize.html). In particular, there's a [tokenizer that's optimized for tweets](http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def tokenize_tweets2(tweets):\n",
    "    \"\"\"Get all of the tokens in a set of tweets\"\"\"\n",
    "    twt = nltk.tokenize.TweetTokenizer(strip_handles=True, reduce_len=True)\n",
    "    tokens = [token for tweet in tweets for token in twt.tokenize(tweet)]\n",
    "    return(tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pos_tokens = tokenize_tweets2(pos_tweets)\n",
    "neg_tokens = tokenize_tweets2(neg_tweets)\n",
    "pos_count = Counter(pos_tokens)\n",
    "neg_count = Counter(neg_tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(':)', 3691),\n",
       " ('!', 1844),\n",
       " ('you', 1341),\n",
       " ('.', 1341),\n",
       " ('to', 1065),\n",
       " ('the', 999),\n",
       " (',', 964),\n",
       " ('I', 890),\n",
       " ('a', 888),\n",
       " ('for', 749),\n",
       " (':-)', 701),\n",
       " ('and', 660),\n",
       " (':D', 658),\n",
       " ('?', 581),\n",
       " (')', 525),\n",
       " ('my', 484),\n",
       " ('in', 481),\n",
       " ('it', 460),\n",
       " ('is', 418),\n",
       " ('of', 403)]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pos_count.most_common(20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(':(', 4585),\n",
       " ('I', 1587),\n",
       " ('(', 1180),\n",
       " ('.', 1092),\n",
       " ('to', 1068),\n",
       " ('the', 846),\n",
       " ('!', 831),\n",
       " (',', 734),\n",
       " ('you', 660),\n",
       " ('?', 644),\n",
       " ('my', 629),\n",
       " ('a', 627),\n",
       " ('i', 620),\n",
       " ('and', 614),\n",
       " ('me', 524),\n",
       " (':-(', 501),\n",
       " ('so', 466),\n",
       " ('is', 456),\n",
       " ('it', 449),\n",
       " ('in', 421)]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "neg_count.most_common(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Much better! This tokenizer got rid of twitter handles for us, so no more \"@\" tokens, and catches emoticons. However, there are still some questions:\n",
    "\n",
    "1. Should we count a capitalized word differently from a non-capitalized word? e.g. should \"Thanks\" be different from \"thanks\"?\n",
    "1. Do we want to be counting punctuation?\n",
    "1. Do we want to count words like \"I\", \"me\", etc.?\n",
    "\n",
    "Using a combination of NLTK and basic Python string tools we can address these concerns.\n",
    "\n",
    "We can easily take a string and get a lowercase version of it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'this is a crazy string'"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"ThIS IS a cRaZy sTRing\".lower()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `string` module in base Python has a set of punctuation for the latin alphabet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import string\n",
    "\n",
    "string.punctuation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "NLTK has a collection of \"stop words\" for many languages, including English. This is one of the corpora we downloaded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['i',\n",
       " 'me',\n",
       " 'my',\n",
       " 'myself',\n",
       " 'we',\n",
       " 'our',\n",
       " 'ours',\n",
       " 'ourselves',\n",
       " 'you',\n",
       " 'your',\n",
       " 'yours',\n",
       " 'yourself',\n",
       " 'yourselves',\n",
       " 'he',\n",
       " 'him',\n",
       " 'his',\n",
       " 'himself',\n",
       " 'she',\n",
       " 'her',\n",
       " 'hers']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from nltk.corpus import stopwords\n",
    "\n",
    "stopwords.words(\"english\")[:20]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can combine all of these into our tokenizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def tokenize_tweets3(tweets):\n",
    "    \"\"\"Get all of the tokens in a set of tweets\"\"\"\n",
    "    twt = nltk.tokenize.TweetTokenizer(strip_handles=True, reduce_len=True)\n",
    "    # combine stop words and punctuation\n",
    "    stop = stopwords.words(\"english\") + list(string.punctuation)\n",
    "    # filter out stop words and punctuation and send to lower case\n",
    "    tokens = [token.lower() for tweet in tweets \n",
    "              for token in twt.tokenize(tweet) \n",
    "              if token.lower() not in stop]\n",
    "    return(tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pos_tokens = tokenize_tweets3(pos_tweets)\n",
    "neg_tokens = tokenize_tweets3(neg_tweets)\n",
    "pos_count = Counter(pos_tokens)\n",
    "neg_count = Counter(neg_tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(':)', 3691),\n",
       " (':-)', 701),\n",
       " (':d', 658),\n",
       " ('thanks', 392),\n",
       " ('follow', 304),\n",
       " ('...', 290),\n",
       " ('love', 273),\n",
       " ('thank', 247),\n",
       " ('u', 245),\n",
       " ('good', 234),\n",
       " ('like', 218),\n",
       " ('day', 209),\n",
       " ('happy', 191),\n",
       " (\"i'm\", 183),\n",
       " ('hi', 173),\n",
       " ('great', 172),\n",
       " ('get', 168),\n",
       " ('see', 167),\n",
       " ('back', 162),\n",
       " (\"it's\", 162)]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pos_count.most_common(20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(':(', 4585),\n",
       " (':-(', 501),\n",
       " (\"i'm\", 343),\n",
       " ('...', 332),\n",
       " ('please', 274),\n",
       " ('miss', 238),\n",
       " ('want', 218),\n",
       " ('♛', 210),\n",
       " ('》', 210),\n",
       " ('like', 206),\n",
       " ('u', 193),\n",
       " ('get', 180),\n",
       " (\"can't\", 180),\n",
       " (\"it's\", 178),\n",
       " (\"don't\", 176),\n",
       " ('sorry', 149),\n",
       " ('one', 144),\n",
       " ('follow', 142),\n",
       " ('time', 141),\n",
       " ('much', 139)]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "neg_count.most_common(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Additional processing\n",
    "How we pre-process text is very important. NLTK provides more tools for pre-processing.\n",
    "\n",
    "One popular method of pre-processing is **stemming**. The idea here is to find the \"root\" of each word. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'actual'"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from nltk.stem import PorterStemmer\n",
    "\n",
    "stemmer = PorterStemmer()\n",
    "stemmer.stem(\"actually\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Does this always work how we want?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pleas pleas\n"
     ]
    }
   ],
   "source": [
    "print(stemmer.stem(\"please\"), stemmer.stem(\"pleasing\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's update the tokenizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def tokenize_tweets4(tweets):\n",
    "    \"\"\"Get all of the tokens in a set of tweets\"\"\"\n",
    "    twt = nltk.tokenize.TweetTokenizer(strip_handles=True, reduce_len=True)\n",
    "    # combine stop words and punctuation\n",
    "    stop = stopwords.words(\"english\") + list(string.punctuation)\n",
    "    # create the stemmer\n",
    "    stemmer = PorterStemmer()\n",
    "    # filter out stop words and punctuation and send to lower case\n",
    "    tokens = [ stemmer.stem(token) for tweet in tweets \n",
    "              for token in twt.tokenize(tweet) \n",
    "              if token.lower() not in stop]\n",
    "    return(tokens)\n",
    "pos_tokens = tokenize_tweets4(pos_tweets)\n",
    "neg_tokens = tokenize_tweets4(neg_tweets)\n",
    "pos_count = Counter(pos_tokens)\n",
    "neg_count = Counter(neg_tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(':)', 3691),\n",
       " (':-)', 701),\n",
       " (':D', 658),\n",
       " ('thank', 643),\n",
       " ('follow', 443),\n",
       " ('love', 398),\n",
       " ('...', 290),\n",
       " ('day', 245),\n",
       " ('good', 238),\n",
       " ('like', 232),\n",
       " ('u', 228),\n",
       " ('get', 209),\n",
       " ('happi', 206),\n",
       " ('see', 186),\n",
       " (\"i'm\", 183),\n",
       " ('great', 172),\n",
       " ('back', 163),\n",
       " (\"it'\", 162),\n",
       " ('know', 155),\n",
       " ('new', 153)]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pos_count.most_common(20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(':(', 4585),\n",
       " (':-(', 501),\n",
       " (\"i'm\", 343),\n",
       " ('...', 332),\n",
       " ('miss', 301),\n",
       " ('pleas', 275),\n",
       " ('follow', 263),\n",
       " ('want', 246),\n",
       " ('get', 233),\n",
       " ('like', 223),\n",
       " ('go', 218),\n",
       " ('♛', 210),\n",
       " ('》', 210),\n",
       " (\"can't\", 180),\n",
       " (\"it'\", 178),\n",
       " (\"don't\", 176),\n",
       " ('time', 166),\n",
       " ('u', 164),\n",
       " ('feel', 158),\n",
       " ('love', 151)]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "neg_count.most_common(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Runtime and optimizations\n",
    "How does the runtime change as we add all of these complications?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "small_twt =  pos_tweets[:2000]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 344 ms, sys: 0 ns, total: 344 ms\n",
      "Wall time: 340 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# Base NLTK tokenizer\n",
    "_ = tokenize_tweets1(small_twt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 78.1 ms, sys: 15.6 ms, total: 93.8 ms\n",
      "Wall time: 84.8 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# Twitter optimized tokenizer\n",
    "_ = tokenize_tweets2(small_twt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 141 ms, sys: 0 ns, total: 141 ms\n",
      "Wall time: 131 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# Get rid of stop words and lowercase\n",
    "_ = tokenize_tweets3(small_twt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 344 ms, sys: 15.6 ms, total: 359 ms\n",
      "Wall time: 352 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# Also stemming\n",
    "_ = tokenize_tweets4(small_twt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Takeaways:\n",
    "- The general NLTK word tokenizer works on many problems, but that generality makes it slow\n",
    "  - Using a tokenizer optimized to your problem will be faster\n",
    "- Adding more and more complications adds more and more time\n",
    "  - Sometimes need to work to optimize these also\n",
    "\n",
    "This optimization really does matter. Here's a \"fast\" version of tokenization made for a specific project."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "def word_tokenize(words):\n",
    "    \"\"\"Faster word tokenization than nltk.word_tokenize\n",
    "    Input:\n",
    "        words: a string to be tokenized\n",
    "    Output:\n",
    "        tokens: tokenized words\n",
    "    \"\"\"\n",
    "    tokens = re.findall(r\"[a-z]+-?[a-z]+\", words.lower())\n",
    "    return(tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "small_twt = \" \".join(pos_tweets[:10000])\n",
    "twt = nltk.tokenize.TweetTokenizer(strip_handles=True, reduce_len=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 500 ms, sys: 15.6 ms, total: 516 ms\n",
      "Wall time: 507 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "_ = nltk.word_tokenize(small_twt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 188 ms, sys: 15.6 ms, total: 203 ms\n",
      "Wall time: 194 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "_ = twt.tokenize(small_twt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 31.2 ms, sys: 0 ns, total: 31.2 ms\n",
      "Wall time: 26.6 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "_ = word_tokenize(small_twt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that optimizing our tokenization can really help the speed. But this tokenizer isn't optimized for this problem. For instance, it doesn't pick up emoticons."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('you', 1591),\n",
       " ('co', 1196),\n",
       " ('the', 1096),\n",
       " ('to', 1094),\n",
       " ('http', 856),\n",
       " ('for', 772),\n",
       " ('and', 706),\n",
       " ('it', 681),\n",
       " ('my', 560),\n",
       " ('in', 505),\n",
       " ('have', 436),\n",
       " ('is', 434),\n",
       " ('of', 413),\n",
       " ('thanks', 393),\n",
       " ('me', 364),\n",
       " ('that', 343),\n",
       " ('https', 336),\n",
       " ('your', 333),\n",
       " ('on', 326),\n",
       " ('follow', 308)]"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Counter(word_tokenize(small_twt)).most_common(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So we see that NLTK has some pros and cons:\n",
    "- Pros\n",
    "  - Easy to use\n",
    "  - Fast enough for a one off analysis on small(ish) data\n",
    "  - Great when (time to code solution) > (time to run NLTK)\n",
    "- Cons\n",
    "  - Much slower than optimized solutions \n",
    "  - Really feel the crunch on larger corpora or large analyses"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### More involved processing\n",
    "NLTK has many other modules to perform more complicated text processsing.\n",
    "\n",
    "We can get the parts of speech for each word in a sentence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('followfriday', 'JJ'),\n",
       " ('france', 'NN'),\n",
       " ('inte', 'NN'),\n",
       " ('pkuchly', 'RB'),\n",
       " ('milipol', 'JJ'),\n",
       " ('paris', 'NN'),\n",
       " ('for', 'IN'),\n",
       " ('being', 'VBG'),\n",
       " ('top', 'JJ'),\n",
       " ('engaged', 'VBN'),\n",
       " ('members', 'NNS'),\n",
       " ('in', 'IN'),\n",
       " ('my', 'PRP$'),\n",
       " ('community', 'NN')]"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokens = word_tokenize(small_twt[:100])\n",
    "nltk.pos_tag(tokens)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}