next up previous
Next: Schedule Up: Increasing the I.Q. of Previous: Approach


Datasets

We'll use the $ > 1$ GB of email from our gmail account to train and test our model. First, we'll parse the email corpus to collect subjects, recipient sets, and message text. Then we'll remove punctuation within the text and split the corpus 90/10, $ 90\%$ for training our model and $ 10\%$ for testing. Finally, we'll compute $ n$ -grams over the training set using $ n\le 3$ and use these $ n$ -grams to compute next-word likelihood within our model.



jac 2010-04-13