Next: Preprocessing Up: Increasing the I.Q. of Previous: Introduction

Dataset

We built our dataset using the sent mailbox from our gmail account. In total, the account contains nearly 1.4 GB of email total, approximately 500 MB of which corresponds to sent mail. The volume includes email headers, subject lines, message bodies, attachments, and any message content typed by others such as forwarded messages or reply-to text.

We fetched email from gmail using using a python script and the secure IMAP protocol¹. Python includes modules for IMAP, SSL (used to secure IMAP), regular expression processing, and email processing, which we used extensively to build our dataset. Gmail transmitted each message in raw RFC822 format, which our script received and stored in a sqlite database. Then, we ran a separate python script to strip away all but message bodies from each message. A third script computed -grams on each of the bodies, where $n \in \{1,2,3\}$ , and Table 1 shows the vocabulary size and number of samples associated with each -gram set.

Table 1: Dataset statistics for emails from our gmail sent mailbox. The statistics are computed after preprocessing that includes stripping away message headers, attachments, and forwarded or reply-to message components. Preprocessing reduces signficantly the size of our dataset from $\approx 500$ MB to $\approx 6$ MB.

-gram Type	Vocabulary Size	# of Samples
unigram	59386	910812
bigram	268665	905937
trigram	422433	901314

Subsections

Next: Preprocessing Up: Increasing the I.Q. of Previous: Introduction

jac 2010-05-11