Next: Datasets Up: Increasing the I.Q. of Previous: Introduction

Approach

We will employ a Markov model to compute $P(w_n\vert w_1\cdots w_{n-1\vert})$ , where

is the

th word of an

-gram. The literature contains many methods for estimating likelihood of the

th word--an initial literature search suggests this problem is a classic one. A popular textbook in natural language processing (NLP) begins by describing maximum likelihood estimation and its pitfalls [2]. Then, the book moves to more complex models that account for probability of word choices not found in the training corpus. We will explore some of these models. In particular, the Good-Turing estimator looks promising (at least in their context).

Time permitting, we will adjust our model in a more complex formulation to account for a message subject and a message recipient set. We conjecture that both subject and message-recipient set have a telling influence on the next-word estimate--in practice, people tend to discuss certain topics with particular groups of people.

In another step of complexity, we will adjust the model to respond to individually typed letters. Incorrect predictions will lead to user typing, and typed letters in combination with history, can provide useful evidence about the correct word.

The tutorial on hidden Markov models [4] and chapter 9 of the NLP textbook [2] can apply in these more complex problem formulations. The ``Statistical Language Modeling Toolkit'' might provide useful NLP facilities for efficiently constructing -grams [1].

Next: Datasets Up: Increasing the I.Q. of Previous: Introduction

jac 2010-04-13