PS-5

Goal: to build a program that tackles part of the speech understanding, underlying any personal digital assistant, based on Hidden Markov Models.

Background

A part of speech (POS) tagger labels each word in a sentence with its part of speech (noun, verb, etc.). For example:

I fish
Will eats the fish
Will you cook the fish
One cook uses a saw
A saw has many uses
You saw me color a fish
Jobs wore one color
The jobs were mine
The mine has many fish
You can cook many

===>

I/PRO fish/V
Will/NP eats/V the/DET fish/N
Will/MOD you/PRO cook/V the/DET fish/N
One/DET cook/N uses/VD a/DET saw/N
A/DET saw/N has/V many/DET uses/N
You/PRO saw/VD me/PRO color/V a/DET fish/N
Jobs/NP wore/VD one/DET color/N
The/DET jobs/N were/VD mine/PRO
The/DET mine/N has/V many/DET fish/N
You/PRO can/MOD cook/V many/PRO

where the tags are after the slashes, indicating e.g., that in the first sentence "I" is a pronoun (PRO), "fish" is a verb (V); in the second sentence, "Will" is a proper noun (NP), "eats" a verb, "the" a determiner (DET), and "fish" a noun (N). If you notice, I purposely constructed these somewhat strange sentences so that the same word serves as a different part of speech in different places ("fish" as both a verb and a noun, "Will" as both a person's name and the future tense modifier, "saw" as both a past tense verb and a tool, etc.). A good tagger makes use of the context to disambiguate the possibilities and determine the correct label. The tagging is in turn a core part of having a "Sudi" understand what a sentence means and how to respond.

The tags that we'll use for this problem set are taken from the book Natural Language Processing with Python:

Tag	Meaning	Examples
`ADJ`	adjective	new, good, high, special, big, local
`ADV`	adverb	really, already, still, early, now
`CNJ`	conjunction	and, or, but, if, while, although
`DET`	determiner	the, a, some, most, every, no
`EX`	existential	there, there's
`FW`	foreign word	dolce, ersatz, esprit, quo, maitre
`MOD`	modal verb	will, can, would, may, must, should
`N`	noun	year, home, costs, time, education
`NP`	proper noun	Alison, Africa, April, Washington
`NUM`	number	twenty-four, fourth, 1991, 14:24
`PRO`	pronoun	he, their, her, its, my, I, us
`P`	preposition	on, of, at, with, by, into, under
`TO`	the word to	to
`UH`	interjection	ah, bang, ha, whee, hmpf, oops
`V`	verb	is, has, get, do, make, see, run
`VD`	past tense	said, took, told, made, asked
`VG`	present participle	making, going, playing, working
`VN`	past participle	given, taken, begun, sung
`WH`	wh determiner	who, which, when, what, where, how

In addition, each punctuation symbol has its own tag (written using the same symbol). I left punctuation out of the above example.

We'll be taking the statistical approach to tagging, which means we need data. Fortunately the problem has been studied a great deal, and there exist good datasets. One prominent one is the Brown corpus (extra credit: outdo it with a Dartmouth corpus), covering a range of works from the news, fictional stories, science writing, etc. It was annotated with a more complex set of tags, which have subsequently been distilled down to the ones above.

The goal of POS tagging is to take a sequence of words and produce the corresponding sequence of tags.

POS tagging via HMM

We will use a hidden Markov model (HMM) approach, applying the principles covered in class. Recall that in an HMM, the states are the things we don't see (hidden) and are trying to infer, and the observations are what we do see. So the observations are words in a sentence and the states are tags because the text we'll observe is not annotated with its part of speech tag (that is our program's job). We proceed through a model by moving from state to state, producing one observation per state. In this "bigram" model, each tag depends on the previous tag. Then each word depends on the tag. (For extra credit, you can go to a "trigram" model, where each tag depends on the previous two tags.) Let "#" be the tag "before" the start of the sentence.

An HMM is defined by its states (here part of speech tags), transitions (here tag to tag, with weights), and observations (here tag to word, with weights). Here is an HMM that includes tags, transitions, and observations for our example sentences. There is a special start state tag "#" before the start of the sentence. Note that we're not going to force a "stop" state (ending with a period, question mark, or exclamation point) since the texts we'll use from the Brown corpus include headlines that break that rule.

Do you see how the various sentences above follow paths through the HMM? For example, "I fish": transition from start to PRO (score -1.2); observe "I" (score -1.9); transition to V (-0.5); observe "fish" (-2.1). Likewise "Will eats the fish": start to NP (-1.6); "Will" (-0.7); to V (-0.7); "eats" (-2.1); to DET (-0.2); "the" (-1.0); to N (0); "fish" -1.0.

Of course, we as people know the parts of speech these words play in these sentences, and can figure out which paths are being followed. The goal of tagging is to have an algorithm determine what is the best path through the POS states to have produced the observed words. Given the many different out edges, even in this relatively simple case, along with the ambiguity in how words can be used, a proper implementation of the Viterbi algorithm is required to efficiently find the best path.

It could be fun to take a random walk on the HMM to generate new sentences, mad libs style.

POS Viterbi

Let's assume for now we have the model pictured above, and want to find the best path for a given sentence. We'll soon turn to how to obtain the transition and observation probabilities. The Viterbi algorithm starts at the # (start) state, with a score of 0, before any observation. Then to handle observation i, it propagates from each reached state at observation i-1, following each transition. The score for the next state as of observation i is the sum of the score at the current state as of i-1, plus the transition score from current to next, plus the score of observation i in the next. As with Dijkstra, that score may or may not be better than the score we already have for that state and observation — maybe we could get there from a different state with a better overall score. So we'll propagate forward from each current state to each next and check whether or not we've found something better (as with Dijkstra relax).

Let's start by considering "will eats the fish". Starting from #, we can transition to NP, DET, PRO, or MOD. There's an observation score for "will" only in NP and MOD, so what score do we give for observing it in the others? Maybe we don't want to completely rule out something that we've never seen (e.g., "will" can be a noun too, just not in the sentences used to build this model). So let's just give it a low log probability, call it the "unseen" score. It should be a negative number, perhaps worse than the observed ones but not totally out of the realm. For this example, I'll make it -10. So as of word 0, we could be at:

DET: 0 (current) -0.9 (transition) -10 (unseen observation) = -10.9
MOD: 0 (current) -2.3 (transition) -0.7 (observation) = -3.0
NP: 0 (current) -1.6 (transition) -0.7 (observation) = -2.3
PRO: 0 (current) -1.2 (transition) -10 (unseen observation) = -11.2

The next word, word 1, is "eats". We consider each transition from each of the current states, adding the current score plus the transition plus the observation. From DET we can go to:

N: -10.9 (current) +0 (transition) -10 (unseen observation) = -20.9

And from MOD to:

PRO: -3.0 (current) -0.7 (transition) -10 (unseen observation) = -13.7
V: -3.0 (current) -0.7 (transition) -2.1 (observation) = -5.8

From NP to:

V: -2.3 (current) -0.7 (transition) -2.1 (observation) = -5.1
VD: -2.3 (current) -0.7 (transition) -10 (unseen observation) = -13.0

From PRO to:

MOD: -11.2 (current) -1.6 (transition) -10 (unseen observation) = -22.8
V: -11.2 (current) -0.5 (transition) -2.1 (observation) = -13.8
VD: -11.2 (current) -1.6 (transition) -10 (unseen observation) = -22.8

Note that there are several different ways to get to V for word 1 — from MOD (score -5.8), from NP (score -5.1), or from PRO (score -13.8). We keep just the best score -5.1, along with a backpointer saying that V for word 1 comes from NP for word 0. Similarly VD for word 1 could have come from NP or PRO, and we keep its best score of -13.0 with a backpointer to NP for word 0.

To continue on, I think it's easier to keep track of these things in a table, like the following, where I use ~~strikethrough~~ to indicate scores and backpointers that aren't the best way to get to the next state for the observation. (Apologies if I added anything incorrectly when I manually built this table — that's why we have computers; let me know and I'll fix it.)

#	observation	next state	curr state	scores: curr + transit + obs	next score
start	n/a	#	0	n/a	0
0	will	DET	#	0 -0.9 -10	-10.9
		MOD	#	0 -2.3 -0.7	-3.0
		NP	#	0 -1.6 -0.7	-2.3
		PRO	#	0 -1.2 -10	-11.2
1	eats	N	DET	-10.9 +0 -10	-20.9
		PRO	MOD	-3.0 -0.7 -10.0	-13.7
		V	~~MOD~~	-3.0 -0.7 -2.1	~~-5.8~~
			~~PRO~~	-11.2 -0.5 -2.1	~~-13.8~~
			NP	-2.3 -0.7 -2.1	-5.1
		VD	NP	-2.3 -0.7 -10	-13.0
			~~PRO~~	-11.2 -1.6 -10	~~-22.8~~
		MOD	PRO	-11.2 -1.6 -10	-22.8
2	the	DET	V	-5.1 -0.2 -1.0	-6.3
			VD	-13.0 -1.1 -1.0	~~-15.1~~
		MOD	PRO	-13.7 -1.6 -10	-25.3
		PRO	V	-5.1 -1.9 -10	-17.0
			VD	-13.0 -0.4 -10	~~-23.4~~
			~~MOD~~	-22.8 -0.7 -10	~~-33.5~~
		V	N	-20.9 -0.3 -10	~~-31.2~~
			PRO	-13.7 -0.5 -10	-24.2
			~~MOD~~	-22.8 -0.7 -10	~~-33.5~~
		VD	N	-20.9 -1.4 -10	~~-32.3~~
			PRO	-13.7 -1.6 -10	-25.3
3	fish	DET	V	-24.2 -0.2 -10	-34.4
			VD	-25.3 -1.1 -10	~~-36.4~~
		MOD	PRO	-17 -1.6 -10	-28.6
		N	DET	-6.3 +0 -1.0	-7.3
		PRO	~~MOD~~	-25.3 -0.7 -10	~~-36.0~~
			V	-24.2 -1.9 -10	~~-36.1~~
			VD	-25.3 -0.4 -10	-35.7
		V	~~MOD~~	-25.3 -0.7 -2.1	~~-28.1~~
			PRO	-17.0 -0.5 -2.1	-19.6
		VD	PRO	-17.0 -1.6 -10	-28.6

Important: We are not deciding for each word which tag is best. We are deciding for each POS tag that we reach for that word what its best score and preceding state are. This will ultimately yield the best path through the graph to generate the entire sentence, disambiguating the parts of speech of the various observed words along the way.

After working through the sentence, we look at the possible states for the last observation, word 3, to see where we could end up with the best possible score. Here it's N, at -7.3. So we know the tag for "fish" is N. We then backtrace to where this state came from: DET. So now we know the tag for "the" is DET. Backtracing again goes to V, for "eats". Then back to NP, for "will". Then #, so we're done. The most likely tags for "Will eats the fish" are: NP V DET N

Task 1: Creating the HMM (25 pts)

Unfortunately, the magic numbers in the example above are not provided for us and we need to calculate them via training. Let's consider training the pictured model from the example sentences and corresponding tags. Basically, we go through each sentence and count up, for each tag, how many times each word was observed with that tag, and how many times each other tag followed it. We essentially want to fill in tables (or really maps) like these:

Transitions

Observations

state	next							total
state	DET	MOD	N	NP	PRO	V	VD	total
#		1		1	1			3
DET			2					2
MOD					1			1
N
NP						1		1
PRO						2		2
V	2							2
VD

state	word								total
state	cook	eats	fish	i	the	will	you	...	total
DET					2				2
MOD						1			1
N			2						2
NP						1			1
PRO				1			1		2
V	1	1	1						3
VD

The above tables show the frequency counts for transitions and observations after processing the following lines:

i/PRO fish/V
will/NP eats/V the/DET fish/N
will/MOD you/PRO cook/V the/DET fish/N

Finally, we convert these counts to log probabilities, by dividing each by the total for its row and taking the log. With only limited experience of 3 sentences, this HMM has only seen one type of transition from each state other than the start state, and has only limited vocabulary of one or two words for each state. So the transition probabilities from # to each of MOD, NP, and PRO is log (1/3), while that from DET to N is log (2/2) = 0. Likewise, the observation of each of "cook", "eats", and "fish" in V is log (1/3), while that of "i" and "you" in PRO is log (1/2). But if we trained it on the rest of the example sentences, we'd have a somewhat more interesting set of transition and observation counts, whose conversion yields the numbers I gave on the figure (rounded off).

Task 1.1: Generating the transition & observations probabilities (20 pts)

The sample solution uses one file for the entire problem set, but divide up your code into classes and functions as you see fit.

While we think of the model as a graph, you need not use the Graph class, and you might in fact find it easier just to keep your own Maps of transition and observation probabilities as we did with finite automata. Recall that you can nest maps (the value for one key is itself a map). The finite automata code might be inspiring, but remember the differences (e.g., HMM observations are on states rather than edges; everything has a log-probability associated with it).

Given a list of words and corresponding list of tags, store in your Maps a count of observations (how many times a tag is matched to a certain word) and transitions (how many times a tag is followed by another tag). Then convert all counts to log-probabilities as described above. (fwiw, the sample solution uses natural log, not log10.)

Task 1.2: File input (5 pts)

Provided in texts.zip are pairs of files: one file contains the sentences and the other file contains the tags for those sentences. Note these files are nicely formatted such that if you split a line by the space character, String[] tokens = line.split(" "), the i-th token in one file corresponds to the i-th tag in the other file, even for punctuation.

Write a function to read one of these files and turn it into a list or array of tokens (you probably want to lowercase all words as you read them). You can then generate an HMM by calling your function from Task 1.1 on a list of tokens and a list of their tags.

Task 2: POS tagging via HMM (40 pts)

Task 2.1: Viterbi (30 pts)

Use the Viterbi algorithm as described above and your trained HMM to find the best POS tags for a given input sentence. Note that we do not want to store all possible paths and scores as we traverse from state to state, as the branching would become consume a large amount of memory.

Here again is the pseudocode from the lecture notes:

currStates = { start }
currScores = map { start=0 }
backtrace = empty list
for i from 0 to # observations - 1
  nextStates = {}
  nextScores = empty map
  backtrace[i] = empty map
  for each currState in currStates
    for each transition currState -> nextState
      add nextState to nextStates
      nextScore = currScores[currState] +
                  transitionScore(currState -> nextState) +
                  observationScore(observations[i] in nextState)
      if nextState isn't in nextScores or nextScore > nextScores[nextState]
        set nextScores[nextState] to nextScore
        set backtrace[i][nextState] to currState
  currStates = nextStates
  currScores = nextScores

Remember to use a constant variable for the observation score of an unobserved word, and play with its value.

Once the above algorithm is complete, you can check which state in currStates has the highest score, then backtrace from that state all the way back to the start. Note that if we index the observations by numbers, as suggested here, the backtrace representation is essentially a list of maps.

Task 2.2: Testing (10 pts)

To assess how good a model is, we can compute how many tags it gets right and how many it gets wrong on some test sentences. It wouldn't be fair to test it on the sentences that we used to train it (though if it did poorly on those, we'd be worried).

Generate an HMM using the "train" files, then apply your HMM to a "test" sentences file and compare the results to the corresponding "test" tags file. Count the numbers of correct vs. incorrect tags.

The sample solution got 32 tags right and 5 wrong for the "simple" tests and 35109 right vs. 1285 wrong for the "brown" tests, with an unseen word penalty of -100 (Note: these are using natural log; if you use log10 there might be some differences).

In a short report, discuss your overall testing performance, and how it depends on the unseen word penalty (and any other parameters you used).

Extra Credit

[10 pts] Move from bigrams to trigrams (i.e., a tag depends on the previous two tags). Compare performance.
[10 pts] Use your model generatively. This can be done randomly (essentially following a path through the HMM, making choices according to the probabilities), or predictively (identify what the best next word would be to continue a sentence).
[10 pts] Use a different corpus; e.g., from phonemes to morphemes, or from one language to another.
[10 pts] Be more careful with things that haven't been observed; this is even more necessary with trigrams. One approach is "interpolation": take a weighted sum of the unigram, bigram, and trigram probabilities, so that the less-informative but more-common lower-order terms can still contribute.
[10 pts] Perform cross-validation. Randomly split the data into different partitions (say 5 different groups, or "folds"). Set aside part 0, train on parts 1-4, and then test on part 0. Then set aside part 1, train on parts 0 & 2-4, and test on 1. Etc. So each part is used as a test set, with the other parts used in training, to construct the model. Note that the Brown corpus is organized by genre (news, fiction, etc.), so deal out the rows among the partitions instead of assigning the first n/5 to one group, the second n/5 to the next, etc.

What to Submit

Turn in the following in a single zip file:

all your code, thoroughly documented;
and your short report (testing on hard-coded graphs, training/testing on provided datasets) in a document (in PDF format).

Grading rubric

Total of 100 points

Correctness (70 points)

5	Loading data, splitting into lower-case words
15	Training: counting transitions and observations
5	Training: log probabilities
15	Viterbi: computing scores and maintaining back pointers
5	Viterbi: start; best end
10	Viterbi: backtrace
7	Console-driven testing
8	File-driven testing

Structure (10 points)

4	Good decomposition of and within methods
3	Proper used of instance and local variables
3	Proper use of parameters

Style (10 points)

3	Comments included within methods and for new methods
4	Consistent and informative names for variables, methods, parameters
3	Clean and consistent layout (blank lines, indentation, no line wraps, etc.)

Testing (10 points)

5	Example sentences and discussion
5	Testing performance and discussion

Acknowledgment

Thanks to Professor Chris Bailey-Kellogg for a rewrite of this problem that greatly clarifies the details around scoring. Also thanks to Sravana Reddy for consulting in the development of the problem set, providing helpful guidance and discussion as well as the processed dataset.