Instructor: Prof. Dekang Lin
Due Date: 11:59pm Wednesday December 3, 2003
Objectives: The following exercises are intended to further your understanding of Hidden Markov and Natural Language Processing.
Guidelines: Please read these carefully. Not following these overall guidelines may result in point penalty (lower grade) for the assignment.
Total Points: 120
In this part of the assignment, you are to experiment with a C++ HMM Package written by Dekang Lin. The package contains three executables:
| Executable | Purpose |
| vit | Given an HMM and an observation sequence, compute the sequence of the hidden states that has the highest probability using the Viterbi algorithm. |
| genseq | Generate an observation sequence using a HMM model. |
| trainhmm | Using a collection of observation sequences to train a HMM model using the Baum-Welch algorithm. |
The three executables share the same implementation of HMM in hmm.h and hmm.cpp. The executables can be computed by typing the make command in the hmm directory.
A HMM is specified in two files: NAME.trans and NAME.emit where NAME is the name of the HMM. The file NAME.trans contains the transition probabilities between the states. The file NAME.emit contains the emission probabilities. Normally, HMM also needs a set of initial probabilities of the states. We treat them as part of the transition probabilities by adding a special initial state. The transition probabilities from the special initial state to other states correspond to the initial probabilities defined in the Russell and Norvig textbook.

The figure above is the HMM in Figure 15.20 in [Russell&Norvig] (p.573). It can be specified with the files phone.trans and phone.emit in the phone directory.
phone.trans:
INIT INIT Onset 1 Onset Onset 0.3 Onset Mid 0.7 Mid Mid 0.9 Mid End 0.1 End End 0.4 End FINAL 0.6
phone.emit:
Onset C1 0.5 Onset C2 0.2 Onset C3 0.3 Mid C3 0.2 Mid C4 0.7 Mid C5 0.1 End C4 0.1 End C6 0.5 End C7 0.4
The first line of the phone.trans file is the name of the initial state. Each of the subsequent lines is a transition probability. For example, P(Mid|Onset)=0.7, P(C4|End)=0.1. The transition and emission probabilities not listed in these two files are treated as zeros.
The vit program takes the name of a HMM as the command-line argument. It then reads sequences of observations from the standard input and prints the most probable sequences of states as well as their probability on the standard output. For example, the file phone.input contains the observation sequence:
C1 C2 C3 C4 C4 C6 C7 C2 C2 C5 C4 C4 C6 C6
The output of the command
../hmm/vit phone < phone.input
is the following:
P(path)=0.625286 path: C1 Onset C2 Onset C3 Mid C4 Mid C4 Mid C6 End C7 End P(path)=0.936748 path: C2 Onset C2 Onset C5 Mid C4 Mid C4 Mid C6 End C6 End
The genseq program takes two parameters. The
first is the name of a HMM (i.e., NAME.trans and
NAME.emit specifies the transition and emission
probabilities of the HMM). The second is the number of sequences to generate.
The program generates a collection of observation sequences with each sequence
on a line. For example, the outputs of the command
../hmm/genseq phone 10
are the following:
C1 C4 C6 C7 C6 C1 C5 C7 C2 C1 C3 C1 C1 C4 C4 C4 C4 C5 C4 C4 C4 C4 C5 C5 C3 C4 C4 C3 C3 C5 C7 C6 C7 C2 C1 C5 C4 C4 C4 C4 C4 C4 C4 C5 C4 C4 C6 C3 C4 C7 C7 C2 C3 C4 C3 C4 C3 C6 C6 C3 C4 C4 C4 C4 C4 C4 C4 C5 C4 C4 C4 C5 C4 C3 C3 C4 C4 C4 C3 C6 C3 C3 C1 C4 C3 C5 C4 C4 C4 C4 C4 C6 C2 C3 C5 C4 C7 C2 C3 C4 C4 C4 C4 C4 C3 C4 C4 C6
One of the beauties of HMM is that the parameters it needs can be estimated
(trained) with sequences of observations. The trainhmm
program does exactly this. It takes three obligatory parameters and one optional
parameter. The three obligatory parameters are: the name of the initial HMM, the
name of the result HMM and the file containing the training sequences. The
optional parameter is the maximum number iterations to run during training. If
the fourth parameter is not provided, the maximum number of iterations is 10.
For example, the command:
../hmm/trainhmm phone-init1 phone-result1 phone.train
will train a HMM with the starting parameters in
the files phone-init1.trans and phone-init1.emit
which contain the following contents:
phone-init1.trans:
INIT INIT Onset 1 Onset Onset 0.5 Onset Mid 0.5 Mid Mid 0.5 Mid End 0.5 End End 0.5 End FINAL 0.5
phone-init1.emit:
Onset C1 0.33 Onset C2 0.33 Onset C3 0.33 Mid C3 0.33 Mid C4 0.33 Mid C5 0.33 End C4 0.33 End C6 0.33 End C7 0.33
As you can see the initial HMM has the right structure, but the probabilities are assumed to be uniformly distributed. The training results will be stored n the files phone-result1.trans and phone-result1.emit.
Having played with toy problems in the previous three questions, we are now ready for a serious application: Par of Speech Tagging. The Part of Speech (POS) of a word is the equivalent classes that a word belongs to such that changing the word to another word in the same equivalence class does not change the grammaticality of a sentence. A natural language word can have multiple Parts of Speech. For example, the word 'can' may be an auxiliary verb (e.g., you can do it), a noun (e.g., a can of fish), a verb (e.g., how to can fresh tomato products). The goal of POS Tagging is to determine the POS of a word in a particular context by assigning POS category (called tag) to the word. For example, here is an example sentence where each word is assigned a POS tag (each word is followed by its tag).
There EX is VBZ no DT asbestos NN in IN our PRP$ products NNS now RB . .
The tags used here come from the Penn Treebank tag set.
One way to build a POS tagger is to use an HMM. The hidden states in HMM are the tags. The observations are the words in a sentence.
Given a sentence, we can the find the tags of the words in it using the Viterbi
algorithm. For example, the directory the file pos.trans and
pos.emit
contain a trained HMM for POS tagging. The file sample.txt
contains a sequence of sentences. We can tag them with the command
../hmm/vit
pos <sample.txt
Here are the outputs of this command:
P(path)=0.218846 path: But CC state NN courts NNS upheld VBD a DT challenge NN by RP consumer NN groups NNS to TO the DT commission NN 's POS rate NN increase NN and CC found VBD the DT rates NNS illegal JJ . . P(path)=0.418471 path: The NNP Illinois NNP Supreme NNP Court NNP ordered VBD the DT commission NN to TO audit VB Commonwealth NNP Edison NNP 's POS construction NN expenses NNS and CC refund VB any DT unreasonable JJ expenses NNS . .
The in addition to the HMM package, you also need the following files in the pos directory:
Your task is to do the following:
You are given the following grammar
| S => NP VP N1 => Adj N1 N1 => N N1 => N N N1 => N N N NP => N1 NP => Det N1 N1 => N1 PP N1 => N1 ReCL NP => Pron NP => Name VP => V VP => V NP VP => VP PP PP => P NP |
ReCL=>
WHN VP ReCL=> WHN S WHN => who|which|that Pron=> she|he|him|that P => in|of|on|with Det => the|a|that N => saw|head|dog|catch|stop|boy|cat Adj => local|big V => saw|head|stop|catch|eat|chased |
Given the following bag of words, can you order them to get a meaningful (or at least grammatical) English sentence?
's . But disagreement do how it over there to
Now, let's try writing a program to do it. You can take either one of the following two approaches:
Please implement either one of the above. Here are some sample bags of words you can test with.
He his is man own PC There contributors many pioneer were Inventories are closely clues for such watched The as counts government is it money spent
The C++ STL includes an algorithm for generating permutations. A example use of it is provide here.