CMPUT 366 Assignment Four

Instructor: Prof. Dekang Lin
Due Date: 11:59pm Wednesday December 3, 2003

Objectives: The following exercises are intended to further your understanding of Hidden Markov and Natural Language Processing.

Guidelines: Please read these carefully. Not following these overall guidelines may result in point penalty (lower grade) for the assignment.

  1. Please download as4.tgz and unpack it with the command tar xzvf as4.tgz.
  2. This assignment includes both essay questions and coding questions. The essay questions and a high-level description of each of the coding question should be submitted by hard copy. Coding questions must be submitted using the 'try' program.
  3. Make sure your name is clearly printed on your submission. For the sake of your privacy, do not include your student ID.
  4. For the essay questions:
  5. For the coding questions:

Total Points: 120


Part I: Hidden Markov Model

In this part of the assignment, you are to experiment with a C++ HMM Package written by Dekang Lin. The package contains three executables:

Executable Purpose
vit Given an HMM and an observation sequence, compute the sequence of the hidden states that has the highest probability using the Viterbi algorithm.
genseq Generate an observation sequence using a HMM model.
trainhmm Using a collection of observation sequences to train a HMM model using the Baum-Welch algorithm.

The three executables share the same implementation of HMM in hmm.h and hmm.cpp. The executables can be computed by typing the make command in the hmm directory.

A HMM is specified in two files: NAME.trans and NAME.emit where NAME is the name of the HMM. The file NAME.trans contains the transition probabilities between the states. The file NAME.emit contains the emission probabilities. Normally, HMM also needs a set of initial probabilities of the states. We treat them as part of the transition probabilities by adding a special initial state. The transition probabilities from the special initial state to other states correspond to the initial probabilities defined in the Russell and Norvig textbook. 

The figure above is the HMM in Figure 15.20 in [Russell&Norvig] (p.573). It can be specified with the files phone.trans and  phone.emit in the phone directory.

phone.trans:

INIT
INIT    Onset   1
Onset   Onset   0.3
Onset   Mid     0.7
Mid     Mid     0.9
Mid     End     0.1
End     End     0.4
End     FINAL   0.6

phone.emit:

Onset   C1      0.5
Onset   C2      0.2
Onset   C3      0.3
Mid     C3      0.2
Mid     C4      0.7
Mid     C5      0.1
End     C4      0.1
End     C6      0.5
End     C7      0.4

The first line of the phone.trans file is the name of the initial state. Each of the subsequent lines is a transition probability. For example, P(Mid|Onset)=0.7, P(C4|End)=0.1. The transition and emission probabilities not listed in these two files are treated as zeros.

Problem 1: Running the Vertibi algorithm [10 points]

The vit program takes the name of a HMM as the command-line argument. It then reads sequences of observations from the standard input and prints the most probable sequences of states as well as their probability on the standard output. For example, the file phone.input contains the observation sequence:

C1 C2 C3 C4 C4 C6 C7
C2 C2 C5 C4 C4 C6 C6

The output of the command 
    ../hmm/vit phone < phone.input
is the following:

P(path)=0.625286
path:
C1      Onset
C2      Onset
C3      Mid
C4      Mid
C4      Mid
C6      End
C7      End
P(path)=0.936748
path:
C2      Onset
C2      Onset
C5      Mid
C4      Mid
C4      Mid
C6      End
C6      End
  1. Give a sequence of observations with 7 symbols such that the most probable state sequence for these observations has the highest probability among all 7 symbol observation sequences. 
  2. Use the vit program to find out the most probable state sequence and its probability.

Problem 2: Generating Observation Sequences [10 points]

The genseq program takes two parameters. The first is the name of a HMM (i.e., NAME.trans and NAME.emit specifies the transition and emission probabilities of the HMM). The second is the number of sequences to generate. The program generates a collection of observation sequences with each sequence on a line. For example, the outputs of the command
     ../hmm/genseq phone 10  
are the following:

C1 C4 C6 C7 C6
C1 C5 C7
C2 C1 C3 C1 C1 C4 C4 C4 C4 C5 C4 C4 C4 C4 C5 C5 C3 C4 C4 C3 C3 C5 C7 C6 C7
C2 C1 C5 C4 C4 C4 C4 C4 C4 C4 C5 C4 C4 C6
C3 C4 C7 C7
C2 C3 C4 C3 C4 C3 C6 C6
C3 C4 C4 C4 C4 C4 C4 C4 C5 C4 C4 C4 C5 C4 C3 C3 C4 C4 C4 C3 C6
C3 C3 C1 C4 C3 C5 C4 C4 C4 C4 C4 C6
C2 C3 C5 C4 C7
C2 C3 C4 C4 C4 C4 C4 C3 C4 C4 C6
  1. Use the HMM phone to generate 100 sequences and save the results in the file phone.train (yes, sometimes there are easy marks).
  2. Calculate the expected length of randomly generated observation sequences for this HMM and compare it with the average length of the sequences in phone.train.

Problem 3: Training HMM with Baum-Welch [20 points]

One of the beauties of HMM is that the parameters it needs can be estimated (trained) with sequences of observations. The trainhmm program does exactly this. It takes three obligatory parameters and one optional parameter. The three obligatory parameters are: the name of the initial HMM, the name of the result HMM and the file containing the training sequences. The optional parameter is the maximum number iterations to run during training. If the fourth parameter is not provided, the maximum number of iterations is 10. For example, the command:
        ../hmm/trainhmm phone-init1 phone-result1 phone.train
 
will train a HMM with the starting parameters in the files phone-init1.trans and phone-init1.emit which contain the following contents:

phone-init1.trans:

INIT
INIT    Onset   1
Onset   Onset   0.5
Onset   Mid     0.5
Mid     Mid     0.5
Mid     End     0.5
End     End     0.5
End     FINAL   0.5

phone-init1.emit:

Onset   C1      0.33
Onset   C2      0.33
Onset   C3      0.33
Mid     C3      0.33
Mid     C4      0.33
Mid     C5      0.33
End     C4      0.33
End     C6      0.33
End     C7      0.33

As you can see the initial HMM has the right structure, but the probabilities are assumed to be uniformly distributed. The training results will be stored n the files phone-result1.trans and phone-result1.emit.

  1. Compare the probabilities in phone-result1.* with the probabilities in phone.* from which the training sequences were generated. How many of the probabilities differ by more than 5%?
  2. Experiment with more training sequences and find out what is the smallest number of training sequences that allows you to train a model in which all the parameters differ from the true model by less than 5%.
  3. Although the transitions in phone-init1.trans have different probabilities than the model that generated the data, the transition diagram would have the same structure as the model. Now, suppose we do not know the structure of the transition diagram. We would have to assume any state (including FINAL, but excluding INIT) can follow any other state with equal probability. Construct a transition probability table in phone-init2.trans and copy phone-init1.emit to phone-init2.emit. Rerun the trainhmm program with phone-init2 and see whether the Baum-Welch algorithm can obtain probabilities close to the model in phone.*.  
  4. To make the learning problem even harder, if you change the initial emission probability table so that any state (including FINAL, but excluding INIT) can generate any symbol with equal probability, can Baum-Welch recover the HMM that was used to generate the data in phone.train? Answer this by creating initial parameters in phone-init3.emit and copy phone-init2.trans to phone-init3.trans and try it out.  

Problem 4: Part of Speech Tagging [30 points]

Having played with toy problems in the previous three questions, we are now ready for a serious application: Par of Speech Tagging. The Part of Speech (POS) of a word is the equivalent classes that a word belongs to such that changing the word to another word in the same equivalence class does not change the grammaticality of a sentence. A natural language word can have multiple Parts of Speech. For example, the word 'can' may be an auxiliary verb (e.g., you can do it),  a noun (e.g., a can of fish), a verb (e.g., how to can fresh tomato products). The goal of POS Tagging is to determine the POS of a word in a particular context by assigning POS category (called tag) to the word. For example, here is an example sentence where each word is assigned a POS tag (each word is followed by its tag). 

There EX is VBZ no DT asbestos NN in IN our PRP$ products NNS now RB . . 

The tags used here come from the Penn Treebank tag set.

One way to build a POS tagger is to use an HMM. The hidden states in HMM are the tags. The observations are the words in a sentence.

Given a sentence, we can the find the tags of the words in it using the Viterbi algorithm. For example, the directory the file pos.trans and pos.emit contain a trained HMM for POS tagging. The file sample.txt contains a sequence of sentences. We can tag them with the command
        ../hmm/vit pos <sample.txt
Here are the outputs of this command:

P(path)=0.218846
path:
But     CC
state   NN
courts  NNS
upheld  VBD
a       DT
challenge       NN
by      RP
consumer        NN
groups  NNS
to      TO
the     DT
commission      NN
's      POS
rate    NN
increase        NN
and     CC
found   VBD
the     DT
rates   NNS
illegal JJ
.       .
P(path)=0.418471
path:
The     NNP
Illinois        NNP
Supreme NNP
Court   NNP
ordered VBD
the     DT
commission      NN
to      TO
audit   VB
Commonwealth    NNP
Edison  NNP
's      POS
construction    NN
expenses        NNS
and     CC
refund  VB
any     DT
unreasonable    JJ
expenses        NNS
.       .

The in addition to the HMM package, you also need the following files in the pos directory:

Your task is to do the following:

  1. Use the tags in alltags.txt and the lexicon lexicon.txt to construct the initial transition and emission probabilities. You need to decide how to set the initial probabilities. Please save the probabilities in pos-init.trans and pos-init.emit.
  2. Use the training corpus in  to set the parameters of a HMM and save the results in pos-trained.trans and pos-trained.emit. This might take more than 15 minutes to run. It is recommended that you try a smaller corpus first to make sure you are running the program correctly. 
  3. Find transitions with top 5 transition probability in pos-trained.trans and provides example word sequences corresponding to such transitions.
  4. Use the vit program and your trained HMM to assign tags to the words in test.txt. Inspect the results and find out the errors made by the program.
  5. If you inspect the contents in  pos-trained.trans and pos-trained.emit you will find that many have the probability values are very close to 0. One possible way to speed up the training of HMM is to discard probability values below a certain threshold. Explain why this would work.
  6. Modify the code in hmm.h and hmm.cpp so that the probabilities lower than 1/100000 are discarded. Report the amount of speed up (if any). 

Part II: Natural Language Processing

Problem 6. Grammar and Chart Parsing [25 points]

You are given the following grammar

S  => NP VP 
N1 => Adj N1 
N1 => N
N1 => N N 
N1 => N N N 
NP => N1
NP => Det N1 
N1 => N1 PP 
N1 => N1 ReCL
NP => Pron
NP => Name 
VP => V
VP => V NP 
VP => VP PP
PP => P NP
ReCL=> WHN VP
ReCL=> WHN S

WHN => who|which|that
Pron=> she|he|him|that
P   => in|of|on|with
Det => the|a|that
N   => saw|head|dog|catch|stop|boy|cat
Adj => local|big
V   => saw|head|stop|catch|eat|chased
  1. Draw all the parse trees for the following sentence:
        she saw the cat with a big head that the dog chased
  2. What are the active and complete arcs a chart parser would construct when parsing the following sentence:
       she saw a big head
  3. Given a sentence with at least 10 words that is ungrammatical but is nonetheless allowed by the above grammar.

Problem 7. Language Modeling [25 points]

Given the following bag of words, can you order them to get a meaningful (or at least grammatical) English sentence?

    's . But disagreement do how it over there to

Now, let's try writing a program to do it. You can take either one of the following two approaches:

Please implement either one of the above. Here are some sample bags of words you can test with. 

He his is man own
PC There contributors many pioneer were
Inventories are closely clues for such watched
The as counts government is it money spent

The C++ STL includes an algorithm for generating permutations. A example use of it is provide here.