Assignment 3: Language Modeling
Home ] Up ] Syllabus ] Assignments and Project ] Schedule ] Calendar ] Resources ]

 

 

Due Date: March 10, 2004

In this assignment, you are to experiment with a C++ HMM Package. The files are located at anzac:/usr/scratch/lindek/as3. The HMM package contains three executables:

Executable Purpose
vit Given an HMM and an observation sequence, compute the sequence of the hidden states that has the highest probability using the Viterbi algorithm.
genseq Generate an observation sequence using a HMM model.
trainhmm Using a collection of observation sequences to train a HMM model using the Baum-Welch algorithm.

The three executables share the same implementation of HMM in hmm.h and hmm.cpp. The executables can be computed by typing the make command in the hmm directory.

A HMM is specified in two files: NAME.trans and NAME.emit where NAME is the name of the HMM. The file NAME.trans contains the transition probabilities between the states. The file NAME.emit contains the emission probabilities. Normally, HMM also needs a set of initial probabilities of the states. We treat them as part of the transition probabilities by adding a special initial state. The transition probabilities from the special initial state to other states correspond to the initial probabilities defined in the Russell and Norvig textbook. 

The figure above is the HMM in Figure 15.20 in [Russell&Norvig] (p.573). It can be specified with the files phone.trans and  phone.emit in the phone directory.

phone.trans:

INIT
INIT    Onset   1
Onset   Onset   0.3
Onset   Mid     0.7
Mid     Mid     0.9
Mid     End     0.1
End     End     0.4
End     FINAL   0.6

phone.emit:

Onset   C1      0.5
Onset   C2      0.2
Onset   C3      0.3
Mid     C3      0.2
Mid     C4      0.7
Mid     C5      0.1
End     C4      0.1
End     C6      0.5
End     C7      0.4

The first line of the phone.trans file is the name of the initial state. Each of the subsequent lines is a transition probability. For example, P(Mid|Onset)=0.7, P(C4|End)=0.1. The transition and emission probabilities not listed in these two files are treated as zeros.

Problem 1: Running the Vertibi algorithm [2 points]

The vit program takes the name of a HMM as the command-line argument. It then reads sequences of observations from the standard input and prints the most probable sequences of states as well as their probability on the standard output. For example, the file phone.input contains the observation sequence:

C1 C2 C3 C4 C4 C6 C7
C2 C2 C5 C4 C4 C6 C6

The output of the command 
    ../hmm/vit phone < phone.input
is the following:

P(path)=0.625286
path:
C1      Onset
C2      Onset
C3      Mid
C4      Mid
C4      Mid
C6      End
C7      End
P(path)=0.936748
path:
C2      Onset
C2      Onset
C5      Mid
C4      Mid
C4      Mid
C6      End
C6      End
  1. Give a sequence of observations with 7 symbols such that the most probable state sequence for these observations has the highest probability among all 7 symbol observation sequences. 
  2. Use the vit program to find out the most probable state sequence and its probability.

Problem 2: Generating Observation Sequences [2 points]

The genseq program takes two parameters. The first is the name of a HMM (i.e., NAME.trans and NAME.emit specifies the transition and emission probabilities of the HMM). The second is the number of sequences to generate. The program generates a collection of observation sequences with each sequence on a line. For example, the outputs of the command
     ../hmm/genseq phone 10  
are the following:

C1 C4 C6 C7 C6
C1 C5 C7
C2 C1 C3 C1 C1 C4 C4 C4 C4 C5 C4 C4 C4 C4 C5 C5 C3 C4 C4 C3 C3 C5 C7 C6 C7
C2 C1 C5 C4 C4 C4 C4 C4 C4 C4 C5 C4 C4 C6
C3 C4 C7 C7
C2 C3 C4 C3 C4 C3 C6 C6
C3 C4 C4 C4 C4 C4 C4 C4 C5 C4 C4 C4 C5 C4 C3 C3 C4 C4 C4 C3 C6
C3 C3 C1 C4 C3 C5 C4 C4 C4 C4 C4 C6
C2 C3 C5 C4 C7
C2 C3 C4 C4 C4 C4 C4 C3 C4 C4 C6
  1. Use the HMM phone to generate 100 sequences and save the results in the file phone.train (yes, sometimes there are easy marks).
  2. Calculate the expected length of randomly generated observation sequences for this HMM and compare it with the average length of the sequences in phone.train.

Problem 3: Training HMM with Baum-Welch [6 points]

One of the beauties of HMM is that the parameters it needs can be estimated (trained) with sequences of observations. The trainhmm program does exactly this. It takes three obligatory parameters and one optional parameter. The three obligatory parameters are: the name of the initial HMM, the name of the result HMM and the file containing the training sequences. The optional parameter is the maximum number iterations to run during training. If the fourth parameter is not provided, the maximum number of iterations is 10. For example, the command:
        ../hmm/trainhmm phone-init1 phone-result1 phone.train
 
will train a HMM with the starting parameters in the files phone-init1.trans and phone-init1.emit which contain the following contents:

phone-init1.trans:

INIT
INIT    Onset   1
Onset   Onset   0.5
Onset   Mid     0.5
Mid     Mid     0.5
Mid     End     0.5
End     End     0.5
End     FINAL   0.5

phone-init1.emit:

Onset   C1      0.33
Onset   C2      0.33
Onset   C3      0.33
Mid     C3      0.33
Mid     C4      0.33
Mid     C5      0.33
End     C4      0.33
End     C6      0.33
End     C7      0.33

As you can see the initial HMM has the right structure, but the probabilities are assumed to be uniformly distributed. The training results will be stored n the files phone-result1.trans and phone-result1.emit.

  1. Compare the probabilities in phone-result1.* with the probabilities in phone.* from which the training sequences were generated. How many of the probabilities differ by more than 5%?
  2. Experiment with more training sequences and find out what is the smallest number of training sequences that allows you to train a model in which all the parameters differ from the true model by less than 5%.
  3. Although the transitions in phone-init1.trans have different probabilities than the model that generated the data, the transition diagram would have the same structure as the model. Now, suppose we do not know the structure of the transition diagram. We would have to assume any state (including FINAL, but excluding INIT) can follow any other state with equal probability. Construct a transition probability table in phone-init2.trans and copy phone-init1.emit to phone-init2.emit. Rerun the trainhmm program with phone-init2 and see whether the Baum-Welch algorithm can obtain probabilities close to the model in phone.*.  
  4. To make the learning problem even harder, if you change the initial emission probability table so that any state (including FINAL, but excluding INIT) can generate any symbol with equal probability, can Baum-Welch recover the HMM that was used to generate the data in phone.train? Answer this by creating initial parameters in phone-init3.emit and copy phone-init2.trans to phone-init3.trans and try it out.  

Problem 4: Part of Speech Tagging with EM [15 points]

Having played with toy problems in the previous three questions, we are now ready for a serious application: Par of Speech Tagging. The Part of Speech (POS) of a word is the equivalent classes that a word belongs to such that changing the word to another word in the same equivalence class does not change the grammaticality of a sentence. A natural language word can have multiple Parts of Speech. For example, the word 'can' may be an auxiliary verb (e.g., you can do it),  a noun (e.g., a can of fish), a verb (e.g., how to can fresh tomato products). The goal of POS Tagging is to determine the POS of a word in a particular context by assigning POS category (called tag) to the word. For example, here is an example sentence where each word is assigned a POS tag (each word is followed by its tag). 

There EX is VBZ no DT asbestos NN in IN our PRP$ products NNS now RB . . 

The tags used here come from the Penn Treebank tag set.

One way to build a POS tagger is to use an HMM. The hidden states in HMM are the tags. The observations are the words in a sentence.

Given a sentence, we can the find the tags of the words in it using the Viterbi algorithm. For example, the directory the file pos.trans and pos.emit contain a trained HMM for POS tagging. The file sample.txt contains a sequence of sentences. We can tag them with the command
        ../hmm/vit pos <sample.txt
Here are the outputs of this command:

P(path)=0.218846
path:
But     CC
state   NN
courts  NNS
upheld  VBD
a       DT
challenge       NN
by      RP
consumer        NN
groups  NNS
to      TO
the     DT
commission      NN
's      POS
rate    NN
increase        NN
and     CC
found   VBD
the     DT
rates   NNS
illegal JJ
.       .
P(path)=0.418471
path:
The     NNP
Illinois        NNP
Supreme NNP
Court   NNP
ordered VBD
the     DT
commission      NN
to      TO
audit   VB
Commonwealth    NNP
Edison  NNP
's      POS
construction    NN
expenses        NNS
and     CC
refund  VB
any     DT
unreasonable    JJ
expenses        NNS
.       .

The in addition to the HMM package, you also need the following files in the pos directory:

A lexicon that lists the possible parts of speech for words: lexicon.txt. No lexicon will ever be complete. There are bound to be words in the text that are outside the lexicon. If a word is not in the lexicon, then its possible parts of speech are assumed to be the set of all tags.
A list of possible tags: alltags.txt.
A training corpus: corpus.txt, which contains some Wall Street Journal articles. Each line in the text is a sentence, which consists of a space-separated sequence of words.
A testing corpus: test.txt in the same format as corpus.txt.
A file named test.tagged which contains the answer keys for the sentences in test.txt.

Your task is to do the following:

  1. Use the tags in alltags.txt and the lexicon lexicon.txt to construct the initial transition and emission probabilities. You need to decide how to set the initial probabilities. Please save the probabilities in pos-init.trans and pos-init.emit.
  2. Use the training corpus in  to set the parameters of a HMM and save the results in pos-trained.trans and pos-trained.emit. This might take more than 15 minutes to run. It is recommended that you try a smaller corpus first to make sure you are running the program correctly. 
  3. Write a program that uses the HMM to tag English sentences. The standard input to the program has the same format as test.txt, where each line is sentence and tokens are space separated. The program prints the tagged sentences on the standard output, in the same format as test.tagged, where each token is followed by its tag. Report the error rate of the tagger.
  4. The file train.tagged contains a tagged corpus. Use this corpus to construct a HMM for POS tagging. Evaluate the accuracy of the tagger using the same test corpus as (c) above.
 

 

Home ] Up ]

Copyright © 2004 Dekang Lin
Last modified: February 26, 2004