|
|
|
| Due Date: March 10, 2004 In this assignment, you are to experiment with a C++ HMM Package. The files are located at anzac:/usr/scratch/lindek/as3. The HMM package contains three executables:
The three executables share the same implementation of HMM in hmm.h and hmm.cpp. The executables can be computed by typing the make command in the hmm directory. A HMM is specified in two files: NAME.trans and NAME.emit where NAME is the name of the HMM. The file NAME.trans contains the transition probabilities between the states. The file NAME.emit contains the emission probabilities. Normally, HMM also needs a set of initial probabilities of the states. We treat them as part of the transition probabilities by adding a special initial state. The transition probabilities from the special initial state to other states correspond to the initial probabilities defined in the Russell and Norvig textbook.
The figure above is the HMM in Figure 15.20 in [Russell&Norvig] (p.573). It can be specified with the files phone.trans and phone.emit in the phone directory. phone.trans: INIT INIT Onset 1 Onset Onset 0.3 Onset Mid 0.7 Mid Mid 0.9 Mid End 0.1 End End 0.4 End FINAL 0.6 phone.emit: Onset C1 0.5 Onset C2 0.2 Onset C3 0.3 Mid C3 0.2 Mid C4 0.7 Mid C5 0.1 End C4 0.1 End C6 0.5 End C7 0.4 The first line of the phone.trans file is the name of the initial state. Each of the subsequent lines is a transition probability. For example, P(Mid|Onset)=0.7, P(C4|End)=0.1. The transition and emission probabilities not listed in these two files are treated as zeros. Problem 1: Running the Vertibi algorithm [2 points]The vit program takes the name of a HMM as the command-line argument. It then reads sequences of observations from the standard input and prints the most probable sequences of states as well as their probability on the standard output. For example, the file phone.input contains the observation sequence: C1 C2 C3 C4 C4 C6 C7 C2 C2 C5 C4 C4 C6 C6 The output of the command P(path)=0.625286 path: C1 Onset C2 Onset C3 Mid C4 Mid C4 Mid C6 End C7 End P(path)=0.936748 path: C2 Onset C2 Onset C5 Mid C4 Mid C4 Mid C6 End C6 End
Problem 2: Generating Observation Sequences [2 points]The genseq program takes two parameters. The
first is the name of a HMM (i.e., NAME.trans and
NAME.emit specifies the transition and emission
probabilities of the HMM). The second is the number of sequences to generate.
The program generates a collection of observation sequences with each sequence
on a line. For example, the outputs of the command C1 C4 C6 C7 C6 C1 C5 C7 C2 C1 C3 C1 C1 C4 C4 C4 C4 C5 C4 C4 C4 C4 C5 C5 C3 C4 C4 C3 C3 C5 C7 C6 C7 C2 C1 C5 C4 C4 C4 C4 C4 C4 C4 C5 C4 C4 C6 C3 C4 C7 C7 C2 C3 C4 C3 C4 C3 C6 C6 C3 C4 C4 C4 C4 C4 C4 C4 C5 C4 C4 C4 C5 C4 C3 C3 C4 C4 C4 C3 C6 C3 C3 C1 C4 C3 C5 C4 C4 C4 C4 C4 C6 C2 C3 C5 C4 C7 C2 C3 C4 C4 C4 C4 C4 C3 C4 C4 C6
Problem 3: Training HMM with Baum-Welch [6 points]One of the beauties of HMM is that the parameters it needs can be estimated
(trained) with sequences of observations. The trainhmm
program does exactly this. It takes three obligatory parameters and one optional
parameter. The three obligatory parameters are: the name of the initial HMM, the
name of the result HMM and the file containing the training sequences. The
optional parameter is the maximum number iterations to run during training. If
the fourth parameter is not provided, the maximum number of iterations is 10.
For example, the command: phone-init1.trans: INIT INIT Onset 1 Onset Onset 0.5 Onset Mid 0.5 Mid Mid 0.5 Mid End 0.5 End End 0.5 End FINAL 0.5 phone-init1.emit: Onset C1 0.33 Onset C2 0.33 Onset C3 0.33 Mid C3 0.33 Mid C4 0.33 Mid C5 0.33 End C4 0.33 End C6 0.33 End C7 0.33 As you can see the initial HMM has the right structure, but the probabilities are assumed to be uniformly distributed. The training results will be stored n the files phone-result1.trans and phone-result1.emit.
Problem 4: Part of Speech Tagging with EM [15 points]Having played with toy problems in the previous three questions, we are now ready for a serious application: Par of Speech Tagging. The Part of Speech (POS) of a word is the equivalent classes that a word belongs to such that changing the word to another word in the same equivalence class does not change the grammaticality of a sentence. A natural language word can have multiple Parts of Speech. For example, the word 'can' may be an auxiliary verb (e.g., you can do it), a noun (e.g., a can of fish), a verb (e.g., how to can fresh tomato products). The goal of POS Tagging is to determine the POS of a word in a particular context by assigning POS category (called tag) to the word. For example, here is an example sentence where each word is assigned a POS tag (each word is followed by its tag). There EX is VBZ no DT asbestos NN in IN our PRP$ products NNS now RB . . The tags used here come from the Penn Treebank tag set. One way to build a POS tagger is to use an HMM. The hidden states in HMM are the tags. The observations are the words in a sentence.
Given a sentence, we can the find the tags of the words in it using the Viterbi
algorithm. For example, the directory the file pos.trans and
pos.emit
contain a trained HMM for POS tagging. The file sample.txt
contains a sequence of sentences. We can tag them with the command P(path)=0.218846 path: But CC state NN courts NNS upheld VBD a DT challenge NN by RP consumer NN groups NNS to TO the DT commission NN 's POS rate NN increase NN and CC found VBD the DT rates NNS illegal JJ . . P(path)=0.418471 path: The NNP Illinois NNP Supreme NNP Court NNP ordered VBD the DT commission NN to TO audit VB Commonwealth NNP Edison NNP 's POS construction NN expenses NNS and CC refund VB any DT unreasonable JJ expenses NNS . . The in addition to the HMM package, you also need the following files in the pos directory:
Your task is to do the following:
|
|
Copyright © 2004 Dekang Lin
|