Below are some possible topics for Comput651 projects.
For each, we provide some suggested readings;
these are the references to start with.
Do not consider these references exhaustive; you will be
expected to review the literature in greater depth for your
project.
While you are not forced to choose one of these topics, it
is strongly
advised that you talk to the instructor if you want to deviate
significantly from the topics below. (I marked with ** the ones
that I am particularly interested in pursuing.)
Ajit Singh and Andrew Moore (2005), Finding
Optimal Bayesian Networks
by Dynamic Programming. Tech Report CMU-CALD-05-106
Mikko Koivisto and Kismat Sood (2004), Exact Bayesian Structure
Discovery in Bayesian Networks. JMLR 5.
http://jmlr.csail.mit.edu/papers/volume5/koivisto04a/koivisto04a.pdf
Sridevi Parise and Max Welling (2006) Structure Learning in Markov
Random Fields, NIPS 2006
http://www.ics.uci.edu/~welling/publications/papers/StructLearnMRF-submit.pdf
Max Welling, Tom Minka and Yee Whye Teh (2005)
Structured Region Graphs: Morphing EP into GBP,
UAI 2005
That analysis assumes as given the correct structure over discrete variables, with CPtable values represented as a table. This suggests many extensions:
There are several applications of this technology:
Another application of nonparametric hierarchical Bayes is clustering, where instead of selecting the number of clusters a priori, the model averages over the number of clusters to produce a posterior over clusterings of data points.
Potential projects include
Potential projects include
In multiagent setting finding the Nash equilibrium is hard, but graphical models provide a framework for recursively decomposing the problem (opening up the possibility of a dynamic programming approach). Dynamic programming algorithms like NashProp (Kearns and Ortiz, 2002) are closely related to belief propagation.
Projects include
M. Kearns and L. Ortiz; Nash Propagation for Loopy Graphical Games. Proceedings of NIPS 2002.
Carlos Guestrin, Daphne Koller and Ronald Parr;
Multiagent Planning with Factored MDPs;
NIPS 2001, pp. 1523 - 1530, Vancouver, Canada, December 2001.
Carlos Guestrin,
Planning Under Uncertainty in Complex Structured Environments,
Ph.D. Dissertation, Computer Science Department, Stanford University,
August 2003.
http://www.cs.cmu.edu/~guestrin/Publications/Thesis/thesis.pdf
An example of a domain where this approach works well is handwriting recognition, where the structure encodes the fact that knowing what the previous letter was tells you something about what the next letter is likely to be.
Possible projects include...
A. Krause, C. Guestrin. " Optimal Value of Information in Graphical Models - Efficient Algorithms and Theoretical Limits", IJCAI 2005
Anderson, B. and Moore, A.; Fast Information Value for Graphical Models; NIPS 2005
Possible projects include
D Lizotte, O Madani and R Greiner; Budgeted Learning of Probabilistic Classifiers, UAI'04.
?PAC-learning - Valiant
Possible projects
DRF
C-H.~Lee, R.~Greiner and M.~Schmidt, ``Support Vector Random Fields for Spatial Classification'', PKDD 2005.
References
Peter Carbonetto, Nando de Freitas and Kobus
Barnard.
A Statistical Model for General Contextual Object Recognition. ECCV 2004
http://www.cs.ubc.ca/~nando/papers/mrftranstwo.pdf
Discriminative Fields for Modeling Spatial Dependencies in Natural Images is about applying 2D conditional random fields (CRFs) for classifying image regions as containing "man-made building" or not, on the basis of texture. The goal of this project is to reproduce the results in the NIPS 2003 paper. Useful links:
Functional fMRI measures brain activation over time, which allows
one to measure changes as an activity is performed (eg, looking at a
picture of a cat vs. looking at a picture of a chair). Tasks using this
data are typically of the form "predict cognitive state given fMRI
data". fMRI data is both temporal and spatial: each voxel contains a
time series, each voxel is correlated to voxels near it.
http://multivac.ml.cmu.edu/10708
Data B: Corel Image Data
Images featurized by color histogram, color histogram layout, color moments, and co-occurence texture. Useful for projects on image segementation, especially since there is a large benchmark repository available.
Most segmentation algorithms have focused on segmentation based on edges or based on discontinuity of color and texture. The ground-truth in this dataset, however, allows supervised learning algorithms to segment the images based on statistics calculated over regions. One way to do this is to "oversegment" the image into superpixels (Felzenszwalb 2004, code available) and merge the superpixels into larger segments. Graphical models can be used to represent smoothness in clusters, by adding appropriate potentials between neighboring pixels. In this project, you can address, for example, learning of such potentials, and inference in models with very large tree-width.
Data C: Twenty Newsgroups
This data set contains 1000 text articles posted to each of 20
online newgroups, for a total of 20,000
articles. This data is useful for a variety of text
classification and/or clustering projects. The "label" of
each article is which of the 20 newsgroups it belongs to. The
newsgroups (labels) are hierarchically organized (e.g., "sports",
"hockey").
http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html
http://www-2.cs.cmu.edu/~guestrin/Research/Data/
This is a real dataset, with lots of missing data, noise, and failed sensors giving outlier values, especially when battery levels are low. Additional data for an intelligent lighting network, which include link quality information between pairs of sensors can is available at
http://www.cs.cmu.edu/~guestrin/Class/10708/Project/lightsensor.zip
Ideas for projects include
· Learn graphical models representing the correlations between measurements at different nodes
· Develop new distributed algorithms for solving a learning task on this data
References
Data E: arXiv Preprints
A collection of preprints in the field of high-energy physics. Includes the raw LaTeX source of each paper (so you can extract either structured sentences or a bag-of-words) along with the graph of citations between papers.
Activity modelling is the task of inferring what the user is doing from observations (eg, motion sensors, microphones). This data set consists of GPS motion data for two subjects tagged with labels like car , working, athome, shopping.
http://www.cs.cmu.edu/~guestrin/Class/10708/Project/gps-labels.zip
An example of a DBN model for this problem is
A. Subramanya, A. Raj, J. Bilmes, and D. Fox.
Recognizing Activities and Spatial Context Using Wearable Sensors
(UAI-2006)
http://www.cs.washington.edu/homes/fox/abstracts/gps-msb-uai-06.abstract.html
This dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.
http://www-2.cs.cmu.edu/~webkb/
Ideas for projects: learning classifiers to predict the type of webpage from the text, using web structure to improve page classification.
Data I: Record DeduplicationConsists of ~500K e-mails collected from Enron employees. It has been used for research into information extraction, social network analysis, and topic modeling.
http://www.cs.cmu.edu/~enron/
http://www.cs.cmu.edu/~einat/datasets.html
The Internet Movie Database makes their data publically available, with certain usage restrictions. It contains tables and links relating movies, actors, directors, box office grosses, and much more. Various slices of the data have been used extensively in research on relational models.
http://www.imdb.com/interfaceA data set based on papers from a machine learning conference (NIPS volumes 1-12). The data can be viewed as a tripartite graph on authors, papers, and words. Links represent authorship and the words used in a paper. Additionally, papers are tagged with topics and we know which year each paper was written. Potential projects include authorship prediction, document clustering, and topic tracking.
http://www.cs.toronto.edu/~roweis/data.html
Data N: Character recognition (digits)Optical character recognition, and the simpler digit recognition task, has been the focus of much ML research. We have two datasets on this topic. The first tackles the more general OCR task, on a small vocabulary of words: (Note that the first letter of each word was removed, since these were capital letters that would make the task harder for you.)
Data O: Precipitation DataThis dataset has includes 45 years of daily precipitation data from the Northwestern US. Ideas for projects include predicting rain levels, deciding where to place sensors to best predict rainfall, or active learning in fixed sensor networks.
UC Irvine has a repository that could be useful for your project. Many of these data sets have been used extensively in graphical models research.
http://www.ics.uci.edu/~mlearn/MLRepository.htmlSam Roweis also has a link to several datasets (most ready for use in Matlab):
http://www.cs.toronto.edu/~roweis/data.htmlStolen, with permission, from C Guestrin, CMU: 10708-F06