Topics for Projects for Cmput651

Below are some possible topics for Comput651 projects. For each, we provide some suggested readings; these are the references to start with. Do not consider these references exhaustive; you will be expected to review the literature in greater depth for your project. While you are not forced to choose one of these topics, it is strongly advised that you talk to the instructor if you want to deviate significantly from the topics below. (I marked with ** the ones that I am particularly interested in pursuing.)

COMING SOON:
Abstractions
NetFlixPrize
1. superresolution for MRI data
2. Custom filters for MRI data
Contact dana at cs.ualberta.ca
Combining classifiers, for Proteome Analyst
Structure Learning
The challenge here is finding a good graph structure over a set of variables, in either a directed or undirected graphical model. Potential projects include...
- Comparing structure learning algorithms for Bayesian networks (eg, hillclimbing, PDAGs, optimal reinsertion) in terms of quality of density estimation, sensitivity of the size of the data set, classification performance, etc.
- Structure search given a fixed ordering -- If you are given a total ordering of the variables x₁...x_n where the parents of x_i are a subset of x₁...x_i-1, structure learning becomes simpler than search over the space of directed acyclic graphs (K&F 15.4.2)
- Learning the structure of an undirected graphical model (Abbeel et. al. 2006, Parise and Welling 2006)
- Learning compact representations for conditional probability distributions -- In discrete Bayesian networks having a large number of parents means a node's CPD is large. It is possible that given a particular assignment to a few of the parents, the rest of the parents do not matter (context-specific independence), which can lead to a compact representation of a CPD (K&F 15.6).
- Bayesian model averaging -- insteading of finding the single best structure for a Bayesian network, compute a posterior distribution over structures (K&F 15.5)
- Optimal structure learning -- the naive algorithms are super-exponential in the number of variables, but both the optimal MAP (Singh & Moore 2005) and optimal BMA (Koivisto & Sood 2004) structures can be computed in exponential time at the cost of exponential memory.
- Find the structure the produces the best discriminative model; see (Guo&Greiner 2005).
References
Inference
The most common use of a probabilistic graphical model is computing queries, the conditional distribution of a set of variables given an assignment to a set of evidence variables. As this problem is NP-hard, there are now a large number of algorithms, both exact and approximate. Potential project topics include,,,
- Comparing approximate inference algorithms in terms of accuracy, computational complexity, sensitivity to parameters. Some exact algorithms include Junction trees and Bucket elimination. On larger networks one typically resorts to algorithms that produce approximate solutions, such as sampling (Monte Carlo methods), variational inference, and generalized belief propagation.
- Adaptive Generalized Belief Propagation (Welling 2004) & Expectation Propagation (K&F 11) -- Compare these methods to each other and Gibbs sampling.
- Convex Procedures -- Methods that performance approximate inference by convex relaxation (Wainwright 2002)
- Linear programming methods for approximating the MAP assignment (Wainwright et. al. 2005b, Yanover et. al. 2006)
- Recursive conditioning -- An any-space inference algorithm that recursively decomposes an inference on a general Bayesian network into inferences on a smaller subnetwork. (Darwiche 2001).
References
Variance of Response (**)
Most inference system return a single value; eg, P(+cancer | +headache, -sorethroat) = 0.02. In many cases, however, this response itself is actually a distribution. That is, the response depends on the parameters (CPtables), which are often based on a (random) datasample. Here, these parameters are necessarily random variables, which means the result of the computation, here the response, is also random. The technical note in BN Variance webpage proves that this distribution is asymptotically normal, and provides the associated mean and variance.
That analysis assumes as given the correct structure over discrete variables, with CPtable values represented as a table. This suggests many extensions:
- dealing with a distribution over structures (rather than a single correct one)
- dealing with non-tabular CPtables -- eg, noisy-ors, or decision trees
- dealing with continuous variables
This analysis also indirectly assumes the datasample was complete; it would also be useful to extend these ideas to deal with incomplete data. This would allow us to deal with important cases, like HMM structures.
There are several applications of this technology:
- The Mixture using Variance paper describes a way to use these variance values to combine difference bayesian-net-based classifiers, learned from different subsets of instances. Can this analysis be applied to combine different subsets of features?
- In general, can we use this analysis for selecting features, of a single classifier... eg select only the features that contribute least to the total variance.
- Can we use this analysis to determine the change of "Good decision, bad outcome"? And perhaps for inbalanced data? See KDD Cup 2006.
- Can this help in active learning ... related to Active Learning with Statistical Models
Temporal Models
There are lots of applications where we want to explicitly model time (control, forecasting, online-learning). Hidden Markov Models are one of the simplest discrete-time models, but there are many others: Kalman filters for continuous state-spaces, factorial Hidden Markov models for problems with many hidden variables that allows for efficient variational inference, and dynamic Bayesian networks which allow arbitrarily complex relationships between hidden and observed variables. Projects include
- Comparing the performance of factorial Hidden Markov Models (Ghahramani & Jordan 1997) to dynamic Bayesian networks (K&F 18).
- Expermental evaluation of approximate inference algorithms for DBNs, such as Boyen-Koller, Particle Filtering, and Thin Junction Trees (Paskin 2003). Kevin Murphy's thesis provides a good overview of inference in DBNs.
- Comparing Kalman filters against more general DBN models.
References
Hierarchical Bayes
In text classification, one can view a corpus as generated by a hierarchical process. For example, select an author. Given an author there is a distribution over topics he is interested in. Select a topic according to this distribution. Given a topicm there is a distribution over words used in a document. Finally, generate a bag of words from this distribution. One approach to modelling this process is hierarchical Bayes (often nonparametric hierarchical Bayes).
Another application of nonparametric hierarchical Bayes is clustering, where instead of selecting the number of clusters a priori, the model averages over the number of clusters to produce a posterior over clusterings of data points.

Potential projects include
- Implementing Latent Dirichlet Allocation for document clustering
- Compare nonparametric Bayesian clustering methods such as Dirichlet Processes and Chinese Restaurant Processes.
- Implement a hierarchical clustering model, either for topic modelling or clustering.
References
Build models for metabolomic pathways, etc.,
... perhaps related to HMP, Kyoto Encyclopedia of Genes and Genomes (KEGG), Proteome Analyst
Modeling Protein-Protein Interactions
Implement, then extend: Probabilistic Modeling of Systematic Errors in Two-Hybrid Experiments See also auxiliary data sets.
Relational Models
Almost all of the machine learning / statistics methods you have studied assume that the data is independent or exchangable. In many cases this is not true. For example, knowing the topic of a web page tells you something about the likely topics of pages linked to it. The independence assumption fails on most graph-structured data sets (relational databases, social networks, web pages).
Potential projects include
- Implementing a restricted case of Probabilistic Relational Models (eg, no existence uncertainty) and compare the performance against some baseline non-relational models.
- Implementing Relational Markov Networks and compare the performance against some baseline non-relational models
- Applying these ideas to address the Netflix challenge
References
Hybrid Bayesian Networks
Many real systems contain a combination of discrete and continuous variables, which can be modeled as a hybrid BN. Potential projects include
- Compare inference algorithms for hybrid DBNs against those that first discretize all the continuous variables, and then just use the standard algorithms (variable elimination, junction trees).
References
Influence Diagrams
A Bayesian network models a part of the world, but not decisions taken by agents nor the effect that these decisions can have upon the world. Influence diagrams extend Bayesian networks with nodes that represent actions an agent can take, the costs and utilities of actions, and most importantly the relationships between them.
In multiagent setting finding the Nash equilibrium is hard, but graphical models provide a framework for recursively decomposing the problem (opening up the possibility of a dynamic programming approach). Dynamic programming algorithms like NashProp (Kearns and Ortiz, 2002) are closely related to belief propagation.

Projects include
- Implementing algorithms for selecting a good or optimal strategy in the single-agent case (K&F 21)
- Finding Nash equilibria in multiplayer games (Koller & Milch, 2003)
References
Max-margin Graphical Models
Typically the parameters of a graphical model are learned by maximum likelihood or maximum a posterori. An alternative criteria for parameter estimation is to maximize the margin between classes, which can be thought of as a combination of graphical models (to represent structured relationships between inputs and outputs) with kernel methods. Projects include,
An example of a domain where this approach works well is handwriting recognition, where the structure encodes the fact that knowing what the previous letter was tells you something about what the next letter is likely to be.
- Compare max-margin to likelihood based methods (eg, character recognition, part of speech tagging)
References
Active Classifier / Value of Information (**)
Most classifiers are "passive", in that they can only use the feature values given, even if many of those values are missing. By contrast, an "active classifier" has the option of first obtaining the values of some attributes, before returning a label. Given a cost model that specifies both the costs of purchasing each attribute value, and also the penalty of returning each type of incorrect label, we can seek the optimal "policy", that identifies when to purchase which attribute value and when to return which label; see webpage.
Possible projects include...
- Apply active classification to dynamically determine which sensor (within a network) you should sample from.
- Extend the model to deal with cost models that give discounts when you purchase many attribute values at once; or alternatively, if you purchase the values of some attribute for all instances, etc.
References
Active/Budgeted Learning (**)
Most learning tasks implicitly assume the learner begins with a body of labeled data -- typically a matrix M = (m_ij) where m_ij is the value of the i-th attribute of the j-th instance. Now consider the challenge of dynamically gathering a subset of such data --- ie, sequentially determining which attributes, of which training instances, to "purchase", where the goal is obtaining the subset of data that will allow the learner to produce the best performance system. In particular, the standard active learner sequentially purchases a sequence of k class-labels, for some currently unlabeled (but otherwise completely specified) instances, with the goal of producing the best possible "based on any k labels". The budgeted learning model similarly seeks the best classifier, but here each instance is given only the class-label but NOT the attribute values. Here, the budgeted learner can sequentially purchase attribute values, subject to spending at most a given fixed total budget.

Possible projects include
- Compare optimization criteria (eg, experimental design criteria) [CITE]
- PAC-learning analysis: how many "attribute-instance" probes are necessary to produce a classifier that is (with probability at least 1-delta) within epsilon of optimal
- Extend this analysis to the task of learning a generative model -- ie, where the goal is learning a probabilistic model that is close to the true model (in terms of KL-divergence)
- Active learning that incorporate parameter uncertainty.

References

Simon Tong, Active Learning: Theory and Applications, PhD thesis, Stanford University 2001.

D Lizotte, O Madani and R Greiner; Budgeted Learning of Probabilistic Classifiers, UAI'04.

?PAC-learning - Valiant

>Segmentation A segmentation of an image maps each voxel to a label -- eg, to "healthy" vs "tumor", within an Magnetic Resonance image of a brain. To be effective, a segmentator should base its decision about the label of pixel x on information about the neighboring pixels (inclding perhaps their respective labels) as well as information on that location. There are a variety of technologies for this task, including generative models like Markov Random Fields, and their descriminative cousins, Conditional Random Fields (CRF), and recently Support Vector Random Fields (SVRF).
Possible projects
References
Modeling Text and Images
Images are oftened annotated with text, such as captions or tags, which can be viewed as an additional source of information when clustering images or building topic models. For example a green patch might indicate that there is a plant in the image, until one reads the caption "man in a green shirt". A related problem (Carbonetto et al. 2004) is data association, linking words to segmented objects in an image. For example, if the caption contains the words boat and sea, we would like to be able to associate these words with the segment(s) of the image corresponding to boat and sea.

References
2D CRFs for Visual Texture Classification

Discriminative Fields for Modeling Spatial Dependencies in Natural Images is about applying 2D conditional random fields (CRFs) for classifying image regions as containing "man-made building" or not, on the basis of texture. The goal of this project is to reproduce the results in the NIPS 2003 paper. Useful links:
- labeled training data.
- C++ graphcuts code for approximate inference
- Kevin Murphys Matlab CRF code
- Carl Rasmussen's matlab conjugate gradient minimizer (better than using netlab or matlab optimization toolbox)
- Intro to CRFs by Hanna Wallach
- Maxent page, includes code
- Steerable pyramid matlab code, possibly useful set of image features
- Matlab wavelet toolbox, possibly useful set of image features .
- Paper of CRFs for sign detection, J. Weinman, 2004
- Markov Random Field Modeling in Computer Vision, S. Z. Li, 1995. (I have a hardcopy of the 2001 edition.)
- G. Winkler, "Image Analysis, Random Fields, and MCMC Methods", 2nd edition, 2003.
- Markov random fields and images, P. Perez. CWI Quarterly, 11(4):413-437, 1998. Review article.
2D CRFs for satellite image classification

Data Sets

Below are a number of data sets that could be used for your project. If you want to use a data set that is not on the list it is strongly advised that you talk to either a TA or the instructor before submitting your intial proposal.

Thanks to Dieter Fox, Andreas Krause, Lin Liao, Einat Minkov, Francisco Pereira, Sam Roweis, and Ben Taskar for donating data sets.

Data A: Functional MRI

Functional fMRI measures brain activation over time, which allows one to measure changes as an activity is performed (eg, looking at a picture of a cat vs. looking at a picture of a chair). Tasks using this data are typically of the form "predict cognitive state given fMRI data". fMRI data is both temporal and spatial: each voxel contains a time series, each voxel is correlated to voxels near it.

http://multivac.ml.cmu.edu/10708

Data B: Corel Image Data

Images featurized by color histogram, color histogram layout, color moments, and co-occurence texture. Useful for projects on image segementation, especially since there is a large benchmark repository available.

Most segmentation algorithms have focused on segmentation based on edges or based on discontinuity of color and texture. The ground-truth in this dataset, however, allows supervised learning algorithms to segment the images based on statistics calculated over regions. One way to do this is to "oversegment" the image into superpixels (Felzenszwalb 2004, code available) and merge the superpixels into larger segments. Graphical models can be used to represent smoothness in clusters, by adding appropriate potentials between neighboring pixels. In this project, you can address, for example, learning of such potentials, and inference in models with very large tree-width.

http://kdd.ics.uci.edu//databases/CorelFeatures/CorelFeatures.html
http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/

Data C: Twenty Newsgroups

This data set contains 1000 text articles posted to each of 20 online newgroups, for a total of 20,000 articles. This data is useful for a variety of text classification and/or clustering projects. The "label" of each article is which of the 20 newsgroups it belongs to. The newsgroups (labels) are hierarchically organized (e.g., "sports", "hockey").

http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html

Data D: Sensor Networks
Using this 54-node sensor network deployment, we collected temperature, humidity, and light data, along with the voltage level of the batteries at each node. The data was collected every 30 seconds, starting around 1am on February 28^th 2004.

http://www-2.cs.cmu.edu/~guestrin/Research/Data/

This is a real dataset, with lots of missing data, noise, and failed sensors giving outlier values, especially when battery levels are low. Additional data for an intelligent lighting network, which include link quality information between pairs of sensors can is available at

http://www.cs.cmu.edu/~guestrin/Class/10708/Project/lightsensor.zip

Ideas for projects include

· Learn graphical models representing the correlations between measurements at different nodes

· Develop new distributed algorithms for solving a learning task on this data

References

http://www-2.cs.cmu.edu/~guestrin/Publications/IPSN2004/ipsn2004.pdf
http://www-2.cs.cmu.edu/~guestrin/Publications/VLDB04/vldb04.pdf
Data E: arXiv Preprints

A collection of preprints in the field of high-energy physics. Includes the raw LaTeX source of each paper (so you can extract either structured sentences or a bag-of-words) along with the graph of citations between papers.

http://www.cs.cornell.edu/projects/kddcup/datasets.html

Data F: TRECVID

A competition for multimedia information retrieval. They keep a fairly large archive of video data sets, along with featurizations of the data.

http://www-nlpir.nist.gov/projects/trecvid/trecvid.data.html

Data G: Activity Modelling

Activity modelling is the task of inferring what the user is doing from observations (eg, motion sensors, microphones). This data set consists of GPS motion data for two subjects tagged with labels like car , working, athome, shopping.

http://www.cs.cmu.edu/~guestrin/Class/10708/Project/gps-labels.zip

An example of a DBN model for this problem is

A. Subramanya, A. Raj, J. Bilmes, and D. Fox.
Recognizing Activities and Spatial Context Using Wearable Sensors (UAI-2006)
http://www.cs.washington.edu/homes/fox/abstracts/gps-msb-uai-06.abstract.html
Data H: WebKB

This dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.

http://www-2.cs.cmu.edu/~webkb/

Ideas for projects: learning classifiers to predict the type of webpage from the text, using web structure to improve page classification.
Data I: Record Deduplication

The datasets provided below comprise of lists of records, and the goal is to identify, for any dataset, the set of records which refer to unique entities. This problem is known by the varied names of deduplication, identity uncertainty and record linkage.

http://www.cs.utexas.edu/users/ml/riddle/data.html

One common approach is to cast the deduplication problem as a classification problem. Consider the set of record-pairs, and classify them as either "unique" or "not-unique". Some papers on record deduplication include

www.isi.edu/info-agents/papers/tejada01-is.pdf
http://www.cs.cmu.edu/~pradeepr/papers/kdd03.pdf

Data J: Enron e-mail

Consists of ~500K e-mails collected from Enron employees. It has been used for research into information extraction, social network analysis, and topic modeling.

http://www.cs.cmu.edu/~enron/
http://www.cs.cmu.edu/~einat/datasets.html
Data K: Internet Movie Database

The Internet Movie Database makes their data publically available, with certain usage restrictions. It contains tables and links relating movies, actors, directors, box office grosses, and much more. Various slices of the data have been used extensively in research on relational models.
http://www.imdb.com/interface

Data L: Netflix

Netflix is running a competition for movie recommendation algorithms. They've released a dataset of 100M ratings from 480K randomly selected users over 17K titles. The data set, and contest details, are available at

http://www.netflixprize.com

A much smaller (but more widely used) movie rating data set is Movielens

http://www.grouplens.org/

Data M: NIPS Corpus

A data set based on papers from a machine learning conference (NIPS volumes 1-12). The data can be viewed as a tripartite graph on authors, papers, and words. Links represent authorship and the words used in a paper. Additionally, papers are tagged with topics and we know which year each paper was written. Potential projects include authorship prediction, document clustering, and topic tracking.

http://www.cs.toronto.edu/~roweis/data.html
Data N: Character recognition (digits)

Optical character recognition, and the simpler digit recognition task, has been the focus of much ML research. We have two datasets on this topic. The first tackles the more general OCR task, on a small vocabulary of words: (Note that the first letter of each word was removed, since these were capital letters that would make the task harder for you.)

Taskar Ocr
Data O: Precipitation Data

This dataset has includes 45 years of daily precipitation data from the Northwestern US. Ideas for projects include predicting rain levels, deciding where to place sensors to best predict rainfall, or active learning in fixed sensor networks.
Other sources of data
UC Irvine has a repository that could be useful for your project. Many of these data sets have been used extensively in graphical models research.
http://www.ics.uci.edu/~mlearn/MLRepository.html

Sam Roweis also has a link to several datasets (most ready for use in Matlab):
http://www.cs.toronto.edu/~roweis/data.html

Stolen, with permission, from C Guestrin, CMU: 10708-F06