Topics for Projects for Cmput651

Below are some possible topics for Comput651 projects. For each, we provide some suggested readings; these are the references to start with. Do not consider these references exhaustive; you will be expected to review the literature in greater depth for your project. While you are not forced to choose one of these topics, it is strongly advised that you talk to the instructor if you want to deviate significantly from the topics below.  (I marked with ** the ones that I am particularly interested in pursuing.)

  1. COMING SOON:
    Abstractions
    NetFlixPrize
  2. 1. superresolution for MRI data
    2. Custom filters for MRI data
    Contact dana at cs.ualberta.ca
    Combining classifiers, for Proteome Analyst
  3. Structure Learning
    The challenge here is finding a good graph structure over a set of variables, in either a directed or undirected graphical model. Potential projects include... References

  4. Inference
    The most common use of a probabilistic graphical model is computing queries, the conditional distribution of a set of variables given an assignment to a set of evidence variables. As this problem is NP-hard, there are now a large number of algorithms, both exact and approximate. Potential project topics include,,, References

  5. Variance of Response (**)
    Most inference system return a single value; eg, P(+cancer | +headache, -sorethroat) = 0.02. In many cases, however, this response itself is actually a distribution. That is, the response depends on the parameters (CPtables), which are often based on a (random) datasample. Here, these parameters are necessarily random variables, which means the result of the computation, here the response, is also random. The technical note in BN Variance webpage proves that this distribution is asymptotically normal, and provides the associated mean and variance.

    That analysis assumes as given the correct structure over discrete variables, with CPtable values represented as a table. This suggests many extensions:

    This analysis also indirectly assumes the datasample was complete; it would also be useful to extend these ideas to deal with incomplete data. This would allow us to deal with important cases, like HMM structures.

    There are several applications of this technology:

  6. Temporal Models
    There are lots of applications where we want to explicitly model time (control, forecasting, online-learning). Hidden Markov Models are one of the simplest discrete-time models, but there are many others: Kalman filters for continuous state-spaces, factorial Hidden Markov models for problems with many hidden variables that allows for efficient variational inference, and dynamic Bayesian networks which allow arbitrarily complex relationships between hidden and observed variables. Projects include References
  7. Hierarchical Bayes
    In text classification, one can view a corpus as generated by a hierarchical process. For example, select an author. Given an author there is a distribution over topics he is interested in. Select a topic according to this distribution. Given a topicm there is a distribution over words used in a document. Finally, generate a bag of words from this distribution. One approach to modelling this process is hierarchical Bayes (often nonparametric hierarchical Bayes). 

    Another application of nonparametric hierarchical Bayes is clustering, where instead of selecting the number of clusters a priori, the model averages over the number of clusters to produce a posterior over clusterings of data points. 

    Potential projects include

    References
  8. Build models for metabolomic pathways, etc.,
    ... perhaps related to HMP, Kyoto Encyclopedia of Genes and Genomes (KEGG), Proteome Analyst

  9. Modeling Protein-Protein Interactions
    Implement, then extend: Probabilistic Modeling of Systematic Errors in Two-Hybrid Experiments See also auxiliary data sets.

  10. Relational Models
    Almost all of the machine learning / statistics methods you have studied assume that the data is independent or exchangable. In many cases this is not true. For example, knowing the topic of a web page tells you something about the likely topics of pages linked to it. The independence assumption fails on most graph-structured data sets (relational databases, social networks, web pages).

    Potential projects include

    References
  11. Hybrid Bayesian Networks
    Many real systems contain a combination of discrete and continuous variables, which can be modeled as a hybrid BN. Potential projects include References
  12. Influence Diagrams
    A Bayesian network models a part of the world, but not decisions taken by agents nor the effect that these decisions can have upon the world. Influence diagrams extend Bayesian networks with nodes that represent actions an agent can take, the costs and utilities of actions, and most importantly the relationships between them.

    In multiagent setting finding the Nash equilibrium is hard, but graphical models provide a framework for recursively decomposing the problem (opening up the possibility of a dynamic programming approach). Dynamic programming algorithms like NashProp (Kearns and Ortiz, 2002) are closely related to belief propagation.

    Projects include

    References
  13. Max-margin Graphical Models
    Typically the parameters of a graphical model are learned by maximum likelihood or maximum a posterori. An alternative criteria for parameter estimation is to maximize the margin between classes, which can be thought of as a combination of graphical models (to represent structured relationships between inputs and outputs) with kernel methods. Projects include,

    An example of a domain where this approach works well is handwriting recognition, where the structure encodes the fact that knowing what the previous letter was tells you something about what the next letter is likely to be.

    References
  14. Active Classifier / Value of Information (**)
    Most classifiers are "passive", in that they can only use the feature values given, even if many of those values are missing. By contrast, an "active classifier" has the option of first obtaining the values of some attributes, before returning a label. Given a cost model that specifies both the costs of purchasing each attribute value, and also the penalty of returning each type of incorrect label, we can seek the optimal "policy", that identifies when to purchase which attribute value and when to return which label; see webpage.

    Possible projects include...


    References
  15. Active/Budgeted Learning (**)
    Most learning tasks implicitly assume the learner begins with a body of labeled data -- typically a matrix M = (m_ij)  where m_ij is the value of the i-th attribute of the j-th instance. Now consider the challenge of dynamically gathering a subset of such data  --- ie, sequentially determining which attributes, of which training instances, to "purchase", where the goal is obtaining the subset of data that will allow the learner to produce the best performance system. In particular, the standard active learner sequentially purchases a sequence of k class-labels, for some currently unlabeled (but otherwise completely specified) instances, with the goal of producing the best possible "based on any k labels".  The budgeted learning model similarly seeks the best classifier, but here each instance is given only the class-label but NOT the attribute values.  Here, the budgeted learner can sequentially purchase attribute values, subject to spending at most a given fixed total budget.

    Possible projects include

  16. References
  17. >Segmentation A segmentation of an image maps each voxel to a label -- eg, to "healthy" vs "tumor", within an Magnetic Resonance image of a brain. To be effective, a segmentator should base its decision about the label of pixel x on information about the neighboring pixels (inclding perhaps their respective labels) as well as information on that location. There are a variety of technologies for this task, including generative models like Markov Random Fields, and their descriminative cousins, Conditional Random Fields (CRF), and recently Support Vector Random Fields (SVRF).

    Possible projects


    References

  18. Modeling Text and Images
    Images are oftened annotated with text, such as captions or tags, which can be viewed as an additional source of information when clustering images or building topic models. For example a green patch might indicate that there is a plant in the image, until one reads the caption "man in a green shirt". A related problem (Carbonetto et al. 2004) is data association, linking words to segmented objects in an image. For example, if the caption contains the words boat and sea, we would like to be able to associate these words with the segment(s) of the image corresponding to boat and sea.

    References

  19. 2D CRFs for Visual Texture Classification

    Discriminative Fields for Modeling Spatial Dependencies in Natural Images is about applying 2D conditional random fields (CRFs) for classifying image regions as containing "man-made building" or not, on the basis of texture. The goal of this project is to reproduce the results in the NIPS 2003 paper. Useful links:

  20. 2D CRFs for satellite image classification

Data Sets

Below are a number of data sets that could be used for your project. If you want to use a data set that is not on the list it is strongly advised that you talk to either a TA or the instructor before submitting your intial proposal.

Thanks to Dieter Fox,
Andreas Krause, Lin Liao, Einat Minkov, Francisco Pereira, Sam Roweis, and Ben Taskar for donating data sets.

Data A: Functional MRI

Functional fMRI measures brain activation over time, which allows one to measure changes as an activity is performed (eg, looking at a picture of a cat vs. looking at a picture of a chair). Tasks using this data are typically of the form "predict cognitive state given fMRI data". fMRI data is both temporal and spatial: each voxel contains a time series, each voxel is correlated to voxels near it. 

http://multivac.ml.cmu.edu/10708

Data B: Corel Image Data

Images featurized by color histogram, color histogram layout, color moments, and co-occurence texture. Useful for projects on image segementation, especially since there is a large benchmark repository available.

Most segmentation algorithms have focused on segmentation based on edges or based on discontinuity of color and texture.  The ground-truth in this dataset, however, allows supervised learning algorithms to segment the images based on statistics calculated over regions.  One way to do this is to "oversegment" the image into superpixels (Felzenszwalb 2004, code available) and merge the superpixels into larger segments.  Graphical models can be used to represent smoothness in clusters, by adding appropriate potentials between neighboring pixels. In this project, you can address, for example, learning of such potentials, and inference in models with very large tree-width.

http://kdd.ics.uci.edu//databases/CorelFeatures/CorelFeatures.html
http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/

Data C: Twenty Newsgroups

This data set contains 1000 text articles posted to each of 20 online newgroups, for a total of 20,000 articles. This data is useful for a variety of text classification and/or clustering projects.  The "label" of each article is which of the 20 newsgroups it belongs to.  The newsgroups (labels) are hierarchically organized (e.g., "sports", "hockey").

http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html

Data D: Sensor Networks
Using this 54-node sensor network deployment, we collected temperature, humidity, and light data, along with the voltage level of the batteries at each node. The data was collected every 30 seconds, starting around 1am on February 28th 2004.

http://www-2.cs.cmu.edu/~guestrin/Research/Data/

This is a real dataset, with lots of missing data, noise, and failed sensors giving outlier values, especially when battery levels are low. Additional data for an intelligent lighting network, which include link quality information between pairs of sensors can is available at

http://www.cs.cmu.edu/~guestrin/Class/10708/Project/lightsensor.zip

Ideas for projects include 

·          Learn graphical models representing the correlations between measurements at different nodes

·          Develop new distributed algorithms for solving a learning task on this data

References

Data F: TRECVID

A competition for multimedia information retrieval. They keep a fairly large archive of video data sets, along with featurizations of the data.

http://www-nlpir.nist.gov/projects/trecvid/trecvid.data.html

Data G: Activity Modelling

Activity modelling is the task of inferring what the user is doing from observations (eg, motion sensors, microphones). This data set consists of GPS motion data for two subjects tagged with labels like car , working, athome, shopping. 

http://www.cs.cmu.edu/~guestrin/Class/10708/Project/gps-labels.zip

An example of a DBN model for this problem is

A. Subramanya, A. Raj, J. Bilmes, and D. Fox.
Recognizing Activities and Spatial Context Using Wearable Sensors (UAI-2006)
http://www.cs.washington.edu/homes/fox/abstracts/gps-msb-uai-06.abstract.html

Data H: WebKB

This dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.

http://www-2.cs.cmu.edu/~webkb/

Ideas for projects: learning classifiers to predict the type of webpage from the text, using web structure to improve page classification.  

Data I: Record Deduplication

The datasets provided below comprise of lists of records, and the goal is to identify, for any dataset, the set of records which refer to unique entities. This problem is known by the varied names of deduplication, identity uncertainty and record linkage.

http://www.cs.utexas.edu/users/ml/riddle/data.html

One common approach is to cast the deduplication problem as a classification problem. Consider the set of record-pairs, and classify them as either "unique" or "not-unique". Some papers on record deduplication include

www.isi.edu/info-agents/papers/tejada01-is.pdf
http://www.cs.cmu.edu/~pradeepr/papers/kdd03.pdf

Data J: Enron e-mail

Consists of ~500K e-mails collected from Enron employees. It has been used for research into information extraction, social network analysis, and topic modeling. 

http://www.cs.cmu.edu/~enron/
http://www.cs.cmu.edu/~einat/datasets.html

Data K: Internet Movie Database

The Internet Movie Database makes their data publically available, with certain usage restrictions. It contains tables and links relating movies, actors, directors, box office grosses, and much more. Various slices of the data have been used extensively in research on relational models. 

http://www.imdb.com/interface

Data L: Netflix

Netflix is running a competition for movie recommendation algorithms. They've released a dataset of 100M ratings from 480K randomly selected users over 17K titles. The data set, and contest details, are available at

http://www.netflixprize.com

A much smaller (but more widely used) movie rating data set is Movielens

http://www.grouplens.org/

Data M: NIPS Corpus

A data set based on papers from a machine learning conference (NIPS volumes 1-12). The data can be viewed as a tripartite graph on authors, papers, and words. Links represent authorship and the words used in a paper. Additionally, papers are tagged with topics and we know which year each paper was written. Potential projects include authorship prediction, document clustering, and topic tracking.

http://www.cs.toronto.edu/~roweis/data.html

Data N: Character recognition (digits)

Optical character recognition, and the simpler digit recognition task, has been the focus of much ML research. We have two datasets on this topic. The first tackles the more general OCR task, on a small vocabulary of words: (Note that the first letter of each word was removed, since these were capital letters that would make the task harder for you.)

Taskar Ocr

Data O: Precipitation Data

This dataset has includes 45 years of daily precipitation data from the Northwestern US. Ideas for projects include predicting rain levels, deciding where to place sensors to best predict rainfall, or active learning in fixed sensor networks.

Other sources of data

UC Irvine has a repository that could be useful for your project. Many of these data sets have been used extensively in graphical models research.

http://www.ics.uci.edu/~mlearn/MLRepository.html

Sam Roweis also has a link to several datasets (most ready for use in Matlab):

http://www.cs.toronto.edu/~roweis/data.html


Stolen, with permission, from C Guestrin, CMU: 10708-F06