Data Mining and Knowledge Discovery (Projects)

Data Mining and Knowledge Discovery

Term Projects (Fall 2007)

Due Date: Variable. See schedule
Percentage overall grade: 39%
Maximum Marks: 100

Deliverables:

Project proposal

2 pages containing

Description of the topic of the project
Methodology intended to address and solve the issues of the project
Evaluation (if applicable)
Proposed milestones
references (in appendix)

Reasonably documented source code.

Executable of the project

Report explaining the project

"Public" demonstration of the project.

Due Dates:

Activity Due Date

Project Proposal October 9th

Preliminary "Private" Demo Week of November 19th

Final report submission December 17th

"Public" Demo December 17th and 18th

NB: A second "private" preliminary demo can be scheduled later.
Marking:
Marking will consider the quality of the proposal, the quality of the report, the quality of the demonstration and to some extent the quality of the preliminary demo.
List of projects: Talk to me in order to get more details about these projects.
# Project May Lead to MSc Taken by Brief Description

1 Conference Ranking Yes Yavar
+
Wei
+
Camilo Conferences are currently ranked subjectively by reputation. The project consists of devising a metric to rank conferences based on their paper citation, the reputation of the authors publishing in them, etc. This information could be extracted from scholar.google or citeseer in addition to the public database dblp.
NB. There are many possibilities for ranking. We can either have different projects on this topic or a team project testing different approaches.

2 Asymmetric Parallel Frequent Itemset Mining Yes Shuang Typically, a parallel data mining program runs the same program on all processors. This project consists of identifying transactional data features that would identify the most appropriate algorithm to run on each processor given its data partition.

3 Emerging Sequences Yes Kang Comparing sets of sequences is relevant to many applications. This project consists of implementing algorithms to identify contrasting sequences among sets of sequences.

4 Contrast Sets based on Association Rules yes Yi Contrast sets can be mined by some specific algorithms such as STUCCO and CIGAR. They can also be mined by means of association rule mining. The project is to use association rule mining to extract contrast sets and see whether contrast sets can be used to characterize clusters or improve classifiers.

5 Home Page Detection Possible Fariba This project consists of identifying a personal web page of a given person if this web page exists. This can be done by using a search engine to extract pages using a person's name as query and classify the resulting pages into Home page and non-home page.

6 Identity resolution by authorship promiscuity Yes Zhiyu In a social network of paper authors the same names can represent different authors, but can we resolve this identification issue by means of authorship promiscuity. Authors are "promiscuous" if they co-author with different groups of authors for each context (i.e. field or group of conferences).

7 Query Refinement Using Social Network Analysis Yes Ying The project consists of exploiting community mining in social network analysis to cluster and label documents in search engine results in order to provide suggested refinement to search queries.

8 Associative classifier by row-enumeration for high dimensionality Yes Jiaofen Associative classifiers use association rule mining to discover classification rules. However, for a high dimensional space, association rule mining can be too expensive to be practical. Row-enumeration is a new alternative. Can it be used to build a classifier for data such as microarray data with many features but few patients?

9 Mammography Classification Yes Farzaneh An associative classifier should be used to classify mammograms into cancerous and normal cases. An interface is to be built receiving mammograms and simulating the diagnostic is to be built.

10 Feature Space Selection Yes Naimeh
+
Mojdeh When classifying records with a large number of attributes (high dimensionality) it is necessary to select the most relevant ones to build the classification model. There are many techniques for dimensionality reduction, feature selection, etc. This project is to explore two new hypothesis for feature selection: one using subspace clustering and the other one using random walks in bipartite graphs.
NB. There could be two individual projects or a team project testing these different hypothesis.

11 Question-Answer recommender System Yes In the context of a search engine, one could send a question instead of keywords. The idea is to expend the question, retrieve pertinent documents using search engines, then summarize these resources and provide the concise summaries as recommended answers to the original question.

12 Classification Explanation Generation for Associative Classifiers Possible Dave When a classifier predicts a class for a new object a straight forward question is "how was the conclusion drawn?" The project is to built an UI that would explain the results of an associative classifier.

Posted on September 26