NSERC Summer Research Projects
Web information retrieval and extraction
The general theme of our work for the past few summers has been on
building tools and techniques for Web information retrieval, automated
fact and relationship extraction, and querying and visualization. This has been (and still
is) an important topic because of the large volume of information stored in non-structured formats (such as free text, html, links, tables etc.) and the desire to ask questions about these facts and relationships and to analyze the content.
For this summer (Summer 2019), I have a few projects (as listed here)
and will need 3 undergrad students who can help me with
building tools and/or running experiments. Here is a brief description of
the specific projects I am offering this summer. For more details, please feel free
to talk to me.
- P1. Querying web tables
The Web contains a huge collection of tables (often referred to as Web tables)
embedded in documents. These tables are much different than relational tables in that the schema information either is not available or is not know in advance. This project will build tools and will experiments for effectively querying Web
This project builds upon a project developed in Summer 2018.
- P2. Cleaning web tables
Data on the web generally is not clean. In many cases, there is no quality control on what is being published. Another contributing factor to the noise level in web tables is the way content (not necessarily relational data) is presented in tabular form for better rendering. Also table schema (e.g. column names and their types) is not available for many tables. Because of these factors, many web tables are not relational or may not contain useful data. This project implements and evaluates a few strategies for cleaning web tables.
The project relates to some of our ongoing work on related topics.
- P3. Searching with sets
Keyword queries are not effective or efficient when searching for documents that may contain a subset of keywords from a set. For example, consider searching for articles that mention 3 players from the roster of Edmonton Oilers. We are experimenting with a search engine that allows more efficient searches using sets. This project involves building tools and running experiments to efficiently support a search engine that use and make references to sets in queries.
This builds on a related project we have worked on lately.
- P4. Placing stories on a map
Consider a map interface to news where articles are placed on a map based on their geographical coverage or relevance. Users can zoom and pan to retrieve information at different levels of granularity. This projects will implement the interface and will run experiments to evaluate it.
This project relates to our recent work on related topics.
Applicants are expected to have a good CS background (preferably be in their 3rd or
4th year), posses good programming skills (in C/C++ and/or Python) and are motivated to work on these projects.
This is a research scholarship, and as such the project is generally open-ended and there is lots of room for creativity. I would expect the candidates to enjoy writing (robust) code and building prototypes.
If interested, please
- send me your resume including a description of courses taken, grades obtained, and major programming projects done, and
- fill out NSERC Form 202 online.
The NSERC undergrad summer scholarship is open to Canadian citizens and permanent residents of Canada.
Some Past Projects
Data Annotation Through Online Games
Facts and relationships that are extratcted from the Web are often erroneos or
inaccurate, and verifying them can be a tedious and sometimes a boring task.
What if we turn this task into an online game where as the users play the game,
the verification happens behind the scene? This doesn't sound boring anymore.
This is a project Eddie Santos and Stephen Romansky (two summer students)
did over a summer. Here is a
link to the game page. James Moore (another summer student) put together
and Android app for the game.
Extracting Facts from the Web
Data extraction from Web pages can be a tedious job, but this job can be often automated, and the extracted data can be easily queried once it is stored in a relational database. We are looking for a student to help us with building tools for data extraction and filtering. The extraction phase will involve building patterns or templates for extracting various pieces of text or filtering them. The storage part involves storing data in a database and indexing it. A candidate student is someone who
- likes writing robust codes,
- knows C, C++, or Perl (or eager to learn),
- knows the basics of database systems.
Searching the Web Connectivity
The connectivity graph of the Web can be searched for various paths, leading to a better understanding of both the topology of the Web and the connectivity between specific pages. Online processing of such queries can be quite time-consuming. With the help of an undergrad student, we will be looking into collecting various synopses and indexing those synopses within a relational system to speed up the search process. A candidate student for this work is someone who
- is a good programmer,
- enjoys using Linux tools for file processing,
- knows the basics of database systems including database tuning.