NSERC Summer Research Projects
The general theme of our work for the past few summers has been on
building tools and techniques for Web information retrieval, automated
fact and relationship extraction, and querying and visualization. This has been (and still
is) an important topic because of the large volume of information stored in non-structured formats (such as free text, html, links, tables etc.) and the desire to ask questions about these facts and relationships and to analyze the content.
For this summer (Summer 2020), I have a few projects (as listed here)
and will need 3 undergrad students who can help me with
building tools and/or running experiments. Here is a brief description of
the specific projects I am offering this summer. For more details, please feel free
to talk to me.
- P1. Querying lists and web tables
The Web contains a huge collection of lists and tables (often referred to as Web tables)
embedded in documents. These lists and tables are much different than relational tables in that the schema information either is not available or is not know in advance. This project will build tools and will experiments for effectively querying Web tables.
This project builds upon a project we have developed in a past work.
- P2. Semantic search over wikipedia
Wikipedia contains a huge collection of what is known as "common knowledge" in the form of facts, relationships, descriptions, etc. There are knowledge bases that are extracted from or closely relate to Wikipedia including Dbpedia, Wikidata, and Yago. While these knowledge bases support rich queries (e.g. expressed in SPARQL), the searches over Wikipedia is often limited to keyword searches. This project is a step toward supporting richer queries over Wikipedia text.
The project relates to some of other ongoing projects.
Applicants are expected to have a good CS background (preferably be in their 3rd or
4th year), posses good programming skills (in C/C++ and/or Python) and are motivated to work on these projects. The work is expected to produce tools, products, resources, and/or help with conducting experimental work and analysis.
This is a research scholarship, and as such the project is generally open-ended and there is lots of room for creativity. I would expect the candidates to enjoy writing (robust) code and building prototypes.
If interested, please
- send me your resume including a description of courses taken, grades obtained, and major programming projects done, and
- fill out NSERC Form 202 online.
The NSERC undergrad summer scholarship is open to Canadian citizens and permanent residents of Canada.
Some Past Projects
Data Annotation Through Online Games
Facts and relationships that are extratcted from the Web are often erroneos or
inaccurate, and verifying them can be a tedious and sometimes a boring task.
What if we turn this task into an online game where as the users play the game,
the verification happens behind the scene? This doesn't sound boring anymore.
This is a project Eddie Santos and Stephen Romansky (two summer students)
did over a summer. Here is a
link to the game page. James Moore (another summer student) put together
and Android app for the game.
Extracting Facts from the Web
Data extraction from Web pages can be a tedious job, but this job can be often automated, and the extracted data can be easily queried once it is stored in a relational database. We are looking for a student to help us with building tools for data extraction and filtering. The extraction phase will involve building patterns or templates for extracting various pieces of text or filtering them. The storage part involves storing data in a database and indexing it. A candidate student is someone who
- likes writing robust codes,
- knows C, C++, or Perl (or eager to learn),
- knows the basics of database systems.
Searching the Web Connectivity
The connectivity graph of the Web can be searched for various paths, leading to a better understanding of both the topology of the Web and the connectivity between specific pages. Online processing of such queries can be quite time-consuming. With the help of an undergrad student, we will be looking into collecting various synopses and indexing those synopses within a relational system to speed up the search process. A candidate student for this work is someone who
- is a good programmer,
- enjoys using Linux tools for file processing,
- knows the basics of database systems including database tuning.