NSERC Summer Research Projects
The general theme of our work for the past few summers has been on
building tools and techniques for Web information retrieval, automated
fact and relationship extraction, and querying and visualization. This has been (and still
is) an important topic because of the large volume of information stored in non-structured formats (such as free text, html, links, tables etc.) and the desire to ask questions about these facts and relationships and to analyze the content.
For this summer (Summer 2021), I have a few projects (as listed here)
and will need 3 undergrad students who can help me with
building tools and/or running experiments. Here is a brief description of
the specific projects I am offering this summer. For more details, please feel free
to talk to me.
- P1. Data integration with open data sources
Data in open data repositories may be related to structured data within an organization, but integrating internal data with external sources is a challenges due to the differences in the structure of data, the formatting of column values, and sometimes the lack of schema. This project will build tools and will experiment with different data integration algorithms to address some of those challenges.
- P2. Interactive exploration of open data sources
Web tables and open data resources are much different than relational tables in that the schema information either is not available or is not accurate. This project will build tools and will experiment with different algorithms for explorative querying and analysis of such data sources.
- P3. Evaluating natural language interfaces to databases
Consider the scenario where queries are expressed in a natural language and those queries need to be mapped to SQL for evaluation. This project will develop a platform (including some datasets and queries) to evaluate some of the existing works in this area.
- P4 - Efficiently searching medical text repositories
Consider searching for medical articles using a tool that takes natural language questions and retrieves the relevant articles. We have built one such tool in the past for a Covid-19 dataset. This project will evaluate the tool and develop improvements and extensions.
Applicants are expected to have a good CS background (preferably be in their 3rd or
4th year), posses good programming skills (in C/C++ and/or Python) and are motivated to work on these projects. The work is expected to produce tools, products, resources, and/or help with conducting experimental work and analysis.
This is a research scholarship, and as such the project is generally open-ended and there is lots of room for creativity. I would expect the candidates to enjoy writing (robust) code and building prototypes.
If interested, please
- send me your resume including a description of courses taken, grades obtained, and major programming projects done, and
- fill out NSERC Form 202 online.
The NSERC undergrad summer scholarship is open to Canadian citizens and permanent residents of Canada.
Some Past Projects
Data Annotation Through Online Games
Facts and relationships that are extratcted from the Web are often erroneos or
inaccurate, and verifying them can be a tedious and sometimes a boring task.
What if we turn this task into an online game where as the users play the game,
the verification happens behind the scene? This doesn't sound boring anymore.
This is a project Eddie Santos and Stephen Romansky (two summer students)
did over a summer. Here is a
link to the game page. James Moore (another summer student) put together
and Android app for the game.
Extracting Facts from the Web
Data extraction from Web pages can be a tedious job, but this job can be often automated, and the extracted data can be easily queried once it is stored in a relational database. We are looking for a student to help us with building tools for data extraction and filtering. The extraction phase will involve building patterns or templates for extracting various pieces of text or filtering them. The storage part involves storing data in a database and indexing it. A candidate student is someone who
- likes writing robust codes,
- knows C, C++, or Perl (or eager to learn),
- knows the basics of database systems.
Searching the Web Connectivity
The connectivity graph of the Web can be searched for various paths, leading to a better understanding of both the topology of the Web and the connectivity between specific pages. Online processing of such queries can be quite time-consuming. With the help of an undergrad student, we will be looking into collecting various synopses and indexing those synopses within a relational system to speed up the search process. A candidate student for this work is someone who
- is a good programmer,
- enjoys using Linux tools for file processing,
- knows the basics of database systems including database tuning.