NSERC Summer Research Projects
The general theme of our work for the past few summers has been on
building tools and techniques for Web information retrieval, automated
fact and relationship extraction, and querying data in the wild. This has been (and still
is) an important topic because of the large volume of information stored in non-structured formats (such as free text, html, links, tables etc.) and the desire to ask questions about these facts and relationships and to analyze the content.
For this summer (Summer 2024), I have a few projects (as listed here)
and will need 2 undergrad students who can help me with
building tools and/or running experiments. Here is a brief description of
the specific projects I am offering this summer. They are building on some of the works we have been doing lately. For more details, please feel free
to talk to me.
- P1. Extracting entity groupings from web pages
Grouping of related entities (e.g. NHL teams, car models, etc.) can be found in web page, usually either in the content of the pages or the tables within them pages. This data is a usseful resource for many applications. This project aims
to build tools and will experiment with different algorithms to extract those entity groupings.
- P2. Exploring and integrating open data sources
Tables in open data repositories may be related to structured data within an organization, but finding the relationships and integrating such data with internal data sources is a challenges due to the differences in the structure of data, the formatting of column values, and sometimes the lack of schema. This project will build tools, will construct benchmarks, and will experiment with different data integration algorithms to address some of those challenges.
Applicants are expected to have a good CS background (preferably be in their 3rd or
4th year), posses good programming skills (in C/C++ and/or Python) and are motivated to work on these projects. The work is expected to produce tools, products, resources, and/or help with conducting experimental work and analysis.
This is a research scholarship, and as such the project is generally open-ended and there is lots of room for creativity. I would expect the candidates to enjoy writing (robust) code and building prototypes.
If interested, please
- send me your resume including a description of courses taken, grades obtained, and major programming projects done, and
- fill out NSERC Form 202 online.
The NSERC undergrad summer scholarship is open to Canadian citizens and permanent residents of Canada.
Some Past Projects
Data Annotation Through Online Games
Facts and relationships that are extratcted from the Web are often erroneos or
inaccurate, and verifying them can be a tedious and sometimes a boring task.
What if we turn this task into an online game where as the users play the game,
the verification happens behind the scene? This doesn't sound boring anymore.
This is a project Eddie Santos and Stephen Romansky (two summer students)
did over a summer. Here is a
link to the game page. James Moore (another summer student) put together
and Android app for the game.
Extracting Facts from the Web
Data extraction from Web pages can be a tedious job, but this job can be often automated, and the extracted data can be easily queried once it is stored in a relational database. We are looking for a student to help us with building tools for data extraction and filtering. The extraction phase will involve building patterns or templates for extracting various pieces of text or filtering them. The storage part involves storing data in a database and indexing it. A candidate student is someone who
- likes writing robust codes,
- knows C, C++, or Perl (or eager to learn),
- knows the basics of database systems.
Searching the Web Connectivity
The connectivity graph of the Web can be searched for various paths, leading to a better understanding of both the topology of the Web and the connectivity between specific pages. Online processing of such queries can be quite time-consuming. With the help of an undergrad student, we will be looking into collecting various synopses and indexing those synopses within a relational system to speed up the search process. A candidate student for this work is someone who
- is a good programmer,
- enjoys using Linux tools for file processing,
- knows the basics of database systems including database tuning.