Filipe Mesquita, Yuval Merhav, Mirko Bronzi, Zhaochen Guo, Stephanie Husby


Frank W. Tompa (U. Waterloo), Paolo Merialdo (U. Roma Tre), Raymond Ng (UBC), Greg Kondrak (UofA), Davood Rafiei (UofA).

Project Description:

Business Intelligence (BI) software typically can use only data that is well structured, such as in relational databases. However, often most of the information needed to make business decisions exists only in documents (such as emails or reports). The goal of this project is to tap into this vast resource to gather useful BI data. Given a formal description of the data needed (e.g., a relational database schema), a sample of the data needed (e.g., an incomplete database instance), and a document collection, we aim at extracting data from the documents and to integrate these new data with the existing relational instance. We envision a system where (i) facts are inherently multidimensional, (ii) documents and tables are the main abstractions for user input and output, and (iii) facts are retrieved from internal and external sources, from structured or unstructured data. Based on the user input the system infers a multidimensional conceptual model that, if materialized, would satisfy userís information need.

Many difficulties exist in developing such an information extraction system, ranging from identifying entities in the text (e.g., identifying that Canada is a country), and relationships among pairs of entities (e.g., that Canada and the US are neighbours). Furthermore, often multiples relationships exist between the same pair of entities (e.g., Canada is also a trade partner with the US). We are working on novel information extraction techniques that address these issues by exploiting contextual information provided in the documents themselves but also in other sources, such as the Web at large. We also plan to study iterative information extraction methods, in which we start from a set of known and trusted facts, from which we iteratively extract new facts and new relations over time.

We have developed prototypes for the extraction and visualization of information from text:

This project is funded through the NSERC Business Intelligence Network