Students:
Filipe Mesquita, Yuval Merhav, Mirko Bronzi, Zhaochen Guo, Stephanie Husby
Collaborators:
Frank W. Tompa (U. Waterloo), Paolo Merialdo (U. Roma Tre), Raymond Ng (UBC), Greg Kondrak (UofA), Davood Rafiei (UofA).
Project Description:
Business Intelligence (BI) software typically can use only data that is well structured, such as in relational databases. However, often most of the information needed to make business decisions exists only in documents (such as emails or reports). The goal of this project is to tap into this vast resource to gather useful BI data. Given a formal description of the data needed (e.g., a relational database schema), a sample of the data needed (e.g., an incomplete database instance), and a document collection, we aim at extracting data from the documents and to integrate these new data with the existing relational instance. We envision a system where (i) facts are inherently multidimensional, (ii) documents and tables are the main abstractions for user input and output, and (iii) facts are retrieved from internal and external sources, from structured or unstructured data. Based on the user input the system infers a multidimensional conceptual model that, if materialized, would satisfy user’s information need.
Many difficulties exist in developing such an information extraction system, ranging from identifying entities in the text (e.g., identifying that Canada is a country), and relationships among pairs of entities (e.g., that Canada and the US are neighbours). Furthermore, often multiples relationships exist between the same pair of entities (e.g., Canada is also a trade partner with the US). We are working on novel information extraction techniques that address these issues by exploiting contextual information provided in the documents themselves but also in other sources, such as the Web at large. We also plan to study iterative information extraction methods, in which we start from a set of known and trusted facts, from which we iteratively extract new facts and new relations over time.
We have developed prototypes for the extraction and visualization of information from text:
Related Publications:
- Shallow Information Extraction for the Knowledge Web. Denilson Barbosa, Haixun Wang and Cong Yu. In IEEE 29th International Conference on Data Engineering. IEEE, 2013.
- Extracting information networks from the blogosphere. Merhav, Yuval, Mesquita, Filipe, Barbosa, Denilson, Yee, Wai Gen and Frieder, Ophir. ACM Trans. Web, 6(3):pp. 11:1-11:33, oct 2012.
- Topic Classification of Blog Posts Using Distant Supervision. Stephanie Husby and Denilson Barbosa. In Proceedings of the Workshop on Semantic Analysis in Social Media, pp. 28-36. Association for Computational Linguistics, Avignon, France, April 2012.
- Extracting Meta Statements from the Blogosphere. Filipe Mesquita and Denilson Barbosa. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media, pp. 225-232. AAAI, 2011.
- The Actor-Topic Model for Extracting Social Networks in Literary Narrative. Asli Celikyilmaz, Dilek Hakkani-Tur, Hua He, Greg Kondrak and Denilson Barbosa. In Proceedings of the NIPS 2010 Workshop -- Machine Learning for Social Computing, pp. 7 pp, 2010.
- Incorporating global information into named entity recognition systems using relational context. Yuval Merhav, Filipe Mesquita, Denilson Barbosa, Wai Gen Yee and Ophir Frieder. In Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, pp. 883-884. ACM, 2010.
- Labeling Data Extracted from the Web. Altigran Soares da Silva, Denilson Barbosa, Joao M. B. Cavalcanti and Marco A. S. Sevalho. In Proceedings of the 6th International Conference on Ontologies, DataBases, and Applications of Semantics, pp. 1099-1116. Springer, 2007.
Funding:
This project is funded through the NSERC Business Intelligence Network