Filipe Mesquita, Yuval Merhav, Mirko Bronzi, Zhaochen Guo, Stephanie Husby


Frank W. Tompa (U. Waterloo), Paolo Merialdo (U. Roma Tre), Raymond Ng (UBC), Greg Kondrak (UofA), Davood Rafiei (UofA).

Project Description:

Business Intelligence (BI) software typically can use only data that is well structured, such as in relational databases. However, often most of the information needed to make business decisions exists only in documents (such as emails or reports). The goal of this project is to tap into this vast resource to gather useful BI data. Given a formal description of the data needed (e.g., a relational database schema), a sample of the data needed (e.g., an incomplete database instance), and a document collection, we aim at extracting data from the documents and to integrate these new data with the existing relational instance. We envision a system where (i) facts are inherently multidimensional, (ii) documents and tables are the main abstractions for user input and output, and (iii) facts are retrieved from internal and external sources, from structured or unstructured data. Based on the user input the system infers a multidimensional conceptual model that, if materialized, would satisfy userís information need.

Many difficulties exist in developing such an information extraction system, ranging from identifying entities in the text (e.g., identifying that Canada is a country), and relationships among pairs of entities (e.g., that Canada and the US are neighbours). Furthermore, often multiples relationships exist between the same pair of entities (e.g., Canada is also a trade partner with the US). We are working on novel information extraction techniques that address these issues by exploiting contextual information provided in the documents themselves but also in other sources, such as the Web at large. We also plan to study iterative information extraction methods, in which we start from a set of known and trusted facts, from which we iteratively extract new facts and new relations over time.

Related Publications:

  • Shallow Information Extraction for the Knowledge Web. Denilson Barbosa, Haixun Wang and Cong Yu. In IEEE 29th International Conference on Data Engineering, pp. 1264 - 1267. IEEE, 2013. PDF WWW BibTeX
  • Effectiveness and Efficiency of Open Relation Extraction. Filipe Mesquita, Jordan Schmidek and Denilson Barbosa. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 447-457. Association for Computational Linguistics, October 2013. PDF BibTeX (data)
  • Extracting information networks from the blogosphere. Yuval Merhav, Filipe Mesquita, Denilson Barbosa, Wai Gen Yee and Ophir Frieder. ACM Trans. Web, 6(3):pp. 11:1-11:33, oct 2012. WWW BibTeX
  • Open Information Extraction with Tree Kernels. Ying Xu, Mi-Young Kim, Kevin Quinn, Randy Goebel and Denilson Barbosa. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 868-877. Association for Computational Linguistics, Atlanta, Georgia, June 2013. PDF BibTeX
  • Topic Classification of Blog Posts Using Distant Supervision. Stephanie Husby and Denilson Barbosa. In Proceedings of the Workshop on Semantic Analysis in Social Media, pp. 28-36. Association for Computational Linguistics, Avignon, France, April 2012. PDF WWW BibTeX
  • Extracting Meta Statements from the Blogosphere. Filipe Mesquita and Denilson Barbosa. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media, pp. 225-232. AAAI, 2011. PDF BibTeX
  • Incorporating global information into named entity recognition systems using relational context. Yuval Merhav, Filipe Mesquita, Denilson Barbosa, Wai Gen Yee and Ophir Frieder. In Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, pp. 883-884. ACM, 2010. WWW BibTeX
  • Labeling Data Extracted from the Web. Altigran Soares da Silva, Denilson Barbosa, Joao M. B. Cavalcanti and Marco A. S. Sevalho. In Proceedings of the 6th International Conference on Ontologies, DataBases, and Applications of Semantics, pp. 1099-1116. Springer, 2007. WWW BibTeX


This project is funded through the NSERC Business Intelligence Network