My current research is in the intersection of database systems and Web information retrieval, where I study scalable methods for search and exploration of big document collections and online resources, querying and analyzing large networks and storage and indexing of non-traditional data. Here is an overview of my ongoing (and some past) projects:
The Web contains a large volume of structured data in the form of tables. Currently there is no easy way to query this data mainly because of the lack of a schema and factors such as the presence of inaccurate information or noise in the content and the huge difference in the quality of the tables. We are studying some of the building blocks that can make this task possible; these include mechanisms for data annotation and cleaning, indexing the content, and possibly an algebra to formulate the queries.
Text documents lack the kind of structure found in relational databases and this limits the domain of searches and the kind of queries that can be expressed. We have been studying ways of improving and extending keyword searches over documents by exploring better ranking functions, pseudo-relevance feedback techniques such as result diversification, and queries that better use the minimal structure that exists in text. Here are some of our ongoing and past activities:
Unlike traditional databases that store a single and often the most recent snapshot of a modeled real world, a data warehouse can store multiple snapshots. The changes between these snapshots may reveal important information about the processes and events that trigger them. For instance rising oil prices could be due to OPEC production cuts, uncertainty in the supply market, seasonal rising demands, etc. Now consider aligning the sequence of oil prices with a stream of news. A positive correlation between the two sequences can pin down the triggering events. Our work studies the issues involved in querying and indexing large volumes of the aforementioned and other historical data. In our past and ongoing work, we have been more looking into similarity queries.
There are many applications in which data is best modeled as a transient stream rather than a persistent relation. Examples are network monitoring, Web-based systems, financial applications and security systems. These applications differ from traditional databases in several aspects: (1)the stream size is potentially unbounded and the data may not be stored locally, (2) often a fast query response time is needed and (3) a quick answer with some allowable error may be preferred over a slow but precise answer. Our work studies various filtering and cleaning algorithms on streaming data with some bounds on the error and the space. In our past and ongoing work, we have been looking into approximately detecting duplicates.