7,973 research outputs found
Information Extraction, Data Integration, and Uncertain Data Management: The State of The Art
Information Extraction, data Integration, and uncertain data management are different areas of research that got vast focus in the last two decades. Many researches tackled those areas of research individually. However, information extraction systems should have integrated with data integration methods to make use of the extracted information. Handling uncertainty in extraction and integration process is an important issue to enhance the quality of the data in such integrated systems. This article presents the state of the art of the mentioned areas of research and shows the common grounds and how to integrate information extraction and data integration under uncertainty management cover
Techniques for improving efficiency and scalability for the integration of information retrieval and databases
PhDThis thesis is on the topic of integration of Information Retrieval (IR) and Databases (DB), with
particular focuses on improving efficiency and scalability of integrated IR and DB technology
(IR+DB). The main purpose of this study is to develop efficient and scalable techniques for
supporting integrated IR and DB technology, which is a popular approach today for handling
complex queries over text and structured data.
Our specific interest in this thesis is how to efficiently handle queries over large-scale text
and structured data. The work is based on a technology that integrates probability theory and
relational algebra, where retrievals for text and data are to be expressed in probabilistic logical
programs such as probabilistic relational algebra or probabilistic Datalog. To support efficient
processing of probabilistic logical programs, we proposed three optimization techniques
that focus on aspects covered logical and physical layers, which include: scoring-driven query
optimization using scoring expression, query processing with top-k incorporated pipeline, and
indexing with relational inverted index.
Specifically, scoring expressions are proposed for expressing the scoring or probabilistic semantics
of implied scoring functions of PRA expressions, so that efficient query execution plan
can be generated by rule-based scoring-driven optimizer. Secondly, to balance efficiency and
effectiveness so that to improve query response time, we studied methods for incorporating topk
algorithms into pipelined query execution engine for IR+DB systems. Thirdly, the proposed
relational inverted index integrates IR-style inverted index and DB-style tuple-based index, which
can be used to support efficient probability estimation and aggregation as well as conventional
relational operations.
Experiments were carried out to investigate the performances of proposed techniques. Experimental
results showed that the efficiency and scalability of an IR+DB prototype have been
improved, while the system can handle queries efficiently on considerable large data sets for a
number of IR tasks
- …