21 research outputs found
Towards a Query Optimizer for Text-Centric Tasks
Text is ubiquitous and, not surprisingly, many important applications rely on textual data for
a variety of tasks. As a notable example, information extraction applications derive structured
relations from unstructured text; as another example, focused crawlers explore the web to locate
pages about specific topics. Execution plans for text-centric tasks follow two general paradigms
for processing a text database: either we can scan, or "crawl," the text database or, alternatively,
we can exploit search engine indexes and retrieve the documents of interest via carefully crafted
queries constructed in task-specific ways. The choice between crawl- and query-based execution
plans can have a substantial impact on both execution time and output "completeness" (e.g.,
in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain
intuition. In this article, we present fundamental building blocks to make the choice of execution
plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to
analyze query- and crawl-based plans in terms of both execution time and output completeness.
We adapt results from random-graph theory and statistics to develop a rigorous cost model for
the execution plans. Our cost model reflects the fact that the performance of the plans depends
on fundamental task-specific properties of the underlying text databases. We identify these
properties and present efficient techniques for estimating the associated parameters of the cost
model. We also present two optimization approaches for text-centric tasks that rely on the cost-model
parameters and select efficient execution plans. Overall, our optimization approaches
help build efficient execution plans for a task, resulting in significant efficiency and output
completeness benefits. We complement our results with a large-scale experimental evaluation
for three important text-centric tasks and over multiple real-life data sets.Information Systems Working Papers Serie
Towards a Query Optimizer for Text-Centric Tasks
Text is ubiquitous and, not surprisingly, many important applications rely on textual data for
a variety of tasks. As a notable example, information extraction applications derive structured
relations from unstructured text; as another example, focused crawlers explore the Web to locate
pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for
processing a text database: either we can scan, or “crawl,” the text database or, alternatively, we can
exploit search engine indexes and retrieve the documents of interest via carefully crafted queries
constructed in task-specific ways. The choice between crawl- and query-based execution plans can
have a substantial impact on both execution time and output “completeness” (e.g., in terms of
recall). Nevertheless, this choice is typically ad hoc and based on heuristics or plain intuition.
In this article, we present fundamental building blocks to make the choice of execution plans
for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze
query- and crawl-based plans in terms of both execution time and output completeness. We adapt
results from random-graph theory and statistics to develop a rigorous cost model for the execution
plans. Our cost model reflects the fact that the performance of the plans depends on fundamental
task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated parameters of the cost model.We also present two
optimization approaches for text-centric tasks that rely on the cost-model parameters and select
efficient execution plans. Overall, our optimization approaches help build efficient execution plans
for a task, resulting in significant efficiency and output completeness benefits. We complement our
results with a large-scale experimental evaluation for three important text-centric tasks and over
multiple real-life data sets.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
Towards a Query Optimizer for Text-Centric Tasks
Text is ubiquitous and, not surprisingly, many important applications rely on textual data for
a variety of tasks. As a notable example, information extraction applications derive structured
relations from unstructured text; as another example, focused crawlers explore the web to locate
pages about specific topics. Execution plans for text-centric tasks follow two general paradigms
for processing a text database: either we can scan, or "crawl," the text database or, alternatively,
we can exploit search engine indexes and retrieve the documents of interest via carefully crafted
queries constructed in task-specific ways. The choice between crawl- and query-based execution
plans can have a substantial impact on both execution time and output "completeness" (e.g.,
in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain
intuition. In this article, we present fundamental building blocks to make the choice of execution
plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to
analyze query- and crawl-based plans in terms of both execution time and output completeness.
We adapt results from random-graph theory and statistics to develop a rigorous cost model for
the execution plans. Our cost model reflects the fact that the performance of the plans depends
on fundamental task-specific properties of the underlying text databases. We identify these
properties and present efficient techniques for estimating the associated parameters of the cost
model. We also present two optimization approaches for text-centric tasks that rely on the cost-model
parameters and select efficient execution plans. Overall, our optimization approaches
help build efficient execution plans for a task, resulting in significant efficiency and output
completeness benefits. We complement our results with a large-scale experimental evaluation
for three important text-centric tasks and over multiple real-life data sets.Information Systems Working Papers Serie
Understanding, Estimating, and Incorporating Output Quality Into Join Algorithms For Information Extraction
Information extraction (IE) systems are trained to extract specific relations from text databases. Real-world
applications often require that the output of multiple IE systems be joined to produce the data of interest. To
optimize the execution of a join of multiple extracted relations, it is not sufficient to consider only execution time.
In fact, the quality of the join output is of critical importance: unlike in the relational world, different join execution
plans can produce join results of widely different quality whenever IE systems are involved. In this paper, we develop
a principled approach to understand, estimate, and incorporate output quality into the join optimization process
over extracted relations. We argue that the output quality is affected by (a) the configuration of the IE systems
used to process the documents, (b) the document retrieval strategies used to retrieve documents, and (c) the actual
join algorithm used. Our analysis considers a variety of join algorithms from relational query optimization, and
predicts the output quality –and, of course, the execution time– of the alternate execution plans. We establish
the accuracy of our analytical models, as well as study the effectiveness of a quality-aware join optimizer, with a
large-scale experimental evaluation over real-world text collections and state-of-the-art IE systems
Understanding, Estimating, and Incorporating Output Quality Into Join Algorithms For Information Extraction
Information extraction (IE) systems are trained to extract specific relations from text databases. Real-world
applications often require that the output of multiple IE systems be joined to produce the data of interest. To
optimize the execution of a join of multiple extracted relations, it is not sufficient to consider only execution time.
In fact, the quality of the join output is of critical importance: unlike in the relational world, different join execution
plans can produce join results of widely different quality whenever IE systems are involved. In this paper, we develop
a principled approach to understand, estimate, and incorporate output quality into the join optimization process
over extracted relations. We argue that the output quality is affected by (a) the configuration of the IE systems
used to process the documents, (b) the document retrieval strategies used to retrieve documents, and (c) the actual
join algorithm used. Our analysis considers a variety of join algorithms from relational query optimization, and
predicts the output quality –and, of course, the execution time– of the alternate execution plans. We establish
the accuracy of our analytical models, as well as study the effectiveness of a quality-aware join optimizer, with a
large-scale experimental evaluation over real-world text collections and state-of-the-art IE systems