28 research outputs found
Harnessing the Deep Web: Present and Future
Over the past few years, we have built a system that has exposed large
volumes of Deep-Web content to Google.com users. The content that our system
exposes contributes to more than 1000 search queries per-second and spans over
50 languages and hundreds of domains. The Deep Web has long been acknowledged
to be a major source of structured data on the web, and hence accessing
Deep-Web content has long been a problem of interest in the data management
community. In this paper, we report on where we believe the Deep Web provides
value and where it does not. We contrast two very different approaches to
exposing Deep-Web content -- the surfacing approach that we used, and the
virtual integration approach that has often been pursued in the data management
literature. We emphasize where the values of each of the two approaches lie and
caution against potential pitfalls. We outline important areas of future
research and, in particular, emphasize the value that can be derived from
analyzing large collections of potentially disparate structured data on the
web.Comment: CIDR 200
World-set Decompositions: Expressiveness and Efficient Algorithms
Uncertain information is commonplace in real-world data management scenarios. The ability to represent large sets of possible instances (worlds) while supporting efficient storage and processing is an important challenge in this context. The recent formalism of world-set decompositions (WSDs) provides a space-efficient representation for uncertain data that also supports scalable processing. WSDs are complete for finite world-sets in that they can represent any finite set of possible worlds. For possibly infinite world-sets, we show that a natural generalization of WSDs precisely captures the expressive power of c-tables. We then show that several important decision problems are efficiently solvable on WSDs while they are NP-hard on c-tables. Finally, we give a polynomial-time algorithm for factorizing WSDs, i.e. an efficient algorithm for minimizing such representations
Maybms: A System For Managing Large Amounts Of Uncertain Data
This dissertation presents the foundations for building a scalable database management system for managing uncertain data, as it appears in different data management scenarios such as data integration, data cleaning, scientific data and web data management. The result of this work is MayBMS - a scalable open-source database management system for managing large amounts of uncertain data. MayBMS uses the so-called U-relational databases to represent uncertainty. U-relational databases store uncertainty and correlations in a purely relational way, and are a complete representation system for finite world sets. Other benefits achieved by our representation model include compact storage and efficient query evaluation. The results of our experimental evaluation clearly show that query evaluation in MayBMS scales up to large data sizes and uncertainty ratios, and that MayBMS consistently outperforms other current systems for managing uncertain data. The dissertation also discusses optimization of queries on vertically partitioned data, efficient confidence computation algorithms, and challenges and solutions when designing an application programming interface for uncertain databases