28 research outputs found

    Harnessing the Deep Web: Present and Future

    Get PDF
    Over the past few years, we have built a system that has exposed large volumes of Deep-Web content to Google.com users. The content that our system exposes contributes to more than 1000 search queries per-second and spans over 50 languages and hundreds of domains. The Deep Web has long been acknowledged to be a major source of structured data on the web, and hence accessing Deep-Web content has long been a problem of interest in the data management community. In this paper, we report on where we believe the Deep Web provides value and where it does not. We contrast two very different approaches to exposing Deep-Web content -- the surfacing approach that we used, and the virtual integration approach that has often been pursued in the data management literature. We emphasize where the values of each of the two approaches lie and caution against potential pitfalls. We outline important areas of future research and, in particular, emphasize the value that can be derived from analyzing large collections of potentially disparate structured data on the web.Comment: CIDR 200

    World-set Decompositions: Expressiveness and Efficient Algorithms

    Get PDF
    Uncertain information is commonplace in real-world data management scenarios. The ability to represent large sets of possible instances (worlds) while supporting efficient storage and processing is an important challenge in this context. The recent formalism of world-set decompositions (WSDs) provides a space-efficient representation for uncertain data that also supports scalable processing. WSDs are complete for finite world-sets in that they can represent any finite set of possible worlds. For possibly infinite world-sets, we show that a natural generalization of WSDs precisely captures the expressive power of c-tables. We then show that several important decision problems are efficiently solvable on WSDs while they are NP-hard on c-tables. Finally, we give a polynomial-time algorithm for factorizing WSDs, i.e. an efficient algorithm for minimizing such representations

    Fast and Simple Relational Processing of Uncertain Data

    Get PDF

    Maybms: A System For Managing Large Amounts Of Uncertain Data

    Full text link
    This dissertation presents the foundations for building a scalable database management system for managing uncertain data, as it appears in different data management scenarios such as data integration, data cleaning, scientific data and web data management. The result of this work is MayBMS - a scalable open-source database management system for managing large amounts of uncertain data. MayBMS uses the so-called U-relational databases to represent uncertainty. U-relational databases store uncertainty and correlations in a purely relational way, and are a complete representation system for finite world sets. Other benefits achieved by our representation model include compact storage and efficient query evaluation. The results of our experimental evaluation clearly show that query evaluation in MayBMS scales up to large data sizes and uncertainty ratios, and that MayBMS consistently outperforms other current systems for managing uncertain data. The dissertation also discusses optimization of queries on vertically partitioned data, efficient confidence computation algorithms, and challenges and solutions when designing an application programming interface for uncertain databases
    corecore