7 research outputs found

    Materialized View Selection in XML Databases

    Get PDF
    Materialized views, a rdbms silver bullet, demonstrate its efficacy in many applications, especially as a data warehousing/decison support system tool. The pivot of playing materialized views efficiently is view selection. Though studied for over thirty years in rdbms, the selection is hard to make in the context of xml databases, where both the semi-structured data and the expressiveness of xml query languages add challenges to the view selection problem. We start our discussion on producing minimal xml views (in terms of size) as candidates for a given workload (a query set). To facilitate intuitionistic view selection, we present a view graph (called vcube) to structurally maintain all generated views. By basing our selection on vcube for materialization, we propose two view selection strategies, targeting at space-optimized and space-time tradeoff, respectively. We built our implementation on top of Berkeley DB XML, demonstrating that significant performance improvement could be obtained using our proposed approaches

    Semantics and efficient evaluation of partial tree-pattern queries on XML

    Get PDF
    Current applications export and exchange XML data on the web. Usually, XML data are queried using keyword queries or using the standard structured query language XQuery the core of which consists of the navigational query language XPath. In this context, one major challenge is the querying of the data when the structure of the data sources is complex or not fully known to the user. Another challenge is the integration of multiple data sources that export data with structural differences and irregularities. In this dissertation, a query language for XML called Partial Tree-Pattern Query (PTPQ) language is considered. PTPQs generalize and strictly contain Tree-Pattern Queries (TPQs) and can express a broad structural fragment of XPath. Because of their expressive power and flexibility, they are useful for querying XML documents the structure of which is complex or not fully known to the user, and for integrating XML data sources with different structures. The dissertation focuses on three issues. The first one is the design of efficient non-main-memory evaluation methods for PTPQs. The second one is the assignment of semantics to PTPQs so that they return meaningful answers. The third one is the development of techniques for answering TPQs using materialized views. Non-main-memory XML query evaluation can be done in two modes (which also define two evaluation models). In the first mode, data is preprocessed and indexes, called inverted lists, are built for it. In the second mode, data are unindexed and arrives continuously in the form of a stream. Existing algorithms cannot be used directly or indirectly to efficiently compute PTPQs in either mode. Initially, the problem of efficiently evaluating partial path queries in the inverted lists model has been addressed. Partial path queries form a subclass of PTPQs which is not contained in the class of TPQs. Three novel algorithms for evaluating partial path queries including a holistic one have been designed. The analytical and experimental results show that the holistic algorithm outperforms the other two. These results have been extended into holistic and non-holistic approaches for PTPQs in the inverted lists model. The experiments show again the superiority of the holistic approach. The dissertation has also addressed the problem of evaluating PTPQs in the streaming model, and two original efficient streaming algorithms for PTPQs have been designed. Compared to the only known streaming algorithm that supports an extension of TPQs, the experimental results show that the proposed algorithms perform better by orders of magnitude while consuming a much smaller fraction of memory space. An original approach for assigning semantics to PTPQs has also been devised. The novel semantics seamlessly applies to keyword queries and to queries with structural restrictions. In contrast to previous approaches that operate locally on data, the proposed approach operates globally on structural summaries of data to extract tree patterns. Compared to previous approaches, an experimental evaluation shows that our approach has a perfect recall both for XML documents with complete and with incomplete data. It also shows better precision compared to approaches with similar recall. Finally, the dissertation has addressed the problem of answering XML queries using exclusively materialized views. An original approach for materializing views in the context of the inverted lists model has been suggested. Necessary and sufficient conditions have been provided for tree-pattern query answerability in terms of view-to-query homomorphisms. A time and space efficient algorithm was designed for deciding query answerability and a technique for computing queries over view materializations using stack- based holistic algorithms was developed. Further, optimizations were developed which (a) minimize the storage space and avoid redundancy by materializing views as bitmaps, and (b) optimize the evaluation of the queries over the views by applying bitwise operations on view materializations. The experimental results show that the proposed approach obtains largely higher hit rates than previous approaches, speeds up significantly the evaluation of queries without using views, and scales very smoothly in terms of storage space and computational overhead

    Structural Summaries as a Core Technology for Efficient XML Retrieval

    Get PDF
    The Extensible Markup Language (XML) is extremely popular as a generic markup language for text documents with an explicit hierarchical structure. The different types of XML data found in today’s document repositories, digital libraries, intranets and on the web range from flat text with little meaningful structure to be queried, over truly semistructured data with a rich and often irregular structure, to rather rigidly structured documents with little text that would also fit a relational database system (RDBS). Not surprisingly, various ways of storing and retrieving XML data have been investigated, including native XML systems, relational engines based on RDBSs, and hybrid combinations thereof. Over the years a number of native XML indexing techniques have emerged, the most important ones being structure indices and labelling schemes. Structure indices represent the document schema (i.e., the hierarchy of nested tags that occur in the documents) in a compact central data structure so that structural query constraints (e.g., path or tree patterns) can be efficiently matched without accessing the documents. Labelling schemes specify ways to assign unique identifiers, or labels, to the document nodes so that specific relations (e.g., parent/child) between individual nodes can be inferred from their labels alone in a decentralized manner, again without accessing the documents themselves. Since both structure indices and labelling schemes provide compact approximate views on the document structure, we collectively refer to them as structural summaries. This work presents new structural summaries that enable highly efficient and scalable XML retrieval in native, relational and hybrid systems. The key contribution of our approach is threefold. (1) We introduce BIRD, a very efficient and expressive labelling scheme for XML, and the CADG, a combined text and structure index, and combine them as two complementary building blocks of the same XML retrieval system. (2) We propose a purely relational variant of BIRD and the CADG, called RCADG, that is extremely fast and scales up to large document collections. (3) We present the RCADG Cache, a hybrid system that enhances the RCADG with incremental query evaluation based on cached results of earlier queries. The RCADG Cache exploits schema information in the RCADG to detect cached query results that can supply some or all matches to a new query with little or no computational and I/O effort. A main-memory cache index ensures that reusable query results are quickly retrieved even in a huge cache. Our work shows that structural summaries significantly improve the efficiency and scalability of XML retrieval systems in several ways. Former relational approaches have largely ignored structural summaries. The RCADG shows that these native indexing techniques are equally effective for XML retrieval in RDBSs. BIRD, unlike some other labelling schemes, achieves high retrieval performance with a fairly modest storage overhead. To the best of our knowledge, the RCADG Cache is the only approach to take advantage of structural summaries for effectively detecting query containment or overlap. Moreover, no other XML cache we know of exploits intermediate results that are produced as a by-product during the evaluation from scratch. These are valuable cache contents that increase the effectiveness of the cache at no extra computational cost. Extensive experiments quantify the practical benefit of all of the proposed techniques, which amounts to a performance gain of several orders of magnitude compared to various other approaches

    The Piazza Peer Data Management System

    Full text link

    Efficient Cache Answerability for XPath Queries

    No full text
    The problem of cache answerability has traditionally been studied over conjunctive queries performed on top of a relational database system. However, with the proliferation of semistructured data and, in particular, of XML as the de facto standard for information interchange on the Internet, most of the assumptions and methods used for traditional systems – and cache answerability is no exception – need to be revisited from the point of view of the semistructured data and query model. In this paper, we present a formal framework for the efficient processing of XPath queries over XML documents in a cache environment that is based on the classic rewriting approach. Furthermore, we provide details on the implementation of our formal methods on top of HLCaches, anLDAPbased distributed caching system for XML, and show that our approach is more efficient than traditional query rewriting algorithms while, at the same time, supporting the full expressive power of XPath queries. Keywords: Semistructured data, cache answerability, query rewritability, XML, XPath, LDA
    corecore