41 research outputs found

    Techniques efficaces basées sur des vues matérialisées pour la gestion des données du Web (algorithmes et systèmes)

    Get PDF
    Le langage XML, proposé par le W3C, est aujourd hui utilisé comme un modèle de données pour le stockage et l interrogation de grands volumes de données dans les systèmes de bases de données. En dépit d importants travaux de recherche et le développement de systèmes efficace, le traitement de grands volumes de données XML pose encore des problèmes des performance dus à la complexité et hétérogénéité des données ainsi qu à la complexité des langages courants d interrogation XML. Les vues matérialisées sont employées depuis des décennies dans les bases de données afin de raccourcir les temps de traitement des requêtes. Elles peuvent être considérées les résultats de requêtes pré-calculées, que l on réutilise afin d éviter de recalculer (complètement ou partiellement) une nouvelle requête. Les vues matérialisées ont fait l objet de nombreuses recherches, en particulier dans le contexte des entrepôts des données relationnelles.Cette thèse étudie l applicabilité de techniques de vues matérialisées pour optimiser les performances des systèmes de gestion de données Web, et en particulier XML, dans des environnements distribués. Dans cette thèse, nos apportons trois contributions.D abord, nous considérons le problème de la sélection des meilleures vues à matérialiser dans un espace de stockage donné, afin d améliorer la performance d une charge de travail des requêtes. Nous sommes les premiers à considérer un sous-langage de XQuery enrichi avec la possibilité de sélectionner des noeuds multiples et à de multiples niveaux de granularités. La difficulté dans ce contexte vient de la puissance expressive et des caractéristiques du langage des requêtes et des vues, et de la taille de l espace de recherche de vues que l on pourrait matérialiser.Alors que le problème général a une complexité prohibitive, nous proposons et étudions un algorithme heuristique et démontrer ses performances supérieures par rapport à l état de l art.Deuxièmement, nous considérons la gestion de grands corpus XML dans des réseaux pair à pair, basées sur des tables de hachage distribuées. Nous considérons la plateforme ViP2P dans laquelle des vues XML distribuées sont matérialisées à partir des données publiées dans le réseau, puis exploitées pour répondre efficacement aux requêtes émises par un pair du réseau. Nous y avons apporté d importantes optimisations orientées sur le passage à l échelle, et nous avons caractérisé la performance du système par une série d expériences déployées dans un réseau à grande échelle. Ces expériences dépassent de plusieurs ordres de grandeur les systèmes similaires en termes de volumes de données et de débit de dissémination des données. Cette étude est à ce jour la plus complète concernant une plateforme de gestion de contenus XML déployée entièrement et testée à une échelle réelle.Enfin, nous présentons une nouvelle approche de dissémination de données dans un système d abonnements, en présence de contraintes sur les ressources CPU et réseau disponibles; cette approche est mise en oeuvre dans le cadre de notre plateforme Delta. Le passage à l échelle est obtenu en déchargeant le fournisseur de données de l effort de répondre à une partie des abonnements. Pour cela, nous tirons profit de techniques de réécriture de requêtes à l aide de vues afin de diffuser les données de ces abonnements, à partir d autres abonnements.Notre contribution principale est un nouvel algorithme qui organise les vues dans un réseau de dissémination d information multi-niveaux ; ce réseau est calculé à l aide d outils techniques de programmation linéaire afin de passer à l échelle pour de grands nombres de vues, respecter les contraintes de capacité du système, et minimiser les délais de propagation des information. L efficacité et la performance de notre algorithme est confirmée par notre évaluation expérimentale, qui inclut l étude d un déploiement réel dans un réseau WAN.XML was recommended by W3C in 1998 as a markup language to be used by device- and system-independent methods of representing information. XML is nowadays used as a data model for storing and querying large volumes of data in database systems. In spite of significant research and systems development, many performance problems are raised by processing very large amounts of XML data. Materialized views have long been used in databases to speed up queries. Materialized views can be seen as precomputed query results that can be re-used to evaluate (part of) another query, and have been a topic of intensive research, in particular in the context of relational data warehousing. This thesis investigates the applicability of materialized views techniques to optimize the performance of Web data management tools, in particular in distributed settings, considering XML data and queries. We make three contributions.We first consider the problem of choosing the best views to materialize within a given space budget in order to improve the performance of a query workload. Our work is the first to address the view selection problem for a rich subset of XQuery. The challenges we face stem from the expressive power and features of both the query and view languages and from the size of the search space of candidate views to materialize. While the general problem has prohibitive complexity, we propose and study a heuristic algorithm and demonstrate its superior performance compared to the state of the art.Second, we consider the management of large XML corpora in peer-to-peer networks, based on distributed hash tables (or DHTs, in short). We consider a platform leveraging distributed materialized XML views, defined by arbitrary XML queries, filled in with data published anywhere in the network, and exploited to efficiently answer queries issued by any network peer. This thesis has contributed important scalability oriented optimizations, as well as a comprehensive set of experiments deployed in a country-wide WAN. These experiments outgrow by orders of magnitude similar competitor systems in terms of data volumes and data dissemination throughput. Thus, they are the most advanced in understanding the performance behavior of DHT-based XML content management in real settings.Finally, we present a novel approach for scalable content-based publish/subscribe (pub/sub, in short) in the presence of constraints on the available computational resources of data publishers. We achieve scalability by off-loading subscriptions from the publisher, and leveraging view-based query rewriting to feed these subscriptions from the data accumulated in others. Our main contribution is a novel algorithm for organizing subscriptions in a multi-level dissemination network in order to serve large numbers of subscriptions, respect capacity constraints, and minimize latency. The efficiency and effectiveness of our algorithm are confirmed through extensive experiments and a large deployment in a WAN.PARIS11-SCD-Bib. électronique (914719901) / SudocSudocFranceF

    CRIS-IR 2006

    Get PDF
    The recognition of entities and their relationships in document collections is an important step towards the discovery of latent knowledge as well as to support knowledge management applications. The challenge lies on how to extract and correlate entities, aiming to answer key knowledge management questions, such as; who works with whom, on which projects, with which customers and on what research areas. The present work proposes a knowledge mining approach supported by information retrieval and text mining tasks in which its core is based on the correlation of textual elements through the LRD (Latent Relation Discovery) method. Our experiments show that LRD outperform better than other correlation methods. Also, we present an application in order to demonstrate the approach over knowledge management scenarios.Fundação para a Ciência e a Tecnologia (FCT) Denmark's Electronic Research Librar

    An Algebraic Approach to XQuery Optimization

    Get PDF
    As more data is stored in XML and more applications need to process this data, XML query optimization becomes performance critical. While optimization techniques for relational databases have been developed over the last thirty years, the optimization of XML queries poses new challenges. Query optimizers for XQuery, the standard query language for XML data, need to consider both document order and sequence order. Nevertheless, algebraic optimization proved powerful in query optimizers in relational and object oriented databases. Thus, this dissertation presents an algebraic approach to XQuery optimization. In this thesis, an algebra over sequences is presented that allows for a simple translation of XQuery into this algebra. The formal definitions of the operators in this algebra allow us to reason formally about algebraic optimizations. This thesis leverages the power of this formalism when unnesting nested XQuery expressions. In almost all cases unnesting nested queries in XQuery reduces query execution times from hours to seconds or milliseconds. Moreover, this dissertation presents three basic algebraic patterns of nested queries. For every basic pattern a decision tree is developed to select the most effective unnesting equivalence for a given query. Query unnesting extends the search space that can be considered during cost-based optimization of XQuery. As a result, substantially more efficient query execution plans may be detected. This thesis presents two more important cases where the number of plan alternatives leads to substantially shorter query execution times: join ordering and reordering location steps in path expressions. Our algebraic framework detects cases where document order or sequence order is destroyed. However, state-of-the-art techniques for order optimization in cost-based query optimizers have efficient mechanisms to repair order in these cases. The results obtained for query unnesting and cost-based optimization of XQuery underline the need for an algebraic approach to XQuery optimization for efficient XML query processing. Moreover, they are applicable to optimization in relational databases where order semantics are considered

    A New Approach for Fast Processing of SPARQL Queries on RDF Quadruples

    Get PDF
    Title from PDF of title page, viewed on July 7, 2015Dissertation advisor: Praveen R. RaoVitaIncludes bibliographic references (pages 87-92)Thesis (Ph.D.)--School of Computing and Engineering. University of Missouri--Kansas City, 2015The Resource Description Framework (RDF) is a standard model for representing data on the Web. It enables the interchange and machine processing of data by considering its semantics. While RDF was first proposed with the vision of enabling the Semantic Web, it has now become popular in domain-specific applications and the Web. Through advanced RDF technologies, one can perform semantic reasoning over data and extract knowledge in domains such as healthcare, biopharmaceuticals, defense, and intelligence. Popular approaches like RDF-3X perform poorly on RDF datasets containing billions of triples when the queries are large and complex. This is because of the large number of join operations that must be performed during query processing. Moreover, most of the scalable approaches were designed to operate on RDF triples instead of quads. To address these issues, we propose to develop a new approach for fast and cost-effective processing of SPARQL queries on large RDF datasets containing RDF quadruples (or quads). Our approach employs a decrease-and-conquer strategy: Rather than indexing the entire RDF dataset, it identifies groups of similar RDF graphs and indexes each group separately. During query processing, it uses a novel filtering index to first identify candidate groups that may contain matches for the query. On these candidates, it executes queries using a conventional SPARQL processor to produce the final results. A query optimization strategy using the candidate groups to further improve the query processing performance is also used.Introduction -- Background and motivations -- The design of RIQ -- Implementation of RIQ -- Evaluation -- Conclusion and future work -- Appendix A. Queries -- Appendix B. SPARQL gramma

    TopX : efficient and versatile top-k query processing for text, structured, and semistructured data

    Get PDF
    TopX is a top-k retrieval engine for text and XML data. Unlike Boolean engines, it stops query processing as soon as it can safely determine the k top-ranked result objects according to a monotonous score aggregation function with respect to a multidimensional query. The main contributions of the thesis unfold into four main points, confirmed by previous publications at international conferences or workshops: • Top-k query processing with probabilistic guarantees. • Index-access optimized top-k query processing. • Dynamic and self-tuning, incremental query expansion for top-k query processing. • Efficient support for ranked XML retrieval and full-text search. Our experiments demonstrate the viability and improved efficiency of our approach compared to existing related work for a broad variety of retrieval scenarios.TopX ist eine Top-k Suchmaschine für Text und XML Daten. Im Gegensatz zu Boole\u27; schen Suchmaschinen terminiert TopX die Anfragebearbeitung, sobald die k besten Ergebnisobjekte im Hinblick auf eine mehrdimensionale Anfrage gefunden wurden. Die Hauptbeiträge dieser Arbeit teilen sich in vier Schwerpunkte basierend auf vorherigen Veröffentlichungen bei internationalen Konferenzen oder Workshops: • Top-k Anfragebearbeitung mit probabilistischen Garantien. • Zugriffsoptimierte Top-k Anfragebearbeitung. • Dynamische und selbstoptimierende, inkrementelle Anfrageexpansion für Top-k Anfragebearbeitung. • Effiziente Unterstützung für XML-Anfragen und Volltextsuche. Unsere Experimente bestätigen die Vielseitigkeit und gesteigerte Effizienz unserer Verfahren gegenüber existierenden, führenden Ansätzen für eine weite Bandbreite von Anwendungen in der Informationssuche

    Distributed XML Query Processing

    Get PDF
    While centralized query processing over collections of XML data stored at a single site is a well understood problem, centralized query evaluation techniques are inherently limited in their scalability when presented with large collections (or a single, large document) and heavy query workloads. In the context of relational query processing, similar scalability challenges have been overcome by partitioning data collections, distributing them across the sites of a distributed system, and then evaluating queries in a distributed fashion, usually in a way that ensures locality between (sub-)queries and their relevant data. This thesis presents a suite of query evaluation techniques for XML data that follow a similar approach to address the scalability problems encountered by XML query evaluation. Due to the significant differences in data and query models between relational and XML query processing, it is not possible to directly apply distributed query evaluation techniques designed for relational data to the XML scenario. Instead, new distributed query evaluation techniques need to be developed. Thus, in this thesis, an end-to-end solution to the scalability problems encountered by XML query processing is proposed. Based on a data partitioning model that supports both horizontal and vertical fragmentation steps (or any combination of the two), XML collections are fragmented and distributed across the sites of a distributed system. Then, a suite of distributed query evaluation strategies is proposed. These query evaluation techniques ensure locality between each fragment of the collection and the parts of the query corresponding to the data in this fragment. Special attention is paid to scalability and query performance, which is achieved by ensuring a high degree of parallelism during distributed query evaluation and by avoiding access to irrelevant portions of the data. For maximum flexibility, the suite of distributed query evaluation techniques proposed in this thesis provides several alternative approaches for evaluating a given query over a given distributed collection. Thus, to achieve the best performance, it is necessary to predict and compare the expected performance of each of these alternatives. In this work, this is accomplished through a query optimization technique based on a distribution-aware cost model. The same cost model is also used to fine-tune the way a collection is fragmented to the demands of the query workload evaluated over this collection. To evaluate the performance impact of the distributed query evaluation techniques proposed in this thesis, the techniques were implemented within a production-quality XML database system. Based on this implementation, a thorough experimental evaluation was performed. The results of this evaluation confirm that the distributed query evaluation techniques introduced here lead to significant improvements in query performance and scalability both when compared to centralized techniques and when compared to existing distributed query evaluation techniques

    Proceedings of the 9th International Workshop on Information Retrieval on Current Research Information Systems

    Get PDF
    The recognition of entities and their relationships in document collections is an important step towards the discovery of latent knowledge as well as to support knowledge management applications. The challenge lies on how to extract and correlate entities, aiming to answer key knowledge management questions, such as; who works with whom, on which projects, with which customers and on what research areas. The present work proposes a knowledge mining approach supported by information retrieval and text mining tasks in which its core is based on the correlation of textual elements through the LRD (Latent Relation Discovery) method. Our experiments show that LRD outperform better than other correlation methods. Also, we present an application in order to demonstrate the approach over knowledge management scenarios

    A Framework for the Organization and Discovery of Information Resources in a WWW Environment Using Association, Classification and Deduction

    Get PDF
    The Semantic Web is envisioned as a next-generation WWW environment in which information is given well-defined meaning. Although the standards for the Semantic Web are being established, it is as yet unclear how the Semantic Web will allow information resources to be effectively organized and discovered in an automated fashion. This dissertation research explores the organization and discovery of resources for the Semantic Web. It assumes that resources on the Semantic Web will be retrieved based on metadata and ontologies that will provide an effective basis for automated deduction. An integrated deduction system based on the Resource Description Framework (RDF), the DARPA Agent Markup Language (DAML) and description logic (DL) was built. A case study was conducted to study the system effectiveness in retrieving resources in a large Web resource collection. The results showed that deduction has an overall positive impact on the retrieval of the collection over the defined queries. The greatest positive impact occurred when precision was perfect with no decrease in recall. The sensitivity analysis was conducted over properties of resources, subject categories, query expressions and relevance judgment in observing their relationships with the retrieval performance. The results highlight both the potentials and various issues in applying deduction over metadata and ontologies. Further investigation will be required for additional improvement. The factors that can contribute to degraded performance were identified and addressed. Some guidelines were developed based on the lessons learned from the case study for the development of Semantic Web data and systems

    Doctor of Philosophy

    Get PDF
    dissertationLinked data are the de-facto standard in publishing and sharing data on the web. To date, we have been inundated with large amounts of ever-increasing linked data in constantly evolving structures. The proliferation of the data and the need to access and harvest knowledge from distributed data sources motivate us to revisit several classic problems in query processing and query optimization. The problem of answering queries over views is commonly encountered in a number of settings, including while enforcing security policies to access linked data, or when integrating data from disparate sources. We approach this problem by efficiently rewriting queries over the views to equivalent queries over the underlying linked data, thus avoiding the costs entailed by view materialization and maintenance. An outstanding problem of query rewriting is the number of rewritten queries is exponential to the size of the query and the views, which motivates us to study problem of multiquery optimization in the context of linked data. Our solutions are declarative and make no assumption for the underlying storage, i.e., being store-independent. Unlike relational and XML data, linked data are schema-less. While tracking the evolution of schema for linked data is hard, keyword search is an ideal tool to perform data integration. Existing works make crippling assumptions for the data and hence fall short in handling massive linked data with tens to hundreds of millions of facts. Our study for keyword search on linked data brought together the classical techniques in the literature and our novel ideas, which leads to much better query efficiency and quality of the results. Linked data also contain rich temporal semantics. To cope with the ever-increasing data, we have investigated how to partition and store large temporal or multiversion linked data for distributed and parallel computation, in an effort to achieve load-balancing to support scalable data analytics for massive linked data

    Content based dissemination of XML data

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    corecore