80 research outputs found

    Answering Tag-Term Keyword Queries over XML Documents in DHT Networks

    Get PDF
    Abstract. The emergence of Peer-to-Peer (P2P) computing model and the popularity of Extensible Markup Language (XML) as the web data format have fueled the extensive research on retrieving XML data in P2P networks. In this paper, we developed an efficient and effective keyword search framework that can support tag-term keyword queries in Distributed Hash Table (DHT) networks. We employed a concise Bloom-Filter data structure to index XML meta-data in the DHT repository. We also developed an effective algorithm that supports tag-term keyword queries over our Bloom-Filter encoded XML meta-data in the DHT network. We conducted extensive experiments to demonstrate the efficiency of indexing scheme, the effectiveness of our keyword query algorithm and the system scalability of our framework

    The ViP2P Platform: XML Views in P2P

    Get PDF
    The growing volumes of XML data sources on the Web or produced by enterprises, organizations etc. raise many performance challenges for data management applications. In this work, we are concerned with the distributed, peer-to-peer management of large corpora of XML documents, based on distributed hash table (or DHT, in short) overlay networks. We present ViP2P (standing for Views in Peer-to-Peer), a distributed platform for sharing XML documents based on a structured P2P network infrastructure (DHT). At the core of ViP2P stand distributed materialized XML views, defined by arbitrary XML queries, filled in with data published anywhere in the network, and exploited to efficiently answer queries issued by any network peer. ViP2P allows user queries to be evaluated over XML documents published by peers in two modes. First, a long-running subscription mode, when a query can be registered in the system and receive answers incrementally when and if published data matches the query. Second, queries can also be asked in an ad-hoc, snapshot mode, where results are required immediately and must be computed based on the results of other long-running, subscription queries. ViP2P innovates over other similar DHT-based XML sharing platforms by using a very expressive structured XML query language. This expressivity leads to a very flexible distribution of XML content in the ViP2P network, and to efficient snapshot query execution. ViP2P has been tested in real deployments of hundreds of computers. We present the platform architecture, its internal algorithms, and demonstrate its efficiency and scalability through a set of experiments. Our experimental results outgrow by orders of magnitude similar competitor systems in terms of data volumes, network size and data dissemination throughput.Comment: RR-7812 (2011

    Keyword-based search in peer-to-peer networks

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Processing Rank-Aware Queries in Schema-Based P2P Systems

    Get PDF
    Effiziente Anfragebearbeitung in Datenintegrationssystemen sowie in P2P-Systemen ist bereits seit einigen Jahren ein Aspekt aktueller Forschung. Konventionelle Datenintegrationssysteme bestehen aus mehreren Datenquellen mit ggf. unterschiedlichen Schemata, sind hierarchisch aufgebaut und besitzen eine zentrale Komponente: den Mediator, der ein globales Schema verwaltet. Anfragen an das System werden auf diesem globalen Schema formuliert und vom Mediator bearbeitet, indem relevante Daten von den Datenquellen transparent für den Benutzer angefragt werden. Aufbauend auf diesen Systemen entstanden schließlich Peer-Daten-Management-Systeme (PDMSs) bzw. schemabasierte P2P-Systeme. An einem PDMS teilnehmende Knoten (Peers) können einerseits als Mediatoren agieren andererseits jedoch ebenso als Datenquellen. Darüber hinaus sind diese Peers autonom und können das Netzwerk jederzeit verlassen bzw. betreten. Die potentiell riesige Datenmenge, die in einem derartigen Netzwerk verfügbar ist, führt zudem in der Regel zu sehr großen Anfrageergebnissen, die nur schwer zu bewältigen sind. Daher ist das Bestimmen einer vollständigen Ergebnismenge in vielen Fällen äußerst aufwändig oder sogar unmöglich. In diesen Fällen bietet sich die Anwendung von Top-N- und Skyline-Operatoren, ggf. in Verbindung mit Approximationstechniken, an, da diese Operatoren lediglich diejenigen Datensätze als Ergebnis ausgeben, die aufgrund nutzerdefinierter Ranking-Funktionen am relevantesten für den Benutzer sind. Da durch die Anwendung dieser Operatoren zumeist nur ein kleiner Teil des Ergebnisses tatsächlich dem Benutzer ausgegeben wird, muss nicht zwangsläufig die vollständige Ergebnismenge berechnet werden sondern nur der Teil, der tatsächlich relevant für das Endergebnis ist. Die Frage ist nun, wie man derartige Anfragen durch die Ausnutzung dieser Erkenntnis effizient in PDMSs bearbeiten kann. Die Beantwortung dieser Frage ist das Hauptanliegen dieser Dissertation. Zur Lösung dieser Problemstellung stellen wir effiziente Anfragebearbeitungsstrategien in PDMSs vor, die die charakteristischen Eigenschaften ranking-basierter Operatoren sowie Approximationstechniken ausnutzen. Peers werden dabei sowohl auf Schema- als auch auf Datenebene hinsichtlich der Relevanz ihrer Daten geprüft und dementsprechend in die Anfragebearbeitung einbezogen oder ausgeschlossen. Durch die Heterogenität der Peers werden Techniken zum Umschreiben einer Anfrage von einem Schema in ein anderes nötig. Da existierende Techniken zum Umschreiben von Anfragen zumeist nur konjunktive Anfragen betrachten, stellen wir eine Erweiterung dieser Techniken vor, die Anfragen mit ranking-basierten Anfrageoperatoren berücksichtigt. Da PDMSs dynamische Systeme sind und teilnehmende Peers jederzeit ihre Daten ändern können, betrachten wir in dieser Dissertation nicht nur wie Routing-Indexe verwendet werden, um die Relevanz eines Peers auf Datenebene zu bestimmen, sondern auch wie sie gepflegt werden können. Schließlich stellen wir SmurfPDMS (SiMUlating enviRonment For Peer Data Management Systems) vor, ein System, welches im Rahmen dieser Dissertation entwickelt wurde und alle vorgestellten Techniken implementiert.In recent years, there has been considerable research with respect to query processing in data integration and P2P systems. Conventional data integration systems consist of multiple sources with possibly different schemas, adhere to a hierarchical structure, and have a central component (mediator) that manages a global schema. Queries are formulated against this global schema and the mediator processes them by retrieving relevant data from the sources transparently to the user. Arising from these systems, eventually Peer Data Management Systems (PDMSs), or schema-based P2P systems respectively, have attracted attention. Peers participating in a PDMS can act both as a mediator and as a data source, are autonomous, and might leave or join the network at will. Due to these reasons peers often hold incomplete or erroneous data sets and mappings. The possibly huge amount of data available in such a network often results in large query result sets that are hard to manage. Due to these reasons, retrieving the complete result set is in most cases difficult or even impossible. Applying rank-aware query operators such as top-N and skyline, possibly in conjunction with approximation techniques, is a remedy to these problems as these operators select only those result records that are most relevant to the user. Being aware that in most cases only a small fraction of the complete result set is actually output to the user, retrieving the complete set before evaluating such operators is obviously inefficient. Therefore, the questions we want to answer in this dissertation are how to compute such queries in PDMSs and how to do that efficiently. We propose strategies for efficient query processing in PDMSs that exploit the characteristics of rank-aware queries and optionally apply approximation techniques. A peer's relevance is determined on two levels: on schema-level and on data-level. According to its relevance a peer is either considered for query processing or not. Because of heterogeneity queries need to be rewritten, enabling cooperation between peers that use different schemas. As existing query rewriting techniques mostly consider conjunctive queries only, we present an extension that allows for rewriting queries involving rank-aware query operators. As PDMSs are dynamic systems and peers might update their local data, this dissertation addresses not only the problem of considering such structures within a query processing strategy but also the problem of keeping them up-to-date. Finally, we provide a system-level evaluation by presenting SmurfPDMS (SiMUlating enviRonment For Peer Data Management Systems) -- a system created in the context of this dissertation implementing all presented techniques

    Techniques efficaces basées sur des vues matérialisées pour la gestion des données du Web (algorithmes et systèmes)

    Get PDF
    Le langage XML, proposé par le W3C, est aujourd hui utilisé comme un modèle de données pour le stockage et l interrogation de grands volumes de données dans les systèmes de bases de données. En dépit d importants travaux de recherche et le développement de systèmes efficace, le traitement de grands volumes de données XML pose encore des problèmes des performance dus à la complexité et hétérogénéité des données ainsi qu à la complexité des langages courants d interrogation XML. Les vues matérialisées sont employées depuis des décennies dans les bases de données afin de raccourcir les temps de traitement des requêtes. Elles peuvent être considérées les résultats de requêtes pré-calculées, que l on réutilise afin d éviter de recalculer (complètement ou partiellement) une nouvelle requête. Les vues matérialisées ont fait l objet de nombreuses recherches, en particulier dans le contexte des entrepôts des données relationnelles.Cette thèse étudie l applicabilité de techniques de vues matérialisées pour optimiser les performances des systèmes de gestion de données Web, et en particulier XML, dans des environnements distribués. Dans cette thèse, nos apportons trois contributions.D abord, nous considérons le problème de la sélection des meilleures vues à matérialiser dans un espace de stockage donné, afin d améliorer la performance d une charge de travail des requêtes. Nous sommes les premiers à considérer un sous-langage de XQuery enrichi avec la possibilité de sélectionner des noeuds multiples et à de multiples niveaux de granularités. La difficulté dans ce contexte vient de la puissance expressive et des caractéristiques du langage des requêtes et des vues, et de la taille de l espace de recherche de vues que l on pourrait matérialiser.Alors que le problème général a une complexité prohibitive, nous proposons et étudions un algorithme heuristique et démontrer ses performances supérieures par rapport à l état de l art.Deuxièmement, nous considérons la gestion de grands corpus XML dans des réseaux pair à pair, basées sur des tables de hachage distribuées. Nous considérons la plateforme ViP2P dans laquelle des vues XML distribuées sont matérialisées à partir des données publiées dans le réseau, puis exploitées pour répondre efficacement aux requêtes émises par un pair du réseau. Nous y avons apporté d importantes optimisations orientées sur le passage à l échelle, et nous avons caractérisé la performance du système par une série d expériences déployées dans un réseau à grande échelle. Ces expériences dépassent de plusieurs ordres de grandeur les systèmes similaires en termes de volumes de données et de débit de dissémination des données. Cette étude est à ce jour la plus complète concernant une plateforme de gestion de contenus XML déployée entièrement et testée à une échelle réelle.Enfin, nous présentons une nouvelle approche de dissémination de données dans un système d abonnements, en présence de contraintes sur les ressources CPU et réseau disponibles; cette approche est mise en oeuvre dans le cadre de notre plateforme Delta. Le passage à l échelle est obtenu en déchargeant le fournisseur de données de l effort de répondre à une partie des abonnements. Pour cela, nous tirons profit de techniques de réécriture de requêtes à l aide de vues afin de diffuser les données de ces abonnements, à partir d autres abonnements.Notre contribution principale est un nouvel algorithme qui organise les vues dans un réseau de dissémination d information multi-niveaux ; ce réseau est calculé à l aide d outils techniques de programmation linéaire afin de passer à l échelle pour de grands nombres de vues, respecter les contraintes de capacité du système, et minimiser les délais de propagation des information. L efficacité et la performance de notre algorithme est confirmée par notre évaluation expérimentale, qui inclut l étude d un déploiement réel dans un réseau WAN.XML was recommended by W3C in 1998 as a markup language to be used by device- and system-independent methods of representing information. XML is nowadays used as a data model for storing and querying large volumes of data in database systems. In spite of significant research and systems development, many performance problems are raised by processing very large amounts of XML data. Materialized views have long been used in databases to speed up queries. Materialized views can be seen as precomputed query results that can be re-used to evaluate (part of) another query, and have been a topic of intensive research, in particular in the context of relational data warehousing. This thesis investigates the applicability of materialized views techniques to optimize the performance of Web data management tools, in particular in distributed settings, considering XML data and queries. We make three contributions.We first consider the problem of choosing the best views to materialize within a given space budget in order to improve the performance of a query workload. Our work is the first to address the view selection problem for a rich subset of XQuery. The challenges we face stem from the expressive power and features of both the query and view languages and from the size of the search space of candidate views to materialize. While the general problem has prohibitive complexity, we propose and study a heuristic algorithm and demonstrate its superior performance compared to the state of the art.Second, we consider the management of large XML corpora in peer-to-peer networks, based on distributed hash tables (or DHTs, in short). We consider a platform leveraging distributed materialized XML views, defined by arbitrary XML queries, filled in with data published anywhere in the network, and exploited to efficiently answer queries issued by any network peer. This thesis has contributed important scalability oriented optimizations, as well as a comprehensive set of experiments deployed in a country-wide WAN. These experiments outgrow by orders of magnitude similar competitor systems in terms of data volumes and data dissemination throughput. Thus, they are the most advanced in understanding the performance behavior of DHT-based XML content management in real settings.Finally, we present a novel approach for scalable content-based publish/subscribe (pub/sub, in short) in the presence of constraints on the available computational resources of data publishers. We achieve scalability by off-loading subscriptions from the publisher, and leveraging view-based query rewriting to feed these subscriptions from the data accumulated in others. Our main contribution is a novel algorithm for organizing subscriptions in a multi-level dissemination network in order to serve large numbers of subscriptions, respect capacity constraints, and minimize latency. The efficiency and effectiveness of our algorithm are confirmed through extensive experiments and a large deployment in a WAN.PARIS11-SCD-Bib. électronique (914719901) / SudocSudocFranceF

    Survey over Existing Query and Transformation Languages

    Get PDF
    A widely acknowledged obstacle for realizing the vision of the Semantic Web is the inability of many current Semantic Web approaches to cope with data available in such diverging representation formalisms as XML, RDF, or Topic Maps. A common query language is the first step to allow transparent access to data in any of these formats. To further the understanding of the requirements and approaches proposed for query languages in the conventional as well as the Semantic Web, this report surveys a large number of query languages for accessing XML, RDF, or Topic Maps. This is the first systematic survey to consider query languages from all these areas. From the detailed survey of these query languages, a common classification scheme is derived that is useful for understanding and differentiating languages within and among all three areas

    Query-driven indexing in large-scale distributed systems

    Get PDF
    Efficient and effective search in large-scale data repositories requires complex indexing solutions deployed on a large number of servers. Web search engines such as Google and Yahoo! already rely upon complex systems to be able to return relevant query results and keep processing times within the comfortable sub-second limit. Nevertheless, the exponential growth of the amount of content on the Web poses serious challenges with respect to scalability. Coping with these challenges requires novel indexing solutions that not only remain scalable but also preserve the search accuracy. In this thesis we introduce and explore the concept of query-driven indexing – an index construction strategy that uses caching techniques to adapt to the querying patterns expressed by users. We suggest to abandon the strict difference between indexing and caching, and to build a distributed indexing structure, or a distributed cache, such that it is optimized for the current query load. Our experimental and theoretical analysis shows that employing query-driven indexing is especially beneficial when the content is (geographically) distributed in a Peer-to-Peer network. In such a setting extensive bandwidth consumption has been identified as one of the major obstacles for efficient large-scale search. Our indexing mechanisms combat this problem by maintaining the query popularity statistics and by indexing (caching) intermediate query results that are requested frequently. We present several indexing strategies for processing multi-keyword and XPath queries over distributed collections of textual and XML documents respectively. Experimental evaluations show significant overall traffic reduction compared to the state-of-the-art approaches. We also study possible query-driven optimizations for Web search engine architectures. Contrary to the Peer-to-Peer setting, Web search engines use centralized caching of query results to reduce the processing load on the main index. We analyze real search engine query logs and show that the changes in query traffic that such a results cache induces fundamentally affect indexing performance. In particular, we study its impact on index pruning efficiency. We show that combination of both techniques enables efficient reduction of the query processing costs and thus is practical to use in Web search engines

    Distributed Cache Table: Efficient Query-Driven Processing of Multi-Term Queries in P2P Networks

    Get PDF
    The state-of-the-art techniques for processing multi-term queries in P2P environments are query flooding and inverted list intersection. However, it has been shown that due to scalability reasons both methods fail to support full-text search in large scale document collections distributed among the nodes in a P2P network. Although a number of optimizations have been suggested recently based on the aforementioned techniques, little evidence is given on their scalability. In this paper we suggest a novel query-driven indexing strategy which generates and maintains only those index entries that are actually used for query processing. In our approach called Distributed Cache Table (DCT), by analogy with Distributed Hash Table (DHT), we suggest to abandon the difference between data indexing and query caching, and to store result sets (caches) for the most profitable queries. DCT employs a distributed index to efficiently locate caches that can answer a given multi-term query and broadcasts the query to all the peers only if no such caches were found. Evaluations on real data and query loads show that DCT converges to a high cache-hit ratio and indeed offers a large-scale distributed solution for storing and efficient querying of vast amounts of documents in the P2P setting. DCT achieves two orders of magnitude improvement in traffic consumption compared to a standard distributed single-term indexing approach

    Intuitionistic fuzzy XML query matching and rewriting

    Get PDF
    With the emergence of XML as a standard for data representation, particularly on the web, the need for intelligent query languages that can operate on XML documents with structural heterogeneity has recently gained a lot of popularity. Traditional Information Retrieval and Database approaches have limitations when dealing with such scenarios. Therefore, fuzzy (flexible) approaches have become the predominant. In this thesis, we propose a new approach for approximate XML query matching and rewriting which aims at achieving soft matching of XML queries with XML data sources following different schemas. Unlike traditional querying approaches, which require exact matching, the proposed approach makes use of Intuitionistic Fuzzy Trees to achieve approximate (soft) query matching. Through this new approach, not only the exact answer of a query, but also approximate answers are retrieved. Furthermore, partial results can be obtained from multiple data sources and merged together to produce a single answer to a query. The proposed approach introduced a new tree similarity measure that considers the minimum and maximum degrees of similarity/inclusion of trees that are based on arc matching. New techniques for soft node and arc matching were presented for matching queries against data sources with highly varied structures. A prototype was developed to test the proposed ideas and it proved the ability to achieve approximate matching for pattern queries with a number of XML schemas and rewrite the original query so that it obtain results from the underlying data sources. This has been achieved through several novel algorithms which were tested and proved efficiency and low CPU/Memory cost even for big number of data sources

    Semantic Cache Reasoners

    Get PDF
    corecore