107 research outputs found

    Structural Summaries as a Core Technology for Efficient XML Retrieval

    Get PDF
    The Extensible Markup Language (XML) is extremely popular as a generic markup language for text documents with an explicit hierarchical structure. The different types of XML data found in today’s document repositories, digital libraries, intranets and on the web range from flat text with little meaningful structure to be queried, over truly semistructured data with a rich and often irregular structure, to rather rigidly structured documents with little text that would also fit a relational database system (RDBS). Not surprisingly, various ways of storing and retrieving XML data have been investigated, including native XML systems, relational engines based on RDBSs, and hybrid combinations thereof. Over the years a number of native XML indexing techniques have emerged, the most important ones being structure indices and labelling schemes. Structure indices represent the document schema (i.e., the hierarchy of nested tags that occur in the documents) in a compact central data structure so that structural query constraints (e.g., path or tree patterns) can be efficiently matched without accessing the documents. Labelling schemes specify ways to assign unique identifiers, or labels, to the document nodes so that specific relations (e.g., parent/child) between individual nodes can be inferred from their labels alone in a decentralized manner, again without accessing the documents themselves. Since both structure indices and labelling schemes provide compact approximate views on the document structure, we collectively refer to them as structural summaries. This work presents new structural summaries that enable highly efficient and scalable XML retrieval in native, relational and hybrid systems. The key contribution of our approach is threefold. (1) We introduce BIRD, a very efficient and expressive labelling scheme for XML, and the CADG, a combined text and structure index, and combine them as two complementary building blocks of the same XML retrieval system. (2) We propose a purely relational variant of BIRD and the CADG, called RCADG, that is extremely fast and scales up to large document collections. (3) We present the RCADG Cache, a hybrid system that enhances the RCADG with incremental query evaluation based on cached results of earlier queries. The RCADG Cache exploits schema information in the RCADG to detect cached query results that can supply some or all matches to a new query with little or no computational and I/O effort. A main-memory cache index ensures that reusable query results are quickly retrieved even in a huge cache. Our work shows that structural summaries significantly improve the efficiency and scalability of XML retrieval systems in several ways. Former relational approaches have largely ignored structural summaries. The RCADG shows that these native indexing techniques are equally effective for XML retrieval in RDBSs. BIRD, unlike some other labelling schemes, achieves high retrieval performance with a fairly modest storage overhead. To the best of our knowledge, the RCADG Cache is the only approach to take advantage of structural summaries for effectively detecting query containment or overlap. Moreover, no other XML cache we know of exploits intermediate results that are produced as a by-product during the evaluation from scratch. These are valuable cache contents that increase the effectiveness of the cache at no extra computational cost. Extensive experiments quantify the practical benefit of all of the proposed techniques, which amounts to a performance gain of several orders of magnitude compared to various other approaches

    Techniques efficaces basées sur des vues matérialisées pour la gestion des données du Web (algorithmes et systèmes)

    Get PDF
    Le langage XML, proposé par le W3C, est aujourd hui utilisé comme un modèle de données pour le stockage et l interrogation de grands volumes de données dans les systèmes de bases de données. En dépit d importants travaux de recherche et le développement de systèmes efficace, le traitement de grands volumes de données XML pose encore des problèmes des performance dus à la complexité et hétérogénéité des données ainsi qu à la complexité des langages courants d interrogation XML. Les vues matérialisées sont employées depuis des décennies dans les bases de données afin de raccourcir les temps de traitement des requêtes. Elles peuvent être considérées les résultats de requêtes pré-calculées, que l on réutilise afin d éviter de recalculer (complètement ou partiellement) une nouvelle requête. Les vues matérialisées ont fait l objet de nombreuses recherches, en particulier dans le contexte des entrepôts des données relationnelles.Cette thèse étudie l applicabilité de techniques de vues matérialisées pour optimiser les performances des systèmes de gestion de données Web, et en particulier XML, dans des environnements distribués. Dans cette thèse, nos apportons trois contributions.D abord, nous considérons le problème de la sélection des meilleures vues à matérialiser dans un espace de stockage donné, afin d améliorer la performance d une charge de travail des requêtes. Nous sommes les premiers à considérer un sous-langage de XQuery enrichi avec la possibilité de sélectionner des noeuds multiples et à de multiples niveaux de granularités. La difficulté dans ce contexte vient de la puissance expressive et des caractéristiques du langage des requêtes et des vues, et de la taille de l espace de recherche de vues que l on pourrait matérialiser.Alors que le problème général a une complexité prohibitive, nous proposons et étudions un algorithme heuristique et démontrer ses performances supérieures par rapport à l état de l art.Deuxièmement, nous considérons la gestion de grands corpus XML dans des réseaux pair à pair, basées sur des tables de hachage distribuées. Nous considérons la plateforme ViP2P dans laquelle des vues XML distribuées sont matérialisées à partir des données publiées dans le réseau, puis exploitées pour répondre efficacement aux requêtes émises par un pair du réseau. Nous y avons apporté d importantes optimisations orientées sur le passage à l échelle, et nous avons caractérisé la performance du système par une série d expériences déployées dans un réseau à grande échelle. Ces expériences dépassent de plusieurs ordres de grandeur les systèmes similaires en termes de volumes de données et de débit de dissémination des données. Cette étude est à ce jour la plus complète concernant une plateforme de gestion de contenus XML déployée entièrement et testée à une échelle réelle.Enfin, nous présentons une nouvelle approche de dissémination de données dans un système d abonnements, en présence de contraintes sur les ressources CPU et réseau disponibles; cette approche est mise en oeuvre dans le cadre de notre plateforme Delta. Le passage à l échelle est obtenu en déchargeant le fournisseur de données de l effort de répondre à une partie des abonnements. Pour cela, nous tirons profit de techniques de réécriture de requêtes à l aide de vues afin de diffuser les données de ces abonnements, à partir d autres abonnements.Notre contribution principale est un nouvel algorithme qui organise les vues dans un réseau de dissémination d information multi-niveaux ; ce réseau est calculé à l aide d outils techniques de programmation linéaire afin de passer à l échelle pour de grands nombres de vues, respecter les contraintes de capacité du système, et minimiser les délais de propagation des information. L efficacité et la performance de notre algorithme est confirmée par notre évaluation expérimentale, qui inclut l étude d un déploiement réel dans un réseau WAN.XML was recommended by W3C in 1998 as a markup language to be used by device- and system-independent methods of representing information. XML is nowadays used as a data model for storing and querying large volumes of data in database systems. In spite of significant research and systems development, many performance problems are raised by processing very large amounts of XML data. Materialized views have long been used in databases to speed up queries. Materialized views can be seen as precomputed query results that can be re-used to evaluate (part of) another query, and have been a topic of intensive research, in particular in the context of relational data warehousing. This thesis investigates the applicability of materialized views techniques to optimize the performance of Web data management tools, in particular in distributed settings, considering XML data and queries. We make three contributions.We first consider the problem of choosing the best views to materialize within a given space budget in order to improve the performance of a query workload. Our work is the first to address the view selection problem for a rich subset of XQuery. The challenges we face stem from the expressive power and features of both the query and view languages and from the size of the search space of candidate views to materialize. While the general problem has prohibitive complexity, we propose and study a heuristic algorithm and demonstrate its superior performance compared to the state of the art.Second, we consider the management of large XML corpora in peer-to-peer networks, based on distributed hash tables (or DHTs, in short). We consider a platform leveraging distributed materialized XML views, defined by arbitrary XML queries, filled in with data published anywhere in the network, and exploited to efficiently answer queries issued by any network peer. This thesis has contributed important scalability oriented optimizations, as well as a comprehensive set of experiments deployed in a country-wide WAN. These experiments outgrow by orders of magnitude similar competitor systems in terms of data volumes and data dissemination throughput. Thus, they are the most advanced in understanding the performance behavior of DHT-based XML content management in real settings.Finally, we present a novel approach for scalable content-based publish/subscribe (pub/sub, in short) in the presence of constraints on the available computational resources of data publishers. We achieve scalability by off-loading subscriptions from the publisher, and leveraging view-based query rewriting to feed these subscriptions from the data accumulated in others. Our main contribution is a novel algorithm for organizing subscriptions in a multi-level dissemination network in order to serve large numbers of subscriptions, respect capacity constraints, and minimize latency. The efficiency and effectiveness of our algorithm are confirmed through extensive experiments and a large deployment in a WAN.PARIS11-SCD-Bib. électronique (914719901) / SudocSudocFranceF

    Practical Dynamic Symbolic Execution for JavaScript

    Get PDF

    Query Evaluation in the Presence of Fine-grained Access Control

    Get PDF
    Access controls are mechanisms to enhance security by protecting data from unauthorized accesses. In contrast to traditional access controls that grant access rights at the granularity of the whole tables or views, fine-grained access controls specify access controls at finer granularity, e.g., individual nodes in XML databases and individual tuples in relational databases. While there is a voluminous literature on specifying and modeling fine-grained access controls, less work has been done to address the performance issues of database systems with fine-grained access controls. This thesis addresses the performance issues of fine-grained access controls and proposes corresponding solutions. In particular, the following issues are addressed: effective storage of massive access controls, efficient query planning for secure query evaluation, and accurate cardinality estimation for access controlled data. Because fine-grained access controls specify access rights from each user to each piece of data in the system, they are effectively a massive matrix of the size as the product of the number of users and the size of data. Therefore, fine-grained access controls require a very compact encoding to be feasible. The proposed storage system in this thesis achieves an unprecedented level of compactness by leveraging the high correlation of access controls found in real system data. This correlation comes from two sides: the structural similarity of access rights between data, and the similarity of access patterns from different users. This encoding can be embedded into a linearized representation of XML data such that a query evaluation framework is able to compute the answer to the access controlled query with minimal disk I/O to the access controls. Query optimization is a crucial component for database systems. This thesis proposes an intelligent query plan caching mechanism that has lower amortized cost for query planning in the presence of fine-grained access controls. The rationale behind this query plan caching mechanism is that the queries, customized by different access controls from different users, may share common upper-level join trees in their optimal query plans. Since join plan generation is an expensive step in query optimization, reusing the upper-level join trees will reduce query optimization significantly. The proposed caching mechanism is able to match efficient query plans to access controlled query plans with minimal runtime cost. In case of a query plan cache miss, the optimizer needs to optimize an access controlled query from scratch. This depends on accurate cardinality estimation on the size of the intermediate query results. This thesis proposes a novel sampling scheme that has better accuracy than traditional cardinality estimation techniques

    Scalable hosting of web applications

    Get PDF
    Modern Web sites have evolved from simple monolithic systems to complex multitiered systems. In contrast to traditional Web sites, these sites do not simply deliver pre-written content but dynamically generate content using (one or more) multi-tiered Web applications. In this thesis, we addressed the question: How to host multi-tiered Web applications in a scalable manner? Scaling up a Web application requires scaling its individual tiers. To this end, various research works have proposed techniques that employ replication or caching solutions at different tiers. However, most of these techniques aim to optimize the performance of individual tiers and not the entire application. A key observation made in our research is that there exists no elixir technique that performs the best for allWeb applications. Effective hosting of a Web application requires careful selection and deployment of several techniques at different tiers. To this end, we present several caching and replication strategies, such as GlobeCBC, GlobeDB and GlobeTP, to improve the scalability of different tiers of a Web application. While these techniques and systems improve the performance of the individual tiers (and eventually the application), an application's administrator is not only interested in the performance of its individual tiers but also in its endto- end performance. To this end, we propose a resource provisioning approach that allows us to choose the best resource configuration for hosting a Web application such that its end-to-end response time can be optimized with minimum usage of resources. The proposed approach is based on an analytical model for multi-tier systems, which allows us to derive expressions for estimating the mean end-to-end response time and its variance.Steen, M.R. van [Promotor]Pierre, G.E.O. [Copromotor

    Physical database design in document stores

    Get PDF
    Tesi en modalitat de cotutela, Universitat Politècnica de Catalunya i Université libre de BruxellesNoSQL is an umbrella term used to classify alternate storage systems to the traditional Relational Database Management Systems (RDBMSs). Among these, Document stores have gained popularity mainly due to the semi-structured data storage model and the rich query capabilities. They encourage users to use a data-first approach as opposed to a design-first one. Database design on document stores is mainly carried out in a trial-and-error or ad-hoc rule-based manner instead of a formal process such as normalization in an RDBMS. However, these approaches could easily lead to a non-optimal design resulting additional costs in the long run. This PhD thesis aims to provide a novel multi-criteria-based approach to database design in document stores. Most of such existing approaches are based on optimizing query performance. However, other factors include storage requirement and complexity of the stored documents specific to each use case. There is a large solution space of alternative designs due to the different combinations of referencing and nesting of data. Thus, we believe multi-criteria optimization is ideal to solve this problem. To achieve this, we need to address several issues that will enable us to apply multi-criteria optimization for the data design problem. First, we evaluate the impact of alternate storage representations of semi-structured data. There are multiple and equivalent ways to physically represent semi-structured data, but there is a lack of evidence about the potential impact on space and query performance. Thus, we embark on the task of quantifying that precisely for document stores. We empirically compare multiple ways of representing semi-structured data, allowing us to derive a set of guidelines for efficient physical database design considering both JSON and relational options in the same palette. Then, we need a formal canonical model that can represent alternative designs. We propose a hypergraph-based approach for representing heterogeneous datastore designs. We extend and formalize an existing common programming interface to NoSQL systems as hypergraphs. We define design constraints and query transformation rules for representative data store types. Next, we propose a simple query rewriting algorithm and provide a prototype implementation together with storage statistics estimator. Next, we require a formal query cost model to estimate and evaluate query performance on alternative document store designs. Document stores use primitive approaches to query processing, such as relying on the end-user to specify the usage of indexes instead of a formal cost model. But we require a reliable approach to compare alternative designs on how they perform on a specific query. For this, we define a generic storage and query cost model based on disk access and memory allocation. As all document stores carry out data operations in memory, we first estimate the memory usage by considering the characteristics of the stored documents, their access patterns, and memory management algorithms. Then, using this estimation and metadata storage size, we introduce a cost model for random access queries. We validate our work on two well-known document store implementations. The results show that the memory usage estimates have an average precision of 91% and predicted costs are highly correlated to the actual execution times. During this work, we also managed to suggest several improvements to document stores. Finally, we implement the automated database design solution using multi-criteria optimization. We introduce an algebra of transformations that can systematically modify a design of our canonical representation. Then, using them, we implement a local search algorithm driven by a loss function that can propose near-optimal designs with high probability. We compare our prototype against an existing document store data design solution. Our proposed designs have better performance and are more compact with less redundancy.NoSQL descriu sistemes d'emmagatzematge alternatius als tradicionals de gestió de bases de dades relacionals (RDBMS). Entre aquests, els magatzems de documents han guanyat popularitat principalment a causa del model de dades semiestructurat i les riques capacitats de consulta. Animen els usuaris a utilitzar un enfocament de dades primer, en lloc d'un enfocament de disseny primer. El disseny de dades en magatzems de documents es porta a terme principalment en forma d'assaig-error o basat en regles ad-hoc en lloc d'un procés formal i sistemàtic com ara la normalització en un RDBMS. Aquest enfocament condueix fàcilment a un disseny no òptim que generarà costos addicionals a llarg termini. La majoria dels enfocaments existents es basen en l'optimització del rendiment de les consultes. Aquesta tesi pretén, en canvi, proporcionar un nou enfocament basat en diversos criteris per al disseny de bases de dades en magatzems de documents, inclouen el requisit d'espai i la complexitat dels documents emmagatzemats específics per a cada cas d'ús. En general, hi ha un gran espai de solucions de dissenys alternatives. Per tant, creiem que l'optimització multicriteri és ideal per resoldre aquest problema. Per aconseguir-ho, hem d'abordar diversos problemes que ens permetran aplicar l'optimització multicriteri. En primer, avaluem l'impacte de les representacions alternatives de dades semiestructurades. Hi ha maneres múltiples i equivalents de representar dades semiestructurades, però hi ha una manca d'evidència sobre l'impacte potencial en l'espai i el rendiment de les consultes. Així, ens embarquem en la tasca de quantificar-ho. Comparem empíricament múltiples representacions de dades semiestructurades, cosa que ens permet derivar directrius per a un disseny eficient tenint en compte les opcions dels JSON i relacionals alhora. Aleshores, necessitem un model canònic que pugui representar dissenys alternatius i proposem un enfocament basat en hipergrafs. Estenem i formalitzem una interfície de programació comuna existent als sistemes NoSQL com a hipergrafs. Definim restriccions de disseny i regles de transformació de consultes per a tipus de magatzem de dades representatius. A continuació, proposem un algorisme de reescriptura de consultes senzill i proporcionem una implementació juntament amb un estimador d'estadístiques d'emmagatzematge. Els magatzems de documents utilitzen enfocaments primitius per al processament de consultes, com ara confiar en l'usuari final per especificar l'ús d'índexs en lloc d'un model de cost. Conseqüentment, necessitem un model de cost de consulta per estimar i avaluar el rendiment en dissenys alternatius. Per això, definim un model genèric propi basat en l'accés a disc i l'assignació de memòria. Com que tots els magatzems de documents duen a terme operacions de dades a memòria, primer estimem l'ús de la memòria tenint en compte les característiques dels documents emmagatzemats, els seus patrons d'accés i els algorismes de gestió de memòria. A continuació, utilitzant aquesta estimació i la mida d'emmagatzematge de metadades, introduïm un model de costos per a consultes d'accés aleatori. Validem el nostre treball en dues implementacions conegudes. Els resultats mostren que les estimacions d'ús de memòria tenen una precisió mitjana del 91% i els costos previstos estan altament correlacionats amb els temps d'execució reals. Finalment, implementem la solució de disseny automatitzat de bases de dades mitjançant l'optimització multicriteri. Introduïm una àlgebra de transformacions que pot modificar sistemàticament un disseny en la nostra representació canònica. A continuació, utilitzant-la, implementem un algorisme de cerca local impulsat per una funció de pèrdua que pot proposar dissenys gairebé òptims amb alta probabilitat. Comparem el nostre prototip amb una solució de disseny de dades de magatzem de documents existent. Els nostres dissenys proposats tenen un millor rendiment i són més compactes, amb menys redundànciaNoSQL est un terme générique utilisé pour classer les systèmes de stockage alternatifs aux systèmes de gestion de bases de données relationnelles (SGBDR) traditionnels. Au moment de la rédaction de cet article, il existe plus de 200 systèmes NoSQL disponibles qui peuvent être classés en quatre catégories principales sur le modèle de stockage de données : magasins de valeurs-clés, magasins de documents, magasins de familles de colonnes et magasins de graphiques. Les magasins de documents ont gagné en popularité principalement en raison du modèle de stockage de données semi-structuré et des capacités de requêtes riches par rapport aux autres systèmes NoSQL, ce qui en fait un candidat idéal pour le prototypage rapide. Les magasins de documents encouragent les utilisateurs à utiliser une approche axée sur les données plutôt que sur la conception. La conception de bases de données sur les magasins de documents est principalement effectuée par essais et erreurs ou selon des règles ad hoc plutôt que par un processus formel tel que la normalisation dans un SGBDR. Cependant, ces approches pourraient facilement conduire à une conception de base de données non optimale entraînant des coûts supplémentaires de traitement des requêtes, de stockage des données et de refonte. Cette thèse de doctorat vise à fournir une nouvelle approche multicritère de la conception de bases de données dans les magasins de documents. La plupart des approches existantes de conception de bases de données sont basées sur l’optimisation des performances des requêtes. Cependant, d’autres facteurs incluent les exigences de stockage et la complexité des documents stockés spécifique à chaque cas d’utilisation. De plus, il existe un grand espace de solution de conceptions alternatives en raison des différentes combinaisons de référencement et d’imbrication des données. Par conséquent, nous pensons que l’optimisation multicritères est idéale par l’intermédiaire d’une expérience éprouvée dans la résolution de tels problèmes dans divers domaines. Cependant, pour y parvenir, nous devons résoudre plusieurs problèmes qui nous permettront d’appliquer une optimisation multicritère pour le problème de conception de données. Premièrement, nous évaluons l’impact des représentations alternatives de stockage des données semi-structurées. Il existe plusieurs manières équivalentes de représenter physiquement des données semi-structurées, mais il y a un manque de preuves concernant l’impact potentiel sur l’espace et sur les performances des requêtes. Ainsi, nous nous lançons dans la tâche de quantifier cela précisément pour les magasins de documents. Nous comparons empiriquement plusieurs façons de représenter des données semi-structurées, ce qui nous permet de dériver un ensemble de directives pour une conception de base de données physique efficace en tenant compte à la fois des options JSON et relationnelles dans la même palette. Ensuite, nous avons besoin d’un modèle canonique formel capable de représenter des conceptions alternatives. Dans cette mesure, nous proposons une approche basée sur des hypergraphes pour représenter des conceptions de magasins de données hétérogènes. Prenant une interface de programmation commune existante aux systèmes NoSQL, nous l’étendons et la formalisons sous forme d’hypergraphes. Ensuite, nous définissons les contraintes de conception et les règles de transformation des requêtes pour trois types de magasins de données représentatifs. Ensuite, nous proposons un algorithme de réécriture de requête simple à partir d’un algorithme générique dans un magasin de données sous-jacent spécifique et fournissons une implémentation prototype. De plus, nous introduisons un estimateur de statistiques de stockage sur les magasins de données sous-jacents. Enfin, nous montrons la faisabilité de notre approche sur un cas d’utilisation d’un système polyglotte existant ainsi que son utilité dans les calculs de métadonnées et de chemins de requêtes physiques. Ensuite, nous avons besoin d’un modèle de coûts de requêtes formel pour estimer et évaluer les performances des requêtes sur des conceptions alternatives de magasin de documents. Les magasins de documents utilisent des approches primitives du traitement des requêtes, telles que l’évaluation de tous les plans de requête possibles pour trouver le plan gagnant et son utilisation dans les requêtes similaires ultérieures, ou l’appui sur l’usager final pour spécifier l’utilisation des index au lieu d’un modèle de coûts formel. Cependant, nous avons besoin d’une approche fiable pour comparer deux conceptions alternatives sur la façon dont elles fonctionnent sur une requête spécifique. Pour cela, nous définissons un modèle de coûts de stockage et de requête générique basé sur l’accès au disque et l’allocation de mémoire qui permet d’estimer l’impact des décisions de conception. Étant donné que tous les magasins de documents effectuent des opérations sur les données en mémoire, nous estimons d’abord l’utilisation de la mémoire en considérant les caractéristiques des documents stockés, leurs modèles d’accès et les algorithmes de gestion de la mémoire. Ensuite, en utilisant cette estimation et la taille de stockage des métadonnées, nous introduisons un modèle de coûts pour les requêtes à accès aléatoire. Il s’agit de la première tenta ive d’une telle approche au meilleur de notre connaissance. Enfin, nous validons notre travail sur deux implémentations de magasin de documents bien connues : MongoDB et Couchbase. Les résultats démontrent que les estimations d’utilisation de la mémoire ont une précision moyenne de 91% et que les coûts prévus sont fortement corrélés aux temps d’exécution réels. Au cours de ce travail, nous avons réussi à proposer plusieurs améliorations aux systèmes de stockage de documents. Ainsi, ce modèle de coûts contribue également à identifier les discordances entre les implémentations de stockage de documents et leurs attentes théoriques. Enfin, nous implémentons la solution de conception automatisée de bases de données en utilisant l’optimisation multicritères. Tout d’abord, nous introduisons une algèbre de transformations qui peut systématiquement modifier une conception de notre représentation canonique. Ensuite, en utilisant ces transformations, nous implémentons un algorithme de recherche locale piloté par une fonction de perte qui peut proposer des conceptions quasi optimales avec une probabilité élevée. Enfin, nous comparons notre prototype à une solution de conception de données de magasin de documents existante uniquement basée sur le coût des requêtes. Nos conceptions proposées ont de meilleures performances et sont plus compactes avec moins de redondancePostprint (published version

    Graph pattern matching on social network analysis

    Get PDF
    Graph pattern matching is fundamental to social network analysis. Its effectiveness for identifying social communities and social positions, making recommendations and so on has been repeatedly demonstrated. However, the social network analysis raises new challenges to graph pattern matching. As real-life social graphs are typically large, it is often prohibitively expensive to conduct graph pattern matching over such large graphs, e.g., NP-complete for subgraph isomorphism, cubic time for bounded simulation, and quadratic time for simulation. These hinder the applicability of graph pattern matching on social network analysis. In response to these challenges, the thesis presents a series of effective techniques for querying large, dynamic, and distributively stored social networks. First of all, we propose a notion of query preserving graph compression, to compress large social graphs relative to a class Q of queries. We then develop both batch and incremental compression strategies for two commonly used pattern queries. Via both theoretical analysis and experimental studies, we show that (1) using compressed graphs Gr benefits graph pattern matching dramatically; and (2) the computation of Gr as well as its maintenance can be processed efficiently. Secondly, we investigate the distributed graph pattern matching problem, and explore parallel computation for graph pattern matching. We show that our techniques possess following performance guarantees: (1) each site is visited only once; (2) the total network traffic is independent of the size of G; and (3) the response time is decided by the size of largest fragment of G rather than the size of entire G. Furthermore, we show how these distributed algorithms can be implemented in the MapReduce framework. Thirdly, we study the problem of answering graph pattern matching using views since view based techniques have proven an effective technique for speeding up query evaluation. We propose a notion of pattern containment to characterise graph pattern matching using views, and introduce efficient algorithms to answer graph pattern matching using views. Moreover, we identify three problems related to graph pattern containment, and provide efficient algorithms for containment checking (approximation when the problem is intractable). Fourthly, we revise graph pattern matching by supporting a designated output node, which we treat as “query focus”. We then introduce algorithms for computing the top-k relevant matches w.r.t. the output node for both acyclic and cyclic pattern graphs, respectively, with early termination property. Furthermore, we investigate the diversified top-k matching problem, and develop an approximation algorithm with performance guarantee and a heuristic algorithm with early termination property. Finally, we introduce an expert search system, called ExpFinder, for large and dynamic social networks. ExpFinder identifies top-k experts in social networks by graph pattern matching, and copes with the sheer size of real-life social networks by integrating incremental graph pattern matching, query preserving compression and top-k matching computation. In particular, we also introduce bounded (resp. unbounded) incremental algorithms to maintain the weighted landmark vectors which are used for incremental maintenance for cached results

    Cost-Based Optimization of Integration Flows

    Get PDF
    Integration flows are increasingly used to specify and execute data-intensive integration tasks between heterogeneous systems and applications. There are many different application areas such as real-time ETL and data synchronization between operational systems. For the reasons of an increasing amount of data, highly distributed IT infrastructures, and high requirements for data consistency and up-to-dateness of query results, many instances of integration flows are executed over time. Due to this high load and blocking synchronous source systems, the performance of the central integration platform is crucial for an IT infrastructure. To tackle these high performance requirements, we introduce the concept of cost-based optimization of imperative integration flows that relies on incremental statistics maintenance and inter-instance plan re-optimization. As a foundation, we introduce the concept of periodical re-optimization including novel cost-based optimization techniques that are tailor-made for integration flows. Furthermore, we refine the periodical re-optimization to on-demand re-optimization in order to overcome the problems of many unnecessary re-optimization steps and adaptation delays, where we miss optimization opportunities. This approach ensures low optimization overhead and fast workload adaptation

    Effective web crawlers

    Get PDF
    Web crawlers are the component of a search engine that must traverse the Web, gathering documents in a local repository for indexing by a search engine so that they can be ranked by their relevance to user queries. Whenever data is replicated in an autonomously updated environment, there are issues with maintaining up-to-date copies of documents. When documents are retrieved by a crawler and have subsequently been altered on the Web, the effect is an inconsistency in user search results. While the impact depends on the type and volume of change, many existing algorithms do not take the degree of change into consideration, instead using simple measures that consider any change as significant. Furthermore, many crawler evaluation metrics do not consider index freshness or the amount of impact that crawling algorithms have on user results. Most of the existing work makes assumptions about the change rate of documents on the Web, or relies on the availability of a long history of change. Our work investigates approaches to improving index consistency: detecting meaningful change, measuring the impact of a crawl on collection freshness from a user perspective, developing a framework for evaluating crawler performance, determining the effectiveness of stateless crawl ordering schemes, and proposing and evaluating the effectiveness of a dynamic crawl approach. Our work is concerned specifically with cases where there is little or no past change statistics with which predictions can be made. Our work analyses different measures of change and introduces a novel approach to measuring the impact of recrawl schemes on search engine users. Our schemes detect important changes that affect user results. Other well-known and widely used schemes have to retrieve around twice the data to achieve the same effectiveness as our schemes. Furthermore, while many studies have assumed that the Web changes according to a model, our experimental results are based on real web documents. We analyse various stateless crawl ordering schemes that have no past change statistics with which to predict which documents will change, none of which, to our knowledge, has been tested to determine effectiveness in crawling changed documents. We empirically show that the effectiveness of these schemes depends on the topology and dynamics of the domain crawled and that no one static crawl ordering scheme can effectively maintain freshness, motivating our work on dynamic approaches. We present our novel approach to maintaining freshness, which uses the anchor text linking documents to determine the likelihood of a document changing, based on statistics gathered during the current crawl. We show that this scheme is highly effective when combined with existing stateless schemes. When we combine our scheme with PageRank, our approach allows the crawler to improve both freshness and quality of a collection. Our scheme improves freshness regardless of which stateless scheme it is used in conjunction with, since it uses both positive and negative reinforcement to determine which document to retrieve. Finally, we present the design and implementation of Lara, our own distributed crawler, which we used to develop our testbed
    corecore