66 research outputs found

    Metadata and provenance management

    Get PDF
    Scientists today collect, analyze, and generate TeraBytes and PetaBytes of data. These data are often shared and further processed and analyzed among collaborators. In order to facilitate sharing and data interpretations, data need to carry with it metadata about how the data was collected or generated, and provenance information about how the data was processed. This chapter describes metadata and provenance in the context of the data lifecycle. It also gives an overview of the approaches to metadata and provenance management, followed by examples of how applications use metadata and provenance in their scientific processes

    Runtime Adaptive Hybrid Query Engine based on FPGAs

    Get PDF
    This paper presents the fully integrated hardware-accelerated query engine for large-scale datasets in the context of Semantic Web databases. As queries are typically unknown at design time, a static approach is not feasible and not flexible to cover a wide range of queries at system runtime. Therefore, we introduce a runtime reconfigurable accelerator based on a Field Programmable Gate Array (FPGA), which transparently incorporates with the freely available Semantic Web database LUPOSDATE. At system runtime, the proposed approach dynamically generates an optimized hardware accelerator in terms of an FPGA configuration for each individual query and transparently retrieves the query result to be displayed to the user. During hardware-accelerated execution the host supplies triple data to the FPGA and retrieves the results from the FPGA via PCIe interface. The benefits and limitations are evaluated on large-scale synthetic datasets with up to 260 million triples as well as the widely known Billion Triples Challenge

    Constructing Large-Scale Semantic Web Indices for the Six RDF Collation Orders

    Get PDF
    The Semantic Web community collects masses of valuable and publicly available RDF data in order to drive the success story of the Semantic Web. Efficient processing of these datasets requires their indexing. Semantic Web indices make use of the simple data model of RDF: The basic concept of RDF is the triple, which hence has only 6 different collation orders. On the one hand having 6 collation orders indexed fast merge joins (consuming the sorted input of the indices) can be applied as much as possible during query processing. On the other hand constructing the indices for 6 different collation orders is very time-consuming for large-scale datasets. Hence the focus of this paper is the efficient Semantic Web index construction for large-scale datasets on today's multi-core computers. We complete our discussion with a comprehensive performance evaluation, where our approach efficiently constructs the indices of over 1 billion triples of real world data

    The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing: Extended Survey

    Full text link
    Graph processing is becoming increasingly prevalent across many application domains. In spite of this prevalence, there is little research about how graphs are actually used in practice. We performed an extensive study that consisted of an online survey of 89 users, a review of the mailing lists, source repositories, and whitepapers of a large suite of graph software products, and in-person interviews with 6 users and 2 developers of these products. Our online survey aimed at understanding: (i) the types of graphs users have; (ii) the graph computations users run; (iii) the types of graph software users use; and (iv) the major challenges users face when processing their graphs. We describe the participants' responses to our questions highlighting common patterns and challenges. Based on our interviews and survey of the rest of our sources, we were able to answer some new questions that were raised by participants' responses to our online survey and understand the specific applications that use graph data and software. Our study revealed surprising facts about graph processing in practice. In particular, real-world graphs represent a very diverse range of entities and are often very large, scalability and visualization are undeniably the most pressing challenges faced by participants, and data integration, recommendations, and fraud detection are very popular applications supported by existing graph software. We hope these findings can guide future research

    Fast Iterative Graph Computation: A Path Centric Approach

    Full text link
    Abstract—Large scale graph processing represents an inter-esting challenge due to the lack of locality. This paper presents PathGraph for improving iterative graph computation on graphs with billions of edges. Our system design has three unique features: First, we model a large graph using a collection of tree-based partitions and use an path-centric computation rather than vertex-centric or edge-centric computation. Our parallel computation model significantly improves the memory and disk locality for performing iterative computation algorithms. Second, we design a compact storage that further maximize sequential access and minimize random access on storage media. Third, we implement the path-centric computation model by using a scatter/gather programming model, which parallels the iterative computation at partition tree level and performs sequential updates for vertices in each partition tree. The experimental results show that the path-centric approach outperforms vertex-centric and edge-centric systems on a number of graph algorithms for both in-memory and out-of-core graphs

    Scalable big data systems: Architectures and optimizations

    Get PDF
    Big data analytics has become not just a popular buzzword but also a strategic direction in information technology for many enterprises and government organizations. Even though many new computing and storage systems have been developed for big data analytics, scalable big data processing has become more and more challenging as a result of the huge and rapidly growing size of real-world data. Dedicated to the development of architectures and optimization techniques for scaling big data processing systems, especially in the era of cloud computing, this dissertation makes three unique contributions. First, it introduces a suite of graph partitioning algorithms that can run much faster than existing data distribution methods and inherently scale to the growth of big data. The main idea of these approaches is to partition a big graph by preserving the core computational data structure as much as possible to maximize intra-server computation and minimize inter-server communication. In addition, it proposes a distributed iterative graph computation framework that effectively utilizes secondary storage to maximize access locality and speed up distributed iterative graph computations. The framework not only considerably reduces memory requirements for iterative graph algorithms but also significantly improves the performance of iterative graph computations. Last but not the least, it establishes a suite of optimization techniques for scalable spatial data processing along with three orthogonal dimensions: (i) scalable processing of spatial alarms for mobile users traveling on road networks, (ii) scalable location tagging for improving the quality of Twitter data analytics and prediction accuracy, and (iii) lightweight spatial indexing for enhancing the performance of big spatial data queries.Ph.D

    Graph Processing in Main-Memory Column Stores

    Get PDF
    Evermore, novel and traditional business applications leverage the advantages of a graph data model, such as the offered schema flexibility and an explicit representation of relationships between entities. As a consequence, companies are confronted with the challenge of storing, manipulating, and querying terabytes of graph data for enterprise-critical applications. Although these business applications operate on graph-structured data, they still require direct access to the relational data and typically rely on an RDBMS to keep a single source of truth and access. Existing solutions performing graph operations on business-critical data either use a combination of SQL and application logic or employ a graph data management system. For the first approach, relying solely on SQL results in poor execution performance caused by the functional mismatch between typical graph operations and the relational algebra. To the worse, graph algorithms expose a tremendous variety in structure and functionality caused by their often domain-specific implementations and therefore can be hardly integrated into a database management system other than with custom coding. Since the majority of these enterprise-critical applications exclusively run on relational DBMSs, employing a specialized system for storing and processing graph data is typically not sensible. Besides the maintenance overhead for keeping the systems in sync, combining graph and relational operations is hard to realize as it requires data transfer across system boundaries. A basic ingredient of graph queries and algorithms are traversal operations and are a fundamental component of any database management system that aims at storing, manipulating, and querying graph data. Well-established graph traversal algorithms are standalone implementations relying on optimized data structures. The integration of graph traversals as an operator into a database management system requires a tight integration into the existing database environment and a development of new components, such as a graph topology-aware optimizer and accompanying graph statistics, graph-specific secondary index structures to speedup traversals, and an accompanying graph query language. In this thesis, we introduce and describe GRAPHITE, a hybrid graph-relational data management system. GRAPHITE is a performance-oriented graph data management system as part of an RDBMS allowing to seamlessly combine processing of graph data with relational data in the same system. We propose a columnar storage representation for graph data to leverage the already existing and mature data management and query processing infrastructure of relational database management systems. At the core of GRAPHITE we propose an execution engine solely based on set operations and graph traversals. Our design is driven by the observation that different graph topologies expose different algorithmic requirements to the design of a graph traversal operator. We derive two graph traversal implementations targeting the most common graph topologies and demonstrate how graph-specific statistics can be leveraged to select the optimal physical traversal operator. To accelerate graph traversals, we devise a set of graph-specific, updateable secondary index structures to improve the performance of vertex neighborhood expansion. Finally, we introduce a domain-specific language with an intuitive programming model to extend graph traversals with custom application logic at runtime. We use the LLVM compiler framework to generate efficient code that tightly integrates the user-specified application logic with our highly optimized built-in graph traversal operators. Our experimental evaluation shows that GRAPHITE can outperform native graph management systems by several orders of magnitude while providing all the features of an RDBMS, such as transaction support, backup and recovery, security and user management, effectively providing a promising alternative to specialized graph management systems that lack many of these features and require expensive data replication and maintenance processes

    Large scale interoperability in the context of Future Internet

    Get PDF
    La croissance de l Internet en tant que plateforme d approvisionnement à grande échelled approvisionnement de contenus multimédia a été une grande success story du 21e siécle.Toutefois, les applications multimédia, avec les charactéristiques spécifiques de leur trafic ainsique les les exigences des nouveaux services, posent un défi intéressant en termes de découverte,de mobilité et de gestion. En outre, le récent élan de l Internet des objets a rendu très nécessairela revitalisation de la recherche pour intégrer des sources hétérogènes d information à travers desréseaux divers. Dans cet objectif, les contributions de cette thèse essayent de trouver un équilibreentre l hétérogénéité et l interopérabilité, pour découvrir et intégrer les sources hétérogènesd information dans le contexte de l Internet du Futur.La découverte de sources d information sur différents réseaux requiert une compréhensionapprofondie de la façon dont l information est structurée et quelles méthodes spécifiques sontutilisés pour communiquer. Ce processus a été régulé à l aide de protocoles de découverte.Cependant, les protocoles s appuient sur différentes techniques et sont conçues en prenant encompte l infrastructure réseau sous-jacente, limitant ainsi leur capacité à franchir la limite d unréseau donné. Pour résoudre ce problème, le première contribution dans cette thèse tente detrouver une solution équilibrée permettant aux protocoles de découverte d interagir les uns avecles autres, tout en fournissant les moyens nécessaires pour franchir les frontières entre réseaux.Dans cet objectif, nous proposons ZigZag, un middleware pour réutiliser et étendre les protocolesde découverte courants, conçus pour des réseaux locaux, afin de découvrir des servicesdisponibles dans le large. Notre approche est basée sur la conversion de protocole permettant ladécouverte de service indépendamment de leur protocole de découverte sous-jacent. Toutefois,dans les réaux de grande échelle orientée consommateur, la quantité des messages de découvertepourrait rendre le réseau inutilisable. Pour parer à cette éventualité, ZigZag utilise le conceptd agrégation au cours du processus de découverte. Grâce à l agrégation, ZigZag est capabled intégrer plusieurs réponses de différentes sources supportant différents protocoles de découverte.En outre, la personnalisation du processus d agrégation afin de s aligner sur ses besoins,requiert une compréhension approfondie des fondamentaux de ZigZag. À cette fin, nous proposonsune seconde contribution: un langage flexible pour aider à définir les politiques d unemanière propre et efficace.The growth of the Internet as a large scale media provisioning platform has been a great successstory of the 21st century. However, multimedia applications, with their specific traffic characteristicsand novel service requirements, pose an interesting challenge in terms of discovery,mobility and management. Furthermore, the recent impetus to Internet of things has made it verynecessary, to revitalize research in order to integrate heterogeneous information sources acrossnetworks. Towards this objective, the contributions in this thesis, try to find a balance betweenheterogeneity and interoperability, to discovery and integrate heterogeneous information sourcesin the context of Future Internet.Discovering information sources across networks need a through understanding of how theinformation is structured and what specific methods they follow to communicate. This processhas been regulated with the help of discovery protocols. However, protocols rely on differenttechniques and are designed taking the underlying network infrastructure into account. Thus,limiting the capability of some protocols to cross network boundary. To address this issue, thefirst contribution in this thesis tries to find a balanced solution to enable discovery protocols tointeroperate with each other as well as provide the necessary means to cross network boundaries.Towards this objective, we propose ZigZag, a middleware to reuse and extend current discoveryprotocols, designed for local networks, to discover available services in the large. Our approachis based on protocol translation to enable service discovery irrespectively of their underlyingdiscovery protocol. Although, our approach provides a step forward towards interoperability inthe large. We needed to make sure that discovery messages do not create a bottleneck for thenetwork.In large scale consumer oriented network, service discovery messages could render the networkunusable. To counter this, ZigZag uses the concept of aggregation during the discoveryprocess. Using aggregation ZigZag is able to integrate several replies from different servicesources supporting different discovery protocols. However, to customize the aggregation processto suit once needs, requires a through understanding of ZigZag fundamentals. To this end, wepropose our second contribution, a flexible policy language that can help define policies in aclean and effective way. In addition, the policy language has some added advantages in terms ofdynamic management. It provides features like delegation, runtime time policy management andlogging. We tested our approach with the help of simulations, the results showed that ZigZag canboth reduce the number of messages that flow through the network, and provide value sensitiveinformation to the requesting entity.Although, ZigZag is designed to discover media services in the large. It can very well be usedin other domains like home automation and smart spaces. While, the flexible pluggable modulardesign of the policy language enables it to be used in other applications like for instance, e-mail.BORDEAUX1-Bib.electronique (335229901) / SudocSudocFranceF
    • …
    corecore