1,958 research outputs found

    Graph Processing in Main-Memory Column Stores

    Get PDF
    Evermore, novel and traditional business applications leverage the advantages of a graph data model, such as the offered schema flexibility and an explicit representation of relationships between entities. As a consequence, companies are confronted with the challenge of storing, manipulating, and querying terabytes of graph data for enterprise-critical applications. Although these business applications operate on graph-structured data, they still require direct access to the relational data and typically rely on an RDBMS to keep a single source of truth and access. Existing solutions performing graph operations on business-critical data either use a combination of SQL and application logic or employ a graph data management system. For the first approach, relying solely on SQL results in poor execution performance caused by the functional mismatch between typical graph operations and the relational algebra. To the worse, graph algorithms expose a tremendous variety in structure and functionality caused by their often domain-specific implementations and therefore can be hardly integrated into a database management system other than with custom coding. Since the majority of these enterprise-critical applications exclusively run on relational DBMSs, employing a specialized system for storing and processing graph data is typically not sensible. Besides the maintenance overhead for keeping the systems in sync, combining graph and relational operations is hard to realize as it requires data transfer across system boundaries. A basic ingredient of graph queries and algorithms are traversal operations and are a fundamental component of any database management system that aims at storing, manipulating, and querying graph data. Well-established graph traversal algorithms are standalone implementations relying on optimized data structures. The integration of graph traversals as an operator into a database management system requires a tight integration into the existing database environment and a development of new components, such as a graph topology-aware optimizer and accompanying graph statistics, graph-specific secondary index structures to speedup traversals, and an accompanying graph query language. In this thesis, we introduce and describe GRAPHITE, a hybrid graph-relational data management system. GRAPHITE is a performance-oriented graph data management system as part of an RDBMS allowing to seamlessly combine processing of graph data with relational data in the same system. We propose a columnar storage representation for graph data to leverage the already existing and mature data management and query processing infrastructure of relational database management systems. At the core of GRAPHITE we propose an execution engine solely based on set operations and graph traversals. Our design is driven by the observation that different graph topologies expose different algorithmic requirements to the design of a graph traversal operator. We derive two graph traversal implementations targeting the most common graph topologies and demonstrate how graph-specific statistics can be leveraged to select the optimal physical traversal operator. To accelerate graph traversals, we devise a set of graph-specific, updateable secondary index structures to improve the performance of vertex neighborhood expansion. Finally, we introduce a domain-specific language with an intuitive programming model to extend graph traversals with custom application logic at runtime. We use the LLVM compiler framework to generate efficient code that tightly integrates the user-specified application logic with our highly optimized built-in graph traversal operators. Our experimental evaluation shows that GRAPHITE can outperform native graph management systems by several orders of magnitude while providing all the features of an RDBMS, such as transaction support, backup and recovery, security and user management, effectively providing a promising alternative to specialized graph management systems that lack many of these features and require expensive data replication and maintenance processes

    A Survey on Array Storage, Query Languages, and Systems

    Full text link
    Since scientific investigation is one of the most important providers of massive amounts of ordered data, there is a renewed interest in array data processing in the context of Big Data. To the best of our knowledge, a unified resource that summarizes and analyzes array processing research over its long existence is currently missing. In this survey, we provide a guide for past, present, and future research in array processing. The survey is organized along three main topics. Array storage discusses all the aspects related to array partitioning into chunks. The identification of a reduced set of array operators to form the foundation for an array query language is analyzed across multiple such proposals. Lastly, we survey real systems for array processing. The result is a thorough survey on array data storage and processing that should be consulted by anyone interested in this research topic, independent of experience level. The survey is not complete though. We greatly appreciate pointers towards any work we might have forgotten to mention.Comment: 44 page

    Density-Aware Linear Algebra in a Column-Oriented In-Memory Database System

    Get PDF
    Linear algebra operations appear in nearly every application in advanced analytics, machine learning, and of various science domains. Until today, many data analysts and scientists tend to use statistics software packages or hand-crafted solutions for their analysis. In the era of data deluge, however, the external statistics packages and custom analysis programs that often run on single-workstations are incapable to keep up with the vast increase in data volume and size. In particular, there is an increasing demand of scientists for large scale data manipulation, orchestration, and advanced data management capabilities. These are among the key features of a mature relational database management system (DBMS). With the rise of main memory database systems, it now has become feasible to also consider applications that built up on linear algebra. This thesis presents a deep integration of linear algebra functionality into an in-memory column-oriented database system. In particular, this work shows that it has become feasible to execute linear algebra queries on large data sets directly in a DBMS-integrated engine (LAPEG), without the need of transferring data and being restricted by hard disc latencies. From various application examples that are cited in this work, we deduce a number of requirements that are relevant for a database system that includes linear algebra functionality. Beside the deep integration of matrices and numerical algorithms, these include optimization of expressions, transparent matrix handling, scalability and data-parallelism, and data manipulation capabilities. These requirements are addressed by our linear algebra engine. In particular, the core contributions of this thesis are: firstly, we show that the columnar storage layer of an in-memory DBMS yields an easy adoption of efficient sparse matrix data types and algorithms. Furthermore, we show that the execution of linear algebra expressions significantly benefits from different techniques that are inspired from database technology. In a novel way, we implemented several of these optimization strategies in LAPEG’s optimizer (SpMachO), which uses an advanced density estimation method (SpProdest) to predict the matrix density of intermediate results. Moreover, we present an adaptive matrix data type AT Matrix to obviate the need of scientists for selecting appropriate matrix representations. The tiled substructure of AT Matrix is exploited by our matrix multiplication to saturate the different sockets of a multicore main-memory platform, reaching up to a speed-up of 6x compared to alternative approaches. Finally, a major part of this thesis is devoted to the topic of data manipulation; where we propose a matrix manipulation API and present different mutable matrix types to enable fast insertions and deletes. We finally conclude that our linear algebra engine is well-suited to process dynamic, large matrix workloads in an optimized way. In particular, the DBMS-integrated LAPEG is filling the linear algebra gap, and makes columnar in-memory DBMS attractive as efficient, scalable ad-hoc analysis platform for scientists

    A Survey of Graph Pre-processing Methods: From Algorithmic to Hardware Perspectives

    Full text link
    Graph-related applications have experienced significant growth in academia and industry, driven by the powerful representation capabilities of graph. However, efficiently executing these applications faces various challenges, such as load imbalance, random memory access, etc. To address these challenges, researchers have proposed various acceleration systems, including software frameworks and hardware accelerators, all of which incorporate graph pre-processing (GPP). GPP serves as a preparatory step before the formal execution of applications, involving techniques such as sampling, reorder, etc. However, GPP execution often remains overlooked, as the primary focus is directed towards enhancing graph applications themselves. This oversight is concerning, especially considering the explosive growth of real-world graph data, where GPP becomes essential and even dominates system running overhead. Furthermore, GPP methods exhibit significant variations across devices and applications due to high customization. Unfortunately, no comprehensive work systematically summarizes GPP. To address this gap and foster a better understanding of GPP, we present a comprehensive survey dedicated to this area. We propose a double-level taxonomy of GPP, considering both algorithmic and hardware perspectives. Through listing relavent works, we illustrate our taxonomy and conduct a thorough analysis and summary of diverse GPP techniques. Lastly, we discuss challenges in GPP and potential future directions

    Compact and efficient representations of graphs

    Get PDF
    [Resumen] En esta tesis estudiamos el problema de la creación de representaciones compactas y eficientes de grafos. Proponemos nuevas estructuras para persistir y consultar grafos de diferentes dominios, prestando especial atención al diseño de soluciones eficientes para grafos generales y grafos RDF. Hemos diseñado una nueva herramienta para generar grafos a partir de fuentes de datos heterogéneas mediante un sistema de definición de reglas. Es una herramienta de propósito general y, hasta nuestro conocimiento, no existe otra herramienta de estas características en el Estado del Arte. Otra contribución de este trabajo es una representación compacta de grafos generales, que soporta el acceso eficiente a los atributos y aristas del grafo. Así mismo, hemos estudiado el problema de la distribución de grafos en un entorno paralelo, almacenados sobre estructuras compactas, y hemos propuesto nueve alternativas diferentes que han sido evaluadas experimentalmente. También hemos propuesto un nuevo índice para RDF que soporta la resolución básica de SPARQL de forma comprimida. Por último, presentamos una nueva estructura compacta para almacenar relaciones ternarias cuyo diseño se enfoca a la representación eficiente de datos RDF. Todas estas propuestas han sido experimentalmente validadas con conjuntos de datos ampliamente aceptados, obteniéndose resultados competitivos comparadas con otras alternativas del Estado del Arte.[Resumo] Na presente tese estudiamos o problema da creación de representacións compactas e eficientes de grafos. Para isto propoñemos novas estruturas para persistir e consultar grafos de diferentes dominios, facendo especial fincapé no deseño de solucións eficientes nos casos de grafos xerais e grafos RDF. Deseñamos unha nova ferramenta para a xeración de grafos a partires de fontes de datos heteroxéneas mediante un sistema de definición de regras. Trátase dunha ferramenta de propósito xeral e, até onde chega o noso coñecemento, non existe outra ferramenta semellante no Estado do Arte. Outra das contribucións do traballo é unha representación compacta de grafos xerais, con soporte para o acceso eficiente aos atributos e aristas do grafo. Así mesmo, estudiamos o problema da distribución de grafos nun contorno paralelo, almacenados sobre estruturas compactas, e propoñemos nove alternativas diferentes que foron avaliadas de xeito experimental. Propoñemos tamén un novo índice para RDF que soporta a resolución básica de SPARQL de xeito comprimido. Para rematar, presentamos unha nova estrutura compacta para almacenar relacións ternarias, cun diseño especialmente enfocado á representación eficiente de datos RDF. Todas estas propostas foron validadas experimentalmente con conxuntos de datos amplamente aceptados, obténdose resultados competitivos comparadas con outras alternativas do Estado do Arte.[Abstract] In this thesis we study the problem of creating compact and efficient representations of graphs. We propose new data structures to store and query graph data from diverse domains, paying special attention to the design of efficient solutions for attributed and RDF graphs. We have designed a new tool to generate graphs from arbitrary data through a rule definition system. It is a general-purpose solution that, to the best of our knowledge, is the first with these characteristics. Another contribution of this work is a very compact representation for attributed graphs, providing efficient access to the properties and links of the graph. We also study the problem of graph distribution on a parallel environment using compact structures, proposing nine different alternatives that are experimentally compared. We also propose a novel RDF indexing technique that supports efficient SPARQL solution in compressed space. Finally, we present a new compact structure to store ternary relationships whose design is focused on the efficient representation of RDF data. All of these proposals were experimentally evaluated with widely accepted datasets, obtaining competitive results when they are compared against other alternatives of the State of the Art
    corecore