24 research outputs found

    Graph Processing in Main-Memory Column Stores

    Get PDF
    Evermore, novel and traditional business applications leverage the advantages of a graph data model, such as the offered schema flexibility and an explicit representation of relationships between entities. As a consequence, companies are confronted with the challenge of storing, manipulating, and querying terabytes of graph data for enterprise-critical applications. Although these business applications operate on graph-structured data, they still require direct access to the relational data and typically rely on an RDBMS to keep a single source of truth and access. Existing solutions performing graph operations on business-critical data either use a combination of SQL and application logic or employ a graph data management system. For the first approach, relying solely on SQL results in poor execution performance caused by the functional mismatch between typical graph operations and the relational algebra. To the worse, graph algorithms expose a tremendous variety in structure and functionality caused by their often domain-specific implementations and therefore can be hardly integrated into a database management system other than with custom coding. Since the majority of these enterprise-critical applications exclusively run on relational DBMSs, employing a specialized system for storing and processing graph data is typically not sensible. Besides the maintenance overhead for keeping the systems in sync, combining graph and relational operations is hard to realize as it requires data transfer across system boundaries. A basic ingredient of graph queries and algorithms are traversal operations and are a fundamental component of any database management system that aims at storing, manipulating, and querying graph data. Well-established graph traversal algorithms are standalone implementations relying on optimized data structures. The integration of graph traversals as an operator into a database management system requires a tight integration into the existing database environment and a development of new components, such as a graph topology-aware optimizer and accompanying graph statistics, graph-specific secondary index structures to speedup traversals, and an accompanying graph query language. In this thesis, we introduce and describe GRAPHITE, a hybrid graph-relational data management system. GRAPHITE is a performance-oriented graph data management system as part of an RDBMS allowing to seamlessly combine processing of graph data with relational data in the same system. We propose a columnar storage representation for graph data to leverage the already existing and mature data management and query processing infrastructure of relational database management systems. At the core of GRAPHITE we propose an execution engine solely based on set operations and graph traversals. Our design is driven by the observation that different graph topologies expose different algorithmic requirements to the design of a graph traversal operator. We derive two graph traversal implementations targeting the most common graph topologies and demonstrate how graph-specific statistics can be leveraged to select the optimal physical traversal operator. To accelerate graph traversals, we devise a set of graph-specific, updateable secondary index structures to improve the performance of vertex neighborhood expansion. Finally, we introduce a domain-specific language with an intuitive programming model to extend graph traversals with custom application logic at runtime. We use the LLVM compiler framework to generate efficient code that tightly integrates the user-specified application logic with our highly optimized built-in graph traversal operators. Our experimental evaluation shows that GRAPHITE can outperform native graph management systems by several orders of magnitude while providing all the features of an RDBMS, such as transaction support, backup and recovery, security and user management, effectively providing a promising alternative to specialized graph management systems that lack many of these features and require expensive data replication and maintenance processes

    Fast Iterative Graph Computation: A Path Centric Approach

    Full text link
    Abstract—Large scale graph processing represents an inter-esting challenge due to the lack of locality. This paper presents PathGraph for improving iterative graph computation on graphs with billions of edges. Our system design has three unique features: First, we model a large graph using a collection of tree-based partitions and use an path-centric computation rather than vertex-centric or edge-centric computation. Our parallel computation model significantly improves the memory and disk locality for performing iterative computation algorithms. Second, we design a compact storage that further maximize sequential access and minimize random access on storage media. Third, we implement the path-centric computation model by using a scatter/gather programming model, which parallels the iterative computation at partition tree level and performs sequential updates for vertices in each partition tree. The experimental results show that the path-centric approach outperforms vertex-centric and edge-centric systems on a number of graph algorithms for both in-memory and out-of-core graphs

    Graph databases and their application to the Italian Business Register for efficient search of relationships among companies

    Get PDF
    We studied and tested three of the major graph databases, and we compared them with a relational database. We worked on a dataset representing equity participations among companies, and we found out that the strong points of graph databases are: the purposely designed storage techniques; and their query languages. The main performance increments have been obtained when heavy graph situations are queried; for simpler situations and queries, a relational database performs equally wellope

    Distributed Graph Storage And Querying System

    Get PDF
    Graph databases offer an efficient way to store and access inter-connected data. However, to query large graphs that no longer fit in memory, it becomes necessary to make multiple trips to the storage device to filter and gather data based on the query. But I/O accesses are expensive operations and immensely slow down query response time and prevent us from fully exploiting the graph specific benefits that graph databases offer. The storage models of most existing graph database systems view graphs as indivisible structures and hence do not allow a hierarchical layering of the graph. This adversely affects query performance for large graphs as there is no way to filter the graph on a higher level without actually accessing the entire information from the disk. Distributing the storage and processing is one way to extract better performance. But current distributed solutions to this problem are not entirely effective, again due to the indivisible representation of graphs adopted in the storage format. This causes unnecessary latency due to increased inter-processor communication. In this dissertation, we propose an optimized distributed graph storage system for scalable and faster querying of big graph data. We start with our unique physical storage model, in which the graph is decomposed into three different levels of abstraction, each with a different storage hierarchy. We use a hybrid storage model to store the most critical component and restrict the I/O trips to only when absolutely necessary. This lets us actively make use of multi-level filters while querying, without the need of comprehensive indexes. Our results show that our system outperforms established graph databases for several class of queries. We show that this separation also eases the difficulties in distributing graph data and go on propose a more efficient distributed model for querying general purpose graph data using the Spark framework

    Graph database management systems: storage, management and query processing

    Get PDF
    The proliferation of graph data, generated from diverse sources, have given rise to many research efforts concerning graph analysis. Interactions in social networks, publication networks, protein networks, software code dependencies and transportation systems are all examples of graph-structured data originating from a variety of application domains and demonstrating different characteristics. In recent years, graph database management systems (GDBMS) have been introduced for the management and analysis of graph data. Motivated by the growing number of real-life applications making use of graph database systems, this thesis focuses on the effectiveness and efficiency aspects of such systems. Specifically, we study the following topics relevant to graph database systems: (i) modeling large-scale applications in GDBMS; (ii) storage and indexing issues in GDBMS, and (iii) efficient query processing in GDBMS. In this thesis, we adopt two different application scenarios to examine how graph database systems can model complex features and perform relevant queries on each of them. Motivated by the popular application of social network analytics, we selected Twitter, a microblogging platform, to conduct our detailed analysis. Addressing limitations of existing models, we pro- pose a data model for the Twittersphere that proactively captures Twitter-specific interactions. We examine the feasibility of running analytical queries on GDBMS and offer empirical analysis of the performance of the proposed approach. Next, we consider a use case of modeling software code dependencies in a graph database system, and investigate how these systems can support capturing the evolution of a codebase overtime. We study a code comprehension tool that extracts software dependencies and stores them in a graph database. On a versioned graph built using a very large codebase, we demonstrate how existing code comprehension queries can be efficiently processed and also show the benefit of running queries across multiple versions. Another important aspect of this thesis is the study of storage aspects of graph systems. Throughput of many graph queries can be significantly affected by disk I/O performance; therefore graph database systems need to focus on effective graph storage for optimising disk operations. We observe that the locality of edges plays an important role and we address the edge-labeling problem which aims to label both incoming and outgoing edges of a graph maximizing the ‘edge-consecutiveness’ metric. By achieving a better layout and locality of edges on disk, we show that our proposed algorithms result in significantly improved disk I/O performance leading to faster execution of neighbourhood queries. Some applications require the integrated processing of queries from graph and the textual domains within a graph database system. Aggregation of these dimensions facilitates gaining key insights in several application scenarios. For example, in a social network setting, one may want to find the closest k users in the network (graph traversal) who talk about a particular topic A (textual search). Motivated by such practical use cases, in this thesis we study the top-k social-textual ranking query that essentially requires efficient combination of a keyword search query with a graph traversal. We propose algorithms that leverage graph partitioning techniques, based on the premise that socially close users will be placed within the same partition, allowing more localised computations. We show that our proposed approaches are able to achieve significantly better results compared to standard baselines and demonstrating robust behaviour under changing parameters

    Efficient Path Enumeration and Structural Clustering on Massive Graphs

    Full text link
    Graph analysis plays a crucial role in understanding the relationships and structures within complex systems. This thesis focuses on addressing fundamental problems in graph analysis, including hop-constrained s-t simple path (HC-s-t path) enumeration, batch HC-s-t path query processing, and graph structural clustering (SCAN). The objective is to develop efficient and scalable distributed algorithms to tackle these challenges, particularly in the context of billion-scale graphs. We first explore the problem of HC-s-t path enumeration. Existing solutions for this problem often suffer from inefficiency and scalability limitations, especially when dealing with billion-scale graphs. To overcome these drawbacks, we propose a novel hybrid search paradigm specifically tailored for HC-s-t path enumeration. This paradigm combines different search strategies to effectively explore the solution space. Building upon this paradigm, we devise a distributed enumeration algorithm that follows a divide-and-conquer strategy, incorporates fruitless exploration pruning, and optimizes memory consumption. Experimental evaluations on various datasets demonstrate that our algorithm achieves a significant speedup compared to existing solutions, even on datasets where they encounter out-of-memory issues. Secondly, we address the problem of batch HC-s-t path query processing. In real-world scenarios, it is common to issue multiple HC-s-t path queries simultaneously and process them as a batch. However, existing solutions often focus on optimizing the processing performance of individual queries, disregarding the benefits of processing queries concurrently. To bridge this gap, we propose the concept of HC-s path queries, which captures the common computation among different queries. We design a two-phase HC-s path query detection algorithm to identify the shared computation for a given set of HC-s-t path queries. Based on the detected HC-s path queries, we develop an efficient HC-s-t path enumeration algorithm that effectively shares the common computation. Extensive experiments on diverse datasets validate the efficiency and scalability of our algorithm for processing multiple HC-s-t path queries concurrently. Thirdly, we investigate the problem of graph structural clustering (SCAN) in billion-scale graphs. Existing distributed solutions for SCAN often lack efficiency or suffer from high memory consumption, making them impractical for large-scale graphs. To overcome these challenges, we propose a fine-grained clustering framework specifically tailored for SCAN. This framework enables effective identification of cohesive subgroups within a graph. Building upon this framework, we devise a distributed SCAN algorithm that minimizes communication overhead and reduces memory consumption throughout the execution. We also incorporate an effective workload balance mechanism that dynamically adjusts to handle skewed workloads. Experimental evaluations on real-world graphs demonstrate the efficiency and scalability of our proposed algorithm. Overall, this thesis contributes novel distributed algorithms for HC-s-t path enumeration, batch HC-s-t path query processing, and graph structural clustering. The proposed algorithms address the efficiency and scalability challenges in graph analysis, particularly on billion-scale graphs. Extensive experimental evaluations validate the superiority of our algorithms compared to existing solutions, enabling efficient and scalable graph analysis in complex systems

    Scalable Graph Algorithms in a High-Level Language Using Primitives Inspired by Linear Algebra

    Get PDF
    This dissertation advances the state of the art for scalable high-performance graph analytics and data mining using the language of linear algebra. Many graph computations suffer poor scalability due to their irregular nature and low operational intensity. A small but powerful set of linear algebra primitives that specifically target graph and data mining applications can expose sufficient coarse-grained parallelism to scale to thousands of processors.In this dissertation we advance existing distributed memory approaches in two important ways. First, we observe that data scientists and domain experts know their analysis and mining problems well, but suffer from little HPC experience. We describe a system that presents the user with a clean API in a high-level language that scales from a laptop to a supercomputer with thousands of cores. We utilize a Domain-Specific Embedded Language with Selective Just-In-Time Specialization to ensure a negligible performance impact over the original distributed memory low-level code. The high-level language enables ease of use, rapid prototyping, and additional features such as on-the-fly filtering, runtime-defined objects, and exposure to a large set of third-party visualization packages.The second important advance is a new sparse matrix data structure and set of algorithms. We note that shared memory machines are dominant both in stand-alone form and as nodes in distributed memory clusters. This thesis offers the design of a new sparse-matrix data structure and set of parallel algorithms, a reusable implementation in shared memory, and a performance evaluation that shows significant speed and memory usage improvements over competing packages. Our method also offers features such as in-memory compression, a low-cost transpose, and chained primitives that do not materialize the entire intermediate result at any one time. We focus on a scalable, generalized, sparse matrix-matrix multiplication algorithm. This primitive is used extensively in many graph algorithms such as betweenness centrality, graph clustering, graph contraction, and subgraph extraction
    corecore