143 research outputs found

    Indexing query graphs to speedup graph query processing

    Get PDF
    Subgraph/supergraph queries although central to graph analytics, are costly as they entail the NP-Complete problem of subgraph isomorphism. We present a fresh solution, the novel principle of which is to acquire and utilize knowledge from the results of previously executed queries. Our approach, iGQ, encompasses two component subindexes to identify if a new query is a subgraph/supergraph of previously executed queries and stores related key information. iGQ comes with novel query processing and index space management algorithms, including graph replacement policies. The end result is a system that leads to significant reduction in the number of required subgraph isomorphism tests and speedups in query processing time. iGQ can be incorporated into any sub/supergraph query processing method and help improve performance. In fact, it is the only contribution that can speedup significantly both subgraph and supergraph query processing. We establish the principles of iGQ and formally prove its correctness. We have implemented iGQ and have incorporated it within three popular recent state of the art index-based graph query processing solutions. We evaluated its performance using real-world and synthetic graph datasets with different characteristics, and a number of query workloads, showcasing its benefits

    Scalable supergraph search in large graph databases

    Full text link
    © 2016 IEEE. Supergraph search is a fundamental problem in graph databases that is widely applied in many application scenarios. Given a graph database and a query-graph, supergraph search retrieves all data-graphs contained in the query-graph from the graph database. Most existing solutions for supergraph search follow the pruning-and-verification framework, which prunes false answers based on features in the pruning phase and performs subgraph isomorphism testings on the remaining graphs in the verification phase. However, they are not scalable to handle large-sized data-graphs and query-graphs due to three drawbacks. First, they rely on a frequent subgraph mining algorithm to select features which is expensive and cannot generate large features. Second, they require a costly verification phase. Third, they process features in a fixed order without considering their relationship to the query-graph. In this paper, we address the three drawbacks and propose new indexing and query processing algorithms. In indexing, we select features directly from the data-graphs without expensive frequent subgraph mining. The features form a feature-tree that contains all-sized features and both the cost sharing and pruning power of the features are considered. In query processing, we propose a verification-free algorithm, where the order to process features is query-dependent by considering both the cost sharing and the pruning power. We explore two optimization strategies to further improve the algorithm efficiency. The first strategy applies a lightweight graph compression technique and the second strategy optimizes the inclusion of answers. Finally, we conduct extensive performance studies on two real large datasets to demonstrate the high scalability of our algorithms

    GraphCache: A Caching System for Graph Queries

    Get PDF
    Graph query processing is essential for graph analytics, but can be very time-consuming as it entails the NP-Complete problem of subgraph isomorphism. Traditionally, caching plays a key role in expediting query processing. We thus put forth GraphCache (GC), the first full-edged caching system for general subgraph/supergraph queries. We contribute the overall system architecture and implementation of GC. We study a number of novel graph cache replacement policies and show that different policies win over different graph datasets and/or queries; we therefore contribute a novel hybrid graph replacement policy that is always the best or near-best performer. Moreover, we discover the related problem of cache pollution and propose a novel cache admission control mechanism to avoid cache pollution. Furthermore, we show that GC can be used as a front end, complementing any graph query processing method as a pluggable component. Currently, GC comes bundled with 3 top-performing filter-then-verify (FTV) subgraph query methods and 3 well-established direct subgraph-isomorphism (SI) algorithms - representing different categories of graph query processing research. Finally, we contribute a comprehensive performance evaluation of GC. We employ more than 6 million queries, generated using different workload generators, and executed against both real-world and synthetic graph datasets of different characteristics, quantifying the benefits and overheads, emphasizing the non-trivial lessons learned

    Large Graph Analysis in the GMine System

    Full text link
    Current applications have produced graphs on the order of hundreds of thousands of nodes and millions of edges. To take advantage of such graphs, one must be able to find patterns, outliers and communities. These tasks are better performed in an interactive environment, where human expertise can guide the process. For large graphs, though, there are some challenges: the excessive processing requirements are prohibitive, and drawing hundred-thousand nodes results in cluttered images hard to comprehend. To cope with these problems, we propose an innovative framework suited for any kind of tree-like graph visual design. GMine integrates (a) a representation for graphs organized as hierarchies of partitions - the concepts of SuperGraph and Graph-Tree; and (b) a graph summarization methodology - CEPS. Our graph representation deals with the problem of tracing the connection aspects of a graph hierarchy with sub linear complexity, allowing one to grasp the neighborhood of a single node or of a group of nodes in a single click. As a proof of concept, the visual environment of GMine is instantiated as a system in which large graphs can be investigated globally and locally

    Performance and scalability of indexed subgraph query processing methods

    Get PDF
    Graph data management systems have become very popular as graphs are the natural data model for many applications. One of the main problems addressed by these systems is subgraph query processing; i.e., given a query graph, return all graphs that contain the query. The naive method for processing such queries is to perform a subgraph isomorphism test against each graph in the dataset. This obviously does not scale, as subgraph isomorphism is NP-Complete. Thus, many indexing methods have been proposed to reduce the number of candidate graphs that have to underpass the subgraph isomorphism test. In this paper, we identify a set of key factors-parameters, that influence the performance of related methods: namely, the number of nodes per graph, the graph density, the number of distinct labels, the number of graphs in the dataset, and the query graph size. We then conduct comprehensive and systematic experiments that analyze the sensitivity of the various methods on the values of the key parameters. Our aims are twofold: first to derive conclusions about the algorithms’ relative performance, and, second, to stress-test all algorithms, deriving insights as to their scalability, and highlight how both performance and scalability depend on the above factors. We choose six wellestablished indexing methods, namely Grapes, CT-Index, GraphGrepSX, gIndex, Tree+∆, and gCode, as representative approaches of the overall design space, including the most recent and best performing methods. We report on their index construction time and index size, and on query processing performance in terms of time and false positive ratio. We employ both real and synthetic datasets. Specifi- cally, four real datasets of different characteristics are used: AIDS, PDBS, PCM, and PPI. In addition, we generate a large number of synthetic graph datasets, empowering us to systematically study the algorithms’ performance and scalability versus the aforementioned key parameters

    Optimizing graph query performance by indexing and caching

    Get PDF
    Subgraph/supergraph queries, though central to graph analytics, are costly as they entail the NP-Complete problem of subgraph isomorphism. To expedite graph query processing, the community has contributed a wealth of approaches that gradually form two categories, i.e., heuristic subgraph isomorphism (SI) methods and algorithms following “filter-then-verify” paradigm (FTV). However, they both bear performance limitations. And a significant drawback of current studies lies in that they throw away the results obtained when executing previous graph queries. To this end, the current work shall present a fresh solution named iGQ, principle of which is to acquire and utilize knowledge from the results of previously executed queries. iGQ encompasses two component subindexes to identify if a new query is a subgraph or supergraph of previously executed queries, such that the stored knowledge will be turned on to accelerate the execution of the new query graph through reducing the subgraph isomorphism tests to be performed. The correctness of iGQ is assured by formal proof. Moreover, iGQ affords the elegance of double use for subgraph and supergraph query processing, bridging the two separate research threads in the community. On the other hand, using cache to accelerate query processing has been prevalent in data management systems. In the realm of graph structured queries, however, little work has been done. Meanwhile, modern big data applications are emerging and demanding the high performance of graph query processing. Therefore, this thesis shall put forth a full-fledged graph caching system coined GraphCache for graph queries. From the ground up, GraphCache is designed as a semantic graph cache that could harness both subgraph and supergraph cache hits, expanding the traditional hits confined by exact match. GraphCache is featured by well-defined subsystems and interfaces, allowing for the flexibility of plugging in any general subgraph/supergraph query solution, be it an FTV algorithm or SI method. Furthermore, GraphCache incorporates the iGQ as the engine of query processing, where previously issued queries are leveraged to expedite graph query processing. With the continuous arrival of queries and the finite memory space, GraphCache requires mechanisms to effectively manage the space, which in turn emerges the problem of cache replacement. But none of the existing replacement policies are developed specifically for graph cache. This work hence proposes a number of graph query aware strategies with different trade-offs and emphasizes a novel hybrid replacement policy with competitive performance. Following the established research in literature, GraphCache handles graph queries against a static dataset, i.e., all graphs in the underlying dataset keep untouched during the continual arrival and execution of queries. However, in real-world applications, the graph dataset naturally evolves/changes over time. This poses a significant challenge for the current graph caching technique and hence gives rise to the requirement of advanced systems that are capable of accelerating subgraph/supergraph queries against dynamic datasets. To address the problem, this work shall contribute an upgraded graph caching system, namely GraphCache+, stressing the newly plugged in subsystems and components of dealing with the consistency of graph cache. GraphCache+ is characterized by its two cache models that represent different designs of ensuring graph cache consistency, as well as the novel logics of alleviating subgraph and supergraph query processing with formal proof of correctness. Additionally, this work is bundled with comprehensive performance evaluations of GraphCache/GraphCache+ with over 6 million queries against both real-world and synthetic datasets with different characteristics, revealing a number of non-trivial lessons. In overall, this work contributes to the community from three perspectives: it provides a fresh idea to expedite graph query processing, applicable for both SI methods and FTV algorithms; it presents GraphCache, to the best of our knowledge the first full-fledged graph caching system for general subgraph/supergraph queries; it explores the topic of graph cache consistency, putting forth a systematic solution GraphCache+

    Parameterized Algorithms for Scalable Interprocedural Data-flow Analysis

    Full text link
    Data-flow analysis is a general technique used to compute information of interest at different points of a program and is considered to be a cornerstone of static analysis. In this thesis, we consider interprocedural data-flow analysis as formalized by the standard IFDS framework, which can express many widely-used static analyses such as reaching definitions, live variables, and null-pointer. We focus on the well-studied on-demand setting in which queries arrive one-by-one in a stream and each query should be answered as fast as possible. While the classical IFDS algorithm provides a polynomial-time solution to this problem, it is not scalable in practice. Specifically, it either requires a quadratic-time preprocessing phase or takes linear time per query, both of which are untenable for modern huge codebases with hundreds of thousands of lines. Previous works have already shown that parameterizing the problem by the treewidth of the program's control-flow graph is promising and can lead to significant gains in efficiency. Unfortunately, these results were only applicable to the limited special case of same-context queries. In this work, we obtain significant speedups for the general case of on-demand IFDS with queries that are not necessarily same-context. This is achieved by exploiting a new graph sparsity parameter, namely the treedepth of the program's call graph. Our approach is the first to exploit the sparsity of control-flow graphs and call graphs at the same time and parameterize by both treewidth and treedepth. We obtain an algorithm with a linear preprocessing phase that can answer each query in constant time with respect to the input size. Finally, we show experimental results demonstrating that our approach significantly outperforms the classical IFDS and its on-demand variant

    A Potpourri of Reason Maintenance Methods

    Get PDF
    We present novel methods to compute changes to materialized views in logic databases like those used by rule-based reasoners. Such reasoners have to address the problem of changing axioms in the presence of materializations of derived atoms. Existing approaches have drawbacks: some require to generate and evaluate large transformed programs that are in Datalog - while the source program is in Datalog and significantly smaller; some recompute the whole extension of a predicate even if only a small part of this extension is affected by the change. The methods presented in this article overcome these drawbacks and derive additional information useful also for explanation, at the price of an adaptation of the semi-naive forward chaining
    corecore