124 research outputs found

    Parallelizing Maximal Clique Enumeration on GPUs

    Full text link
    We present a GPU solution for exact maximal clique enumeration (MCE) that performs a search tree traversal following the Bron-Kerbosch algorithm. Prior works on parallelizing MCE on GPUs perform a breadth-first traversal of the tree, which has limited scalability because of the explosion in the number of tree nodes at deep levels. We propose to parallelize MCE on GPUs by performing depth-first traversal of independent subtrees in parallel. Since MCE suffers from high load imbalance and memory capacity requirements, we propose a worker list for dynamic load balancing, as well as partial induced subgraphs and a compact representation of excluded vertex sets to regulate memory consumption. Our evaluation shows that our GPU implementation on a single GPU outperforms the state-of-the-art parallel CPU implementation by a geometric mean of 4.9x (up to 16.7x), and scales efficiently to multiple GPUs. Our code has been open-sourced to enable further research on accelerating MCE

    High Performance Large Graph Analytics by Enhancing Locality

    Get PDF
    Graphs are widely used in a variety of domains for representing entities and their relationship to each other. Graph analytics helps to understand, detect, extract and visualize insightful relationships between different entities. Graph analytics has a wide range of applications in various domains including computational biology, commerce, intelligence, health care and transportation. The breadth of problems that require large graph analytics is growing rapidly resulting in a need for fast and efficient graph processing. One of the major challenges in graph processing is poor locality of reference. Locality of reference refers to the phenomenon of frequently accessing the same memory location or adjacent memory locations. Applications with poor data locality reduce the effectiveness of the cache memory. They result in large number of cache misses, requiring access to high latency main memory. Therefore, it is essential to have good locality for good performance. Most graph processing applications have highly random memory access patterns. Coupled with the current large sizes of the graphs, they result in poor cache utilization. Additionally, the computation to data access ratio in many graph processing applications is very low, making it difficult to cover the memory latency using computation. It is also challenging to efficiently parallelize most graph applications. Many graphs in real world have unbalanced degree distribution. It is difficult to achieve a balanced workload for such graphs. The parallelism in graph applications is generally fine-grained in nature. This calls for efficient synchronization and communication between the processing units. Techniques for enhancing locality have been well studied in the context of regular applications like linear algebra. Those techniques are in most cases not applicable to the graph problems. In this dissertation, we propose two techniques for enhancing locality in graph algorithms: access transformation and task-set reduction. Access transformation can be applied to algorithms to improve the spatial locality by changing the random access pattern to sequential access. It is applicable to iterative algorithms that process random vertices/edges in each iteration. The task-set reduction technique can be applied to enhance the temporal locality. It is applicable to algorithms which repeatedly access the same data to perform certain task. Using the two techniques, we propose novel algorithms for three graph problems: k-core decomposition, maximal clique enumeration and triangle listing. We have implemented the algorithms. The results show that these algorithms provide significant improvement in performance and also scale well

    Shared Memory Parallel Subgraph Enumeration

    Full text link
    The subgraph enumeration problem asks us to find all subgraphs of a target graph that are isomorphic to a given pattern graph. Determining whether even one such isomorphic subgraph exists is NP-complete---and therefore finding all such subgraphs (if they exist) is a time-consuming task. Subgraph enumeration has applications in many fields, including biochemistry and social networks, and interestingly the fastest algorithms for solving the problem for biochemical inputs are sequential. Since they depend on depth-first tree traversal, an efficient parallelization is far from trivial. Nevertheless, since important applications produce data sets with increasing difficulty, parallelism seems beneficial. We thus present here a shared-memory parallelization of the state-of-the-art subgraph enumeration algorithms RI and RI-DS (a variant of RI for dense graphs) by Bonnici et al. [BMC Bioinformatics, 2013]. Our strategy uses work stealing and our implementation demonstrates a significant speedup on real-world biochemical data---despite a highly irregular data access pattern. We also improve RI-DS by pruning the search space better; this further improves the empirical running times compared to the already highly tuned RI-DS.Comment: 18 pages, 12 figures, To appear at the 7th IEEE Workshop on Parallel / Distributed Computing and Optimization (PDCO 2017

    DPP-PMRF: Rethinking Optimization for a Probabilistic Graphical Model Using Data-Parallel Primitives

    Full text link
    We present a new parallel algorithm for probabilistic graphical model optimization. The algorithm relies on data-parallel primitives (DPPs), which provide portable performance over hardware architecture. We evaluate results on CPUs and GPUs for an image segmentation problem. Compared to a serial baseline, we observe runtime speedups of up to 13X (CPU) and 44X (GPU). We also compare our performance to a reference, OpenMP-based algorithm, and find speedups of up to 7X (CPU).Comment: LDAV 2018, October 201

    Listing k-cliques in Sparse Real-World Graphs

    Get PDF
    International audienceMotivated by recent studies in the data mining community which require to efficiently list all k-cliques, we revisit the iconic algorithm of Chiba and Nishizeki and develop the most efficient parallel algorithm for such a problem. Our theoretical analysis provides the best asymptotic upper bound on the running time of our algorithm for the case when the input graph is sparse. Our experimental evaluation on large real-world graphs shows that our parallel algorithm is faster than state-of-the-art algorithms, while boasting an excellent degree of parallelism. In particular, we are able to list all k-cliques (for any k) in graphs containing up to tens of millions of edges as well as all 10-cliques in graphs containing billions of edges, within a few minutes and a few hours respectively. Finally, we show how our algorithm can be employed as an effective subroutine for finding the k-clique core decomposition and an approximate k-clique densest subgraphs in very large real-world graphs
    corecore