4 research outputs found

    WolfGraph : the edge-centric graph processing on GPU

    Get PDF
    There is the significant interest nowadays in developing the frameworks for parallelizing the processing of large graphs such as social networks, web graphs, etc. The work has been proposed to parallelize the graph processing on clusters (distributed memory), multicore machines (shared memory) and GPU devices. Most existing research on GPU-based graph processing employs the vertex-centric processing model and the Compressed Sparse Row (CSR) form to store and process a graph. However, they suffer from irregular memory access and load imbalance in GPU, which hampers the full exploitation of GPU performance. In this paper, we present WolfGraph, a GPU-based graph processing framework that addresses the above problems. WolfGraph adopts the edge-centric processing, which iterates over the edges rather than vertices. The data structure and graph partition in WolfGraph are carefully crafted so as to minimize the graph pre-processing and allow the coalesced memory access. WolfGraph fully utilizes the GPU power by processing all edges in parallel. We also develop a new method, called Concatenated Edge List (CEL), to process a graph that is bigger than the global memory of GPU. WolfGraph allows the users to define their own graph-processing methods and plug them into the WolfGraph framework. Our experiments show that WolfGraph achieves 7-8x speedup over GraphChi and X-Stream when processing large graphs, and it also offers 65% performance improvement over the existing GPU-based, vertex-centric graph processing frameworks, such as Gunrock

    Towards feature-aware graph processing on the GPU

    Get PDF
    Unlike traditional graph processing applications, graph-based learning algorithms like Belief Propagation and Multimodal Learning require complex data such as feature vectors and matrices residing on graph vertices and edges, and employ vector/matrix operations on this data. GPU-based high-performance graph processing frameworks utilize clever techniques to mitigate the effect of random global memory accesses arising from irregular graph structure, and also perform efficient load balancing. However, these frameworks are oblivious to algorithm-specific details like the nature of operations involved and the vertex/edge property types used, and hence they end up generating unnecessary random global memory accesses. Moreover, traditional graph processing frameworks often force the user to follow a strict sequence of operations, which does not capture the nuances of different control flows in graph-based learning algorithms. In this thesis, we present Onyx, a feature-aware framework for graph-based learning algorithms on the GPU. Onyx employs a feature-aware processing model where each vertex property is collectively computed by a group of threads. This allows accesses to be coalesced into fewer global memory transactions, improving memory utilization. Onyx also incorporates dynamic vertex activation to perform sparse computations as vertex properties stabilize over time. The user expresses computations in the form of parallel operations on vertex and edge features, providing flexibility for custom control flows that support different kinds of graph-based learning algorithms. To extract high performance, Onyx automatically folds multiple parallel vertex- and edge-feature operations into a single kernel at compile-time. This eliminates the overhead of repeated kernel launches, and permits the use of low-latency shared memory as intermediate storage. We utilize GPU instructions to efficiently perform collaborative operations across vertex and edge features such as normalization, reduction and feature-level change detection. Finally, as feature-aware processing reduces the computation done per thread, we organized the critical path in Onyx as pipelined steps to minimize expensive dependency stalls. Our evaluation shows that Onyx\u27s feature-aware processing decreases the number of atomic transactions and simultaneously increases global load efficiency. Together with change-driven computation this results in up to 20.3x speedup. We also implemented the graph-based learning algorithms on state-of-the-art GPU graph frameworks, and observe that Onyx outperforms them by up to 51.2x

    Exploring Multi-Level Parallelism For Graph-Based Applications Via Algorithm And System Co-Design

    Get PDF
    Graph processing is at the heart of many modern applications where graphs are used as the basic data structure to represent the entities of interest and the relationships between them. Improving the performance of graph-based applications, especially using parallelism techniques, has drawn significant interest both in academia and industry. On the one hand, modern CPU architectures are able to provide massive computational power by using sophisticated memory hierarchy and multi-level parallelism, including thread-level parallelism, data-level parallelism, etc. On the other hand, graph processing workloads are notoriously challenging for achieving high performance due to their irregular computation pattern and unpredictable control flow. Therefore, how to accelerate the performance of graph-based applications using parallelism is still an open question. This dissertation focuses on providing high performance for graph-based applications. To take full advantage of multi-level parallelism resources provided by CPUs, this dissertation studies the characteristics of graph-based applications and matches their parallel solutions with the underlying hardware via algorithm and system co-design. This dissertation divides graph-based applications into three categories: typical graph algorithms, sequential graph-based applications, and applications with graph-based solutions. The first category comprises typical graph algorithms with available parallel solutions. This dissertation proposes GraphPhi as a new approach to graph processing on emerging Intel Xeon Phi-like architectures. The second category includes specialized graph applications without nontrivial parallel solutions. This dissertation studies a state-of-the-art 2-hop labeling approach named Pruned Landmark Labeling (PLL). This dissertation proposes Batched Vertex-Centric PLL (BVC-PLL), which breaks PLL\u27s inherent dependencies and parallelizes it in a scalable way. The third category includes applications that rely on graph-based solutions. This dissertation studies the sequential search algorithm for the graph-based indexing methods used for the Approximate Nearest Neighbor Search (ANNS) problem. This dissertation proposes Speed-ANN, a parallel similarity search algorithm that reveals hidden intra-query parallelism to accelerate the search speed while fulfilling the high accuracy requirement. Moreover, this dissertation further explores the optimization opportunities for computational graph-based deep neural network inference running on tiny devices, specifically microcontrollers (MCUs). Altogether, this dissertation studies graph-based applications and improves their performance by providing solutions of multi-level parallelism via algorithm and system co-design to match them with the underlying multi-core CPU architectures
    corecore