57,588 research outputs found

    Parallel Breadth-First Search on Distributed Memory Systems

    Full text link

    Distributed-Memory Breadth-First Search on Massive Graphs

    Full text link
    This chapter studies the problem of traversing large graphs using the breadth-first search order on distributed-memory supercomputers. We consider both the traditional level-synchronous top-down algorithm as well as the recently discovered direction optimizing algorithm. We analyze the performance and scalability trade-offs in using different local data structures such as CSR and DCSC, enabling in-node multithreading, and graph decompositions such as 1D and 2D decomposition.Comment: arXiv admin note: text overlap with arXiv:1104.451

    Parallel symbolic state-space exploration is difficult, but what is the alternative?

    Full text link
    State-space exploration is an essential step in many modeling and analysis problems. Its goal is to find the states reachable from the initial state of a discrete-state model described. The state space can used to answer important questions, e.g., "Is there a dead state?" and "Can N become negative?", or as a starting point for sophisticated investigations expressed in temporal logic. Unfortunately, the state space is often so large that ordinary explicit data structures and sequential algorithms cannot cope, prompting the exploration of (1) parallel approaches using multiple processors, from simple workstation networks to shared-memory supercomputers, to satisfy large memory and runtime requirements and (2) symbolic approaches using decision diagrams to encode the large structured sets and relations manipulated during state-space generation. Both approaches have merits and limitations. Parallel explicit state-space generation is challenging, but almost linear speedup can be achieved; however, the analysis is ultimately limited by the memory and processors available. Symbolic methods are a heuristic that can efficiently encode many, but not all, functions over a structured and exponentially large domain; here the pitfalls are subtler: their performance varies widely depending on the class of decision diagram chosen, the state variable order, and obscure algorithmic parameters. As symbolic approaches are often much more efficient than explicit ones for many practical models, we argue for the need to parallelize symbolic state-space generation algorithms, so that we can realize the advantage of both approaches. This is a challenging endeavor, as the most efficient symbolic algorithm, Saturation, is inherently sequential. We conclude by discussing challenges, efforts, and promising directions toward this goal

    Massive Data-Centric Parallelism in the Chiplet Era

    Full text link
    Traditionally, massively parallel applications are executed on distributed systems, where computing nodes are distant enough that the parallelization schemes must minimize communication and synchronization to achieve scalability. Mapping communication-intensive workloads to distributed systems requires complicated problem partitioning and dataset pre-processing. With the current AI-driven trend of having thousands of interconnected processors per chip, there is an opportunity to re-think these communication-bottlenecked workloads. This bottleneck often arises from data structure traversals, which cause irregular memory accesses and poor cache locality. Recent works have introduced task-based parallelization schemes to accelerate graph traversal and other sparse workloads. Data structure traversals are split into tasks and pipelined across processing units (PUs). Dalorex demonstrated the highest scalability (up to thousands of PUs on a single chip) by having the entire dataset on-chip, scattered across PUs, and executing the tasks at the PU where the data is local. However, it also raised questions on how to scale to larger datasets when all the memory is on chip, and at what cost. To address these challenges, we propose a scalable architecture composed of a grid of Data-Centric Reconfigurable Array (DCRA) chiplets. Package-time reconfiguration enables creating chip products that optimize for different target metrics, such as time-to-solution, energy, or cost, while software reconfigurations avoid network saturation when scaling to millions of PUs across many chip packages. We evaluate six applications and four datasets, with several configurations and memory technologies, to provide a detailed analysis of the performance, power, and cost of data-local execution at scale. Our parallelization of Breadth-First-Search with RMAT-26 across a million PUs reaches 3323 GTEPS

    GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU

    Full text link
    High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based on sparse linear algebra, which will allow graph algorithms to be expressed in a performant, succinct, composable and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity, which allows users to write graph algorithms without specifying push and pull direction. Exploiting output sparsity allows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics. The design principles described in this paper have been implemented in "GraphBLAST", the first high-performance linear algebra-based graph framework on NVIDIA GPUs that is open-source. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework, while offering a simpler and more concise programming model.Comment: 50 pages, 14 figures, 14 table

    Scalable Breadth-First Search on a GPU Cluster

    Full text link
    On a GPU cluster, the ratio of high computing power to communication bandwidth makes scaling breadth-first search (BFS) on a scale-free graph extremely challenging. By separating high and low out-degree vertices, we present an implementation with scalable computation and a model for scalable communication for BFS and direction-optimized BFS. Our communication model uses global reduction for high-degree vertices, and point-to-point transmission for low-degree vertices. Leveraging the characteristics of degree separation, we reduce the graph size to one third of the conventional edge list representation. With several other optimizations, we observe linear weak scaling as we increase the number of GPUs, and achieve 259.8 GTEPS on a scale-33 Graph500 RMAT graph with 124 GPUs on the latest CORAL early access system.Comment: 12 pages, 13 figures. To appear at IPDPS 201
    • …
    corecore