596 research outputs found

    PT-Scotch: A tool for efficient parallel graph ordering

    Get PDF
    The parallel ordering of large graphs is a difficult problem, because on the one hand minimum degree algorithms do not parallelize well, and on the other hand the obtainment of high quality orderings with the nested dissection algorithm requires efficient graph bipartitioning heuristics, the best sequential implementations of which are also hard to parallelize. This paper presents a set of algorithms, implemented in the PT-Scotch software package, which allows one to order large graphs in parallel, yielding orderings the quality of which is only slightly worse than the one of state-of-the-art sequential algorithms. Our implementation uses the classical nested dissection approach but relies on several novel features to solve the parallel graph bipartitioning problem. Thanks to these improvements, PT-Scotch produces consistently better orderings than ParMeTiS on large numbers of processors

    Recent Advances in Graph Partitioning

    Full text link
    We survey recent trends in practical algorithms for balanced graph partitioning together with applications and future research directions

    SPAWN: An Iterative, Potentials-Based, Dynamic Scheduling and Partitioning Tool

    Get PDF
    International audienceMany applications of physics modeling use regular meshes on which computations of highly variable cost can occur. Distributing the underlying cells over manycore architec-tures is a critical load balancing step that should increase the period until another step is required. Graph partitioning tools are known to be very effective for such problems, but they exhibit scalability problems as the number of cores and the number of cells increases. We introduce a dynamic task scheduling approach inspired by physical particles interactions. Our method allows cores to virtually move over a 2D/3D mesh of tasks and uses a Voronoi domain decomposition to balance workload among cores. Displacements of cores are the result of force computations using a carefully chosen pair potential. We evaluate our method against graph partitioning tools and existing task schedulers with a representative physical application, and demonstrate the relevance of our approach

    An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling

    Full text link
    We present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination, and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factorization leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to 7 fold for problems in our test suite. The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK -- STRUctured Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices

    High-Quality Hierarchical Process Mapping

    Get PDF
    Partitioning graphs into blocks of roughly equal size such that few edges run between blocks is a frequently needed operation when processing graphs on a parallel computer. When a topology of a distributed system is known, an important task is then to map the blocks of the partition onto the processors such that the overall communication cost is reduced. We present novel multilevel algorithms that integrate graph partitioning and process mapping. Important ingredients of our algorithm include fast label propagation, more localized local search, initial partitioning, as well as a compressed data structure to compute processor distances without storing a distance matrix. Moreover, our algorithms are able to exploit a given hierarchical structure of the distributed system under consideration. Experiments indicate that our algorithms speed up the overall mapping process and, due to the integrated multilevel approach, also find much better solutions in practice. For example, one configuration of our algorithm yields similar solution quality as the previous state-of-the-art in terms of mapping quality for large numbers of partitions while being a factor 9.3 faster. Compared to the currently fastest iterated multilevel mapping algorithm Scotch, we obtain 16% better solutions while investing slightly more running time

    Partitioning, Ordering, and Load Balancing in a Hierarchically Parallel Hybrid Linear Solver

    Get PDF
    Institut National Polytechnique de Toulouse, RT-APO-12-2PDSLin is a general-purpose algebraic parallel hybrid (direct/iterative) linear solver based on the Schur complement method. The most challenging step of the solver is the computation of a preconditioner based on an approximate global Schur complement. We investigate two combinatorial problems to enhance PDSLin's performance at this step. The first is a multi-constraint partitioning problem to balance the workload while computing the preconditioner in parallel. For this, we describe and evaluate a number of graph and hypergraph partitioning algorithms to satisfy our particular objective and constraints. The second problem is to reorder the sparse right-hand side vectors to improve the data access locality during the parallel solution of a sparse triangular system with multiple right-hand sides. This is to speed up the process of eliminating the unknowns associated with the interface. We study two reordering techniques: one based on a postordering of the elimination tree and the other based on a hypergraph partitioning. To demonstrate the effect of these techniques on the performance of PDSLin, we present the numerical results of solving large-scale linear systems arising from two applications of our interest: numerical simulations of modeling accelerator cavities and of modeling fusion devices

    Local time stepping on high performance computing architectures: mitigating CFL bottlenecks for large-scale wave propagation

    Get PDF
    Modeling problems that require the simulation of hyperbolic PDEs (wave equations) on large heterogeneous domains have potentially many bottlenecks. We attack this problem through two techniques: the massively parallel capabilities of graphics processors (GPUs) and local time stepping (LTS) to mitigate any CFL bottlenecks on a multiscale mesh. Many modern supercomputing centers are installing GPUs due to their high performance, and extending existing seismic wave-propagation software to use GPUs is vitally important to give application scientists the highest possible performance. In addition to this architectural optimization, LTS schemes avoid performance losses in meshes with localized areas of refinement. Coupled with the GPU performance optimizations, the derivation and implementation of an Newmark LTS scheme enables next-generation performance for real-world applications. Included in this implementation is work addressing the load-balancing problem inherent to multi-level LTS schemes, enabling scalability to hundreds and thousands of CPUs and GPUs. These GPU, LTS, and scaling optimizations accelerate the performance of existing applications by a factor of 30 or more, and enable future modeling scenarios previously made unfeasible by the cost of standard explicit time-stepping schemes
    • …
    corecore