156 research outputs found

    A Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures

    Full text link
    Irregular computations on unstructured data are an important class of problems for parallel programming. Graph coloring is often an important preprocessing step, e.g. as a way to perform dependency analysis for safe parallel execution. The total run time of a coloring algorithm adds to the overall parallel overhead of the application whereas the number of colors used determines the amount of exposed parallelism. A fast and scalable coloring algorithm using as few colors as possible is vital for the overall parallel performance and scalability of many irregular applications that depend upon runtime dependency analysis. Catalyurek et al. have proposed a graph coloring algorithm which relies on speculative, local assignment of colors. In this paper we present an improved version which runs even more optimistically with less thread synchronization and reduced number of conflicts compared to Catalyurek et al.'s algorithm. We show that the new technique scales better on multi-core and many-core systems and performs up to 1.5x faster than its predecessor on graphs with high-degree vertices, while keeping the number of colors at the same near-optimal levels.Comment: To appear in the proceedings of Euro Par 201

    SciDAC Institute: Combinatorial Scientific Computing and Petascale Simulations (CSCAPES). Final Report

    Full text link

    Fine-grained Locality-aware Parallel Scheme for Anisotropic Mesh Adaptation

    Get PDF
    AbstractIn this paper, we provide a fine-grained parallel scheme for anisotropic mesh adaptation on NUMA11Non-Uniform Memory Access architectures.Data dependencies are expressed by a graph for each kernel, and concurrency is extracted through fine-grained graph coloring. Tasks are structured into bulk-synchronous steps to avoid data races and to aggregate shared-data accesses.To ensure performance prediction, time cost and load imbalance are theoretically characterized.The devised scheme was evaluated on a 4 NUMA node (2-socket) machine, and a mean efficiency of 70% was reached on 32 cores for 3 kernels out of 4. The impact of irregular degree distribution and data layout on scalability is highlighted

    Doctor of Philosophy

    Get PDF
    dissertationPartial differential equations (PDEs) are widely used in science and engineering to model phenomena such as sound, heat, and electrostatics. In many practical science and engineering applications, the solutions of PDEs require the tessellation of computational domains into unstructured meshes and entail computationally expensive and time-consuming processes. Therefore, efficient and fast PDE solving techniques on unstructured meshes are important in these applications. Relative to CPUs, the faster growth curves in the speed and greater power efficiency of the SIMD streaming processors, such as GPUs, have gained them an increasingly important role in the high-performance computing area. Combining suitable parallel algorithms and these streaming processors, we can develop very efficient numerical solvers of PDEs. The contributions of this dissertation are twofold: proposal of two general strategies to design efficient PDE solvers on GPUs and the specific applications of these strategies to solve different types of PDEs. Specifically, this dissertation consists of four parts. First, we describe the general strategies, the domain decomposition strategy and the hybrid gathering strategy. Next, we introduce a parallel algorithm for solving the eikonal equation on fully unstructured meshes efficiently. Third, we present the algorithms and data structures necessary to move the entire FEM pipeline to the GPU. Fourth, we propose a parallel algorithm for solving the levelset equation on fully unstructured 2D or 3D meshes or manifolds. This algorithm combines a narrowband scheme with domain decomposition for efficient levelset equation solving

    A Novel Multithreaded Algorithm for Extracting Maximal Chordal Subgraphs

    Get PDF
    Chordal graphs are triangulated graphs where any cycle larger than three is bisected by a chord. Many combinatorial optimization problems such as computing the size of the maximum clique and the chromatic number are NP-hard on general graphs but have polynomial time solutions on chordal graphs. In this paper, we present a novel multithreaded algorithm to extract a maximal chordal sub graph from a general graph. We develop an iterative approach where each thread can asynchronously update a subset of edges that are dynamically assigned to it per iteration and implement our algorithm on two different multithreaded architectures - Cray XMT, a massively multithreaded platform, and AMD Magny-Cours, a shared memory multicore platform. In addition to the proof of correctness, we present the performance of our algorithm using a test set of synthetical graphs with up to half-a-billion edges and real world networks from gene correlation studies and demonstrate that our algorithm achieves high scalability for all inputs on both types of architectures
    corecore