4,575 research outputs found

    A general and efficient divide-and-conquer algorithm framework for multi-core clusters

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in Cluster Computing. The final authenticated version is available online at: https://doi.org/10.1007/s10586-017-0766-y[Abstract]Divide-and-conquer is one of the most important patterns of parallelism, being applicable to a large variety of problems. In addition, the most powerful parallel systems available nowadays are computer clusters composed of distributed-memory nodes that contain an increasing number of cores that share a common memory. The optimal exploitation of these systems often requires resorting to a hybrid model that mimics the underlying hardware by combining a distributed and a shared memory parallel programming model. This results in longer development times and increased maintenance costs. In this paper we present a very general skeleton library that allows to parallelize any divide-and-conquer problem in hybrid distributed-shared memory systems with little effort while providing much flexibility and good performance. Our proposal combines a message-passing paradigm at the process level and a threaded model inside each process, hiding the related complexity from the user. The evaluation shows that this skeleton provides performance comparable, and often better than that of manually optimized codes while requiring considerably less effort when parallelizing applications on multi-core clusters.Ministerio de EconomĂ­a y Competitividad; TIN2013-42148-PMinisterio de EconomĂ­a y Competitividad; TIN2016-75845-PXunta de Galicia; GRC2013/05

    Specifying and Executing Optimizations for Parallel Programs

    Full text link
    Compiler optimizations, usually expressed as rewrites on program graphs, are a core part of all modern compilers. However, even production compilers have bugs, and these bugs are difficult to detect and resolve. The problem only becomes more complex when compiling parallel programs; from the choice of graph representation to the possibility of race conditions, optimization designers have a range of factors to consider that do not appear when dealing with single-threaded programs. In this paper we present PTRANS, a domain-specific language for formal specification of compiler transformations, and describe its executable semantics. The fundamental approach of PTRANS is to describe program transformations as rewrites on control flow graphs with temporal logic side conditions. The syntax of PTRANS allows cleaner, more comprehensible specification of program optimizations; its executable semantics allows these specifications to act as prototypes for the optimizations themselves, so that candidate optimizations can be tested and refined before going on to include them in a compiler. We demonstrate the use of PTRANS to state, test, and refine the specification of a redundant store elimination optimization on parallel programs.Comment: In Proceedings GRAPHITE 2014, arXiv:1407.767

    Importance of Explicit Vectorization for CPU and GPU Software Performance

    Full text link
    Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization and non-parallel optimization techniques, which can often be employed additionally, are less frequently discussed. In this paper, we present an analysis of several optimizations done on both central processing unit (CPU) and GPU implementations of a particular computationally intensive Metropolis Monte Carlo algorithm. Explicit vectorization on the CPU and the equivalent, explicit memory coalescing, on the GPU are found to be critical to achieving good performance of this algorithm in both environments. The fully-optimized CPU version achieves a 9x to 12x speedup over the original CPU version, in addition to speedup from multi-threading. This is 2x faster than the fully-optimized GPU version.Comment: 17 pages, 17 figure
    • …
    corecore