62,764 research outputs found

    Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory

    Full text link
    New algorithms and optimization techniques are needed to balance the accelerating trend towards bandwidth-starved multicore chips. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach that makes explicit use of shared caches in multicore environments and minimizes synchronization and boundary overhead. For clusters of shared-memory nodes we demonstrate how temporal blocking can be employed successfully in a hybrid shared/distributed-memory environment.Comment: 9 pages, 6 figure

    Fast Shared-Memory Barrier Synchronization for a 1024-Cores RISC-V Many-Core Cluster

    Full text link
    Synchronization is likely the most critical performance killer in shared-memory parallel programs. With the rise of multi-core and many-core processors, the relative impact on performance and energy overhead of synchronization is bound to grow. This paper focuses on barrier synchronization for TeraPool, a cluster of 1024 RISC-V processors with non-uniform memory access to a tightly coupled 4MB shared L1 data memory. We compare the synchronization strategies available in other multi-core and many-core clusters to identify the optimal native barrier kernel for TeraPool. We benchmark a set of optimized barrier implementations and evaluate their performance in the framework of the widespread fork-join Open-MP style programming model. We test parallel kernels from the signal-processing and telecommunications domain, achieving less than 10% synchronization overhead over the total runtime for problems that fit TeraPool's L1 memory. By fine-tuning our tree barriers, we achieve 1.6x speed-up with respect to a naive central counter barrier and just 6.2% overhead on a typical 5G application, including a challenging multistage synchronization kernel. To our knowledge, this is the first work where shared-memory barriers are used for the synchronization of a thousand processing elements tightly coupled to shared data memory.Comment: 15 pages, 7 figure

    MP-LOCKs: Replacing hardware synchronization primitives with message passing

    Get PDF
    Journal ArticleShared memory programs guarantee the correctness of concurrent accesses to shared data using interprocessor synchronization operations. The most common synchronization operators are locks, which are traditionally implemented in user-level libraries via a mix of shared memory accesses and hardware synchronization primitives like test-and-set. In this paper, we argue that synchronization operations implemented using fast message passing and kernel-embedded lock managers are an attractive alternative to dedicated synchronization hardware. We propose three message passing lock (MP-LOCK) algorithms (centralized, distributed, and reactive) and provide guidelines for implementing them efficiently. MP-LOCKs redice tje design complexity and runtime occupancy of DSM controllers and can exploit software's inherent flexibility to adapt to differing applications lock access patterns. We compared the performance of MP-LOCKs with two common shared memory lock algorithms: test-and-set and MCS locks and found that MP-LOCKs scale better. For machines with 16 to 32 nides, applications using MP-LOCKs ran up to 186% faster than the same applications with shared memory locks. For small systems (up to 8 nodes), MP-LOCK performance lags shared memory lock performance due to the higher software overhead. However, three of the MP-LOCK applications slow down by no more than 18%, while the other two slowed by no more than 180%. Given these results, we conclude that locks based on message passing should be considered as a replacement for hardware locks in future scalable multiprocessors that supports efficient message passing mechanisms. In addition, it is possible to implement efficient software synchronization primitives in clusters of workstations by using the guidelines we proposed

    Improvements in Hardware Transactional Memory for GPU Architectures

    Get PDF
    In the multi-core CPU world, transactional memory (TM)has emerged as an alternative to lock-based programming for thread synchronization. Recent research proposes the use of TM in GPU architectures, where a high number of computing threads, organized in SIMT fashion, requires an effective synchronization method. In contrast to CPUs, GPUs offer two memory spaces: global memory and local memory. The local memory space serves as a shared scratch-pad for a subset of the computing threads, and it is used by programmers to speed-up their applications thanks to its low latency. Prior work from the authors proposed a lightweight hardware TM (HTM) support based in the local memory, modifying the SIMT execution model and adding a conflict detection mechanism. An efficient implementation of these features is key in order to provide an effective synchronization mechanism at the local memory level. After a quick description of the main features of our HTM design for GPU local memory, in this work we gather together a number of proposals designed with the aim of improving those mechanisms with high impact on performance. Firstly, the SIMT execution model is modified to increase the parallelism of the application when transactions must be serialized in order to make forward progress. Secondly, the conflict detection mechanism is optimized depending on application characteristics, such us the read/write sets, the probability of conflict between transactions and the existence of read-only transactions. As these features can be present in hardware simultaneously, it is a task of the compiler and runtime to determine which ones are more important for a given application. This work includes a discussion on the analysis to be done in order to choose the best configuration solution.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

    Shared-Semaphored Cache Implementation for Parallel Program Execution in Multi-Core Systems

    Get PDF
    The paper brings forward the idea of multi-threadedcomputation synchronization based on the shared semaphoredcache in the multi-core CPUs. It is dedicated to the implementationof multi-core PLC control, embedded solution or parallelcomputation of models described using hardware description languages.The shared semaphored cache is implemented as guardedmemory cells within a dedicated section of the cache memory thatis shared by multiple cores. This enables the cores to speed up thedata exchange and seamlessly synchronize the computation. Theidea has been verified by creating a multi-core system model usingVerilog HDL. The simulation of task synchronization methodsallows for proving the benefits of shared semaphored memorycells over standard synchronization methods. The proposed ideaenhances the computation in the algorithms that consist ofrelatively short tasks that can be processed in parallel andrequires fast synchronization mechanisms to avoid data raceconditions

    On the Power of Multi-Objects

    Get PDF
    In the standard ``single-object\u27\u27 model of shared-memory computing, it is assumed that a process accesses at most one shared object in each of its steps. In this paper, we consider a more powerful variant---the ``multi-object\u27\u27 model---in which each process may access *any* finite number of shared objects atomically in each of its steps. We present results that relate the synchronization power of a type in the multi-object model to its synchronization power in the single-object model. Although the types fetch&add and swap have the same synchronization power in the single-object model, Afek, Merritt, and Taubenfeld showed that their synchronization powers differ in the multi-object model. We prove that this divergence phenomenon is exhibited {\em only\/} by types at levels 1 and 2; all higher level types have the same unbounded synchronization power in the multi-object model stated above. This paper identifies all possible relationships between a type\u27s synchronization power in the single-object model and its synchronization power in the multi-object model
    corecore