Search CORE

62,764 research outputs found

Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory

Author: Hager Georg
Wellein Gerhard
Wittmann Markus
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/12/2009
Field of study

New algorithms and optimization techniques are needed to balance the accelerating trend towards bandwidth-starved multicore chips. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach that makes explicit use of shared caches in multicore environments and minimizes synchronization and boundary overhead. For clusters of shared-memory nodes we demonstrate how temporal blocking can be employed successfully in a hybrid shared/distributed-memory environment.Comment: 9 pages, 6 figure

arXiv.org e-Print Archive

Crossref

Fast Shared-Memory Barrier Synchronization for a 1024-Cores RISC-V Many-Core Cluster

Author: Benini Luca
Bertuletti Marco
Riedel Samuel
Vanelli-Coralli Alessandro
Zhang Yichao
Publication venue
Publication date: 17/07/2023
Field of study

Synchronization is likely the most critical performance killer in shared-memory parallel programs. With the rise of multi-core and many-core processors, the relative impact on performance and energy overhead of synchronization is bound to grow. This paper focuses on barrier synchronization for TeraPool, a cluster of 1024 RISC-V processors with non-uniform memory access to a tightly coupled 4MB shared L1 data memory. We compare the synchronization strategies available in other multi-core and many-core clusters to identify the optimal native barrier kernel for TeraPool. We benchmark a set of optimized barrier implementations and evaluate their performance in the framework of the widespread fork-join Open-MP style programming model. We test parallel kernels from the signal-processing and telecommunications domain, achieving less than 10% synchronization overhead over the total runtime for problems that fit TeraPool's L1 memory. By fine-tuning our tree barriers, we achieve 1.6x speed-up with respect to a naive central counter barrier and just 6.2% overhead on a typical 5G application, including a challenging multistage synchronization kernel. To our knowledge, this is the first work where shared-memory barriers are used for the synchronization of a thousand processing elements tightly coupled to shared data memory.Comment: 15 pages, 7 figure

arXiv.org e-Print Archive

MP-LOCKs: Replacing hardware synchronization primitives with message passing

Author: Carter John B.
Kuo Chen-Chi
Publication venue: University of Utah
Publication date: 01/05/2011
Field of study

Journal ArticleShared memory programs guarantee the correctness of concurrent accesses to shared data using interprocessor synchronization operations. The most common synchronization operators are locks, which are traditionally implemented in user-level libraries via a mix of shared memory accesses and hardware synchronization primitives like test-and-set. In this paper, we argue that synchronization operations implemented using fast message passing and kernel-embedded lock managers are an attractive alternative to dedicated synchronization hardware. We propose three message passing lock (MP-LOCK) algorithms (centralized, distributed, and reactive) and provide guidelines for implementing them efficiently. MP-LOCKs redice tje design complexity and runtime occupancy of DSM controllers and can exploit software's inherent flexibility to adapt to differing applications lock access patterns. We compared the performance of MP-LOCKs with two common shared memory lock algorithms: test-and-set and MCS locks and found that MP-LOCKs scale better. For machines with 16 to 32 nides, applications using MP-LOCKs ran up to 186% faster than the same applications with shared memory locks. For small systems (up to 8 nodes), MP-LOCK performance lags shared memory lock performance due to the higher software overhead. However, three of the MP-LOCK applications slow down by no more than 18%, while the other two slowed by no more than 180%. Given these results, we conclude that locks based on message passing should be considered as a replacement for hardware locks in future scalable multiprocessors that supports efficient message passing mechanisms. In addition, it is possible to implement efficient software synchronization primitives in clusters of workstations by using the guidelines we proposed

The University of Utah: J. Willard Marriott Digital Library

Recommended from our members

Percolation scheduling for non-VLIW machines

Author: Brownhill Carrie J.
Nicolau Alexandru
Publication venue: eScholarship, University of California
Publication date: 15/01/1990
Field of study

Percolation Scheduling, a technique for compile-time code parallelization, has proven very successful for exploiting fine-grain irregular parallelism in ordinary programs. Currently, this technology is targeted only to VLIW (Very Long Instruction Word) machines, which have the advantages of 'free' synchronization and communication. Shared memory multi-processors can simulate the execution characteristics of VLIW machines with the use of static barriers. Preliminary results show that Percolation Scheduling can be used with good results on this type of architecture by increasing the granularity from operation level to source statement level, removing any redundant synchronization, and providing an efficient implementation of multi-way jumps

eScholarship - University of California

A Shared Scratchpad Memory with Synchronization Support

Author: Hansen Henrik Enggaard
Kristensen Andreas Toftegaard
Maroun Emad Jacob
Marquart Jimmi
Schoeberl Martin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

Online Research Database In Technology

Improvements in Hardware Transactional Memory for GPU Architectures

Author: Asenjo-Plaza Rafael
Navarro Ángeles
Plata-Gonzalez Oscar Guillermo
Villegas Alejandro
Publication venue
Publication date: 20/07/2016
Field of study

In the multi-core CPU world, transactional memory (TM)has emerged as an alternative to lock-based programming for thread synchronization. Recent research proposes the use of TM in GPU architectures, where a high number of computing threads, organized in SIMT fashion, requires an effective synchronization method. In contrast to CPUs, GPUs offer two memory spaces: global memory and local memory. The local memory space serves as a shared scratch-pad for a subset of the computing threads, and it is used by programmers to speed-up their applications thanks to its low latency. Prior work from the authors proposed a lightweight hardware TM (HTM) support based in the local memory, modifying the SIMT execution model and adding a conflict detection mechanism. An efficient implementation of these features is key in order to provide an effective synchronization mechanism at the local memory level. After a quick description of the main features of our HTM design for GPU local memory, in this work we gather together a number of proposals designed with the aim of improving those mechanisms with high impact on performance. Firstly, the SIMT execution model is modified to increase the parallelism of the application when transactions must be serialized in order to make forward progress. Secondly, the conflict detection mechanism is optimized depending on application characteristics, such us the read/write sets, the probability of conflict between transactions and the existence of read-only transactions. As these features can be present in hardware simultaneously, it is a task of the compiler and runtime to determine which ones are more important for a given application. This work includes a discussion on the analysis to be done in order to choose the best configuration solution.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

Repositorio Institucional Universidad de Málaga

Shared-Semaphored Cache Implementation for Parallel Program Execution in Multi-Core Systems

Author: Milik Adam
Walichiewicz Michał
Publication venue: Electronics and Telecommunications Committee
Publication date: 18/05/2023
Field of study

The paper brings forward the idea of multi-threadedcomputation synchronization based on the shared semaphoredcache in the multi-core CPUs. It is dedicated to the implementationof multi-core PLC control, embedded solution or parallelcomputation of models described using hardware description languages.The shared semaphored cache is implemented as guardedmemory cells within a dedicated section of the cache memory thatis shared by multiple cores. This enables the cores to speed up thedata exchange and seamlessly synchronize the computation. Theidea has been verified by creating a multi-core system model usingVerilog HDL. The simulation of task synchronization methodsallows for proving the benefits of shared semaphored memorycells over standard synchronization methods. The proposed ideaenhances the computation in the algorithms that consist ofrelatively short tasks that can be processed in parallel andrequires fast synchronization mechanisms to avoid data raceconditions

International Journal of Electronics and Telecommunications (Warsaw University of Technology)

On the Power of Multi-Objects

Author: Jayanti Prasad
Khanna Sanjay
Publication venue: Dartmouth Digital Commons
Publication date: 28/02/1997
Field of study

In the standard ``single-object\u27\u27 model of shared-memory computing, it is assumed that a process accesses at most one shared object in each of its steps. In this paper, we consider a more powerful variant---the ``multi-object\u27\u27 model---in which each process may access *any* finite number of shared objects atomically in each of its steps. We present results that relate the synchronization power of a type in the multi-object model to its synchronization power in the single-object model. Although the types fetch&add and swap have the same synchronization power in the single-object model, Afek, Merritt, and Taubenfeld showed that their synchronization powers differ in the multi-object model. We prove that this divergence phenomenon is exhibited {\em only\/} by types at levels 1 and 2; all higher level types have the same unbounded synchronization power in the multi-object model stated above. This paper identifies all possible relationships between a type\u27s synchronization power in the single-object model and its synchronization power in the multi-object model

Dartmouth Digital Commons (Dartmouth College)