Search CORE

446 research outputs found

Reducing consistency traffic and cache misses in the avalanche multiprocessor

Author: Carter John B.
Kuramkote Ravindra
Publication venue: University of Utah
Publication date: 01/01/1995
Field of study

Journal ArticleFor a parallel architecture to scale effectively, communication latency between processors must be avoided. We have found that the source of a large number of avoidable cache misses is the use of hardwired write-invalidate coherency protocols, which often exhibit high cache miss rates due to excessive invalidations and subsequent reloading of shared data. In the Avalanche project at the University of Utah, we are building a 64-node multiprocessor designed to reduce the end-to-end communication latency of both shared memory and message passing programs. As part of our design efforts, we are evaluating the potential performance benefits and implementation complexity of providing hardware support for multiple coherency protocols. Using a detailed architecture simulation of Avalanche, we have found that support for multiple consistency protocols can reduce the time parallel applications spend stalled on memory operations by up to 66% and overall execution time by up to 31%. Most of this reduction in memory stall time is due to a novel release-consistent multiple-writer write-update protocol implemented using a write state buffer

Smartlocks: Self-Aware Synchronization through Lock Acquisition Scheduling

Author: Agarwal Anant
Eastep Jonathan
Santambrogio Marco D.
Wingate David
Publication venue
Publication date: 09/11/2009
Field of study

As multicore processors become increasingly prevalent, system complexity is skyrocketing. The advent of the asymmetric multicore compounds this -- it is no longer practical for an average programmer to balance the system constraints associated with today's multicores and worry about new problems like asymmetric partitioning and thread interference. Adaptive, or self-aware, computing has been proposed as one method to help application and system programmers confront this complexity. These systems take some of the burden off of programmers by monitoring themselves and optimizing or adapting to meet their goals. This paper introduces an open-source self-aware synchronization library for multicores and asymmetric multicores called Smartlocks. Smartlocks is a spin-lock library that adapts its internal implementation during execution using heuristics and machine learning to optimize toward a user-defined goal, which may relate to performance, power, or other problem-specific criteria. Smartlocks builds upon adaptation techniques from prior work like reactive locks, but introduces a novel form of adaptation designed for asymmetric multicores that we term lock acquisition scheduling. Lock acquisition scheduling is optimizing which waiter will get the lock next for the best long-term effect when multiple threads (or processes) are spinning for a lock. Our results demonstrate empirically that lock scheduling is important for asymmetric multicores and that Smartlocks significantly outperform conventional and reactive locks for asymmetries like dynamic variations in processor clock frequencies caused by thermal throttling events

Archivio istituzionale della ricerca - Politecnico di Milano

Towards fair, scalable, locking

Author: Beivide Palacio Ramon
Cristal Kestelman Adrián
Harris Tim
Sanyal Sutirtha
Unsal Osman Sabri
Valero Cortés Mateo
Vallejo Enrique
Vallejo Fernando
Publication venue
Publication date: 01/01/2008
Field of study

Without care, Hardware Transactional Memory presents several performance pathologies that can degrade its performance. Among them, writers of commonly read variables can suffer from starvation. Though different solutions have been proposed for HTM systems, hybrid systems can still suffer from this performance problem, given that software transactions don’t interact with the mechanisms used by hardware to avoid starvation. In this paper we introduce a new per-directory-line hardware contention management mechanism that allows fairer access between both software and hardware threads without the need to abort any transaction. Our mechanism is based on “reserving” directory lines, implementing a limited fair queue for the requests on that line. We adapt the mechanism to the LogTM conflict detection mechanism and show that the resulting proposal is deadlock free. Finally, we sketch how the idea could be applied more generally to reader-writer locks.Postprint (published version

Avalanche: A communication and memory architecture for scalable parallel computing

Author: Carter John B.
Kuo Chen-Chi
Publication venue: University of Utah
Publication date: 01/01/1995
Field of study

technical reportAs the gap between processor and memory speeds widens?? system designers will inevitably incorpo rate increasingly deep memory hierarchies to maintain the balance between processor and memory system performance At the same time?? most communication subsystems are permitted access only to main memory and not a processor s top level cache As memory latencies increase?? this lack of integration between the memory and communication systems will seriously impede interprocessor communication performance and limit e ective scalability In the Avalanche project we are re designing the memory architecture of a commercial RISC multiprocessor?? the HP PA RISC ?? to include a new multi level context sensitive cache that is tightly coupled to the communication fabric The primary goal of Avalanche s integrated cache and communication controller is attack ing end to end communication latency in all of its forms This includes cache misses induced by excessive invalidations and reloading of shared data by write invalidate coherence protocols and cache misses induced by depositing incoming message data in main memory and faulting it into the cache An execution driven simulation study of Avalanche s architecture indicates that it can reduce cache stalls by and overall execution times b

Avalanche: A communication and memory architecture for scalable parallel computing

Author: Carter John B.
Davis Al
Publication venue: University of Utah
Publication date: 01/01/1995
Field of study

technical reportAs the gap between processor and memory speeds widens, system designers will inevitably incorporate increasingly deep memory hierarchies to maintain the balance between processor and memory system performance. At the same time, most communication subsystems are permitted access only to main memory and not a processor's top level cache. As memory latencies increase, this lack of integration between the memory and communication systems will seriously impede interprocessor communication performance and limit effective scalability. In the Avalanche project we are redesigning the memory architecture of a commercial RISC multiprocessor, the HP PA-RISC 7100, to include a new multi-level context sensitive cache that is tightly coupled to the communication fabric. The primary goal of Avalanche's integrated cache and communication controller is attacking end to end communication latency in all of its forms. This includes cache misses induced by excessive invalidations and reloading of shared data by write-invalidate coherence protocols and cache misses induced by depositing incoming message data in main memory and faulting it into the cache. An execution-driven simulation study of Avalanche's architecture indicates that it can reduce cache stalls by 5-60% and overall execution times by 10-28%

Scalable Synchronization with Mindicators

Author: Liu Yujie
McNamara Logan
Publication venue: Lehigh Preserve
Publication date
Field of study

The Mindicator is a shared object that stores one value for each thread in a system, and can return the minimum of all thread’s values in constant time. In this paper, we explore applications of the Mindicator in synchronization algorithms. We introduce three new algorithms, designed for scalable Read-Copy-Update (RCU), fair Readers-Writer locking, and Group Mutual Exclusion. Experimental evaluation shows these algorithms to perform well while avoiding contention

Lehigh University: Lehigh Preserve

Constant RMR Solutions to Reader Writer Synchronization

Author: Bhatt Vibhor
Jayanti Prasad
Publication venue: Dartmouth Digital Commons
Publication date: 26/02/2010
Field of study

We study Reader-Writer Exclusion, a well-known variant of the Mutual Exclusion problem where processes are divided into two classes--readers and writers--and multiple readers can be in the Critical Section (CS) at the same time, although no process may be in the CS at the same time as a writer. Since readers don\u27t conflict with each other, they should not obstruct each other. Specifically, the concurrent entering property must be satisfied: if all writers are in the remainder section, each reader should be able to enter the CS in a bounded number of its own steps. Three versions of the Reader-Writer Exclusion problem are commonly studied--one where writers have priority over readers, another where readers have priority, and the last where neither class has priority over the other and no process may starve. To ensure high performance on Cache-Coherent (CC) and Distributed Shared Memory (DSM) multiprocessors, algorithms should be designed to generate as few remote memory references (RMRs) as possible. The ideal would be to achieve constant RMR complexity, i.e., the worst case number of RMRs that a process generates in order to enter and exit the CS once is a constant, independent of the number of processes. Constant RMR complexity algorithms have existed for Mutual Exclusion for two decades, but none exists for Reader-Writer Exclusion. Danek and Hadzilacos\u27 lower bound proof implies that it is impossible to achieve sublinear RMR complexity for DSM machines. For CC machines, the best existing bound, also due to Danek and Hadzilacos , is O(log n), where n is the number of processes. In this work, we present the first constant RMR complexity algorithms for all three versions of the Reader-Writer Exclusion problem (for CC machines)