Search CORE

8 research outputs found

Reducing consistency traffic and cache misses in the avalanche multiprocessor

Author: Carter John B.
Kuramkote Ravindra
Publication venue: University of Utah
Publication date: 01/01/1995
Field of study

Journal ArticleFor a parallel architecture to scale effectively, communication latency between processors must be avoided. We have found that the source of a large number of avoidable cache misses is the use of hardwired write-invalidate coherency protocols, which often exhibit high cache miss rates due to excessive invalidations and subsequent reloading of shared data. In the Avalanche project at the University of Utah, we are building a 64-node multiprocessor designed to reduce the end-to-end communication latency of both shared memory and message passing programs. As part of our design efforts, we are evaluating the potential performance benefits and implementation complexity of providing hardware support for multiple coherency protocols. Using a detailed architecture simulation of Avalanche, we have found that support for multiple consistency protocols can reduce the time parallel applications spend stalled on memory operations by up to 66% and overall execution time by up to 31%. Most of this reduction in memory stall time is due to a novel release-consistent multiple-writer write-update protocol implemented using a write state buffer

The University of Utah: J. Willard Marriott Digital Library

Evaluating the potential of programmable multiprocessor cache controllers

Author: Carter John B.
Hibler Mike
Publication venue: University of Utah
Publication date: 01/01/1994
Field of study

technical reportThe next generation of scalable parallel systems (e.g., machines by KSR, Convex, and others) will have shared memory supported in hardware, unlike most current generation machines (e.g., offerings by Intel, nCube, and Thinking Machines). However, current shared memory architectures are constrained by the fact that their cache controllers are hardwired and inflexible, which limits the range of programs that can achieve scalable performance. This observation has led a number of researchers to propose building programmable multiprocessor cache controllers that can implement a variety of caching protocols, support multiple communication paradigms, or accept guidance from software. To evaluate the potential performance benefits of these designs, we have simulated five SPLASH benchmark programs on a virtual multiprocessor that supports five directory-based caching protocols. When we compared the off-line optimal performance of this design, wherein each cache line was maintained using the protocol that required the least communication, with the performance achieved when using a single protocol for all lines, we found that use of the "optimal" protocol reduced consistency traffic by 10-80%, with a mean improvement of 25-35%. Cache miss rates also dropped by up to 25%. Thus, the combination of programmable (or tunable) hardware and software able to exploit this added flexibility, e.g., via user pragmas or compiler analysis, could dramatically improve the performance of future shared memory multiprocessors

The University of Utah: J. Willard Marriott Digital Library

Avalanche: A communication and memory architecture for scalable parallel computing

Author: Carter John B.
Kuo Chen-Chi
Publication venue: University of Utah
Publication date: 01/01/1995
Field of study

technical reportAs the gap between processor and memory speeds widens?? system designers will inevitably incorpo rate increasingly deep memory hierarchies to maintain the balance between processor and memory system performance At the same time?? most communication subsystems are permitted access only to main memory and not a processor s top level cache As memory latencies increase?? this lack of integration between the memory and communication systems will seriously impede interprocessor communication performance and limit e ective scalability In the Avalanche project we are re designing the memory architecture of a commercial RISC multiprocessor?? the HP PA RISC ?? to include a new multi level context sensitive cache that is tightly coupled to the communication fabric The primary goal of Avalanche s integrated cache and communication controller is attack ing end to end communication latency in all of its forms This includes cache misses induced by excessive invalidations and reloading of shared data by write invalidate coherence protocols and cache misses induced by depositing incoming message data in main memory and faulting it into the cache An execution driven simulation study of Avalanche s architecture indicates that it can reduce cache stalls by and overall execution times b

The University of Utah: J. Willard Marriott Digital Library

Avalanche: A communication and memory architecture for scalable parallel computing

Author: Carter John B.
Davis Al
Publication venue: University of Utah
Publication date: 01/01/1995
Field of study

technical reportAs the gap between processor and memory speeds widens, system designers will inevitably incorporate increasingly deep memory hierarchies to maintain the balance between processor and memory system performance. At the same time, most communication subsystems are permitted access only to main memory and not a processor's top level cache. As memory latencies increase, this lack of integration between the memory and communication systems will seriously impede interprocessor communication performance and limit effective scalability. In the Avalanche project we are redesigning the memory architecture of a commercial RISC multiprocessor, the HP PA-RISC 7100, to include a new multi-level context sensitive cache that is tightly coupled to the communication fabric. The primary goal of Avalanche's integrated cache and communication controller is attacking end to end communication latency in all of its forms. This includes cache misses induced by excessive invalidations and reloading of shared data by write-invalidate coherence protocols and cache misses induced by depositing incoming message data in main memory and faulting it into the cache. An execution-driven simulation study of Avalanche's architecture indicates that it can reduce cache stalls by 5-60% and overall execution times by 10-28%

The University of Utah: J. Willard Marriott Digital Library

Memory sharing predictor: the key to a speculative coherent DSM

Author: Falsafi Babak
Lai An-Chow
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 06/04/2009
Field of study

Recent research advocates using general message predictors to learn and predict the coherence activity in distributed shared memory (DSM). By accurately predicting a message and timely invoking the necessary coherence actions, a DSM can hide much of the remote access latency. This paper proposes the Memory Sharing Predictors (MSPs), pattern-based predictors that significantly improve prediction accuracy and implementation cost over general message predictors. An MSP is based on the key observation that to hide the remote access latency, a predictor must accurately predict only the remote memory accesses (i.e., request messages) and not the subsequent coherence messages invoked by an access. Simulation results indicate that MSPs improve prediction accuracy over general message predictors from 81% to 93% while requiring less storage overhead. This paper also presents the first design and evaluation for a speculative coherent DSM using pattern- based predictors. We identify simple techniques and mechanisms to trigger prediction timely and perform speculation for remote read accesses. Our speculation hardware readily works with a conventional full-map write- invalidate coherence protocol without any modifications. Simulation results indicate that performing speculative read requests alone reduces execution times by 12% in our shared-memory application

Infoscience - École polytechnique fédérale de Lausanne

A two-level directory architecture for highly scalable cc-NUMA multiprocessors

Author: J. Duato
J. Gonzalez
J.M. Garcia
M.E. Acacio
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

An architecture for high-performance scalable shared-memory multiprocessors exploiting on-chip integration

Author: J. Duato
J. Gonzalez
J.M. Garcia
M.E. Acacio
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Dynamic computation migration in distributed shared memory systems

Author: Hsieh Wilson Cheng-Yi
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/1995
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1995.Vita.Includes bibliographical references (p. 123-131).by Wilson Cheng-Yi Hsieh.Ph.D

CiteSeerX

DSpace@MIT