745 research outputs found

    Exploring the value of supporting multiple DSM protocols in Hardware DSM Controllers

    Get PDF
    Journal ArticleThe performance of a hardware distributed shared memory (DSM) system is largely dependent on its architect's ability to reduce the number of remote memory misses that occur. Previous attempts to solve this problem have included measures such as supporting both the CC-NUMA and S-COMA architectures is the same machine and providing a programmable DSM controller that can emulate any DSM mechanism. In this paper we first present the design of a DSM controller that supports multiple DSM protocols in custom hardware, and allows the programmer or compiler to specify on a per-variable basis what protocol to use to keep that variable coherent. This simulated performance of this DSM controller compares favorably with that of conventional single-protocol custom hardware designs, often outperforming the conventional systems by a factor of two. To achieve these promising results, that multi-protocol DSM controller needed to support only two DSM architectures (CC-NUMA and S-COMA) and three coherency protocols (both release and sequentially consistent write invalidate and release consistent write update). This work demonstrates the value of supporting a degree of flexibility in one's DSM controller design and suggests what operations such a flexible DSM controller should support

    Flexible compiler-managed L0 buffers for clustered VLIW processors

    Get PDF
    Wire delays are a major concern for current and forthcoming processors. One approach to attack this problem is to divide the processor into semi-independent units referred to as clusters. A cluster usually consists of a local register file and a subset of the functional units, while the data cache remains centralized. However, as technology evolves, the latency of such a centralized cache increase leading to an important performance impact. In this paper, we propose to include flexible low-latency buffers in each cluster in order to reduce the performance impact of higher cache latencies. The reduced number of entries in each buffer permits the design of flexible ways to map data from L1 to these buffers. The proposed L0 buffers are managed by the compiler, which is responsible to decide which memory instructions make us of them. Effective instruction scheduling techniques are proposed to generate code that exploits these buffers. Results for the Mediabench benchmark suite show that the performance of a clustered VLIW processor with a unified L1 data cache is improved by 16% when such buffers are used. In addition, the proposed architecture also shows significant advantages over both MultiVLIW processors and clustered processors with a word-interleaved cache, two state-of-the-art designs with a distributed L1 data cache.Peer ReviewedPostprint (published version

    Reducing consistency traffic and cache misses in the avalanche multiprocessor

    Get PDF
    Journal ArticleFor a parallel architecture to scale effectively, communication latency between processors must be avoided. We have found that the source of a large number of avoidable cache misses is the use of hardwired write-invalidate coherency protocols, which often exhibit high cache miss rates due to excessive invalidations and subsequent reloading of shared data. In the Avalanche project at the University of Utah, we are building a 64-node multiprocessor designed to reduce the end-to-end communication latency of both shared memory and message passing programs. As part of our design efforts, we are evaluating the potential performance benefits and implementation complexity of providing hardware support for multiple coherency protocols. Using a detailed architecture simulation of Avalanche, we have found that support for multiple consistency protocols can reduce the time parallel applications spend stalled on memory operations by up to 66% and overall execution time by up to 31%. Most of this reduction in memory stall time is due to a novel release-consistent multiple-writer write-update protocol implemented using a write state buffer

    Avalanche: A communication and memory architecture for scalable parallel computing

    Get PDF
    technical reportAs the gap between processor and memory speeds widens?? system designers will inevitably incorpo rate increasingly deep memory hierarchies to maintain the balance between processor and memory system performance At the same time?? most communication subsystems are permitted access only to main memory and not a processor s top level cache As memory latencies increase?? this lack of integration between the memory and communication systems will seriously impede interprocessor communication performance and limit e ective scalability In the Avalanche project we are re designing the memory architecture of a commercial RISC multiprocessor?? the HP PA RISC ?? to include a new multi level context sensitive cache that is tightly coupled to the communication fabric The primary goal of Avalanche s integrated cache and communication controller is attack ing end to end communication latency in all of its forms This includes cache misses induced by excessive invalidations and reloading of shared data by write invalidate coherence protocols and cache misses induced by depositing incoming message data in main memory and faulting it into the cache An execution driven simulation study of Avalanche s architecture indicates that it can reduce cache stalls by and overall execution times b

    RWMSI (Read Exclusive Write Exclusive Modified Shared Invalid) Cache Coherence Protocol

    Get PDF
    This paper proposes a novel coherence protocol RWMSI (Read exclusive Write exclusive Modified Shared Invalid) that merges “snooping and directory – based coherence protocols “and enhanced them depending on the state of “MESI snooping protocol “. “ Coherence” is implemented with “snooping or directory based protocols “. Because of the shared bus the “ Snooping protocols “ are not scalable , while directory protocols incur directory storage overhead , frequent indirections , and are more prone to design bugs

    Avalanche: A communication and memory architecture for scalable parallel computing

    Get PDF
    technical reportAs the gap between processor and memory speeds widens, system designers will inevitably incorporate increasingly deep memory hierarchies to maintain the balance between processor and memory system performance. At the same time, most communication subsystems are permitted access only to main memory and not a processor's top level cache. As memory latencies increase, this lack of integration between the memory and communication systems will seriously impede interprocessor communication performance and limit effective scalability. In the Avalanche project we are redesigning the memory architecture of a commercial RISC multiprocessor, the HP PA-RISC 7100, to include a new multi-level context sensitive cache that is tightly coupled to the communication fabric. The primary goal of Avalanche's integrated cache and communication controller is attacking end to end communication latency in all of its forms. This includes cache misses induced by excessive invalidations and reloading of shared data by write-invalidate coherence protocols and cache misses induced by depositing incoming message data in main memory and faulting it into the cache. An execution-driven simulation study of Avalanche's architecture indicates that it can reduce cache stalls by 5-60% and overall execution times by 10-28%

    A Shared memory multiprocessor system architecture utilizing a uniform

    Get PDF
    Due to VLSI lithography problems and the limitation of additional architectural enhancements uniprocessor systems are nearing the end of their life cycle. Therefore, it is believed that Symmetric Multiprocessing (SMP) systems will be the next mainstream computer. These systems allow multiple processors, accessing the same memory image, to cooperate on a number of computational tasks as a single entity. While multiprocessor systems can offer a substantial performance increase compared to uniprocessor systems, major design considerations must be addressed to achieve desired system efficiency levels. Managing cache coherence is a significant problem in multiprocessor systems. Current implementations cope with this problem by utilizing a cache coherence protocol. This protocol puts a large amount of overhead on the system bus to ensure proper program execution, effectively decreasing overall system performance. This thesis approaches the cache coherence problem from a new angle. Instead of utilizing a cache coherence protocol, a new memory system is proposed which eliminates the need for a cache coherence protocol, by utilizing a shared level 2 data-only cache. This new architecture allows for better utilization of the system and improved performance and scalability. A data rate analysis is conducted to demonstrate the potential performance increase from the proposed architecture over conventional approaches. The data rate model clearly shows an increase in system performance and utilization when using the architecture proposed in this thesis

    Proximity coherence for chip-multiprocessors

    Get PDF
    Many-core architectures provide an efficient way of harnessing the growing numbers of transistors available in modern fabrication processes; however, the parallel programs run on these platforms are increasingly limited by the energy and latency costs of communication. Existing designs provide a functional communication layer but do not necessarily implement the most efficient solution for chip-multiprocessors, placing limits on the performance of these complex systems. In an era of increasingly power limited silicon design, efficiency is now a primary concern that motivates designers to look again at the challenge of cache coherence. The first step in the design process is to analyse the communication behaviour of parallel benchmark suites such as Parsec and SPLASH-2. This thesis presents work detailing the sharing patterns observed when running the full benchmarks on a simulated 32-core x86 machine. The results reveal considerable locality of shared data accesses between threads with consecutive operating system assigned thread IDs. This pattern, although of little consequence in a multi-node system, corresponds to strong physical locality of shared data between adjacent cores on a chip-multiprocessor platform. Traditional cache coherence protocols, although often used in chip-multiprocessor designs, have been developed in the context of older multi-node systems. By redesigning coherence protocols to exploit new patterns such as the physical locality of shared data, improving the efficiency of communication, specifically in chip-multiprocessors, is possible. This thesis explores such a design – Proximity Coherence – a novel scheme in which L1 load misses are optimistically forwarded to nearby caches via new dedicated links rather than always being indirected via a directory structure.EPSRC DTA research scholarshi
    corecore