112 research outputs found

    Exploring the value of supporting multiple DSM protocols in Hardware DSM Controllers

    Get PDF
    Journal ArticleThe performance of a hardware distributed shared memory (DSM) system is largely dependent on its architect's ability to reduce the number of remote memory misses that occur. Previous attempts to solve this problem have included measures such as supporting both the CC-NUMA and S-COMA architectures is the same machine and providing a programmable DSM controller that can emulate any DSM mechanism. In this paper we first present the design of a DSM controller that supports multiple DSM protocols in custom hardware, and allows the programmer or compiler to specify on a per-variable basis what protocol to use to keep that variable coherent. This simulated performance of this DSM controller compares favorably with that of conventional single-protocol custom hardware designs, often outperforming the conventional systems by a factor of two. To achieve these promising results, that multi-protocol DSM controller needed to support only two DSM architectures (CC-NUMA and S-COMA) and three coherency protocols (both release and sequentially consistent write invalidate and release consistent write update). This work demonstrates the value of supporting a degree of flexibility in one's DSM controller design and suggests what operations such a flexible DSM controller should support

    MP-LOCKs: Replacing hardware synchronization primitives with message passing

    Get PDF
    Journal ArticleShared memory programs guarantee the correctness of concurrent accesses to shared data using interprocessor synchronization operations. The most common synchronization operators are locks, which are traditionally implemented in user-level libraries via a mix of shared memory accesses and hardware synchronization primitives like test-and-set. In this paper, we argue that synchronization operations implemented using fast message passing and kernel-embedded lock managers are an attractive alternative to dedicated synchronization hardware. We propose three message passing lock (MP-LOCK) algorithms (centralized, distributed, and reactive) and provide guidelines for implementing them efficiently. MP-LOCKs redice tje design complexity and runtime occupancy of DSM controllers and can exploit software's inherent flexibility to adapt to differing applications lock access patterns. We compared the performance of MP-LOCKs with two common shared memory lock algorithms: test-and-set and MCS locks and found that MP-LOCKs scale better. For machines with 16 to 32 nides, applications using MP-LOCKs ran up to 186% faster than the same applications with shared memory locks. For small systems (up to 8 nodes), MP-LOCK performance lags shared memory lock performance due to the higher software overhead. However, three of the MP-LOCK applications slow down by no more than 18%, while the other two slowed by no more than 180%. Given these results, we conclude that locks based on message passing should be considered as a replacement for hardware locks in future scalable multiprocessors that supports efficient message passing mechanisms. In addition, it is possible to implement efficient software synchronization primitives in clusters of workstations by using the guidelines we proposed

    Reactive NUMA: A design for unifying S-COMA and CC-NUMA

    Get PDF
    This paper proposes and evaluates a new approach to directory-based cache coherence protocols called Reactive NUMA (R-NUMA). An R-NUMA system combines a conventional CC-NUMA coherence protocol with a more-recent Simple-COMA (S-COMA) protocol. What makes R-NUMA novel is the way it dynamically reacts to program and system behavior to switch between CC- NUMA and S-COMA and exploit the best aspects of both protocols. This reactive behavior allows each node in an R-NUMA system to independently choose the best protocol for a particular page, thus providing much greater performance stability than either CC-NUMA or S-COMA alone. Our evaluation is both qualitative and quantitative. We first show the theoretical result that R-NUMA's worst-case performance is bounded within a small constant factor (i.e., two to three times) of the best of CC-NUMA and S-COMA. We then use detailed execution-driven simulation to show that, in practice, R-NUMA usually performs better than either a pure CC-NUMA or pure S-COMA protocol, and no more than 57% worse than the best of CC-NUMA and S- COMA, for our benchmarks and base system assumptions

    Extending the reach of microprocessors : column and curious caching

    Get PDF
    Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references (p. 162-167).by Derek T. Chiou.Ph.D

    Simulation models of shared-memory multiprocessor systems

    Get PDF

    Optimizing traffic in DSM clusters: fine-grain memory caching versus page migration/replication

    Get PDF
    In this paper we compare and contrast two techniques to improve capacity/conflict miss traffic in CC-NUMA DSM clusters. Page migration/replication optimizes read-write accesses to a page used by a single processor by migrating the page to that processor and replicates all read-shared pages in the sharers' local memories. R-NUMA optimizes read-write accesses to any page by allowing a processor to cache that page in its main memory. Page migration/replication requires less hardware complexity compared with R-NUMA, but has limited applicability and incurs much higher overheads even with tuned hardware/software support. In this paper we compare and contrast page migration/replication and R-NUMA on simulated clusters of symmetric multiprocessors executing shared-memory applications. Our results show that: (1) both page migration/replication and R-NUMA significantly improve the system performance over "first-touch" migration in many applications, (2) page migration/replication has limited opportunity and cannot eliminate all the capacity/conflict misses even with fast hardware support and unlimited amount of memory, (3) R-NUMA always performs best given a page cache large enough to fit an application's primary working set and subsumes page migration/replication, (4) R-NUMA benefits more from hardware support to accelerate page operations than page migration/replication, and (5) integrating page migration/replication into R-NUMA to help reduce the hardware cost requires sophisticated mechanisms and policies to select candidates for page migration/replicatio

    Comparing the effectiveness of fine-grain memory caching against page migration/replication in reducing traffic in DSM clusters

    Get PDF
    In this paper we compare and contrast two techniques to improve capacity/conflict miss traffic in CC-NUMA DSM clusters. Page migration/replication optimizes read-write accesses to a page used by a single processor by migrating the page to that processor and replicates all read-shared pages in the sharers' local memories. R-NUMA optimizes read- write accesses to any page by allowing a processor to cache that page in its main memory. Page migration/replication requires less hardware complexity as compared to R-NUMA, but has limited applicability and incurs much higher overheads even with timed hardware/software support. In this paper, we compare and contrast page migration/replication and R-NUMA on simulated clusters of symmetric multiprocessors executing shared-memory applications. Our results show that: (1) both page migration/replication and R-NUMA significantly improve the system performance over “first-touch” migration in many applications, (2) page migration/replication has limited opportunity and can not eliminate all the capacity/conflict misses even with fast hardware support and unlimited amount of memory, (3) R-NUMA always performs best given a page cache large enough to fit an application's primary working set and subsumes page migration/replication, (4) R-NUMA benefits more from hardware support to accelerate page operations than page migration/replication, and (5) integrating page migration/replication into R- NUMA to help reduce the hardware cost requires sophisticated mechanisms and policies to select candidates for page migration/replicatio

    Efficient techniques to provide scalability for token-based cache coherence protocols

    Full text link
    Cache coherence protocols based on tokens can provide low latency without relying on non-scalable interconnects thanks to the use of efficient requests that are unordered. However, when these unordered requests contend for the same memory block, they may cause protocols races. To resolve the races and ensure the completion of all the cache misses, token protocols use a starvation prevention mechanism that is inefficient and non-scalable in terms of required storage structures and generated traffic. Besides, token protocols use non-silent invalidations which increase the latency of write misses proportionally to the system size. All these problems make token protocols non-scalable. To overcome the main problems of token protocols and increase their scalability, we propose a new starvation prevention mechanism named Priority Requests. This mechanism resolves contention by an efficient, elegant, and flexible method based on ordered requests. Furthermore, thanks to Priority Requests, efficient techniques can be applied to limit the storage requirements of the starvation prevention mechanism, to reduce the total traffic generated for managing protocol races, and to reduce the latency of write misses. Thus, the main problems of token protocols can be solved, which, in turn, contributes to wide their efficiency and scalability.Cuesta Sáez, BA. (2009). Efficient techniques to provide scalability for token-based cache coherence protocols [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/6024Palanci
    corecore