24 research outputs found

    Memory coherence activity prediction in commercial workloads

    Get PDF
    Recent research indicates that prediction-based coherence optimizations offer substantial performance improvements for scientific applications in distributed shared memory multiprocessors. Important commercial applications also show sensitivity to coherence latency, which will become more acute in the future as technology scales. Therefore it is important to investigate prediction of memory coherence activity in the context of commercial workloads.This paper studies a trace-based Downgrade Predictor (DGP) for predicting last stores to shared cache blocks, and a pattern-based Consumer Set Predictor (CSP) for predicting subsequent readers. We evaluate this class of predictors for the first time on commercial applications and demonstrate that our DGP correctly predicts 47%-76% of last stores. Memory sharing patterns in commercial workloads are inherently non-repetitive; hence CSP cannot attain high coverage. We perform an opportunity study of a DGP enhanced through competitive underlying predictors, and in commercial and scientific applications, demonstrate potential to increase coverage up to 14%

    Store-Ordered Streaming of Shared Memory

    Get PDF
    Coherence misses in shared-memory multiprocessors account for a substantial fraction of execution time in many important scientific and commercial workloads. Memory streaming provides a promising solution to the coherence miss bottleneck because it improves memory level parallelism and lookahead while using on-chip resources efficiently. We observe that the order in which shared data are consumed by one processor is correlated to the order in which they were produced by another. We investigate this phenomenon and demonstrate that it can be exploited to send Store- ORDered Streams (SORDS) of shared data from producers to consumers, thereby eliminating coherent read misses. Using a trace-driven analysis of all user and OS memory references in a cache-coherent distributed shared- memory multiprocessor, we show that SORDS based memory streaming can eliminate between 36% and 100% of all coherent read misses in scientific workloads and between 23% and 48%in online transaction processing workloads

    Software Shared Memory Support on Clusters of Symmetric MultiProcessors Using Remote-Write Networks

    No full text
    Low-latency, remote-write-access networks have recently become commodity items. These networks can connect clusters of symmetric multiprocessors (SMPs) to form very cost-effective, large scale parallel systems. Software-based distributed shared memory (SDSM) is a natural choice for the underlying platform. However, to exploit the platform's full potential, sharing across SMPs must be managed without compromising the efficiency of sharing within an SMP. Cashmere-2L is a "two-level" SDSM protocol that delivers the platform's potential through novel software techniques that leverage, without compromising, the efficiency of the hardware coherence. The protocol implements a moderately lazy release consistency model with page directories, home-nodes, and multiple concurrent writers. By avoiding global meta-data locks and TLB shootdown, Cashmere2L is able to maintain a high level of asynchrony. The prototype Cashmere-2L system currently runs on an 8-node, 32-processor DEC AlphaServer cluster ..

    Cashmere-2L: Software Coherent Shared Memory on a Clustered Remote-Write Network

    No full text
    Low-latency remote-write networks, such as DEC’s Memory Channel, provide the possibility of transparent, inexpensive, large-scale shared-memory parallel computing on clusters of shared memory multiprocessors (SMPs). The challenge is to take advantage of hardware shared memory for sharing within an SMP, and to ensure that software overhead is incurred only when actively sharing data across SMPs in the cluster. In this paper, we describe a “twolevel” software coherent shared memory system—Cashmere-2L— that meets this challenge. Cashmere-2L uses hardware to share memory within a node, while exploiting the Memory Channel’s remote-write capabilities to implement “moderately lazy ” release consistency with multiple concurrent writers, directories, home nodes, and page-size coherence blocks across nodes. Cashmere-2L employs a novel coherence protocol that allows a high level o

    Store-ordered streaming of shared memory

    No full text
    Coherence misses in shared-memory multiprocessors account for a substantial fraction of execution time in many important scientific and commercial workloads. Memory streaming provides a promising solution to the coherence miss bottleneck because it improves memory level parallelism and lookahead while using on-chip resources efficiently. We observe that the order in which shared data are consumed by one processor is correlated to the order in which they were produced by another. We investigate this phenomenon and demonstrate that it can be exploited to send Store-ORDered Streams (SORDS) of shared data from producers to consumers, thereby eliminating coherent read misses. Using a trace-driven analysis of all user and OS memory references in a cache-coherent distributed shared-memory multiprocessor, we show that SORDSbased memory streaming can eliminate between 36 % and 100 % of all coherent read misses in scientific workloads and between 23% and 48 % in online transaction processing workloads. 1
    corecore