14 research outputs found

    SWEL: hardware cache coherence protocols to map shared data onto shared caches

    Get PDF
    posterIn chip multiprocessors, replication of cache lines is allowed to reduce the latency each core has to access a cache line. Because of this replication, it is possible for one copy of data to become out of date if another copy of that data is modified. How a MESI protocol accomplishes this: ? Keep a list of sharers for all cache lines ? Invalidate all L1s in the list of sharers on modify (write) request ? Forward read requests to whichever L1 has most recently modified the data As can be seen in the diagrams below, the frequency and degree of data sharing is not very high on average, so MESI might be overprovisioning for this uncommon case. Proposal We propose a new coherence protocol named SWEL (for Shared, Written, Exclusivity Level) that seeks to reduce the frequency and complexity of the operations required to keep the cache coherent in protocols like SWEL. MESI's drawbacks: ? Requires indirection to get up to date data ? Requires serialized sequences of messages to perform point to point invalidations SWEL protocol: ? Limits replication of data to reduce coherence operations (only private and read-only replicated) ? Speculatively assumes all data is read-only or private ? Simple hardware mechanism to detect shared and written state ? Broadcast bus to perform invalidates The RSWEL optimization of this algorithm allows for selective reconstitution of cache blocks so they can be once again cached in L1 after they've been broadcast invalidated, based on some lockout timer N

    Sandbox prefetching: safe run-time evaluation of aggressive prefetchers

    Get PDF
    pre-printMemory latency is a major factor in limiting CPU per- formance, and prefetching is a well-known method for hiding memory latency. Overly aggressive prefetching can waste scarce resources such as memory bandwidth and cache capacity, limiting or even hurting performance. It is therefore important to employ prefetching mechanisms that use these resources prudently, while still prefetching required data in a timely manner. In this work, we propose a new mechanism to deter-mine at run-time the appropriate prefetching mechanism for the currently executing program, called Sandbox Prefetching. Sandbox Prefetching evaluates simple, aggressive offset prefetchers at run-time by adding the prefetch address to a Bloom filter, rather than actually fetching the data into the cache. Subsequent cache accesses are tested against the contents of the Bloom filter to see if the aggressive prefetcher under evaluation could have accurately prefetched the data, while simultaneously testing for the existence of prefetchable streams. Real prefetches are performed when the accuracy of evaluated prefetchers exceeds a threshold. This method combines the ideas of global pattern confirmation and immediate prefetching action to achieve high performance. Sandbox Prefetching improves performance across the tested workloads by 47.6% compared to not using any prefetching, and by 18.7% compared to the Feedback Directed Prefetching technique. Performance is also improved by 1.4% compared to the Access Map Pattern Matching Prefetcher, while incurring consid- erably less logic and storage overheads

    NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads

    Get PDF
    pre-printWhile Processing-in-Memory has been investigated for decades, it has not been embraced commercially. A number of emerging technologies have renewed interest in this topic. In particular, the emergence of 3D stacking and the imminent release of Micron's Hybrid Memory Cube device have made it more practical to move computation near memory. However, the literature is missing a detailed analysis of a killer application that can leverage a Near Data Computing (NDC) architecture. This paper focuses on in-memory MapReduce workloads that are commercially important and are especially suitable for NDC because of their embarrassing parallelism and largely localized memory accesses. The NDC architecture incorporates several simple processing cores on a separate, non-memory die in a 3D-stacked memory package; these cores can perform Map operations with efficient memory access and without hitting the bandwidth wall. This paper describes and evaluates a number of key elements necessary in realizing efficient NDC operation: (i) low-EPI cores, (ii) long daisy chains of memory devices, (iii) the dynamic activation of cores and SerDes links. Compared to a baseline that is heavily optimized for MapReduce execution, the NDC design yields up to 15X reduction in execution time and 18X reduction in system energy

    Scalable, reliable, power-efficient communication for hardware transactional memory

    Get PDF
    Journal ArticleIn a hardware transactional memory system with lazy versioning and lazy conflict detection, the process of transaction commit can emerge as a bottleneck. This is especially true for a large-scale distributed memory system where multiple transactions may attempt to commit simultaneously and co-ordination is required before allowing commits to proceed in parallel. In this paper, we propose novel algorithms to implement commit that are more scalable (in terms of delay and energy) and are free of deadlocks/livelocks. We show that these algorithms have similarities with the token cache coherence concept and leverage these similarities to extend the algorithms to handle message loss and starvation scenarios. The proposed algorithms improve upon the state-of-the-art by yielding up to a 7X reduction in commit delay and up to a 48X reduction in network messages. These translate into overall performance improvements of up to 66% (for synthetic workloads with average transaction length of 200 cycles), 35% (for average transaction length of 1000 cycles), 8% (for average transaction length of 4000 cycles), and 41% (for a collection of SPLASH-2 programs)

    Commit Algorithms for Scalable Hardware Transactional Memory

    Get PDF
    In a hardware transactional memory system with lazy versioning and lazy conflict detection, the process of transaction commit can emerge as a bottleneck. For a large-scale distributed memory system, we propose novel algorithms to implement commit that are deadlock- and livelock-free and do not employ any centralized resource. These algorithms improve upon the state-of-the-art by yielding up to 59 % improvement in average delay and up to 97 % reduction in network traffic.
    corecore