8 research outputs found

    Physical-aware link allocation and route assignment for chip multiprocessing

    Get PDF
    The architecture definition, design, and validation of the interconnect networks is a key step in the design of modern on-chip systems. This paper proposes a mathematical formulation of the problem of simultaneously defining the topology of the network and the message routes for the traffic among the processing elements of the system. The solution of the problem meets the physical and performance constraints defined by the designer. The method guarantees that the generated solution is deadlock free. It is also capable of automatically discovering topologies that have been previously used in industrial systems. The applicability of the method has been validated by solving realistic size interconnect networks modeling the typical multiprocessor systems.Peer ReviewedPostprint (published version

    Adding Token Counting to Directory-Based Cache Coherence

    Get PDF
    The coherence protocol is a first-order design concern in multicore designs. Directory protocols are naturally scalable, as they place no restrictions on the interconnect and have minimal bandwidth requirements; however, this scalability comes at the cost of increased sharing latency due to indirection. In contrast, broadcast-based systems such as snooping protocols and token coherence reduce latency of sharing misses by sending requests directly to other processors. Unfortunately, their reliance on totally ordered interconnects and/or broadcast limits their scalability. This work introduces PATCH (Predictive/Adaptive Token Counting Hybrid), a coherence protocol that provides the scalability of directory protocols while opportunistically using available bandwidth to reduce sharing latency. PATCH extends a standard directory protocol to track tokens and use token counting rules for enforcing coherence permissions. Token counting allows PATCH to support direct requests on an unordered interconnect, while a novel mechanism called token tenure uses local processor timeouts and the directory’s per-block point of ordering at the home node to guarantee forward progress without relying on broadcast. PATCH makes three main contributions. First, PATCH uses direct request prioritization to match the performance of broadcast-based protocols without restricting scalability. Second, PATCH introduces token tenure, which provides broadcast-free forward progress for token counting protocols. Finally, PATCH provides greater scalability than directory protocols when using inexact encodings of sharers because only processors holding tokens need to acknowledge requests. Overall, PATCH is a “one-size-fits-all” coherence protocol that dynamically adapts to work well for small systems, large systems, and anywhere in betwee

    An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing

    Full text link

    Scalable, reliable, power-efficient communication for hardware transactional memory

    Get PDF
    Journal ArticleIn a hardware transactional memory system with lazy versioning and lazy conflict detection, the process of transaction commit can emerge as a bottleneck. This is especially true for a large-scale distributed memory system where multiple transactions may attempt to commit simultaneously and co-ordination is required before allowing commits to proceed in parallel. In this paper, we propose novel algorithms to implement commit that are more scalable (in terms of delay and energy) and are free of deadlocks/livelocks. We show that these algorithms have similarities with the token cache coherence concept and leverage these similarities to extend the algorithms to handle message loss and starvation scenarios. The proposed algorithms improve upon the state-of-the-art by yielding up to a 7X reduction in commit delay and up to a 48X reduction in network messages. These translate into overall performance improvements of up to 66% (for synthetic workloads with average transaction length of 200 cycles), 35% (for average transaction length of 1000 cycles), 8% (for average transaction length of 4000 cycles), and 41% (for a collection of SPLASH-2 programs)

    Token Tenure and PATCH: A Predictive/Adaptive Token-Counting Hybrid

    Get PDF
    Traditional coherence protocols present a set of difficult trade-offs: the reliance of snoopy protocols on broadcast and ordered interconnects limits their scalability, while directory protocols incur a performance penalty on sharing misses due to indirection. This work introduces PATCH (Predictive/Adaptive Token-Counting Hybrid), a coherence protocol that provides the scalability of directory protocols while opportunistically sending direct requests to reduce sharing latency. PATCH extends a standard directory protocol to track tokens and use token-counting rules for enforcing coherence permissions. Token counting allows PATCH to support direct requests on an unordered interconnect, while a mechanism called token tenure provides broadcast-free forward progress using the directory protocol’s per-block point of ordering at the home along with either timeouts at requesters or explicit race notification messages. PATCH makes three main contributions. First, PATCH introduces token tenure, which provides broadcast-free forward progress for token-counting protocols. Second, PATCH deprioritizes best-effort direct requests to match or exceed the performance of directory protocols without restricting scalability. Finally, PATCH provides greater scalability than directory protocols when using inexact encodings of sharers because only processors holding tokens need to acknowledge requests. Overall, PATCH is a “one-size-fits-all” coherence protocol that dynamically adapts to work well for small systems, large systems, and anywhere in between

    Efficient and scalable starvation prevention mechanism for token coherence

    Full text link
    [EN] Token Coherence is a cache coherence protocol that simultaneously captures the best attributes of the traditional approximations to coherence: direct communication between processors (like snooping-based protocols) and no reliance on bus-like interconnects (like directory-based protocols). This is possible thanks to a class of unordered requests that usually succeed in resolving the cache misses. The problem of the unordered requests is that they can cause protocol races, which prevent some misses from being resolved. To eliminate races and ensure the completion of the unresolved misses, Token Coherence uses a starvation prevention mechanism named persistent requests. This mechanism is extremely inefficient and, besides, it endangers the scalability of Token Coherence since it requires storage structures (at each node) whose size grows proportionally to the system size. While multiprocessors continue including an increasingly number of nodes, both the performance and scalability of cache coherence protocols will continue to be key aspects. In this work, we propose an alternative starvation prevention mechanism, named priority requests, that outperforms the persistent request one. This mechanism is able to reduce the application runtime more than 20 percent (on average) in a 64-processor system. Furthermore, thanks to the flexibility shown by priority requests, it is possible to drastically minimize its storage requirements, thereby improving the whole scalability of Token Coherence. Although this is achieved at the expense of a slight performance degradation, priority requests still outperform persistent requests significantly.This work was partially supported by the Spanish MEC and MICINN, as well as European Commission FEDER funds, under Grants CSD2006-00046 and TIN2009-14475-C04-01. Antonio Robles is taking a sabbatical granted by the Universidad Politecnica de Valencia for updating his teaching and research activities.Cuesta Sáez, BA.; Robles Martínez, A.; Duato Marín, JF. (2011). Efficient and scalable starvation prevention mechanism for token coherence. IEEE Transactions on Parallel and Distributed Systems. 22(10):1610-1623. doi:10.1109/TPDS.2011.30S16101623221

    A consistency architecture for hierarchical shared caches

    Full text link
    Hierarchical Cache Consistency (HCC) is a scalable cache-con-sistency architecture for chip multiprocessors in which caches are shared hierarchically. HCC’s cache-consistency protocol is embed-ded in the message-routing network that interconnects the caches, providing a distributed and scalable alternative to bus-based and directory-based consistency mechanisms. The HCC consistency protocol is “progressive ” in that every message makes monotonic progress without timeouts, retries, negative acknowledgments, or retreating in any way. The latency is at most proportional to the di-ameter of the network. For HCC with a binary fat-tree network, the protocol requires at most 13 bits of additional state per cache line, no matter how large the system. We prove that the HCC protocol is deadlock free and provides sequential consistency

    Speculative Techniques for Memory Hierarchy Management

    Get PDF
    The “Memory Wall” [1], is the gap in performance between the processor and the main memory. Over the last 30 years computer architects have added multiple levels of cache to fill this gap, cache levels that are closer to the processors are smaller and faster. On the other hand, the levels that are far from the processors are bigger and slower. However the processors are still exposed to the latency of DRAM on misses. Therefore, speculative memory management techniques such as prefetching are used in modern microprocessors to bridge this gap in performance. First, we propose Synchronization-aware Hardware Prefetching for Chip Multiprocessors, a novel hardware data prefetching scheme designed for prefetching shared-memory, multi- threaded workloads. This is the first work we are aware of to characterize the causes of poor prefetching performance in shared- memory multi-threaded applications. These are the inability to prefetch beyond synchronization points and tendency to prefetch shared data before it has been written. SB-Fetch, a low-complexity, low-overhead prefetcher design that addresses both issues. Second, we propose a new prefetching algorithm, Set-Level Adaptive Prefetching for Com- pressed Caches (SLAP-CC), which seeks to address this problem by varying the prefetching aggressiveness based on how much effective capacity is available in each set. The ontribu- tions of this work is characterize the increase and per-set variability of cache efficiency which typical cache compression schemes create, and propose a new prefetching scheme, SLAP-CC, designed to leverage this cache efficiency variability. Third, we propose a new a scheduling mechanism that predicts the hard- to-prefetch loads at issue time and preemptively schedule them for execution as soon as they are ready, to allow the cache hierarchy to start the mishandling mechanism sooner. Such scheduling mechanism reduces the miss penalty on the dependent instructions after a hard-to-prefetch loads
    corecore