179 research outputs found

    Efficient Cache Coherence on Manycore Optical Networks

    Get PDF
    Ever since industry has turned to parallelism instead of frequency scaling to improve processor performance, multicore processors have continued to scale to larger and larger numbers of cores. Some believe that multicores will have 1000 cores or more by the middle of the next decade. However, their promise of increased performance will only be reached if their inherent scaling challenges are overcome. One such major scaling challenge is the viability of efficient cache coherence with large numbers of cores. Meanwhile, recent advances in nanophotonic device manufacturing are making CMOS-integrated optics a realityâ interconnect technology which can provide significantly more bandwidth at lower power than conventional electrical analogs. The contributions of this paper are two-fold. (1) It presents ATAC, a new manycore architecture that augments an electrical mesh network with an optical network that performs highly efficient broadcasts. (2) It introduces ACKwise, a novel directory-based cache coherence protocol that provides high performance and scalability on any large-scale manycore interconnection net- work with broadcast capability. Performance evaluation studies using analytical models show that (i) a 1024-core ATAC chip using ACKwise achieves a speedup of 3.9Ã compared to a similarly-sized pure electrical mesh manycore with a conventional limited directory protocol; (ii) the ATAC chip with ACKwise achieves a speedup of 1.35Ã compared to the electrical mesh chip with ACKwise; and (iii) a pure electrical mesh chip with ACKwise achieves a speedup of 2.9Ã over the same chip using a conventional limited directory protocol

    Vote the OS off your Core

    Get PDF
    Recent trends in OS research have shown evidence that there are performance benefits to running OS services on different cores than the user applications that rely on them. We quantitatively evaluate this claim in terms of one of the most significant architectural constraints: memory performance. To this end, we have created CachEMU, an open-source memory trace generator and cache simulator built as an extension to QEMU for working with system traces. Using CachEMU, we determined that for five common Linux test workloads, it was best to run the OS close, but not too close on the same package, but not on the same core

    SCORPIO: A 36-Core Research Chip Demonstrating Snoopy Coherence on a Scalable Mesh NoC with In-Network Ordering

    Get PDF
    URL to conference programIn the many-core era, scalable coherence and on-chip interconnects are crucial for shared memory processors. While snoopy coherence is common in small multicore systems, directory-based coherence is the de facto choice for scalability to many cores, as snoopy relies on ordered interconnects which do not scale. However, directory-based coherence does not scale beyond tens of cores due to excessive directory area overhead or inaccurate sharer tracking. Prior techniques supporting ordering on arbitrary unordered networks are impractical for full multicore chip designs. We present SCORPIO, an ordered mesh Network-on-Chip(NoC) architecture with a separate fixed-latency, bufferless network to achieve distributed global ordering. Message delivery is decoupled from the ordering, allowing messages to arrive in any order and at any time, and still be correctly ordered. The architecture is designed to plug-and-play with existing multicore IP and with practicality, timing, area, and power as top concerns. Full-system 36 and 64-core simulations on SPLASH-2 and PARSEC benchmarks show an average application run time reduction of 24.1% and 12.9%, in comparison to distributed directory and AMD HyperTransport coherence protocols, respectively. The SCORPIO architecture is incorporated in an 11 mm-by- 13 mm chip prototype, fabricated in IBM 45nm SOI technology, comprising 36 Freescale e200 Power Architecture TM cores with private L1 and L2 caches interfacing with the NoC via ARM AMBA, along with two Cadence on-chip DDR2 controllers. The chip prototype achieves a post synthesis operating frequency of 1 GHz (833 MHz post-layout) with an estimated power of 28.8 W (768 mW per tile), while the network consumes only 10% of tile area and 19 % of tile power.United States. Defense Advanced Research Projects Agency (DARPA UHPC grant at MIT (Angstrom))Center for Future Architectures ResearchMicroelectronics Advanced Research Corporation (MARCO)Semiconductor Research Corporatio
    • …
    corecore