49 research outputs found

    Latency reduction techniques in chip multiprocessor cache systems

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (p. 117-122).Single-chip multiprocessors (CMPs) solve several bottlenecks facing chip designers today. Compared to traditional superscalars, CMPs deliver higher performance at lower power for thread-parallel workloads. In this thesis, we consider tiled CMPs, a class of CMPs where each tile contains a slice of the total on-chip L2 cache storage, and tiles are connected by an on-chip network. Two basic schemes are currently used to manage L2 slices. First, each slice can be used as a private L2 for the tile. Private L2 caches provide the lowest hit latency but reduce the total effective cache capacity because each tile creates a local copy of any block it touches. Second, all slices are aggregated to form a single large L2 shared by all tiles. A shared L2 cache increases the effective cache capacity for shared data, but incurs longer hit latencies when L2 data is on a remote tile. In practice, either private or shared works better for a given workload. We present two new policies, victim replication and victim migration, both of which combine the advantages of private and shared designs. They are variants of the shared scheme which attempt to keep copies of local L1 cache victims within the local L2 cache slice.(cont.) Hits to these replicated copies reduce the effective latency of the shared L2 cache, while retaining the benefits of a higher effective capacity for shared data. We evaluate the various schemes using full-system simulation of single-threaded, multi-threaded, and multi-programmed workloads running on an eight-processor tiled CMP. We show that both techniques achieve significant performance improvement over baseline private and shared schemes for these workloads.by Michael Zhang.Ph.D

    Studying the Impact of Multicore Processor Scaling on Cache Coherence Directories via Reuse Distance Analysis

    Get PDF
    Directories are one key part of a processor's cache coherence hardware, and constitute one of the main bottlenecks in multicore processor scaling, e.g. core count and cache size scaling. Many research effects have tried to improve the scalability of the directory, but most of them only simulate a few architecture configurations. It is important to study the directory's architecture dependency, as the CPUs continue to scale. This is because besides applications, directory behaviors are also highly sensitive to architecture. Varying core count directly affects the amount of sharing in the directory, and varying the data cache hierarchy affects the directory access stream. But unfortunately, exploring the huge design space of multiple core counts and cache configurations is challenging using traditional architectural simulation due to the slow speed of simulations. This thesis studies the directory using multicore reuse distance analysis. It extends existing multicore reuse distance techniques, developing a method to extract directory access information from the parallel LRU stacks used to acquire private-stack reuse distance profiles. This thesis implements this method in a PIN-based profiler to study the directory behavior, including the directory access pattern and directory content, and to analyze current directory techniques. The profile results show that the directory accesses are highly dependent on cache size, exhibiting a 3.5x drop when scaling the data cache size from 16KB to 1MB; the sharing causes the ratio of directory entry to cache blocks to drop below 50%; and the majority of the accesses are to a small percentage of the directory entries. Cache simulations are performed to validate the profiling results, showing the profiled results are within 14.5% of simulation on average. This thesis also analyzes different directory techniques using the insights from the profiler. The case studies on the Cuckoo, DGD, SCD techniques and multi-level directories show that required directory size varies significantly with CPU scaling, the opportunity of compressing private data decreases with cache scaling, reducing the sharer list size is an effective technique and a small L1 directory is sufficient to capture most of the latency critical accesses respectively

    Cuckoo Directory: A Scalable Directory for Many-Core Systems

    Get PDF
    Growing core counts have highlighted the need for scalable on-chip coherence mechanisms. The increase in the number of on-chip cores exposes the energy and area costs of scaling the directories. Duplicate-tag based directories require highly associative structures that grow with core count, precluding scalability due to prohibitive power consumption. Sparse directories overcome the power barrier by reducing directory associativity, but require storage area over-provisioning to avoid high invalidation rates. We propose the Cuckoo directory, a power- and area-efficient scalable distributed directory. The cuckoo directory scales to high core counts without the energy costs of wide associative lookup and without gross capacity over-provisioning. Simulation of a 16-core CMP with commercial server and scientific workloads shows that the Cuckoo directory eliminates invalidations while being up to four times more power efficient than the Duplicate-tag directory and 24% more power-efficient and up to seven times more area efficient than the Sparse directory organization. Analytical projections indicate that the Cuckoo directory retains its energy and area benefits with increasing core count, efficiently scaling to at least 1024 cores

    Active memory controller

    Full text link
    Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips. In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs\u27 performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50x faster barriers, 12x faster spinlocks, 8.5x-15x faster stream/array operations, and 3x faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation
    corecore