212 research outputs found

    The locality-aware adaptive cache coherence protocol

    Get PDF
    Next generation multicore applications will process massive amounts of data with significant sharing. Data movement and management impacts memory access latency and consumes power. Therefore, harnessing data locality is of fundamental importance in future processors. We propose a scalable, efficient shared memory cache coherence protocol that enables seamless adaptation between private and logically shared caching of on-chip data at the fine granularity of cache lines. Our data-centric approach relies on in-hardware yet low-overhead runtime profiling of the locality of each cache line and only allows private caching for data blocks with high spatio-temporal locality. This allows us to better exploit the private caches and enable low-latency, low-energy memory access, while retaining the convenience of shared memory. On a set of parallel benchmarks, our low-overhead locality-aware mechanisms reduce the overall energy by 25% and completion time by 15% in an NoC-based multicore with the Reactive-NUCA on-chip cache organization and the ACKwise limited directory-based coherence protocol.United States. Defense Advanced Research Projects Agency. The Ubiquitous High Performance Computing Progra

    BriskStream: Scaling Data Stream Processing on Shared-Memory Multicore Architectures

    Full text link
    We introduce BriskStream, an in-memory data stream processing system (DSPSs) specifically designed for modern shared-memory multicore architectures. BriskStream's key contribution is an execution plan optimization paradigm, namely RLAS, which takes relative-location (i.e., NUMA distance) of each pair of producer-consumer operators into consideration. We propose a branch and bound based approach with three heuristics to resolve the resulting nontrivial optimization problem. The experimental evaluations demonstrate that BriskStream yields much higher throughput and better scalability than existing DSPSs on multi-core architectures when processing different types of workloads.Comment: To appear in SIGMOD'1

    Instruction-Level Execution Migration

    Get PDF
    We introduce the Execution Migration Machine (EM²), a novel data-centric multicore memory system architecture based on computation migration. Unlike traditional distributed memory multicores, which rely on complex cache coherence protocols to move the data to the core where the computation is taking place, our scheme always moves the computation to the core where the data resides. By doing away with the cache coherence protocol, we are able to boost the effectiveness of per-core caches while drastically reducing hardware complexity. To evaluate the potential of EM² architectures, we developed a series of PIN/Graphite-based models of an EM² multicore with 64 x86 cores and, under some simplifying assumptions (a timing model restricted to data memory performance, no instruction cache modeling, high-bandwidth fixed-latency interconnect allowing concurrent migrations), compared them against corresponding directory-based cache-coherent architecture models. We justify our assumptions and show that our conclusions are valid even if our assumptions are removed. Experimental results on a range of SPLASH-2 and PARSEC benchmarks indicate that EM2 can significantly improve per-core cache performance, decreasing overall miss rates by as much as 84% and reducing average memory latency by up to 58%

    Scalable directoryless shared memory coherence using execution migration

    Get PDF
    We introduce the concept of deadlock-free migration-based coherent shared memory to the NUCA family of architectures. Migration-based architectures move threads among cores to guarantee sequential semantics in large multicores. Using a execution migration (EM) architecture, we achieve performance comparable to directory-based architectures without using directories: avoiding automatic data replication significantly reduces cache miss rates, while a fast network-level thread migration scheme takes advantage of shared data locality to reduce remote cache accesses that limit traditional NUCA performance. EM area and energy consumption are very competitive, and, on the average, it outperforms a directory-based MOESI baseline by 6.8% and a traditional S-NUCA design by 9.2%. We argue that with EM scaling performance has much lower cost and design complexity than in directory-based coherence and traditional NUCA architectures: by merely scaling network bandwidth from 128 to 256 (512) bit flits, the performance of our architecture improves by an additional 8% (12%), while the baselines show negligible improvement

    Locality-aware data replication in the Last-Level Cache

    Get PDF
    Next generation multicores will process massive data with varying degree of locality. Harnessing on-chip data locality to optimize the utilization of cache and network resources is of fundamental importance. We propose a locality-aware selective data replication protocol for the last-level cache (LLC). Our goal is to lower memory access latency and energy by replicating only high locality cache lines in the LLC slice of the requesting core, while simultaneously keeping the off-chip miss rate low. Our approach relies on low overhead yet highly accurate in-hardware run-time classification of data locality at the cache line granularity, and only allows replication for cache lines with high reuse. Furthermore, our classifier captures the LLC pressure at the existing replica locations and adapts its replication decision accordingly. The locality tracking mechanism is decoupled from the sharer tracking structures that cause scalability concerns in traditional coherence protocols. Moreover, the complexity of our protocol is low since no additional coherence states are created. On a set of parallel benchmarks, our protocol reduces the overall energy by 16%, 14%, 13% and 21% and the completion time by 4%, 9%, 6% and 13% when compared to the previously proposed Victim Replication, Adaptive Selective Replication, Reactive-NUCA and Static-NUCA LLC management schemes

    EM2: A Scalable Shared-Memory Multicore Architecture

    Get PDF
    We introduce the Execution Migration Machine (EM2), a novel, scalable shared-memory architecture for large-scale multicores constrained by off-chip memory bandwidth. EM2 reduces cache miss rates, and consequently off-chip memory usage, by permitting only one copy of data to be stored anywhere in the system: when a thread wishes to access an address not locally cached on the core it is executing on, it migrates to the appropriate core and continues execution. Using detailed simulations of a range of 256-core configurations on the SPLASH-2 benchmark suite, we show that EM2 improves application completion times by 18% on the average while remaining competitive with traditional architectures in silicon area

    Judicious Thread Migration When Accessing Distributed Shared Caches

    Get PDF
    Chip-multiprocessors (CMPs) have become the mainstream chip design in recent years; for scalability reasons, designs with high core counts tend towards tiled CMPs with physically distributed shared caches. This naturally leads to a Non-Uniform Cache Architecture (NUCA) design, where on chip access latencies depend on the physical distances between requesting cores and home cores where the data is cached. Improving data locality is thus key to performance, and several studies have addressed this problem using data replication and data migration. In this paper, we consider another mechanism, hardware level thread migration. This approach, we argue, can better exploit shared data locality for NUCA designs by effectively replacing multiple round-trip remote cache accesses with a smaller number of migrations. High migration costs, however, make it crucial to use thread migrations judiciously; we therefore propose a novel, on-line prediction scheme which decides whether to perform a remote access (as in traditional NUCA designs) or to perform a thread migration at the instruction level. For a set of parallel benchmarks, our thread migration predictor improves the performance by 18% on average and at best by 2.3X over the standard NUCA design that only uses remote accesses

    OLTP on Hardware Islands

    Get PDF
    Modern hardware is abundantly parallel and increasingly heterogeneous. The numerous processing cores have non-uniform access latencies to the main memory and to the processor caches, which causes variability in the communication costs. Unfortunately, database systems mostly assume that all processing cores are the same and that microarchitecture differences are not significant enough to appear in critical database execution paths. As we demonstrate in this paper, however, hardware heterogeneity does appear in the critical path and conventional database architectures achieve suboptimal and even worse, unpredictable performance. We perform a detailed performance analysis of OLTP deployments in servers with multiple cores per CPU (multicore) and multiple CPUs per server (multisocket). We compare different database deployment strategies where we vary the number and size of independent database instances running on a single server, from a single shared-everything instance to fine-grained shared-nothing configurations. We quantify the impact of non-uniform hardware on various deployments by (a) examining how efficiently each deployment uses the available hardware resources and (b) measuring the impact of distributed transactions and skewed requests on different workloads. Finally, we argue in favor of shared-nothing deployments that are topology- and workload-aware and take advantage of fast on-chip communication between islands of cores on the same socket.Comment: VLDB201
    corecore