284 research outputs found

    Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy

    Get PDF
    Journal ArticleCache hierarchies in future many-core processors are expected to grow in size and contribute a large fraction of overall processor power and performance. In this paper, we postulate a 3D chip design that stacks SRAM and DRAM upon processing cores and employs OS-based page coloring to minimize horizontal communication of cache data. We then propose a heterogeneous reconfigurable cache design that takes advantage of the high density of DRAM and the superior power/delay characteristics of SRAM to efficiently meet the working set demands of each individual core. Finally, we analyze the communication patterns for such a processor and show that a tree topology is an ideal fit that significantly reduces the power and latency requirements of the on-chip network. The above proposals are synergistic: each proposal is made more compelling because of its combination with the other innovations described in this paper. The proposed reconfigurable cache model improves performance by up to 19% along with 48% savings in network power

    Jenga: Harnessing Heterogeneous Memories through Reconfigurable Cache Hierarchies

    Get PDF
    Conventional memory systems are organized as a rigid hierarchy, with multiple levels of progressively larger and slower memories. Hierarchy allows a simple, fixed design to benefit a wide range of applications, because working sets settle at the smallest (and fastest) level they fit in. However, rigid hierarchies also cause significant overheads, because each level adds latency and energy even when it does not capture the working set. In emerging systems with heterogeneous memory technologies such as stacked DRAM, these overheads often limit performance and efficiency. We propose Jenga, a reconfigurable cache hierarchy that avoids these pathologies and approaches the performance of a hierarchy optimized for each application. Jenga monitors application behavior and dynamically builds virtual cache hierarchies out of heterogeneous, distributed cache banks. Jenga uses simple hardware support and a novel software runtime to configure virtual cache hierarchies. On a 36-core CMP with a 1 GB stacked-DRAM cache, Jenga outperforms a combination of state-of-the-art techniques by 10% on average and by up to 36%, and does so while saving energy, improving system-wide energy-delay product by 29% on average and by up to 96%

    Energy challenges for ICT

    Get PDF
    The energy consumption from the expanding use of information and communications technology (ICT) is unsustainable with present drivers, and it will impact heavily on the future climate change. However, ICT devices have the potential to contribute signi - cantly to the reduction of CO2 emission and enhance resource e ciency in other sectors, e.g., transportation (through intelligent transportation and advanced driver assistance systems and self-driving vehicles), heating (through smart building control), and manu- facturing (through digital automation based on smart autonomous sensors). To address the energy sustainability of ICT and capture the full potential of ICT in resource e - ciency, a multidisciplinary ICT-energy community needs to be brought together cover- ing devices, microarchitectures, ultra large-scale integration (ULSI), high-performance computing (HPC), energy harvesting, energy storage, system design, embedded sys- tems, e cient electronics, static analysis, and computation. In this chapter, we introduce challenges and opportunities in this emerging eld and a common framework to strive towards energy-sustainable ICT

    Massive Data-Centric Parallelism in the Chiplet Era

    Full text link
    Traditionally, massively parallel applications are executed on distributed systems, where computing nodes are distant enough that the parallelization schemes must minimize communication and synchronization to achieve scalability. Mapping communication-intensive workloads to distributed systems requires complicated problem partitioning and dataset pre-processing. With the current AI-driven trend of having thousands of interconnected processors per chip, there is an opportunity to re-think these communication-bottlenecked workloads. This bottleneck often arises from data structure traversals, which cause irregular memory accesses and poor cache locality. Recent works have introduced task-based parallelization schemes to accelerate graph traversal and other sparse workloads. Data structure traversals are split into tasks and pipelined across processing units (PUs). Dalorex demonstrated the highest scalability (up to thousands of PUs on a single chip) by having the entire dataset on-chip, scattered across PUs, and executing the tasks at the PU where the data is local. However, it also raised questions on how to scale to larger datasets when all the memory is on chip, and at what cost. To address these challenges, we propose a scalable architecture composed of a grid of Data-Centric Reconfigurable Array (DCRA) chiplets. Package-time reconfiguration enables creating chip products that optimize for different target metrics, such as time-to-solution, energy, or cost, while software reconfigurations avoid network saturation when scaling to millions of PUs across many chip packages. We evaluate six applications and four datasets, with several configurations and memory technologies, to provide a detailed analysis of the performance, power, and cost of data-local execution at scale. Our parallelization of Breadth-First-Search with RMAT-26 across a million PUs reaches 3323 GTEPS

    Adaptive memory-side last-level GPU caching

    Get PDF
    Emerging GPU applications exhibit increasingly high computation demands which has led GPU manufacturers to build GPUs with an increasingly large number of streaming multiprocessors (SMs). Providing data to the SMs at high bandwidth puts significant pressure on the memory hierarchy and the Network-on-Chip (NoC). Current GPUs typically partition the memory-side last-level cache (LLC) in equally-sized slices that are shared by all SMs. Although a shared LLC typically results in a lower miss rate, we find that for workloads with high degrees of data sharing across SMs, a private LLC leads to a significant performance advantage because of increased bandwidth to replicated cache lines across different LLC slices. In this paper, we propose adaptive memory-side last-level GPU caching to boost performance for sharing-intensive workloads that need high bandwidth to read-only shared data. Adaptive caching leverages a lightweight performance model that balances increased LLC bandwidth against increased miss rate under private caching. In addition to improving performance for sharing-intensive workloads, adaptive caching also saves energy in a (co-designed) hierarchical two-stage crossbar NoC by power-gating and bypassing the second stage if the LLC is configured as a private cache. Our experimental results using 17 GPU workloads show that adaptive caching improves performance by 28.1% on average (up to 38.1%) compared to a shared LLC for sharing-intensive workloads. In addition, adaptive caching reduces NoC energy by 26.6% on average (up to 29.7%) and total system energy by 6.1% on average (up to 27.2%) when configured as a private cache. Finally, we demonstrate through a GPU NoC design space exploration that a hierarchical two-stage crossbar is both more power- and area-efficient than full and concentrated crossbars with the same bisection bandwidth, thus providing a low-cost cooperative solution to exploit workload sharing behavior in memory-side last-level caches

    DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

    Full text link
    Data movement between the CPU and main memory is a first-order obstacle against improving performance, scalability, and energy efficiency in modern systems. Computer systems employ a range of techniques to reduce overheads tied to data movement, spanning from traditional mechanisms (e.g., deep multi-level cache hierarchies, aggressive hardware prefetchers) to emerging techniques such as Near-Data Processing (NDP), where some computation is moved close to memory. Our goal is to methodically identify potential sources of data movement over a broad set of applications and to comprehensively compare traditional compute-centric data movement mitigation techniques to more memory-centric techniques, thereby developing a rigorous understanding of the best techniques to mitigate each source of data movement. With this goal in mind, we perform the first large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory. We develop the first systematic methodology to classify applications based on the sources contributing to data movement bottlenecks. From our large-scale characterization of 77K functions across 345 applications, we select 144 functions to form the first open-source benchmark suite (DAMOV) for main memory data movement studies. We select a diverse range of functions that (1) represent different types of data movement bottlenecks, and (2) come from a wide range of application domains. Using NDP as a case study, we identify new insights about the different data movement bottlenecks and use these insights to determine the most suitable data movement mitigation mechanism for a particular application. We open-source DAMOV and the complete source code for our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.Comment: Our open source software is available at https://github.com/CMU-SAFARI/DAMO

    Exploring manycore architectures for next-generation HPC systems through the MANGO approach

    Full text link
    [EN] The Horizon 2020 MANGO project aims at exploring deeply heterogeneous accelerators for use in High-Performance Computing systems running multiple applications with different Quality of Service (QoS) levels. The main goal of the project is to exploit customization to adapt computing resources to reach the desired QoS. For this purpose, it explores different but interrelated mechanisms across the architecture and system software. In particular, in this paper we focus on the runtime resource management, the thermal management, and support provided for parallel programming, as well as introducing three applications on which the project foreground will be validated.This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 671668.Flich Cardo, J.; Agosta, G.; Ampletzer, P.; Atienza-Alonso, D.; Brandolese, C.; Cappe, E.; Cilardo, A.... (2018). Exploring manycore architectures for next-generation HPC systems through the MANGO approach. Microprocessors and Microsystems. 61:154-170. https://doi.org/10.1016/j.micpro.2018.05.011S1541706
    • …
    corecore