648 research outputs found

    Cache Equalizer: A Cache Pressure Aware Block Placement Scheme for Large-Scale Chip Multiprocessors

    Get PDF
    This paper describes Cache Equalizer (CE), a novel distributed cache management scheme for large scale chip multiprocessors (CMPs). Our work is motivated by large asymmetry in cache sets usages. CE decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache, is continuously collected at a group (comprised of cache sets) granularity, and periodically recorded at the memory controller to guide the placement process. An incoming block is consequently placed at a cache group that exhibits the minimum pressure. CE provides Quality of Service (QoS) by robustly offering better performance than the baseline shared NUCA cache. Simulation results using a full-system simulator demonstrate that CE outperforms shared NUCA caches by an average of 15.5% and by as much as 28.5% for the benchmark programs we examined. Furthermore, evaluations manifested the outperformance of CE versus related CMP cache designs

    CLAM: Compiler Lease of Cache Memory

    Get PDF
    Caching is a common solution to the data movement performance bottleneck of today’s computational systems and networks. Traditional caching examines program behavior and cache optimization separately, limiting performance. Recently, a new cache policy called Compiler Lease of cAche Memory (CLAM), has been suggested for program-based cache management. CLAM manages cache memory by allowing the compiler to assign leases, or lifespans, to cached items over a hardware-software interface, known as lease cache. Lease cache affords new performance potential, by way of program-driven cache optimization. It is applicable to existing cache architecture optimizations, and can be used to emulate other cache policies. This paper presents the first functional hardware implementation of lease cache for CLAM support. Lease cache hardware architecture is first presented, along with CLAM hardware support systems. The cache is emulated on an FPGA, and benchmarked using a collection of scientific kernels from the PolyBench/C suite, for three CLAM lease assignment policies: Compiler Assigned Reference Leasing (CARL), Phased Reference Leasing (PRL), and Fixed Uniform Leasing (FUL). CARL and PRL are able to achieve superior performance to Least Recently Used (LRU) replacement, while FUL is shown to serve as a safety mechanism for CLAM. Novel spectrum-based cache tenancy analysis verifies PRL’s effectiveness in limiting cache utilization, and can identify changes in the working-set that cause the policy to perform adversely. This suggests that CLAM is extendable to more complex workloads if working-set transitions can elicit a similar change in lease policy. Being able to do so could yield appreciable performance improvements for large and highly iterative workloads like tensors

    Architectural Support for Operating System-Driven CMP Cache Management

    Get PDF
    The role of the operating system (OS) in managing shared resources such as CPU time, memory, peripherals, and even energy is well motivated and understood [22]. Unfortu- nately, one key resource|lower-level shared cache in chip multi-processors|is commonly managed purely in hardware by rudimentary replacement policies such as least-recently- used (LRU). The rigid nature of the hardware cache manage- ment policy poses a serious problem since there is no single best cache management policy across all sharing scenarios. For example, the cache management policy for a scenario where applications from a single organization are running under \best effort performance expectation is likely to be different from the policy for a scenario where applications from competing business entities (say, at a third party data center) are running under a minimum service level expecta- tion. When it comes to managing shared caches, there is an inherent tension between exibility and performance. On one hand, managing the shared cache in the OS offers immense policy exibility since it may be implemented in soft- ware. Unfortunately, it is prohibitively expensive in terms of performance for the OS to be involved in managing tempo- rally fine-grain events such as cache allocation. On the other hand, sophisticated hardware-only cache management tech- niques to achieve fair sharing or throughput maximization have been proposed. But they offer no policy exibility. This paper addresses this problem by designing architec- tural support for OS to effciently manage shared caches with a wide variety of policies. Our scheme consists of a hard- ware cache quota management mechanism, an OS interface and a set of OS level quota orchestration policies. The hard- ware mechanism guarantees that OS-specifed quotas are en- forced in shared caches, thus eliminating the need for (and the performance penalty of) temporally fine-grained OS in- tervention. The OS retains policy exibility since it can tune the quotas during regularly scheduled OS interventions. We demonstrate that our scheme can support a wide range of policies including policies that provide (a) passive per- formance differentiation, (b) reactive fairness by miss-rate equalization and (c) reactive performance differentiation

    Cache Power Optimization Using Multiple Voltage Supplies to Exploit Read/Write Asymmetry

    Get PDF
    Power consumption becomes more and more critical in computer systems nowadays. Most of the previous work has been focusing on general-purpose computational core, but optimization techniques for conventional CPU core has reached a limit. Our experimental results show that read operations in SRAM can be performed at a lower supply with much reduced power consumption compared to write operations. Based on this observation and the fact that cache, consisting mostly of SRAM, often occupies significant on-chip area of the CPU and consumes a huge portion of the CPU power, we propose a new method to reduce the power consumption of cache. By dynamically switching the cache voltage supply between a lower voltage for read and a higher voltage for write, our method can effectively reduce cache power without affecting the performance of the multi-level cache hierarchy in a computer system. We can realize further power savings by lowering the supply below read voltage for hold-only operations when the cache is idle. Both the power switching controller implementation and the power consumption statistics from various SPEC benchmarks will be presented to demonstrate the efficiency of our proposed methods

    Ubik: efficient cache sharing with strict qos for latency-critical workloads

    Get PDF
    Chip-multiprocessors (CMPs) must often execute workload mixes with different performance requirements. On one hand, user-facing, latency-critical applications (e.g., web search) need low tail (i.e., worst-case) latencies, often in the millisecond range, and have inherently low utilization. On the other hand, compute-intensive batch applications (e.g., MapReduce) only need high long-term average performance. In current CMPs, latency-critical and batch applications cannot run concurrently due to interference on shared resources. Unfortunately, prior work on quality of service (QoS) in CMPs has focused on guaranteeing average performance, not tail latency. In this work, we analyze several latency-critical workloads, and show that guaranteeing average performance is insufficient to maintain low tail latency, because microarchitectural resources with state, such as caches or cores, exert inertia on instantaneous workload performance. Last-level caches impart the highest inertia, as workloads take tens of milliseconds to warm them up. When left unmanaged, or when managed with conventional QoS frameworks, shared last-level caches degrade tail latency significantly. Instead, we propose Ubik, a dynamic partitioning technique that predicts and exploits the transient behavior of latency-critical workloads to maintain their tail latency while maximizing the cache space available to batch applications. Using extensive simulations, we show that, while conventional QoS frameworks degrade tail latency by up to 2.3x, Ubik simultaneously maintains the tail latency of latency-critical workloads and significantly improves the performance of batch applications.United States. Defense Advanced Research Projects Agency (Power Efficiency Revolution For Embedded Computing Technologies Contract HR0011-13-2-0005)National Science Foundation (U.S.) (Grant CCF-1318384

    Resource-aware scheduling for 2D/3D multi-/many-core processor-memory systems

    Get PDF
    This dissertation addresses the complexities of 2D/3D multi-/many-core processor-memory systems, focusing on two key areas: enhancing timing predictability in real-time multi-core processors and optimizing performance within thermal constraints. The integration of an increasing number of transistors into compact chip designs, while boosting computational capacity, presents challenges in resource contention and thermal management. The first part of the thesis improves timing predictability. We enhance shared cache interference analysis for set-associative caches, advancing the calculation of Worst-Case Execution Time (WCET). This development enables accurate assessment of cache interference and the effectiveness of partitioned schedulers in real-world scenarios. We introduce TCPS, a novel task and cache-aware partitioned scheduler that optimizes cache partitioning based on task-specific WCET sensitivity, leading to improved schedulability and predictability. Our research explores various cache and scheduling configurations, providing insights into their performance trade-offs. The second part focuses on thermal management in 2D/3D many-core systems. Recognizing the limitations of Dynamic Voltage and Frequency Scaling (DVFS) in S-NUCA many-core processors, we propose synchronous thread migrations as a thermal management strategy. This approach culminates in the HotPotato scheduler, which balances performance and thermal safety. We also introduce 3D-TTP, a transient temperature-aware power budgeting strategy for 3D-stacked systems, reducing the need for Dynamic Thermal Management (DTM) activation. Finally, we present 3QUTM, a novel method for 3D-stacked systems that combines core DVFS and memory bank Low Power Modes with a learning algorithm, optimizing response times within thermal limits. This research contributes significantly to enhancing performance and thermal management in advanced processor-memory systems
    • …
    corecore