16 research outputs found
Improving Cache Hits On Replacment Blocks Using Weighted LRU-LFU Combinations
Block replacement refers to the process of selecting a block of data or a cache line to be evicted or replaced when a new block needs to be brought into a cache or a memory hierarchy. In computer systems, block replacement policies are used in caching mechanisms, such as in CPU caches or disk caches, to determine which blocks are evicted when the cache is full and new data needs to be fetched. The combination of LRU (Least Recently Used) and LFU (Least Frequently Used) in a weighted manner is known as the "LFU2" algorithm. LFU2 is an enhanced caching algorithm that aims to leverage the benefits of both LRU and LFU by considering both recency and frequency of item access. In LFU2, each item in the cache is associated with two counters: the usage counter and the recency counter. The usage counter tracks the frequency of item access, while the recency counter tracks the recency of item access. These counters are used to calculate a combined weight for each item in the cache. Based on the experimental results, the LRU-LFU combination method succeeded in increasing cache hits from 94.8% on LFU and 95.5% on LFU to 96.6%
Principled Approaches to Last-Level Cache Management
Memory is a critical component of all computing systems. It represents a fundamental
performance and energy bottleneck. Ideally, memory aspects such as energy cost, performance,
and the cost of implementing management techniques would scale together with
the size of all different computing systems; unfortunately this is not the case. With the upcoming
trends in applications, new memory technologies, etc., scaling becomes a bigger
a problem, aggravating the performance bottleneck that memory represents. A memory
hierarchy was proposed to alleviate the problem. Each level in the hierarchy tends to have
a decreasing cost per bit, an increased capacity, and a higher access time compared to its
previous level. Preferably all data will be stored in the fastest level of memory, unfortunately,
faster memory technologies tend to be associated with a higher manufacturing
cost, which often limits their capacity. The design challenge is, to determine which is the
frequently used data, and store it in the faster levels of memory.
A cache is a small, fast, on-chip chunk of memory. Any data stored in main memory
can be stored in the cache. For many programs, a typical behavior is to access data that has
been accessed previously. Taking advantage of this behavior, a copy of frequently accessed
data is kept in the cache, in order to provide a faster access time next time is requested.
Due to capacity constrains, it is likely that all of the frequently reused data cannot fit in
the cache, because of this, cache management policies decide which data is to be kept in
the cache, and which in other levels of the memory hierarchy. Under an efficient cache
management policy, an encouraging amount of memory requests will be serviced from a
fast on-chip cache.
The disparity in access latency between the last-level cache and main memory motivates
the search for efficient cache management policies. There is a great amount of recently proposed work that strives to utilize cache capacity in the most favorable to performance
way possible. Related work focus on optimizing the performance of caches focusing
on different possible solutions, e.g. reduce miss rate, consume less power, reducing
storage overhead, reduce access latency, etc.
Our work focus on improving the performance of last-level caches by designing policies
based on principles adapted from other areas of interest. In this dissertation, we
focus on several aspects of cache management policies, we first introduce a space-efficient
placement and promotion policy which goal is to minimize the updates to the replacement
policy state on each cache access. We further introduce a mechanism that predicts whether
a block in the cache will be reused, it feeds different features from a block to the predictor
in order to increase the correlation of a previous access to a future access. We later introduce
a technique that tweaks traditional cache indexing, providing fast accesses to a vast
majority of requests in the presence of a slow access memory technology such as DRAM
Evaluating the Presence of a Victim Cache on an Arm Processor
Mobile processor is a CPU designed to save power. It is found in mobile computers and cell phones. A CPU chip, designed for portable computers, is typically housed in a smaller chip package, but more importantly, in order to run cooler, it uses lower voltages than its desktop counterpart and has more sleep mode capability. A mobile processor can be throttled down to different power levels and/or sections of the chip can be turned off entirely when not in use. ARM is a 32-bit reduced instruction set computer (RISC) instruction set architecture (ISA). The relative simplicity of ARM processors makes them suitable for low power applications. Hence ARM processors account for approximately 90% of all mobile 32-bit RISC processors.
Today, mobile processors are expected to run complex, algorithm-heavy, memory-intensive applications which were originally designed and coded for general-purpose processors. Due to this we see a huge impact of the memory latencies on the execution time of applications. To reduce this impact and serve this kind of applications, the relative complexity of ARM processors has increased in the last decade by the inclusion of traditional methods like multiple issue of instructions, out-of-order instruction execution and large, associative caches. Victim Caching is another method which can be used to reduce the execution time and is currently not incorporated in the ARM processors. This method was proposed by Norman P. Jouppi in his paper “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers”.
Victim Cache is defined as an extension to a direct mapped cache that adds a small, secondary, fully associative cache to store cache blocks that have been ejected from the main cache due to a capacity or conflict miss. These ejected blocks are likely to be needed again so storing them in the secondary cache should increase performance and reduce the execution times.
Therefore for the Master\u27s project we re-implemented the SimpleScalar simulator for an ARM processor by incorporating the impact of Victim Cache. This re-implementation of the ARM simulator gave a significant improvement in the performance when various applications of MIBench benchmark suite were run on this simulator. It is observed to have a reduction of 1.93% in the number of clock cycles used and increase in the hit rate of Level 1cache by 2.7% over various Level 1 cache and Victim cache configurations on an average. It is also observed that the benefit of Victim cache increases as the size of Level 1 cache decreases and the performance boost obtained by the processor in presence of a Victim cache is comparable to the performance obtained when a Large, Associative Level 1 cache is used. Hence, incorporation of Victim Cache to an ARM processor is highly advantageous to the current generation of Mobile processors instead of using a Large, Associative Level 1 cache
Recommended from our members
Improving virtual memory performance in virtualized environments
Virtual Memory is a major system performance bottleneck in virtualized environments. In addition to expensive address translations, frequent virtual machine context switches are common in virtualized environments, resulting in increased TLB miss rates, subsequent expensive page walks and data cache contention due to incoming page table entries evicting useful data. Orthogonally, translation coherence, which is currently an expensive operation implemented in software, can consume up to 50% of the runtime of an application executing on the guest. To improve the performance of virtual memory in virtualized environments, two solutions have been proposed in this thesis - namely, (1) Context Switch Aware Large TLB (CSALT), an architecture which addresses the problem of increased TLB miss rates and their adverse impact on data caches. CSALT copes with the increased demand of context switches by storing a large number TLB entries. It mitigates data cache contention by employing a novel TLB-aware cache partitioning scheme. On 8-core systems that switch between two virtual machine contexts executing multi-threaded workloads, CSALT achieves an average performance improvement of 85% over a baseline with conventional L1-L2 TLBs and 25% over a baseline which has a large L3 TLB (2) Translation Coherence using Addressable TLBs (TCAT), a hardware translation coherence scheme which eliminates almost all of the overheads associated with address translation coherence. TCAT overlays translation coherence atop cache coherence to accurately identify slave cores. It then leverages the addressable Part-Of-Memory TLB (POM-TLB) to eliminate expensive Inter Processor Interrupts (IPI) and achieve precise invalidations on the slave core. On 8-core systems with one virtual machine context executing multi-threaded workloads, TCAT achieves an average performance improvement of 13% over the kvmtlb baselineElectrical and Computer Engineerin
Exploring Alternate Cache Indexing Techniques
Cache memory is a bridging component which covers the increasing gap between the speed of
a processor and main memory. An excellent performance of the cache is crucial to improve system
performance. Conflict misses are one of the critical reasons that limit the cache performance by
mapping blocks to the same set which results in the eviction of many blocks. However, many
blocks in the cache sets are not mapped, and thus the available space is not efficiently utilized. A
direct way to reduce conflict misses is to increase associativity, but this comes with the cost of an
increase in the hit time. Another way to reduce conflict misses is to change the cache-indexing
scheme and distribute the accesses across all sets.
This thesis focuses on the second way mentioned above and aims to evaluate the impact of the
matrix-based indexing scheme on cache performance against the traditional modulus-based indexing
scheme. A correlation between the proposed indexing scheme and different cache replacement
policies is also observed.
The matrix-based indexing scheme yields a geometric mean speedup of 1.2% for SPEC CPU
2017 benchmarks for single core simulations when applied for direct-mapped last level cache. In
this case, an improvement of 1.5% and 4% is observed for at least eighteen and seven of SPEC
CPU2017 applications respectively. Also, it yields 2% of performance improvement over sixteen
SPEC CPU2006 benchmarks. The new indexing scheme correlates well with multiperspective
reuse prediction. It is observed that LRU benefits machine learning benchmark by a performance
of 5.1%. For multicore simulations, the new indexing scheme does not improve performance
significantly. However, this scheme also does not impact the application’s performance negatively
Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors
abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions.
Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%.
Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications.
Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future.
In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.Dissertation/ThesisDoctoral Dissertation Computer Science 201
Speculative Techniques for Memory Hierarchy Management
The “Memory Wall” [1], is the gap in performance between the processor and the main memory. Over the last 30 years computer architects have added multiple levels of cache to fill this gap, cache levels that are closer to the processors are smaller and faster. On the other hand, the levels that are far from the processors are bigger and slower. However the processors are still exposed to the latency of DRAM on misses. Therefore, speculative memory management techniques such as prefetching are used in modern microprocessors to bridge this gap in performance.
First, we propose Synchronization-aware Hardware Prefetching for Chip Multiprocessors, a novel hardware data prefetching scheme designed for prefetching shared-memory, multi- threaded workloads. This is the first work we are aware of to characterize the causes of poor prefetching performance in shared- memory multi-threaded applications. These are the inability to prefetch beyond synchronization points and tendency to prefetch shared data before it has been written. SB-Fetch, a low-complexity, low-overhead prefetcher design that addresses both issues.
Second, we propose a new prefetching algorithm, Set-Level Adaptive Prefetching for Com- pressed Caches (SLAP-CC), which seeks to address this problem by varying the prefetching aggressiveness based on how much effective capacity is available in each set. The ontribu- tions of this work is characterize the increase and per-set variability of cache efficiency which typical cache compression schemes create, and propose a new prefetching scheme, SLAP-CC, designed to leverage this cache efficiency variability.
Third, we propose a new a scheduling mechanism that predicts the hard- to-prefetch loads at issue time and preemptively schedule them for execution as soon as they are ready, to allow the cache hierarchy to start the mishandling mechanism sooner. Such scheduling mechanism reduces the miss penalty on the dependent instructions after a hard-to-prefetch loads
Exploring Alternate Cache Indexing Techniques
Cache memory is a bridging component which covers the increasing gap between the speed of
a processor and main memory. An excellent performance of the cache is crucial to improve system
performance. Conflict misses are one of the critical reasons that limit the cache performance by
mapping blocks to the same set which results in the eviction of many blocks. However, many
blocks in the cache sets are not mapped, and thus the available space is not efficiently utilized. A
direct way to reduce conflict misses is to increase associativity, but this comes with the cost of an
increase in the hit time. Another way to reduce conflict misses is to change the cache-indexing
scheme and distribute the accesses across all sets.
This thesis focuses on the second way mentioned above and aims to evaluate the impact of the
matrix-based indexing scheme on cache performance against the traditional modulus-based indexing
scheme. A correlation between the proposed indexing scheme and different cache replacement
policies is also observed.
The matrix-based indexing scheme yields a geometric mean speedup of 1.2% for SPEC CPU
2017 benchmarks for single core simulations when applied for direct-mapped last level cache. In
this case, an improvement of 1.5% and 4% is observed for at least eighteen and seven of SPEC
CPU2017 applications respectively. Also, it yields 2% of performance improvement over sixteen
SPEC CPU2006 benchmarks. The new indexing scheme correlates well with multiperspective
reuse prediction. It is observed that LRU benefits machine learning benchmark by a performance
of 5.1%. For multicore simulations, the new indexing scheme does not improve performance
significantly. However, this scheme also does not impact the application’s performance negatively