144 research outputs found
Perceptron Learning in Cache Management and Prediction Techniques
Hardware prefetching is an effective technique for hiding cache miss latencies in modern processor designs. An efficient prefetcher should identify complex memory access patterns during program execution. This ability enables the prefetcher to read a block ahead of its demand access, potentially preventing a cache miss. Accurately identifying the right blocks to prefetch is essential to achieving high performance from the prefetcher.
Prefetcher performance can be characterized by two main metrics that are generally at odds with one another: coverage, the fraction of baseline cache misses which the prefetcher brings into the cache; and accuracy, the fraction of prefetches which are ultimately used. An overly aggressive prefetcher may improve coverage at the cost of reduced accuracy. Thus, performance may be harmed by this over-aggressiveness because many resources are wasted, including cache capacity and bandwidth. An ideal prefetcher would have both high coverage and accuracy.
In this thesis, I propose Perceptron-based Prefetch Filtering (PPF) as a way to increase the coverage of the prefetches generated by a baseline prefetcher without negatively impacting accuracy. PPF enables more aggressive tuning of a given baseline prefetcher, leading to increased coverage by filtering out the growing numbers of inaccurate prefetches such an aggressive tuning implies. I also explore a range of features to use to train PPFâs perceptron layer to identify inaccurate prefetches. PPF improves performance on a memory-intensive subset of the SPEC CPU 2017 benchmarks by 3.78% for a single-core configuration, and by 11.4% for a 4-core configuration, compared to the baseline prefetcher alone
Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction
Long-latency load requests continue to limit the performance of
high-performance processors. To increase the latency tolerance of a processor,
architects have primarily relied on two key techniques: sophisticated data
prefetchers and large on-chip caches. In this work, we show that: 1) even a
sophisticated state-of-the-art prefetcher can only predict half of the off-chip
load requests on average across a wide range of workloads, and 2) due to the
increasing size and complexity of on-chip caches, a large fraction of the
latency of an off-chip load request is spent accessing the on-chip cache
hierarchy. The goal of this work is to accelerate off-chip load requests by
removing the on-chip cache access latency from their critical path. To this
end, we propose a new technique called Hermes, whose key idea is to: 1)
accurately predict which load requests might go off-chip, and 2) speculatively
fetch the data required by the predicted off-chip loads directly from the main
memory, while also concurrently accessing the cache hierarchy for such loads.
To enable Hermes, we develop a new lightweight, perceptron-based off-chip load
prediction technique that learns to identify off-chip load requests using
multiple program features (e.g., sequence of program counters). For every load
request, the predictor observes a set of program features to predict whether or
not the load would go off-chip. If the load is predicted to go off-chip, Hermes
issues a speculative request directly to the memory controller once the load's
physical address is generated. If the prediction is correct, the load
eventually misses the cache hierarchy and waits for the ongoing speculative
request to finish, thus hiding the on-chip cache hierarchy access latency from
the critical path of the off-chip load. Our evaluation shows that Hermes
significantly improves performance of a state-of-the-art baseline. We
open-source Hermes.Comment: To appear in 55th IEEE/ACM International Symposium on
Microarchitecture (MICRO), 202
Evaluation of L1 Residence for Perceptron Filter Enhanced Signature Path Prefetcher
Rapid advancement of integrated circuit technology described by Mooreâs Law has greatly increased computational power. Processors have taken advantage of this by increasing computation rates, while memory has gained increased capacity. As processor operation speeds have greatly exceeded memory access times, computer architects have added multiple levels of caches to avoid penalties for repeat accesses to memory. While this is an improvement, architects have further improved access efficiency by developing methods of prefetching data from memory to hide the latency penalty usually incurred on a cache miss. Previous work at Texas A&M and their submission to the Third Data Prefetching Championship (DPC3) primarily consisted of L2 cache prefetching. L1 prefetching has been less explored than L2 due to hardware limitations on implementation. In this paper, I attempt to evaluate the effect of L1 residence for Texas A&Mâs Perceptron Filtered Signature Path Prefetcher (PPF). While an unoptimized movement of the PPF from the L2 to the L1 showed performance degradation, optimizations such as using the L1 data stream to prefetch to all cache levels and updating table sizes and lengths have matched L2 performance
Principled Approaches to Last-Level Cache Management
Memory is a critical component of all computing systems. It represents a fundamental
performance and energy bottleneck. Ideally, memory aspects such as energy cost, performance,
and the cost of implementing management techniques would scale together with
the size of all different computing systems; unfortunately this is not the case. With the upcoming
trends in applications, new memory technologies, etc., scaling becomes a bigger
a problem, aggravating the performance bottleneck that memory represents. A memory
hierarchy was proposed to alleviate the problem. Each level in the hierarchy tends to have
a decreasing cost per bit, an increased capacity, and a higher access time compared to its
previous level. Preferably all data will be stored in the fastest level of memory, unfortunately,
faster memory technologies tend to be associated with a higher manufacturing
cost, which often limits their capacity. The design challenge is, to determine which is the
frequently used data, and store it in the faster levels of memory.
A cache is a small, fast, on-chip chunk of memory. Any data stored in main memory
can be stored in the cache. For many programs, a typical behavior is to access data that has
been accessed previously. Taking advantage of this behavior, a copy of frequently accessed
data is kept in the cache, in order to provide a faster access time next time is requested.
Due to capacity constrains, it is likely that all of the frequently reused data cannot fit in
the cache, because of this, cache management policies decide which data is to be kept in
the cache, and which in other levels of the memory hierarchy. Under an efficient cache
management policy, an encouraging amount of memory requests will be serviced from a
fast on-chip cache.
The disparity in access latency between the last-level cache and main memory motivates
the search for efficient cache management policies. There is a great amount of recently proposed work that strives to utilize cache capacity in the most favorable to performance
way possible. Related work focus on optimizing the performance of caches focusing
on different possible solutions, e.g. reduce miss rate, consume less power, reducing
storage overhead, reduce access latency, etc.
Our work focus on improving the performance of last-level caches by designing policies
based on principles adapted from other areas of interest. In this dissertation, we
focus on several aspects of cache management policies, we first introduce a space-efficient
placement and promotion policy which goal is to minimize the updates to the replacement
policy state on each cache access. We further introduce a mechanism that predicts whether
a block in the cache will be reused, it feeds different features from a block to the predictor
in order to increase the correlation of a previous access to a future access. We later introduce
a technique that tweaks traditional cache indexing, providing fast accesses to a vast
majority of requests in the presence of a slow access memory technology such as DRAM
Evaluation of Cache Inclusion Policies in Cache Management
Processor speed has been increasing at a higher rate than the speed of memories over the last years. Caches were designed to mitigate this gap and, ever since, several cache management techniques have been designed to further improve performance.
Most techniques have been designed and evaluated on non-inclusive caches even though many modern processors implement either inclusive or exclusive policies. Exclusive caches benefit from a larger effective capacity, so they might become more popular when the number of cores per last-level cache increases.
This thesis aims to demonstrate that the best cache management techniques for exclusive caches do not necessarily have to be the same as for non-inclusive or inclusive caches. To assess this statement we evaluated several cache management techniques with different inclusion policies, number of cores and cache sizes.
We found that the configurations for inclusive and non-inclusive policies usually performed similarly, but for exclusive caches the best configurations were indeed different. Prefetchers impacted performance more than replacement policies, and determined which configurations were the best ones. Also, exclusive caches showed a higher speedup on multi-core.
The least recently used (LRU) replacement policy is among the best policies for any prefetcher combination in exclusive caches but is the one used as a baseline in most cache replacement policy research. Therefore, we conclude that the results in this thesis motivate further research on prefetchers and replacement policies targeted
to exclusive caches
Page size aware cache prefetching
The increase in working set sizes of contemporary applications outpaces the growth in cache sizes, resulting in frequent main memory accesses that deteriorate system per- formance due to the disparity between processor and memory speeds. Prefetching data blocks into the cache hierarchy ahead of demand accesses has proven successful at attenuating this bottleneck. However, spatial cache prefetchers operating in the physical address space leave significant performance on the table by limiting their pattern detection within 4KB physical page boundaries when modern systems use page sizes larger than 4KB to mitigate the address translation overheads. This paper exploits the high usage of large pages in modern systems to increase the effectiveness of spatial cache prefetch- ing. We design and propose the Page-size Propagation Module (PPM), a ”architectural scheme that propagates the page size information to the lower-level cache prefetchers, enabling safe prefetching beyond 4KB physical page boundaries when the accessed blocks reside in large pages, at the cost of augmenting the first-level cachesâ Miss Status Holding Register (MSHR) entries with one additional bit. PPM is compatible with any cache prefetcher without implying design modifications. We capitalize on PPMâs benefits by designing a module that consists of two page size aware prefetchers that inherently use different page sizes to drive prefetching. The composite module uses adaptive logic to dynamically enable the most appropriate page size aware prefetcher. Finally, we show that the proposed designs are transparent to which cache prefetcher is used. We apply the proposed page size exploitation techniques to four state-of-the-art spatial cache prefetchers. Our evalua- tion shows that our proposals improve single-core geomean performance by up to 8.1% (2.1% at minimum) over the original implementation of the considered prefetchers, across 80 memory-intensive workloads. In multi-core contexts, we report geomean speedups up to 7.7% across different cache prefetchers and core configurations.This work is supported by the Spanish Ministry of Science and Technology through the PID2019-107255GB project, the Generalitat de Catalunya (contract 2017-SGR-1414), the European Union Horizon 2020 research and innovation program under grant agreement No 955606 (DEEP-SEA EU project), the National Science Foundation through grants CNS-1938064 and CCF-1912617, and the Semiconductor Research Corporation project GRC 2936.001. Georgios Vavouliotis has been supported by the Spanish Ministry of Economy, Industry, and Competitiveness and the European Social Fund under the FPI fellowship No. PRE2018-087046. Marc Casas has been partially supported by the Grant RYC2017-23269 funded by MCIN/AEI/10.13039/501100011033 and ESF âInvesting in your futureâ.Peer ReviewedPostprint (author's final draft
- âŠ