63 research outputs found
Perceptron Learning in Cache Management and Prediction Techniques
Hardware prefetching is an effective technique for hiding cache miss latencies in modern processor designs. An efficient prefetcher should identify complex memory access patterns during program execution. This ability enables the prefetcher to read a block ahead of its demand access, potentially preventing a cache miss. Accurately identifying the right blocks to prefetch is essential to achieving high performance from the prefetcher.
Prefetcher performance can be characterized by two main metrics that are generally at odds with one another: coverage, the fraction of baseline cache misses which the prefetcher brings into the cache; and accuracy, the fraction of prefetches which are ultimately used. An overly aggressive prefetcher may improve coverage at the cost of reduced accuracy. Thus, performance may be harmed by this over-aggressiveness because many resources are wasted, including cache capacity and bandwidth. An ideal prefetcher would have both high coverage and accuracy.
In this thesis, I propose Perceptron-based Prefetch Filtering (PPF) as a way to increase the coverage of the prefetches generated by a baseline prefetcher without negatively impacting accuracy. PPF enables more aggressive tuning of a given baseline prefetcher, leading to increased coverage by filtering out the growing numbers of inaccurate prefetches such an aggressive tuning implies. I also explore a range of features to use to train PPFâs perceptron layer to identify inaccurate prefetches. PPF improves performance on a memory-intensive subset of the SPEC CPU 2017 benchmarks by 3.78% for a single-core configuration, and by 11.4% for a 4-core configuration, compared to the baseline prefetcher alone
Last-touch correlated data streaming
Recent research advocates address-correlating predictors to identify cache block addresses for prefetch. Unfortunately, address-correlating predictors require correlation data storage proportional in size to a program's active memory footprint. As a result, current proposals for this class of predictor are either limited in coverage due to constrained on-chip storage requirements or limited in prediction lookaheaddue to long off-chip correlation data lookup. In this paper, we propose Last-Touch Correlated Data Streaming (LT-cords), a practical address-correlating predictor. The key idea of LT-cords is to record correlation data off chip in the order they will be used and stream them into a practicallysized on-chip table shortly before they are needed, thereby obviating the need for scalable on-chip tables and enabling low-latency lookup. We use cycle-accurate simulation of an 8-way out-of-order superscalar processor to show that: (1) LT-cords with 214KB of on-chip storage can achieve the same coverage as a last-touch predictor with unlimited storage, without sacrificing predictor lookahead, and (2) LT-cords improves performance by 60% on average and 385% at best in the benchmarks studied. © 2007 IEEE
Perceptron Learning in Cache Management and Prediction Techniques
Hardware prefetching is an effective technique for hiding cache miss latencies in modern processor designs. An efficient prefetcher should identify complex memory access patterns during program execution. This ability enables the prefetcher to read a block ahead of its demand access, potentially preventing a cache miss. Accurately identifying the right blocks to prefetch is essential to achieving high performance from the prefetcher.
Prefetcher performance can be characterized by two main metrics that are generally at odds with one another: coverage, the fraction of baseline cache misses which the prefetcher brings into the cache; and accuracy, the fraction of prefetches which are ultimately used. An overly aggressive prefetcher may improve coverage at the cost of reduced accuracy. Thus, performance may be harmed by this over-aggressiveness because many resources are wasted, including cache capacity and bandwidth. An ideal prefetcher would have both high coverage and accuracy.
In this thesis, I propose Perceptron-based Prefetch Filtering (PPF) as a way to increase the coverage of the prefetches generated by a baseline prefetcher without negatively impacting accuracy. PPF enables more aggressive tuning of a given baseline prefetcher, leading to increased coverage by filtering out the growing numbers of inaccurate prefetches such an aggressive tuning implies. I also explore a range of features to use to train PPFâs perceptron layer to identify inaccurate prefetches. PPF improves performance on a memory-intensive subset of the SPEC CPU 2017 benchmarks by 3.78% for a single-core configuration, and by 11.4% for a 4-core configuration, compared to the baseline prefetcher alone
SGDP: A Stream-Graph Neural Network Based Data Prefetcher
Data prefetching is important for storage system optimization and access
performance improvement. Traditional prefetchers work well for mining access
patterns of sequential logical block address (LBA) but cannot handle complex
non-sequential patterns that commonly exist in real-world applications. The
state-of-the-art (SOTA) learning-based prefetchers cover more LBA accesses.
However, they do not adequately consider the spatial interdependencies between
LBA deltas, which leads to limited performance and robustness. This paper
proposes a novel Stream-Graph neural network-based Data Prefetcher (SGDP).
Specifically, SGDP models LBA delta streams using a weighted directed graph
structure to represent interactive relations among LBA deltas and further
extracts hybrid features by graph neural networks for data prefetching. We
conduct extensive experiments on eight real-world datasets. Empirical results
verify that SGDP outperforms the SOTA methods in terms of the hit ratio by
6.21%, the effective prefetching ratio by 7.00%, and speeds up inference time
by 3.13X on average. Besides, we generalize SGDP to different variants by
different stream constructions, further expanding its application scenarios and
demonstrating its robustness. SGDP offers a novel data prefetching solution and
has been verified in commercial hybrid storage systems in the experimental
phase. Our codes and appendix are available at
https://github.com/yyysjz1997/SGDP/
A Branch-Directed Data Cache Prefetching Technique for Inorder Processors
The increasing gap between processor and main memory speeds has become a serious
bottleneck towards further improvement in system performance. Data prefetching
techniques have been proposed to hide the performance impact of such long memory
latencies. But most of the currently proposed data prefetchers predict future memory
accesses based on current memory misses. This limits the opportunity that can be
exploited to guide prefetching.
In this thesis, we propose a branch-directed data prefetcher that uses the high prediction
accuracies of current-generation branch predictors to predict a future basic block trace
that the program will execute and issues prefetches for all the identified memory
instructions contained therein. We also propose a novel technique to generate prefetch
addresses by exploiting the correlation between the addresses generated by memory
instructions and the values of the corresponding source registers at prior branch
instances. We evaluate the impact of our prefetcher by using a cycle-accurate simulation
of an inorder processor on the M5 simulator. The results of the evaluation show that the
branch-directed prefetcher improves the performance on a set of 18 SPEC CPU2006
benchmarks by an average of 38.789% over a no-prefetching implementation and
2.148% over a system that employs a Spatial Memory Streaming prefetcher
Mechanisms to improve the efficiency of hardware data prefetchers
A well known performance bottleneck in computer architecture is the so-called memory
wall. This term refers to the huge disparity between on-chip and off-chip access
latencies. Historically speaking, the operating frequency of processors has increased at
a steady pace, while most past advances in memory technology have been in density,
not speed. Nowadays, the trend for ever increasing processor operating frequencies
has been replaced by an increasing number of CPU cores per chip. This will continue
to exacerbate the memory wall problem, as several cores now have to compete for
off-chip data access. As multi-core systems pack more and more cores, it is expected
that the access latency as observed by each core will continue to increase. Although
the causes of the memory wall have changed, it is, and will continue to be in the near
future, a very significant challenge in terms of computer architecture design.
Prefetching has been an important technique to amortize the effect of the memory
wall. With prefetching, data or instructions that are expected to be used in the near
future are speculatively moved up in the memory hierarchy, were the access latency is
smaller. This dissertation focuses on hardware data prefetching at the last cache level
before memory (last level cache, LLC). Prefetching at the LLC usually offers the best
performance increase, as this is where the disparity between hit and miss latencies is
the largest.
Hardware prefetchers operate by examining the miss address stream generated
by the cache and identifying patterns and correlations between the misses. Most
prefetchers divide the global miss stream in several sub-streams, according to some
pre-specified criteria. This process is known as localization. The benefits of localization
are well established: it increases the accuracy of the predictions and helps
filtering out spurious, non-predictable misses. However localization has one important
drawback: since the misses are classified into different sub-streams, important chronological
information is lost. A consequence of this is that most localizing prefetchers
issue prefetches in an untimely manner, fetching data too far in advance. This behavior
promotes data pollution in the cache.
The first part of this thesis proposes a new class of prefetchers based on the novel
concept of Stream Chaining. With Stream Chaining, the prefetcher tries to reconstruct
the chronological information lost in the process of localization, while at the
same time keeping its benefits. We describe two novel Stream Chaining prefetching
algorithms based on two state of the art localizing prefetchers: PC/DC and C/DC. We show how both prefetchers issue prefetches in a more timely manner than their nonchaining
counterparts, increasing performance by as much as 55% (10% on average)
on a suite of sequential benchmarks, while consuming roughly the same amount of
memory bandwidth.
In order to hide the effects of the memory wall, hardware prefetchers are usually
configured to aggressively prefetch as much data as possible. However, a highly aggressive
prefetcher can have negative effects on performance. Factors such as prefetching
accuracy, cache pollution and memory bandwidth consumption have to be taken
into account. This is specially important in the context of multi-core systems, where
typically each core has its own prefetching engine and there is high competition for
accessing memory. Several prefetch throttling and filtering mechanisms have been
proposed to maximize the effect of prefetching in multi-core systems. The general
strategy behind these heuristics is to promote prefetches that are more likely to be used
and cause less interference. Traditionally these methods operate at the source level,
i.e., directly into the prefetch engine they are assigned to control.
In multi-core systems all prefetches are aggregated in a FIFO-like data structure
called the Prefetch Request Queue (PRQ), where they wait to be dispatched to memory.
The second part of this thesis shows that a traditional FIFO PRQ does not promote
a timely prefetching behavior and usually hinders part of the performance benefits
achieved by throttling heuristics. We propose a novel approach to prefetch aggressiveness
control in multi-cores that performs throttling at the PRQ (i.e., global) level, using
global knowledge of the metrics of all prefetchers and information about the global
state of the PRQ. To do this, we introduce the Resizable Prefetching Heap (RPH), a
data structure modeled after a binary heap that promotes timely dispatch of prefetches
as well as fairness in the distribution of prefetching bandwidth. The RPH is designed as
a drop-in replacement of traditional FIFO PRQs. We compare our proposal against a
state-of-the-art source-level throttling algorithm (HPAC) in a 8-core system. Unlike
previous research, we evaluate both multiprogrammed and multithreaded (parallel)
workloads, using a modern prefetching algorithm (C/DC). Our experimental results
show that RPH-based throttling increases the throttling performance benefits obtained
by HPAC by as much as 148% (53.8% average) in multiprogrammed workloads and
as much as 237% (22.5% average) in parallel benchmarks, while consuming roughly
the same amount of memory bandwidth. When comparing the speedup over fixed degree
prefetching, RPH increased the average speedup of HPAC from 7.1% to 10.9% in
multiprogrammed workloads, and from 5.1% to 7.9% in parallel benchmarks
Reference Speculation-driven Memory Management
The âMemory Wallâ, the vast gulf between processor execution speed and memory latency, has led to the development of large and deep cache hierarchies over the last twenty years. Although processor frequency is no-longer on the exponential growth curve, the drive towards ever greater main memory capacity and limited off-chip bandwidth have kept this gap from closing significantly. In addition, future memory technologies such as Non-Volatile Memory (NVM) devices do not help to decrease the latency of the first reference to a particular memory address. To reduce the increasing off-chip memory access latency, this dissertation presents three intelligent speculation mechanisms that can predict and manage future memory usage.
First, we propose a novel hardware data prefetcher called Signature Path Prefetcher (SPP), which offers effective solutions for major challenges in prefetcher design. SPP uses a compressed history-based scheme that accurately predicts a series of long complex address patterns. For example, to address a series of long complex memory references, SPP uses a compressed history signature that is able to learn and prefetch complex data access patterns. Moreover, unlike other history-based algorithms, which miss out on many prefetching opportunities when address patterns make a transition between physical pages, SPP tracks the stream of data accesses across physical page boundaries and continues prefetching as soon as they move to new pages. Finally, SPP uses the confidence it has in its predictions to adaptively throttle itself on a per-prefetch stream basis. In our analysis, we find that SPP outperforms the state-of-the-art hardware data prefetchers by 6.4% with higher prefetching accuracy and lower off-chip bandwidth usage.
Second, we develop a holistic on-chip cache management system that tightly integrates data prefetching and cache replacement algorithms into one unified solution. Also, we eliminate the use of Program Counter (PC) in the cache replacement module by using a simple dead block prediction with global hysteresis. In addition to effectively predicting dead blocks in the Last-Level Cache (LLC) by observing program phase behaviors, the replacement component also gives feedback to the prefetching component to help decide on the optimal fill level for prefetches. Meanwhile, the prefetching component feeds confidence information about each individual prefetch to the LLC replacement component. A low confidence prefetch is less likely to interfere with the contents of the LLC, and as confidence in that prefetch increases, its position within the LLC replacement stack is solidified, and it eventually is brought into the L2 cache, close to where it will be used in the processor core.
Third, we observe that the host machine in virtualized system operates under different memory pressure regimes, as the memory demand from guest Virtual Machines (VMs) changes dynamically at runtime. Adapting to this runtime system state is critical to reduce the performance cost of VM memory management. We propose a novel dynamic memory management policy called Memory Pressure Aware (MPA) ballooning. MPA ballooning dynamically speculates and allocates memory resources to each VM based on the current memory pressure regime. Moreover, MPA ballooning proactively reacts and adapts to sudden changes in memory demand from guest VMs. MPA ballooning requires neither additional hardware support, nor incurs extra minor page faults in its memory pressure estimation
- âŠ