97 research outputs found
Multithreading Aware Hardware Prefetching for Chip Multiprocessors
To take advantage of the processing power in the Chip Multiprocessors design,
applications must be divided into semi-independent processes that can run concur-
rently on multiple cores within a system. Therefore, programmers must insert thread
synchronization semantics (i.e. locks, barriers, and condition variables) to synchro-
nize data access between processes. Indeed, threads spend long time waiting to
acquire the lock of a critical section. In addition, a processor has to stall execution
to wait for load data accesses to complete. Furthermore, there are often independent instructions which include load instructions beyond synchronization semantics that could be executed in parallel while a thread waits on the synchronization semantics. The conveniences of the cache memories come with some extra cost in Chip Multiprocessors. Cache Coherence mechanisms address the Memory Consistency problem. However, Cache Coherence adds considerable overhead to memory accesses. Having aggressive prefetcher on different cores of a Chip Multiprocessor can definitely lead to significant system performance degradation when running multi-threaded applications. This result of prefetch-demand interference when a prefetcher in one core ends up pulling shared data from a producing core before it has been written, the cache block will end up transitioning back and forth between the cores and result in useless prefetch, saturating the memory bandwidth and substantially increase the latency to critical shared data.
We present a hardware prefetcher that enables large performance improvements
from prefetching in Chip Multiprocessors by significantly reducing prefetch-demand
interference. Furthermore, it will utilize the time that a thread spends waiting on syn-
chronization semantics to run ahead of the critical section to speculate and prefetch independent load instruction data beyond the synchronization semantics
Phases, Modalities, Temporal and Spatial Locality: Domain Specific ML Prefetcher for Accelerating Graph Analytics
Memory performance is a bottleneck in graph analytics acceleration. Existing
Machine Learning (ML) prefetchers struggle with phase transitions and irregular
memory accesses in graph processing. We propose MPGraph, an ML-based Prefetcher
for Graph analytics using domain specific models. MPGraph introduces three
novel optimizations: soft detection for phase transitions, phase-specific
multi-modality models for access delta and page predictions, and chain
spatio-temporal prefetching (CSTP) for prefetch control. Our transition
detector achieves 34.17-82.15% higher precision compared with
Kolmogorov-Smirnov Windowing and decision tree. Our predictors achieve
6.80-16.02% higher F1-score for delta and 11.68-15.41% higher accuracy-at-10
for page prediction compared with LSTM and vanilla attention models. Using
CSTP, MPGraph achieves 12.52-21.23% IPC improvement, outperforming
state-of-the-art non-ML prefetcher BO by 7.58-12.03% and ML-based prefetchers
Voyager and TransFetch by 3.27-4.58%. For practical implementation, we
demonstrate MPGraph using compressed models with reduced latency shows
significantly superior accuracy and coverage compared with BO, leading to 3.58%
higher IPC improvement
From Traditional Adaptive Data Caching to Adaptive Context Caching: A Survey
Context data is in demand more than ever with the rapid increase in the
development of many context-aware Internet of Things applications. Research in
context and context-awareness is being conducted to broaden its applicability
in light of many practical and technical challenges. One of the challenges is
improving performance when responding to large number of context queries.
Context Management Platforms that infer and deliver context to applications
measure this problem using Quality of Service (QoS) parameters. Although
caching is a proven way to improve QoS, transiency of context and features such
as variability, heterogeneity of context queries pose an additional real-time
cost management problem. This paper presents a critical survey of
state-of-the-art in adaptive data caching with the objective of developing a
body of knowledge in cost- and performance-efficient adaptive caching
strategies. We comprehensively survey a large number of research publications
and evaluate, compare, and contrast different techniques, policies, approaches,
and schemes in adaptive caching. Our critical analysis is motivated by the
focus on adaptively caching context as a core research problem. A formal
definition for adaptive context caching is then proposed, followed by
identified features and requirements of a well-designed, objective optimal
adaptive context caching strategy.Comment: This paper is currently under review with ACM Computing Surveys
Journal at this time of publishing in arxiv.or
Perceptron Learning in Cache Management and Prediction Techniques
Hardware prefetching is an effective technique for hiding cache miss latencies in modern processor designs. An efficient prefetcher should identify complex memory access patterns during program execution. This ability enables the prefetcher to read a block ahead of its demand access, potentially preventing a cache miss. Accurately identifying the right blocks to prefetch is essential to achieving high performance from the prefetcher.
Prefetcher performance can be characterized by two main metrics that are generally at odds with one another: coverage, the fraction of baseline cache misses which the prefetcher brings into the cache; and accuracy, the fraction of prefetches which are ultimately used. An overly aggressive prefetcher may improve coverage at the cost of reduced accuracy. Thus, performance may be harmed by this over-aggressiveness because many resources are wasted, including cache capacity and bandwidth. An ideal prefetcher would have both high coverage and accuracy.
In this thesis, I propose Perceptron-based Prefetch Filtering (PPF) as a way to increase the coverage of the prefetches generated by a baseline prefetcher without negatively impacting accuracy. PPF enables more aggressive tuning of a given baseline prefetcher, leading to increased coverage by filtering out the growing numbers of inaccurate prefetches such an aggressive tuning implies. I also explore a range of features to use to train PPF’s perceptron layer to identify inaccurate prefetches. PPF improves performance on a memory-intensive subset of the SPEC CPU 2017 benchmarks by 3.78% for a single-core configuration, and by 11.4% for a 4-core configuration, compared to the baseline prefetcher alone
SeLeP: Learning Based Semantic Prefetching for Exploratory Database Workloads
Prefetching is a crucial technique employed in traditional databases to
enhance interactivity, particularly in the context of data exploitation. Data
exploration is a query processing paradigm in which users search for insights
buried in the data, often not knowing what exactly they are looking for. Data
exploratory tools deal with multiple challenges such as the need for
interactivity with no a priori knowledge being present to help with the system
tuning. The state-of-the-art prefetchers are specifically designed for
navigational workloads only, where the number of possible actions is limited.
The prefetchers that work with SQL-based workloads, on the other hand, mainly
rely on data logical addresses rather than the data semantics. They fail to
predict complex access patterns in cases where the database size is
substantial, resulting in an extensive address space, or when there is frequent
co-accessing of data. In this paper, we propose SeLeP, a semantic prefetcher
that makes prefetching decisions for both types of workloads, based on the
encoding of the data values contained inside the accessed blocks. Following the
popular path of using machine learning approaches to automatically learn the
hidden patterns, we formulate the prefetching task as a time-series forecasting
problem and use an encoder-decoder LSTM architecture to learn the data access
pattern. Our extensive experiments, across real-life exploratory workloads,
demonstrate that SeLeP improves the hit ratio up to 40% and reduces I/O time up
to 45% compared to the state-of-the-art, attaining impressive 95% hit ratio and
80% I/O reduction on average
- …