97 research outputs found

    Multithreading Aware Hardware Prefetching for Chip Multiprocessors

    Get PDF
    To take advantage of the processing power in the Chip Multiprocessors design, applications must be divided into semi-independent processes that can run concur- rently on multiple cores within a system. Therefore, programmers must insert thread synchronization semantics (i.e. locks, barriers, and condition variables) to synchro- nize data access between processes. Indeed, threads spend long time waiting to acquire the lock of a critical section. In addition, a processor has to stall execution to wait for load data accesses to complete. Furthermore, there are often independent instructions which include load instructions beyond synchronization semantics that could be executed in parallel while a thread waits on the synchronization semantics. The conveniences of the cache memories come with some extra cost in Chip Multiprocessors. Cache Coherence mechanisms address the Memory Consistency problem. However, Cache Coherence adds considerable overhead to memory accesses. Having aggressive prefetcher on different cores of a Chip Multiprocessor can definitely lead to significant system performance degradation when running multi-threaded applications. This result of prefetch-demand interference when a prefetcher in one core ends up pulling shared data from a producing core before it has been written, the cache block will end up transitioning back and forth between the cores and result in useless prefetch, saturating the memory bandwidth and substantially increase the latency to critical shared data. We present a hardware prefetcher that enables large performance improvements from prefetching in Chip Multiprocessors by significantly reducing prefetch-demand interference. Furthermore, it will utilize the time that a thread spends waiting on syn- chronization semantics to run ahead of the critical section to speculate and prefetch independent load instruction data beyond the synchronization semantics

    Phases, Modalities, Temporal and Spatial Locality: Domain Specific ML Prefetcher for Accelerating Graph Analytics

    Full text link
    Memory performance is a bottleneck in graph analytics acceleration. Existing Machine Learning (ML) prefetchers struggle with phase transitions and irregular memory accesses in graph processing. We propose MPGraph, an ML-based Prefetcher for Graph analytics using domain specific models. MPGraph introduces three novel optimizations: soft detection for phase transitions, phase-specific multi-modality models for access delta and page predictions, and chain spatio-temporal prefetching (CSTP) for prefetch control. Our transition detector achieves 34.17-82.15% higher precision compared with Kolmogorov-Smirnov Windowing and decision tree. Our predictors achieve 6.80-16.02% higher F1-score for delta and 11.68-15.41% higher accuracy-at-10 for page prediction compared with LSTM and vanilla attention models. Using CSTP, MPGraph achieves 12.52-21.23% IPC improvement, outperforming state-of-the-art non-ML prefetcher BO by 7.58-12.03% and ML-based prefetchers Voyager and TransFetch by 3.27-4.58%. For practical implementation, we demonstrate MPGraph using compressed models with reduced latency shows significantly superior accuracy and coverage compared with BO, leading to 3.58% higher IPC improvement

    From Traditional Adaptive Data Caching to Adaptive Context Caching: A Survey

    Full text link
    Context data is in demand more than ever with the rapid increase in the development of many context-aware Internet of Things applications. Research in context and context-awareness is being conducted to broaden its applicability in light of many practical and technical challenges. One of the challenges is improving performance when responding to large number of context queries. Context Management Platforms that infer and deliver context to applications measure this problem using Quality of Service (QoS) parameters. Although caching is a proven way to improve QoS, transiency of context and features such as variability, heterogeneity of context queries pose an additional real-time cost management problem. This paper presents a critical survey of state-of-the-art in adaptive data caching with the objective of developing a body of knowledge in cost- and performance-efficient adaptive caching strategies. We comprehensively survey a large number of research publications and evaluate, compare, and contrast different techniques, policies, approaches, and schemes in adaptive caching. Our critical analysis is motivated by the focus on adaptively caching context as a core research problem. A formal definition for adaptive context caching is then proposed, followed by identified features and requirements of a well-designed, objective optimal adaptive context caching strategy.Comment: This paper is currently under review with ACM Computing Surveys Journal at this time of publishing in arxiv.or

    Perceptron Learning in Cache Management and Prediction Techniques

    Get PDF
    Hardware prefetching is an effective technique for hiding cache miss latencies in modern processor designs. An efficient prefetcher should identify complex memory access patterns during program execution. This ability enables the prefetcher to read a block ahead of its demand access, potentially preventing a cache miss. Accurately identifying the right blocks to prefetch is essential to achieving high performance from the prefetcher. Prefetcher performance can be characterized by two main metrics that are generally at odds with one another: coverage, the fraction of baseline cache misses which the prefetcher brings into the cache; and accuracy, the fraction of prefetches which are ultimately used. An overly aggressive prefetcher may improve coverage at the cost of reduced accuracy. Thus, performance may be harmed by this over-aggressiveness because many resources are wasted, including cache capacity and bandwidth. An ideal prefetcher would have both high coverage and accuracy. In this thesis, I propose Perceptron-based Prefetch Filtering (PPF) as a way to increase the coverage of the prefetches generated by a baseline prefetcher without negatively impacting accuracy. PPF enables more aggressive tuning of a given baseline prefetcher, leading to increased coverage by filtering out the growing numbers of inaccurate prefetches such an aggressive tuning implies. I also explore a range of features to use to train PPF’s perceptron layer to identify inaccurate prefetches. PPF improves performance on a memory-intensive subset of the SPEC CPU 2017 benchmarks by 3.78% for a single-core configuration, and by 11.4% for a 4-core configuration, compared to the baseline prefetcher alone

    SeLeP: Learning Based Semantic Prefetching for Exploratory Database Workloads

    Full text link
    Prefetching is a crucial technique employed in traditional databases to enhance interactivity, particularly in the context of data exploitation. Data exploration is a query processing paradigm in which users search for insights buried in the data, often not knowing what exactly they are looking for. Data exploratory tools deal with multiple challenges such as the need for interactivity with no a priori knowledge being present to help with the system tuning. The state-of-the-art prefetchers are specifically designed for navigational workloads only, where the number of possible actions is limited. The prefetchers that work with SQL-based workloads, on the other hand, mainly rely on data logical addresses rather than the data semantics. They fail to predict complex access patterns in cases where the database size is substantial, resulting in an extensive address space, or when there is frequent co-accessing of data. In this paper, we propose SeLeP, a semantic prefetcher that makes prefetching decisions for both types of workloads, based on the encoding of the data values contained inside the accessed blocks. Following the popular path of using machine learning approaches to automatically learn the hidden patterns, we formulate the prefetching task as a time-series forecasting problem and use an encoder-decoder LSTM architecture to learn the data access pattern. Our extensive experiments, across real-life exploratory workloads, demonstrate that SeLeP improves the hit ratio up to 40% and reduces I/O time up to 45% compared to the state-of-the-art, attaining impressive 95% hit ratio and 80% I/O reduction on average
    • …
    corecore