25 research outputs found

    PP-Bridge: Establishing a Bridge between the Prefetching and Cache Partitioning

    Get PDF
    — Modern computer processors are equipped with multiple cores, each boasting its own dedicated cache memory, while collectively sharing a generously sized Last Level Cache (LLC). To ensure equitable utilization of the LLC space and bolster system security, partitioning techniques have been introduced to allocate the shared LLC space among the applications running on different cores. This partition dynamically adapts to the requirements of these applications. Prefetching plays a vital role in enhancing cache performance by proactively loading data into the cache before it get requested explicitly by a core. Each core employs prefetch engines to decide which data blocks to fetch preemptively. However, a haphazard prefetcher may bring in more data blocks than necessary, leading to cache pollution and a subsequent degradation in system performance. To maximize the benefits of prefetching, it is essential to keep cache pollution to a minimum. Intriguingly, our research has uncovered that when existing prefetching techniques are combined with partitioning methods, they tend to exacerbate cache pollution within the LLC, resulting in a noticeable decline in system performance. In this paper, we present a novel approach aimed at mitigating cache pollution when combining prefetching with partitioning techniques

    Evaluation of L1 Residence for Perceptron Filter Enhanced Signature Path Prefetcher

    Get PDF
    Rapid advancement of integrated circuit technology described by Moore’s Law has greatly increased computational power. Processors have taken advantage of this by increasing computation rates, while memory has gained increased capacity. As processor operation speeds have greatly exceeded memory access times, computer architects have added multiple levels of caches to avoid penalties for repeat accesses to memory. While this is an improvement, architects have further improved access efficiency by developing methods of prefetching data from memory to hide the latency penalty usually incurred on a cache miss. Previous work at Texas A&M and their submission to the Third Data Prefetching Championship (DPC3) primarily consisted of L2 cache prefetching. L1 prefetching has been less explored than L2 due to hardware limitations on implementation. In this paper, I attempt to evaluate the effect of L1 residence for Texas A&M’s Perceptron Filtered Signature Path Prefetcher (PPF). While an unoptimized movement of the PPF from the L2 to the L1 showed performance degradation, optimizations such as using the L1 data stream to prefetch to all cache levels and updating table sizes and lengths have matched L2 performance

    Branch-directed Data Prefetching

    Get PDF
    Memory prefetching in computer processors is the practice of predicting memory addresses that will need to be accessed and issuing requests to pull data from those addresses ahead of time. These circuits are crucial to combatting the "memory wall", a bottleneck in processor speed caused by the relatively slower progression of memory access speeds compared to progress in instruction execution speed. This project builds upon the Signature Path Prefetcher (SPP), a prefetcher for the L2C cache developed in Professor Gratz’s CAMSIN research group. The SPP decides prefetch addresses based on a delta access history signature. This project explores the possibility of enhancing the SPP by incorporating branch history data (branch decisions & target addresses) into the existing prefetcher structure. The Branch-Directed SPP aims to improve overall performance as measured by IPC speedup. Results show that the design performs similarly to baseline SPP across these metrics, outperforming slightly on some trace sets and underperforming slightly on others

    Reference Speculation-driven Memory Management

    Get PDF
    The “Memory Wall”, the vast gulf between processor execution speed and memory latency, has led to the development of large and deep cache hierarchies over the last twenty years. Although processor frequency is no-longer on the exponential growth curve, the drive towards ever greater main memory capacity and limited off-chip bandwidth have kept this gap from closing significantly. In addition, future memory technologies such as Non-Volatile Memory (NVM) devices do not help to decrease the latency of the first reference to a particular memory address. To reduce the increasing off-chip memory access latency, this dissertation presents three intelligent speculation mechanisms that can predict and manage future memory usage. First, we propose a novel hardware data prefetcher called Signature Path Prefetcher (SPP), which offers effective solutions for major challenges in prefetcher design. SPP uses a compressed history-based scheme that accurately predicts a series of long complex address patterns. For example, to address a series of long complex memory references, SPP uses a compressed history signature that is able to learn and prefetch complex data access patterns. Moreover, unlike other history-based algorithms, which miss out on many prefetching opportunities when address patterns make a transition between physical pages, SPP tracks the stream of data accesses across physical page boundaries and continues prefetching as soon as they move to new pages. Finally, SPP uses the confidence it has in its predictions to adaptively throttle itself on a per-prefetch stream basis. In our analysis, we find that SPP outperforms the state-of-the-art hardware data prefetchers by 6.4% with higher prefetching accuracy and lower off-chip bandwidth usage. Second, we develop a holistic on-chip cache management system that tightly integrates data prefetching and cache replacement algorithms into one unified solution. Also, we eliminate the use of Program Counter (PC) in the cache replacement module by using a simple dead block prediction with global hysteresis. In addition to effectively predicting dead blocks in the Last-Level Cache (LLC) by observing program phase behaviors, the replacement component also gives feedback to the prefetching component to help decide on the optimal fill level for prefetches. Meanwhile, the prefetching component feeds confidence information about each individual prefetch to the LLC replacement component. A low confidence prefetch is less likely to interfere with the contents of the LLC, and as confidence in that prefetch increases, its position within the LLC replacement stack is solidified, and it eventually is brought into the L2 cache, close to where it will be used in the processor core. Third, we observe that the host machine in virtualized system operates under different memory pressure regimes, as the memory demand from guest Virtual Machines (VMs) changes dynamically at runtime. Adapting to this runtime system state is critical to reduce the performance cost of VM memory management. We propose a novel dynamic memory management policy called Memory Pressure Aware (MPA) ballooning. MPA ballooning dynamically speculates and allocates memory resources to each VM based on the current memory pressure regime. Moreover, MPA ballooning proactively reacts and adapts to sudden changes in memory demand from guest VMs. MPA ballooning requires neither additional hardware support, nor incurs extra minor page faults in its memory pressure estimation

    Prefetching in functional languages

    Get PDF
    Functional programming languages contain a number of runtime and language features, such as garbage collection, indirect memory accesses, linked data structures and immutability, that interact with a processor’s memory system. These conspire to cause a variety of unintuitive memory performance effects. For example, it is slower to traverse through linked lists and arrays of data that have been sorted than to traverse the same data accessed in the order it was allocated. We seek to understand these issues and mitigate them in a manner consistent with functional languages, taking advantage of the features themselves where possible. For example, immutability and garbage collection force linked lists to be allocated roughly sequentially in memory, even when the data pointed to within each node is not. We add language primitives for software-prefetching to the OCaml language to exploit this, and observe significant performance improvements a variety of micro- and macro-benchmarks, resulting in speedups of up to 2× on the out-of-order superscalar Intel Haswell and Xeon Phi Knights Landing systems, and up to 3× on the in-order Arm Cortex-A53.Arm Limite

    Multithreading Aware Hardware Prefetching for Chip Multiprocessors

    Get PDF
    To take advantage of the processing power in the Chip Multiprocessors design, applications must be divided into semi-independent processes that can run concur- rently on multiple cores within a system. Therefore, programmers must insert thread synchronization semantics (i.e. locks, barriers, and condition variables) to synchro- nize data access between processes. Indeed, threads spend long time waiting to acquire the lock of a critical section. In addition, a processor has to stall execution to wait for load data accesses to complete. Furthermore, there are often independent instructions which include load instructions beyond synchronization semantics that could be executed in parallel while a thread waits on the synchronization semantics. The conveniences of the cache memories come with some extra cost in Chip Multiprocessors. Cache Coherence mechanisms address the Memory Consistency problem. However, Cache Coherence adds considerable overhead to memory accesses. Having aggressive prefetcher on different cores of a Chip Multiprocessor can definitely lead to significant system performance degradation when running multi-threaded applications. This result of prefetch-demand interference when a prefetcher in one core ends up pulling shared data from a producing core before it has been written, the cache block will end up transitioning back and forth between the cores and result in useless prefetch, saturating the memory bandwidth and substantially increase the latency to critical shared data. We present a hardware prefetcher that enables large performance improvements from prefetching in Chip Multiprocessors by significantly reducing prefetch-demand interference. Furthermore, it will utilize the time that a thread spends waiting on syn- chronization semantics to run ahead of the critical section to speculate and prefetch independent load instruction data beyond the synchronization semantics

    DeepP: Deep Learning Multi-Program Prefetch Configuration for the IBM POWER 8

    Full text link
    [EN] Current multi-core processors implement sophisticated hardware prefetchers, that can be configured by application (PID),to improve the system performance. When running multiple applications, each application can present different prefetch requirements,hence different configurations can be used. Setting the optimal prefetch configuration for each application is a complex task since itdoes not only depend on the application characteristics but also on the interference at the shared memory resources (e.g. memorybandwidth). In his paper, we proposeDeepP, a deep learning approach for the IBM POWER8 that identifies at run-time the bestprefetch configuration for each application in a workload. To this end, the neural network predicts the performance of each applicationunder the studied prefetch configurations by using a set of performance events. The prediction accuracy of the network is improvedthanks to a dynamic training methodology that allows learning the impact of dynamic changes of the prefetch configuration onperformance. At run-time, the devised network infers the best prefetch configuration for each application and adjusts it dynamically.Experimental results show that the proposed approach improves performance, on average, by 5,8%, 6,7%, and 15,8% compared tothe default prefetch configuration across different 6-, 8-, and 10-application workloads, respectively.This work was supported in part by Ministerio de Ciencia, Innovacion y Universidades and the European ERDF under Grant RTI2018-098156-B-C51, in part by Generalitat Valenciana under Grant AICO/2021/266, and in part by Ayudas Contratos predoctorales UPV -subprograma 1 (PAID-01-20). The work of Josue Feliu was supported by a Juan de la Cierva Formacion Contract under Grant FJC2018-036021-I.Lurbe-Sempere, M.; Feliu-Pérez, J.; Petit Martí, SV.; Gómez Requena, ME.; Sahuquillo Borrás, J. (2022). DeepP: Deep Learning Multi-Program Prefetch Configuration for the IBM POWER 8. IEEE Transactions on Computers. 71(10):2646-2658. https://doi.org/10.1109/TC.2021.313999726462658711
    corecore