442 research outputs found

    Execution history guided instruction prefetching

    Get PDF
    The increasing gap in performance between processors and main memory has made effective instructions prefetching techniques more important than ever. A major deficiency of existing prefetching methods is that most of them require an extra port to I-cache. A recent study by [19] shows that this factor alone explains why most modern microprocessors do not use such I-cache hardware-based I-cache prefetch schemes. The contribution of this paper is two-fold. First we present a method that does not require an extra port to I-cache

    Adaptive runtime-assisted block prefetching on chip-multiprocessors

    Get PDF
    Memory stalls are a significant source of performance degradation in modern processors. Data prefetching is a widely adopted and well studied technique used to alleviate this problem. Prefetching can be performed by the hardware, or be initiated and controlled by software. Among software controlled prefetching we find a wide variety of schemes, including runtime-directed prefetching and more specifically runtime-directed block prefetching. This paper proposes a hybrid prefetching mechanism that integrates a software driven block prefetcher with existing hardware prefetching techniques. Our runtime-assisted software prefetcher brings large blocks of data on-chip with the support of a low cost hardware engine, and synergizes with existing hardware prefetchers that manage locality at a finer granularity. The runtime system that drives the prefetch engine dynamically selects which cache to prefetch to. Our evaluation on a set of scientific benchmarks obtains a maximum speed up of 32 and 10 % on average compared to a baseline with hardware prefetching only. As a result, we also achieve a reduction of up to 18 and 3 % on average in energy-to-solution.Peer ReviewedPostprint (author's final draft

    Instruction fetch architectures and code layout optimizations

    Get PDF
    The design of higher performance processors has been following two major trends: increasing the pipeline depth to allow faster clock rates, and widening the pipeline to allow parallel execution of more instructions. Designing a higher performance processor implies balancing all the pipeline stages to ensure that overall performance is not dominated by any of them. This means that a faster execution engine also requires a faster fetch engine, to ensure that it is possible to read and decode enough instructions to keep the pipeline full and the functional units busy. This paper explores the challenges faced by the instruction fetch stage for a variety of processor designs, from early pipelined processors, to the more aggressive wide issue superscalars. We describe the different fetch engines proposed in the literature, the performance issues involved, and some of the proposed improvements. We also show how compiler techniques that optimize the layout of the code in memory can be used to improve the fetch performance of the different engines described Overall, we show how instruction fetch has evolved from fetching one instruction every few cycles, to fetching one instruction per cycle, to fetching a full basic block per cycle, to several basic blocks per cycle: the evolution of the mechanism surrounding the instruction cache, and the different compiler optimizations used to better employ these mechanisms.Peer ReviewedPostprint (published version

    Effective instruction prefetching via fetch prestaging

    Get PDF
    As technological process shrinks and clock rate increases, instruction caches can no longer be accessed in one cycle. Alternatives are implementing smaller caches (with higher miss rate) or large caches with a pipelined access (with higher branch misprediction penalty). In both cases, the performance obtained is far from the obtained by an ideal large cache with one-cycle access. In this paper we present cache line guided prestaging (CLGP), a novel mechanism that overcomes the limitations of current instruction cache implementations. CLGP employs prefetching to charge future cache lines into a set of fast prestage buffers. These buffers are managed efficiently by the CLGP algorithm, trying to fetch from them as much as possible. Therefore, the number of fetches served by the main instruction cache is highly reduced, and so the negative impact of its access latency on the overall performance. With the best CLGP configuration using a 4 KB I-cache, speedups of 3.5% (at 0.09 /spl mu/m) and 12.5% (at 0.045 /spl mu/m) are obtained over an equivalent fetch directed prefetching configuration, and 39% (at 0.09 /spl mu/m) and 48% (at 0.045 /spl mu/m) over using a pipelined instruction cache without prefetching. Moreover, our results show that CLGP with a 2.5 KB of total cache budget can obtain a similar performance than using a 64 KB pipelined I-cache without prefetching, that is equivalent performance at 6.4X our hardware budget.Peer ReviewedPostprint (published version

    PREFENDER: A Prefetching Defender against Cache Side Channel Attacks as A Pretender

    Full text link
    Cache side channel attacks are increasingly alarming in modern processors due to the recent emergence of Spectre and Meltdown attacks. A typical attack performs intentional cache access and manipulates cache states to leak secrets by observing the victim's cache access patterns. Different countermeasures have been proposed to defend against both general and transient execution based attacks. Despite their effectiveness, they mostly trade some level of performance for security, or have restricted security scope. In this paper, we seek an approach to enforcing security while maintaining performance. We leverage the insight that attackers need to access cache in order to manipulate and observe cache state changes for information leakage. Specifically, we propose Prefender, a secure prefetcher that learns and predicts attack-related accesses for prefetching the cachelines to simultaneously help security and performance. Our results show that Prefender is effective against several cache side channel attacks while maintaining or even improving performance for SPEC CPU 2006 and 2017 benchmarks.Comment: Submitting to a journa

    Fast Key-Value Lookups with Node Tracker

    Get PDF
    Lookup operations for in-memory databases are heavily memory bound, because they often rely on pointer-chasing linked data structure traversals. They also have many branches that are hard-to-predict due to random key lookups. In this study, we show that although cache misses are the primary bottleneck for these applications, without a method for eliminating the branch mispredictions only a small fraction of the performance benefit is achieved through prefetching alone. We propose the Node Tracker (NT), a novel programmable prefetcher/pre-execution unit that is highly effective in exploiting inter key-lookup parallelism to improve single-thread performance. We extend NT with branch outcome streaming (BOS) to reduce branch mispredictions and show that this achieves an extra 3Ă— speedup. Finally, we evaluate the NT as a pre-execution unit and demonstrate that we can further improve the performance in both single- and multi-threaded execution modes. Our results show that, on average, NT improves single-thread performance by 4.1Ă— when used as a prefetcher; 11.9Ă— as a prefetcher with BOS; 14.9Ă— as a pre-execution unit and 18.8Ă— as a pre-execution unit with BOS. Finally, with 24 cores of the latter version, we achieve a speedup of 203Ă— and 11Ă— over the single-core and 24-core baselines, respectively

    WCET Optimizations and Architectural Support for Hard Real-Time Systems

    Get PDF
    As time predictability is critical to hard real-time systems, it is not only necessary to accurately estimate the worst-case execution time (WCET) of the real-time tasks but also desirable to improve either the WCET of the tasks or time predictability of the system, because the real-time tasks with lower WCETs are easy to schedule and more likely to meat their deadlines. As a real-time system is an integration of software and hardware, the optimization can be achieved through two ways: software optimization and time-predictable architectural support. In terms of software optimization, we fi rst propose a loop-based instruction prefetching approach to further improve the WCET comparing with simple prefetching techniques such as Next-N-Line prefetching which can enhance both the average-case performance and the worst-case performance. Our prefetching approach can exploit the program controlow information to intelligently prefetch instructions that are most likely needed. Second, as inter-thread interferences in shared caches can signi cantly a ect the WCET of real-time tasks running on multicore processors, we study three multicore-aware code positioning methods to reduce the inter-core L2 cache interferences between co-running real-time threads. One strategy focuses on decreasing the longest WCET among the co-running threads, and two other methods aim at achieving fairness in terms of the amount or percentage of WCET reduction among co-running threads. In the aspect of time-predictable architectural support, we introduce the concept of architectural time predictability (ATP) to separate timing uncertainty concerns caused by hardware from software, which greatly facilitates the advancement of time-predictable processor design. We also propose a metric called Architectural Time-predictability Factor (ATF) to measure architectural time predictability quantitatively. Furthermore, while cache memories can generally improve average-case performance, they are harmful to time predictability and thus are not desirable for hard real-time and safety-critical systems. In contrast, Scratch-Pad Memories (SPMs) are time predictable, but they may lead to inferior performance. Guided by ATF, we propose and evaluate a variety of hybrid on-chip memory architectures to combine both caches and SPMs intelligently to achieve good time predictability and high performance. Detailed implementation and experimental results discussion are presented in this dissertation
    • …
    corecore