447 research outputs found

    Adaptive runtime-assisted block prefetching on chip-multiprocessors

    Get PDF
    Memory stalls are a significant source of performance degradation in modern processors. Data prefetching is a widely adopted and well studied technique used to alleviate this problem. Prefetching can be performed by the hardware, or be initiated and controlled by software. Among software controlled prefetching we find a wide variety of schemes, including runtime-directed prefetching and more specifically runtime-directed block prefetching. This paper proposes a hybrid prefetching mechanism that integrates a software driven block prefetcher with existing hardware prefetching techniques. Our runtime-assisted software prefetcher brings large blocks of data on-chip with the support of a low cost hardware engine, and synergizes with existing hardware prefetchers that manage locality at a finer granularity. The runtime system that drives the prefetch engine dynamically selects which cache to prefetch to. Our evaluation on a set of scientific benchmarks obtains a maximum speed up of 32 and 10 % on average compared to a baseline with hardware prefetching only. As a result, we also achieve a reduction of up to 18 and 3 % on average in energy-to-solution.Peer ReviewedPostprint (author's final draft

    PP-Bridge: Establishing a Bridge between the Prefetching and Cache Partitioning

    Get PDF
    — Modern computer processors are equipped with multiple cores, each boasting its own dedicated cache memory, while collectively sharing a generously sized Last Level Cache (LLC). To ensure equitable utilization of the LLC space and bolster system security, partitioning techniques have been introduced to allocate the shared LLC space among the applications running on different cores. This partition dynamically adapts to the requirements of these applications. Prefetching plays a vital role in enhancing cache performance by proactively loading data into the cache before it get requested explicitly by a core. Each core employs prefetch engines to decide which data blocks to fetch preemptively. However, a haphazard prefetcher may bring in more data blocks than necessary, leading to cache pollution and a subsequent degradation in system performance. To maximize the benefits of prefetching, it is essential to keep cache pollution to a minimum. Intriguingly, our research has uncovered that when existing prefetching techniques are combined with partitioning methods, they tend to exacerbate cache pollution within the LLC, resulting in a noticeable decline in system performance. In this paper, we present a novel approach aimed at mitigating cache pollution when combining prefetching with partitioning techniques

    Data prefetching on in-order processors

    Get PDF
    Low-power processors have attracted attention due to their energy-efficiency. A large market, such as the mobile one, relies on these processors for this very reason. Even High Performance Computing (HPC) systems are starting to consider low-power processors as a way to achieve exascale performance within 20MW, however, they must meet the right performance/Watt balance. Current low-power processors contain in-order cores, which cannot re-order instructions to avoid data dependency-induced stalls. Whilst this is useful to reduce the chip's total power consumption, it brings several challenges. Due to the evolving performance gap between memory and processor, memory is a significant bottleneck. In-order cores cannot re-order instructions and are memory latency bound, something data prefetching can help alleviate by ensuring data is readily available. In this work, we do an exhaustive analysis of available data prefetching techniques in state-of-The-Art in-order cores. We analyze 5 static prefetchers and 2 dynamic aggressiveness and destination mechanisms applied to 3 data prefetchers on a set of HPC mini-and proxy-Applications, whilst running on in-order processors. We show that next-line prefetching can achieve nearly top performance with a reasonable bandwidth consumption when throttled, whilst neighbor prefetchers have been found to be best, overall.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P), by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), by the European Union’s Horizon 2020 research and innovation programme (grant agreements No 671697 and No 779877). M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104.Peer ReviewedPostprint (author's final draft

    Conjugate gradient solvers on Intel Xeon Phi and NVIDIA GPUs

    Full text link
    Lattice Quantum Chromodynamics simulations typically spend most of the runtime in inversions of the Fermion Matrix. This part is therefore frequently optimized for various HPC architectures. Here we compare the performance of the Intel Xeon Phi to current Kepler-based NVIDIA Tesla GPUs running a conjugate gradient solver. By exposing more parallelism to the accelerator through inverting multiple vectors at the same time, we obtain a performance greater than 300 GFlop/s on both architectures. This more than doubles the performance of the inversions. We also give a short overview of the Knights Corner architecture, discuss some details of the implementation and the effort required to obtain the achieved performance.Comment: 7 pages, proceedings, presented at 'GPU Computing in High Energy Physics', September 10-12, 2014, Pisa, Ital
    • …
    corecore