644 research outputs found

    Adaptive runtime-assisted block prefetching on chip-multiprocessors

    Get PDF
    Memory stalls are a significant source of performance degradation in modern processors. Data prefetching is a widely adopted and well studied technique used to alleviate this problem. Prefetching can be performed by the hardware, or be initiated and controlled by software. Among software controlled prefetching we find a wide variety of schemes, including runtime-directed prefetching and more specifically runtime-directed block prefetching. This paper proposes a hybrid prefetching mechanism that integrates a software driven block prefetcher with existing hardware prefetching techniques. Our runtime-assisted software prefetcher brings large blocks of data on-chip with the support of a low cost hardware engine, and synergizes with existing hardware prefetchers that manage locality at a finer granularity. The runtime system that drives the prefetch engine dynamically selects which cache to prefetch to. Our evaluation on a set of scientific benchmarks obtains a maximum speed up of 32 and 10 % on average compared to a baseline with hardware prefetching only. As a result, we also achieve a reduction of up to 18 and 3 % on average in energy-to-solution.Peer ReviewedPostprint (author's final draft

    Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors

    Full text link
    This paper presents a low-overhead optimizer for the ubiquitous sparse matrix-vector multiplication (SpMV) kernel. Architectural diversity among different processors together with structural diversity among different sparse matrices lead to bottleneck diversity. This justifies an SpMV optimizer that is both matrix- and architecture-adaptive through runtime specialization. To this direction, we present an approach that first identifies the performance bottlenecks of SpMV for a given sparse matrix on the target platform either through profiling or by matrix property inspection, and then selects suitable optimizations to tackle those bottlenecks. Our optimization pool is based on the widely used Compressed Sparse Row (CSR) sparse matrix storage format and has low preprocessing overheads, making our overall approach practical even in cases where fast decision making and optimization setup is required. We evaluate our optimizer on three x86-based computing platforms and demonstrate that it is able to distinguish and appropriately optimize SpMV for the majority of matrices in a representative test suite, leading to significant speedups over the CSR and Inspector-Executor CSR SpMV kernels available in the latest release of the Intel MKL library.Comment: 10 pages, 7 figures, ICPP 201

    The Janus triad: Exploiting parallelism through dynamic binary modification

    Get PDF
    We present a unified approach for exploiting thread-level, data-level, and memory-level parallelism through a same-ISA dynamic binary modifier guided by static binary analysis. A static binary analyser first examines an executable and determines the operations required to extract parallelism at runtime, encoding them as a series of rewrite rules that a dynamic binary modifier uses to perform binary transformation. We demonstrate this framework by exploiting three different kinds of parallelism to perform automatic vectorisation, software prefetching, and automatic parallelisation together on legacy application binaries. Software prefetch insertion alone achieves an average speedup of 1.2Ă—, comparing favourably with an automatic compiler pass. Automatic vectorisation brings speedups of 2.7Ă— on the TSVC benchmarks, significantly beating a compiler approach for some workloads. Finally, combining prefetching, vectorisation, and parallelisation realises a speedup of 3.8Ă— on a representative application loop

    Ubiquitous Memory Introspection (Preliminary Manuscript)

    Get PDF
    Modern memory systems play a critical role in the performance ofapplications, but a detailed understanding of the application behaviorin the memory system is not trivial to attain. It requires timeconsuming simulations of the memory hierarchy using long traces, andoften using detailed modeling. It is increasingly possible to accesshardware performance counters to measure events in the memory system,but the measurements remain coarse grained, better suited forperformance summaries than providing instruction level feedback. Theavailability of a low cost, online, and accurate methodology forderiving fine-grained memory behavior profiles can prove extremelyuseful for runtime analysis and optimization of programs.This paper presents a new methodology for Ubiquitous MemoryIntrospection (UMI). It is an online and lightweight mini-simulationmethodology that focuses on simulating short memory access tracesrecorded from frequently executed code regions. The simulations arefast and can provide profiling results at varying granularities, downto that of a single instruction or address. UMI naturally complementsruntime optimizations techniques and enables new opportunities formemory specific optimizations.In this paper, we present a prototype implementation of a runtimesystem implementing UMI. The prototype is readily deployed oncommodity processors, requires no user intervention, and can operatewith stripped binaries and legacy software. The prototype operateswith an average runtime overhead of 20% but this slowdown is only 6%slower than a state of the art binary instrumentation tool. We used32 benchmarks, including the full suite of SPEC2000 benchmarks, forour evaluation. We show that the mini-simulation results accuratelyreflect the cache performance of two existing memory systems, anIntel Pentium~4 and an AMD Athlon MP (K7) processor. We alsodemonstrate that low level profiling information from the onlinesimulation can serve to identify high-miss rate load instructions with a77% rate of accuracy compared to full offline simulations thatrequired days to complete. The online profiling results are used atruntime to implement a simple software prefetching strategy thatachieves a speedup greater than 60% in the best case
    • …
    corecore