Search CORE

421 research outputs found

An Event-Triggered Programmable Prefetcher for Irregular Workloads

Author: Kocberber O.
Lau E.
Malhotra V.
McSherry F.
Siek J.
Publication venue: ACM SIGPLAN Notices
Publication date: 01/01/2018
Field of study

Many modern workloads compute on large amounts of data, often with irregular memory accesses. Current architectures perform poorly for these workloads, as existing prefetching techniques cannot capture the memory access patterns; these applications end up heavily memory-bound as a result. Although a number of techniques exist to explicitly configure a prefetcher with traversal patterns, gaining significant speedups, they do not generalise beyond their target data structures. Instead, we propose an event-triggered programmable prefetcher combining the flexibility of a general-purpose computational unit with an event-based programming model, along with compiler techniques to automatically generate events from the original source code with annotations. This allows more complex fetching decisions to be made, without needing to stall when intermediate results are required. Using our programmable prefetching system, combined with small prefetch kernels extracted from applications, we achieve an average 3.0x speedup in simulation for a variety of graph, database and HPC workloads.</jats:p

Crossref

Apollo (Cambridge)

Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design

Author: Ahmadi Agreen
Austin Todd
Behroozi Armand
Dreslinski Ronald
Kaszyk Kuba
Li Lu
Mahlke Scott
May Kyle
Morton John Magnus
Mudge Trevor
Nguyen Brandon
O'Boyle Michael F P
Sun Jiawen
Talati Nishil
Vasiladiotis Christos
Verma Tarunesh
Yang Yichen
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/04/2021
Field of study

Edinburgh Research Explorer

Fast Key-Value Lookups with Node Tracker

Author: Cavus Mustafa
Sendag Resit
Shatnawi Mohammed
Uht Augustus K.
Publication venue: DigitalCommons@URI
Publication date: 01/01/2021
Field of study

Lookup operations for in-memory databases are heavily memory bound, because they often rely on pointer-chasing linked data structure traversals. They also have many branches that are hard-to-predict due to random key lookups. In this study, we show that although cache misses are the primary bottleneck for these applications, without a method for eliminating the branch mispredictions only a small fraction of the performance benefit is achieved through prefetching alone. We propose the Node Tracker (NT), a novel programmable prefetcher/pre-execution unit that is highly effective in exploiting inter key-lookup parallelism to improve single-thread performance. We extend NT with branch outcome streaming (BOS) to reduce branch mispredictions and show that this achieves an extra 3× speedup. Finally, we evaluate the NT as a pre-execution unit and demonstrate that we can further improve the performance in both single- and multi-threaded execution modes. Our results show that, on average, NT improves single-thread performance by 4.1× when used as a prefetcher; 11.9× as a prefetcher with BOS; 14.9× as a pre-execution unit and 18.8× as a pre-execution unit with BOS. Finally, with 24 cores of the latter version, we achieve a speedup of 203× and 11× over the single-core and 24-core baselines, respectively

DigitalCommons@URI

Recommended from our members

Software prefetching for indirect memory accesses: A microarchitectural perspective

Author: Ainsworth S
Jones TM
Publication venue: ACM Transactions on Computer Systems
Publication date: 01/01/2019
Field of study

Many modern data processing and HPC workloads are heavily memory-latency bound. A tempting proposition to solve this is software prefetching, where special non-blocking loads are used to bring data into the cache hierarchy just before being required. However, these are difficult to insert to effectively improve performance, and techniques for automatic insertion are currently limited. This article develops a novel compiler pass to automatically generate software prefetches for indirect memory accesses, a special class of irregular memory accesses often seen in high-performance workloads. We evaluate this across a wide set of systems, all of which gain benefit from the technique. We then evaluate the extent to which good prefetch instructions are architecture dependent and the class of programs that are particularly amenable. Across a set of memory-bound benchmarks, our automated pass achieves average speedups of 1.3× for an Intel Haswell processor, 1.1× for both an ARM Cortex-A57 and Qualcomm Kryo, 1.2× for a Cortex-72 and an Intel Kaby Lake, and 1.35× for an Intel Xeon Phi Knight’s Landing, each of which is an out-of-order core, and performance improvements of 2.1× and 2.7× for the in-order ARM Cortex-A53 and first generation Intel Xeon Phi.EPSRC [EP/K026399/1, EP/M506485/1], ARM Ltd

Apollo (Cambridge)