Search CORE

235 research outputs found

Dynamic Memory Optimization using Pool Allocation and Prefetching

Author: Rabbah Rodric
Wong Weng Fai
Zhao Qin
Publication venue
Publication date: 01/01/2006
Field of study

Heap memory allocation plays an important role in modern applications. Conventional heap allocators, however, generally ignore the underlying memory hierarchy of the system, favoring instead a low runtime overhead and fast response times. Unfortunately, with little concern for the memory hierarchy, the data layout may exhibit poor spatial locality, and degrade cache performance. In this paper, we describe a dynamic heap allocation scheme called pool allocation. The strategy aims to improve cache performance by inspecting memory allocation requests, and allocating memory from appropriate heap pools as dictated by the requesting context. The advantages are two fold. First, by pooling together data with a common context, we expect to improve spatial locality, as data fetched to the caches will contain fewer items from different contexts. If the allocation patterns are closely matched to the traversal patterns, the end result is faster memory performance. Second, by pooling heap objects, we expect access patterns to exhibit more regularity, thus creating more opportunities for data prefetching. Our dynamic memory optimizer exploits the increased regularity to insert prefetch instructions at runtime. The optimizations are implemented in DynamoRIO, a dynamic optimization framework. We evaluate the work using various benchmarks, and measure a 17% speedup over gcc -O3 on an Athlon MP, and a 13% speedup on a Pentium 4.Singapore-MIT Alliance (SMA

DSpace@MIT

Best-Offset Hardware Prefetching

Author: Michaud Pierre
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2016
Field of study

International audienceHardware prefetching is an important feature of modern high-performance processors. When the application working set is too large to fit in on-chip caches, disabling hardware prefetchers may result in severe performance reduction. A new prefetcher was recently introduced, the Sandbox prefetcher, that tries to find dynamically the best prefetch offset using the sandbox method. The Sandbox prefetcher uses simple hardware and was shown to be quite effective. However, the sandbox method does not take into account prefetch timeliness. We propose an offset prefetcher with a new method for selecting the prefetch offset that takes into account prefetch timeliness. We show that our Best-Offset prefetcher outperforms the Sandbox prefetcher on the SPEC CPU2006 benchmarks , with equally simple hardware

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

HAL-Rennes 1

A Survey of Techniques for Architecting TLBs

Author: Mittal Sparsh
Publication venue: 'Wiley'
Publication date: 01/01/2016
Field of study

“Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently and a TLB miss is extremely costly, prudent management of TLB is important for improving performance and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and distinctions. We believe that this paper will be useful for chip designers, computer architects and system engineers

Research Archive of Indian Institute of Technology Hyderabad

Hardware-only stream prediction + cache prefetching + dynamic access ordering

Author: Mckee Sally A.
Zhang Chengqiang
Publication venue: University of Utah
Publication date: 01/01/1999
Field of study

Journal ArticleThe speed gap between processors and memory system is becoming the performance bottleneck for many applications, and computations with strided access patterns are among those that suffer most. The vectors used in such applications lack temporal and often spatial locality, and are usually too large to cache. In spite of their poor cache behavior, these access patterns have the advantage of being, predictable, which can be exploited to improve the efficiency of the memory subsystem. As a promising technique to relieve memory system bottleneck, prefetching has been studied in its various forms, and so is dynamic memory scheduling. This study builds on these results, combining a stride-based reference prediction table, a mechanism that prefetches L2 cache lines, and a memory controller that dynamically schedules accesses to a Direct Rambus memory subsystem. We find that such a system delivers impressive speedups for scientific applications with regular access patterns (reducing execution time by almost a factor of two) without negatively affecting the performance of non-streaming programs

The University of Utah: J. Willard Marriott Digital Library

Recommended from our members

Graph prefetching using data structure knowledge

Author: Ainsworth S
Jones TM
Publication venue: Proceedings of the International Conference on Supercomputing
Publication date: 01/01/2016
Field of study

Searches on large graphs are heavily memory latency bound, as a result of many high latency DRAM accesses. Due to the highly irregular nature of the access patterns involved, caches and prefetchers, both hardware and software, perform poorly on graph workloads. This leads to CPU stalling for the majority of the time. However, in many cases the data access pattern is well defined and predictable in advance, many falling into a small set of simple patterns. Although existing implicit prefetchers cannot bring significant benefit, a prefetcher armed with knowledge of the data structures and access patterns could accurately anticipate applications' traversals to bring in the appropriate data. This paper presents a design of an explicitly configured prefetcher to improve performance for breadth-first searches and sequential iteration on the efficient and commonly-used compressed sparse row graph format. By snooping L1 cache accesses from the core and reacting to data returned from its own prefetches, the prefetcher can schedule timely loads of data in advance of the application needing it. For a range of applications and graph sizes, our prefetcher achieves average speedups of 2.3x, and up to 3.3x, with little impact on memory bandwidth requirements.This work was supported by the Engineering and Physical Sciences Research Council (EPSRC), through grant references EP/K026399/1 and EP/M506485/1, and ARM Ltd.This is the author accepted manuscript. The final version is available from ACM at http://dx.doi.org/10.1145/2925426.2926254

Apollo (Cambridge)

Cost-effective compiler directed memory prefetching and bypassing

Author: Ayguadé Parra Eduard
Baer Jean-Loup
Ortega Fernández Daniel
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2002
Field of study

Ever increasing memory latencies and deeper pipelines push memory farther from the processor. Prefetching techniques aim is to bridge these two gaps by fetching data in advance to both the L1 cache and the register file. Our main contribution in this paper is a hybrid approach to the prefetching problem that combines both software and hardware prefetching in a cost-effective way by needing very little hardware support and impacting minimally the design of the processor pipeline. The prefetcher is built on-top of a static memory instruction bypassing, which is in charge of bringing prefetched values in the register file. In this paper we also present a thorough analysis of the limits of both prefetching and memory instruction bypassing. We also compare our prefetching technique with a prior speculative proposal that attacked the same problem, and we show that at much lower cost, our hybrid solution is better than a realistic implementation of speculative prefetching and bypassing. On average, our hybrid implementation achieves a 13% speed-up improvement over a version with software prefetching in a subset of numerical applications and an average of 43% over a version with no software prefetching (achieving up to a 102% for specific benchmarks).Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Improving cache locality for thread-level speculation

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

Crossref

Feasibility analysis of correlation based prefetching using digital signal processing

Author: Morse David L.
Publication venue: RIT Scholar Works
Publication date: 01/01/2004
Field of study

As the gap between processor performance and memory performance continues to broaden with time, techniques to hide memory latency such as correlation based prefetching become exceedingly important. When a memory reference issued by the processor misses the level one cache, the request propagates down the memory hierarchy until it finally finds the requested datum. With each layer traversed, the latency grows exponentially. Prefetching is a technique used to hide this latency by attempting to predict which memory references will be requested in the near future, and then load them into cache before they are needed. This work investigates the use of digital signal processing techniques in designing an effective prefetch algorithm. The algorithm proposed in this work uses the Kalman Filter as the basic digital signal processing block. The sequence of memory address references with respect to time is interpreted as a digital signal. By applying Kalman filtering techniques, a robust prediction algorithm is presented to predict future miss references based on the pattern of previous miss references. The algorithm was simulated using 40 benchmark programs from the Olden, MediaBench, and SPEC benchmark suites for the Alpha 21264 and the PISA (a MIPS-like ISA) instruction set architectures. A main difference between these two ISAs is that the Alpha 21264 ISA contains software prefetch instructions, and the PISA instruction set architecture does not. The simulations place a prefetcher unit between the level one data cache and the level two unified cache. SimpleScalar simulation results for a broad set of benchmark programs using 32 Kalman filter blocks show an average of 6.5% speedup for the Alpha 21264 ISA, and an average of 5.6% speedup for the PISA instruction set architecture for those benchmark programs which have a potential speedup from prefetching greater than 10%

RIT Scholar Works