Search CORE

26 research outputs found

Page size aware cache prefetching

Author: Casas Guix Marc
Chacon Gino
Gratz Paul V.
Jiménez Daniel A.
Vavouliotis Georgios
Álvarez Martí Lluc
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

The increase in working set sizes of contemporary applications outpaces the growth in cache sizes, resulting in frequent main memory accesses that deteriorate system per- formance due to the disparity between processor and memory speeds. Prefetching data blocks into the cache hierarchy ahead of demand accesses has proven successful at attenuating this bottleneck. However, spatial cache prefetchers operating in the physical address space leave significant performance on the table by limiting their pattern detection within 4KB physical page boundaries when modern systems use page sizes larger than 4KB to mitigate the address translation overheads. This paper exploits the high usage of large pages in modern systems to increase the effectiveness of spatial cache prefetch- ing. We design and propose the Page-size Propagation Module (PPM), a µarchitectural scheme that propagates the page size information to the lower-level cache prefetchers, enabling safe prefetching beyond 4KB physical page boundaries when the accessed blocks reside in large pages, at the cost of augmenting the first-level caches’ Miss Status Holding Register (MSHR) entries with one additional bit. PPM is compatible with any cache prefetcher without implying design modifications. We capitalize on PPM’s benefits by designing a module that consists of two page size aware prefetchers that inherently use different page sizes to drive prefetching. The composite module uses adaptive logic to dynamically enable the most appropriate page size aware prefetcher. Finally, we show that the proposed designs are transparent to which cache prefetcher is used. We apply the proposed page size exploitation techniques to four state-of-the-art spatial cache prefetchers. Our evalua- tion shows that our proposals improve single-core geomean performance by up to 8.1% (2.1% at minimum) over the original implementation of the considered prefetchers, across 80 memory-intensive workloads. In multi-core contexts, we report geomean speedups up to 7.7% across different cache prefetchers and core configurations.This work is supported by the Spanish Ministry of Science and Technology through the PID2019-107255GB project, the Generalitat de Catalunya (contract 2017-SGR-1414), the European Union Horizon 2020 research and innovation program under grant agreement No 955606 (DEEP-SEA EU project), the National Science Foundation through grants CNS-1938064 and CCF-1912617, and the Semiconductor Research Corporation project GRC 2936.001. Georgios Vavouliotis has been supported by the Spanish Ministry of Economy, Industry, and Competitiveness and the European Social Fund under the FPI fellowship No. PRE2018-087046. Marc Casas has been partially supported by the Grant RYC2017-23269 funded by MCIN/AEI/10.13039/501100011033 and ESF ‘Investing in your future’.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

A Branch-Directed Data Cache Prefetching Technique for Inorder Processors

Author: Panda Reena
Publication venue
Publication date
Field of study

The increasing gap between processor and main memory speeds has become a serious bottleneck towards further improvement in system performance. Data prefetching techniques have been proposed to hide the performance impact of such long memory latencies. But most of the currently proposed data prefetchers predict future memory accesses based on current memory misses. This limits the opportunity that can be exploited to guide prefetching. In this thesis, we propose a branch-directed data prefetcher that uses the high prediction accuracies of current-generation branch predictors to predict a future basic block trace that the program will execute and issues prefetches for all the identified memory instructions contained therein. We also propose a novel technique to generate prefetch addresses by exploiting the correlation between the addresses generated by memory instructions and the values of the corresponding source registers at prior branch instances. We evaluate the impact of our prefetcher by using a cycle-accurate simulation of an inorder processor on the M5 simulator. The results of the evaluation show that the branch-directed prefetcher improves the performance on a set of 18 SPEC CPU2006 benchmarks by an average of 38.789% over a no-prefetching implementation and 2.148% over a system that employs a Spatial Memory Streaming prefetcher

Texas A&M Repository

Last-touch correlated data streaming

Author: Falsafi Babak
Ferdman Michael
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 06/04/2009
Field of study

Recent research advocates address-correlating predictors to identify cache block addresses for prefetch. Unfortunately, address-correlating predictors require correlation data storage proportional in size to a program's active memory footprint. As a result, current proposals for this class of predictor are either limited in coverage due to constrained on-chip storage requirements or limited in prediction lookaheaddue to long off-chip correlation data lookup. In this paper, we propose Last-Touch Correlated Data Streaming (LT-cords), a practical address-correlating predictor. The key idea of LT-cords is to record correlation data off chip in the order they will be used and stream them into a practicallysized on-chip table shortly before they are needed, thereby obviating the need for scalable on-chip tables and enabling low-latency lookup. We use cycle-accurate simulation of an 8-way out-of-order superscalar processor to show that: (1) LT-cords with 214KB of on-chip storage can achieve the same coverage as a last-touch predictor with unlimited storage, without sacrificing predictor lookahead, and (2) LT-cords improves performance by 60% on average and 385% at best in the benchmarks studied. © 2007 IEEE

Infoscience - École polytechnique fédérale de Lausanne

An Event-Triggered Programmable Prefetcher for Irregular Workloads

Author: Kocberber O.
Lau E.
Malhotra V.
McSherry F.
Siek J.
Publication venue: ACM SIGPLAN Notices
Publication date: 01/01/2018
Field of study

Many modern workloads compute on large amounts of data, often with irregular memory accesses. Current architectures perform poorly for these workloads, as existing prefetching techniques cannot capture the memory access patterns; these applications end up heavily memory-bound as a result. Although a number of techniques exist to explicitly configure a prefetcher with traversal patterns, gaining significant speedups, they do not generalise beyond their target data structures. Instead, we propose an event-triggered programmable prefetcher combining the flexibility of a general-purpose computational unit with an event-based programming model, along with compiler techniques to automatically generate events from the original source code with annotations. This allows more complex fetching decisions to be made, without needing to stall when intermediate results are required. Using our programmable prefetching system, combined with small prefetch kernels extracted from applications, we achieve an average 3.0x speedup in simulation for a variety of graph, database and HPC workloads.</jats:p

Crossref

Apollo (Cambridge)

Caching Techniques in Next Generation Cellular Networks

Author: Mahmood Ahsan
Publication venue: Politecnico di Torino
Publication date: 23/07/2018
Field of study

Content caching will be an essential feature in the next generations of cellular networks. Indeed, a network equipped with caching capabilities allows users to retrieve content with reduced access delays and consequently reduces the traffic passing through the network backhaul. However, the deployment of the caching nodes in the network is hindered by the following two challenges. First, the storage space of a cache is limited as well as expensive. So, it is not possible to store in the cache every content that can be possibly requested by the user. This calls for efficient techniques to determine the contents that must be stored in the cache. Second, efficient ways are needed to implement and control the caching node. In this thesis, we investigate caching techniques focussing to address the above-mentioned challenges, so that the overall system performance is increased. In order to tackle the challenge of the limited storage capacity, smart proactive caching strategies are needed. In the context of vehicular users served by edge nodes, we believe a caching strategy should be adapted to the mobility characteristics of the cars. In this regard, we propose a scheme called RICH (RoadsIde CacHe), which optimally caches content at the edge nodes where connected vehicles require it most. In particular, our scheme is designed to ensure in-order delivery of content chunks to end users. Unlike blind popularity decisions, the probabilistic caching used by RICH considers vehicular trajectory predictions as well as content service time by edge nodes. We evaluate our approach on realistic mobility datasets against a popularity-based edge approach called POP, and a mobility-aware caching strategy known as netPredict. In terms of content availability, our RICH edge caching scheme provides an enhancement of up to 33% and 190% when compared with netPredict and POP respectively. At the same time, the backhaul penalty bandwidth is reduced by a factor ranging between 57% and 70%. Caching node is an also a key component in Named Data Networking (NDN) that is an innovative paradigm to provide content based services in future networks. As compared to legacy networks, naming of network packets and in-network caching of content make NDN more feasible for content dissemination. However, the implementation of NDN requires drastic changes to the existing network infrastructure. One feasible approach is to use Software Defined Networking (SDN), according to which the control of the network is delegated to a centralized controller, which configures the forwarding data plane. This approach leads to large signaling overhead as well as large end-to-end (e2e) delays. In order to overcome these issues, in this work, we provide an efficient way to implement and control the NDN node. We propose to enable NDN using a stateful data plane in the SDN network. In particular, we realize the functionality of an NDN node using a stateful SDN switch attached with a local cache for content storage, and use OpenState to implement such an approach. In our solution, no involvement of the controller is required once the OpenState switch has been configured. We benchmark the performance of our solution against the traditional SDN approach considering several relevant metrics. Experimental results highlight the benefits of a stateful approach and of our implementation, which avoids signaling overhead and significantly reduces e2e delays

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

EURECOM Repository

On-chip mechanisms to reduce effective memory access latency

Author: Hashemi Milad Olia
Publication venue
Publication date: 01/09/2016
Field of study

This dissertation develops hardware that automatically reduces the effective latency of accessing memory in both single-core and multi-core systems. To accomplish this, the dissertation shows that all last level cache misses can be separated into two categories: dependent cache misses and independent cache misses. Independent cache misses have all of the source data that is required to generate the address of the memory access available on-chip, while dependent cache misses depend on data that is located off-chip. This dissertation proposes that dependent cache misses are accelerated by migrating the dependence chain that generates the address of the memory access to the memory controller for execution. Independent cache misses are accelerated using a new mode for runahead execution that only executes filtered dependence chains. With these mechanisms, this dissertation demonstrates a 62% increase in performance and a 19% decrease in effective memory access latency for a quad-core processor on a set of high memory intensity workloads.Electrical and Computer Engineerin

arXiv.org e-Print Archive

Texas ScholarWorks

Recommended from our members

Architecture Optimizations for Memory Systems of Throughput Processors

Author: Gu Yongbin
Publication venue: 'Oregon State University'
Publication date
Field of study

Throughput-oriented processors, such as graphics processing units (GPUs), have been increasingly used to accelerate general purpose computing, including machine learning models that are being utilized in numerous disciplines. Thousands of concurrently running threads in a GPU demand a highly efficient memory subsystem for data supply in GPUs. In this dissertation, we have studied the memory architecture of the traditional GPUs and revealed that the traditional memory architecture, initially designed for graphics processing, is less efficient in handling general purpose computing tasks. We propose several memory architecture optimizations for two primary objectives: (1) optimize current memory architecture for more efficient handling of general purpose computing tasks; (2) improve the overall performance of GPUs. This dissertation has four major parts: (1) The first part deals with the L2 cache inefficiency. A key factor that affects the memory subsystem is the order of memory accesses. While reordering memory accesses at L2 cache has large potential benefits to both cache and DRAM, little work has been conducted to exploit this. In this work, we investigate the largely unexplored opportunity of L2 cache access reordering. We propose Cache Access Reordering Tree (CART), a novel architecture that can improve memory subsystem efficiency by actively reordering memory accesses at L2 cache to be cache-friendly and DRAM-friendly. (2) The second part deals with miss handling architecture (MHA) in GPUs. Conventional MHA is static in sense that it provides a fixed number of MSHR entries to track primary misses, and a fixed number of slots within each entry to track secondary misses. This leads to severe entry or slot under-utilization and poor match to practical workloads, as the number of memory requests to different cache lines can vary significantly. We propose Dynamically Linked MSHR (DL-MSHR), a novel approach that dynamically forms MSHR entries from a pool of available slots. This approach can self-adapt to primary-miss-predominant applications by forming more entries with fewer slots, and self-adapt to secondary-miss-predominant applications by having fewer entries but more slots per entry. (3) The third part aims to improve the performance of Unified Virtual Memory (UVM), which is recently introduced into GPUs. We propose CAPTURE(Capacity-Aware Prefetch with True Usage Reflected Eviction), a novel microarchitecture scheme that implements coordinated prefetch-eviction for GPU UVM management. CAPTURE utilizes GPU memory status and memory access history to dynamically adjust the prefetching and ``capture'' accurate remaining page reusing opportunities for improved eviction. (4) In the fourth part, we propose a comprehensive UVM benchmark suite named UVMBench to facilitate future research on the UVM research

ScholarsArchive@OSU

MxTasks: a novel processing model to support data processing on modern hardware

Author: Mühlig Jan
Publication venue
Publication date: 01/01/2023
Field of study

The hardware landscape has changed rapidly in recent years. Modern hardware in today's servers is characterized by many CPU cores, multiple sockets, and vast amounts of main memory structured in NUMA hierarchies. In order to benefit from these highly parallel systems, the software has to adapt and actively engage with newly available features. However, the processing models forming the foundation for many performance-oriented applications have remained essentially unchanged. Threads, which serve as the central processing abstractions, can be considered a "black box" that hardly allows any transparency between the application and the system underneath. On the one hand, applications are aware of the knowledge that could assist the system in optimizing the execution, such as accessed data objects and access patterns. On the other hand, the limited opportunities for information exchange cause operating systems to make assumptions about the applications' intentions to optimize their execution, e.g., for local data access. Applications, on the contrary, implement optimizations tailored to specific situations, such as sophisticated synchronization mechanisms and hardware-conscious data structures. This work presents MxTasking, a task-based runtime environment that assists the design of data structures and applications for contemporary hardware. MxTasking rethinks the interfaces between performance-oriented applications and the execution substrate, streamlining the information exchange between both layers. By breaking patterns of processing models designed with past generations of hardware in mind, MxTasking creates novel opportunities to manage resources in a hardware- and application-conscious way. Accordingly, we question the granularity of "conventional" threads and show that fine-granular MxTasks are a viable abstraction unit for characterizing and optimizing the execution in a general way. Using various demonstrators in the context of database management systems, we illustrate the practical benefits and explore how challenges like memory access latencies and error-prone synchronization of concurrency can be addressed straightforwardly and effectively

Eldorado - Ressourcen aus und für Lehre, Studium und Forschung

High performance and energy-efficient instruction cache design and optimisation for ultra-low-power multi-core clusters

Author: Chen Jie <1991>
Publication venue: Alma Mater Studiorum - Università di Bologna
Publication date: 12/07/2022
Field of study

High Energy efficiency and high performance are the key regiments for Internet of Things (IoT) end-nodes. Exploiting cluster of multiple programmable processors has recently emerged as a suitable solution to address this challenge. However, one of the main bottlenecks for multi-core architectures is the instruction cache. While private caches fall into data replication and wasting area, fully shared caches lack scalability and form a bottleneck for the operating frequency. Hence we propose a hybrid solution where a larger shared cache (L1.5) is shared by multiple cores connected through a low-latency interconnect to small private caches (L1). However, it is still limited by large capacity miss with a small L1. Thus, we propose a sequential prefetch from L1 to L1.5 to improve the performance with little area overhead. Moreover, to cut the critical path for better timing, we optimized the core instruction fetch stage with non-blocking transfer by adopting a 4 x 32-bit ring buffer FIFO and adding a pipeline for the conditional branch. We present a detailed comparison of different instruction cache architectures' performance and energy efficiency recently proposed for Parallel Ultra-Low-Power clusters. On average, when executing a set of real-life IoT applications, our two-level cache improves the performance by up to 20% and loses 7% energy efficiency with respect to the private cache. Compared to a shared cache system, it improves performance by up to 17% and keeps the same energy efficiency. In the end, up to 20% timing (maximum frequency) improvement and software control enable the two-level instruction cache with prefetch adapt to various battery-powered usage cases to balance high performance and energy efficiency

AMS Tesi di Dottorato