Search CORE

19 research outputs found

Using Runahead Execution to Hide Memory Latency in High Level Synthesis

Author: Shane Fleming
Publication venue: IEEE
Publication date: 01/01/2017
Field of study

Reads and writes to global data in off-chip RAM can limit the performance achieved with HLS tools, as each access takes multiple cycles and usually blocks progress in the application state machine. This can be combated by using data prefetchers, which hide access time by predicting the next memory access and loading it into a cache before it's required. Unfortunately, current prefetchers are only useful for memory accesses with known regular patterns, such as walking arrays, and are ineffective for those that use irregular patterns over application-specific data structures. In this work, we demonstrate prefetchers that are tailor-made for applications, even if they have irregular memory accesses. This is achieved through program slicing, a static analysis technique that extracts the memory structure of the input code and automatically constructs an application-specific prefetcher. Both our analysis and tool are fully automated and implemented as a new compiler flag in LegUp, an open source HLS tool. In this work we create a theoretical model showing that speedup must be between 1x and 2x, we also evaluate five benchmarks, achieving an average speedup of 1.38x with an average resource overhead of 1.15x

Cronfa at Swansea University

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

Author: Fernandez Ivan
Ghose Saugata
Gómez-Luna Juan
Mutlu Onur
Oliveira Geraldo F.
Orosa Lois
Sadrosadati Mohammad
Vijaykumar Nandita
Publication venue
Publication date: 01/01/2021
Field of study

Data movement between the CPU and main memory is a first-order obstacle against improving performance, scalability, and energy efficiency in modern systems. Computer systems employ a range of techniques to reduce overheads tied to data movement, spanning from traditional mechanisms (e.g., deep multi-level cache hierarchies, aggressive hardware prefetchers) to emerging techniques such as Near-Data Processing (NDP), where some computation is moved close to memory. Our goal is to methodically identify potential sources of data movement over a broad set of applications and to comprehensively compare traditional compute-centric data movement mitigation techniques to more memory-centric techniques, thereby developing a rigorous understanding of the best techniques to mitigate each source of data movement. With this goal in mind, we perform the first large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory. We develop the first systematic methodology to classify applications based on the sources contributing to data movement bottlenecks. From our large-scale characterization of 77K functions across 345 applications, we select 144 functions to form the first open-source benchmark suite (DAMOV) for main memory data movement studies. We select a diverse range of functions that (1) represent different types of data movement bottlenecks, and (2) come from a wide range of application domains. Using NDP as a case study, we identify new insights about the different data movement bottlenecks and use these insights to determine the most suitable data movement mitigation mechanism for a particular application. We open-source DAMOV and the complete source code for our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.Comment: Our open source software is available at https://github.com/CMU-SAFARI/DAMO

arXiv.org e-Print Archive

Repository for Publications and Research Data

Directory of Open Access Journals

DeSC: Decoupled Supply-Compute Communication Management for Heterogeneous Architectures

Author: Juan L Aragón
Margaret Martonosi
Tae Jun Ham
Publication venue
Publication date: 03/04/2020
Field of study

ABSTRACT Today's computers employ significant heterogeneity to meet performance targets at manageable power. In adopting increased compute specialization, however, the relative amount of time spent on memory or communication latency has increased. System and software optimizations for memory and communication often come at the costs of increased complexity and reduced portability. We propose Decoupled Supply-Compute (DeSC) as a way to attack memory bottlenecks automatically, while maintaining good portability and low complexity. Drawing from Decoupled Access Execute (DAE) approaches, our work updates and expands on these techniques with increased specialization and automatic compiler support. Across the evaluated workloads, DeSC o↵ers an average of 2.04x speedup over baseline (on homogeneous CMPs) and 1.56x speedup when a DeSC data supplier feeds data to a hardware accelerator. Achieving performance very close to what a perfect cache hierarchy would o↵er, DeSC o↵ers the performance gains of specialized communication acceleration while maintaining useful generality across platforms

CiteSeerX

DeSC: Decoupled Supply-Compute Communication Management for Heterogeneous Architectures

Author: Juan L Aragón
Margaret Martonosi
Tae Jun Ham
Publication venue
Publication date: 03/04/2020
Field of study

ABSTRACT Today's computers employ significant heterogeneity to meet performance targets at manageable power. In adopting increased compute specialization, however, the relative amount of time spent on memory or communication latency has increased. System and software optimizations for memory and communication often come at the costs of increased complexity and reduced portability. We propose Decoupled Supply-Compute (DeSC) as a way to attack memory bottlenecks automatically, while maintaining good portability and low complexity. Drawing from Decoupled Access Execute (DAE) approaches, our work updates and expands on these techniques with increased specialization and automatic compiler support. Across the evaluated workloads, DeSC offers an average of 2.04x speedup over baseline (on homogeneous CMPs) and 1.56x speedup when a DeSC data supplier feeds data to a hardware accelerator. Achieving performance very close to what a perfect cache hierarchy would offer, DeSC offers the performance gains of specialized communication acceleration while maintaining useful generality across platforms

CiteSeerX

EASY: efficient arbiter SYnthesis from multi-threaded code

Author: Anderson J
Chen YT
Cheng J
Constantinides G
Fleming S
Publication venue: ACM
Publication date: 15/11/2018
Field of study

High-Level Synthesis (HLS) tools automatically transform a high-level specification of a circuit into a low-level RTL description.Traditionally, HLS tools have operated on sequential code, howeverin recent years there has been a drive to synthesize multi-threadedcode. A major challenge facing HLS tools in this context is how toautomatically partition memory amongst parallel threads to fullyexploit the bandwidth available on an FPGA device and avoid mem-ory contention. Current automatic memory partitioning techniqueshave inefficient arbitration due to conservative assumptions regard-ing which threads may access a given memory bank. In this paper,we address this problem through formal verification techniques,permitting a less conservative, yet provably correct circuit to begenerated. We perform a static analysis on the code to determinewhich memory banks are shared by which threads. This analysisenables us to optimize the arbitration efficiency of the generatedcircuit. We apply our approach to the LegUp HLS tool and showthat for a set of typical application benchmarks we can achieve upto 87% area savings, and 39% execution time improvement, withlittle additional compilation time

ZENODO

Spiral - Imperial College Digital Repository

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Recommended from our members

Architectural support for message queue task parallelism

Author: Wu Qinzhe
Publication venue
Publication date: 04/01/2024
Field of study

The scaling of threads is an attractive way to exploit task-level parallelism and boost performance. From the perspective of software programming, many applications (e.g., network package processing, SQL queries) could be composite of a set of small tasks. Those tasks are arranged in a data flow graph and each task is undertaken by some threads. Message queues are often used to coordinate the tasks among the threads. On the other side, thread scaling is in favor of the hardware advancing trend that there are more Processing Elements (PE) in modern Chip Multiprocessors (CMP) than ever before. This is because single PE cannot simply run faster due to power and thermal limitations; instead architects have to use more transistors for increasing number of PEs, in order to improve the overall computing power of a processor. Unfortunately, this paradigm using message queues to drive parallel tasks sometime leads to diminishing performance returns due to issues lying in the architecture and system design. Particularly, the conventional coherent shared-memory architectures let task-parallel workloads suffer from unnecessary synchronization overhead and load-to-use latency. For instance, when passing messages through queues, multiple threads could contend for the exclusivity of the cacheline where the shared queue data structure stays. The more threads, the more severe the contention is, because every transition upgrading a cacheline from shared to exclusive state needs to invalidate more copies in the private caches of other cores, and waits for the acknowledgements from more cores. Such a overhead hurts the scalability of threads synchronizing via message queues. Adding to the coherence overhead, the load-to-use latency (from a consumer requesting data until the data being moved to the consumer to use) is often on the critical path, slowing down the computation. This is because the cache hierarchy in modern processors creates some layers of local storage to buffer data separately for different cores. Therefore, serving message queue data in an ondemand manner incurs longer load-to-use latency. It is also challenging to schedule message-driven tasks to use cores efficiently when arrival rate and service rate mismatch. It wastes CPU cycles if a runtime system leaves tasks blocked on full/empty message queues, while switching tasks has additional scheduling overheads. Diverse system topologies further complicate the problem, as the scheduling also needs to take data locality into consideration. This dissertation explores architectural supports for enhancing the scalability of message queue task parallelism, reducing the load-to-use latency, as well as avoiding blocking. Specifically, this dissertation designs and evaluates a message queue architecture that lowers the overhead of synchronization on shared queue states, a speculation technique to hide the load-to-use latency, as well as a locality-aware message queue runtime system with low overhead on scheduling and buffer resizing. The first contribution of the dissertation is Virtual-Link scalable message queue architecture (VL). Instead of having threads access the shared queue state variables (i.e., head, tail, or lock) atomically, VL provides configurable hardware support, providing both data transfer and synchronization. Unlike other hardware queue architectures with dedicated network, VL reuses the existing cache coherence network and delivers a virtualized channel as if there were a direct link (or route) between two arbitrary PEs. VL facilitates efficient synchronized data movement between M:N producers and consumers with several benefits: (i) the number of sharers on synchronization primitives is reduced to zero, eliminating a primary bottleneck of traditional lock-free queues, (ii) memory spills, snoops, and invalidations are reduced, (iii) data stays on the fast path (inside the interconnect) a majority of the time. Another contribution of the dissertation is SPAMeR speculation mechanism. SPAMeR has the capability to speculatively push messages in anticipation of consumer message requests. With the speculation, the latency of moving data from the source to the consumer that needs the data could be partially or fully overlapped with the message processing time. Unlike pre-fetch approaches which predict what addresses to fetch next, with a queue we know exactly what data is needed next but not when it is needed; SPAMeR proposes algorithms to learn from queue operation history in order to predict this. Finally the dissertation contributes ARMQ locality-aware runtime. ARMQ collects a set of approaches that avoids message queue blocking, ranging from the most general yielding, to dynamically resizing the buffer, and to spawning helper tasks. On one hand, ARMQ minimizes the overheads (e.g., wasteful polling, context switch, memory allocation and copying etc.) with a few techniques (e.g., userspace threading, chunk-based ringbuffer etc.) On the other hand, ARMQ schedules the message-driven tasks precisely and opportunely, in order to maximize the data locality preserved (in favor of cache) and balance the resource allocation.Electrical and Computer Engineerin

Texas ScholarWorks

The effect of an optical network on-chip on the performance of chip multiprocessors

Author: Van Laer Anouk
Publication venue: UCL (University College London)
Publication date: 28/04/2018
Field of study

Optical networks on-chip (ONoC) have been proposed to reduce power consumption and increase bandwidth density in high performance chip multiprocessors (CMP), compared to electrical NoCs. However, as buffering in an ONoC is not viable, the end-to-end message path needs to be acquired in advance during which the message is buffered at the network ingress. This waiting latency is therefore a combination of path setup latency and contention and forms a significant part of the total message latency. Many proposed ONoCs, such as Single Writer, Multiple Reader (SWMR), avoid path setup latency at the expense of increased optical components. In contrast, this thesis investigates a simple circuit-switched ONoC with lower component count where nodes need to request a channel before transmission. To hide the path setup latency, a coherence-based message predictor is proposed, to setup circuits before message arrival. Firstly, the effect of latency and bandwidth on application performance is thoroughly investigated using full-system simulations of shared memory CMPs. It is shown that the latency of an ideal NoC affects the CMP performance more than the NoC bandwidth. Increasing the number of wavelengths per channel decreases the serialisation latency and improves the performance of both ONoC types. With 2 or more wavelengths modulating at 25 Gbit=s , the ONoCs will outperform a conventional electrical mesh (maximal speedup of 20%). The SWMR ONoC outperforms the circuit-switched ONoC. Next coherence-based prediction techniques are proposed to reduce the waiting latency. The ideal coherence-based predictor reduces the waiting latency by 42%. A more streamlined predictor (smaller than a L1 cache) reduces the waiting latency by 31%. Without prediction, the message latency in the circuit-switched ONoC is 11% larger than in the SWMR ONoC. Applying the realistic predictor reverses this: the message latency in the SWMR ONoC is now 18% larger than the predictive circuitswitched ONoC

UCL Discovery

Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors

Author
Publication venue
Publication date: 01/01/2018
Field of study

abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions. Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%. Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications. Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future. In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.Dissertation/ThesisDoctoral Dissertation Computer Science 201

ASU Digital Repository

Datacenter Architectures for the Microservices Era

Author: Mirhosseininiri Seyedamirhossein
Publication venue
Publication date: 01/01/2021
Field of study

Modern internet services are shifting away from single-binary, monolithic services into numerous loosely-coupled microservices that interact via Remote Procedure Calls (RPCs), to improve programmability, reliability, manageability, and scalability of cloud services. Computer system designers are faced with many new challenges with microservice-based architectures, as individual RPCs/tasks are only a few microseconds in most microservices. In this dissertation, I seek to address the most notable challenges that arise due to the dissimilarities of the modern microservice based and classic monolithic cloud services, and design novel server architectures and runtime systems that enable efficient execution of µs-scale microservices on modern hardware. In the first part of my dissertation, I seek to address the problem of Killer Microseconds, which refers to µs-scale “holes” in CPU schedules caused by stalls to access fast I/O devices or brief idle times between requests in high throughput µs-scale microservices. Whereas modern computing platforms can efficiently hide ns-scale and ms-scale stalls through micro-architectural techniques and OS context switching, they lack efficient support to hide the latency of µs-scale stalls. In chapter II, I propose Duplexity, a heterogeneous server architecture that employs aggressive multithreading to hide the latency of killer microseconds, without sacrificing the Quality-of-Service (QoS) of latency-sensitive microservices. Duplexity is able to achieve 1.9× higher core utilization and 2.7× lower iso-throughput 99th-percentile tail latency over an SMT-based server design, on average. In chapters III-IV, I comprehensively investigate the problem of tail latency in the context of microservices and address multiple aspects of it. First, in chapter III, I characterize the tail latency behavior of microservices and provide general guidelines for optimizing computer systems from a queuing perspective to minimize tail latency. Queuing is a major contributor to end-to-end tail latency, wherein nominal tasks are enqueued behind rare, long ones, due to Head-of-Line (HoL) blocking. Next, in chapter IV, I introduce Q-Zilla, a scheduling framework to tackle tail latency from a queuing perspective, and CoreZilla, a microarchitectural instantiation of the framework. Q-Zilla is composed of the ServerQueue Decoupled Size-Interval Task Assignment (SQD-SITA) scheduling algorithm and the Express-lane Simultaneous Multithreading (ESMT) microarchitecture, which together seek to address HoL blocking by providing an “express-lane” for short tasks, protecting them from queuing behind rare, long ones. By combining the ESMT microarchitecture and the SQD-SITA scheduling algorithm, CoreZilla is able to improves tail latency over a conventional SMT core with 2, 4, and 8 contexts by 2.25×, 3.23×, and 4.38×, on average, respectively, and outperform a theoretical 32-core scale-up organization by 12%, on average, with 8 contexts. Finally, in chapters V-VI, I investigate the tail latency problem of microservices from a cluster, rather than server-level, perspective. Whereas Service Level Objectives (SLOs) define end-to-end latency targets for the entire service to ensure user satisfaction, with microservice-based applications, it is unclear how to scale individual microservices when end-to-end SLOs are violated or underutilized. I introduce Parslo as an analytical framework for partial SLO allocation in virtualized cloud microservices. Parslo takes a microservice graph as an input and employs a Gradient Descent-based approach to allocate “partial SLOs” to different microservice nodes, enabling independent auto-scaling of individual microservices. Parslo achieves the optimal solution, minimizing the total cost for the entire service deployment, and is applicable to general microservice graphs.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/167978/1/miramir_1.pd

Deep Blue Documents at the University of Michigan