Search CORE

176 research outputs found

Improving GPU performance : reducing memory conflicts and latency

Author: Braak van den, G.J.W.
Publication venue: Technische Universiteit Eindhoven
Publication date: 25/11/2015
Field of study

Pure OAI Repository

Improving GPU performance : reducing memory conflicts and latency

Author: Braak van den, G.J.W.
Publication venue: Technische Universiteit Eindhoven
Publication date: 25/11/2015
Field of study

Pure OAI Repository

GSI: a GPU stall inspector to characterize the sources of memory stalls for tightly coupled GPUs

Author: Alsop Johnathan R.
Publication venue
Publication date: 01/05/2016
Field of study

In recent years the power wall has prevented the continued scaling of single core performance. This has led to the rise of dark silicon and motivated a move toward parallelism and specialization. As a result, energy-efficient high-throughput GPU cores are increasingly favored for accelerating data-parallel applications. However, the best way to efficiently communicate and synchronize across heterogeneous cores remains an important open research question. Many methods have been proposed to improve the efficiency of heterogeneous memory systems, but current methods for evaluating the performance effects of these innovations are limited in their ability to attribute differences in execution time to sources of latency in the memory system. Performance characterization of tightly coupled CPU-GPU systems is complicated by the high levels of parallelism present in GPU codes. Existing simulation tools provide only coarse-grained metrics which can obscure the underlying memory system interactions that cause performance differences. In this thesis we introduce GPU Stall Inspector (GSI), a method for identifying and visualizing the causes of GPU stalls with a focus on a tightly coupled CPU-GPU memory subsystem. We demonstrate the utility of our approach by evaluating the sources of stalls in several recent architectural innovations for tightly coupled, heterogeneous CPU-GPU systems

Illinois Digital Environment for Access to Learning and Scholarship Repository

Scratchpad Sharing in GPUs

Author: Anantpur Jayvant
Jatala Vishwesh
Karkare Amey
Publication venue
Publication date: 12/02/2017
Field of study

GPGPU applications exploit on-chip scratchpad memory available in the Graphics Processing Units (GPUs) to improve performance. The amount of thread level parallelism present in the GPU is limited by the number of resident threads, which in turn depends on the availability of scratchpad memory in its streaming multiprocessor (SM). Since the scratchpad memory is allocated at thread block granularity, part of the memory may remain unutilized. In this paper, we propose architectural and compiler optimizations to improve the scratchpad utilization. Our approach, Scratchpad Sharing, addresses scratchpad under-utilization by launching additional thread blocks in each SM. These thread blocks use unutilized scratchpad and also share scratchpad with other resident blocks. To improve the performance of scratchpad sharing, we propose Owner Warp First (OWF) scheduling that schedules warps from the additional thread blocks effectively. The performance of this approach, however, is limited by the availability of the shared part of scratchpad. We propose compiler optimizations to improve the availability of shared scratchpad. We describe a scratchpad allocation scheme that helps in allocating scratchpad variables such that shared scratchpad is accessed for short duration. We introduce a new instruction, relssp, that when executed, releases the shared scratchpad. Finally, we describe an analysis for optimal placement of relssp instructions such that shared scratchpad is released as early as possible. We implemented the hardware changes using the GPGPU-Sim simulator and implemented the compiler optimizations in Ocelot framework. We evaluated the effectiveness of our approach on 19 kernels from 3 benchmarks suites: CUDA-SDK, GPGPU-Sim, and Rodinia. The kernels that underutilize scratchpad memory show an average improvement of 19% and maximum improvement of 92.17% compared to the baseline approach

arXiv.org e-Print Archive

Open Access Repository of IISc Research Publications

Domain-specific Architectures for Data-intensive Applications

Author: Addisie Abraham Lamesgin
Publication venue
Publication date: 01/01/2020
Field of study

Graphs' versatile ability to represent diverse relationships, make them effective for a wide range of applications. For instance, search engines use graph-based applications to provide high-quality search results. Medical centers use them to aid in patient diagnosis. Most recently, graphs are also being employed to support the management of viral pandemics. Looking forward, they are showing promise of being critical in unlocking several other opportunities, including combating the spread of fake content in social networks, detecting and preventing fraudulent online transactions in a timely fashion, and in ensuring collision avoidance in autonomous vehicle navigation, to name a few. Unfortunately, all these applications require more computational power than what can be provided by conventional computing systems. The key reason is that graph applications present large working sets that fail to fit in the small on-chip storage of existing computing systems, while at the same time they access data in seemingly unpredictable patterns, thus cannot draw benefit from traditional on-chip storage. In this dissertation, we set out to address the performance limitations of existing computing systems so to enable emerging graph applications like those described above. To achieve this, we identified three key strategies: 1) specializing memory architecture, 2) processing data near its storage, and 3) message coalescing in the network. Based on these strategies, this dissertation develops several solutions: OMEGA, which employs specialized on-chip storage units, with co-located specialized compute engines to accelerate the computation; MessageFusion, which coalesces messages in the interconnect; and Centaur, providing an architecture that optimizes the processing of infrequently-accessed data. Overall, these solutions provide 2x in performance improvements, with negligible hardware overheads, across a wide range of applications. Finally, we demonstrate the applicability of our strategies to other data-intensive domains, by exploring an acceleration solution for MapReduce applications, which achieves a 4x performance speedup, also with negligible area and power overheads.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163186/1/abrahad_1.pd

Deep Blue Documents at the University of Michigan

Recommended from our members

A SIMD architecture for hard real-time systems

Author: Spliet Roy
Publication venue: University of Cambridge
Publication date: 31/03/2020
Field of study

Emerging safety-critical systems require high-performance data-parallel architectures and, problematically, ones that can guarantee tight and safe worst-case execution times. Given the complexity of existing architectures like GPUs, it is unlikely that sufficiently accurate models and algorithms for timing analysis will emerge in the foreseeable future. This motivates a clean-slate approach to designing a real-time data-parallel architecture. In this work I present Sim-D: a wide-SIMD architecture for hard real-time systems. Similar to GPUs, Sim-D performs hardware strip-mining to schedule the work for a compute kernel in entities called work-groups. Sim-D schedules the work for each work-group as a sequence of uninterruptible access- and execute program phases, interleaving the phases of two work-groups. By providing performance isolation between the memory- and compute resources, the execution time of each phase can be tightly bound through static analysis. I present a predictable closed-page DRAM controller that processes requests for large 1D- and 2D blocks of data, as well as indirect indexed transfers. These large transfers coalesce the data requests of a whole work-group. For a linear 4KiB transfer over a 64-bit data bus, the utilisation provably exceeds 78% for DDR4-3200AA DRAM. For 2D blocks, a well-chosen tiling configuration can achieve near-similar efficiency. I show that bounds on the execution time of indexed transfers are pessimistic by nature, but propose a novel snoopy indexed transfer mechanism that permits more reasonable bounds when the buffer size is limited. Finally, I present a worst-case execution time calculation algorithm for Sim-D. This algorithm is paired with two hardware work-group scheduling policies that deterministically reduce run-time variance. The worst-case execution time analysis algorithm combines static control flow analysis with a simulation-based cost model for execution and DRAM transfers. Its key novelty is the addition of a stage that considers work-group scheduling effects. I show that the work-group scheduling policies degrade performance on average by 8.9%, but permit the calculation of worst-case execution time bounds that are tight within 14.3% on average for benchmarks that avoid inefficient indexed transfers

Apollo (Cambridge)

Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology

Author: Hyun Bongjoon
Kim Taehun
Lee Dongjae
Rhu Minsoo
Publication venue
Publication date: 01/08/2023
Field of study

Processing-in-memory (PIM) has been explored for decades by computer architects, yet it has never seen the light of day in real-world products due to their high design overheads and lack of a killer application. With the advent of critical memory-intensive workloads, several commercial PIM technologies have been introduced to the market ranging from domain-specific PIM architectures to more general-purpose PIM architectures. In this work, we deepdive into UPMEM's commercial PIM technology, a general-purpose PIM-enabled parallel architecture that is highly programmable. Our first key contribution is the development of a flexible simulation framework for PIM. The simulator we developed (aka PIMulator) enables the compilation of UPMEM-PIM source codes into its compiled machine-level instructions, which are subsequently consumed by our cycle-level performance simulator. Using PIMulator, we demystify UPMEM's PIM design through a detailed characterization study. Building on top of our characterization, we conduct a series of case studies to pathfind important architectural features that we deem will be critical for future PIM architectures to suppor

arXiv.org e-Print Archive

Analysis and modeling of the timing behavior of GPU architectures

Author: Voudouris P.
Publication venue
Publication date: 01/01/2014
Field of study

Repository TU/e

Pure OAI Repository

Using hybrid shared and distributed caching for mixed-coherency GPU workloads

Author: Anssari Nasser
Publication venue
Publication date: 01/12/2012
Field of study

Current GPU computing models support a mixture of coherent and incoherent classes of memory operations. Workloads using these models typically have working sets too large to fit in an economical SRAM structure. Still, GPU architectures have last-level caches to primarily fulfill two functions: eliminate redundant DRAM accesses servicing requests from different L1 caches to the same line, and maintain on-chip memory coherence for the coherent class of memory operations. In this thesis, we propose an alternative memory system design for GPU architectures better fit for their workloads. Our architectural design features a directory-like sharing tracker that allows the incoherent private L1 caches to directly satisfy remote requests for shared data. It also retains a shared L2 cache with a customized caching policy to support coherent accesses on-chip and better serve non-coalesced requests that contend aggressively for cache lines. This thesis characterizes the novel and intriguing tradeoffs between the components of our proposed memory system design for area, energy, and performance. We show that the proposed design achieves a 22% average reduction in DRAM data demand over a standard GPU architecture with 1MB L2 cache, leading to an overall 28% reduction in the memory system energy consumption on average. Conversely, our results show that the DRAM data demand of the proposed design with 256KB L2 cache is on par with a standard GPU architecture with 1MB L2 cache, albeit at a smaller area overhead and power leakage. Our results, while drawn on motivations from the GPU realm, are not architecture-specific and can be extended to other throughput-oriented many-core organizations

Illinois Digital Environment for Access to Learning and Scholarship Repository