Search CORE

10 research outputs found

Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems

Author: ACC.
AMD.
AMD.
Amini Mehdi
Dominik Grewe
Grewe Dominik
Kayiran Onur
LLVM.
Michael F. P. O’boyle
NVIDIA Corp.
Ogilvie William
PathScale Inc.
Portland Group
Project Omini Compiler
Sung Jui
University of Illinois at Urbana-Champaign (UIUC).
Wang Zheng
Zheng Wang
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2015
Field of study

General-purpose GPU-based systems are highly attractive, as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This article presents a compiler-based approach to automatically generate optimized OpenCL code from data parallel OpenMP programs for GPUs. A key feature of our scheme is that it leverages existing transformations, especially data transformations, to improve performance on GPU architectures and uses automatic machine learning to build a predictive model to determine if it is worthwhile running the OpenCL code on the GPU or OpenMP code on the multicore host. We applied our approach to the entire NAS parallel benchmark suite and evaluated it on distinct GPU-based systems. We achieved average (up to) speedups of 4.51× and 4.20× (143× and 67×) on Core i7/NVIDIA GeForce GTX580 and Core i7/AMD Radeon 7970 platforms, respectively, over a sequential baseline. Our approach achieves, on average, greater than 10× speedups over two state-of-the-art automatic GPU code generators

Crossref

Lancaster E-Prints

White Rose Research Online

OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance

Author: Adwait Jog
K. Mishra
Mahmut T. K
Onur Kayiran
Publication venue
Publication date
Field of study

Emerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose applications, available hardware resources of a GPGPU are not efficiently utilized, leading to lost opportunity in improving performance. A major cause of this is the inefficiency of current warp scheduling policies in tolerating long memory latencies. In this paper, we identify that the scheduling decisions made by such policies are agnostic to thread-block, or cooperative thread array (CTA), behavior, and as a result inefficient. We present a coordinated CTA-aware scheduling policy that utilizes four schemes to minimize the impact of long memory latencies. The first two schemes, CTA-aware two-level warp scheduling and locality aware warp scheduling, enhance per-core performance by effectively reducing cache contention and improving latency hiding capability. The third scheme, bank-level parallelism aware warp scheduling, improves overall GPGPU performance by enhancing DRAM bank-level parallelism. The fourth scheme employs opportunistic memory-side prefetching to further enhance performance by taking advantage of open DRAM rows. Evaluations on a 28-core GPGPU platform with highly memory-intensive applications indicate that our proposed mechanism can provide 33 % average performance improvement compared to the commonly-employed round-robin warp scheduling policy

CiteSeerX

Managing GPU Concurrency in Heterogeneous Architectures

Author: Adwait Jog (5358326)
Chita R. Das (5358323)
Gabriel H. Loh (5357927)
Mahmut T. Kandemir (5358047)
Nachiappan Chidambaram Nachiappan (5358320)
Onur Kayiran (5358317)
Onur Mutlu (5357288)
Rachata Ausavarungnirun (5357540)
Publication venue
Publication date: 29/06/2018
Field of study

<p>Heterogeneous architectures consisting of general-purpose CPUs and throughput-optimized GPUs are projected to be the dominant computing platforms for many classes of applications. The design of such systems is more complex than that of homogeneous architectures because maximizing resource utilization while minimizing shared resource interference between CPU and GPU applications is difficult. We show that GPU applications tend to monopolize the shared hardware resources, such as memory and network, because of their high thread-level parallelism (TLP), and discuss the limitations of existing GPU-based concurrency management techniques when employed in heterogeneous systems. To solve this problem, we propose an integrated concurrency management strategy that modulates the TLP in GPUs to control the performance of both CPU and GPU applications. This mechanism considers both GPU core state and system-wide memory and network congestion information to dynamically decide on the level of GPU concurrency to maximize system performance. We propose and evaluate two schemes: one (CM-CPU) for boosting CPU performance in the presence of GPU interference, the other (CM-BAL) for improving both CPU and GPU performance in a balanced manner and thus overall system performance. Our evaluations show that the first scheme improves average CPU performance by 24%, while reducing average GPU performance by 11%. The second scheme provides 7% average performance improvement for both CPU and GPU applications. We also show that our solution allows the user to control performance trade-offs between CPUs and GPUs.</p

FigShare

Abstract-The challenges to push computing to exaflop levels are difficult given desired targets for memory capacity, memory bandwidth, power efficiency, reliability, and cost. This paper presents a vision for an architecture that can be used to construct exascale systems. We describe a conceptual Exascale Node Architecture (ENA), which is the computational building block for an exascale supercomputer. The ENA consists of an Exascale Heterogeneous Processor (EHP) coupled with an advanced memory system. The EHP provides a high-performance accelerated processing unit (CPU+GPU), in-package high-bandwidth 3D memory, and aggressive use of die-stacking and chiplet technologies to meet the requirements for exascale computing in a balanced manner. We present initial experimental analysis to demonstrate the promise of our approach, and we discuss remaining open research challenges for the community

CiteSeerX

Highly Concurrent Latency-tolerant Register Files for GPUs

Author: Abdel-Majeed Mohammad
Aho Alfred V.
Annavaram Murali
Annunziata A. J.
Bakhoda Ali
Bakhoda Ali
Bakhoda Ali
Bakhshalipour M.
Brown Jeffery A.
Chappell Robert S.
Che Shuai
Collins Jamison D.
Collins Jamison D.
Cooper Keith D.
Ebrahimi Eiman
Ebrahimi Eiman
Ebrahimi Eiman
Esfeden Hodjat Asghari
Esfeden Hodjat Asghari
Fukami S.
Gebhart Mark
Gebhart Mark
Hashemi Milad
Hashemi Milad
Hecht Matthew S.
Jang Hyunjun
Jeon H.
Jeon Hyeran
Jiang Nan
Jing Naifeng
Jog Adwait
Jog Adwait
Jones Timothy M.
Kadjo David
Kamruzzaman Md
Kayıran Onur
Kayiran Onur
Kayiran Onur
Khorasani Farzad
Kim Dongkeun
Lattner Chris
Lee Benjamin C.
Lee Chang Joo
Lee Sangpil
Leng Jingwen
Li Chao
Liao Steve S. W.
Lindholm John Erik
Lipasti Mikko H.
Liu Yang
Lu Jiwei
Luk Chi-Keung
Mirhosseini Amirhossein
Mirhosseini Amirhossein
Muralimanohar Naveen
Mutlu Onur
Mutlu Onur
Mutlu Onur
Mutlu Onur
Mutlu Onur
Mutlu Onur
Nachiappan Nachiappan Chidambaram
Narasiman Veynu
Nematollahi N.
NVIDIA.
NVIDIA.
Oehmke David W.
Parkin Stuart S. P.
Patney Anjul
Pekhimenko Gennady
Pekhimenko Gennady
Qureshi Moinuddin K.
Rixner Scott
Rogers Timothy G.
Roth Amir
Sadrosadati Mohammad
Sethia Ankit
Sethia Ankit
Sharif Ahmad
Srinath Santhosh
Stratton John A.
Tian Yingying
Trishul
Venkatesan Rangharajan
Vijaykumar N.
Wang Zhenlin
Wenisch Thomas F.
Zuravleff William K.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref