Search CORE

149 research outputs found

Scheduling Irregular Workloads on GPUs

Author: Troendle David Arthur
Publication venue: eGrove
Publication date: 01/01/2019
Field of study

This doctoral research aims at understanding the nature of the overhead for data irregular GPU workloads, proposing a solution, and examining the consequences of the result. We propose a novel, retry-free GPU workload scheduler for irregular workloads. When used in a Breadth First Search (BFS) algorithm, the proposed simple, monolithic concurrent queue scales to within 10% of ideal scalability on AMD’s Fiji GPU with 14,336 active threads. The dissertation presents an important finding that the retry overhead associated with Compare and Swap (CAS) operations is the principle reason why concurrent queues do not scale well as the number of clients increases in a massively multi-threaded environment

eGrove (Univ. of Mississippi)

Enabling preemptive multiprogramming on GPUs

Author: Cabezas Javier
Gelado Fernandez Isaac
Navarro Nacho
Ramírez Bellido Alejandro
Tanasic Ivan
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service. In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x.We would like to thank the anonymous reviewers, Alexan- der Veidenbaum, Carlos Villavieja, Lluis Vilanova, Lluc Al- varez, and Marc Jorda on their comments and help improving our work and this paper. This work is supported by Euro- pean Commission through TERAFLUX (FP7-249013), Mont- Blanc (FP7-288777), and RoMoL (GA-321253) projects, NVIDIA through the CUDA Center of Excellence program, Spanish Government through Programa Severo Ochoa (SEV-2011-0067) and Spanish Ministry of Science and Technology through TIN2007-60625 and TIN2012-34557 projects.Peer ReviewedPostprint (author’s final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

Matching non-uniformity for program optimizations on heterogeneous many-core systems

Author: Wu Bo
Publication venue: W&M ScholarWorks
Publication date: 01/01/2014
Field of study

As computing enters an era of heterogeneity and massive parallelism, it exhibits a distinct feature: the deepening non-uniform relations among the computing elements in both hardware and software. Besides traditional non-uniform memory accesses, much deeper non-uniformity shows in a processor, runtime, and application, exemplified by the asymmetric cache sharing, memory coalescing, and thread divergences on multicore and many-core processors. Being oblivious to the non-uniformity, current applications fail to tap into the full potential of modern computing devices.;My research presents a systematic exploration into the emerging property. It examines the existence of such a property in modern computing, its influence on computing efficiency, and the challenges for establishing a non-uniformity--aware paradigm. I propose several techniques to translate the property into efficiency, including data reorganization to eliminate non-coalesced accesses, asynchronous data transformations for locality enhancement and a controllable scheduling for exploiting non-uniformity among thread blocks. The experiments show much promise of these techniques in maximizing computing throughput, especially for programs with complex data access patterns

College of William & Mary: W&M Publish

A Lightweight Island Model for the Genetic Algorithm over GPGPU

Author: AlKurdi Ahmad Hilal
Alraslan Mohammad
Publication venue: Faculty of Electrical Engineering, J.J. Strossmayer University of Osijek
Publication date: 01/01/2023
Field of study

This paper presents a parallel approach of the genetic algorithm (GA) over the Graphical Processing Unit (GPU) to solve the Traveling Salesman Problem (TSP). Since the earlier studies did not focus on implementing the island model in a persistent way, this paper introduces an approach, named Lightweight Island Model (LIM), that aims to implement the concept of persistent threads in the island model of the genetic algorithm. For that, we present the implementation details to convert the traditional island model, which is separated into multiple kernels, into a computing paradigm based on a persistent kernel. Many synchronization techniques, including cooperative groups and implicit synchronization, are discussed to reduce the CPU-GPU interaction that existed in the traditional island model. A new parallelization strategy is presented for distributing the work among live threads during the selection and crossover steps. The GPU configurations that lead to the best possible performance are also determined. The introduced approach will be compared, in terms of speedup and solution quality, with the traditional island model (TIM) as well as with related works that concentrated on suggesting a lighter version of the master-slave model, including switching among kernels (SAK) and scheduled light kernel (SLK) approaches. The results show that the new approach can increase the speed-up to 27x over serial CPU, 4.5x over the traditional island model, and up to 1.5–2x over SAK and SLK approaches

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

DD- $\alpha$ AMG on QPACE 3

Author: Georg Peter
Richtmann Daniel
Wettig Tilo
Publication venue: 'EDP Sciences'
Publication date: 19/10/2017
Field of study

We describe our experience porting the Regensburg implementation of the DD-

\alpha

AMG solver from QPACE 2 to QPACE 3. We first review how the code was ported from the first generation Intel Xeon Phi processor (Knights Corner) to its successor (Knights Landing). We then describe the modifications in the communication library necessitated by the switch from InfiniBand to Omni-Path. Finally, we present the performance of the code on a single processor as well as the scaling on many nodes, where in both cases the speedup factor is close to the theoretical expectations.Comment: 12 pages, 6 figures, Proceedings of Lattice 201

arXiv.org e-Print Archive

Crossref

EDP Sciences OAI-PMH repository (1.2.0)