Search CORE

89 research outputs found

Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU

Author: Zhang Zheng
Publication venue: W&M ScholarWorks
Publication date: 01/01/2012
Field of study

Enhancing the match between software executions and hardware features is key to computing efficiency. The match is a continuously evolving and challenging problem. This dissertation focuses on the development of programming system support for exploiting two key features of modern hardware development: the massive parallelism of emerging computational accelerators such as Graphic Processing Units (GPU), and the non-uniformity of cache sharing in modern multicore processors. They are respectively driven by the important role of accelerators in today\u27s general-purpose computing and the ultimate importance of memory performance. This dissertation particularly concentrates on optimizing control flows and memory references, at both compilation and execution time, to tap into the full potential of pure software solutions in taking advantage of the two key hardware features.;Conditional branches cause divergences in program control flows, which may result in serious performance degradation on massively data-parallel GPU architectures with Single Instruction Multiple Data (SIMD) parallelism. On such an architecture, control divergence may force computing units to stay idle for a substantial time, throttling system throughput by orders of magnitude. This dissertation provides an extensive exploration of the solution to this problem and presents program level transformations based upon two fundamental techniques --- thread relocation and data relocation. These two optimizations provide fundamental support for swapping jobs among threads so that the control flow paths of threads converge within every SIMD thread group.;In memory performance, this dissertation concentrates on two aspects: the influence of nonuniform sharing on multithreading applications, and the optimization of irregular memory references on GPUs. In shared cache multicore chips, interactions among threads are complicated due to the interplay of cache contention and synergistic prefetching. This dissertation presents the first systematic study on the influence of non-uniform shared cache on contemporary parallel programs, reveals the mismatch between the software development and underlying cache sharing hierarchies, and further demonstrates it by proposing and applying cache-sharing-aware data transformations that bring significant performance improvement. For the second aspect, the efficiency of GPU accelerators is sensitive to irregular memory references, which refer to the memory references whose access patterns remain unknown until execution time (e.g., A[P[i]]). The root causes of the irregular memory reference problem are similar to that of the control flow problem, while in a more general and complex form. I developed a framework, named G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations

College of William & Mary: W&M Publish

Accelerating The Discontinuous Galerkin Cell-Vertex Scheme (Dg-Cvs) Solver On Cpu-Gpu Heterogeneous Systems

Author: Hu Xiaoqi
Publication venue: eGrove
Publication date: 01/01/2017
Field of study

Dg-Cvs (Discontinuous Galerkin Cell-Vertex Scheme) is an efficient, accurate and robust numerical solver for general hyperbolic conservation laws. It can solve a broad range of conservation laws such as the shallow water equation and Magnetohydrodynamics equations. Dg-Cvs is a Riemann-Solver-free high order space-time method for arbitrary space conservation laws. It fuses the discontinuous Galerkin (dg) method and the conservation element/solution element (ce/se) method to take advantage of the best features of both methods. Thanks to the ce/se method, the time derivative of the solution is treated as an independent unknown, which is amendable to gpu\u27s parallel execution. In this thesis, we use a cpu-gpu heterogeneous processor to accelerate Dg-Cvs to demonstrate that complex scientific applications can benefit from a heterogeneous computing system. There are challenges that such scientific program poses on the gpu architecture such as thread divergence and low kernel occupancy. We developed optimizations to address these concerns. Our proposed optimizations include thread remapping to minimize thread divergence and register pressure reduction to increase kernel occupancy. Our experiment results show that Dg-Cvs on gpu outperforms cpu by up to 57\% before optimization and 145\% afterwards. We also use Dg-Cvs as a real world application to explore the possibility of using shared virtual memory (svm) for tighter collaboration between cpu and gpu. However, svm did not help improve the performance due to the overhead of address translation and atomic operations. We developed a microbenchmark to better understand the performance impact of svm

eGrove (Univ. of Mississippi)

Reducing thread divergence in a GPU-accelerated branch-and-bound algorithm

Author: Bendjoudi Ahcène
Chakroun Imen
Melab Nouredine
Mezmaz Mohand
Publication venue: 'Wiley'
Publication date: 01/01/2012
Field of study

International audienceIn this paper, we address the design and implementation of GPU-accelerated Branch-and-Bound algorithms (B&B) for solving Flow-shop scheduling optimization problems (FSP). Such applications are CPU-time consuming and highly irregular. On the other hand, GPUs are massively multi-threaded accelerators using the SIMD model at execution. A major issue which arises when executing on GPU a B&B applied to FSP is thread or branch divergence. Such divergence is caused by the lower bound function of FSP which contains many irregular loops and conditional instructions. Our challenge is therefore to revisit the design and implementation of B&B applied to FSP dealing with thread divergence. Extensive experiments of the proposed approach have been carried out on well-known FSP benchmarks using an Nvidia Tesla C2050 GPU card. Compared to a CPU-based execution, accelerations up to ×77.46 are achieved for large problem instances

HAL - Lille 3

INRIA a CCSD electronic archive server

Hal-Diderot

Recommended from our members

Divergence Reduction and Dependency Management in GPU Programs using Asynchronous Work Scheduling

Author: Cuneo Braxton
Publication venue: 'Oregon State University'
Publication date
Field of study

With continuing improvements in performance and capability, GPU processing has gained significant and growing interest across science and industry. With this interest, research has increasingly focused upon methods of processing algorithms with stochastic, non-uniform branching while maintaining low divergence. Central among these methods is thread-data remapping (TDR), whereby data is periodically labeled by what processing they require and re-assigned across threads to group data requiring similar processing into the same warp. In prior art, generalized TDR has been implemented as a phased series of exhaustive processing and sorting, with all data processed before the subsequent sorting phase can complete. This thesis discusses the drawbacks of this exhaustive approach and demonstrates a significant alternative for general divergence reduction: By organizing data requiring similar processing into warp-sized arrays, data can be efficiently re-mapped in shared memory on an opportunistic basis. Furthermore, by using a lock-free reservoir to store these warp sized arrays in global memory, load balancing can occur between work groups without the need for global synchronization. By abstracting this remapping process as scheduling in an asynchronous runtime, a general-purpose framework was developed to remap process across arbitrary processing patterns, allowing for easier programming, arbitrary synchronization through barriers, and other useful features

ScholarsArchive@OSU

WOODSTOCC: Extracting Latent Parallelism from a DNA Sequence Aligner on a GPU

Author: Buhler Jeremy D
Cole Stephen V
Gardner Jacob R
Publication venue: Washington University Open Scholarship
Publication date: 01/09/2015
Field of study

An exponential increase in the speed of DNA sequencing over the past decade has driven demand for fast, space-efficient algorithms to process the resultant data. The first step in processing is alignment of many short DNA sequences, or reads, against a large reference sequence. This work presents WOODSTOCC, an implementation of short-read alignment designed for Graphics Processing Unit (GPU) architectures. WOODSTOCC translates a novel CPU implementation of gapped short-read alignment, which has guaranteed optimal and complete results, to the GPU. Our implementation combines an irregular trie search with dynamic programming to expose regularly structured parallelism. We first describe this implementation, then discuss its port to the GPU. WOODSTOCC’s GPU port exploits three generally useful techniques for extracting regular parallelism from irregular computations: dynamic thread mapping with a worklist, kernel stage decoupling, and kernel slicing. We discuss the performance impact of these techniques and suggest further opportunities for improvement

Washington University St. Louis: Open Scholarship

Practical Gpgpu Application Resilience Estimation And Fortification

Author: Yang Lishan
Publication venue: W&M ScholarWorks
Publication date: 01/01/2022
Field of study

Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications but remain susceptible to transient hardware faults (soft errors) that can easily compromise application output. One of the major challenges in the domain of GPU reliability is to accurately measure general purpose GPU (GPGPU) application resilience to transient faults. This challenge stems from the fact that a typical GPGPU application spawns a huge number of threads and then utilizes a large amount of potentially unreliable compute and memory resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on the application error resilience is impractical. Alternatively, fault site selection techniques have been proposed to approach high accuracy with less fault injection experiments. However, most of the existing methods in the literature only focus on the single-bit fault model and only one input. In this dissertation, we offer solutions to the two problems above. We extend a progressive fault site pruning technique for two multi-bit fault models: (a) multi-bit faults in the same word; (b) multiple single-bit faults in different words accessed by the same thread. We devise a methodology, SUGAR (Speeding Up GPGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of application error resilience. Key of the SUGAR estimation methodology is the identification of repeating thread patterns that develop as a function of the size of the input. These patterns allow for accurate prediction of application error resilience for arbitrarily large inputs. With the presence of input-aware estimation strategies, we are able to pinpoint the vulnerabilities in a GPGPU application and propose low overhead protection techniques accordingly. Based on the variety of thread resilience in GPGPU applications, we propose a methodology that identifies the resilience of threads and aims to map threads with the same resilience characteristics to the same warp. Our technique allows engaging partial protection mechanisms at the warp level. We illustrate that threads can be remapped into reliable or unreliable warps with only minimal introduced overhead, and then selective protection via replication is applied in unreliable warps. We show how this remapping facilitates warp replication for error detection and correction and achieves a significant reduction of execution cycles, comparing to standard techniques. In addition to input-aware estimation and fortification, we present a detailed characterization comparing microarchitecture-level and software-level fault injection and show the gap of resilience estimation introduced by injecting faults into different layers in the system execution stack. We also implement a software-level redundancy protection mechanism and measure its effectiveness using microarchitecture-level and software-level fault injection

College of William & Mary: W&M Publish

A RECONFIGURABLE AND EXTENSIBLE EXPLORATION PLATFORM FOR FUTURE HETEROGENEOUS SYSTEMS

Author: Gagliardi Mirko
Publication venue
Publication date: 10/12/2018
Field of study

Accelerator-based -or heterogeneous- computing has become increasingly important in a variety of scenarios, ranging from High-Performance Computing (HPC) to embedded systems. While most solutions use sometimes custom-made components, most of today’s systems rely on commodity highend CPUs and/or GPU devices, which deliver adequate performance while ensuring programmability, productivity, and application portability. Unfortunately, pure general-purpose hardware is affected by inherently limited power-efficiency, that is, low GFLOPS-per-Watt, now considered as a primary metric. The many-core model and architectural customization can play here a key role, as they enable unprecedented levels of power-efficiency compared to CPUs/GPUs. However, such paradigms are still immature and deeper exploration is indispensable. This dissertation investigates customizability and proposes novel solutions for heterogeneous architectures, focusing on mechanisms related to coherence and network-on-chip (NoC). First, the work presents a non-coherent scratchpad memory with a configurable bank remapping system to reduce bank conflicts. The experimental results show the benefits of both using a customizable hardware bank remapping function and non-coherent memories for some types of algorithms. Next, we demonstrate how a distributed synchronization master better suits many-cores than standard centralized solutions. This solution, inspired by the directory-based coherence mechanism, supports concurrent synchronizations without relying on memory transactions. The results collected for different NoC sizes provided indications about the area overheads incurred by our solution and demonstrated the benefits of using a dedicated hardware synchronization support. Finally, this dissertation proposes an advanced coherence subsystem, based on the sparse directory approach, with a selective coherence maintenance system which allows coherence to be deactivated for blocks that do not require it. Experimental results show that the use of a hybrid coherent and non-coherent architectural mechanism along with an extended coherence protocol can enhance performance. The above results were all collected by means of a modular and customizable heterogeneous many-core system developed to support the exploration of power-efficient high-performance computing architectures. The system is based on a NoC and a customizable GPU-like accelerator core, as well as a reconfigurable coherence subsystem, ensuring application-specific configuration capabilities. All the explored solutions were evaluated on this real heterogeneous system, which comes along with the above methodological results as part of the contribution in this dissertation. In fact, as a key benefit, the experimental platform enables users to integrate novel hardware/software solutions on a full-system scale, whereas existing platforms do not always support a comprehensive heterogeneous architecture exploration

Università degli Studi di Napoli Federico Il Open Archive

Doctor of Philosophy

Author: Sun Weibin
Publication venue: University of Utah
Publication date: 01/08/2014
Field of study

dissertationAs the base of the software stack, system-level software is expected to provide ecient and scalable storage, communication, security and resource management functionalities. However, there are many computationally expensive functionalities at the system level, such as encryption, packet inspection, and error correction. All of these require substantial computing power. What's more, today's application workloads have entered gigabyte and terabyte scales, which demand even more computing power. To solve the rapidly increased computing power demand at the system level, this dissertation proposes using parallel graphics pro- cessing units (GPUs) in system software. GPUs excel at parallel computing, and also have a much faster development trend in parallel performance than central processing units (CPUs). However, system-level software has been originally designed to be latency-oriented. GPUs are designed for long-running computation and large-scale data processing, which are throughput-oriented. Such mismatch makes it dicult to t the system-level software with the GPUs. This dissertation presents generic principles of system-level GPU computing developed during the process of creating our two general frameworks for integrating GPU computing in storage and network packet processing. The principles are generic design techniques and abstractions to deal with common system-level GPU computing challenges. Those principles have been evaluated in concrete cases including storage and network packet processing applications that have been augmented with GPU computing. The signicant performance improvement found in the evaluation shows the eectiveness and eciency of the proposed techniques and abstractions. This dissertation also presents a literature survey of the relatively young system-level GPU computing area, to introduce the state of the art in both applications and techniques, and also their future potentials

The University of Utah: J. Willard Marriott Digital Library