89 research outputs found
Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU
Enhancing the match between software executions and hardware features is key to computing efficiency. The match is a continuously evolving and challenging problem. This dissertation focuses on the development of programming system support for exploiting two key features of modern hardware development: the massive parallelism of emerging computational accelerators such as Graphic Processing Units (GPU), and the non-uniformity of cache sharing in modern multicore processors. They are respectively driven by the important role of accelerators in today\u27s general-purpose computing and the ultimate importance of memory performance. This dissertation particularly concentrates on optimizing control flows and memory references, at both compilation and execution time, to tap into the full potential of pure software solutions in taking advantage of the two key hardware features.;Conditional branches cause divergences in program control flows, which may result in serious performance degradation on massively data-parallel GPU architectures with Single Instruction Multiple Data (SIMD) parallelism. On such an architecture, control divergence may force computing units to stay idle for a substantial time, throttling system throughput by orders of magnitude. This dissertation provides an extensive exploration of the solution to this problem and presents program level transformations based upon two fundamental techniques --- thread relocation and data relocation. These two optimizations provide fundamental support for swapping jobs among threads so that the control flow paths of threads converge within every SIMD thread group.;In memory performance, this dissertation concentrates on two aspects: the influence of nonuniform sharing on multithreading applications, and the optimization of irregular memory references on GPUs. In shared cache multicore chips, interactions among threads are complicated due to the interplay of cache contention and synergistic prefetching. This dissertation presents the first systematic study on the influence of non-uniform shared cache on contemporary parallel programs, reveals the mismatch between the software development and underlying cache sharing hierarchies, and further demonstrates it by proposing and applying cache-sharing-aware data transformations that bring significant performance improvement. For the second aspect, the efficiency of GPU accelerators is sensitive to irregular memory references, which refer to the memory references whose access patterns remain unknown until execution time (e.g., A[P[i]]). The root causes of the irregular memory reference problem are similar to that of the control flow problem, while in a more general and complex form. I developed a framework, named G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations
Accelerating The Discontinuous Galerkin Cell-Vertex Scheme (Dg-Cvs) Solver On Cpu-Gpu Heterogeneous Systems
Dg-Cvs (Discontinuous Galerkin Cell-Vertex Scheme) is an efficient, accurate and robust numerical solver for general hyperbolic conservation laws. It can solve a broad range of conservation laws such as the shallow water equation and Magnetohydrodynamics equations. Dg-Cvs is a Riemann-Solver-free high order space-time method for arbitrary space conservation laws. It fuses the discontinuous Galerkin (dg) method and the conservation element/solution element (ce/se) method to take advantage of the best features of both methods. Thanks to the ce/se method, the time derivative of the solution is treated as an independent unknown, which is amendable to gpu\u27s parallel execution. In this thesis, we use a cpu-gpu heterogeneous processor to accelerate Dg-Cvs to demonstrate that complex scientific applications can benefit from a heterogeneous computing system. There are challenges that such scientific program poses on the gpu architecture such as thread divergence and low kernel occupancy. We developed optimizations to address these concerns. Our proposed optimizations include thread remapping to minimize thread divergence and register pressure reduction to increase kernel occupancy. Our experiment results show that Dg-Cvs on gpu outperforms cpu by up to 57\% before optimization and 145\% afterwards. We also use Dg-Cvs as a real world application to explore the possibility of using shared virtual memory (svm) for tighter collaboration between cpu and gpu. However, svm did not help improve the performance due to the overhead of address translation and atomic operations. We developed a microbenchmark to better understand the performance impact of svm
Reducing thread divergence in a GPU-accelerated branch-and-bound algorithm
International audienceIn this paper, we address the design and implementation of GPU-accelerated Branch-and-Bound algorithms (B&B) for solving Flow-shop scheduling optimization problems (FSP). Such applications are CPU-time consuming and highly irregular. On the other hand, GPUs are massively multi-threaded accelerators using the SIMD model at execution. A major issue which arises when executing on GPU a B&B applied to FSP is thread or branch divergence. Such divergence is caused by the lower bound function of FSP which contains many irregular loops and conditional instructions. Our challenge is therefore to revisit the design and implementation of B&B applied to FSP dealing with thread divergence. Extensive experiments of the proposed approach have been carried out on well-known FSP benchmarks using an Nvidia Tesla C2050 GPU card. Compared to a CPU-based execution, accelerations up to Ă—77.46 are achieved for large problem instances
Recommended from our members
Divergence Reduction and Dependency Management in GPU Programs using Asynchronous Work Scheduling
With continuing improvements in performance and capability, GPU processing has gained significant and growing interest across science and industry. With this interest, research has increasingly focused upon methods of processing algorithms with stochastic, non-uniform branching while maintaining low divergence. Central among these methods is thread-data remapping (TDR), whereby data is periodically labeled by what processing they require and re-assigned across threads to group data requiring similar processing into the same warp. In prior art, generalized TDR has been implemented as a phased series of exhaustive processing and sorting, with all data processed before the subsequent sorting phase can complete. This thesis discusses the drawbacks of this exhaustive approach and demonstrates a significant alternative for general divergence reduction: By organizing data requiring similar processing into warp-sized arrays, data can be efficiently re-mapped in shared memory on an opportunistic basis. Furthermore, by using a lock-free reservoir to store these warp sized arrays in global memory, load balancing can occur between work groups without the need for global synchronization. By abstracting this remapping process as scheduling in an asynchronous runtime, a general-purpose framework was developed to remap process across arbitrary processing patterns, allowing for easier programming, arbitrary synchronization through barriers, and other useful features
WOODSTOCC: Extracting Latent Parallelism from a DNA Sequence Aligner on a GPU
An exponential increase in the speed of DNA sequencing over the past decade has driven demand for fast, space-efficient algorithms to process the resultant data. The first step in processing is alignment of many short DNA sequences, or reads, against a large reference sequence. This work presents WOODSTOCC, an implementation of short-read alignment designed for Graphics Processing Unit (GPU) architectures. WOODSTOCC translates a novel CPU implementation of gapped short-read alignment, which has guaranteed optimal and complete results, to the GPU. Our implementation combines an irregular trie search with dynamic programming to expose regularly structured parallelism. We first describe this implementation, then discuss its port to the GPU. WOODSTOCC’s GPU port exploits three generally useful techniques for extracting regular parallelism from irregular computations: dynamic thread mapping with a worklist, kernel stage decoupling, and kernel slicing. We discuss the performance impact of these techniques and suggest further opportunities for improvement
Practical Gpgpu Application Resilience Estimation And Fortification
Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications but remain susceptible to transient hardware faults (soft errors) that can easily compromise application output. One of the major challenges in the domain of GPU reliability is to accurately measure general purpose GPU (GPGPU) application resilience to transient faults. This challenge stems from the fact that a typical GPGPU application spawns a huge number of threads and then utilizes a large amount of potentially unreliable compute and memory resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on the application error resilience is impractical. Alternatively, fault site selection techniques have been proposed to approach high accuracy with less fault injection experiments. However, most of the existing methods in the literature only focus on the single-bit fault model and only one input. In this dissertation, we offer solutions to the two problems above. We extend a progressive fault site pruning technique for two multi-bit fault models: (a) multi-bit faults in the same word; (b) multiple single-bit faults in different words accessed by the same thread. We devise a methodology, SUGAR (Speeding Up GPGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of application error resilience. Key of the SUGAR estimation methodology is the identification of repeating thread patterns that develop as a function of the size of the input. These patterns allow for accurate prediction of application error resilience for arbitrarily large inputs. With the presence of input-aware estimation strategies, we are able to pinpoint the vulnerabilities in a GPGPU application and propose low overhead protection techniques accordingly. Based on the variety of thread resilience in GPGPU applications, we propose a methodology that identifies the resilience of threads and aims to map threads with the same resilience characteristics to the same warp. Our technique allows engaging partial protection mechanisms at the warp level. We illustrate that threads can be remapped into reliable or unreliable warps with only minimal introduced overhead, and then selective protection via replication is applied in unreliable warps. We show how this remapping facilitates warp replication for error detection and correction and achieves a significant reduction of execution cycles, comparing to standard techniques. In addition to input-aware estimation and fortification, we present a detailed characterization comparing microarchitecture-level and software-level fault injection and show the gap of resilience estimation introduced by injecting faults into different layers in the system execution stack. We also implement a software-level redundancy protection mechanism and measure its effectiveness using microarchitecture-level and software-level fault injection
A RECONFIGURABLE AND EXTENSIBLE EXPLORATION PLATFORM FOR FUTURE HETEROGENEOUS SYSTEMS
Accelerator-based -or heterogeneous- computing has become increasingly
important in a variety of scenarios, ranging from High-Performance Computing (HPC) to embedded systems. While most solutions use sometimes
custom-made components, most of today’s systems rely on commodity highend CPUs and/or GPU devices, which deliver adequate performance while
ensuring programmability, productivity, and application portability. Unfortunately, pure general-purpose hardware is affected by inherently limited
power-efficiency, that is, low GFLOPS-per-Watt, now considered as a primary metric. The many-core model and architectural customization can
play here a key role, as they enable unprecedented levels of power-efficiency
compared to CPUs/GPUs. However, such paradigms are still immature and
deeper exploration is indispensable.
This dissertation investigates customizability and proposes novel solutions
for heterogeneous architectures, focusing on mechanisms related to coherence and network-on-chip (NoC). First, the work presents a non-coherent
scratchpad memory with a configurable bank remapping system to reduce
bank conflicts. The experimental results show the benefits of both using a
customizable hardware bank remapping function and non-coherent memories for some types of algorithms. Next, we demonstrate how a distributed
synchronization master better suits many-cores than standard centralized
solutions. This solution, inspired by the directory-based coherence mechanism, supports concurrent synchronizations without relying on memory
transactions. The results collected for different NoC sizes provided indications about the area overheads incurred by our solution and demonstrated
the benefits of using a dedicated hardware synchronization support. Finally, this dissertation proposes an advanced coherence subsystem, based
on the sparse directory approach, with a selective coherence maintenance
system which allows coherence to be deactivated for blocks that do not require it. Experimental results show that the use of a hybrid coherent and
non-coherent architectural mechanism along with an extended coherence
protocol can enhance performance.
The above results were all collected by means of a modular and customizable heterogeneous many-core system developed to support the exploration
of power-efficient high-performance computing architectures. The system is
based on a NoC and a customizable GPU-like accelerator core, as well as
a reconfigurable coherence subsystem, ensuring application-specific configuration capabilities. All the explored solutions were evaluated on this real heterogeneous system, which comes along with the above methodological
results as part of the contribution in this dissertation. In fact, as a key
benefit, the experimental platform enables users to integrate novel hardware/software solutions on a full-system scale, whereas existing platforms
do not always support a comprehensive heterogeneous architecture exploration
Doctor of Philosophy
dissertationAs the base of the software stack, system-level software is expected to provide ecient and scalable storage, communication, security and resource management functionalities. However, there are many computationally expensive functionalities at the system level, such as encryption, packet inspection, and error correction. All of these require substantial computing power. What's more, today's application workloads have entered gigabyte and terabyte scales, which demand even more computing power. To solve the rapidly increased computing power demand at the system level, this dissertation proposes using parallel graphics pro- cessing units (GPUs) in system software. GPUs excel at parallel computing, and also have a much faster development trend in parallel performance than central processing units (CPUs). However, system-level software has been originally designed to be latency-oriented. GPUs are designed for long-running computation and large-scale data processing, which are throughput-oriented. Such mismatch makes it dicult to t the system-level software with the GPUs. This dissertation presents generic principles of system-level GPU computing developed during the process of creating our two general frameworks for integrating GPU computing in storage and network packet processing. The principles are generic design techniques and abstractions to deal with common system-level GPU computing challenges. Those principles have been evaluated in concrete cases including storage and network packet processing applications that have been augmented with GPU computing. The signicant performance improvement found in the evaluation shows the eectiveness and eciency of the proposed techniques and abstractions. This dissertation also presents a literature survey of the relatively young system-level GPU computing area, to introduce the state of the art in both applications and techniques, and also their future potentials
- …