22 research outputs found
Stack-less SIMT reconvergence at low cost
Parallel architectures following the SIMT model such as GPUs benefit from application regularity by issuing concurrent threads running in lockstep on SIMD units. As threads take different paths across the control-flow graph, lockstep execution is partially lost, and must be regained whenever possible in order to maximize the occupancy of SIMD units. In this paper, we propose a technique to handle SIMT control divergence that operates in constant space and handles indirect jumps and recursion. We describe a possible implementation which leverage the existing memory divergence management unit, ensuring a low hardware cost. In terms of performance, this solution is at least as efficient as existing techniques
Recommended from our members
An enhanced GPU architecture for not-so-regular parallelism with special implications for database search
textGraphics Processing Units (GPUs) have become a popular platform for executing general purpose (i.e., non-graphics) applications. To run efficiently on a GPU, applications must be parallelized into many threads, each of which performs the same task but operates on different data (i.e., data parallelism). Previous work has shown that some applications experience significant speedup when executed on a GPU instead of a CPU. The applications that benefit most tend to have certain characteristics such as high computational intensity, regular control-flow and memory access patterns, and little to no communication among threads. However, not all parallel applications have these characteristics. Applications with a more balanced compute to memory ratio, divergent control flow, irregular memory accesses, and/or frequent communication (i.e., not-so-regular applications) will not take full advantage of the GPU's resources, resulting in performance far short of what could be delivered. The goal of this dissertation is to enhance the GPU architecture to better handle not-so-regular parallelism. This is accomplished in two parts. First, I analyze a diverse set of data parallel applications that suffer from divergent control-flow and/or significant stall time due to memory. I propose two microarchitectural enhancements to the GPU called the Large Warp Microarchitecture and Two-Level Warp Scheduling to address these problems respectively. When combined, these mechanisms increase performance by 19% on average. Second, I examine one of the most important and fundamental applications in computing: database search. Database search is an excellent example of an application that is rich in parallelism, but rife with not-so-regular characteristics. I propose enhancements to the GPU architecture including new instructions that improve intra-warp thread communication and decision making, and also a row-buffer locality hint bit to better handle the irregular memory access patterns of index-based tree search. These proposals improve performance by 21% for full table scans, and 39% for index-based search. The result of this dissertation is an enhanced GPU architecture that better handles not-so-regular parallelism. This increases the scope of applications that run efficiently on the GPU, making it a more viable platform not only for current parallel workloads such as databases, but also for future and emerging parallel applications.Electrical and Computer Engineerin
The dual-path execution model for efficient GPU control flow
Current graphics processing units (GPUs) utilize the single instruction multiple thread (SIMT) execution model. With SIMT, a group of logical threads executes such that all threads in the group execute a single common instruction on a particular cycle. To enable control flow to diverge within the group of threads, GPUs partially serialize execution and follow a single control flow path at a time. The execution of the threads in the group that are not on the current path is masked. Most current GPUs rely on a hardware reconvergence stack to track the multiple concurrent paths and to choose a single path for execution. Control flow paths are pushed onto the stack when they diverge and are popped off of the stack to enable threads to reconverge and keep lane utilization high. The stack algorithm guarantees optimal reconvergence for applications with structured control flow as it traverses the structured control-flow tree depth first. The downside of using the reconvergence stack is that only a single path is followed, which does not maximize available parallelism, degrading performance in some cases. We propose a change to the stack hardware in which the execution of two different paths can be interleaved. While this is a fundamental change to the stack concept, we show how dual-path execution can be implemented with only modest changes to current hardware and that parallelism is increased without sacrificing optimal (structured) control-flow reconvergence. We perform a detailed evaluation of a set of benchmarks with divergent control flow and demonstrate that the dual-path stack architecture is much more robust compared to previous approaches for increasing path parallelism. Dual-path execution either matches the performance of the baseline single-path stack architecture or outperforms single-path execution by 14.9% on average and by over 30% in some cases.1
Recommended from our members
Performance-efficient mechanisms for managing irregularity in throughput processors
textRecent graphics processing units (GPUs) have emerged as a promising platform for general purpose computing and have been shown to be very efficient in executing parallel applications with regular control and memory access behavior. Current GPU architectures primarily adopt the single-instruction multiple-thread (SIMT) programming model that balances programmability and hardware efficiency. With SIMT, the programmer writes application code to be executed by scalar threads and each thread is supported with conditional branch and fine-grained load/store instruction for ease of programming. At the same time, the hardware and software collaboratively enable the grouping of scalar threads to be executed in a vectorized single-instruction multiple-data (SIMD) in-order pipeline, simplifying hardware design. As GPUs gain momentum in being utilized in various application domains, these throughput processors will increasingly demand more efficient execution of irregular applications. Current GPUs, however, suffer from reduced thread-level parallelism, underutilization of compute resources, inefficient on-chip caching, and waste in off-chip memory bandwidth utilization for highly irregular programs with divergent control and memory accesses. In this dissertation, I develop techniques that enable simple, robust, and highly effective performance optimizations for SIMT-based throughput processor architectures such that they can better manage irregularity. I first identify that previously suggested optimizations to the divergent control flow problem suffers from the following limitations: 1) serialized execution of diverging paths, 2) lack of robustness across regular/irregular codes, and 3) limited applicability. Based on such observations, I propose and evaluate three novel mechanisms that resolve the aforementioned issues, providing significant performance improvements while minimizing implementation overhead. In the second half of the dissertation, I observe that conventional coarse-grained memory hierarchy designs do not take into account the massively multi-threaded nature of GPUs, which leads to substantial waste in off-chip memory bandwidth utilization. I design and evaluate a locality-aware memory hierarchy for throughput processors, which retains the advantages of coarse-grained accesses for spatially and temporally local programs while permitting selective fine-grained access to memory. By adaptively adjusting the access granularity, memory bandwidth and energy consumption are reduced for data with low spatial/temporal locality without wasting control overheads or prefetching potential for data with high spatial locality.Electrical and Computer Engineerin
Scheduling paths leveraging dynamic information in SIMT architectures
International audienceThread divergence optimization in GPU architectures have long been hindered by restrictive control-flow mechanisms based on stacks of execution masks. However, GPU architectures recently began implementing more flexible hardware mechanisms, presumably based on path tables. We leverage this opportunity by proposing a hardware implementation of iteration shifting, a divergence optimization that enables lockstep execution across arbitrary iterations of a loop. Although software implementations of iteration shifting have been previously proposed, implementing this scheduling technique in hardware lets us leverage dynamic information such as divergence patterns and memory stalls. Evaluation using simulation suggest that the expected performance improvements will remain modest or even nonexistent unless the organization of the memory access path is also revisited
GPU Wavefront Splitting for Safety-Critical Systems
Graphics processing units (GPUs) are compute platforms that are ideal for highly parallel workloads due to their high degree of hardware parallelism. Parallelism offered by GPUs lends itself well to machine learning and computer vision applications, including in safety-critical systems. Safety-critical systems require a guarantee of timing predictability. Guaranteeing timing predictability means being able to statically analyze the worst-case execution time (WCET) of the GPU program. Unfortunately, existing GPUs are designed for average-case performance and are thus not designed for timing predictability. Consequently,
there is potential for research effort to provide these guarantees.
Prior research works have proposed several new techniques to improve performance. One such technique is wavefront splitting, which reduces the number of idle threads on the GPU and increase utilization. However, no prior work addresses the WCET of this technique. The purpose of this thesis is to develop a GPU implementation for safety-critical systems that leverages wavefront splitting and to enable analysis of the WCET in such an implementation
Recommended from our members
Fine-grained containment domains for throughput processors
Continued scaling of semiconductor technology has made modern processors rely on large design margins to guarantee correct operation under worst case conditions. Design margins appear in the form of higher supply voltage or lower clock frequency, leading to inefficiency. In practice, it is rare to observe such worst-case conditions and the processor can run at a reduced voltage or higher frequency experiencing only few infrequent errors. Recent proposals have used hardware error detectors and recovery mechanisms to detect and re- cover from these rare errors, a technique known as timing speculation. While this is effective for out-of-order processors with inherent capability to recover from misspeculation, implementing similar hardware for throughput processors such as the Graphics Processing Units (GPUs) is prohibitively costly due to the massive amount of thread context that needs to be preserved. Further- more, recovery overhead is much higher since the SIMD (Single Instruction Multiple Data) execution model of GPUs require multiple threads to roll back together in case of an error. In this dissertation, I develop a hardware/software co-design approach to enable reduced-margin operation on GPUs that overcomes the limitations of existing techniques. The proposed scheme leverages the hierarchical programming model of GPUs to provide hierarchical and uncoordinated local checkpoint-recovery. By decomposing a program into a hierarchically nested tree of code blocks which I refer to as containment domains (CDs), the pro- gram becomes amenable to automatic analysis and tuning, and an optimum trade-off can be made between preservation and recovery overhead. To aid this optimization process, an analytical model is developed to estimate the performance efficiency of a given application setting at a given error rate. With the analytical model, an exhaustive search can be performed to find the optimal solution. The tunability also allows the proposed scheme to easily adapt to a wide range of error rates making it future proof against emerging uncertainties in semiconductor design. The proposed scheme combines software and hardware components to achieve the highest efficiency in preservation, restoration, and recovery. The software components include: 1) an API and runtime that lets the programmers describe the hierarchy of containment domains within an application and preserve the state required for rollback recovery, and 2) a compiler analysis that automatically inserts preservation routines for register variables. The hardware components include: 1) a stack structure to keep track of recovery program counters (PC), 2) a set of error containment mechanisms to guarantee that no erroneous data is propagated outside of a containment domain and 3) an error reporting architecture that keeps track of affected threads and initiate recovery of them.Electrical and Computer Engineerin
Spatio-temporal SIMT and scalarization for improving GPU efficiency
Temporal SIMT (TSIMT) has been suggested as an alternative to conventional (spatial) SIMT for improving GPU performance on branch-intensive code. Although TSIMT has been briefly mentioned before, it was not evaluated. We present a complete design and evaluation of TSIMT GPUs, along with the inclusion of scalarization and a combination of temporal and spatial SIMT, named Spatiotemporal SIMT (STSIMT). Simulations show that TSIMT alone results in a performance reduction, but a combination of scalarization and STSIMT yields a mean performance enhancement of 19.6% and improves the energy-delay product by 26.2% compared to SIMT.EC/FP7/288653/EU/Low-Power Parallel Computing on GPUs/LPGP