85 research outputs found
The weakening of branch predictor performance as an inevitable side effect of exploiting control independence
Many algorithms are inherently sequential and hard to explicitly parallelize. Cores designed to aggressively handle these problems exhibit deeper pipelines and wider fetch widths to exploit instruction-level parallelism via out-of-order execution. As these parameters increase, so does the amount of instructions fetched along an incorrect path when a branch is mispredicted. Many of the instructions squashed after a branch are control independent, meaning they will be fetched regardless of whether the candidate branch is taken or not. There has been much research in retaining these control independent instructions on misprediction of the candidate branch. This research shows that there is potential for exploiting control independence since under favorable circumstances many benchmarks can exhibit 30% or more speedup. Though these control independent processors are meant to lessen the damage of misprediction, an inherent side-effect of fetching out of order, branch weakening, keeps realized speedup from reaching its potential. This thesis introduces, formally defines, and identifies the types of branch weakening. Useful information is provided to develop techniques that may reduce weakening. A classification is provided that measures each type of weakening to help better determine potential speedup of control independence processors. Experimentation shows that certain applications suffer greatly from weakening. Total branch mispredictions increase by 30% in several cases. Analysis has revealed two broad causes of weakening: changes in branch predictor update times and changes in the outcome history used by branch predictors. Each of these broad causes are classified into more specific causes, one of which is due to the loss of nearby correlation data and cannot be avoided. The classification technique presented in this study measures that 45% of the weakening in the selected SPEC CPU 2000 benchmarks are of this type while 40% involve other changes in outcome history. The remaining 15% is caused by changes in predictor update times. In applying fundamental techniques that reduce weakening, the Control Independence Aware Branch Predictor is developed. This predictor reduces weakening for the majority of chosen benchmarks. In doing so, a control independence processor, snipper, to attain significantly higher speedup for 10 out of 15 studied benchmarks
Une architecture unifiée pour traiter la divergence de contrôle et la divergence mémoire en SIMT
National audienceLes architectures parallèles qui suivent le modèle SIMT telles que les GPU tirent parti de la régularité des applications en exécutant plusieurs threads concurrents sur des unités SIMD de manière synchronisée. Cela nécessite de séparer les threads lorsqu'il prennent des chemins différents dans le graphe de contrôle de flot, et d'être à même de les re-synchroniser dès que possible de manière à maximiser l'occupation des unités SIMD. Nous proposons dans cet article une technique pour traiter la divergence de contrôle en SIMT qui opère en espace constant et gère les sauts indirects et la récursivité. Nous décrivons une réalisation possible qui s'appuie sur le matériel existant de l'unité de gestion de la divergence mémoire, assurant un coût matériel très réduit. En termes de performance, cette solution est au moins aussi efficace que les techniques existantes
Dynamic Task Prediction for an SpMT Architecture Based on Control Independence
Exploiting better performance from computer programs translates to finding more instructions to execute in parallel. Since most general purpose programs are written in an imperatively sequential manner, closely lying instructions are always data dependent, making the designer look far ahead into the program for parallelism. This necessitates wider superscalar processors with larger instruction windows. But superscalars suffer from three key limitations, their inability to scale, sequential fetch bottleneck and high branch misprediction penalty. Recent studies indicate that current superscalars have reached the end of the road and designers will have to look for newer ideas to build computer processors.
Speculative Multithreading (SpMT) is one of the most recent techniques to exploit parallelism from applications. Most SpMT architectures partition a sequential program into multiple threads (or tasks) that can be concurrently executed on multiple processing units. It is desirable that these tasks are sufficiently distant from each other so as to facilitate parallelism. It is also desirable that these tasks are control independent of each other so that execution of a future task is guaranteed in case of local control flow misspeculations. Some task prediction mechanisms rely on the compiler requiring recompilation of programs. Current dynamic mechanisms either rely on program constructs like loop iterations and function and loop boundaries, resulting in unbalanced loads, or predict tasks which are too short to be of use in an SpMT architecture. This thesis is the first proposal of a predictor that dynamically predicts control independent tasks that are consistently wide apart, and executes them on a novel SpMT architecture
Assouplir les contraintes des architectures SIMT à faible coût
Parallel architectures following the SIMT model such as GPUs benefit from application regularity by issuing concurrent threads running in lockstep on SIMD units. As threads take different paths across the control-flow graph, lockstep execution is partially lost, and must be regained whenever possible in order to maximize the occupancy of SIMD units. In this paper, we propose two techniques to handle SIMT control divergence and identify reconvergence points. The most advanced one operates in constant space and handles indirect jumps and recursion. We evaluate a hardware implementation which leverage the existing memory divergence management unit. In terms of performance, this solution is at least as efficient as state of the art techniques in use in current GPUs.Les architectures parallèles qui obéissent au modèle SIMT telles que les GPU tirent parti de la régularité des applications en exécutant plusieurs threads concurrents sur des unités SIMD de manière synchrone. Lorsque les threads empruntent des chemins divergents dans le graphe de flot de contrôle, leur exécution est séquentialisée jusqu'au prochain point de convergence. La reconvergence doit être effectuée au plus tôt de manière à maximiser l'occupation des unités SIMD. Nous proposons dans cet article deux techniques permettant de traiter la divergence de contrôle en SIMT et d'identifier dynamiquement les points de reconvergence, dont une qui opère en espace constant et gère les sauts indirects et la récursivité. Nous évaluons une réalisation matérielle consistant à partager le matériel existant de l'unité de gestion de la divergence mémoire. En termes de performances, cette solution est au moins aussi efficace que les techniques de l'état de l'art employés par les GPU actuels
Reconvergence de contrĂ´le implicite pour les architectures SIMT
National audienceParallel architectures following the SIMT model such as GPUs benefit from application regularity by issuing concurrent threads running in lockstep on SIMD units. As threads take different paths across the control-flow graph, lockstep execution is partially lost, and must be regained whenever possible in order to maximize the occupancy of SIMD units. In this paper, we propose two techniques to handle SIMT control divergence and identify reconvergence points. The most advanced one operates in constant space and handles indirect jumps and recursion. We evaluate a hardware implementation which leverage the existing memory divergence management unit. In terms of performance, this solution is at least as efficient as state of the art techniques in use in current GPUs.Les architectures parallèles qui obéissent au modèle SIMT telles que les GPU tirent parti de la régularité des applications en exécutant plusieurs threads concurrents sur des unités SIMD de manière synchrone. Lorsque les threads empruntent des chemins divergents dans le graphe de flot de contrôle, leur exécution est séquentialisée jusqu'au prochain point de convergence. La reconvergence doit être effectuée au plus tôt de manière à maximiser l'occupation des unités SIMD. Nous proposons dans cet article deux techniques permettant de traiter la divergence de contrôle en SIMT et d'identifier dynamiquement les points de reconvergence, dont une qui opère en espace constant et gère les sauts indirects et la récursivité. Nous évaluons une réalisation matérielle consistant à partager le matériel existant de l'unité de gestion de la divergence mémoire. En termes de performances, cette solution est au moins aussi efficace que les techniques de l'état de l'art employés par les GPU actuels
SYRANT: SYmmetric Resource Allocation on Not-taken and Taken Paths
International audienceIn the multicore era, achieving ultimate single process performance is still an issue e.g. for single process workload or for sequential sections in par- allel applications. Unfortunately, despite tremendous research effort on branch prediction, substantial performance potential is still wasted due to branch mis- predictions. On a branch misprediction resolution, instruction treatment on the wrong path is essentially thrown away. However, in most cases after a conditional branch, the taken and the not-taken paths of execution merge after a few instruc- tions. Instructions that follow the reconvergence point are executed whatever the branch outcome is. We present SYRANT (SYmmetric Resource Allocation on Not-taken and Taken paths), a new technique for exploiting control independence. SYRANT essentially uses the same pipeline structure as a conventional processor. SYRANT tries to enforce the allocation of the exact same resources on the out-of-order execution mechanism (physical register, load/store queue and reorder buffer) for both the taken and not-taken paths. Thus, on a branch misprediction, the result of an in- struction already executed on the wrong path after the reconvergence point can be conserved in the same structure when it is data independent. Adding SYRANT on top of an aggressive superscalar execution core allows to improve performance for applications suffering a significant branch misprediction rate. As a side but important extra contribution, we introduce ABL/SBL a simple and non-intrusive hardware reconvergence detection mechanism. ABL/SBL can be used in a conventional superscalar processor to improve branch prediction ac- curacy through exploiting the execution of branches along the wrong path.De plus en plus d'avancées dans le domaine de l'architecture des processeurs sont basées sur l'exécution spéculative des instructions. Dans le cas particulier de la spéculation liée aux branchements, le chemin d'exécution pour le branchement pris et le branchement non pris reconverge souvent après quelques instructions. Ce phénomène est appelé reconvergence du flot de contrôle. Il en résulte que ces instructions sont exécutées quelque soit la direction du branchement. On parle alors d'indépendance de contrôle. Plusieurs techniques ont été précédement pro- posées pour l'exploiter. Elles sont étudiées dans ce rapport. Une nouvelle technique exploitant l'indépendance de contrôle est présentée dans ce rapport. Nommée SYRANT (SYmmetric Resource Allocation on Not- taken and Taken paths), elle consiste à forcer l'allocation des mêmes ressources sur le chemin pris et non pris. Ainsi, lors d'une mauvaise prédiction, le travail déjà effectué sur le mauvais chemin et situé après le point de reconvergence peut être conservé
Recommended from our members
Performance-efficient mechanisms for managing irregularity in throughput processors
textRecent graphics processing units (GPUs) have emerged as a promising platform for general purpose computing and have been shown to be very efficient in executing parallel applications with regular control and memory access behavior. Current GPU architectures primarily adopt the single-instruction multiple-thread (SIMT) programming model that balances programmability and hardware efficiency. With SIMT, the programmer writes application code to be executed by scalar threads and each thread is supported with conditional branch and fine-grained load/store instruction for ease of programming. At the same time, the hardware and software collaboratively enable the grouping of scalar threads to be executed in a vectorized single-instruction multiple-data (SIMD) in-order pipeline, simplifying hardware design. As GPUs gain momentum in being utilized in various application domains, these throughput processors will increasingly demand more efficient execution of irregular applications. Current GPUs, however, suffer from reduced thread-level parallelism, underutilization of compute resources, inefficient on-chip caching, and waste in off-chip memory bandwidth utilization for highly irregular programs with divergent control and memory accesses. In this dissertation, I develop techniques that enable simple, robust, and highly effective performance optimizations for SIMT-based throughput processor architectures such that they can better manage irregularity. I first identify that previously suggested optimizations to the divergent control flow problem suffers from the following limitations: 1) serialized execution of diverging paths, 2) lack of robustness across regular/irregular codes, and 3) limited applicability. Based on such observations, I propose and evaluate three novel mechanisms that resolve the aforementioned issues, providing significant performance improvements while minimizing implementation overhead. In the second half of the dissertation, I observe that conventional coarse-grained memory hierarchy designs do not take into account the massively multi-threaded nature of GPUs, which leads to substantial waste in off-chip memory bandwidth utilization. I design and evaluate a locality-aware memory hierarchy for throughput processors, which retains the advantages of coarse-grained accesses for spatially and temporally local programs while permitting selective fine-grained access to memory. By adaptively adjusting the access granularity, memory bandwidth and energy consumption are reduced for data with low spatial/temporal locality without wasting control overheads or prefetching potential for data with high spatial locality.Electrical and Computer Engineerin
Recommended from our members
An enhanced GPU architecture for not-so-regular parallelism with special implications for database search
textGraphics Processing Units (GPUs) have become a popular platform for executing general purpose (i.e., non-graphics) applications. To run efficiently on a GPU, applications must be parallelized into many threads, each of which performs the same task but operates on different data (i.e., data parallelism). Previous work has shown that some applications experience significant speedup when executed on a GPU instead of a CPU. The applications that benefit most tend to have certain characteristics such as high computational intensity, regular control-flow and memory access patterns, and little to no communication among threads. However, not all parallel applications have these characteristics. Applications with a more balanced compute to memory ratio, divergent control flow, irregular memory accesses, and/or frequent communication (i.e., not-so-regular applications) will not take full advantage of the GPU's resources, resulting in performance far short of what could be delivered. The goal of this dissertation is to enhance the GPU architecture to better handle not-so-regular parallelism. This is accomplished in two parts. First, I analyze a diverse set of data parallel applications that suffer from divergent control-flow and/or significant stall time due to memory. I propose two microarchitectural enhancements to the GPU called the Large Warp Microarchitecture and Two-Level Warp Scheduling to address these problems respectively. When combined, these mechanisms increase performance by 19% on average. Second, I examine one of the most important and fundamental applications in computing: database search. Database search is an excellent example of an application that is rich in parallelism, but rife with not-so-regular characteristics. I propose enhancements to the GPU architecture including new instructions that improve intra-warp thread communication and decision making, and also a row-buffer locality hint bit to better handle the irregular memory access patterns of index-based tree search. These proposals improve performance by 21% for full table scans, and 39% for index-based search. The result of this dissertation is an enhanced GPU architecture that better handles not-so-regular parallelism. This increases the scope of applications that run efficiently on the GPU, making it a more viable platform not only for current parallel workloads such as databases, but also for future and emerging parallel applications.Electrical and Computer Engineerin
- …