85 research outputs found

    The weakening of branch predictor performance as an inevitable side effect of exploiting control independence

    Get PDF
    Many algorithms are inherently sequential and hard to explicitly parallelize. Cores designed to aggressively handle these problems exhibit deeper pipelines and wider fetch widths to exploit instruction-level parallelism via out-of-order execution. As these parameters increase, so does the amount of instructions fetched along an incorrect path when a branch is mispredicted. Many of the instructions squashed after a branch are control independent, meaning they will be fetched regardless of whether the candidate branch is taken or not. There has been much research in retaining these control independent instructions on misprediction of the candidate branch. This research shows that there is potential for exploiting control independence since under favorable circumstances many benchmarks can exhibit 30% or more speedup. Though these control independent processors are meant to lessen the damage of misprediction, an inherent side-effect of fetching out of order, branch weakening, keeps realized speedup from reaching its potential. This thesis introduces, formally defines, and identifies the types of branch weakening. Useful information is provided to develop techniques that may reduce weakening. A classification is provided that measures each type of weakening to help better determine potential speedup of control independence processors. Experimentation shows that certain applications suffer greatly from weakening. Total branch mispredictions increase by 30% in several cases. Analysis has revealed two broad causes of weakening: changes in branch predictor update times and changes in the outcome history used by branch predictors. Each of these broad causes are classified into more specific causes, one of which is due to the loss of nearby correlation data and cannot be avoided. The classification technique presented in this study measures that 45% of the weakening in the selected SPEC CPU 2000 benchmarks are of this type while 40% involve other changes in outcome history. The remaining 15% is caused by changes in predictor update times. In applying fundamental techniques that reduce weakening, the Control Independence Aware Branch Predictor is developed. This predictor reduces weakening for the majority of chosen benchmarks. In doing so, a control independence processor, snipper, to attain significantly higher speedup for 10 out of 15 studied benchmarks

    Une architecture unifiée pour traiter la divergence de contrôle et la divergence mémoire en SIMT

    Get PDF
    National audienceLes architectures parallèles qui suivent le modèle SIMT telles que les GPU tirent parti de la régularité des applications en exécutant plusieurs threads concurrents sur des unités SIMD de manière synchronisée. Cela nécessite de séparer les threads lorsqu'il prennent des chemins différents dans le graphe de contrôle de flot, et d'être à même de les re-synchroniser dès que possible de manière à maximiser l'occupation des unités SIMD. Nous proposons dans cet article une technique pour traiter la divergence de contrôle en SIMT qui opère en espace constant et gère les sauts indirects et la récursivité. Nous décrivons une réalisation possible qui s'appuie sur le matériel existant de l'unité de gestion de la divergence mémoire, assurant un coût matériel très réduit. En termes de performance, cette solution est au moins aussi efficace que les techniques existantes

    Dynamic Task Prediction for an SpMT Architecture Based on Control Independence

    Get PDF
    Exploiting better performance from computer programs translates to finding more instructions to execute in parallel. Since most general purpose programs are written in an imperatively sequential manner, closely lying instructions are always data dependent, making the designer look far ahead into the program for parallelism. This necessitates wider superscalar processors with larger instruction windows. But superscalars suffer from three key limitations, their inability to scale, sequential fetch bottleneck and high branch misprediction penalty. Recent studies indicate that current superscalars have reached the end of the road and designers will have to look for newer ideas to build computer processors. Speculative Multithreading (SpMT) is one of the most recent techniques to exploit parallelism from applications. Most SpMT architectures partition a sequential program into multiple threads (or tasks) that can be concurrently executed on multiple processing units. It is desirable that these tasks are sufficiently distant from each other so as to facilitate parallelism. It is also desirable that these tasks are control independent of each other so that execution of a future task is guaranteed in case of local control flow misspeculations. Some task prediction mechanisms rely on the compiler requiring recompilation of programs. Current dynamic mechanisms either rely on program constructs like loop iterations and function and loop boundaries, resulting in unbalanced loads, or predict tasks which are too short to be of use in an SpMT architecture. This thesis is the first proposal of a predictor that dynamically predicts control independent tasks that are consistently wide apart, and executes them on a novel SpMT architecture

    Assouplir les contraintes des architectures SIMT à faible coût

    Get PDF
    Parallel architectures following the SIMT model such as GPUs benefit from application regularity by issuing concurrent threads running in lockstep on SIMD units. As threads take different paths across the control-flow graph, lockstep execution is partially lost, and must be regained whenever possible in order to maximize the occupancy of SIMD units. In this paper, we propose two techniques to handle SIMT control divergence and identify reconvergence points. The most advanced one operates in constant space and handles indirect jumps and recursion. We evaluate a hardware implementation which leverage the existing memory divergence management unit. In terms of performance, this solution is at least as efficient as state of the art techniques in use in current GPUs.Les architectures parallèles qui obéissent au modèle SIMT telles que les GPU tirent parti de la régularité des applications en exécutant plusieurs threads concurrents sur des unités SIMD de manière synchrone. Lorsque les threads empruntent des chemins divergents dans le graphe de flot de contrôle, leur exécution est séquentialisée jusqu'au prochain point de convergence. La reconvergence doit être effectuée au plus tôt de manière à maximiser l'occupation des unités SIMD. Nous proposons dans cet article deux techniques permettant de traiter la divergence de contrôle en SIMT et d'identifier dynamiquement les points de reconvergence, dont une qui opère en espace constant et gère les sauts indirects et la récursivité. Nous évaluons une réalisation matérielle consistant à partager le matériel existant de l'unité de gestion de la divergence mémoire. En termes de performances, cette solution est au moins aussi efficace que les techniques de l'état de l'art employés par les GPU actuels

    Reconvergence de contrĂ´le implicite pour les architectures SIMT

    Get PDF
    National audienceParallel architectures following the SIMT model such as GPUs benefit from application regularity by issuing concurrent threads running in lockstep on SIMD units. As threads take different paths across the control-flow graph, lockstep execution is partially lost, and must be regained whenever possible in order to maximize the occupancy of SIMD units. In this paper, we propose two techniques to handle SIMT control divergence and identify reconvergence points. The most advanced one operates in constant space and handles indirect jumps and recursion. We evaluate a hardware implementation which leverage the existing memory divergence management unit. In terms of performance, this solution is at least as efficient as state of the art techniques in use in current GPUs.Les architectures parallèles qui obéissent au modèle SIMT telles que les GPU tirent parti de la régularité des applications en exécutant plusieurs threads concurrents sur des unités SIMD de manière synchrone. Lorsque les threads empruntent des chemins divergents dans le graphe de flot de contrôle, leur exécution est séquentialisée jusqu'au prochain point de convergence. La reconvergence doit être effectuée au plus tôt de manière à maximiser l'occupation des unités SIMD. Nous proposons dans cet article deux techniques permettant de traiter la divergence de contrôle en SIMT et d'identifier dynamiquement les points de reconvergence, dont une qui opère en espace constant et gère les sauts indirects et la récursivité. Nous évaluons une réalisation matérielle consistant à partager le matériel existant de l'unité de gestion de la divergence mémoire. En termes de performances, cette solution est au moins aussi efficace que les techniques de l'état de l'art employés par les GPU actuels

    SYRANT: SYmmetric Resource Allocation on Not-taken and Taken Paths

    Get PDF
    International audienceIn the multicore era, achieving ultimate single process performance is still an issue e.g. for single process workload or for sequential sections in par- allel applications. Unfortunately, despite tremendous research effort on branch prediction, substantial performance potential is still wasted due to branch mis- predictions. On a branch misprediction resolution, instruction treatment on the wrong path is essentially thrown away. However, in most cases after a conditional branch, the taken and the not-taken paths of execution merge after a few instruc- tions. Instructions that follow the reconvergence point are executed whatever the branch outcome is. We present SYRANT (SYmmetric Resource Allocation on Not-taken and Taken paths), a new technique for exploiting control independence. SYRANT essentially uses the same pipeline structure as a conventional processor. SYRANT tries to enforce the allocation of the exact same resources on the out-of-order execution mechanism (physical register, load/store queue and reorder buffer) for both the taken and not-taken paths. Thus, on a branch misprediction, the result of an in- struction already executed on the wrong path after the reconvergence point can be conserved in the same structure when it is data independent. Adding SYRANT on top of an aggressive superscalar execution core allows to improve performance for applications suffering a significant branch misprediction rate. As a side but important extra contribution, we introduce ABL/SBL a simple and non-intrusive hardware reconvergence detection mechanism. ABL/SBL can be used in a conventional superscalar processor to improve branch prediction ac- curacy through exploiting the execution of branches along the wrong path.De plus en plus d'avancées dans le domaine de l'architecture des processeurs sont basées sur l'exécution spéculative des instructions. Dans le cas particulier de la spéculation liée aux branchements, le chemin d'exécution pour le branchement pris et le branchement non pris reconverge souvent après quelques instructions. Ce phénomène est appelé reconvergence du flot de contrôle. Il en résulte que ces instructions sont exécutées quelque soit la direction du branchement. On parle alors d'indépendance de contrôle. Plusieurs techniques ont été précédement pro- posées pour l'exploiter. Elles sont étudiées dans ce rapport. Une nouvelle technique exploitant l'indépendance de contrôle est présentée dans ce rapport. Nommée SYRANT (SYmmetric Resource Allocation on Not- taken and Taken paths), elle consiste à forcer l'allocation des mêmes ressources sur le chemin pris et non pris. Ainsi, lors d'une mauvaise prédiction, le travail déjà effectué sur le mauvais chemin et situé après le point de reconvergence peut être conservé
    • …
    corecore