45 research outputs found

    Field-based branch prediction for packet processing engines

    Get PDF
    Network processors have exploited many aspects of architecture design, such as employing multi-core, multi-threading and hardware accelerator, to support both the ever-increasing line rates and the higher complexity of network applications. Micro-architectural techniques like superscalar, deep pipeline and speculative execution provide an excellent method of improving performance without limiting either the scalability or flexibility, provided that the branch penalty is well controlled. However, it is difficult for traditional branch predictor to keep increasing the accuracy by using larger tables, due to the fewer variations in branch patterns of packet processing. To improve the prediction efficiency, we propose a flow-based prediction mechanism which caches the branch histories of packets with similar header fields, since they normally undergo the same execution path. For packets that cannot find a matching entry in the history table, a fallback gshare predictor is used to provide branch direction. Simulation results show that the our scheme achieves an average hit rate in excess of 97.5% on a selected set of network applications and real-life packet traces, with a similar chip area to the existing branch prediction architectures used in modern microprocessors

    An evaluation of multiple branch predictor and trace cache advanced fetch unit designs for dynamically scheduled superscalar processors

    Get PDF
    Semiconductor feature size continues to decrease permitting superscalar microprocessors to continue to increase the number of functional units available for execution. As the instruction issue width increases beyond the five instruction average basic block size of integer programs, more than one basic block must be issued per cycle to continue to increase instructions per cycle (IPC) performance. Researchers have created methods of fetching instructions beyond the first taken branch to overcome the bottleneck created by the limitations of conventional single branch predictors. We compare the performance of the multiple branch prediction (MBP) and trace cache (TC) fetch unit optimization methods. Multiple branch predictor fetch unit designs issue multiple basic blocks per cycle using a branch address cache and a multiple branch predictor. A trace cache uses the runtime instruction stream to create fixed length instruction traces that encapsulate multiple basic blocks. The comparison is performed by using a SPARC v8 timing based simulator. We simulate both advanced fetch methods and execute benchmarks from the SPEC CPU2000 suite. The results of the simulations are compared and a detailed analysis of both microarchitectures is performed. We find that both fetch unit designs provide a competitive IPC performance. As issue width is increased from an eight to sixteen way superscalar, the IPC performance improves implying that these fetch unit designs are able to take advantage of the wider issue widths. The MBP can use a smaller L1 instruction cache size than the TC and yet achieve a similar IPC performance. Pre-arranged instructions provided by the TC allow the pipeline stages to be shortened in comparison to the MBP. The shorter pipeline significantly improves the IPC performance. Prior trace cache research used two or more ports to the instruction cache to improve the chances of fetching a full basic block per cycle. This was at the expense of instruction cache line realignment complexity. Our results show good performance with a single instruction cache port. We study an approximately equal cost implementation for the MBP and TC. Of the six benchmarks studied, the TC outperforms the MBP over four of the benchmarks

    Employing Variable Cross-Reference Prediction and Iterative Dispatch to Raise Dynamic Branch Prediction Accuracy

    Get PDF
    [[abstract]]To improve branch prediction accuracy for the two-level adaptive branch predictor, two schemes dealing respectively with the prediction and dispatch parts, are presented in this paper. The proposed VCR prediction scheme is able to achieve desirable prediction accuracy, with reasonably low time complexity and no extra hardware cost, by variably cross-referring traces in the PHT to make predictions. The Iterative dispatch approach utilizes the PHT history to do dispatching for an additional layer of pattern history which helps providing more information for making better predictions. To attain desirable prediction accuracy at reduced cost, a combined predictor formed by the proposed VCR scheme and the optimal PPM algorithm is also considered. Extensive trace-driven simulation runs have been conducted to evaluate the performance of our proposed schemes and other predictors. As the results indicate, our proposed schemes compare favorably in most of the situations in terms of prediction accuracy.[[notice]]補正完畢[[incitationindex]]E

    Herramienta para la simulación y visualización de procesadores superscalares

    Get PDF
    Nuestro proyecto ha consistido en la elaboración de predictores de saltos para usar en el simulador Simplescalar. Simplescalar es una potente herramienta que permite la simulación de un procesador superescalar, desde distintos puntos de vista. Uno de estos puntos es la predicción de saltos, algo fundamental para el buen funcionamiento y rendimiento de un microprocesador. El simulador se encuentra dividido en varios módulos, cada uno de ellos tiene el código abierto, por lo que se permite su modificación para así poder asegurar a los investigadores que puedan probar con comodidad aquello en lo que estén interesados. El lenguaje de este código es C, y está estructurado y modulado de tal forma que se permiten hacer cambios con relativa facilidad. [ABSTRACT]Our project deals with the elaboration of branch predictors to use them with the Simpescalar simulator. Simplescalar is a powerful tool that allows a superscalar processor’s simulation from several viewpoints. One of these points is branch prediction, which is basic to have a good work and performance in a microprocessor. The simulator is divided into several modules, each one of them with their open code. That is the reason why it’s permitted its modification in order to make sure to investigators a comfortable testing of the topics they are interested in. The language of this code is C, and it is structured and moduled in such form that allows to make changes with relative easiness

    The Hybrid Branch Predictor using Bias of a Prediction Counter - a New Prediction Table Selection Scheme “Confidence-Selector” -

    Get PDF
    命令間の制御依存によってパイプライン処理を滞らせないために,近年のプロセッサでは分岐予測が採用されている.しかし,パイプライン段数の増加に伴い,分岐予測ミスペナルティも増大しているため,より高精度な分岐予測器が求められている.複数の予測器を組み合わせたハイブリッド分岐予測器では,各予測器による予測の中から最終的な予測を選択するSelectorを持つ.しかし,Selectorの精度はそれほど高くはない.本研究では,予測器の予測カウンタ状態毎に予測精度が異なることに着目し,予測カウンタの予測信頼度を利用した予測選択手法Confidence-Selectorを提案する.本手法では,従来のSelectorが不要となり,Selectorに要していたハードウェアを各予測器に割り当てられる.SPECint95(train入力)ベンチマークを対象に実験を行った結果,本手法の適用により,従来のハイブリッド分岐予測器と比較して12KB容量で平均0.22%,24KBで平均0.31%予測ミス率が低減した.フェッチ幅4命令の40段プロセッサでは,平均0.31%の予測ミス率低減により、約4.0%の処理速度(IPC)向上を達成できる.修士論

    Reducing complexity of processor front ends with static analysis and selective preloading

    Get PDF
    General purpose processors were once designed with the major goal of maximizing performance. As power consumption has grown, with the advent of multi-core processors and the rising importance of embedded and mobile devices, the importance of designing efficient and low cost architectures has increased. This dissertation focuses on reducing the complexity of the front end of the processor, mainly branch predictors. Branch predictors have also been designed with a focus on improving prediction accuracy so that performance is maximized. To accomplish this, the predictors proposed in the literature and used in real systems have become increasingly complex and larger, a trend that is inconsistent with the anticipated trend of simpler and more numerous cores in future processors. Much of the increased complexity in many recently proposed predictors is used to select a part of history most correlated to a branch. This makes them costly, if not impossible to implement practically. We suggest that the complex decisions do not have to be made in hardware at prediction or run time and can be moved offline. High accuracy can be achieved by making complex prediction decisions in a one-time profile run instead of using complex hardware. We apply these techniques to Spotlight, our own low cost, low complexity branch predictor. A static analysis step determines, for each branch, the history segment yielding the highest accuracy. This information is placed in unused instruction space. Spotlight achieves higher accuracy than other implementation-simple predictors such as Gshare and YAGS and matches or outperforms the two complex neural predictors that we compare it to. To ensure timely access, we evaluate using a hardware table (called a BIT) to store profile bits after they are extracted from instructions, and the accuracy of using this table. The drawback of a BIT is its size. We introduce a novel technique, Preloading that places data for an instruction in prior blocks on the path to the instruction. By doing so, it is able to significantly reduce the size of the BIT needed for good performance. We discuss other applications of Preloading on the front end other than branch predictors

    A PPM-like, tag-based branch predictor

    Get PDF
    Volume 7International audienceThis paper describes cbp1.5, the tag-based, global-history predictor derived from PPM that was rank five at the first Championship Branch Prediction competition. This predictor is a particular instance of a family of predictors which we call GPPM. We introduce GPPMideal, an ideal GPPM predictor. It is possible to derive cbp1.5 from GPPM-ideal by introducing a series of degradations corresponding to real-life constraints. We characterize cbp1.5 by quantifying the impact of each degradation on the distributed CBP traces

    Exploiting intra-function correlation with the global history stack

    Get PDF
    Abstract. The demand for more computation power in high-end embedded systems has put embedded processors on parallel evolution track as the RISC processors. Caches and deeper pipelines are standard features on recent embedded microprocessors. As a result of this, some of the performance penalties associated with branch instructions in RISC processors are becoming more prevalent in these processors. As is the case in RISC architectures, designers have turned to dynamic branch prediction to alleviate this problem. Global correlating branch predictors take advantage of the influence past branches have on future ones. The conditional branch outcomes are recorded in a global history register (GHR). Based on the hypothesis that most correlation is among intra-function branches, we provide a detailed analysis of the Global History Stack (GHS) in this paper. The GHS saves the global history in the return address stack when a call instruction is executed. Following the subsequent return, the history is restored from the stack. In addition, to preserve the correlation between the callee branches and the caller branches following the call instruction, we save a few of the history bits coming from the end of the callee's execution. We also investigate saving the GHR of a function in the Branch Target Buffer (BTB) when it returns so that it can be restored when that function is called again. Our results show that these techniques improve the accuracy of several global history based prediction schemes by 4% on average. Consequently, performance improvements as high as 13% are attained

    The weakening of branch predictor performance as an inevitable side effect of exploiting control independence

    Get PDF
    Many algorithms are inherently sequential and hard to explicitly parallelize. Cores designed to aggressively handle these problems exhibit deeper pipelines and wider fetch widths to exploit instruction-level parallelism via out-of-order execution. As these parameters increase, so does the amount of instructions fetched along an incorrect path when a branch is mispredicted. Many of the instructions squashed after a branch are control independent, meaning they will be fetched regardless of whether the candidate branch is taken or not. There has been much research in retaining these control independent instructions on misprediction of the candidate branch. This research shows that there is potential for exploiting control independence since under favorable circumstances many benchmarks can exhibit 30% or more speedup. Though these control independent processors are meant to lessen the damage of misprediction, an inherent side-effect of fetching out of order, branch weakening, keeps realized speedup from reaching its potential. This thesis introduces, formally defines, and identifies the types of branch weakening. Useful information is provided to develop techniques that may reduce weakening. A classification is provided that measures each type of weakening to help better determine potential speedup of control independence processors. Experimentation shows that certain applications suffer greatly from weakening. Total branch mispredictions increase by 30% in several cases. Analysis has revealed two broad causes of weakening: changes in branch predictor update times and changes in the outcome history used by branch predictors. Each of these broad causes are classified into more specific causes, one of which is due to the loss of nearby correlation data and cannot be avoided. The classification technique presented in this study measures that 45% of the weakening in the selected SPEC CPU 2000 benchmarks are of this type while 40% involve other changes in outcome history. The remaining 15% is caused by changes in predictor update times. In applying fundamental techniques that reduce weakening, the Control Independence Aware Branch Predictor is developed. This predictor reduces weakening for the majority of chosen benchmarks. In doing so, a control independence processor, snipper, to attain significantly higher speedup for 10 out of 15 studied benchmarks
    corecore