4,672 research outputs found

    Architectural support for probabilistic branches

    Get PDF
    A plethora of research efforts have focused on fine-tuning branch predictors to increasingly higher levels of accuracy. However, several important optimization, financial, and statistical data analysis algorithms rely on probabilistic computation. These applications draw random values from a distribution and steer control flow based on those values. Such probabilistic branches are challenging to predict because of their inherent probabilistic nature. As a result, probabilistic codes significantly suffer from branch mispredictions. This paper proposes Probabilistic Branch Support (PBS), a hardware/software cooperative technique that leverages the observation that the outcome of probabilistic branches needs to be correct only in a statistical sense. PBS stores the outcome and the probabilistic values that lead to the outcome of the current execution to direct the next execution of the probabilistic branch, thereby completely removing the penalty for mispredicted probabilistic branches. PBS relies on marking probabilistic branches in software for hardware to exploit. Our evaluation shows that PBS improves MPKI by 45% on average (and up to 99%) and IPC by 6.7% (up to 17%) over the TAGE-SC-L predictor. PBS requires 193 bytes of hardware overhead and introduces statistically negligible algorithmic inaccuracy

    Rebalancing the core front-end through HPC code analysis

    Get PDF
    There is a need to increase performance under the same power and area envelope to achieve Exascale technology in high performance computing (HPC). The today's chip multiprocessor (CMP) design is tailored by traditional desktop and server workloads, different from parallel applications commonly run in HPC. In this work, we focus on the HPC code characteristics and processor front-end which factors around 30% of core power and area on the emerging lean-core type of processors used in HPC. Separating serial from parallel code sections inside applications, we characterize three HPC benchmark suites and compare them to a traditional set of desktop integer workloads. HPC applications have biased and mostly backward taken branches, small dynamic instruction footprints, and long basic blocks. Our findings suggest smaller branch predictors (BP) with the additional loop BP, smaller branch target buffers (BTB), and smaller L1 instruction caches (I-cache) with wider lines. Still, the aforementioned downsizing applies only to the cores meant to run parallel code. The difference between serial and parallel code sections in HPC applications points to an asymmetric CMP design, with one baseline core for sequential and many HPCtailored cores designed for parallel code. Predictions using Sniper simulator and McPAT show that an HPC-tailored lean core saves 16% of the core area and 7% of power compared to a baseline core, without performance loss. Using the area savings to add an extra core, an asymmetric CMP with one baseline and eight tailored cores has the same area budget as a symmetric CMP composed out of eight baseline cores demanding 4% more power and providing 12% shorter execution time on average.Postprint (author's final draft

    A distributed processor state management architecture for large-window processors

    Get PDF
    Processor architectures with large instruction windows have been proposed to expose more instruction-level parallelism (ILP) and increase performance. Some of the proposed architectures replace a re-order buffer (ROB) with a check-pointing mechanism and an out-of-order release of processor resources. Check-pointing, however, leads to an imprecise processor state recovery on mis-predicted branches and exceptions and re-execution of correct-path instructions after state recovery. It also requires large register files complicating renaming, allocation and release of physical registers. This paper proposes a new processor architecture called a Multi-State Processor (MSP). The MSP does not use check-pointing, avoids the above-mentioned problems, and has a fast, distributed state recovery mechanism. The MSP uses a novel register management architecture allowing implementation of large register files with simpler and more scalable register allocation, renaming, and release. It is also key to precise processor state recovery mechanism. The MSP is shown to improve IPC by 14%, on average, for integer SPEC CPU2000 benchmarks compared to a check-pointing based mechanism ([2]) when a fast and simple branch predictor is used. With a very aggressive branch predictor the IPC improvement is 1%, on average, and 3% if some of the programs are optimized for the MSP. The MSP also reduces the average number of executed instructions by 16.5% (12% for the aggressive branch predictor), mostly due to precise state recovery. This improves the MSP processor energy efficiency even though it uses a larger register file.Peer ReviewedPostprint (published version
    • …
    corecore