6 research outputs found

    POOR MAN’S TRACE CACHE: A VARIABLE DELAY SLOT ARCHITECTURE

    Get PDF
    We introduce a novel fetch architecture called Poor Man’s Trace Cache (PMTC). PMTC constructs taken-path instruction traces via instruction replication in static code and inserts them after unconditional direct and select conditional direct control transfer instructions. These traces extend to the end of the cache line. Since available space for trace insertion may vary by the position of the control transfer instruction within the line, we refer to these fetch slots as variable delay slots. This approach ensures traces are fetched along with the control transfer instruction that initiated the trace. Branch, jump and return instruction semantics as well as the fetch unit are modified to utilize traces in delay slots. PMTC yields the following benefits: 1. Average fetch bandwidth increases as the front end can fetch across taken control transfer instructions in a single cycle. 2. The dynamic number of instruction cache lines fetched by the processor is reduced as multiple non contiguous basic blocks along a given path are encountered in one fetch cycle. 3. Replication of a branch instruction along multiple paths provides path separability for branches, which positively impacts branch prediction accuracy. PMTC mechanism requires minimal modifications to the processor’s fetch unit and the trace insertion algorithm can easily be implemented within the assembler without compiler support

    An evaluation of multiple branch predictor and trace cache advanced fetch unit designs for dynamically scheduled superscalar processors

    Get PDF
    Semiconductor feature size continues to decrease permitting superscalar microprocessors to continue to increase the number of functional units available for execution. As the instruction issue width increases beyond the five instruction average basic block size of integer programs, more than one basic block must be issued per cycle to continue to increase instructions per cycle (IPC) performance. Researchers have created methods of fetching instructions beyond the first taken branch to overcome the bottleneck created by the limitations of conventional single branch predictors. We compare the performance of the multiple branch prediction (MBP) and trace cache (TC) fetch unit optimization methods. Multiple branch predictor fetch unit designs issue multiple basic blocks per cycle using a branch address cache and a multiple branch predictor. A trace cache uses the runtime instruction stream to create fixed length instruction traces that encapsulate multiple basic blocks. The comparison is performed by using a SPARC v8 timing based simulator. We simulate both advanced fetch methods and execute benchmarks from the SPEC CPU2000 suite. The results of the simulations are compared and a detailed analysis of both microarchitectures is performed. We find that both fetch unit designs provide a competitive IPC performance. As issue width is increased from an eight to sixteen way superscalar, the IPC performance improves implying that these fetch unit designs are able to take advantage of the wider issue widths. The MBP can use a smaller L1 instruction cache size than the TC and yet achieve a similar IPC performance. Pre-arranged instructions provided by the TC allow the pipeline stages to be shortened in comparison to the MBP. The shorter pipeline significantly improves the IPC performance. Prior trace cache research used two or more ports to the instruction cache to improve the chances of fetching a full basic block per cycle. This was at the expense of instruction cache line realignment complexity. Our results show good performance with a single instruction cache port. We study an approximately equal cost implementation for the MBP and TC. Of the six benchmarks studied, the TC outperforms the MBP over four of the benchmarks

    Design of trace caches for high bandwidth instruction fetching

    Get PDF
    Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (leaves 60-63).by Michael Sung.M.Eng

    Architectural Verification of Four-instruction Superscalar Processor for MIPS I Instruction Set

    Get PDF
    The study undertaken in this thesis tries to tackle this inefficiency by having extra register locations other than the architectural registers called pseudo-registers, and a pointer scheme is followed to reference both architectural and pseudo registers. This scheme renames each logical destination register of an incoming instruction, to a pseudo register referenced by pointers called pseudo-pointers. Two separate lists of these pointers are maintained, one for all types of instructions and the other for only unspeculated instructions. When a branch instruction preceding the speculated instruction is evaluated and it is established that the prediction was correct, the machine state is altered by updating the pointer lists instead of moving the data. As the pointes are only 6-bits, the inefficiency is considerably reduced. This processor scheme is implemented using the Verilog hardware description language (HDL). The following study provides architectural details of each component used in the processor, stressing issues involved in the implementation and methods used to overcome these issues. This study also discusses verification methodology, documenting steps involved in compiling a 'c' program and loading it onto the simulated instructions cache and data cache for simulation. Finally, simulation results are presented for a sample 'c' program verifying the design

    Microarchitecture Choices and Tradeoffs for Maximizing Processing Efficiency.

    Full text link
    This thesis is concerned with hardware approaches for maximizing the number of independent instructions in the execution core and thereby maximizing the processing efficiency for a given amount of compute bandwidth. Compute bandwidth is the number of parallel execution units multiplied by the pipelining of those units in the processor. Keeping those computing elements busy is key to maximize processing efficiency and therefore power efficiency. While some applications have many independent instructions that can be issued in parallel without inefficiencies due to branch behavior, cache behavior, or instruction dependencies, most applications have limited parallelism and plenty of stalling conditions. This thesis presents two approaches to this problem, which in combination greatly increases the efficiency of the processor utilization of resources. The first approach addresses the problem of small basic blocks that arise when code has frequent branches. We introduce algorithms and mechanisms to predict multiple branches simultaneously and to fetch multiple non-continuous basic blocks every cycle along a predicted branch path. This makes what was previously an inherently serial process into a parallelized instruction fetch approach. For integer applications, the result is an increase in useful instruction fetch capacity of 40% when two basic blocks are fetched per cycle and 63% for three blocks per cycle. For floating point benchmarks, the associated improvement is 27% and 59%. The second approach addresses increasing the number of independent instructions to the execution core through simultaneous multi-threading (SMT). We compare to another multithreading approach, Switch-on-Event multithreading, and show that SMT is far superior. Intel Pentium 4 SMT microarchitecture algorithms are analyzed, and we look at the impact of SMT on power efficiency of the Pentium 4 Processor. A new metric, the SMT Energy Benefit is defined. Not only do we show that the SMT Energy Benefit for a given workload with SMT can be quite significant, we also generalize the results and build a model for what other future processors’ SMT Energy Benefit would be. We conclude that while SMT will continue to be an energy-efficient feature, as processors get more energy-efficient in general the relative SMT Energy Benefit may be reduced.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61740/1/dtmarr_1.pd

    Multiple branch and block prediction

    No full text
    However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact
    corecore