14 research outputs found

    Reducing branch delay to zero in pipelined processors

    Get PDF
    A mechanism to reduce the cost of branches in pipelined processors is described and evaluated. It is based on the use of multiple prefetch, early computation of the target address, delayed branch, and parallel execution of branches. The implementation of this mechanism using a branch target instruction memory is described. An analytical model of the performance of this implementation makes it possible to measure the efficiency of the mechanism with a very low computational cost. The model is used to determine the size of cache lines that maximizes the processor performance, to compare the performance of the mechanism with that of other schemes, and to analyze the performance of the mechanism with two alternative cache organizations.Peer ReviewedPostprint (published version

    Alternative implementations of two-level adaptive branch prediction

    Get PDF
    As the issue rate and depth of pipelining of high performance Superscalar processors increase, the importance of an excellent branch predictor becomes more vital to delivering the potential performance of a wide-issue, deep pipelined microarchitecture. We propose a new dynamic branch predictor (Two-Level Adaptive Branch Prediction) that achieves substantially higher accuracy than any other scheme reported in the literature. The mechanism uses two levels of branch history information to make predictions, the history of the last L branches encountered, and the branch behavior for the last s occurrences of the specific pattern of these k branches. We have identified three variations of the Two-Level Adaptive Branch Prediction, depending on how finely we resolve the history information gathered. We compute the hardware costs of implementing each of the three variations, and use these costs in evaluating their relative effectiveness. We measure the branch prediction accuracy of the three variations of Two-Level Adaptive Branch Prediction, along with several other popular proposed dynamic and static prediction schemes, on the SPEC benchmarks. We show that the average prediction accuracy for TwoLevel Adaptive Branch Prediction is 97 percent, while the other known schemes achieve at most 94.4 percent average prediction accuracy. We measure the effectiveness of different prediction algorithms and different amounts of history and pattern information. We measure the costs of each variation to obtain the same prediction accuracy.

    Context flow architecture

    Get PDF

    VLSI design of a twin register file for reducing the effects of conditional branches in a pipelined architecture

    Get PDF
    Pipelining is a major organizational technique which has been used by computer engineers to enhance the performance of computers. Pipelining improves the performance of computer systems by exploiting the instruction level parallelism of a program. In a pipelined processor the execution of instructions is overlapped, and each instruction is executed in a different stage of the pipeline. Most pipelined architectures are based on a sequential model of program execution in which a program counter sequences through instructions one by one.A fundamental disadvantage of pipelined processing is the loss incurred due to conditional branches. When a conditional branch instruction is encountered, more than one possible paths are following the instruction. The correct path can be known only upon the completion of the conditional branch instruction. The execution of the next instruction following a conditional branch cannot be started until the conditional branch instruction is resolved, resulting in stalling of the pipeline. One approach to avoid stalling is to predict the path to be executed and continue the execution of instructions along the predicted path. But in this case an incorrect prediction results in the execution of incorrect instructions. Hence . the results of these incorrect instructions have to be purged. Also, the instructions in the various stages of the pipeline must be removed and the pipeline has to start fetching instructions from the correct path. Thus incorrect prediction involves a flushing of the pipeline. This thesis proposes a twin processor architecture for reducing the effects of conditional branches. In such an architecture, both the paths following a conditional branch are executed simultaneously on two processors. When the conditional branch is resolved, the results of the incorrect path are discarded. Such an architecture requires a special purpose twin register file. It is the purpose of this thesis to design a twin register file consisting of two register files which can be independently accessed by the two processors. Each of the register files also has the capability of being copied into the other, making the design of the twin register file a complicated issue. The special pwpose twin register file is designed using layout tools Lager and Magic. The twin register file consists of two three-port register files which are capable of executing the 'read', 'write' and 'transfer' operations. The transfer of data from one register f.tle to another is accomplished in a single phase of the cl<X!k. The functionality of a 32-word-by-16-bit twin register file is verified by simulating it on IRSIM. The timing requirements for the read, write and transfer operations are detennined by simulating the twin register file on SPICE.Electrical Engineerin

    Prefetching techniques for client server object-oriented database systems

    Get PDF
    The performance of many object-oriented database applications suffers from the page fetch latency which is determined by the expense of disk access. In this work we suggest several prefetching techniques to avoid, or at least to reduce, page fetch latency. In practice no prediction technique is perfect and no prefetching technique can entirely eliminate delay due to page fetch latency. Therefore we are interested in the trade-off between the level of accuracy required for obtaining good results in terms of elapsed time reduction and the processing overhead needed to achieve this level of accuracy. If prefetching accuracy is high then the total elapsed time of an application can be reduced significantly otherwise if the prefetching accuracy is low, many incorrect pages are prefetched and the extra load on the client, network, server and disks decreases the whole system performance. Access pattern of object-oriented databases are often complex and usually hard to predict accurately. The ..

    A general framework to realize an abstract machine as an ILP processor with application to java

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Software-Oriented Distributed Shared Cache Management for Chip Multiprocessors

    Get PDF
    This thesis proposes a software-oriented distributed shared cache management approach for chip multiprocessors (CMPs). Unlike hardware-based schemes, our approach offloads the cache management task to trace analysis phase, allowing flexible management strategies. For single-threaded programs, a static 2D page coloring scheme is proposed to utilize oracle trace information to derive an optimal data placement schema for a program. In addition, a dynamic 2D page coloring scheme is proposed as a practical solution, which tries to ap- proach the performance of the static scheme. The evaluation results show that the static scheme achieves 44.7% performance improvement over the conventional shared cache scheme on average while the dynamic scheme performs 32.3% better than the shared cache scheme. For latency-oriented multithreaded programs, a pattern recognition algorithm based on the K-means clustering method is introduced. The algorithm tries to identify data access pat- terns that can be utilized to guide the placement of private data and the replication of shared data. The experimental results show that data placement and replication based on these access patterns lead to 19% performance improvement over the shared cache scheme. The reduced remote cache accesses and aggregated cache miss rate result in much lower bandwidth requirements for the on-chip network and the off-chip main memory bus. Lastly, for throughput-oriented multithreaded programs, we propose a hint-guided data replication scheme to identify memory instructions of a target program that access data with a high reuse property. The derived hints are then used to guide data replication at run time. By balancing the amount of data replication and local cache pressure, the proposed scheme has the potential to help achieve comparable performance to best existing hardware-based schemes.Our proposed software-oriented shared cache management approach is an effective way to manage program performance on CMPs. This approach provides an alternative direction to the research of the distributed cache management problem. Given the known difficulties (e.g., scalability and design complexity) we face with hardware-based schemes, this software- oriented approach may receive a serious consideration from researchers in the future. In this perspective, the thesis provides valuable contributions to the computer architecture research society
    corecore