2,579 research outputs found

    Efficient resources assignment schemes for clustered multithreaded processors

    Get PDF
    New feature sizes provide larger number of transistors per chip that architects could use in order to further exploit instruction level parallelism. However, these technologies bring also new challenges that complicate conventional monolithic processor designs. On the one hand, exploiting instruction level parallelism is leading us to diminishing returns and therefore exploiting other sources of parallelism like thread level parallelism is needed in order to keep raising performance with a reasonable hardware complexity. On the other hand, clustering architectures have been widely studied in order to reduce the inherent complexity of current monolithic processors. This paper studies the synergies and trade-offs between two concepts, clustering and simultaneous multithreading (SMT), in order to understand the reasons why conventional SMT resource assignment schemes are not so effective in clustered processors. These trade-offs are used to propose a novel resource assignment scheme that gets and average speed up of 17.6% versus Icount improving fairness in 24%.Peer ReviewedPostprint (published version

    Leveraging register windows to reduce physical registers to the bare minimum

    Get PDF
    Register window is an architectural technique that reduces memory operations required to save and restore registers across procedure calls. Its effectiveness depends on the size of the register file. Such register requirements are normally increased for out-of-order execution because it requires registers for the in-flight instructions, in addition to the architectural ones. However, a large register file has an important cost in terms of area and power and may even affect the cycle time. In this paper, we propose a software/hardware early register release technique that leverage register windows to drastically reduce the register requirements, and hence, reduce the register file cost. Contrary to the common belief that out-of-order processors with register windows would need a large physical register file, this paper shows that the physical register file size may be reduced to the bare minimum by using this novel microarchitecture. Moreover, our proposal has much lower hardware complexity than previous approaches, and requires minimal changes to a conventional register window scheme. Performance studies show that the proposed technique can reduce the number of physical registers to the number of logical registers plus one (minimum number to guarantee forward progress) and still achieve almost the same performance as an unbounded register file.Peer ReviewedPostprint (published version

    Trace-level speculative multithreaded architecture

    Get PDF
    This paper presents a novel microarchitecture to exploit trace-level speculation by means of two threads working cooperatively in a speculative and non-speculative way respectively. The architecture presents two main benefits: (a) no significant penalties are introduced in the presence of a misspeculation and (b) any type of trace predictor can work together with this proposal. In this way, aggressive trace predictors can be incorporated since misspeculations do not introduce significant penalties. We describe in detail TSMA (trace-level speculative multithreaded architecture) and present initial results to show the benefits of this proposal. We show how simple trace predictors achieve significant speed-up in the majority of cases. Results of a simple trace speculation mechanism show an average speed-up of 16%.Peer ReviewedPostprint (published version

    A software-hardware hybrid steering mechanism for clustered microarchitectures

    Get PDF
    Clustered microarchitectures provide a promising paradigm to solve or alleviate the problems of increasing microprocessor complexity and wire delays. High- performance out-of-order processors rely on hardware-only steering mechanisms to achieve balanced workload distribution among clusters. However, the additional steering logic results in a significant increase on complexity, which actually decreases the benefits of the clustered design. In this paper, we address this complexity issue and present a novel software-hardware hybrid steering mechanism for out-of-order processors. The proposed software- hardware cooperative scheme makes use of the concept of virtual clusters. Instructions are distributed to virtual clusters at compile time using static properties of the program such as data dependences. Then, at runtime, virtual clusters are mapped into physical clusters by considering workload information. Experiments using SPEC CPU2000 benchmarks show that our hybrid approach can achieve almost the same performance as a state-of-the-art hardware-only steering scheme, while requiring low hardware complexity. In addition, the proposed mechanism outperforms state-of-the-art software-only steering mechanisms by 5% and 10% on average for 2-cluster and 4-cluster machines, respectively.Peer ReviewedPostprint (published version

    Avoiding core's DUE & SDC via acoustic wave detectors and tailored error containment and recovery

    Get PDF
    The trend of downsizing transistors and operating voltage scaling has made the processor chip more sensitive against radiation phenomena making soft errors an important challenge. New reliability techniques for handling soft errors in the logic and memories that allow meeting the desired failures-in-time (FIT) target are key to keep harnessing the benefits of Moore's law. The failure to scale the soft error rate caused by particle strikes, may soon limit the total number of cores that one may have running at the same time. This paper proposes a light-weight and scalable architecture to eliminate silent data corruption errors (SDC) and detected unrecoverable errors (DUE) of a core. The architecture uses acoustic wave detectors for error detection. We propose to recover by confining the errors in the cache hierarchy, allowing us to deal with the relatively long detection latencies. Our results show that the proposed mechanism protects the whole core (logic, latches and memory arrays) incurring performance overhead as low as 0.60%. © 2014 IEEE.Peer ReviewedPostprint (author's final draft

    Virtual-physical registers

    Get PDF
    A novel dynamic register renaming approach is proposed in this work. The key idea of the novel scheme is to delay the allocation of physical registers until a late stage in the pipeline, instead of doing it in the decode stage as conventional schemes do. In this way, the register pressure is reduced and the processor can exploit more instruction-level parallelism. Delaying the allocation of physical registers require some additional artifact to keep track of dependences. This is achieved by introducing the concept of virtual-physical registers, which do not require any storage location and are used to identify dependences among instructions that have not yet allocated a register to its destination operand. Two alternative allocation strategies have been investigated that differ in the stage where physical registers are allocated: issue or write-back. The experimental evaluation has confirmed the higher performance of the latter alternative. We have performed all evaluation of the novel scheme through a detailed simulation of a dynamically scheduled processor. The results show a significant improvement (e.g., 19% increase in IPC for a machine with 64 physical registers in each file) when compared with the traditional register renaming approach.Peer ReviewedPostprint (published version

    Compiler analysis for trace-level speculative multithreaded architectures

    Get PDF
    Trace-level speculative multithreaded processors exploit trace-level speculation by means of two threads working cooperatively. One thread, called the speculative thread, executes instructions ahead of the other by speculating on the result of several traces. The other thread executes speculated traces and verifies the speculation made by the first thread. In this paper, we propose a static program analysis for identifying candidate traces to be speculated. This approach identifies large regions of code whose live-output values may be successfully predicted. We present several heuristics to determine the best opportunities for dynamic speculation, based on compiler analysis and program profiling information. Simulation results show that the proposed trace recognition techniques achieve on average a speed-up close to 38% for a collection of SPEC2000 benchmarks.Peer ReviewedPostprint (published version

    A novel architecture for large windows processors

    Get PDF
    Several processor architectures with large instruction windows have been proposed. They improve performance by maintaining hundreds of instructions in flight to increase the level of instruction parallelism (ILP). Such architectures replace a re-order buffer (ROB) with a check-pointing mechanism and an out-of-order release of the processor resources. Check-pointing, however, leads to an imprecise state recovery on mispredicted branches and exceptions and frequent re-execution of current-path instructions during the state recovery. It also requires large register files complicating renaming, allocation and release of physical registers. This technical report proposes a new processor architecture that does not use either a traditional ROB or check-pointing, avoids the above-mentioned problems, and has a fast, distributed state recovery mechanism. Its novel register management architecture allows implementation of large register files with simpler and more scalable, register renaming and commit. It is also key to the precise recovery mechanism.Postprint (published version

    A distributed processor state management architecture for large-window processors

    Get PDF
    Processor architectures with large instruction windows have been proposed to expose more instruction-level parallelism (ILP) and increase performance. Some of the proposed architectures replace a re-order buffer (ROB) with a check-pointing mechanism and an out-of-order release of processor resources. Check-pointing, however, leads to an imprecise processor state recovery on mis-predicted branches and exceptions and re-execution of correct-path instructions after state recovery. It also requires large register files complicating renaming, allocation and release of physical registers. This paper proposes a new processor architecture called a Multi-State Processor (MSP). The MSP does not use check-pointing, avoids the above-mentioned problems, and has a fast, distributed state recovery mechanism. The MSP uses a novel register management architecture allowing implementation of large register files with simpler and more scalable register allocation, renaming, and release. It is also key to precise processor state recovery mechanism. The MSP is shown to improve IPC by 14%, on average, for integer SPEC CPU2000 benchmarks compared to a check-pointing based mechanism ([2]) when a fast and simple branch predictor is used. With a very aggressive branch predictor the IPC improvement is 1%, on average, and 3% if some of the programs are optimized for the MSP. The MSP also reduces the average number of executed instructions by 16.5% (12% for the aggressive branch predictor), mostly due to precise state recovery. This improves the MSP processor energy efficiency even though it uses a larger register file.Peer ReviewedPostprint (published version
    • …
    corecore