8,526 research outputs found

    Enlarging instruction streams

    Get PDF
    The stream fetch engine is a high-performance fetch architecture based on the concept of an instruction stream. We call a sequence of instructions from the target of a taken branch to the next taken branch, potentially containing multiple basic blocks, a stream. The long length of instruction streams makes it possible for the stream fetch engine to provide a high fetch bandwidth and to hide the branch predictor access latency, leading to performance results close to a trace cache at a lower implementation cost and complexity. Therefore, enlarging instruction streams is an excellent way to improve the stream fetch engine. In this paper, we present several hardware and software mechanisms focused on enlarging those streams that finalize at particular branch types. However, our results point out that focusing on particular branch types is not a good strategy due to Amdahl's law. Consequently, we propose the multiple-stream predictor, a novel mechanism that deals with all branch types by combining single streams into long virtual streams. This proposal tolerates the prediction table access latency without requiring the complexity caused by additional hardware mechanisms like prediction overriding. Moreover, it provides high-performance results which are comparable to state-of-the-art fetch architectures but with a simpler design that consumes less energy.Peer ReviewedPostprint (published version

    Interval simulation: raising the level of abstraction in architectural simulation

    Get PDF
    Detailed architectural simulators suffer from a long development cycle and extremely long evaluation times. This longstanding problem is further exacerbated in the multi-core processor era. Existing solutions address the simulation problem by either sampling the simulated instruction stream or by mapping the simulation models on FPGAs; these approaches achieve substantial simulation speedups while simulating performance in a cycle-accurate manner This paper proposes interval simulation which rakes a completely different approach: interval simulation raises the level of abstraction and replaces the core-level cycle-accurate simulation model by a mechanistic analytical model. The analytical model estimates core-level performance by analyzing intervals, or the timing between two miss events (branch mispredictions and TLB/cache misses); the miss events are determined through simulation of the memory hierarchy, cache coherence protocol, interconnection network and branch predictor By raising the level of abstraction, interval simulation reduces both development time and evaluation time. Our experimental results using the SPEC CPU2000 and PARSEC benchmark suites and the MS multi-core simulator show good accuracy up to eight cores (average error of 4.6% and max error of 11% for the multi-threaded full-system workloads), while achieving a one order of magnitude simulation speedup compared to cycle-accurate simulation. Moreover interval simulation is easy to implement: our implementation of the mechanistic analytical model incurs only one thousand lines of code. Its high accuracy, fast simulation speed and ease-of-use make interval simulation a useful complement to the architect's toolbox for exploring system-level and high-level micro-architecture trade-offs

    DIA: A complexity-effective decoding architecture

    Get PDF
    Fast instruction decoding is a true challenge for the design of CISC microprocessors implementing variable-length instructions. A well-known solution to overcome this problem is caching decoded instructions in a hardware buffer. Fetching already decoded instructions avoids the need for decoding them again, improving processor performance. However, introducing such special--purpose storage in the processor design involves an important increase in the fetch architecture complexity. In this paper, we propose a novel decoding architecture that reduces the fetch engine implementation cost. Instead of using a special-purpose hardware buffer, our proposal stores frequently decoded instructions in the memory hierarchy. The address where the decoded instructions are stored is kept in the branch prediction mechanism, enabling it to guide our decoding architecture. This makes it possible for the processor front end to fetch already decoded instructions from the memory instead of the original nondecoded instructions. Our results show that using our decoding architecture, a state-of-the-art superscalar processor achieves competitive performance improvements, while requiring less chip area and energy consumption in the fetch architecture than a hardware code caching mechanism.Peer ReviewedPostprint (published version
    • …
    corecore