1,274 research outputs found

    Cycle Accurate Energy and Throughput Estimation for Data Cache

    Get PDF
    Resource optimization in energy constrained real-time adaptive embedded systems highly depends on accurate energy and throughput estimates of processor peripherals. Such applications require lightweight, accurate mathematical models to profile energy and timing requirements on the go. This paper presents enhanced mathematical models for data cache energy and throughput estimation. The energy and throughput models were found to be within 95% accuracy of per instruction energy model of a processor, and a full system simulator?s timing model respectively. Furthermore, the possible application of these models in various scenarios is discussed in this paper

    Exploiting Fine-Grain Concurrency Analytical Insights in Superscalar Processor Design

    Get PDF
    This dissertation develops analytical models to provide insight into various design issues associated with superscalar-type processors, i.e., the processors capable of executing multiple instructions per cycle. A survey of the existing machines and literature has been completed with a proposed classification of various approaches for exploiting fine-grain concurrency. Optimization of a single pipeline is discussed based on an analytical model. The model-predicted performance curves are found to be in close proximity to published results using simulation techniques. A model is also developed for comparing different branch strategies for single-pipeline processors in terms of their effectiveness in reducing branch delay. The additional instruction fetch traffic generated by certain branch strategies is also studied and is shown to be a useful criterion for choosing between equally well performing strategies. Next, processors with multiple pipelines are modelled to study the tradeoffs associated with deeper pipelines versus multiple pipelines. The model developed can reveal the cause of performance bottleneck: insufficient resources to exploit discovered parallelism, insufficient instruction stream parallelism, or insufficient scope of concurrency detection. The cost associated with speculative (i.e., beyond basic block) execution is examined via probability distributions that characterize the inherent parallelism in the instruction stream. The throughput prediction of the analytic model is shown, using a variety of benchmarks, to be close to the measured static throughput of the compiler output, under resource and scope constraints. Further experiments provide misprediction delay estimates for these benchmarks under scope constraints, assuming beyond-basic-block, out-of-order execution and run-time scheduling. These results were derived using traces generated by the Multiflow TRACE SCHEDULINGâ„¢(*) compacting C and FORTRAN 77 compilers. A simplified extension to the model to include multiprocessors is also proposed. The extended model is used to analyze combined systems, such as superpipelined multiprocessors and superscalar multiprocessors, both with shared memory. It is shown that the number of pipelines (or processors) at which the maximum throughput is obtained is increasingly sensitive to the ratio of memory access time to network access delay, as memory access time increases. Further, as a function of inter-iteration dependency distance, optimum throughput is shown to vary nonlinearly, whereas the corresponding Optimum number of processors varies linearly. The predictions from the analytical model agree with published results based on simulations. (*)TRACE SCHEDULING is a trademark of Multiflow Computer, Inc

    Interval simulation: raising the level of abstraction in architectural simulation

    Get PDF
    Detailed architectural simulators suffer from a long development cycle and extremely long evaluation times. This longstanding problem is further exacerbated in the multi-core processor era. Existing solutions address the simulation problem by either sampling the simulated instruction stream or by mapping the simulation models on FPGAs; these approaches achieve substantial simulation speedups while simulating performance in a cycle-accurate manner This paper proposes interval simulation which rakes a completely different approach: interval simulation raises the level of abstraction and replaces the core-level cycle-accurate simulation model by a mechanistic analytical model. The analytical model estimates core-level performance by analyzing intervals, or the timing between two miss events (branch mispredictions and TLB/cache misses); the miss events are determined through simulation of the memory hierarchy, cache coherence protocol, interconnection network and branch predictor By raising the level of abstraction, interval simulation reduces both development time and evaluation time. Our experimental results using the SPEC CPU2000 and PARSEC benchmark suites and the MS multi-core simulator show good accuracy up to eight cores (average error of 4.6% and max error of 11% for the multi-threaded full-system workloads), while achieving a one order of magnitude simulation speedup compared to cycle-accurate simulation. Moreover interval simulation is easy to implement: our implementation of the mechanistic analytical model incurs only one thousand lines of code. Its high accuracy, fast simulation speed and ease-of-use make interval simulation a useful complement to the architect's toolbox for exploring system-level and high-level micro-architecture trade-offs

    Performance mapping of a class of fully decoupled architecture

    Get PDF

    Fast simulation techniques for microprocessor design space exploration

    Get PDF
    Designing a microprocessor is extremely time-consuming. Computer architects heavily rely on architectural simulators, e.g., to drive high-level design decisions during early stage design space exploration. The benefit of architectural simulators is that they yield relatively accurate performance results, are highly parameterizable and are very flexible to use. The downside, however, is that they are at least three or four orders of magnitude slower than real hardware execution. The current trend towards multicore processors exacerbates the problem; as the number of cores on a multicore processor increases, simulation speed has become a major concern in computer architecture research and development. In this dissertation, we propose and evaluate two simulation techniques that reduce the simulation time significantly: statistical simulation and interval simulation. Statistical simulation speeds up the simulation by reducing the number of dynamically executed instructions. First, we collect a number of program execution characteristics into a statistical profile. From this profile we can generate a synthetic trace that exhibits the same execution behavior but which has a much shorter trace length as compared to the original trace. Simulating this synthetic trace then yields a performance estimate. Interval simulation raises the level of abstraction in architectural simulation; it replaces the core-level cycle-accurate simulation model by a mechanistic analytical model. The analytical model builds on insights from interval analysis: miss events divide the smooth streaming of instructions into so called intervals. The model drives the timing by analyzing the type of the miss events and their latencies, instead of tracking the individual instructions as they propagate through the pipeline stages

    Modeling Out-of-Order Superscalar Processor Performance Quickly and Accurately with Traces

    Get PDF
    Fast and accurate processor simulation is essential in processor design. Trace-driven simulation is a widely practiced fast simulation method. However, serious accuracy issues arise when an out-of-order superscalar processor is considered. In this thesis, trace-driven simulation methods are suggested to quickly and accurately model out-of-order superscalar processor performance with reduced traces. The approaches abstract the processor core and focus on the processor's uncore events rather than the processor's internal events. As a result, fast simulation speed is achieved while maintaining fairly small error compared with an execution-driven simulator. Traces can be generated either by a cycle-accurate simulator or an abstract timing model on top of a simple functional simulator. Simulation results are more accurate with the method using traces generated from a cycle-accurate simulator. Faster trace generation speed is achieved with the abstract timing model. The methods determine how to treat a cache miss with respect to other cache misses recorded in the trace by dynamically reconstructing the reorder buffer state during simulation and honoring the dependencies between the trace items. This approach preserves a processor's dynamic uncore access patterns and accurately predicts the relative performance change when the processor's uncore-level parameters are changed. The methods are attractive especially in the early design stages due to its fast simulation speed

    Improving Instruction Fetch Rate with Code Pattern Cache for Superscalar Architecture

    Get PDF
    In the past, instruction fetch speeds have been improved by using cache schemes that capture the actual program flow. In this proposal, we present the architecture of a new instruction cache named code pattern cache (CPC); the cache is used with superscalar processors. CPC?s operation is based on the fundamental principles that: common programs tend to repeat their execution patterns; and efficient storage of a program flow can enhance the performance of an instruction fetch mechanism. CPC saves basic blocks (sets of instructions separated by control instructions) and their boundary addresses while the code is running. Basic blocks and their addresses are stored in two separate structures, called block pointer cache (BPC) and basic block cache (BBC), respectively. Later, if the same basic block sequence is expected to execute, it is fetched from CPC, instead of the instruction cache; this mechanism results in higher likelihood of delivering a larger number of instructions in every clock cycle. We developed single and multi-threaded simulators for TC, BC, and CPC, and used them with 10 SPECint2000 benchmarks. The simulation results demonstrated CPC?s advantage over TC and BC, in terms of trace miss rate and average trace length. Additionally, we used cache models to quantify the timing, area, and power for the three cache schemes. Using an aggregate performance index that combined the simulation and modeling results, CPC was shown to perform better than both TC and BC. During our research, each of the TC-, BC-, or CPC- configurations took 4-6 hours to simulate, so performance comparison of these caches proved to be a very time-consuming process. Neural network models (NNM?s) can be time-efficient alternatives to simulations, so we studied their feasibility to represent the cache behavior. We developed two NNM\u27s, one to predict the trace miss rate and the other to predict the average trace length for the three caches. The NNM\u27s modeled the caches with reasonable accuracy, and produced results in a fraction of a second

    Hierarchical architecture design and simulation environment

    Get PDF
    The Hierarchical Architectural design and Simulation Environment (HASE)is intended as a flexible tool for computer architects who wish to experiment with alternative architectural configurations and design parameters. HASE is both a design environment and a simulator. Architecture components are described by a hierarchical library of objects defined in terms of an object oriented simulation language. HASE instantiates these objects to simulate and animate the execution of a computer architecture. An event trace generated by the simulator therefore describes the interaction between architecture components, for example, fetch stages, address and data buses, sequencers, instruction buffers and register files. The objects can model physical components at different abstraction levels, eg. PMS (processor memory switch), ISP (instruction set processor) and RTL (register transfer level). HASE applies the concepts of inheritance, encapsulation and polymorphism associated with object orientation, to simplify the design and implementation of an architecture simulation that models component operations at different abstraction levels. For example, HASE can probe the performance of a processor's floating point unit, executing a multiplication operation, at a lower level of abstraction, i.e. the RTL, whilst simulating remaining architecture components at a PMS level of abstraction. By adopting this approach, HASE returns a more meaningful and relevant event trace from an architecture simulation. Furthermore, an animator visualises the simulation's event trace to clarify the collaborations and interactions between architecture components. The prototype version of HASE is based on GSS (Graphical Support System), and DEMOS (Discrete Event Modelling On Simula)

    RPPM : Rapid Performance Prediction of Multithreaded workloads on multicore processors

    Get PDF
    Analytical performance modeling is a useful complement to detailed cycle-level simulation to quickly explore the design space in an early design stage. Mechanistic analytical modeling is particularly interesting as it provides deep insight and does not require expensive offline profiling as empirical modeling. Previous work in mechanistic analytical modeling, unfortunately, is limited to single-threaded applications running on single-core processors. This work proposes RPPM, a mechanistic analytical performance model for multi-threaded applications on multicore hardware. RPPM collects microarchitecture-independent characteristics of a multi-threaded workload to predict performance on a previously unseen multicore architecture. The profile needs to be collected only once to predict a range of processor architectures. We evaluate RPPM's accuracy against simulation and report a performance prediction error of 11.2% on average (23% max). We demonstrate RPPM's usefulness for conducting design space exploration experiments as well as for analyzing parallel application performance
    • …
    corecore