10 research outputs found

    Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs

    Full text link
    As the number of transistors integrated on a chip continues to increase, a growing challenge is accurately modeling per-formance in the early stages of processor design. Analytical models have been employed to rapidly search for higher performance designs, and can provide insights that detailed simulators may not. This paper proposes techniques to predict the impact of pending cache hits, hardware prefetching, and realistic miss status holding register (MSHR) resources on superscalar performance in the presence of long latency memory systems when employing hybrid analytical models that apply instruction trace analysis. Pending cache hits are secondary references to a cache block for which a request has already been initiated but has not yet completed. We find pending hits resulting from spatial locality and the fine-grained selection of instruction profile window blocks used for analysis both have non-negligible influences on the accuracy of hybrid analytical models and subsequently propose techniques to account for their effects. We then introduce techniques to estimate the performance impact of data prefetching by modeling the timeliness of prefetches and to account for a limited number of MSHRs by restricting the size of profile window blocks. As with earlier hybrid analytical models, our approach is roughly two orders of magnitude faster than detailed simulations. When modeling pending hits for a processor with unlimited outstanding misses we improve the accuracy of our baseline by a factor of 3.9, decreasing average error from 39.7 % to 10.3%. When modeling a processor with data prefetching, a limited number of MSHRs, or both, the techniques result in an average error of 13.8%, 9.5 % and 17.8%, respectively. 1

    Modeling Out-of-Order Superscalar Processor Performance Quickly and Accurately with Traces

    Get PDF
    Fast and accurate processor simulation is essential in processor design. Trace-driven simulation is a widely practiced fast simulation method. However, serious accuracy issues arise when an out-of-order superscalar processor is considered. In this thesis, trace-driven simulation methods are suggested to quickly and accurately model out-of-order superscalar processor performance with reduced traces. The approaches abstract the processor core and focus on the processor's uncore events rather than the processor's internal events. As a result, fast simulation speed is achieved while maintaining fairly small error compared with an execution-driven simulator. Traces can be generated either by a cycle-accurate simulator or an abstract timing model on top of a simple functional simulator. Simulation results are more accurate with the method using traces generated from a cycle-accurate simulator. Faster trace generation speed is achieved with the abstract timing model. The methods determine how to treat a cache miss with respect to other cache misses recorded in the trace by dynamically reconstructing the reorder buffer state during simulation and honoring the dependencies between the trace items. This approach preserves a processor's dynamic uncore access patterns and accurately predicts the relative performance change when the processor's uncore-level parameters are changed. The methods are attractive especially in the early design stages due to its fast simulation speed

    Empirical and Statistical Application Modeling Using on -Chip Performance Monitors.

    Get PDF
    To analyze the performance of applications and architectures, both programmers and architects desire formal methods to explain anomalous behavior. To this end, we present various methods that utilize non-intrusive, performance-monitoring hardware only recently available on microprocessors to provide further explanations of observed behavior. All the methods attempt to characterize and explain the instruction-level parallelism achieved by codes on different architectures. We also present a prototype tool automating the analysis process to exploit the advantages of the empirical and statistical methods proposed. The empirical, statistical and hybrid methods are discussed and explained with case study results provided. The given methods further the wealth of tools available to programmer\u27s and architects for generally understanding the performance of scientific applications. Specifically, the models and tools presented provide new methods for evaluating and categorizing application performance. The empirical memory model serves to quantify the hierarchical memory performance of applications by inferring the incurred latencies of codes after the effect of latency hiding techniques are realized. The instruction-level model and its extensions model on-chip performance analytically giving insight into inherent performance bottlenecks in superscalar architectures. The statistical model and its hybrid extension provide other methods of categorizing codes via their statistical variations. The PTERA performance tool automates the use of performance counters for use by these methods across platforms making the modeling process easier still. These unique methods provide alternatives to performance modeling and categorizing not available previously in an attempt to utilize the inherent modeling capabilities of performance monitors on commodity processors for scientific applications

    Embedded Processor Selection/Performance Estimation using FPGA-based Profiling

    Get PDF
    In embedded systems, modeling the performance of the candidate processor architectures is very important to enable the designer to estimate the capability of each architecture against the target application. Considering the large number of available embedded processors, the need has increased for building an infrastructure by which it is possible to estimate the performance of a given application on a given processor with a minimum of time and resources. This dissertation presents a framework that employs the softcore MicroBlaze processor as a reference architecture where FPGA-based profiling is implemented to extract the functional statistics that characterize the target application. Linear regression analysis is implemented for mapping the functional statistics of the target application to the performance of the candidate processor architecture. Hence, this approach does not require running the target application on each candidate processor; instead, it is run only on the reference processor which allows testing many processor architectures in very short time

    Iterative Compilation and Performance Prediction for Numerical Applications

    Get PDF
    Institute for Computing Systems ArchitectureAs the current rate of improvement in processor performance far exceeds the rate of memory performance, memory latency is the dominant overhead in many performance critical applications. In many cases, automatic compiler-based approaches to improving memory performance are limited and programmers frequently resort to manual optimisation techniques. However, this process is tedious and time-consuming. Furthermore, a diverse range of a rapidly evolving hardware makes the optimisation process even more complex. It is often hard to predict the potential benefits from different optimisations and there are no simple criteria to stop optimisations i.e. when optimal memory performance has been achieved or sufficiently approached. This thesis presents a platform independent optimisation approach for numerical applications based on iterative feedback-directed program restructuring using a new reasonably fast and accurate performance prediction technique for guiding optimisations. New strategies for searching the optimisation space, by means of profiling to find the best possible program variant, have been developed. These strategies have been evaluated using a range of kernels and programs on different platforms and operating systems. A significant performance improvement has been achieved using new approaches when compared to the state-of-the-art native static and platform-specific feedback directed compilers

    A Framework for Statistical Modeling of Superscalar Processor Performance

    No full text
    This paper presents a statistical approach to modeling superscalar processor performance. Standard trace-driven techniques are very accurate, but require extremely long simulation times, especially as traces reach lengths in the billions of instructions. A framework for statistical models is described which facilitates fast, accurate performance evaluation. A machine model is built up from components: buffers, pipelines, etc. Each program trace is scanned once, generating a set of program parallelism parameters which can be used across an entire family of machine models. The machine model and program parallelism parameters are combined to form a Markov chain. The Markov chain is partitioned in order to reduce the size of the state space, and the resulting linked models are solved using an iterative technique. The use of this framework is demonstrated with two simple processor microarchitectures. The IPC estimates are very close to the IPCs generated by tracedriven simulation of the sam..
    corecore