4 research outputs found

    Exploiting Fine-Grain Concurrency Analytical Insights in Superscalar Processor Design

    Get PDF
    This dissertation develops analytical models to provide insight into various design issues associated with superscalar-type processors, i.e., the processors capable of executing multiple instructions per cycle. A survey of the existing machines and literature has been completed with a proposed classification of various approaches for exploiting fine-grain concurrency. Optimization of a single pipeline is discussed based on an analytical model. The model-predicted performance curves are found to be in close proximity to published results using simulation techniques. A model is also developed for comparing different branch strategies for single-pipeline processors in terms of their effectiveness in reducing branch delay. The additional instruction fetch traffic generated by certain branch strategies is also studied and is shown to be a useful criterion for choosing between equally well performing strategies. Next, processors with multiple pipelines are modelled to study the tradeoffs associated with deeper pipelines versus multiple pipelines. The model developed can reveal the cause of performance bottleneck: insufficient resources to exploit discovered parallelism, insufficient instruction stream parallelism, or insufficient scope of concurrency detection. The cost associated with speculative (i.e., beyond basic block) execution is examined via probability distributions that characterize the inherent parallelism in the instruction stream. The throughput prediction of the analytic model is shown, using a variety of benchmarks, to be close to the measured static throughput of the compiler output, under resource and scope constraints. Further experiments provide misprediction delay estimates for these benchmarks under scope constraints, assuming beyond-basic-block, out-of-order execution and run-time scheduling. These results were derived using traces generated by the Multiflow TRACE SCHEDULINGā„¢(*) compacting C and FORTRAN 77 compilers. A simplified extension to the model to include multiprocessors is also proposed. The extended model is used to analyze combined systems, such as superpipelined multiprocessors and superscalar multiprocessors, both with shared memory. It is shown that the number of pipelines (or processors) at which the maximum throughput is obtained is increasingly sensitive to the ratio of memory access time to network access delay, as memory access time increases. Further, as a function of inter-iteration dependency distance, optimum throughput is shown to vary nonlinearly, whereas the corresponding Optimum number of processors varies linearly. The predictions from the analytical model agree with published results based on simulations. (*)TRACE SCHEDULING is a trademark of Multiflow Computer, Inc

    Empirical and Statistical Application Modeling Using on -Chip Performance Monitors.

    Get PDF
    To analyze the performance of applications and architectures, both programmers and architects desire formal methods to explain anomalous behavior. To this end, we present various methods that utilize non-intrusive, performance-monitoring hardware only recently available on microprocessors to provide further explanations of observed behavior. All the methods attempt to characterize and explain the instruction-level parallelism achieved by codes on different architectures. We also present a prototype tool automating the analysis process to exploit the advantages of the empirical and statistical methods proposed. The empirical, statistical and hybrid methods are discussed and explained with case study results provided. The given methods further the wealth of tools available to programmer\u27s and architects for generally understanding the performance of scientific applications. Specifically, the models and tools presented provide new methods for evaluating and categorizing application performance. The empirical memory model serves to quantify the hierarchical memory performance of applications by inferring the incurred latencies of codes after the effect of latency hiding techniques are realized. The instruction-level model and its extensions model on-chip performance analytically giving insight into inherent performance bottlenecks in superscalar architectures. The statistical model and its hybrid extension provide other methods of categorizing codes via their statistical variations. The PTERA performance tool automates the use of performance counters for use by these methods across platforms making the modeling process easier still. These unique methods provide alternatives to performance modeling and categorizing not available previously in an attempt to utilize the inherent modeling capabilities of performance monitors on commodity processors for scientific applications

    Controlling speculative execution through a virtually ordered memory system

    Get PDF
    Processors which extract parallelism through speculative execution must be able to identify when mis-speculation has occurred. The three places where mis-speculation can occur are register accesses, control flow prediction and memory accesses. Controlling register and control flow speculation has been well studied, but no scalable techniques for identifying memory dependence violations have been identified. Since speculative execution occurs out of order this requires tracking the causal order, as well as the addresses of memory accesses. This thesis uses simulations to investigate tracking the causal order of memory accesses using explicit tags known as virtual timestamps, a distributed and scalable method. Realizable virtual timestamps are necessarily restricted in length and it is demonstrated that naive allocation schemes seriously constrain execution by inefficiently allocating virtual timestamps. Efficiently allocating virtual timestamps requires analysis of the number required by each section of code. Basic statically and dynamically evaluated analysis methods are established to avoid virtual timestamp allocation becoming a resource bottleneck. The same analysis is also used to efficiently allocate state saving resources in a fixed hardware order. The hardware order provides an alternative way of maintaining the causal order using a simple hardware organization. The ability to predict the resources required by regions of code is used as a way of selecting instructions to execute speculatively. This enables resources to be allocated efficiently and is shown to allow large amounts of parallelism to be extracted. It also promotes the effectiveness of speculative execution by issuing less instructions that will ultimately be rolled back. Using a hierarchy of hardware ordering modules, themselves ordered by explicit virtual timestamps, a scalable ordering system is proposed. This hierarchy forms the basis of a twisted memory system, a multiple version memory system capable of identifying speculative memory dependence violations. The preliminary investigations presented here show that twisted memory has the potential to support aggressive speculative parallel execution. Particular attention is paid to memory bandwidth requirements

    Automating the construction of a complier heuristics using machine learning

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (p. 153-162).Compiler writers are expected to create effective and inexpensive solutions to NP-hard problems such as instruction scheduling and register allocation. To make matters worse, separate optimization phases have strong interactions and competing resource constraints. Compiler writers deal with system complexity by dividing the problem into multiple phases and devising approximate heuristics for each phase. However, to achieve satisfactory performance, developers are forced to manually tweak their heuristics with trial-and-error experimentation. In this dissertation I present meta optimization, a methodology for automatically constructing high quality compiler heuristics using machine learning techniques. This thesis describes machine-learned heuristics for three important compiler optimizations: hyperblock formation, register allocation, and loop unrolling. The machine-learned heuristics outperform (by as much as 3x in some cases) their state-of-the-art hand-crafted counterparts. By automatically collecting data and systematically analyzing them, my techniques discover subtle interactions that even experienced engineers would likely overlook. In addition to improving performance, my techniques can significantly reduce the human effort involved in compiler design.(cont.) Machine learning algorithms can design critical portions of compiler heuristics, thereby freeing the human designer to focus on compiler correctness. The progression of experiments I conduct in this thesis leads to collaborative compilation, an approach which enables ordinary users to transparently train compiler heuristics by running their applications as they normally would. The collaborative system automatically adapts itself to the applications in which a community of users is interested.by Mark W. Stephenson.Ph.D
    corecore