11 research outputs found

    Generic Pipelined Processor Modeling and High Performance Cycle-Accurate Simulator Generation

    Full text link
    Detailed modeling of processors and high performance cycle-accurate simulators are essential for today's hardware and software design. These problems are challenging enough by themselves and have seen many previous research efforts. Addressing both simultaneously is even more challenging, with many existing approaches focusing on one over another. In this paper, we propose the Reduced Colored Petri Net (RCPN) model that has two advantages: first, it offers a very simple and intuitive way of modeling pipelined processors; second, it can generate high performance cycle-accurate simulators. RCPN benefits from all the useful features of Colored Petri Nets without suffering from their exponential growth in complexity. RCPN processor models are very intuitive since they are a mirror image of the processor pipeline block diagram. Furthermore, in our experiments on the generated cycle-accurate simulators for XScale and StrongArm processor models, we achieved an order of magnitude (~15 times) speedup over the popular SimpleScalar ARM simulator.Comment: Submitted on behalf of EDAA (http://www.edaa.com/

    Modelling Heterogeneous DSP–FPGA Based System Partitioning with Extensions to the Spinach Simulation Environment

    Get PDF
    In this paper we present system-on-a-chip extensions to the Spinach simulation environment for rapidly prototyping heterogeneous DSP/FPGA based architectures, specifically in the embedded domain. This infrastructure has been successfully used to model systems varying from multiprocessor gigabit ethernet controllers to Texas Instruments C6x series DSP based systems with tightly coupled FPGA based coprocessors for computational offloading. As an illustrative example of this toolsets functionality, we investigate workload partitioning in heterogeneous DSP/FPGA based embedded environments. Specifically, we focus on computational offloading of matrix multiplication kernels across DSP/FPGA based embedded architectures

    Leveraging register windows to reduce physical registers to the bare minimum

    Get PDF
    Register window is an architectural technique that reduces memory operations required to save and restore registers across procedure calls. Its effectiveness depends on the size of the register file. Such register requirements are normally increased for out-of-order execution because it requires registers for the in-flight instructions, in addition to the architectural ones. However, a large register file has an important cost in terms of area and power and may even affect the cycle time. In this paper, we propose a software/hardware early register release technique that leverage register windows to drastically reduce the register requirements, and hence, reduce the register file cost. Contrary to the common belief that out-of-order processors with register windows would need a large physical register file, this paper shows that the physical register file size may be reduced to the bare minimum by using this novel microarchitecture. Moreover, our proposal has much lower hardware complexity than previous approaches, and requires minimal changes to a conventional register window scheme. Performance studies show that the proposed technique can reduce the number of physical registers to the number of logical registers plus one (minimum number to guarantee forward progress) and still achieve almost the same performance as an unbounded register file.Peer ReviewedPostprint (published version

    Archexplorer for automatic design space exploration

    Get PDF
    Growing architectural complexity and stringent time-to-market constraints suggest the need to move architecture design beyond parametric exploration to structural exploration. ArchExplorer is a Web-based permanent and open design-space exploration framework that lets researchers compare their designs against others. The authors demonstrate their approach by exploring the design space of an on-chip memory subsystem and a multicore processor.Postprint (published version

    Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs

    Full text link
    As the number of transistors integrated on a chip continues to increase, a growing challenge is accurately modeling per-formance in the early stages of processor design. Analytical models have been employed to rapidly search for higher performance designs, and can provide insights that detailed simulators may not. This paper proposes techniques to predict the impact of pending cache hits, hardware prefetching, and realistic miss status holding register (MSHR) resources on superscalar performance in the presence of long latency memory systems when employing hybrid analytical models that apply instruction trace analysis. Pending cache hits are secondary references to a cache block for which a request has already been initiated but has not yet completed. We find pending hits resulting from spatial locality and the fine-grained selection of instruction profile window blocks used for analysis both have non-negligible influences on the accuracy of hybrid analytical models and subsequently propose techniques to account for their effects. We then introduce techniques to estimate the performance impact of data prefetching by modeling the timeliness of prefetches and to account for a limited number of MSHRs by restricting the size of profile window blocks. As with earlier hybrid analytical models, our approach is roughly two orders of magnitude faster than detailed simulations. When modeling pending hits for a processor with unlimited outstanding misses we improve the accuracy of our baseline by a factor of 3.9, decreasing average error from 39.7 % to 10.3%. When modeling a processor with data prefetching, a limited number of MSHRs, or both, the techniques result in an average error of 13.8%, 9.5 % and 17.8%, respectively. 1

    Statistical simulation: Adding efficiency to the computer designer's toolbox

    Full text link

    Fast simulation techniques for microprocessor design space exploration

    Get PDF
    Designing a microprocessor is extremely time-consuming. Computer architects heavily rely on architectural simulators, e.g., to drive high-level design decisions during early stage design space exploration. The benefit of architectural simulators is that they yield relatively accurate performance results, are highly parameterizable and are very flexible to use. The downside, however, is that they are at least three or four orders of magnitude slower than real hardware execution. The current trend towards multicore processors exacerbates the problem; as the number of cores on a multicore processor increases, simulation speed has become a major concern in computer architecture research and development. In this dissertation, we propose and evaluate two simulation techniques that reduce the simulation time significantly: statistical simulation and interval simulation. Statistical simulation speeds up the simulation by reducing the number of dynamically executed instructions. First, we collect a number of program execution characteristics into a statistical profile. From this profile we can generate a synthetic trace that exhibits the same execution behavior but which has a much shorter trace length as compared to the original trace. Simulating this synthetic trace then yields a performance estimate. Interval simulation raises the level of abstraction in architectural simulation; it replaces the core-level cycle-accurate simulation model by a mechanistic analytical model. The analytical model builds on insights from interval analysis: miss events divide the smooth streaming of instructions into so called intervals. The model drives the timing by analyzing the type of the miss events and their latencies, instead of tracking the individual instructions as they propagate through the pipeline stages

    Performance Limitations in Wide Superscalar Processors

    Get PDF
    Superscalar processors with wide instruction fetch only results in diminishing performance returns. The aim of this research to find what causes these limitations. In addition, a new cycle-accurate computer architecture simulator - AbaKus - is developed to study and evaluate the performance of the architecture designs. Eager-Based executions and their designs are tested to overcome the effects of low-accuracy of branch prediction on 38% of the conditional branch instructions. An improvement IPC of 27% on average is shown. However, confidence estimators need improvement on its design logic as they prove critical on the performance of eager-based executions. In addition, the limitation of compilers to extract ILP from the benchmark programs leads to a severe restriction on performance of Superscalar architectures due to data dependencies.School of Electrical & Computer Engineerin

    Energy scalability of OCN

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Page 198 blank.Includes bibliographical references (p. 191-197).On-chip interconnection networks (OCN) such as point-to-point networks and buses form the communication backbone in multiprocessor systems-on-a-chip, multicore processors, and tiled processors. OCNs consume significant portions of a chip's energy budget, so their energy analysis early in the design cycle becomes important for architectural design decisions. Although innumerable studies have examined OCN implementation and performance, there have been few energy analysis studies. This thesis develops an analytical framework for energy estimation in OCNs, for any given topology and arbitrary communication patterns, and presents OCN energy results based on both analytical communication models and real network traces from applications running on a tiled multicore processor. This thesis is the first work to address communication locality in analyzing multicore interconnect energy and to use real multicore interconnect traces extensively. The thesis compares the energy performance of point-to-point networks with buses for varying degrees of communication locality. The model accounts for wire length, switch energy, and network contention. This work is the first to examine network contention from the energy standpoint.(cont.) The thesis presents a detailed analysis of the energy costs of a switch and shows that the estimated values for channel energy, switch control logic energy, and switch queue buffer energy are 34.5pJ, 17pJ, and 12pJ, respectively. The results suggest that a one-dimensional point-to-point network results in approximately 66% energy savings over a bus for 16 or more processors, while a two-dimensional network saves over 82%, when the processors communicate with each other with equal likelihood. The savings increase with locality. Analysis of the effect of contention on OCNs for the Raw tiled microprocessor reports a maximum energy overhead of 23% due to resource contention in the interconnection network.by Theodoros K. Konstantakopoulos.Ph.D

    A performance driven approach for hardware synthesis of guarded atomic actions

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.Includes bibliographical references (p. 137-140).Hardware designers are facing new challenges in the design of complex ASIC's and processors as their sizes approach up to 100 million logic gates. We believe no adequate solution exists that allows designers to specify hardware which takes full advantage of the available resources in these devices. The hardware design specification languages are either too low level to support efficient large scale design (for example, Verilog), or the language and synthesis methodology is so high-level that the designer's micro-architectural ingenuity is lost in the design process. This results in circuits that oftentimes do not match the designer's expectations (for example, C-based behavioral synthesis). 'This thesis presents a design methodology and related synthesis algorithms that address several of the key issues of hardware design specification and high-level synthesis while avoiding the pitfalls of past approaches. The areas we focus on are modular compilation and performance specification. The modular flow allows for the separate compilation of modules and ensures the correct usage of module interfaces by attaching annotations with well defined semantics to them. We also introduce performance specifications as a core part of a design description.(cont.) This allows a designer to more easily achieve the expected design performance and it allows for rapid micro-architectural exploration. We chose guarded atomic actions as the foundation of this research because of their clean execution semantics. These semantics allow for easy design transformation (either manual or compiler driven) while ensuring that the correctness of the design is maintained. We demonstrate the practicality and power of this methodology using several examples, such as a processor which from a single design description can automatically be transformed into an unpipelined processor or a superscalar processor simply by changing a single-line performance specification.by Daniel L. Rosenband.Ph.D
    corecore