536 research outputs found
Trace-level speculative multithreaded architecture
This paper presents a novel microarchitecture to exploit trace-level speculation by means of two threads working cooperatively in a speculative and non-speculative way respectively. The architecture presents two main benefits: (a) no significant penalties are introduced in the presence of a misspeculation and (b) any type of trace predictor can work together with this proposal. In this way, aggressive trace predictors can be incorporated since misspeculations do not introduce significant penalties. We describe in detail TSMA (trace-level speculative multithreaded architecture) and present initial results to show the benefits of this proposal. We show how simple trace predictors achieve significant speed-up in the majority of cases. Results of a simple trace speculation mechanism show an average speed-up of 16%.Peer ReviewedPostprint (published version
Empirical and Statistical Application Modeling Using on -Chip Performance Monitors.
To analyze the performance of applications and architectures, both programmers and architects desire formal methods to explain anomalous behavior. To this end, we present various methods that utilize non-intrusive, performance-monitoring hardware only recently available on microprocessors to provide further explanations of observed behavior. All the methods attempt to characterize and explain the instruction-level parallelism achieved by codes on different architectures. We also present a prototype tool automating the analysis process to exploit the advantages of the empirical and statistical methods proposed. The empirical, statistical and hybrid methods are discussed and explained with case study results provided. The given methods further the wealth of tools available to programmer\u27s and architects for generally understanding the performance of scientific applications. Specifically, the models and tools presented provide new methods for evaluating and categorizing application performance. The empirical memory model serves to quantify the hierarchical memory performance of applications by inferring the incurred latencies of codes after the effect of latency hiding techniques are realized. The instruction-level model and its extensions model on-chip performance analytically giving insight into inherent performance bottlenecks in superscalar architectures. The statistical model and its hybrid extension provide other methods of categorizing codes via their statistical variations. The PTERA performance tool automates the use of performance counters for use by these methods across platforms making the modeling process easier still. These unique methods provide alternatives to performance modeling and categorizing not available previously in an attempt to utilize the inherent modeling capabilities of performance monitors on commodity processors for scientific applications
Hierarchical architecture design and simulation environment
The Hierarchical Architectural design and Simulation Environment (HASE)is
intended as a flexible tool for computer architects who wish to experiment with
alternative architectural configurations and design parameters. HASE is both
a design environment and a simulator. Architecture components are described
by a hierarchical library of objects defined in terms of an object oriented simulation language. HASE instantiates these objects to simulate and animate the
execution of a computer architecture. An event trace generated by the simulator
therefore describes the interaction between architecture components, for example,
fetch stages, address and data buses, sequencers, instruction buffers and register
files. The objects can model physical components at different abstraction levels,
eg. PMS (processor memory switch), ISP (instruction set processor) and RTL
(register transfer level). HASE applies the concepts of inheritance, encapsulation
and polymorphism associated with object orientation, to simplify the design and
implementation of an architecture simulation that models component operations
at different abstraction levels. For example, HASE can probe the performance
of a processor's floating point unit, executing a multiplication operation, at a
lower level of abstraction, i.e. the RTL, whilst simulating remaining architecture
components at a PMS level of abstraction. By adopting this approach, HASE
returns a more meaningful and relevant event trace from an architecture simulation. Furthermore, an animator visualises the simulation's event trace to clarify
the collaborations and interactions between architecture components. The prototype version of HASE is based on GSS (Graphical Support System), and DEMOS
(Discrete Event Modelling On Simula)
Superscalar RISC-V Processor with SIMD Vector Extension
With the increasing number of digital products in the market, the need for robust and highly configurable processors rises. The demand is convened by the stable and extensible open-sourced RISC-V instruction set architecture. RISC-V processors are becoming popular in many fields of applications and research.
This thesis presents a dual-issue superscalar RISC-V processor design with dynamic execution. The proposed design employs the global sharing scheme for branch prediction and Tomasulo algorithm for out-of-order execution. The processor is capable of speculative execution with five checkpoints. Data flow in the instruction dispatch and commit stages is optimized to achieve higher instruction throughput.
The superscalar processor is extended with a customized vector instruction set of single-instruction-multiple-data computations to specifically improve the performance on machine learning tasks. According to the definition of the proposed vector instruction set, the scratchpad memory and element-wise arithmetic units are implemented in the vector co-processor.
Different test programs are evaluated on the fully-tested superscalar processor. Compared to the reference work, the proposed design improves 18.9% on average instruction throughput and 4.92% on average prediction hit rate, with 16.9% higher operating clock frequency synthesized on the Intel Arria 10 FPGA board.
The forward propagation of a convolution neural network model is evaluated by the standalone superscalar processor and the integration of the vector co-processor. The vector program with software-level optimizations achieves 9.53Ă— improvement on instruction throughput and 10.18Ă— improvement on real-time throughput. Moreover, the integration also provides 2.22Ă— energy efficiency compared with the superscalar processor along
- …