1,678 research outputs found
Cycle-accurate performance modelling in an ultra-fast just-in-time dynamic binary translation instruction set simulator
Abstract. Instruction set simulators (ISS) are vital tools for compiler and processor architecture design space exploration and verification. State-of-the-art simulators using just-in-time (JIT) dynamic binary translation (DBT) techniques are able to simulate complex embedded processors at speeds above 500 MIPS. However, these functional ISS do not provide microarchitectural observability. In contrast, low-level cycle-accurate ISS are too slow to simulate full-scale applications, forcing developers to revert to FPGA-based simulations. In this paper we demonstrate that it is possible to run ultra-high speed cycle-accurate instruction set simulations surpassing FPGA-based simulation speeds. We extend the JIT DBT engine of our ISS and augment JIT generated code with a verified cycle-accurate processor model. Our approach can model any microarchitectural configuration, does not rely on prior profiling, instrumentation, or compilation, and works for all binaries targeting a state-of-the-art embedded processor implementing the ARCompact TM instruction set architecture (ISA). We achieve simulation speeds up to 88 MIPS on a standard x86 desktop computer for the industry standard EEMBC, COREMARK and BIOPERF benchmark suites.
High speed simulation of microprocessor systems using LTU dynamic binary translation
This thesis presents new simulation techniques designed to speed up the simulation
of microprocessor systems. The advanced simulation techniques may be applied to
the simulator class which employs dynamic binary translation as its underlying technology.
This research supports the hypothesis that faster simulation speeds can be
realized by translating larger sections of the target program at runtime. The primary
motivation for this research was to help facilitate comprehensive design-space exploration
and hardware/software co-design of novel processor architectures by reducing
the time required to run simulations.
Instruction set simulators are used to design and to verify new system architectures,
and to develop software in parallel with hardware. However, compromises must often
be made when performing these tasks due to time constraints. This is particularly true
in the embedded systems domain where there is a short time-to-market. The processing
demands placed on simulation platforms are exacerbated further by the need to simulate
the increasingly complex, multi-core processors of tomorrow. High speed simulators
are therefore essential to reducing the time required to design and test advanced
microprocessors, enabling new systems to be released ahead of the competition.
Dynamic binary translation based simulators typically translate small sections of the
target program at runtime. This research considers the translation of larger units of
code in order to increase simulation speed. The new simulation techniques identify
large sections of program code suitable for translation after analyzing a profile of the
target programâs execution path built-up during simulation.
The average instruction level simulation speed for the EEMBC benchmark suite is
shown to be at least 63% faster for the new simulation techniques than for basic block
dynamic binary translation based simulation and 14.8 times faster than interpretive
simulation. The average cycle-approximate simulation speed is shown to be at least
32% faster for the new simulation techniques than for basic block dynamic binary
translation based simulation and 8.37 times faster than cycle-accurate interpretive simulation
Speeding up dynamic compilation: concurrent and parallel dynamic compilation
The main challenge faced by a dynamic compilation system is to detect and
translate frequently executed program regions into highly efficient native code
as fast as possible. To efficiently reduce dynamic compilation latency, a dynamic
compilation system must improve its workload throughput, i.e. compile
more application hotspots per time. As time for dynamic compilation
adds to the overall execution time, the dynamic compiler is often decoupled
and operates in a separate thread independent from the main execution loop
to reduce the overhead of dynamic compilation.
This thesis proposes innovative techniques aimed at effectively speeding
up dynamic compilation. The first contribution is a generalised region
recording scheme optimised for program representations that require dynamic
code discovery (e.g. binary program representations). The second contribution
reduces dynamic compilation cost by incrementally compiling several
hot regions in a concurrent and parallel task farm. Altogether the combination
of generalised light-weight code discovery, large translation units,
dynamic work scheduling, and concurrent and parallel dynamic compilation
ensures timely and efficient processing of compilation workloads. Compared
to state-of-the-art dynamic compilation approaches, speedups of up to 2.08
are demonstrated for industry standard benchmarks such as BioPerf, Spec
Cpu 2006, and Eembc.
Next, innovative applications of the proposed dynamic compilation scheme
to speed up architectural and micro-architectural performance modelling are
demonstrated. The main contribution in this context is to exploit runtime
information to dynamically generate optimised code that accurately models
architectural and micro-architectural components. Consequently, compilation
units are larger and more complex resulting in increased compilation
latencies. Large and complex compilation units present an ideal use case for
our concurrent and parallel dynamic compilation infrastructure. We demonstrate
that our novel micro-architectural performance modelling is faster than
state-of-the-art Fpga-based simulation, whilst providing the same level of
accuracy
High Speed CPU Simulation using LTU Dynamic Binary Translation
International audienceIn order to increase the speed of dynamic binary translation based simulators we consider the translation of large translation units consisting of multiple blocks. In contrast to other simulators, which translate hot blocks or pages, the techniques presented in this paper profile the target program's execution path at runtime. The identification of hot paths ensures that only executed code is translated whilst at the same time offering greater scope for optimization. Mean performance figures for the functional simulation of EEMBC benchmarks show the new simulation techniques to be at least 63% faster than basic block based dynamic binary translation
SimBench: A Portable Benchmarking Methodology for Full-System Simulators
We acknowledge funding by the EPSRC grant PAMELA EP/K008730/1.Full-system simulators are increasingly finding their way into the consumer space for the purposes of backwards compatibility and hardware emulation (e.g. for games consoles). For such compute-intensive applications simulation performance is paramount. In this paper we argue that existing benchmarksuites such as SPEC CPU2006, originally designed for architecture and compiler performance evaluation, are not well suited for the identification of performance bottlenecks in full-system simulators. While their large, complex workloads provide an indication as to the performance of the simulator on âreal-worldâ workloads, this does not give any indication of why a particular simulator might run an application faster or slower than another. In this paper we present SimBench, an extensive suite of targeted micro-benchmarks designed to run bare-metal on a fullsystem simulator. SimBench exercises dynamic binary translation (DBT) performance, interrupt and exception handling, memoryaccess performance, I/O and other performance-sensitive areas. SimBench is cross-platform benchmarking framework and can be retargeted to new architectures with minimal effort. For several simulators, including QEMU, Gem5 and SimIt-ARM, and targeting ARM and Intel x86 architectures, we demonstrate that SimBench is capable of accurately pinpointing and explaining real-world performance anomalies, which are largely obfuscated by existing application-oriented benchmarks.Postprin
Simulation methodologies for mobile GPUs
GPUs critically rely on a complex system software stack comprising kernel- and user-space drivers and JIT compilers. Yet, existing GPU simulators typically abstract away details of the software stack and GPU instruction set. Partly, this is because GPU vendors rarely release sufficient information about their latest GPU products. However, this is also due to the lack of an integrated CPU-GPU simulation framework, which is complete and powerful enough to drive the complex GPU software environment. This has led to a situation where research on GPU architectures and compilers is largely based on outdated or greatly simplified architectures and software stacks, undermining the validity of the generated results. Making the situation even more dire, existing GPU simulation efforts are concentrated around desktop GPUs, making infrastructure for modelling mobile GPUs virtually non-existent, despite their surging importance in the GPU market. Still, mobile GPU designers are faced with the challenge of evaluating design alternatives involving hundreds of architectural configuration options and micro-architectural improvements under tight time-to-market constraints, to which currently employed design flows involving detailed, but slow simulations are not well suited. In this thesis we develop a full-system simulation environment for a mobile platform, which enables users to run a complete and unmodified software stack for a state-of-the-art mobile Arm CPU and Mali Bifrost GPU powered device, achieving 100\% architectural accuracy across all available toolchains. We demonstrate the capability of our GPU simulation framework through a number of case studies exploring modern, mobile GPU applications, and optimize them using functional simulation statistics, unavailable with other approaches or hardware. Furthermore, we develop a trace-based performance model, allowing architects to rapidly model GPU configurations in early design space exploration
From High Level Architecture Descriptions to Fast Instruction Set Simulators
As computer systems become increasingly complex and diverse, so too do the architectures
they implement. This leads to an increase in complexity in the tools used to design
new hardware and software. One particularly important tool in hardware and software
design is the Instruction Set Simulator, which is used to prototype new architectures and
hardware features, verify hardware, and test and debug software. Many Architecture
Description Languages exist which facilitate the description of new architectural or
hardware features, and generate a tools such as simulators. However, these typically
suffer from poor performance, are difficult to test effectively, and may be limited in
functionality.
This thesis considers three objectives when developing Instruction Set Simulators:
performance, correctness, and completeness, and presents techniques which contribute
to each of these. Performance is obtained by combining Dynamic Binary Translation
techniques with a novel analysis of high level architecture descriptions. This makes use
of partial evaluation techniques in order to both improve the translation system, and to
improve the quality of the translated code, leading a performance improvement of over
2.5x compared to a naĂŻve implementation.
This thesis also presents techniques which contribute to the correctness objective.
Each possible behaviour of each described instruction is used to guide the generation
of a test case. Constraint satisfaction techniques are used to determine the necessary
instruction encoding and context for each behaviour to be produced. It is shown that
this is a significant improvement over benchmark-driven testing, and this technique
has led to the discovery of several bugs and inconsistencies in multiple state of the art
instruction set simulators.
Finally, several challenges in âFull Systemâ simulation are addressed, contributing
to both the performance and completeness objectives. Full System simulation generally
carries significant performance costs compared with other simulation strategies. Crucially,
instructions which access memory require virtual to physical address translation
and can now cause exceptions. Both of these processes must be correctly and efficiently
handled by the simulator. This thesis presents novel techniques to address this issue
which provide up to a 1.65x speedup over a state of the art solution
Branch Prediction For Network Processors
Originally designed to favour flexibility over packet processing performance, the future of the programmable network processor is challenged by the need to meet both increasing line rate as well as providing additional processing capabilities. To meet these requirements, trends within networking research has tended to focus on techniques such as offloading computation intensive tasks to dedicated hardware logic or through increased parallelism. While parallelism retains flexibility, challenges such as load-balancing limit its scope. On the other hand, hardware offloading allows complex algorithms to be implemented at high speed but sacrifice flexibility. To this end, the work in this thesis is focused on a more fundamental aspect of a network processor, the data-plane processing engine.
Performing both system modelling and analysis of packet processing functions; the goal of this thesis is to identify and extract salient information regarding the performance of multi-processor workloads. Following on from a traditional software based analysis of programme workloads, we develop a method of modelling and analysing hardware accelerators when applied to network processors. Using this quantitative information, this thesis proposes an architecture which allows deeply pipelined micro-architectures to be implemented on the data-plane while reducing the branch penalty associated with these architectures
- âŠ