










































Cycle-Accurate Performance Modelling in an Ultra-Fast Just-In-
Time Dynamic Binary Translation Instruction Set Simulator
Citation for published version:
Bohm, I, Franke, B & Topham, N 2011, 'Cycle-Accurate Performance Modelling in an Ultra-Fast Just-In-
Time Dynamic Binary Translation Instruction Set Simulator' Transactions on High Performance and
Embedded Architecture and Compilation, vol. 5, no. 4.
Link:




Transactions on High Performance and Embedded Architecture and Compilation
General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.
Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.
Download date: 05. Apr. 2019
Cycle-Accurate Performance Modelling in an Ultra-Fast
Just-In-Time Dynamic Binary Translation Instruction
Set Simulator
Igor Böhm, Björn Franke, and Nigel Topham
Institute for Computing Systems Architecture,
School of Informatics, University of Edinburgh
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, United Kingdom
I.Bohm@sms.ed.ac.uk,{bfranke,npt}@inf.ed.ac.uk
http://groups.inf.ed.ac.uk/pasta/
Abstract. Instruction set simulators (ISS) are vital tools for compiler and proces-
sor architecture design space exploration and verification. State-of-the-art simula-
tors using just-in-time (JIT) dynamic binary translation (DBT) techniques are able
to simulate complex embedded processors at speeds above 500 MIPS. However,
these functional ISS do not provide microarchitectural observability. In contrast,
low-level cycle-accurate ISS are too slow to simulate full-scale applications, forc-
ing developers to revert to FPGA-based simulations. In this paper we demonstrate
that it is possible to run ultra-high speed cycle-accurate instruction set simula-
tions surpassing FPGA-based simulation speeds.We extend the JIT DBT engine of
our ISS and augment JIT generated code with a verified cycle-accurate processor
model. Our approach can model any microarchitectural configuration, does not
rely on prior profiling, instrumentation, or compilation, and works for all binaries
targeting a state-of-the-art embedded processor implementing the ARCompact
TM
instruction set architecture (ISA). We achieve simulation speeds up to 88 MIPS on
a standard x86 desktop computer for the industry standard EEMBC, COREMARK
and BIOPERF benchmark suites.
1 Introduction
Simulators play an important role in the design of today’s high performance micropro-
cessors. They support design-space exploration, where processor characteristics such as
speed and power consumption are accurately predicted for different architectural mod-
els. The information gathered enables designers to select the most efficient processor
designs for fabrication. On a slightly higher level instruction set simulators provide a
platform on which experimental instruction set architectures can be tested, and new
compilers and applications may be developed and verified. They help to reduce the
overall development time for new microprocessors by allowing concurrent engineering
during the design phase. This is especially important for embedded system-on-chip
(SOC) designs, where processors may be extended to support specific applications.
However, increasing size and complexity of embedded applications challenges current
ISS technology. For example, the JPEG encode and decode EEMBC benchmarks execute
between 10 ∗ 109 and 16 ∗ 109 instructions. Similarly, AAC (Advanced Audio Coding)
2 Igor Böhm, Björn Franke, and Nigel Topham
decoding and playback of a six minute excerpt of Mozart’s Requiem using a sample
rate of 44.1 kHz and a bit rate of 128 kbps results in ≈ 38 ∗ 109 executed instructions.
These figures clearly demonstrate the need for fast ISS technology to keep up with
performance demands of real-world embedded applications.
The broad introduction of multi-core systems, e.g. in the form of multi-processor
systems-on-chip (MPSOC), has exacerbated the strain on simulation technology and
it is widely acknowledged that improved single-core simulation performance is key to
making the simulation of larger multi-core systems a viable option [1].
This paper is concerned with ultra-fast ISS using recently developed just-in-time
(JIT) dynamic binary translation (DBT) techniques [27,6,15]. DBT combines interpre-
tive and compiled simulation techniques in order to maintain high speed, observability
and flexibility. However, achieving accurate state and even more so microarchitectural
observability remains in tension with high speed simulation. In fact, none of the existing
JIT DBT ISS [27,6,15] maintains a detailed performance model.
In this paper we present a novel methodology for fast and cycle-accurate perfor-
mance modelling of the processor pipeline, instruction and data caches, and memory
within a JIT DBT ISS. Our main contribution is a simple, yet powerful software pipeline
model together with an instruction operand dependency and side-effect analysis JIT
DBT pass that allows to retain an ultra-fast instruction-by-instruction execution model
without compromising microarchitectural observability. The essential idea is to recon-
struct the microarchitectural pipeline state after executing an instruction. This is less
complex in terms of runtime and implementation than a cycle-by-cycle execution model
and reduces the work for pipeline state updates by more than an order of magnitude.
In our ISS we maintain additional data structures relating to the processor pipeline
and the caches and emit lightweight calls to functions updating the processor state in the
JIT generated code. In order to maintain flexibility and to achieve high simulation speed
our approach decouples the performance model in the ISS from the functional simula-
tion, thereby eliminating the need for extensive rewrites of the simulation framework
to accommodate microarchitectural changes. In fact, the strict separation of concerns
(functional simulation vs. performance modelling) enables the automatic generation of
a pipeline performance model from a processor specification written in an architecture
description language (ADL) such as LISA [26]. This is, however, beyond the scope of
this paper.
We have evaluated our performance modelling methodology against the industry
standard EEMBC, COREMARK, and BIOPERF benchmark suites for our ISS of the EN-
CORE [33] embedded processor implementing the ARCompact
TM
[32] ISA. Our ISS
faithfully models the 5-stage interlocked ENCORE processor pipeline (see Figure 3)
with forwarding logic, its mixed-mode 16/32-bit instruction set, zero overhead loops,
static and dynamic branch prediction, branch delay slots, and four-way set associative
data and instruction caches. We also provide results for the 7-stage ENCORE processor
pipeline variant modelled by our ISS. Across all 44 benchmarks from EEMBC, CORE-
MARK, and BIOPERF the speed of simulation reaches up to 88 MIPS on a standard x86
desktop computer and outperforms that of a speed-optimised FPGA implementation of
the ENCORE processor.

































Fig. 1. Dynamic binary translation flow integrated into main simulation loop.
1.1 Motivating Example
Before we take a more detailed look at our JIT DBT engine and the proposed JIT per-
formance model code generation approach, we provide a motivating example in order
to highlight the key concepts.
Consider the block of ARCompact
TM
instructions in Figure 2 taken from the CORE-
MARK benchmark. Our ISS identifies this block of code as a hotspot and compiles it to
native machine code using the sequence of steps illustrated in Figure 1. Each block maps
onto a function denoted by its address (see label 1© in Figure 2), and each instruction is
translated into semantically equivalent native code faithfully modelling the processors
architectural state (see labels 2©, 3©, and 6© in Figure 2). In order to correctly track mi-
croarchitectural state, we augment each translated ARCompact
TM
instruction with calls
to specialised functions (see labels 3© and 7© in Figure 2) responsible for updating the
underlying microarchitectural model (see Figure 3).
Figure 3 demonstrates how the hardware pipeline microarchitecture is mapped onto
a software model capturing its behaviour. To improve the performance of microarchi-
tectural state updates we emit several versions of performance model update functions
tailored to each instruction kind (i.e. arithmetic and logical instructions, load/store in-
structions, branch instructions). Section 3.1 describes the microarchitectural software
model in more detail.
4 Igor Böhm, Björn Franke, and Nigel Topham
After code has been emitted for a batch of blocks, it is translated and linked by a
JIT compiler. Finally, the translated block map is updated with addresses of each newly
translated block. On subsequent encounters to a previously translated block during sim-
ulation, it will be present in the translated block map and can be executed directly.
1.2 Contributions
Among the contributions of this paper are:
1. The development of a cycle-accurate timing model for state-of-the-art embedded
processors that can be adapted to different microarchitectures and is independent
of the implementation of a functional ISS,
2. the integration of this cycle-accurate timing model into a JIT DBT engine of an ISS
to improve the speed of cycle-accurate instruction set simulation to a level that is
higher than a speed-optimised FPGA implementation of the same processor core,
without compromising accuracy,
3. an extensive evaluation against industry standard COREMARK, EEMBC, and BIOP-
ERF benchmark suites and an interpretive cycle-accurate mode of our ISS that has
been verified and calibrated against an actual state-of-the-art hardware implemen-




The remainder of this paper is structured as follows. In section 2 we provide a brief
outline of the ENCORE embedded processor that serves as a simulation target in this
paper. In addition, we outline the main features of our ARCSIM ISS and describe the
basic functionality of its JIT DBT engine. This is followed by a description of our ap-
proach to decoupled, cycle-accurate performance modelling in the JIT generated code
in section 3. We present the results of our extensive, empirical evaluation in section 4
before we discuss the body of related work in section 5. Finally, we summarise and
conclude in section 6.
2 Background
2.1 The ENCORE Embedded Processor
In order to demonstrate the effectiveness of our approach we use a state-of-the-art pro-
cessor implementing the ARCompact
TM
ISA, namely the ENCORE [33].
The ENCORE’s microarchitecture is based on an interlocked pipeline with forward-
ing logic, supporting zero overhead loops (ZOL), freely intermixable 16- and 32-bit
instruction encodings, static and dynamic branch prediction, branch delay slots, and
predicated instructions. There exist two pipeline variants of the ENCORE processor,
namely a 5-stage (see Figure 3) variant and a 7-stage variant which has an additional
ALIGN stage between the FETCH and DECODE stages, and an additional REGISTER
stage between the DECODE and EXECUTE stages.
Cycle-Accurate Performance Modelling in an Ultra-Fast JIT DBT ISS 5
extern CpuState cpu;             // global processor state
void BLK_0x00000848(void) {
  cpu.r[2] = (uint16_t)(cpu.r[9]);
  pipeline(0,cpu.avail[9],&(cpu.avail[2]),0x00000848,1,0);
  cpu.r[3] = cpu.r[12] ^ cpu.r[2];
  pipeline(cpu.avail[12],cpu.avail[2],&(cpu.avail[3]),0x0000084c,1,0);
  cpu.r[3] = cpu.r[3] & (uint32_t)15;
  pipeline(cpu.avail[3],0,&(cpu.avail[3]),0x00000850,1,0);
  cpu.r[3] = cpu.r[3] << ((sint8_t)3 & 0x1f);
  pipeline(cpu.avail[3],0,&(cpu.avail[3]),0x00000854,1,0);
  cpu.r[2] = cpu.r[2] & (uint32_t)7;
  pipeline(cpu.avail[2],0,&(cpu.avail[2]),0x00000858,1,0);
  cpu.r[3] = cpu.r[3] | cpu.r[2];
  pipeline(cpu.avail[3],cpu.avail[2],&(cpu.avail[3]),0x0000085c,1,0);
  cpu.r[4] = cpu.r[3] << ((sint8_t)8 & 0x1f);
  pipeline(cpu.avail[3],0,&(cpu.avail[4]),0x00000860,1,0);
  // compare and branch instruction with delay slot
  pipeline(cpu.avail[10],cpu.avail[13],&(ignore),0x00000864,1,0);
  if (cpu.r[10] >= cpu.r[13]) {
    cpu.pl[FE] = cpu.pl[ME] - 1; // branch penalty
    fetch(0x0000086c);           // speculative fetch due to branch pred.
    cpu.auxr[BTA] = 0x00000890;  // set BTA register
    cpu.D = 1;                   // set delay slot bit
  } else {
    cpu.pc = 0x0000086c;
  }
  cpu.r[4] = cpu.r[4] | cpu.r[3];// delay slot instruction
  pipeline(cpu.avail[4],cpu.avail[3],&(cpu.avail[4]),0x00000868,1,0);
  if (cpu.D) {                   // branch was taken
    cpu.D = 0;                   // clear delay slot bit
    cpu.pc = cpu.auxr[BTA];      // set PC
  }





 [0x00000848] ext     r2,r9
 [0x0000084c] xor     r3,r12,r2
 [0x00000850] and     r3,r3,0xf
 [0x00000854] asl     r3,r3,0x3
 [0x00000858] and     r2,r2,0x7
 [0x0000085c] or      r3,r3,r2
 [0x00000860] asl     r4,r3,0x8
 [0x00000864] brcc.d  r10,r13,0x2c











  FE,    // fetch 
  DE,    // decode
  EX,    // execute 
  ME,    // memory
  WB,    // write back




  uint32_t pc;
  uint32_t r[REGS];          // general purpose registers
  uint32_t auxr[AUXREGS];    // auxiliary registers
  char     L,Z,N,C,V,U,D,H;  // status flags (H...halt bit)
  uint64_t pl[STAGES];       // per stage cycle count
  uint64_t avail[REGS];      // per register cycle count
  uint64_t cycles;           // total cycle count




Fig. 2. JIT dynamic binary translation of ARCompactTM basic block with CpuState structure
representing architectural 6© and microarchitectural state 7©. See Figure 3 for an implementation
of the microarchitectural state update function pipeline().
6 Igor Böhm, Björn Franke, and Nigel Topham
In our configuration we use 32K 4-way set associative instruction and data caches
with a pseudo-random block replacement policy. Because cache misses are expensive,
a pseudo-random replacement policy requires us to exactly model cache behaviour to
avoid large deviations in cycle count. Although the above configuration was used for
this work, the processor is highly configurable. Pipeline depth, cache sizes, associativ-
ity, and block replacement policies as well as byte order (i.e. big endian, little endian),
bus widths, register-file size, and instruction set specific options such as instruction set
extensions (ISEs) are configurable. The processor is fully synthesisable onto an FPGA
and fully working ASIP silicon implementations have been taped-out recently.
2.2 ARCSIM Instruction Set Simulator
In our work we extended ARCSIM [34], a target adaptable simulator with extensive sup-
port of the ARCompact
TM
ISA. It is a full-system simulator, implementing the processor,
its memory sub-system (including MMU), and sufficient interrupt-driven peripherals to
simulate the boot-up and interactive operation of a complete Linux-based system. The
simulator provides the following simulation modes:
– Co-simulation mode working in lock-step with standard hardware simulation tools
used for hardware and performance verification.
– Highly-optimised [27] interpretive simulation mode.
– Target microarchitecture adaptable cycle-accurate simulation mode modelling the
processor pipeline, caches, and memories. This mode has been calibrated against a
5-stage and 7-stage pipeline variant of the ENCORE processor.
– High-speed JIT DBT functional simulation mode [27][15] capable of simulating an
embedded system at speeds approaching or even exceeding that of a silicon ASIP
whilst faithfully modelling the processor’s architectural state.
– A profiling simulation mode that is orthogonal to the above modes delivering addi-
tional statistics such as dynamic instruction frequencies, detailed per register access
statistics, per instruction latency distributions, detailed cache statistics, executed
delay slot instructions, as well as various branch predictor statistics.
In common with the ENCORE processor, the ARCSIM simulator is highly config-
urable. Architectural features such as register file size, instruction set extensions, the set
of branch conditions, the auxiliary register set, as well as memory mapped IO extensions
can be specified via a set of well defined APIs and configuration settings. Furthermore,
microarchitectural features such as pipeline depth, per instruction execution latencies,
cache size and associativity, cache block replacement policies, memory subsystem lay-
out, branch prediction strategies, as well as bus and memory access latencies are fully
configurable. The microarchitectural configurations used for our experiments are listed
in Table 2.
2.3 Hotspot Detection and JIT Dynamic Binary Translation
In ARCSIM simulation time is partitioned into epochs, where each epoch is defined
as the interval between two successive JIT translations. Within an epoch frequently
Cycle-Accurate Performance Modelling in an Ultra-Fast JIT DBT ISS 7



































































ENCORE 5-Stage Pipeline JIT Generated Software Model
void
pipeline(uint64_t  opd1,  uint64_t  opd2,
         uint64_t* dst1,  uint64_t* dst2,
         uint32_t  faddr, uint32_t  xc,   uint32_t mc)
{
  // FETCH     - account for instruction fetch latency
  cpu.pl[FE] += fetch(faddr);
  // INVARIANT - see section 3.1 processor pipeline model
  if (cpu.pl[FE] < cpu.pl[DE]) cpu.pl[FE] = cpu.pl[DE];
  // DECODE    - determine operand availability time 
  cpu.pl[DE] = max3((cpu.pl[FE] + 1), opd1, opd2);
  if (cpu.pl[DE] < cpu.pl[EX]) cpu.pl[DE] = cpu.pl[EX];
  // EXECUTE   - account for execution latency and destination availability time
  cpu.pl[EX] = *dst1 = cpu.pl[DE] + xc;                   
  if (cpu.pl[EX] < cpu.pl[ME]) cpu.pl[EX] = cpu.pl[ME];
  // MEMORY    - account for memory latency and destination availability time
  cpu.pl[ME] = *dst2 = cpu.pl[EX] + mc;
  if (cpu.pl[ME] < cpu.pl[WB]) cpu.pl[ME] = cpu.pl[WB];
  // WRITEBACK









Fig. 3. ENCORE 5-Stage hardware pipeline model with a sample JIT generated software microar-
chitectural model.
8 Igor Böhm, Björn Franke, and Nigel Topham
executed basic blocks (i.e. hotspots) are detected at runtime and recorded as traces (see
Figure 1). After each epoch the hottest recorded traces (i.e. frequently executed traces)
are passed to the JIT DBT engine for native code generation. More recently [15] we have
extended hotspot detection and JIT DBT with the capability to find and translate large
translation units (LTU) consisting of multiple traced control-flow-graphs. By increasing
the size of translation units it is possible to achieve significant speedups in simulation
performance. The simulation speedup can be attributed to improved locality, more time
is spent simulating within a translation unit, and greater scope for optimisations for the
JIT compiler as it can optimise across multiple blocks.
3 Methodology
In this paper we describe our approach to combine cycle accurate and high-speed JIT
DBT simulation modes in order to provide architectural and microarchitectural ob-
servability at speeds exceeding speed-optimised FPGA implementations. We do this
by extending our JIT DBT engine with a pass responsible for analysing instruction
operand dependencies and side-effects, and an additional code emission pass emitting
specialised code for performance model updates (see labels 1© and 2© in Figure 1).
In the following sections we outline our generic processor pipeline model and de-
scribe how to account for instruction operand availability and side-effect visibility tim-
ing. We also discuss our cache and memory model and show how to integrate control
flow and branch prediction into our microarchitectural performance model.
3.1 Processor Pipeline Model
The granularity of execution on hardware and RTL simulation is cycle based —cycle-
by-cycle. If the designer wants to find out how many cycles it took to execute an instruc-
tion or program, all that is necessary is to simply count the number of cycles. While this
execution model works well for hardware it is too detailed and slow for ISS purposes.
Therefore fast functional ISS have an instruction-by-instruction execution model. While
this execution model yields faster simulation speeds it usually compromises microar-
chitectural observability and detail. Our software pipeline model together with an in-
struction operand dependency and side-effect analysis JIT DBT pass allows to retain an
instruction-by-instruction execution model without compromising microarchitectural
observability. The essential idea is to reconstruct the microarchitectural pipeline state
after executing an instruction.
Thus the processor pipeline is modelled as an array with as many elements as there
are pipeline stages (see definition of pl[STAGES] at label 7© in Figure 2). For each
pipeline stage we add up the corresponding latencies and store the cycle-count at which
the instruction is ready to leave the respective stage. The line with label 1© in Figure 3
demonstrates this for the fetch stage cpu.pl[FE] by adding the amount of cycles it
takes to fetch the corresponding instruction to the current cycle count at that stage. The
next line in Figure 3 with the label 2© is an invariant ensuring that an instruction cannot
leave its pipeline stage before the instruction in the immediately following stage is ready
to proceed. Figure 4 contains a detailed example of the microarchitectural performance
model determining the cycle count for a sample ARCompact
TM
instruction.
Cycle-Accurate Performance Modelling in an Ultra-Fast JIT DBT ISS 9
Pipeline Model
if (cpu.pl[FE] < cpu.pl[DE]) 
    cpu.pl[FE] = cpu.pl[DE];
cpu.pl[FE] += fetch(0x00000868);
// INITIAL STATE AT FETCH
if (cpu.pl[DE] < cpu.pl[EX])
    cpu.pl[DE] = cpu.pl[EX];
cpu.pl[DE] = max3((cpu.pl[FE]+1),
                  opd1, opd2);
// INITIAL STATE AT DECODE
if (cpu.pl[EX] < cpu.pl[ME])
    cpu.pl[EX] = cpu.pl[ME];
cpu.pl[EX] = cpu.pl[DE] + 1;
*dst1      = cpu.pl[EX];
// INITIAL STATE AT EXECUTE
if (cpu.pl[ME] < cpu.pl[WB])
    cpu.pl[ME] = cpu.pl[WB];
cpu.pl[ME] = cpu.pl[EX] + 0;
*dst2      = cpu.pl[ME];
// INITIAL STATE AT MEMORY
// FINAL PIPELINE STATE
cpu.pl[WB] = cpu.pl[ME] + 1;
// INITIAL STATE AT WRITEBACK























Fig. 4. ENCORE 5-Stage Pipeline model example using final instruction from ARCompactTM ba-
sic block depicted in Figure 2. It demonstrates the reconstruction of microarchitectural pipeline
state after the instruction has been executed. Bold red numbers denote changes to cycle-counts
for the respective pipeline stages, bold green numbers denote already committed cycle-counts.
10 Igor Böhm, Björn Franke, and Nigel Topham
3.2 Instruction Operand Dependencies and Side Effects
In order to determine when an instruction is ready to leave the decode stage it is nec-
essary to know when operands become available. For instructions that have side-effects
(i.e. modify the contents of a register) we need to remember when the side-effect will
become visible. The avail[GPRS] array (see label 7© in Figure 2) encodes this in-
formation for each operand.
When emitting calls to microarchitectural update functions our JIT DBT engine
passes source operand availability times and pointers to destination operand availability
locations determined during dependency analysis as parameters (see label 3© in Figure
2). This information is subsequently used to compute when an instruction can leave the
decode stage (see label 3© in Figure 3) and to record when side-effects become visible
in the execute and memory stage (see labels 4© and 5© in Figure 3). Because not all
instructions modify general purpose registers or have two source operands, there exist
several highly optimised versions of microarchitectural state update functions, and the
function outlined in Figure 3 demonstrates only one of several possible variants.
3.3 Control Flow and Branch Prediction
When dealing with control flow operations (e.g. jump, branch, branch on compare)
special care must be taken to account for various types of penalties and speculative
execution. The ARCompact
TM
ISA allows for delay slot instructions and the ENCORE
processor and ARCSIM simulator support various static and dynamic branch prediction
schemes.
The code highlighted by label 4© in Figure 2 demonstrates how a branch penalty is
applied for a mis-predicted branch. The pipeline penalty depends on the pipeline stage
when the branch outcome and target address are known (see target address availability
for BCC/JCC and BRCC/BBIT control flow instructions in Figure 3) and the availability
of a delay slot instruction. One also must take care of speculatively fetched and executed
instructions in case of a mis-predicted branch.
3.4 Cache and Memory Model
Because cache misses and off-chip memory access latencies significantly contribute
towards the final cycle count, ARCSIM maintains a 100% accurate cache and memory
model. In its default configuration the ENCORE processor implements a pseudo-random
block replacement policy where the content of a shift register is used in order to deter-
mine a victim block for eviction. The rotation of the shift register must be triggered at




ISA specifies very flexible and powerful load/store
operations, memory access simulation is a critical aspect of high-speed full system sim-
ulations. [27] describes in more detail how memory access simulation is implemented
in ARCSIM so that accurate modelling of target memory semantics is preserved whilst
simulating load and store instructions at the highest possible rate.
Cycle-Accurate Performance Modelling in an Ultra-Fast JIT DBT ISS 11
Vendor & Model HPTMCOMPAQTMdc7900 SFF
Number CPUs 1 (dual-core)
Processor Type Intel c©CoreTM2 Duo processor E8400
Clock Frequency 3 GHz
L1-Cache 32K Instruction/Data caches
L2-Cache 6 MB
FSB Frequency 1333 MHz
Table 1. Simulation Host Configuration.
4 Empirical Evaluation
We have extensively evaluated our cycle-accurate JIT DBT performance modelling ap-
proach and in this section we describe our experimental setup and methodology before
we present and discuss our results.
4.1 Experimental Setup and Methodology
We have evaluated our cycle-accurate JIT DBT simulation approach using the BIOPERF
benchmark suite that comprises a comprehensive set of computationally-intensive life
science applications [5]. We also used the industry standard EEMBC 1.1, and CORE-
MARK [36] embedded benchmark suites comprising applications from the automotive,
consumer, networking, office, and telecom domains.
All codes have been built with the ARC port of the GCC 4.2.1 compiler with full
optimisation enabled (i.e. -O3 -mA7). Each benchmark has been simulated in a stand-
alone manner, without an underlying operating system, to isolate benchmark behaviour
from background interrupts and virtual memory exceptions. Such system-related effects
are measured by including a Linux full-system simulation in the benchmarks.
The BIOPERF benchmarks were run with “class-A” input data-sets available from
the BIOPERF web site. The EEMBC 1.1 and COREMARK benchmarks were configured
using large iteration counts to execute at least 109 instructions. All benchmarks were
simulated until completion. The Linux benchmark consisted of simulating the boot-up
and shut-down sequence of a Linux kernel configured to run on a typical embedded
ARC700 system with two interrupting timers, a console UART, and a paged virtual
memory system.
Our main interest has been on simulation speed, therefore we have measured the
maximum possible simulation speed in MIPS using various simulation modes (FPGA
speed vs. cycle-accurate interpretive mode vs. cycle-accurate JIT DBT mode - see Fig-
ures 5, 6, 7 and 8). Table 2 lists the configuration details of our simulator and target
processor. All measurements were performed on a X86 desktop computer detailed in
Table 1 under conditions of low system load. When comparing ARCSIM simulation
speeds to FPGA implementations shown in Figures 5, 6, 7 and 8, we used a XILINX
VIRTEX5 XC5 VFX70T (speed grade 1) FPGA clocked at 50 MHz.




















































































































































Benchmark ISS JIT DBT 
Mode











































42.76 27.65 1661673708 1182378613 1.405 35.58 249.05 14.32 82.56 1717955864 3.28% 3.28%
44.88 24.44 2035127484 1097009731 1.855 26.95 188.66 14.42 76.07 2011085954 -1.20% 1.20%
36.73 31.82 2364356106 1168832029 2.023 24.72 173.02 15.05 77.66 2347298145 -0.73% 0.73%
44.00 23.91 1935893704 1051923424 1.840 27.17 190.18 14.20 74.06 1912428735 -1.23% 1.23%
55.43 21.09 2797405413 1169362673 2.392 20.90 146.31 16.90 69.18 2797390606 -0.00% 0.00%
35.20 33.19 1685823346 1168152409 1.443 34.65 242.52 13.30 87.82 1703736603 1.05% 1.05%
58.94 19.50 2256182622 1149063994 1.963 25.46 178.25 15.92 72.19 2256176777 -0.00% 0.00%
24.51 46.79 1938134229 1146599000 1.690 29.58 207.06 13.97 82.06 1921162121 -0.88% 0.88%
34.89 38.47 2798103201 1342104542 2.085 23.98 167.88 13.55 99.02 2780017320 -0.65% 0.65%
36.03 36.50 1733677905 1315041407 1.318 37.93 265.48 13.88 94.76 1708436543 -1.48% 1.48%
46.43 331.25 35286971472 15379979678 2.294 21.79 152.55 13.64 1127.48 35134308222 -0.43% 0.43%
58.65 19.75 1517042138 1158328760 1.310 38.18 267.24 14.65 79.05 1517023038 -0.00% 0.00%
55.33 31.40 2905044798 1737243150 1.672 29.90 209.30 14.84 117.09 2783844800 -4.35% 4.35%
56.61 235.68 33033780199 13340652937 2.476 20.19 141.35 13.21 1009.74 32849326961 -0.56% 0.56%
64.58 17.28 1580193755 1115769652 1.416 35.30 247.13 14.97 74.53 1527449587 -3.45% 3.45%
52.73 19.81 1577863337 1044759671 1.510 33.11 231.75 15.13 69.05 1558246996 -1.26% 1.26%
28.56 38.47 2158004991 1098794030 1.964 25.46 178.21 14.45 76.04 2141915426 -0.75% 0.75%
25.11 41.71 1800892174 1047094989 1.720 29.07 203.50 12.81 81.74 1774249285 -1.50% 1.50%
38.76 43.22 2276968768 1675270000 1.359 36.79 257.51 14.43 116.13 2370317384 3.94% 3.94%
58.27 21.82 2153228335 1271291691 1.694 29.52 206.64 13.14 96.73 2163414356 0.47% 0.47%
29.60 36.36 2896657669 1076477772 2.691 18.58 130.07 13.88 77.54 2878501543 -0.63% 0.63%
54.24 19.52 1232839594 1058835946 1.164 42.94 300.60 15.09 70.18 1231792904 -0.08% 0.08%
34.96 29.34 1641344348 1025626849 1.600 31.24 218.70 13.10 78.31 1618582314 -1.41% 1.41%
58.41 35.50 6455758016 2073605735 3.113 16.06 112.42 15.19 136.55 6455758016 0.00% 0.00%
63.65 20.03 2162813708 1274562786 1.697 29.47 206.26 14.76 86.38 2070645379 -4.45% 4.45%
20.36 78.48 3395278387 1597444740 2.125 23.52 164.67 16.56 96.45 3360334387 -1.04% 1.04%
29.48 36.94 1875857693 1088933319 1.723 29.02 203.17 13.51 80.58 1932046104 2.91% 2.91%
87.96 12.45 1411905096 1095448312 1.289 38.79 271.55 14.33 76.43 1401202101 -0.76% 0.76%
39.39 25.60 1627370761 1008410197 1.614 30.98 216.88 13.13 76.80 1624267366 -0.19% 0.19%
37.42 34.98 1896833261 1308893896 1.449 34.50 241.51 13.59 96.28 1935102815 1.98% 1.98%
36.59 31.02 1776794485 1135346823 1.565 31.95 223.65 13.22 85.86 1745943123 -1.77% 1.77%
33.18 33.95 1909552398 1126582732 1.695 29.50 206.49 13.69 82.31 1911955274 0.13% 0.13%
54.47 20.20 1499950439 1100533535 1.363 36.69 256.80 14.79 74.40 1493696145 -0.42% 0.42%
50.83 23.55 1951650255 1197045183 1.630 30.67 214.67 14.53 82.36 1928258231 -1.21% 1.21%
44.97 29.71 207.97 14.30 1.30% 1.30%






























































































































































































































ISS Interpretive Mode Speed Optimised FPGA ISS JIT DBT Mode
Benchmark R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 STDEV
a2time01   
aifftr01   
aifirf01   
aiifft01   
autcor00   
basefp01   
bezier01   
bitmnp01   
cacheb01   
canrdr01   
cjpeg      
conven00   
dither01   
djpeg      
fbital00   
fft00      
idctrn01   
iirflt01   
matrix01   
ospf       
pktflow    
pntrch01   
puwmod01   
rgbcmy01   
rgbhpg01   
rgbyiq01   
rotate01   
routelookup
rspeed01   
tblook01   
text01     
ttsprk01   
viterb00   
coremark   
42.12 41.71 42.33 42.06 42.05 42.76 42.43 42.10 42.62 42.58 0.3240
43.84 43.53 43.35 43.96 43.58 44.88 44.11 44.14 43.61 43.62 0.4421
36.47 36.41 36.73 36.11 36.68 36.17 36.64 36.25 35.46 34.33 0.7319
44.00 43.50 43.42 43.28 43.43 43.57 43.53 43.47 43.45 43.47 0.1881
54.87 55.30 54.88 54.31 54.52 55.11 55.43 55.16 55.11 55.41 0.3696
35.09 34.87 35.15 35.20 35.13 34.83 34.84 35.05 35.14 35.11 0.1400
57.68 57.44 58.94 57.40 57.84 58.01 57.80 58.84 58.26 56.60 0.6926
24.44 24.51 24.21 24.11 24.25 24.27 24.37 24.32 24.25 24.49 0.1293
34.89 34.65 34.26 34.22 34.50 34.38 34.04 34.63 34.52 34.69 0.2550
35.49 35.63 35.34 35.39 35.33 35.42 35.29 35.90 36.03 35.54 0.2503
46.36 45.28 46.27 45.81 45.81 46.36 46.29 46.25 46.43 45.88 0.3672
57.63 57.40 58.65 58.19 57.34 57.22 56.49 58.20 56.76 56.92 0.6928
55.33 54.77 55.20 53.95 54.95 54.90 54.49 54.68 54.12 54.57 0.4365
55.38 55.59 42.55 55.32 56.47 56.61 56.49 56.10 56.32 56.51 4.3088
63.02 63.07 63.05 63.84 64.58 62.96 63.86 63.88 63.47 63.37 0.5252
52.30 51.95 52.05 52.48 52.09 52.73 52.53 51.55 51.69 51.86 0.3821
28.51 28.39 28.48 28.42 28.51 28.56 28.37 28.45 28.45 28.43 0.0587
24.78 24.94 24.79 25.02 25.08 24.99 24.77 24.79 25.11 24.83 0.1333
38.51 38.67 38.65 38.65 38.70 38.67 38.68 38.33 38.76 38.60 0.1215
56.93 57.01 58.23 57.00 57.72 57.17 56.63 56.87 55.76 58.27 0.7540
29.24 29.45 29.10 29.46 29.57 29.47 28.93 28.83 29.31 29.60 0.2669
54.03 53.88 53.54 53.76 53.52 53.49 54.24 54.05 53.54 53.85 0.2642
34.88 34.48 34.63 34.96 34.77 34.28 34.78 34.48 34.34 34.76 0.2313
56.82 58.41 57.78 58.38 57.85 57.32 56.92 57.48 57.33 58.33 0.5860
63.24 62.49 62.53 62.47 63.65 62.40 62.61 62.44 62.46 63.37 0.4653
20.36 19.94 20.29 20.16 20.10 20.02 20.22 20.03 20.03 19.59 0.2152
29.26 29.14 29.44 28.89 29.10 29.08 29.31 29.48 29.34 29.05 0.1877
87.09 87.15 86.47 87.33 86.62 86.07 87.33 87.23 87.12 87.96 0.5285
38.70 39.27 39.02 38.92 38.48 39.39 39.20 38.57 38.75 39.21 0.3162
37.42 36.78 36.78 35.26 36.84 36.78 37.24 36.71 36.74 36.80 0.5698
36.13 36.04 36.08 36.59 36.03 36.19 35.99 36.48 36.47 36.15 0.2164
33.17 33.11 33.15 32.88 33.18 32.89 32.92 32.42 32.94 32.87 0.2270
54.47 53.02 51.75 52.32 51.15 52.62 52.48 52.92 53.45 53.50 0.9395











































Benchmark ISS JIT DBT 
Mode





















27.85 29.92 1396066577 833343182 1.675 29.85 208.92 12.44 67.00 1327371250 -5.18% 5.18%
48.22 381.43 1.35156E+11 18595299674 7.268 6.88 48.15 11.86 1550.52 1.33289E+11 -1.40% 1.40%
30.48 1509.90 90403057897 46020931162 1.964 25.45 178.17 12.44 3699.25 90233024890 -0.19% 0.19%
18.64 26.48 863649574 493553641 1.750 28.57 200.02 11.34 43.53 852371861 -1.32% 1.32%
10.68 11.12 257312461 118797407 2.166 23.08 161.59 11.84 10.03 257330678 0.01% 0.01%
30.06 1359.93 91163243813 40880624840 2.230 22.42 156.95 11.74 3482.79 89982946930 -1.31% 1.31%
12.32 25.91 1009100386 319332249 3.160 15.82 110.76 10.47 30.50 999011356 -1.01% 1.01%
27.33 50.09 5173526964 1369014891 3.779 13.23 92.62 11.70 117.04 5102009595 -1.40% 1.40%
26.15 93.59 4922532216 2447437589 2.011 24.86 174.02 11.44 213.95 4853610263 -1.42% 1.42%
33.22 1020.28 64293295821 33897449616 1.897 26.36 184.53 11.96 2833.96 64603015325 0.48% 0.48%
26.50 21.65 151.57 11.72 1.37% 1.37%
48.22 29.85 208.92 12.44 5.18%
10.10 6.00 173940743 60660885 2.867 17.44 122.06 11.43 5.30 173957294 0.01%












27.83 27.57 26.98 27.12 27.23 27.85 26.69 27.81 27.85 27.78 0.430592614892545
48.04 48.06 48.22 48.11 47.87 48.01 48.12 48.02 47.99 46.09 0.62624187730372
30.44 30.48 30.44 30.42 30.45 30.43 30.29 30.43 30.45 30.32 0.060598863208993
18.58 18.28 18.23 18.49 18.60 18.58 18.58 18.49 18.64 18.57 0.13945927322659
10.68 10.63 10.54 10.63 10.64 10.68 10.53 10.36 10.12 10.20 0.20403975647462
29.98 30.00 29.88 29.87 30.06 30.05 29.82 29.82 29.73 29.96 0.109650961388094
12.16 12.30 12.32 12.07 12.13 12.31 12.31 12.29 12.25 12.30 0.090209632400192
25.54 27.33 22.99 27.27 25.55 25.55 25.54 27.27 25.57 25.46 1.29043532706344
10.08 10.08 9.68 9.91 9.92 10.09 10.08 10.10 9.95 9.67 0.165408585025083
25.88 25.90 26.15 26.13 25.89 25.80 25.96 26.11 26.14 26.04 0.12944325225965







































































ISS Interpretive Mode Speed Optimised FPGA ISS JIT DBT Mode























































Cycle count deviation in % - Baseline is cycle accurate ISS Interpretive Mode
















90725691 114626288 114824713 0.17% 10 23 80
45417100 109978145 110982114 0.91% 10 27 40
235347794 366524621 374858788 2.27% 8 23 51
13340652937 32849322365 33038153698 0.57% 8 37 5















































1182378613 42.76 90725691 80
1097009731 44.88 45417100 40
1168832029 36.73 235347794 51





































































Fig. 5. 5-Stage Pipeline - Simulation rate (in MIPS) using EEMBC and COREMARK benchmarks
comparing (a) ISS interpretive cycle-accurate simulation mode, (b) speed-optimised FPGA im-




















































































































































Benchmark ISS JIT DBT 
Mode











































42.76 27.65 1661673708 1182378613 1.405 35.58 249.05 14.32 82.56 1717955864 3.28 3.28
44.88 24.44 2035127484 1097009731 1.855 26.95 188.66 14.42 76.07 2011085954 -1.20 1.20
36.73 31.82 2364356106 1168832029 2.023 24.72 173.02 15.05 77.66 2347298145 -0.73 0.73
44.00 23.91 1935893704 1051923424 1.840 27.17 190.18 14.20 74.06 1912428735 -1.23 1.23
55.43 21.09 2797405413 1169362673 2.392 20.90 146.31 16.90 69.18 2797390606 -0.00 0.00
35.20 33.19 1685823346 1168152409 1.443 34.65 242.52 13.30 87.82 1703736603 1.05 1.05
58.94 19.50 2256182622 1149063994 1.963 25.46 178.25 15.92 72.19 2256176777 -0.00 0.00
24.51 46.79 1938134229 1146599000 1.690 29.58 207.06 13.97 82.06 1921162121 -0.88 0.88
34.89 38.47 2798103201 1342104542 2.085 23.98 167.88 13.55 99.02 2780017320 -0.65 0.65
36.03 36.50 1733677905 1315041407 1.318 37.93 265.48 13.88 94.76 1708436543 -1.48 1.48
46.43 331.25 35286971472 15379979678 2.294 21.79 152.55 13.64 1127.48 35134308222 -0.43 0.43
58.65 19.75 1517042138 1158328760 1.310 38.18 267.24 14.65 79.05 1517023038 -0.00 0.00
55.33 31.40 2905044798 1737243150 1.672 29.90 209.30 14.84 117.09 2783844800 -4.35 4.35
56.61 235.68 33033780199 13340652937 2.476 20.19 141.35 13.21 1009.74 32849326961 -0.56 0.56
64.58 17.28 1580193755 1115769652 1.416 35.30 247.13 14.97 74.53 1527449587 -3.45 3.45
52.73 19.81 1577863337 1044759671 1.510 33.11 231.75 15.13 69.05 1558246996 -1.26 1.26
28.56 38.47 2158004991 1098794030 1.964 25.46 178.21 14.45 76.04 2141915426 -0.75 0.75
25.11 41.71 1800892174 1047094989 1.720 29.07 203.50 12.81 81.74 1774249285 -1.50 1.50
38.76 43.22 2276968768 1675270000 1.359 36.79 257.51 14.43 116.13 2370317384 3.94% 3.94
58.27 21.82 2153228335 1271291691 1.694 29.52 206.64 13.14 96.73 2163414356 0.47% 0.47%
29.60 36.36 2896657669 1076477772 2.691 18.58 130.07 13.88 77.54 2878501543 -0.63% 0.63%
54.24 19.52 1232839594 1058835946 1.164 42.94 300.60 15.09 70.18 1231792904 -0.08% 0.08%
34.96 29.34 1641344348 1025626849 1.600 31.24 218.70 13.10 78.31 1618582314 -1.41% 1.41%
58.41 35.50 6455758016 2073605735 3.113 16.06 112.42 15.19 136.55 6455758016 0.00% 0.00%
63.65 20.03 2162813708 1274562786 1.697 29.47 206.26 14.76 86.38 2070645379 -4.45% 4.45%
20.36 78.48 3395278387 1597444740 2.125 23.52 164.67 16.56 96.45 3360334387 -1.04% 1.04%
29.48 36.94 1875857693 1088933319 1.723 29.02 203.17 13.51 80.58 1932046104 2.91% 2.91%
87.96 12.45 1411905096 1095448312 1.289 38.79 271.55 14.33 76.43 1401202101 -0.76% 0.76%
39.39 25.60 1627370761 1008410197 1.614 30.98 216.88 13.13 76.80 1624267366 -0.19% 0.19%
37.42 34.98 1896833261 1308893896 1.449 34.50 241.51 13.59 96.28 1935102815 1.98% 1.98%
36.59 31.02 1776794485 1135346823 1.565 31.95 223.65 13.22 85.86 1745943123 -1.77% 1.77%
33.18 33.95 1909552398 1126582732 1.695 29.50 206.49 13.69 82.31 1911955274 0.13% 0.13%
54.47 20.20 1499950439 1100533535 1.363 36.69 256.80 14.79 74.40 1493696145 -0.42% 0.42%
50.83 23.55 1951650255 1197045183 1.630 30.67 214.67 14.53 82.36 1928258231 -1.21% 1.21%
44.97 29.71 207.97 14.30 1.30% 1.30%






























































































































































































































ISS Interpretive Mode Speed Optimised FPGA ISS JIT DBT Mode
Benchmark R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 STDEV
a2time01   
aifftr01   
aifirf01   
aiifft01   
autcor00   
basefp01   
bezier01   
bitmnp    
cacheb01   
canrdr01   
cjpeg      
conven00   
dither    
djpeg      
fbital 0   
fft00      
idctrn01   
iirflt01   
matrix01   
ospf       
pktflow    
pntrch01   
puwmod01   
rgbcmy01   
rgbhpg01   
rgbyiq01   
rotate01   
routelookup
rspeed01   
tblook01   
text01     
ttsprk01   
viterb00   
coremark   
42.12 41.71 42.33 42.06 42.05 42.76 42.43 42.10 42.62 42.58 0.3240
43.84 43.53 43.35 43.96 43.58 44.88 44.11 44.14 43.61 43.62 0.4421
36.47 36.41 36.73 36.11 36.68 36.17 36.64 36.25 35.46 34.33 0.7319
44.00 43.50 43.42 43.28 43.43 43.57 43.53 43.47 43.45 43.47 0.1881
54.87 55.30 54.88 54.31 54.52 55.11 55.43 55.16 55.11 55.41 0.3696
35.09 34.87 35.15 35.20 35.13 34.83 34.84 35.05 35.14 35.11 0.1400
57.68 57.44 58.94 57.40 57.84 58.01 57.80 58.84 58.26 56.60 0.6926
24.44 24.51 24.21 24.11 24.2 24.27 24.37 24.32 24.25 24.49 .129
34.89 34.65 34.26 34.22 34.5 34.38 34.04 34.6 34.52 34.6 . 5 0
35.49 35.63 35.34 35.39 35.33 35.42 35.29 35.90 36.03 35.54 .2503
46.36 45.28 46.27 45.81 45.81 4 .36 46.29 46. 5 46.43 45.88 .3672
57.63 57.40 58.65 58.19 57.34 57.22 56.49 5 .20 56. 6 56.92 .6928
55.33 54.77 55.20 53.95 54.95 54.90 54.49 54.68 54.12 54.57 .4365
55.38 55.59 42.55 55.32 5 .47 5 .61 56.4 5 .10 5 .32 5 .51 4.3088
63.02 63.07 63.05 63.84 64.5 6 .96 63.86 63.88 63.47 63.3 .5 52
52.30 51.95 52.05 52.48 52.09 .73 .53 1.55 1.69 1.86 .3821
28.51 28.39 28.48 28.42 28.51 28.56 28.37 28.45 28.45 28.43 .0587
24.78 24.94 24.79 25.02 25.08 24.99 24.77 24.79 25.11 24.83 0.1333
38.51 38.67 38.65 38.65 38.70 38.67 38.68 38.33 38.76 38.60 0.1215
56.93 57.01 58.23 57.00 57.72 57.17 56.63 56.87 55.76 58.27 0.7540
29.24 29.45 29.10 29.46 29.57 29.47 28.93 28.83 29.31 29.60 0.2669
54.03 53.88 53.54 53.76 53.52 53.49 54.24 54.05 53.54 53.85 0.2642
34.88 34.48 34.63 34.96 34.77 34.28 34.78 34.48 34.34 34.76 0.2313
56.82 58.41 57.78 58.38 57.85 57.32 56.92 57.48 57.33 58.33 0.5860
63.24 62.49 62.53 62.47 63.65 62.40 62.61 62.44 62.46 63.37 0.4653
20.36 19.94 20.29 20.16 20.10 20.02 20.22 20.03 20.03 19.59 0.2152
29.26 29.14 29.44 28.89 29.10 29.08 29.31 29.48 29.34 29.05 0.1877
87.09 87.15 86.47 87.33 86.62 86.07 87.33 87.23 87.12 87.96 0.5285
38.70 39.27 39.02 38.92 38.48 39.39 39.20 38.57 38.75 39.21 0.3162
37.42 36.78 36.78 35.26 36.84 36.78 37.24 36.71 36.74 36.80 0.5698
36.13 36.04 36.08 36.59 36.03 36.19 35.99 36.48 36.47 36.15 0.2164
33.17 33.11 33.15 32.88 33.18 32.89 32.92 32.42 32.94 32.87 0.2270
54.47 53.02 51.75 52.32 51.15 52.62 52.48 52.92 53.45 53.50 0.9395











































Benchmark ISS JIT DBT 
Mode





















27.85 29.92 1396066577 833343182 1.675 29.85 208.92 12.44 67.00 1327371250 -5.18% 5.18%
48.22 381.43 1.35156E+11 18595299674 7.268 6.88 48.15 11.86 1550.52 1.33289E+11 -1.40% 1.40%
30.48 1509.90 90403057897 46020931162 1.964 25.45 178.17 12.44 3699.25 90233024890 -0.19% 0.19%
18.64 26.48 863649574 493553641 1.750 28.57 200.02 11.34 43.53 852371861 -1.32% 1.32%
10.68 11.12 257312461 118797407 2.166 23.08 161.59 11.84 10.03 257330678 0.01% 0.01%
30.06 1359.93 91163243813 40880624840 2.230 22.42 156.95 11.74 3482.79 89982946930 -1.31% 1.31%
12.32 25.91 1009100386 319332249 3.160 15.82 110.76 10.47 30.50 999011356 -1.01% 1.01%
27.33 50.09 5173526964 1369014891 3.779 13.23 92.62 11.70 117.04 5102009595 -1.40% 1.40%
26.15 93.59 4922532216 2447437589 2.011 24.86 174.02 11.44 213.95 4853610263 -1.42% 1.42%
33.22 1020.28 64293295821 33897449616 1.897 26.36 184.53 11.96 2833.96 64603015325 0.48% 0.48%
26.50 21.65 151.57 11.72 1.37% 1.37%
48.22 29.85 208.92 12.44 5.18%
10.10 6.00 173940743 60660885 2.867 17.44 122.06 11.43 5.30 173957294 0.01%












27.83 27.57 26.98 27.12 27.23 27.85 26.69 27.81 27.85 27.78 0.430592614892545
48.04 48.06 48.22 48.11 47.87 48.01 48.12 48.02 47.99 46.09 0.62624187730372
30.44 30.48 30.44 30.42 30.45 30.43 30.29 30.43 30.45 30.32 0.060598863208993
18.58 18.28 18.23 18.49 18.60 18.58 18.58 18.49 18.64 18.57 0.13945927322659
10.68 10.63 10.54 10.63 10.64 10.68 10.53 10.36 10.12 10.20 0.20403975647462
29.98 30.00 29.88 29.87 30.06 30.05 29.82 29.82 29.73 29.96 0.109650961388094
12.16 12. 0 12.32 12.07 12.13 12.31 12.31 12.29 12.25 12.30 0.090209632400192
25.54 27.33 22.99 27.27 25.55 25.55 25.54 27.27 25.57 25.46 1.29043532706344
10.08 10.08 9.68 9.91 9.92 10.09 10.08 10.10 9.95 9.67 0.165408585025083
25.88 25.90 26.15 26.13 25.89 25.80 25.96 26.11 26.14 26.04 0.12944325225965







































































ISS Interpretive Mode Speed Optimised FPGA ISS JIT DBT Mode























































Cycle count deviation in % - Baseline is cycle accurate ISS Interpretive Mode
















90725691 11 626288 114824713 0.17% 10 23 80
45417100 109978145 110982114 0.91% 10 27 40
235347794 36 524621 374858788 2.27% 8 23 51
13340652937 328 9322365 33038153698 0.57% 8 37 5















































1182378613 42.76 90725691 80
1097009731 44.88 45417100 40
1168832029 36.73 235347794 51





































































Fig. 6. 5-Stage Pipeline - Simulation rate (in MIPS) using the BIOPERF benchmarks comparing
(a) ISS interpretive cycle-acc rat simulation mode, (b) speed-optimised FPGA implementation,
and (c) our novel ISS JIT DBT cycle- ccurate imulation mode.
Cycle-Accurate Performance Modelling in an Ultra-Fast JIT DBT ISS 13
Processor Microarchitecture ENCORE





Register Set 32 baseline registers








Bus Width/Latency/Clock Divisor 32-bit/16 cycles/2
Instruction Set Simulator ARCSIM
Simulator Full-system, cycle-accurate
JIT Compiler LLVM 2.7
I/O & System Calls Emulated
Table 2. Configuration and setup of simulated target microarchitectures and the ISS. FPGA and
ASIP implementations of the outlined microarchitectures were used for verification.
4.2 Simulation Speed
We initially discuss the simulation speed-up achieved by our novel cycle-accurate JIT
DBT simulation mode compared to a verified cycle-accurate interpretive simulation
mode for a 5-stage processor pipeline variant as this has been the primary motivation
of our work. Finally, we also outline results for a different pipeline variant, namely the
7-stage pipeline version of the ENCORE. A summary of our results is shown in Figures
5, 6, 7, and 8.
For EEMBC and COREMARK benchmarks (Figure 5) our proposed cycle-accurate
JIT DBT simulation mode for the 5-stage pipeline variant is more than three times faster
on average (45 MIPS) than the verified cycle-accurate interpretive mode (14 MIPS). It
even outperforms a speed-optimised FPGA implementation of the ENCORE processor
(30 MIPS) clocked at 50 MHz. For some benchmarks (e.g. autcor00, bezier01,
cjpeg, djpeg, rgbcmy01, rgbhpg01, routelookup) our new cycle-accurate
JIT DBT mode is more than twice as fast as the speed-optimised FPGA implementation.
This can be explained by the fact that those benchmarks contain sequences of instruc-
tions that map particularly well onto the simulation host ISA. Furthermore, frequently
executed blocks in these benchmarks contain instructions with fewer dependencies re-
sulting in the generation and execution of simpler microarchitectural state update func-
tions.
Our cycle-accurate JIT DBT simulation achieves an average simulation rate of 26
MIPS for the computationally-intensive life science application programs from the BIOP-
14 Igor Böhm, Björn Franke, and Nigel Topham
ERF benchmark suite (Figure 6), again outperforming the previously outlined speed-
optimised FPGA implementation (22 MIPS). For the fasta-ssearch benchmark
our JIT DBT is more than 6 times faster than the speed-optimised FPGA which is due to
a relatively high cycles per instruction (CPI) metric of 7. For the hmmsearch bench-
mark our JIT DBT cycle accurate simulation is slightly slower than interpretive cycle
accurate simulation. This is entirely due to the shorter runtime and abundance of appli-
cation hotspots keeping the JIT DBT engine very busy, resulting in a slowdown due to
JIT compilation overheads.
For EEMBC and COREMARK benchmarks our cycle-accurate JIT DBT simulation
mode for the 7-stage pipeline variant (Figure 7) is more than twice as fast on average
(33 MIPS) than the verified cycle-accurate interpretive mode (13 MIPS). Again it out-
performs the speed-optimised FPGA implementation of the 7-stage ENCORE processor
variant (28 MIPS) clocked at 50 MHz. For some benchmarks (e.g. autcor00, djpeg,
rgbcmy01) our new cycle-accurate JIT DBT mode is almost twice as fast as the speed-
optimised FPGA implementation. Average BIOPERF benchmark simulation rate figures
for the 7-stage pipeline (Figure 8) demonstrate that our cycle-accurate JIT DBT (22
MIPS) once more outperforms a speed-optimised FPGA implementation (21 MIPS) and
is twice as fast as cycle-accurate interpretive simulation (11 MIPS).
For the introductory sample application performing AAC decoding and playback of
Mozart’s Requiem outlined in Section 1, our cycle-accurate JIT DBT mode is capable
of simulating at a sustained rate of 31 MIPS (7-stage pipeline) and 36 MIPS (5-stage
pipeline), enabling real-time simulation. For the boot-up and shutdown sequence of a
Linux kernel our fast cycle-accurate JIT DBT simulation mode achieves 12 MIPS for
both pipeline variants resulting in a highly responsive interactive environment. These
examples clearly demonstrate that ARCSIM is capable of simulating system-related ef-
fects such as interrupts and virtual memory exceptions efficiently and still provide full
microarchitectural observability.
Our profiling simulation mode is orthogonal to all of the above simulation modes.
Note that for all performance results full profiling was enabled (including dynamic in-
struction execution profiling, per instruction latency distributions, detailed cache statis-
tics, executed delay slot instructions, as well as various branch predictor statistics).
5 Related Work
Previous work on high-speed instruction set simulation has tended to focus on compiled
and hybrid mode simulators. Whilst an interpretive simulator spends most of its time
repeatedly fetching and decoding target instructions, a compiled simulator fetches and
decodes each instruction once, spending most of its time performing the operations.
5.1 Fast Instruction Set Simulation
A statically-compiled simulator [18] which employed in-line macro expansion was
shown to run up to three times faster than an interpretive simulator. Target code is stat-
ically translated to host machine code which is then executed directly within a switch
statement.
Cycle-Accurate Performance Modelling in an Ultra-Fast JIT DBT ISS 15
Benchmark ISS JIT DBT 
Mode














































32.37 36.53 1785975219 1182378613 1.510 33.10 231.71 12.79 92.44 1930064669 7.47% 7.47%
31.93 34.36 2034726938 1097009731 1.855 26.96 188.70 13.18 83.21 2036823168 0.10% 0.10%
27.87 41.93 2371321637 1168832029 2.029 24.65 172.52 14.01 83.40 2372460663 0.05% 0.05%
30.90 34.04 1935565667 1051923424 1.840 27.17 190.21 12.94 81.28 1935981481 0.02% 0.02%
45.30 25.81 2797422237 1169362673 2.392 20.90 146.31 15.14 77.25 2797422237 0.00% 0.00%
26.88 43.45 1754744950 1168152409 1.502 33.29 233.00 12.27 95.23 1857635106 5.54% 5.54%
39.34 29.21 2258585815 1149063994 1.966 25.44 178.06 14.15 81.21 2258585815 0.00% 0.00%
21.22 54.03 2270733058 1146599000 1.980 25.25 176.73 12.42 92.32 2282648106 0.52% 0.52%
25.15 53.36 2857765887 1342104542 2.129 23.48 164.37 12.44 107.86 2875695565 0.62% 0.62%
27.21 48.32 1791094729 1315041407 1.362 36.71 256.97 13.12 100.22 1857676863 3.58% 3.58%
36.36 422.94 36325056167 15379979678 2.362 21.17 148.19 12.60 1220.19 36569179019 0.67% 0.67%
39.99 28.97 1599023142 1158328760 1.380 36.22 253.54 13.26 87.38 1599023142 0.00% 0.00%
35.06 49.55 3026342828 1737243150 1.742 28.70 200.91 13.18 131.85 3026346736 0.00% 0.00%
38.84 343.46 34151717925 13340652937 2.560 19.53 136.72 12.26 1088.48 34211134649 0.17% 0.17%
44.78 24.92 1719696892 1115769652 1.541 32.44 227.09 13.44 83.00 1724898292 0.30% 0.30%
43.14 24.22 1580764604 1044759671 1.513 33.05 231.32 13.72 76.13 1581879190 0.07% 0.07%
20.92 52.51 2170529390 1098794030 1.975 25.31 177.18 13.43 81.82 2170763050 0.01% 0.01%
19.91 52.60 1864128218 1047094989 1.780 28.09 196.60 11.90 87.99 1894899071 1.62% 1.62%
30.37 55.17 2437682652 1675270000 1.455 34.36 240.53 12.93 129.59 2670115980 8.70% 8.70%
45.49 27.95 2332472251 1271291691 1.835 27.25 190.76 12.31 103.25 2454277869 4.96% 4.96%
20.54 52.41 2962604759 1076477772 2.752 18.17 127.17 12.50 86.10 2962652787 0.00% 0.00%
37.38 28.32 1245003435 1058835946 1.176 42.52 297.66 13.77 76.91 1397705633 10.93% 10.93%
25.75 39.82 1803481105 1025626849 1.758 28.43 199.04 12.19 84.11 1868486860 3.48% 3.48%
42.58 48.69 6702198006 2073605735 3.232 15.47 108.29 13.98 148.28 6702198006 0.00% 0.00%
43.54 29.27 2132212751 1274562786 1.673 29.89 209.22 13.77 92.58 2132213527 0.00% 0.00%
17.78 89.86 3483215005 1597444740 2.180 22.93 160.51 15.05 106.13 3483215183 0.00% 0.00%
24.87 43.79 2213187950 1088933319 2.032 24.60 172.21 12.17 89.45 2387776150 7.31% 7.31%
50.14 21.85 1440245723 1095448312 1.315 38.03 266.21 13.45 81.48 1629838395 11.63% 11.63%
29.12 34.63 1763120979 1008410197 1.748 28.60 200.18 12.02 83.91 1813631873 2.79% 2.79%
28.88 45.31 2001218549 1308893896 1.529 32.70 228.92 12.37 105.81 2152269661 7.02% 7.02%
27.90 40.69 1831311877 1135346823 1.613 31.00 216.99 12.36 91.85 1879971851 2.59% 2.59%
27.97 40.28 2129140319 1126582732 1.890 26.46 185.19 13.07 86.17 2165861407 1.70% 1.70%
33.99 32.38 1536411658 1100533535 1.396 35.82 250.71 13.53 81.32 1631978323 5.86% 5.86%
35.97 33.28 2010820531 1197045183 1.680 29.77 208.36 13.34 89.73 2164712103 7.11% 7.11%
32.63 28.45 199.18 13.09 2.79%
50.14 42.52 297.66 15.14 11.63%

































































































































































Cycle Accurate Simulation Rate EEMBC and CoreMark - Small and Long Running Embedded Benchmarks
Benchmark ISS JIT DBT 
Mode





















23.28 35.80 1397283380 833343182 1.677 29.82 208.74 11.76 70.86 1399718440 0.17% 0.17%
32.58 564.52 1.3602E+11 18595299674 7.315 6.84 47.85 11.11 1655.57 1.36937E+11 0.67% 0.67%
28.55 1611.87 96286670881 46020931162 2.092 23.90 167.29 11.22 4100.35 98699378563 2.44% 2.44%
16.42 30.05 920903747 493553641 1.866 26.80 187.58 10.76 45.89 923701980 0.30% 0.30%
9.81 12.11 279633989 118797407 2.354 21.24 148.69 10.96 10.84 279939715 0.11% 0.11%
23.85 1714.03 93670881830 40880624840 2.291 21.82 152.75 10.77 3794.72 98102379410 4.52% 4.52%
11.32 28.20 1040243985 319332249 3.258 15.35 107.44 10.90 29.30 1039204174 -0.10% 0.10%
21.79 62.82 5270423200 1369014891 3.850 12.99 90.91 11.18 122.50 5314054150 0.82% 0.82%
22.73 107.66 5138499158 2447437589 2.100 23.81 166.70 10.75 227.75 5300055634 3.05% 3.05%
27.65 1226.02 66707331171 33897449616 1.968 25.41 177.85 11.03 3073.02 69474030397 3.98% 3.98%
21.80 20.80 145.58 11.04 1.62% 1.62%
32.58 29.82 208.74 11.76 4.52%







































































ISS Interpretive Mode Speed Optimised FPGA ISS JIT DBT Mode















Fig. 7. 7-Stage Pipeline - Simulation rate (in MIPS) using EEMBC and COREMARK benchmarks
comparing (a) ISS interpretive cycle-accurate simulation mode, (b) speed-optimised FPGA im-
plementation, and (c) our novel ISS JIT DBT cycle-accurate simulation mode.
Benchmark ISS JIT DBT 
Mode














































32.37 36.53 1785975219 1182378613 1.510 33.10 231.71 12.79 92.44 1930064669 7.47% 7.47%
31.93 34.36 2034726938 1097009731 1.855 26.96 188.70 13.18 83.21 2036823168 0.10% 0.10%
27.87 41.93 2371321637 1168832029 2.029 24.65 172.52 14.01 83.40 2372460663 0.05% 0.05%
30.90 34.04 1935565667 1051923424 1.840 27.17 190.21 12.94 81.28 1935981481 0.02% 0.02%
45.30 25.81 2797422237 1169362673 2.392 20.90 146.31 15.14 77.25 2797422237 0.00% 0.00%
26.88 43.45 1754744950 1168152409 1.502 33.29 233.00 12.27 95.23 1857635106 5.54% 5.54%
39.34 29.21 2258585815 1149063994 1.966 25.44 178.06 14.15 81.21 2258585815 0.00% 0.00%
21.22 54.03 2270733058 1146599000 1.980 25.25 176.73 12.42 92.32 2282648106 0.52% 0.52%
25.15 53.36 2857765887 1342104542 2.129 23.48 164.37 12.44 107.86 2875695565 0.62% 0.62%
27.21 48.32 1791094729 1315041407 1.362 36.71 256.97 13.12 100.22 1857676863 3.58% 3.58%
36.36 422.94 36325056167 15379979678 2.362 21.17 148.19 12.60 1220.19 36569179019 0.67% 0.67%
39.99 28.97 1599023142 1158328760 1.380 36.22 253.54 13.26 87.38 1599023142 0.00% 0.00%
35.06 49.55 3026342828 1737243150 1.742 28.70 200.91 13.18 131.85 3026346736 0.00% 0.00%
38.84 343.46 34151717925 13340652937 2.560 19.53 136.72 12.26 1088.48 34211134649 0.17% 0.17%
44.78 24.92 1719696892 1115769652 1.541 32.44 227.09 13.44 83.00 1724898292 0.30% 0.30%
43.14 24.22 1580764604 1044759671 1.513 33.05 231.32 13.72 76.13 1581879190 0.07% 0.07%
20.92 52.51 2170529390 1098794030 1.975 25.31 177.18 13.43 81.82 2170763050 0.01% 0.01%
19.91 52.60 1864128218 1047094989 1.780 28.09 196.60 11.90 87.99 1894899071 1.62% 1.62%
30.37 55.17 2437682652 1675270000 1.455 34.36 240.53 12.93 129.59 2670115980 8.70% 8.70%
45.49 27.95 2332472251 1271291691 1.835 27.25 190.76 12.31 103.25 2454277869 4.96% 4.96%
20.54 52.41 2962604759 1076477772 2.752 18.17 127.17 12.50 86.10 2962652787 0.00% 0.00%
37.38 28.32 1245003435 1058835946 1.176 42.52 297.66 13.77 76.91 1397705633 10.93% 10.93%
25.75 39.82 1803481105 1025626849 1.758 28.43 199.04 12.19 84.11 1868486860 3.48% 3.48%
42.58 48.69 6702198006 2073605735 3.232 15.47 108.29 13.98 148.28 6702198006 0.00% 0.00%
43.54 29.27 2132212751 1274562786 1.673 29.89 209.22 13.77 92.58 2132213527 0.00% 0.00%
17.78 89.86 3483215005 1597444740 2.180 22.93 160.51 15.05 106.13 3483215183 0.00% 0.00%
24.87 43.79 2213187950 1088933319 2.032 24.60 172.21 12.17 89.45 2387776150 7.31% 7.31%
50.14 21.85 1440245723 1095448312 1.315 38.03 266.21 13.45 81.48 1629838395 11.63% 11.63%
29.12 34.63 1763120979 1008410197 1.748 28.60 200.18 12.02 83.91 1813631873 2.79% 2.79%
28.88 45.31 2001218549 1308893896 1.529 32.70 228.92 12.37 105.81 2152269661 7.02% 7.02%
27.90 40.69 1831311877 1135346823 1.613 31.00 216.99 12.36 91.85 1879971851 2.59% 2.59%
27.97 40.28 2129140319 1126582732 1.890 26.46 185.19 13.07 86.17 2165861407 1.70% 1.70%
33.99 32.38 1536411658 1100533535 1.396 35.82 250.71 13.53 81.32 1631978323 5.86% 5.86%
35.97 33.28 2010820531 1197045183 1.680 29.77 208.36 13.34 89.73 2164712103 7.11% 7.11%
32.63 28.45 199.18 13.09 2.79%
50.14 42.52 297.66 15.14 11.63%

































































































































































Cycle Accurate Simulation Rate EEMBC and CoreMark - Small and Long Running Embedded Benchmarks
Benchmark ISS JIT DBT 
Mode





















23.28 35.80 1397283380 833343182 1.677 29.82 208.74 11.76 70.86 1399718440 0.17% 0.17%
32.58 564.52 1.3602E+11 18595299674 7.315 6.84 47.85 11.11 1655.57 1.36937E+11 0.67% 0.67%
28.55 1611.87 96286670881 46020931162 2.092 23.90 167.29 11.22 4100.35 98699378563 2.44% 2.44%
16.42 30.05 920903747 493553641 1.866 26.80 187.58 10.76 45.89 923701980 0.30% 0.30%
9.81 12.11 279633989 118797407 2.354 21.24 148.69 10.96 10.84 279939715 0.11% 0.11%
23.85 1714.03 93670881830 40880624840 2.291 21.82 152.75 10.77 3794.72 98102379410 4.52% 4.52%
11.32 28.20 1040243985 319332249 3.258 15.35 107.44 10.90 29.30 1039204174 -0.10% 0.10%
21.79 62.82 5270423200 1369014891 3.850 12.99 90.91 11.18 122.50 5314054150 0.82% 0.82%
22.73 107.66 5138499158 2447437589 2.100 23.81 166.70 10.75 227.75 5300055634 3.05% 3.05%
27.65 1226.02 66707331171 33897449616 1.968 25.41 177.85 11.03 3073.02 69474030397 3.98% 3.98%
21.80 20.80 145.58 11.04 1.62% 1.62%
32.58 29.82 208.74 11.76 4.52%







































































ISS Interpretive Mode Speed Optimised FPGA ISS JIT DBT Mode















Fig. 8. 7-Stage Pipeline - Simulation rate (in MIPS) using the BIOPERF benchmarks comparing
(a) ISS interpretive cycle-accurate simulation mode, (b) speed-optimised FPGA implementation,
and (c) our novel ISS JIT DBT cycle-accurate simulation mode.
16 Igor Böhm, Björn Franke, and Nigel Topham
Dynamic translation techniques are used to overcome the lack of flexibility inher-
ent in statically-compiled simulators. The MIMIC simulator [17] simulates IBM SYS-
TEM/370 instructions on the IBM RT PC and translates groups of target basic blocks
into host instructions. SHADE [9] and EMBRA [28] use DBT with translation caching
techniques in order to increase simulation speeds. The Ultra-fast Instruction Set Simula-
tor [30] improves the performance of statically-compiled simulation by using low-level
binary translation techniques to take full advantage of the host architecture.
Just-In-Time Cache Compiled Simulation (JIT-CCS) [21] executes and caches pre-
compiled instruction-operation functions for each function fetched. The Instruction Set
Compiled Simulation (IC-CS) simulator [25] was designed to be a high performance
and flexible functional simulator. To achieve this the time-consuming instruction decode
process is performed during the compile stage, whilst interpretation is enabled at simu-
lation time. The SIMICS [25] full system simulator translates the target machine-code
instructions into an intermediate format before interpretation. During simulation the in-
termediate instructions are processed by the interpreter which calls the corresponding
service routines. QEMU [3] is a fast simulator which uses an original dynamic transla-
tor. Each target instruction is divided into a simple sequence of micro-operation, the set
of micro-operations having been pre-compiled offline into an object file. During simu-
lation the code generator accesses the object file and concatenates micro-operations to
form a host function that emulates the target instructions within a block. More recent
approaches to JIT DBT ISS are presented in [24,27,6,15,7]. Apart from different target
platforms these approaches differ in the granularity of translation units (basic blocks vs
pages or CFG regions) and their JIT code generation target language (ANSI-C vs LLVM
IR).
The commercial simulator XISS simulator [35] employs JIT DBT technology and
targets the same ARCompact
TM
ISA that has been used in this paper. It achieves simula-
tion speeds of 200+ MIPS. In contrast, ARCSIM operates at 500+ MIPS [7] in functional
simulation mode.
5.2 Performance Modelling in Fast Instruction Set Simulators
A dynamic binary translation approach to architectural simulation has been introduced
in [8]. The POWERPC ISA is dynamically mapped onto PISA in order to take advantage
of the underlying SIMPLESCALAR [31] timing model. While this approach enables
hardware design space exploration it does not provide a faithful performance model for
any actual POWERPC implementation.
Most relevant to our work is the performance estimation approach in the HYSIM hy-
brid simulation environment [11,16,12,13]. HYSIM merges native host execution with
detailed ISS. For this, an application is partitioned and operation cost annotations are
introduced to a low-level intermediate representation (IR). HYSIM “imitates” the op-
eration of an optimising compiler and applies generic code transformations that are
expected to be applied in the actual compiler targeting the simulation platform. Further-
more, calls to stub functions are inserted in the code that handle accesses to data man-
aged in the ISS where also the cache model is located. We believe there are a number of
short-comings in this approach: First, no executable for the target platform is ever gen-
erated and, hence, the simulated code is only an approximation of what the actual target
Cycle-Accurate Performance Modelling in an Ultra-Fast JIT DBT ISS 17
compiler would generate. Second, no detailed pipeline model is maintained. Hence,
cost annotations do not reflect actual instruction latencies and dependencies between
instructions, but assume fixed average instruction latencies. Even for relatively simple,
non-superscalar processors this assumption does not hold. Furthermore, HYSIM has
only been evaluated against an ISS that does not implement a detailed pipeline model.
Hence, accuracy figures reported in e.g. [12] only refer to how close performance es-
timates come to those obtained by this ISS, but it is unclear if these figures accurately
reflect the actual target platform. Finally, only a very few benchmarks have been evalu-
ated. A similar hybrid approach targeting software energy estimation has been proposed
earlier in [19,20].
Statistical performance estimation methodologies such as SIMPOINT and SMARTS
have been proposed in [14,29]. The approaches are potentially very fast, but require pre-
processing (SIMPOINT) of an application and do not accurately model the microarchi-
tecture (SMARTS, SIMPOINT). Unlike our accurate pipeline modelling this introduces
a statistical error that cannot be entirely avoided.
Machine learning based performance models have been proposed in [2,4,22] and,
more recently, more mature approaches have been presented in [10,23]. After initial
training these performance estimation methodologies can achieve very high simula-
tion rates that are only limited by the speed of faster, functional simulators. Similar
to SMARTS and SIMPOINT, however, these approaches suffer from inherent statistical
errors and the reliable detection of statistical outliers is still an unsolved problem.
6 Summary and Conclusions
We have demonstrated that our approach to cycle-accurate ISS easily surpasses speed-
optimised FPGA implementations whilst providing detailed architectural and microar-
chitectural profiling feedback and statistics. Our main contribution is a simple yet pow-
erful software pipeline model in conjunction with an instruction operand dependency
and side-effect analysis pass integrated into a JIT DBT ISS enabling ultra-fast simula-
tion speeds without compromising microarchitectural observability. Our cycle-accurate
microarchitectural modelling approach is portable and independent of the implemen-
tation of a functional ISS. More importantly, it is capable of capturing even complex
interlocked processor pipelines. Because our novel pipeline modelling approach is mi-
croarchitecture adaptable and decouples the performance model in the ISS from func-
tional simulation it can be automatically generated from ADL specifications.
In future work we plan to improve and optimise JIT generated code that performs
microarchitectural performance model updates and show that fast cycle-accurate multi-
core simulation is feasible with our approach.
18 Igor Böhm, Björn Franke, and Nigel Topham
References
1. David August, Jonathan Chang, Sylvain Girbal, Daniel Gracia Perez, Gilles Mouchard,
David Penry, Olivier Temam, and Neil Vachharajani. UNISIM: An Open Simulation En-
vironment and Library for Complex Architecture Design and Collaborative Development.
IEEE Computer Architecture Letters, 20 Aug (2007).
2. J. R. Bammi, E. Harcourt, W. Kruijtzer, L. Lavagno, and M. T. Lazarescu. Software per-
formance estimation strategies in a system-level design tool. In Proceedings of CODES’00,
(2000).
3. F. Bellard. QEMU, a fast and portable dynamic translator. Proceedings of the Annual Con-
ference on USENIX Annual Technical Conference. USENIX Association, Berkeley, CA, p.
41, (2005).
4. G. Bontempi and W. Kruijtzer. A data analysis method for software performance prediction.
DATE’02: Proceedings of the Conference on Design, Automation and Test in Europe, (2002).
5. D. Bader, Y. Li, T. Li, and V. Sachdeva. BioPerf: A Benchmark Suite to Evaluate High-
Performance Computer Architecture on Bioinformatics Applications. In: Proceedings of the
IEEE International Symposium on Workload Characterization (IISWC’05), pp. 163–173,
2005.
6. Florian Brandner, Andreas Fellnhofer, Andreas Krall, and David Riegler Fast and Accurate
Simulation using the LLVM Compiler Framework. RAPIDO’09: 1st Workshop on Rapid
Simulation and Performance Evaluation: Methods and Tools (2009) pp. 1-6.
7. Igor Böhm, Björn Franke and Nigel Topham Cycle-Accurate Performance Modelling in an
Ultra-Fast Just-In-Time Dynamic Binary Translation Instruction Set Simulator. In: Proceed-
ings of the International Symposium on Systems, Architectures, Modeling, and Simulation
(SAMOS’10), Samos, Greece, (2010)
8. H.W. Cain, K.M. Lepak, and M.H. Lipasti. A dynamic binary translation approach to archi-
tectural simulation. SIGARCH Computer Architecture News, Vol. 29, No. 1, March (2001).
9. B. Cmelik, and D. Keppel. Shade: A Fast Instruction-Set Simulator for Execution Profiling.
Proceedings of the 1994 ACM SIGMETRICS Conference on Measurement and Modeling of
Computer Systems, pp. 128–137, ACM Press, New York, (1994).
10. Björn Franke. Fast cycle-approximate instruction set simulation. SCOPES’08: Proceedings
of the 11th international workshop on Software & compilers for embedded systems (2008).
11. Lei Gao, Stefan Kraemer, Rainer Leupers, Gerd Ascheid, Heinrich Meyr. A fast and generic
hybrid simulation approach using C virtual machine. CASES’07: Proceedings of the inter-
national conference on Compilers, architecture, and synthesis for embedded systems (2007).
12. Lei Gao, Stefan Kraemer, Kingshuk Karuri, Rainer Leupers, Gerd Ascheid, and Heinrich
Meyr. An Integrated Performance Estimation Approach in a Hybrid Simulation Framework.
MOBS’08: Annual Workshop on Modelling, Benchmarking and Simulation (2008).
13. Lei Gao, Kingshuk Karuri, Stefan Kraemer, Rainer Leupers, Gerd Ascheid, and Heinrich
Meyr. Multiprocessor performance estimation using hybrid simulation. DAC’08: Proceed-
ings of the 45th annual Design Automation Conference (2008).
14. G. Hamerly, E. Perelman, J. Lau, and B. Calder. SIMPOINT 3.0: Faster and more flexible
program analysis. MOBS’05: Proceedings of Workshop on Modelling, Benchmarking and
Simulation, (2005).
15. Daniel Jones and Nigel Topham. High Speed CPU Simulation Using LTU Dynamic Binary
Translation. Lecture Notes In Computer Science (2009) vol. 5409.
16. Stefan Kraemer, Lei Gao, Jan Weinstock, Rainer Leupers, Gerd Ascheid, and Hein-
rich Meyr. HySim: a fast simulation framework for embedded software development.
CODES+ISSS’07: Proceedings of the 5th IEEE/ACM international conference on Hard-
ware/software codesign and system synthesis (2007).
Cycle-Accurate Performance Modelling in an Ultra-Fast JIT DBT ISS 19
17. C. May. MIMIC: A Fast System/370 Simulator. SIGPLAN: Papers of the Symposium on
Interpreters and Interpretive Techniques, pp. 1–13, ACM Press, Ney York, (1987).
18. C. Mills, S.C. Ahalt, J. Fowler. Compiled Instruction Set Simulation. Software: Practice and
Experience, 21(8), pp. 877 – 889, (1991).
19. A. Muttreja, A. Raghunathan, S. Ravi, and N.K. Jha. Hybrid simulation for embedded soft-
ware energy estimation. DAC’05: Proceedings of the 42nd Annual Conference on Design
Automation, pp. 23–26, ACM Press, New York, (2005).
20. A. Muttreja, A. Raghunathan, S. Ravi, and N.K. Jha. Hybrid simulation for energy estimation
of embedded software. IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, (2007).
21. A. Nohl, G. Braun, O. Schliebusch, R. Leupers, H. Meyr, and A. Hoffmann. A Universal
Technique for Fast and Flexible Instruction-Set Architecture Simulation. DAC’02: Proceed-
ings of the 39th Conference on Design Automation, pp. 22–27, ACM Press, New York,
(2002).
22. M. S. Oyamada, F. Zschornack, and F. R. Wagner. Accurate software performance estimation
using domain classification and neural networks. In Proceedings of SBCCI’04, (2004).
23. Daniel Powell and Björn Franke. Using continuous statistical machine learning to enable
high-speed performance prediction in hybrid instruction-/cycle-accurate instruction set sim-
ulators. CODES+ISSS’09: Proceedings of the 7th IEEE/ACM international conference on
Hardware/software codesign and system synthesis, (2009).
24. W. Qin, J. D’Errico, and X. Zhu. A Multiprocessing Approach to Accelerate Retargetable
and Portable Dynamic-Compiled Instruction-Set Simulation. CODES-ISSS’06: Proceedings
of the 4th International Conference on Hardware/Software Codesign and System Synthesis,
pp. 193–198, ACM Press, New York, (2006).
25. M. Reshadi, P. Mishra, and N. Dutt. Instruction Set Compiled Simulation: A Technique for
Fast and Flexible Instruction Set Simulation. Proceedings of the 40th Conference on Design
Automation, pp. 758–763, ACM Press, New York, (2003).
26. O. Schliebusch, A. Hoffmann, A. Nohl, G. Braun, and H. Meyr. Architecture Implementation
Using the Machine Description Language LISA. ASP-DAC’02: Proceedings of the Asia and
South Pacific Design Automation Conference, Washington, DC, USA, (2002).
27. Nigel Topham and Daniel Jones. High Speed CPU Simulation using JIT Binary Translation.
MOBS’07: Annual Workshop on Modelling, Benchmarking and Simulation (2007).
28. E. Witchel, and M. Rosenblum. Embra: Fast and Flexibile Machine Simulation. In: Pro-
ceedings of the 1996 ACM SIGMETRICS International Conference on Measurement and
Modeling of Computer Systems, pp. 68–79, ACM Press, New York, (1996).
29. R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. SMARTS: accelerating microar-
chitecture simulation via rigorous statistical sampling. ISCA’03: Proceedings of the 30th
Annual International Symposium on Computer Architecture (ISCA), (2003).
30. J. Zhu, and D.D. Gajski. A Retargetable, Ultra-Fast Instruction Set Simulator. DATE’99:
Proceedings of the Conference on Design, Automation and Test in Europe, p. 62, ACM Press,
New York, (1999).
31. Doug Burger and Todd Austin. The SimpleScalar tool set, version 2.0. SIGARCH Computer
Architecture News (1997) vol. 25 (3).
32. ARCompact
TM
Instruction Set Architecture. Synopsys Inc. http://www.synopsys.
com/IP/ConfigurableCores/ARCProcessors/, retrieved 05 November (2010).
33. ENCORE Embedded Processor. http://groups.inf.ed.ac.uk/pasta/hw_
encore.html, retrieved 05 November 2010.
34. ARCSIM Instruction Set Simulator. http://groups.inf.ed.ac.uk/pasta/
tools_arcsim.html, retrieved 05 November 2010.
35. XISS Simulator. Synopsys Inc. http://www.synopsys.com/dw/ipdir.php?ds=
sim_xiss, retrieved 10 February 2010.
20 Igor Böhm, Björn Franke, and Nigel Topham
36. The Embedded Microprocessor Benchmark Consortium: EEMBC Benchmark Suite. http:
//www.eembc.org
