Employing Simulation to Facilitate the Design of Dynamic Code Generators by Rosario, Vanderson Martins do et al.
Employing Simulation to Facilitate the Design of
Dynamic Code Generators
Vanderson M. do Rosario1 Raphael Zinsly1,2
Sandro Rigo1 Edson Borin1
1Institute of Computing - UNICAMP - Brazil
2IBM - Campinas - Brazil
September 1, 2020
Abstract
Dynamic Translation (DT) is a sophisticated technique that allows the
implementation of high-performance emulators and high-level-language
virtual machines. In this technique, the guest code is compiled dynami-
cally at runtime. Consequently, achieving good performance depends on
several design decisions, including the shape of the regions of code being
translated. Researchers and engineers explore these decisions to bring the
best performance possible. However, a real DT engine is a very sophis-
ticated piece of software, and modifying one is a hard and demanding
task. Hence, we propose using simulation to evaluate the impact of de-
sign decisions on dynamic translators and present RAIn, an open-source
DT simulator that facilitates the test of DT’s design decisions, such as
Region Formation Techniques (RFTs). RAIn outputs several statistics
that support the analysis of how design decisions may affect the behavior
and the performance of a real DT. We validated RAIn running a set of
experiments with six well known RFTs (NET, MRET2, LEI, NETPlus,
NET-R, and NETPlus-e-r) and showed that it can reproduce well-known
results from the literature without the effort of implementing them on a
real and complex dynamic translator engine.
1 Introduction
Emulators and high-level-language virtual machines compile applications’ code
during their execution. This approach, known as Just-In-Time (JIT) compi-
lation or Dynamic Translation (DT), is a concept that is as old as high-level
programming languages themselves [1]. DT techniques can be used to improve
execution time and space efficiency of programs [2] and to support programming
language versatility with High-Level Language Virtual Machines (HLLVM) [3].
They can also be used to maintain support for legacy code by the industry [4, 5]
or to support new architectures such as RISC-V [6, 7]. In dynamic high-level
languages, DT can be used to emulate their intermediate representation, such as
the Facebook Hip-Hop Virtual Machine [3], the Java HotSpot [8], and Firefox’s
IonMonkey JavaScript JIT [9].
1
ar
X
iv
:2
00
8.
13
24
0v
1 
 [c
s.P
L]
  3
0 A
ug
 20
20
In order to achieve high-performance using dynamic translation, the cost
added by invoking a compiler during runtime should be lower than the perfor-
mance gains achieved by the execution of the code produced by the compiler.
To pay off the compilation cost, a piece of code needs to execute for a significant
amount of time, so the savings achieved by the execution of the optimized code
are also significant. Fortunately, programs usually spend most of their execu-
tion time in a minority of their code [10], and DT can achieve high-performance
by only translating and optimizing frequently executed code [11], which we call
hot code. For the rest of the application, the cold part, an interpreter, or a
fast-non-optimizer compiler can be used. Therefore, one of the main strategies
to improve the performance of a DT engine (DTE) is to create heuristics to
predict, at runtime, which part of the code is hot and which is cold. In this pa-
per, the term Region Formation Techniques (RFTs) will be used in its broadest
sense to refer to all these prediction schemes. The quality of the code produced
by the dynamic translator for hot regions is also an essential factor that affects
performance. In this context, both the set of optimizations employed and the
granularity and shape of the region of code being compiled play an important
role in the code quality and, hence, its performance. For instance, while small
portions of code, such as basic blocks, can be fast compiled, larger ones expose
more opportunities for optimizations [12]. RFTs that capture whole loops or
even code from more than one method in the same region enables more aggres-
sive (loop or inter-procedural) optimizations. Moreover, besides affecting the
scope of optimizations, the RFT design decisions may also affect the hot code
frequency, the code duplication rate, and the optimization costs, among others.
All these variables need to be considered while designing and evaluating a
DTE, and this is not a trivial task. A DTE is a sophisticated piece of soft-
ware that includes in itself a compiler, may also include an interpreter, complex
data structures for storing compiled binary, a linker, and code to orchestrate
all these pieces together. Understanding the code of a DTE or debugging one
is challenging, mainly when the bug occurs on the code generated dynamically
by the DTE. Consequently, in this kind of software, implementing and vali-
dating novel research and design ideas is usually a complex process. In fields
like processor architecture designing, prototyping ideas in real hardware is also
complex and simulation is broadly used to make design exploration approach-
able [13, 14, 15, 16]. In this work, we argue that simulation can also be used
to facilitate DT design exploration, and we present RAIn, a novel DT/RFT
simulator.
We use RAIn to evaluate several RFTs and reproduce results from the lit-
erature allowing its simulation capabilities validation. Our evaluation setup
includes programs from both the SPEC-CPU 2006 [17] and SYSMARK [18]
benchmarks and covers six different RFTs techniques from the literature. The
contributions of this paper can be summarized as follows:
• A novel DTE Simulator, called RAIn, which makes the implementation
and testing new RFTs simpler and faster. RAIn is capable of producing
several different statistics that facilitate the evaluation of the behavior and
performance of the RFTs.
• A comprehensive study of the performance and behavior of six RFTs
(NET [19], MRET2 [20], LEI [21], Netplus [22], Relaxed NET [23], and
2
Extended Netplus [24]) using programs from SPEC-CPU 2006 [17] and
SYSMARK [18], thus covering several different application profiles. The
results corroborate the findings encountered by previous work and provides
a comparison of all the techniques using the same set of applications.
The remainder of the text is organized as follows: Section 2 discusses the
typical organization of a trace/region-based DTE and the characterization of
their overhead (performance issues). Section 3 describes the proposed simulator,
including its functionality and its advantages. Section 4 shows the experimental
setup, and Section 5 presents a comprehensive comparison between several DT
designs. Finally, Section 6 presents the conclusions.
2 Region-based Dynamic Translators
A Dynamic Translator Engine (DTE) is a piece of software that translates guest
code, which may be a binary generated for one computer architecture, into
code compatible with the host architecture, also known as native code. When
emulating hot code, i.e., frequently executed code, it usually pays off to spend
effort translating and optimizing the guest code into optimized native code. In
this case, the performance gains achieved by executing the optimized native
code surpasses the costs of translating the code. For cold code, i.e., infrequently
executed code, it is usually better to employ techniques such as interpretation or
quick, basic-block-based, translation, which have no or low translation cost. In
this paper, we use the term interpretation to represent the mechanisms employed
to emulate cold code.
Figure 1 illustrates the execution flow of a common DTE. First, it loads the
guest code to memory (state 1), which can be, for example, an intermediate
representation such as Java Byte Code or an x86 binary. Then, the emulation
process begins by fetching, decoding, and interpreting all the instructions one
by one (state 2), in a process called interpretation.
During the interpretation, an active monitor (profiler), or the interpreter
itself, monitors whether the emulation is repeatedly executing the same code
for longer than a given threshold, called hotness threshold. If so, this means
that the execution is on a hot part of the code, such as a cycle. Once detected
that the execution is in a hot code, the interpreter starts to record the trace of
instructions to form a region of code for translation (state 3). These recorded
instructions, normally part of a loop or a cycle in the static code, are passed
to a compiler which compiles the region from the guest architecture into a
semantically equivalent optimized code compatible with the host architecture,
the native code (state 4). The native code is then stored in a code cache and
every time this same piece of code needs to be executed in the future, the
execution jumps to the native code in the cache (state 5) instead of interpreting
it. All these steps are repeated until the entire program’s execution is finalized.
3
1) Loading
Input
2) Interpreting
3) Region
Recording
5) Native Code
Executing
4) Region
Compiling
Detect possible hot
region
Finish recording a
region to be compiled
Execute
compiled region
Next Instruction starts
a compiled region
Figure 1: Execution flow of a common DTE.
DTEs implement specific functionalities to execute each one of the afore-
mentioned steps. For example, to predict which regions of code are hot or not,
there are three typical implementations [25]: a) frequency counting based on
instrumentation, b) sampling based on interrupt-timers, or c) a combination
of both. To store the compiled code and make it fast to access, DTEs employ
hash maps organized as caches, called Translated Code Cache, or TCC. Another
critical process during emulation is mapping guest addresses into their respec-
tive emitted host code addresses. As translation does not always result in the
same memory layout, access to memory locations using the address from the
guest code in indirect jumps and returns needs to be mapped to the address in
the compiled/translated host code. The mechanism that handles this mapping
of addresses during region execution is usually referred to as Indirect Branch
Translation Handler [26]. All these mechanisms and structures carry design
decisions and details that directly affects the performance of a DTE.
Another important mechanism is the Region Formation Technique (RFT),
which is responsible for : (1) deciding which instructions to profile, (2) deciding
when to start recording regions, and (3) deciding when to stop recording them.
At one hand, to minimize compilation overhead, it is important to translate
only hot code. On the other hand, to accelerate the native code, it is usually
necessary to form large regions to increase the translation scope and expose
more optimization opportunities to the optimizer. For example, RFTs that
capture whole loops or even code from more than one method in the same region
enables more aggressive (loop or inter-procedural) optimizations. Consequently,
the RFT design may have a big impact on the performance of the native code
and, hence, the DTE.
So far, the main RFTs proposed in the literature are NET (a.k.a MRET) [19],
NET-r [27], MRET2 [20], LEI [21], NETPlus [22] and NETPlues-e-r [28]. All
of them using different strategies to select hot code and selecting dynamic re-
gions with different sizes, shape, and characteristics which directly affects the
performance of a DTE. Below, we include a brief description of each one of
them:
• NET: The authors of Dynamo [29] introduced an RFT called NET (Next-
Executing Tail) [19], which was originally called MRET (Most Recent
Executed Tail). In NET, regions are superblocks. Targets of backward
branches or targets of other superblock exits are considered as potential
superblock entries and are assigned a counter that keeps track of its exe-
4
cution frequency. After the counter reaches a defined hotness threshold,
a new region is recorded starting from this instruction and continuing
until another backward branch is reached or a given maximal number of
instructions is included in the region.
• MRET2: The authors of StarDBT [30] introduced the MRET2 RFT,
a variation of NET that aims at reducing the number of side-exits [20].
MRET2 consists of executing the recording phase of the NET technique
twice. If different code sequences are selected during both recordings, only
the intersection between them is selected to compose the MRET2 region.
• LEI: Hiniker, Hazelwood, and Smith [21] introduced the Last Executed
Iteration (LEI) technique. LEI selects cyclic superblocks based on a his-
tory buffer for the current execution. It focuses on avoiding inner loop
duplication on the superblocks.
• NETPlus: Davis and Hazelwood pointed out in a more recent work [22]
that the history buffer used by LEI imposes a considerable overhead. In
this latter paper, the authors propose the NETPlus RFT, which follows
the same steps as NET up to the point where a region is being closed. At
this point, NETPlus will look ahead in the code for a branch whose target
is the beginning of the superblock. When found, all instructions between
the current end of the region and the branch are added to the superblock.
In this manner, NETPlus aims at capturing more loops inside individual
superblocks when compared to the original NET, imposing a low overhead
on the superblock selection process. It is important to notice, however,
that the look-ahead process may touch code or memory positions that
have never been touched before and may never be touched in the future,
which may trigger unexpected page faults. The deepness in which the
search can go is a parameter for the NETPlus RFT.
• NET-r and NETPlus-e-r: Hong et al. [27] presented a modified ver-
sion of QEMU that uses the LLVM backend to emit highly optimized
regions, named HQEMU. The authors observe that to obtain maximum
benefits from the LLVM optimizations, HQEMU needs to create large re-
gions of code. Thus, they present a modification of two known RFTs.
The first, called NET-r, is a relaxation of NET that makes it similar to
the cyclical-path-based repetition detection scheme by not end recording
a region when a backward branch is found, but when a cycle is found (re-
peated instruction address recorded). The second [28], called NETPlus-e,
is an extension of NETPlus that adds not only paths that exit the NET
region and returns to its entrance but also paths that exit the NET re-
gion and returns to any part of the region. NETPlus-e can also use the
NET-r instead of NET, thus creating an extended and relaxed version of
NETPlus (NETPlus-e-r).
2.1 Dynamic Translation Performance
In this section, we discuss the primary sources of overhead in a DTE, mainly
the ones related to the RFT choice.
If we consider emulation flow depicted in Figure 1, at any given time, a DTE
can be in any of the five states. State 1, Loading DTE and Guest Code, only
5
needs to be executed once, and for long-time executions, it incurs a minimum
overhead; hence, we will not consider it. Instead, we will focus on the perfor-
mance and overhead sources on the other four: Interpretation, Region Record,
Region Compilation, and Native Execution.
The total execution cost of emulating code with a DTE is composed of the
cost of interpreting (State 2) and profiling (State 3) cold code plus the cost of
compiling hot code (State 4) and the cost of executing the native code (State
5). Notice that the sooner a code is compiled, the lesser time the DTE spends
emulating it as cold code and more time it spends emulating it as hot code, i.e.,
executing optimized native code.
Emulating code with native/optimized code is faster than emulating code
with interpretation, so one greed approach would be to compile every single part
of the code, but we need also to consider the compilation cost. If the execution
frequency is low, the compilation overhead may exceed the gains achieved by
the hot-code emulation. In the case of cold code, compiling damages the final
performance instead of improving it. In this case, to only interpret the code is
the best option [11].
This can be summarized by equations 1, 2 and 3, where InterpCost is the
average cost of interpreting each guest instruction, InterpFreq is the number of
instructions interpreted, HotStaticSize is the total number of guest instructions
dynamically compiled, Gencost is the average cost of compiling a single guest
instruction, CompilerInitializationCost is the initialization overhead of calling
the compiler, NumRegions is the number of regions of code being compiled,
NativeCost is the average cost of emulating a guest instruction by executing na-
tive, compiled, code, NativeFreq is the number of guest instructions emulated
by native code, TotalFreq is the total number of guest instructions emulated.
Notice that compiling a code only results in performance gains when the in-
equality of Equation 5 is true.
Other important performance overhead in a DTE is the region transition
overhead. Transitioning between the interpreter and a native region of code or
transitioning between native regions of code may imply in saving and loading
emulation context. Emulators may maintain a context of the machine being
emulated, such as the values of the registers. In native code, these guest reg-
isters can be mapped to host registers, but when jumping to the interpreter
these values need to be saved to memory so it can be again accessed by the
interpreter. The same happens when regions are compiled with a register al-
location that chooses different guest-host register mapping per region. In this
case, the guest registers modified that are in host registers need to be saved
again to memory. This overhead can be described as a multiplication between
the transition cost multiplied by the number of times the transition happens, as
described in Equation 4. Notice that larger regions tend to have entire cycles
inside it, such as entire nest of loops, thus reducing the number of transitions.
InterpTime = InterpCost × InterpFreq (1)
Nativetime = NativeCost ×NativeFreq (2)
Gentime = GenCost ×HotStaticSize + CompilerInitializationCost×NumRegions
(3)
TransitionT ime = TransitionCost×NumTransitions (4)
Interptime+Nativetime+Gentime+TransitionT ime < InterpCost×TotalFreq (5)
6
Although these equations offer a simplified overhead model for DTEs, it
gives us a significant insight into the performance of dynamic translator: the
more frequently executed are the compiled regions of code, the higher will be the
speedup when comparing to solely interpreting it. Another interesting point is
that this hotness characteristic is inherent to the program being emulated, not to
the DTE [11]. For example, a program could execute each of its instructions only
one time, being impossible to achieve any speedup with dynamic compilation.
Hence, the performance of the DTE depends also on the characteristics of the
program being emulated.
The cold emulation, hot emulation, compiler, and region transition overheads
define the main factors in a DTE performance.
Hot emulation (NativeCost) performance is directly related to the quality
of the code generator by the DTE compiler. Many decisions such as the shape
and size of the regions affect the quality of the code generated. For instance,
regions with a more substantial number of instructions may expose many more
optimization opportunities to the compiler, but regions with more branches are
more susceptible to early exits due to phase changing [31], leading to region frag-
mentation [21], code duplication [32], and infrequently executed region tails [33].
Another problem with large regions comes from exception handling: given the
difficulty to map exceptions during native execution, the DTE may need to roll
the execution back and reinterpret the entire region every time an exception oc-
curs and regions with frequent exceptions may become a performance issue [34].
Larger regions have more probability of including more branches and exceptions.
Giving the main performance factors of DTEs and the characteristics of the
compilation units chosen by an RFT to generated high-performance code, we
selected seven metrics that are important when trying to better understand and
analyze the performance behavior of a DTE. These metrics are described as
follows:
• Total Number Of Regions: indicates how many regions the RFT
formed. This metric provides insights about the compilation overhead,
the more frequent the compiler is called, the higher will be its overhead.
• Regions Coverage: reflects the percentage of the instructions that are
being emulated by translated code, instead of interpretation (InterpFreq/NativeFreq).
This metric indicates how much the hot code detection and the region for-
mation policy are guessing correctly. The more instructions are executed
outside the regions, the higher is the chance of existing hot code that was
not included in a region. It is important to compare this metric with
the number of regions because forming fewer regions at the cost of lower
coverage is not desirable.
• Number of transitions: is the number of entrances in regions which
came directly from the exit of other regions (NumTransitions). A high
amount of transitions may cause a higher pressure over the processor code
cache, and it is associated with the fragmentation of code cycles (nested
loops, for instance). Furthermore, transitions over regions have an emu-
lation cost. Thus, a low number of transitions imply in a good dynamic
region quality.
• Dynamic Region Size: is the total number of dynamic instructions
emulated by the region divided by the number of times that region was
7
entered. It is important to notice that the average dynamic size of regions
with low completion ratio can be smaller than its static size. On the other
hand, regions with loops can have a dynamic size much more prominent
than their static size. This metric indicates the locality of execution; the
more significant is the dynamic size, the lower is the number of transitions
between regions.
• Static Region Size: this is the average number of instructions per re-
gion. Therefore, it is also correlated with the compilation overhead, as
the compilation time is usually related to the number of instructions be-
ing compiled (GenCost ×HotStaticSize).
• Completion Ratio: is the percentage of times a region is executed en-
tirely, which means that all instructions in that region were executed from
the entrance to its exit. This metric makes more sense when dealing with
superblocks, like the ones formed by NET or MRET2, which have a main
exit well defined. Regions such as the ones formed by NETPlus do not
have a clear distinction between side-exits and main-exits. A good comple-
tion ratio on traces means fewer early-exits, which can have a significant
impact on fragmentation.
• 90% Cover Set: indicates the minimum number of regions needed to
cover 90% of the executed code frequency. The lower is the cover set, the
fewer regions are needed to cover the hot part of the code, and these are
the regions that should incur further optimizations.
Collecting and understanding these metrics is important to understand the
advantage of each RFT and DTE design choice and its drawbacks. In the
following section, we show that more interesting than the metrics themselves,
it is not necessary to fully implement a DTE to collect them. We only need to
simulate the states transitions from a DTE during the emulation of a binary. To
prove so, we implemented such a DTE life-cycle simulator, named RAIn, and
simulated the execution of multiple applications using different RFTs.
3 RAIn: A Dynamic Translator Simulator
The implementation and evaluation of region formation techniques in a real-
world DBT or HLLVM is not an easy task. It involves debugging dynamically
generated machine code among other complex tasks, which is overall a very time
demanding job. For that reason, it is seldom to see DT systems that implement
more than one RFT technique on a single DBT, which is why it is very difficult
to make a fair comparison among different region formation techniques.
Our approach to this problem was to develop an open-source tool, called the
Region Appraise Infrastructure (RAIn)1, to simulate the execution of a dynamic
translator, allowing an easy and flexible prototyping process. RAIn relies on the
Trace Execution Automata (TEA) [35] technique to mimic region formation and
execution and to collect accurate region profile information. Initially, the TEA
technique was used to record execution traces along with profile information
for future executions [35]. A TEA uses a Deterministic Finite Automata, or
1RAIn’s source code: https://github.com/vandersonmr/Rain3
8
DFA, to map executed instructions or basic blocks to pre-defined traces. RAIn
also applies a DFA, but it maps instructions to regions, formed according to a
pre-defined RFT.
RAIn implementation is not dependent on any architecture and it can be
used to simulate a RISC-V, ARM, or x86 input with no changes. The input to
RAIn is a sequence of instructions that can be collected from the execution of
a program (address and opcode), for example, using a simulator. It consumes
this sequence of instructions, but instead of executing them, like a functional
DTE, an automaton is dynamically traversed and updated, representing the
execution of instructions by regions. This automaton is also expanded under
certain circumstances, representing the creation of regions. Initially, only one
state, called No Trace being Executed (NTE) is present in the automaton. This
state keeps track of instructions that belong to no regions and is used to account
for instructions that are executed by an interpreter on a virtual machine that
couples interpretation with dynamic binary translation, for example. This state
prevents the system from creating a new state for every single instruction exe-
cuted, which could bloat the automaton. Figure 2b represents a RAIn DFA that
has been created when executing the trace generated by the program in Fig-
ure 2a. The regions R1 and R2 were formed according to the NET RFT. Notice
that each instruction represents a state at the DFA and states are grouped into
regions, representing the regions formed by the RFTs. Edges between states
crossing R1 and R2 boundaries represent transitions between the two regions.
(a) NET superblocks
(b) RAIn DFA
Figure 2: Example of RAIn state blocks.
So, whenever an instruction is consumed, RAIn checks the automaton for
a valid transition leaving the current state. If there is an outgoing edge la-
beled with the address of the consumed instruction, then, RAIn performs the
transition, updating the current state along with the edge and state execution
statistics. If there is no outgoing edge that represents the execution of the con-
sumed instruction, then it means that this path has not been recorded before.
In this case, a new edge is created and added to the automaton, representing
a new valid transition. This may happen due to a side-exit execution, like the
9
transition from instruction jeq T2.inc, on region R1, to instruction inc eax,
on R2, for example.
The RFT technique monitors the automata transitions and, according to its
policy, it may start the formation of a new region. During this phase, instead
of transitioning on existing states, RAIn records the executed instructions and
associated transitions until it reaches the RFT stop criteria. After reaching the
stop criteria, RAIn updates the automaton, creating new states and transitions
that represent the instructions and correct execution flows inside the new region.
The operation of RAIn itself can be seen as a state machine. Starting at
the EXECUTE state, the system consumes instructions performing transitions
on the current DFA and recording statistics. Once the RFT triggers the region
formation, the RECORD state is activated, and RAIn starts recording a new region
based on the flow and the instructions being consumed. After the RFT identifies
the stop condition, RAIn enters the APPEND state, in which it expands the DFA
with the newly formed region. Once the DFA is expanded, the system returns
to the EXECUTE state, continuing with the automata execution.
RAIn is implemented in two modules: the RegionManager and the Simula-
tor. The RegionManager is the module responsible for managing the policies
for RFTs. It controls the start and stops criteria for region recording. To add
a new RTF to RAIn, all that is necessary to provide is an implementation of
a RegionManager and a respective call in the main function to register it on
the Simulator module. Every RegionManager implements the method “handle-
NewInstruction” that is called for every instruction from the trace and it should
handle region creations. Code 1 shows the whole implementation of the NET
RFT. With less than 20 lines of code we implement and RFT and are able to
analyze it using RAIn’s metrics with instructions traces from different ISAs and
Operational Systems.
Code 1: NET Implementation using RAIn
Maybe<Region> NET: :
handleNewInstruct ion ( t r a c e i t em t &LastInst , t r a c e i t em t &CurrentInst , State LastTrans i t i on ) {
i f ( Recording ) {
i f (wasBackwardBranch ( LastInst , CurrentInst ) | | LastTrans i t i on == InterToNative ) {
Recording = f a l s e ;
return Maybe<Region>(RecordingRegion ) ;
}
RecordingRegion−>addAddress ( CurrentInst . addr ) ;
} else i f ( ( LastTrans i t ion == StayedInter && wasBackwardBranch ( LastInst , CurrentInst ) ) | |
LastTrans i t i on == NativeToInter ) {
HotnessCounter [ CurrentInst . addr ] += 1 ;
i f ( i sHot ( HotnessCounter [ CurrentInst . addr ] ) ) {
Recording = true ;
RecordingRegion−>addAddress ( CurrentInst . addr ) ;
HotnessCounter [ CurrentInst . addr ] = 0 ;
}
}
return Maybe<Region > : : Nothing ( ) ;
}
RAIn processes a sequence of instructions that represent the execution of a
program. RAIn processes one instruction at a time, similar to an interpreter.
Hence, its performance is similar to the performance of an interpreter when
evaluating a single RFT. In case the user aims at evaluating several RFTs, she
may parallelize the simulation by loading the sequence of instructions once and
feeding several RAIn threads, each one simulating a different RFT or hyper-
parameter. We employed this approach to collect the results from our experi-
ments and it took near to half an hour to collect all statistics from all tested
RFTs from each benchmark trace with 10 billion-instruction in an Intel Xeon
E5-2630 (2.60GHz).
10
4 Experimental Setup
To evaluate RAIn, the study presented in this paper was conducted using appli-
cations from two benchmark suites (SPEC CPU 2006 and SYSmark 2012) and
a Linux-compatible and open-source image editor, GIMP. A total of 14 bench-
marks from SPEC CPU [17] were applied in this study, ten from SPEC-FP and
four from SPEC-INT; and four benchmarks from SYSmark [18].
The usual benchmark suite for RFT related research in the literature is
SPEC CPU. However, SPEC CPU and SYSmark have a noticeably different
profile, and one of the goals is to understand how these DTEs configurations
perform across all these types of applications. SYSmark is described as an
application-based benchmark that reflects usage patterns of business users in
the areas of office productivity, data/financial analysis, system management,
media creation, 3D modeling, and web development. In this work, we have
evaluated the effect of RFT techniques on office applications, combining four
applications from the SYSmark Office scenario (FineReader Pro 10.0, Internet
Explorer 8, PowerPoint 2010, and Word 2010) and GIMP (2.8.20). This set
of applications form what we will call Desktop Apps, and aims to represent a
group of applications with a large 90% Cover set, as opposed to the SPEC CPU
benchmarks.
Several applications from both of these benchmark suites generate sequences
with trillions of executed instructions. The chosen method to handle such
amount of data was to use RAIn to simulate 10 billion-instruction sequences
of each program. We executed all these benchmarks on Bochs Emulator [36]
and captured the x86 executed instructions to form the sequence. To avoid
initialization code, the first 10 billion instructions were discarded, and then the
next 10 billion were recorded. This number of instructions proved to be enough
to expose the differences in the behavior of applications and RFTs. These can
be seen in the huge variation of code execution locality demonstrated by the
90% cover set, and the number of basic blocks presented in Section 5.2 and the
RFT behavior difference showed in Section 5.
5 Experimental Results
5.1 Parameter Selection
There are two important parameters in our tested RFTs: hotness threshold
and NETPlus deepness. To selected a good value, we ran two benchmarks and
tested several parameter values.
Threshold Value
Figure 3 shows how the number of compiled regions, the 90% cover set, the
percentage of cold regions, and hot-emulation coverage are affected by the hot-
ness threshold. As we can notice from these graphics, a slight variation of the
threshold value can significantly decrease the number of basic blocks selected
for translation, reducing the compilation overhead. The same effect occurred
when varying the threshold of all the RFTs, as we can observe in Figure 3(A).
We can also observe that by choosing a threshold near 1024, we got a low
number of regions selected and cold region proportion without losing too much
hot-emulation coverage (only when the threshold is near 2000 the hot-emulation
11
coverage becomes less than 90% for Finereader). Thus, we set 1024 as a fixed
threshold for all the next experiments.
Figure 3: Impact of region hotness threshold on the A) 90% Cover Set, B)
number of compiled regions, C) native execution coverage, and D) percentage
of cold regions for six RFTs. The data was generated using 10-billion-instruction
sequences from Finereader (SYSmark) and GCC (SPEC CPU) benchmarks.
As depicted in Figure 3, all the tested RFTs are strongly influenced by
the threshold and increasing its value not only reduced the proportion of cold
regions, proving the strong correlation between the past execution frequency and
its future, but also increased the 90% Cover Set value. Therefore, the threshold
value has a large influence over the four metrics and so, its choice should not be
neglected or ignored in the design and construction of a DTE.
NETPlus Deepness
As explained in Section 2, the NETPlus RFT has an expansion depth limit
that controls how far the search for loops in the original NET region can go.
Figure 4 shows how the number of compiled regions, the average dynamic region
size, and the average static region size are affected by the NETPlus expansion
depth limit. The graphic shows that there is a stabilization in the metrics when
the depth limit gets near to ten and, thus, choosing a value higher than ten
would probably have no benefit. Hence, we fix the NETPlus expansion depth
limit in ten for all the following experiments.
One important observation is that the results in Figure 4 are very close to the
ones presented by the authors of NETPlus [22], demonstrating the capabilities
of RAIn to explore RFT properties and leading its users to obtain the same
conclusions as for when using a real DTE. Additionally, Figure 4 shows that
the increase in the average static size of the regions made by the NETPlus
expansion can be much more costly for some programs than for others, such as
the case of bwaves and deal, chosen for being in different parts of the spectrum
of the 90% cover set metric. A high increase in the static size from bwave did
not lead to any significant increase in the dynamic region size, showing that
NETPlus, for some programs, can add costs that may never be paid-off, a fact,
and information that was not first observed by its authors.
Hence, all the following results were generated with a hotness threshold set to
1024, a NETPlus expansion depth limit set to ten and with all benchmarks bars
ordered by the 90% Cover Set values presented in next subsection ( Section 5.2).
12
Figure 4: Impact of the NETPlus expansion depth limit on the number of
compiled regions, avg. dynamic region size and avg. static size variation for
two SPEC CPU applications (we choose from SPEC to be easier to compare
with results from the NETPlus original paper). The results are normalized by
the first value (depth = 4).
5.2 Application Impact on DT Overhead
To evaluate the impact that different RFTs have on the metrics discussed in Sec-
tion 2.1, we execute RAIn with sequences of x86 instructions extracted from the
selected benchmarks. For each application, we skip 10 billion instructions and
then record the next 10 billion ones. We also added a sequence of instructions
from a Linux Image Editor, GIMP, to compose our set of desktop applications.
In our tests, we considered that only basic blocks executed more than 1024 times
are hot enough to be worth compiling. As can be seen in Figure 5a, some appli-
cations have very few basic blocks that reach this frequency, while others have
a lot of them. As the total execution frequency for each presented program
is constant (10 billion x86 instructions), having more basic blocks with high
execution frequency means that their average execution frequency is smaller.
Notice that desktop applications had the less hot basic blocks which are similar
to the conclusion in the work undertaken by Cesar et al. [11], where they argue
the importance of having office/GUI benchmarks when evaluating a DTE and
argue that the low execution frequency average of these applications is a barrier
to DTE’s performance. This is one of the main reasons we included desktop
applications from SYSmarkand GIMP in our experimental setup for evaluating
RAIn.
Another straightforward way to verify this is by using the 90% Cover Set
metric, first introduced by the authors of Dynamo [37]. The 90% Cover Set
counts the minimum number of regions (in this example, basic blocks) necessary
to achieve 90% of the execution frequency. The smaller the 90% Cover Set, the
lesser the amount of code to be compiled, and the higher the average execution
frequency of these basic blocks. Duesterwald et al. [37] demonstrated that there
is a strong inverse relationship between the 90% Cover Set size and a DTE’s
13
(a)
(b)
Figure 5: (a) Number of basic blocks that execute 1024 or more times and
(b) minimum number of basic blocks required to cover 90% of the 10 billion
instructions simulated per application.
performance. Therefore, it would be challenging to obtain the same performance
on benchmarks with far different numbers in the 90% Cover Set. Some examples
are the sjeng and IE, as we can see in Figure 5b.
5.2.1 Completion Ratio
The authors of MRET2 [20] argue that its main advantage is the increase in the
completion ratio of the selected traces over NET. They measured the completion
ratio with MRET2 and NET with the full execution of applications from SPEC
CPU 98 and show that, on average, MRET2 improves the completion ratio by
20%. Figure 6a shows a re-plot of their data, while Figure 6b shows the data
14
collected with RAIn for SPEC CPU 2006 applications. Besides the difference
in methodology and benchmarks, both RAIn and the original paper present
very similar results (distribution and average), leading to the same conclusions,
showing again that the simulation performed by RAIn is capable of producing
results similar to the ones obtained with real DTEs.
(a) Data from the MRET2 patent [20] – Full execution of
SPEC98 applications.
(b) Data generated by RAIn – 10 billion instructions from
SPEC CPU applications.
We extrapolate the experiment using RAIn to compare NET, MRET2, and
NET-R. The results in Figure 7 show that NET-R is only better than NET in
benchmarks with low 90% Cover Set (highly dense execution frequency), i.e.,
the ones more to the left, while MRET2 is better than NET in almost every
benchmark.
5.2.2 Compilation Overhead
Since the compilation cost is correlated with the number of times the dynamic
compiler is invoked and also with the number of instructions present in the
compiled regions, we use RAIn to evaluate the number of compiled regions (Fig-
ure 8a) and the average static region sizes (Figure 8b). NETPlus-e-r produced
smaller amounts of regions to be compiled. However, despite being a much more
simplistic RFT than NETPlus-e-r, NET-r had a very significant impact on this
metric too. MRET2 and NET produced many more regions to be compiled, a
result that is explored and explained by the authors of the LEI technique [21].
We can also observe that there is a trade-off between the number of regions
compiled and the average static region size: the majority of RFTs that decrease
15
Figure 7: MRET2 and NET-r normalized completion ratios.
the number of compiled regions, also increase the average static region size.
NETPlus have the best trade-off for these metrics; it decreased the number of
regions and only slightly increased the average static size; LEI, on the other
hand, has the worst trade-off. Furthermore, notice that NET-r, NETPlus-e-
r, and LEI create larger regions when emulating applications with larger 90%
Cover Sets, such as the desktop applications. Pointing that the relaxation of
NET (NET-r) and also the expansion of NETPlus (NETPlus-e-r) have a differ-
ent impact on benchmarks with different 90% Cover Set.
16
(a)
(b)
Figure 8: (a) Total number of Regions Compiled normalized by the NET values;
and (b) the average static region sizes normalized by the NET results for of all
RFTs and benchmarks.
5.2.3 Dynamic Characteristics
We also investigated the dynamic characteristics of the regions formed by all
RFTs with RAIn. Figure 9a shows the 90% cover set for all RFTs normalized by
the results of NET. In this metric, only MRET2 had a worse performance than
NET, indicating that it requires more regions to be compiled to cover 90% of the
execution. NETPlus-e-r achieved the best results, followed by NETPlus, LEI,
and NET-r, with NETPlus being more efficient on benchmarks with a higher
90% Cover Set. A similar result was obtained with the average dynamic size,
shown in Figure 9b, with NETPlus-e-r achieving again the best results, followed
by NETPlus and LEI, while NET-r had only a slight improvement and MRET2
decreased when compared to NET. Last, we can notice that the results were
17
much less significant in benchmarks with a high 90% Cover Set. Overall, these
cases support the view that it is more difficult to select regions better than NET
when the 90% Cover Set is high.
(a)
(b)
Figure 9: (a) the 90% Cover Sets normalized by the NET results; and (b) the
average dynamic region sizes normalized by the NET results for all RFTs and
benchmarks.
These results are similar to the ones found by previous work, which supports
our claim that using simulation to evaluate RFTs for a dynamic translator is
a sound approach. Also, these results show that the relaxation and extension
proposed by Hong et al. [27] is a simple and powerful technique that should be
considered when designing and implementing a dynamic translator.
18
6 Conclusions
In this work, we presented a novel DTE simulator called RAIn, which is ca-
pable of reproducing several results in the literature through simulation. The
simulation enables the test of multiples DTEs’ designs, producing several use-
ful statistics without complex implementations. Therefore, opening the new
opportunities for exploration of DTE design decisions in a faster and simple
manner.
As far as we know, there is no other simulation framework for testing DTE
designs like the one proposed in this paper, hence no other work similar to this
one. Moreover, there is no additional comparative study involving several RFTs
whatsoever. Typically an article that presents a new RFT only compares it with
only one other more [20, 21, 22], usually NET. Finally, but not less important,
the results that we obtained with RAIn in this work corroborate the findings
reported by other authors in previous works.
For example, we found with RAIn that MRET2 has a better completion rate
than NET, the same result presented by the MRET2’s original paper [20]. We
also showed that NET and MRET2 compile far more regions than the other tech-
niques and that these regions are smaller, a phenomenon that the LEI authors
called region fragmentation [21]. Furthermore, our results with the NETPlus
depth limit variation showed the same graphic pattern as in the one in the NET-
Plus original paper [22]. Moreover, the point of convergence found by us with
RAIn for the depth limit is the same as the one found by its authors. Finally, we
concluded that NETPlus-e-r is the best RFT for the dynamic metrics tested and
this is exactly why NETPlus-e-r was selected to be used in the HQUEMU [27].
On top of that, we presented a comprehensive study about several RFTs;
we identified the strongest and weakest points of each tested RFT, showing
the importance of using RAIn as a tool for the design of any future DTE.
For instance, if one needs to reduce the number of transitions and increase
the average time spent in a region, the best RFT would be the Expanded and
Relaxed version of NETPlus. Alternatively, if one needs to create smaller regions
with larger completion ration, MRET2 is by far the best tested RFT.
References
[1] J. Aycock, “A brief history of just-in-time,” ACM Computing Surveys
(CSUR), vol. 35, no. 2, pp. 97–113, 2003.
[2] P. Brown, “Throw-away compiling,” Software: Practice and Experience,
vol. 6, no. 3, pp. 423–434, 1976.
[3] K. Adams, J. Evans, B. Maher, G. Ottoni, A. Paroski, B. Simmers,
E. Smith, and O. Yamauchi, “The hiphop virtual machine,” ACM SIG-
PLAN Notices, vol. 49, no. 10, pp. 777–790, 2014.
[4] J. C. Dehnert, B. K. Grant, J. P. Banning, R. Johnson, T. Kistler,
A. Klaiber, and J. Mattson, “The transmeta code morphingTM software:
using speculation, recovery, and adaptive retranslation to address real-life
challenges,” in Proceedings of the international symposium on Code gener-
ation and optimization: feedback-directed and runtime optimization. IEEE
Computer Society, 2003, pp. 15–24.
19
[5] A. Inc, “Rosetta.” [Online]. Available: https://www.apple.com/rosetta/
index.html
[6] L. Lupori, V. Rosario, and E. Borin, “Towards a high-performance risc-v
emulator,” in 2018 Symposium on High Performance Computing Systems
(WSCAD). IEEE, 2018, pp. 213–220.
[7] V. M. do Rosario, F. Pisani, A. R. Gomes, and E. Borin, “Fog-assisted
translation: towards efficient software emulation on heterogeneous iot de-
vices,” in 2018 IEEE International Parallel and Distributed Processing
Symposium Workshops (IPDPSW). IEEE, 2018, pp. 1268–1277.
[8] T. Suganuma, T. Ogasawara, M. Takeuchi, T. Yasue, M. Kawahito,
K. Ishizaki, H. Komatsu, and T. Nakatani, “Overview of the ibm java
just-in-time compiler,” IBM systems Journal, vol. 39, no. 1, pp. 175–193,
2000.
[9] Ionmonkey. [Online]. Available: https://wiki.mozilla.org/IonMonkey
[10] D. E. Knuth, “An empirical study of fortran programs,” Software: Practice
and experience, vol. 1, no. 2, pp. 105–133, 1971.
[11] D. Cesar, R. Auler, R. Dalibera, S. Rigo, E. Borin, and G. Araujo, “Mod-
eling virtual machines misprediction overhead,” in Proceedings of the 2013
IEEE International Symposium on Workload Characterization (IISWC
’13), September 2013, pp. 153–162.
[12] T. Suganuma, T. Yasue, and T. Nakatani, “A region-based compilation
technique for dynamic compilers,” ACM Transactions on Programming
Languages and Systems (TOPLAS), vol. 28, no. 1, pp. 134–174, 2006.
[13] D. Burger, T. M. Austin, and S. Bennett, “Evaluating future microproces-
sors: The simplescalar tool set,” University of Wisconsin-Madison Depart-
ment of Computer Sciences, Tech. Rep., 1996.
[14] R. E. Bryant and M. N. Velev, “Verification of pipelined microprocessors
by comparing memory execution sequences in symbolic simulation,” in
Advances in Computing Science — ASIAN’97, R. K. Shyamasundar and
K. Ueda, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 1997, pp.
18–31.
[15] D. Ponomarev, G. Kucuk, and K. Ghose, “Accupower: an accurate power
estimation tool for superscalar microprocessors,” in Proceedings 2002 De-
sign, Automation and Test in Europe Conference and Exhibition, 2002, pp.
124–129.
[16] F. A. Endo, D. Courousse´, and H. Charles, “Micro-architectural simula-
tion of in-order and out-of-order arm microprocessors with gem5,” in 2014
International Conference on Embedded Computer Systems: Architectures,
Modeling, and Simulation (SAMOS XIV), 2014, pp. 266–273.
[17] Standard Performance Evaluation Corporation, “SPEC CPU 2006,” https:
//www.spec.org/cpu2006/, 2006, [Online; accessed 14-Mar-2015].
20
[18] Bapco, “SYSmark 2012,” http://bapco.com/products/sysmark-2012,
2012, [Online; accessed 14-Apr-2014].
[19] E. Duesterwald and V. Bala, “Software profiling for hot path prediction:
Less is more,” ACM SIGOPS Operating Systems Review, vol. 34, no. 5, pp.
202–211, 2000.
[20] C. Wang, B. Zheng, H. Kim, M. B. Jr., and Y. Wu, “Two-pass mret trace
selection for dynamic optimization,” Patent number 20070079293, 2007.
[21] D. Hiniker, K. Hazelwood, and M. D. Smith, “Improving region selection
in dynamic optimization systems,” in MICRO 38: Proceedings of the 38th
annual IEEE/ACM International Symposium on Microarchitecture, 2005,
pp. 141–154.
[22] D. Davis and K. Hazelwood, “Improving region selection through loop com-
pletion,” in ASPLOS Workshop on Runtime Environments/Systems, Lay-
ering, and Virtualized Environments, 2011.
[23] D.-Y. Hong, J.-J. Wu, P.-C. Yew, W.-C. Hsu, C.-C. Hsu, P. Liu, C.-M.
Wang, and Y.-C. Chung, “Efficient and retargetable dynamic binary trans-
lation on multicores,” IEEE Transactions on Parallel and Distributed Sys-
tems, vol. 25, no. 3, pp. 622–632, 2014.
[24] C.-C. Hsu, D.-Y. Hong, W.-C. Hsu, P. Liu, and J.-J. Wu, “A dynamic bi-
nary translation system in a client/server environment,” Journal of Systems
Architecture, vol. 61, no. 7, pp. 307–319, 2015.
[25] M. D. Bond and K. S. McKinley, “Practical path profiling for dynamic opti-
mizers,” in Proceedings of the international symposium on Code generation
and optimization. IEEE Computer Society, 2005, pp. 205–216.
[26] J. D. Hiser, D. Williams, W. Hu, J. W. Davidson, J. Mars, and B. R.
Childers, “Evaluating indirect branch handling mechanisms in software dy-
namic translation systems,” in Proceedings of the International Symposium
on Code Generation and Optimization. IEEE Computer Society, 2007, pp.
61–73.
[27] D.-Y. Hong, C.-C. Hsu, P.-C. Yew, J.-J. Wu, W.-C. Hsu, P. Liu, C.-M.
Wang, and Y.-C. Chung, “Hqemu: a multi-threaded and retargetable dy-
namic binary translator on multicores,” in Proceedings of the Tenth Inter-
national Symposium on Code Generation and Optimization. ACM, 2012,
pp. 104–113.
[28] H. Guan, Y. Yang, K. Chen, Y. Ge, L. Liu, and Y. Chen, “Distribit: a
distributed dynamic binary translator system for thin client computing,”
in Proceedings of the 19th ACM International Symposium on High Perfor-
mance Distributed Computing. ACM, 2010, pp. 684–691.
[29] V. Bala, E. Duesterwald, and S. Banerjia, “Dynamo: a transparent dy-
namic optimization system,” in Proceedings of the ACM SIGPLAN 2000
Conference on Programming Language Design and Implementation, 2000,
pp. 1–12.
21
[30] C. Wang, S. Hu, H.-S. Kim, S. R. Nair, M. B. Jr., Z. Ying, and Y. Wu,
“Stardbt: An efficient multi-platform dynamic binary translation system,”
in Asia-Pacific Computer Systems Architecture Conference, 2007, pp. 4–15.
[31] C.-C. Hsu, P. Liu, J.-J. Wu, P.-C. Yew, D.-Y. Hong, W.-C. Hsu, and C.-M.
Wang, “Improving dynamic binary optimization through early-exit guided
code region formation,” in ACM SIGPLAN Notices, vol. 48, no. 7. ACM,
2013, pp. 23–32.
[32] K. Scott, N. Kumar, B. R. Childers, J. W. Davidson, and M. L. Soffa,
“Overhead reduction techniques for software dynamic translation,” in Par-
allel and Distributed Processing Symposium, 2004. Proceedings. 18th Inter-
national. IEEE, 2004, p. 200.
[33] E. Borin and Y. Wu, “Characterization of dbt overhead,” in Workload
Characterization, 2009. IISWC 2009. IEEE International Symposium on.
IEEE, 2009, pp. 178–187.
[34] C. Ha¨ubl and H. Mo¨ssenbo¨ck, “Trace-based compilation for the java hotspot
virtual machine,” in Proceedings of the 9th International Conference on
Principles and Practice of Programming in Java. ACM, 2011, pp. 129–
138.
[35] J. P. Porto, G. Araujo, E. Borin, and Y. Wu, “Trace execution automata
in dynamic binary translation,” in 3rd Workshop on Architectural and Mi-
croarchitectural Support for Binary Translation, 2010.
[36] K. P. Lawton, “Bochs: A portable pc emulator for unix/x,” Linux Journal,
vol. 1996, no. 29es, p. 7, 1996.
[37] V. Bala, E. Duesterwald, and S. Banerjia, “Transparent Dynamic
Optimization: The Design and Implementation of Dynamo [Technical
Report],” HP Labs, Tech. Rep., 1999. [Online]. Available: http:
//www.hpl.hp.com/techreports/1999/HPL-1999-78.html
22
