The Janus triad: Exploiting parallelism through dynamic binary modification by Zhou, R et al.
The Janus Triad: Exploiting Parallelism Through
Dynamic Binary Modication
Ruoyu Zhou, George Wort, Márton Erdős, Timothy M. Jones
University of Cambridge, UK
{ruoyu.zhou,*,me412,timothy.jones}@cl.cam.ac.uk
*georgewort11@gmail.com
Abstract
We present a unied approach for exploiting thread-level,
data-level, andmemory-level parallelism through a same-ISA
dynamic binary modier guided by static binary analysis.
A static binary analyser rst examines an executable and
determines the operations required to extract parallelism at
runtime, encoding them as a series of rewrite rules that a
dynamic binary modier uses to perform binary transfor-
mation. We demonstrate this framework by exploiting three
dierent kinds of parallelism to perform automatic vectori-
sation, software prefetching, and automatic parallelisation
together on legacy application binaries. Software prefetch
insertion alone achieves an average speedup of 1.2×, compar-
ing favourably with an automatic compiler pass. Automatic
vectorisation brings speedups of 2.7× on the TSVC bench-
marks, signicantly beating a compiler approach for some
workloads. Finally, combining prefetching, vectorisation, and
parallelisation realises a speedup of 3.8× on a representative
application loop.
CCS Concepts • Software and its engineering→ Run-
time environments; Retargetable compilers; Dynamic com-
pilers.
Keywords binary translation, binary optimisation, vector-
ization, software prefetch
ACM Reference Format:
Ruoyu Zhou, George Wort, Márton Erdős, Timothy M. Jones. 2019.
The Janus Triad: Exploiting Parallelism Through Dynamic Binary
Modication. In Proceedings of the 15th ACM SIGPLAN/SIGOPS
International Conference on Virtual Execution Environments (VEE
’19), April 14, 2019, Providence, RI, USA. ACM, New York, NY, USA,
13 pages. hps://doi.org/10.1145/3313808.3313812
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear
this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than ACMmust be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specic permission and/or a fee. Request
permissions from permissions@acm.org.
VEE ’19, April 14, 2019, Providence, RI, USA
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6020-3/19/04. . . $15.00
hps://doi.org/10.1145/3313808.3313812
1 Introduction
Multicore processors support a variety of dierent kinds par-
allelism to meet the varying needs of a wide range of work-
loads. Among them, thread-level parallelism (TLP) relies on
decomposing workloads into threads at a coarse-grained
level and running them on dierent cores. SIMD processing
exploits data-level parallelism (DLP) whereby the same op-
eration is executed on multiple data points simultaneously.
Meanwhile, memory-level parallelism (MLP) is sometimes
considered a form of instruction-level parallelism (ILP), but
comes into its own when exploited using coarse-grained
software prefetching techniques to boost cache performance
and memory port utilisation.
However, exploiting these various forms of parallelism is
workload and processor-specic. SIMD instruction set archi-
tectures (ISAs) change regularly (for example, x86 architec-
tures support MMX, SSE and AVX vector ISAs developed by
Intel, not to mention those created by AMD), the number
of cores on chip is growing slowly, and the conguration
of the memory hierarchy has a profound eect on whether
software prefetching is useful or not. As such, it is increas-
ingly dicult for users of proprietary and legacy software to
specialise their codes for dierent chips, and even for open-
source software, end users often receive their applications in
binary form or link against pre-compiled libraries. In many
cases the applications can only be compiled with generic op-
timisations that do not take full advantage of the capabilities
of the underlying hardware.
Therefore, optimising applications through binary modi-
cation that targets the same ISA becomes a seductive propo-
sition. There are many static binary modication tools (for
example, Vulcan [22], BOLT [20], Sun BCO [14], Second-
Write [2], Ispike [17]) that can covert a generic executable to
a specialised binary, some relying on proling or instrumen-
tation. However, most focus on control-ow and peephole
optimisations and therefore cannot handle complicated and
stripped binaries. Without additional symbolic information,
they are not able to analyse and perform sophisticated trans-
formations that maintain the original program semantics.
Dynamic binary modication (DBM) normally sets up a
virtualised environment and executes the application just-
in-time in its sandboxed code caches. This is exible and
can specialise to dierent hardware, however it suers run-
time code-cache warm-up overhead and has to use runtime
VEE ’19, April 14, 2019, Providence, RI, USA Ruoyu Zhou, George Wort, Márton Erdős, Timothy M. Jones
Executable
Disassemble to IR
Control Flow Analysis
Dependence Analysis
Custom Analysis
Rewrite Schedule 
Generation
Same Executable
Rewrite Schedule
Dynamic Binary Modification
Static Binary Analysis
Runtime Profiles
Hot Code Profiling
Runtime Dependence
Execution Cost 
Models
Runtime Profiling
Profiling Info Analysis
Rewrite Schedule
Rewrite
Schedule 
Interpreter
D
B
M
Code Cache
(a) Flow of Janus’ binary modication
Basic Block A
jump B
Basic Block B
jump C
Basic Block C
Indirect jump D,E
Basic Block D
Basic Block E
Original 
Executable
DBM
Modified A
Copied B
jump mC
Modified C
Code Cache
Translated 
Indice Dispatch
Indirect 
Lookup
Miss
Hit
Rewrite 
Schedule 
Interpreter
JIT-compiled A
jump mB
jump ind_lk
Rewrite 
Schedule
Modified D
Header
Rewrite Rules
A ...
A ...
C ...
D ...
E ...
RuleIDBB D
1
2
1
3
3
JIT-compiler
Rule 2
Rule 1
Rule 3
Rule 1
Executable
(b) Overview of Janus’ rewrite-rule interpretation
Figure 1. Overview of the Janus binary modication framework.
optimisation to amortise this cost. Therefore some, such as
Dynamo [4] and starDBT [23], focus on a specic type of
application with a high fraction of hot code, whereas others,
such as DynamoRIO [6] and PIN [16], focus on program
analysis and instrumentation, where overheads are tolerated.
The overarching problem is that these tools lack a global
understanding of the applications that they modify.
A combination of static analysis and dynamic binary modi-
cation together builds on each other’s strengths. Janus is an
open source same-ISA dynamic binary modier augmented
with static binary analysis [25]. Janus performs a static anal-
ysis of binary executables and determines the transforma-
tions required to optimise them. From this, it produces a
set of domain-specic rewrite rules, which are steps that
the dynamic binary modier should follow to perform the
transformation at runtime. The dynamic component, based
on DynamoRIO, reads these rewrite rules and carries out the
transformation when it encounters the relevant code.
The Janus framework provides a system for extracting dif-
ferent levels of parallelism within a dynamic binary modier.
Originally developed for automatic parallelisation of binaries
(i.e., extracting TLP) [25], this paper augments Janus with
techniques to extract DLP and MLP as well. Using a combina-
tion of static analysis and dynamicmodication, we show the
performance of Janus when extracting each of these types of
parallelism in isolation and combined. Our evaluation shows
that inserting software prefetches alone provides an aver-
age speedup of 1.2× and automatic vectorization brings 2.7×
Combined, Janus achieves 3.8× on a workload amenable to
all forms of parallelism.
2 The Janus Framework
Janus is a same-ISA binary modication framework that was
initially developed for automatic parallelisation [25]. It com-
bines static binary analysis and dynamic binary modication,
controlled through a domain-specic rewrite schedule.
Figure 1(a) shows an overview of the Janus framework. It
starts by analysing an executable statically to identify regions
for optimisation, augmenting this with proling to rene its
cost models. It then determines how to modify the code to
perform the desired transformations and encodes the steps
into rewrite rules, contained within a rewrite schedule. The
dynamic binary modier reads the executable and rewrite
schedule and performs specic modications according to
the rules. Janus can also perform prole-guided optimisa-
tion, where dierent rewrite schedule are used to direct the
dynamic binary modier to perform dierent analysis or
modication at each pass.
2.1 Rewrite-Schedule Interface
The rewrite schedule is based on the insight that any complex
transformation to a binary can be decomposed into a series
of coarse-grained dynamic operations, each of which can
be implemented as a xed package called a rewrite rule. The
rewrite rules specify isolated modications to make locally
to each dynamic basic block encountered, to produce a global
transformation, so that the power of the rewrite schedule is
greater than the sum of its parts.
The rewrite rules are persistent, platform independent
and are only associated with the corresponding input binary
and instruction set architecture. The rewrite schedule can
be reused for multiple runs and cumulatively optimised.
The Janus Triad VEE ’19, April 14, 2019, Providence, RI, USA
2.2 Rewrite-Rule Interpretation
Using a rewrite schedule enables Janus to overcome the lim-
itations of a purely static or dynamic approach. The rewrite
schedule controls binary modication and conveys static
information to the DBM, removing the need for dynamic
program analysis. Yet Janus also builds on the strengths of
dynamic binary modication, by specialising code for dif-
ferent hardware, correctly handling signals and faults, and
dealing with code that is not discoverable ahead of time.
Figure 1(b) shows the interpretation process of the rewrite
rules. To execute application instructions, the baseline DBM
rst translates them, mangles them if they could cause it to
lose control of the running program, then stores them in a
code cache. Before the DBM copies each newly discovered
basic block to its code cache, it consults the hash table to
determine whether there are any rewrite rules associated
with the block. If there are, then the DBM invokes the modi-
cation handlers dictated by the rewrite rules on the basic
block before encoding back to the code cache.
2.3 Janus Applications
Using Janus, we can perform a variety of binary transforma-
tions, under control of the static analyser. Janus could also
transform the application for program analysis or other opti-
misation targets, such as security or reliability. The rewrite
rule interface can easily be expanded to support new optimi-
sations that are not easily accomplished in a purely static or
dynamic system, to further exploit parallelism. The initial
Janus framework has already demonstrated its application
for extracting thread-level parallelism [25]. In this paper, we
show how three kinds of parallelism (TLP, DLP, and MLP)
are exploited using a combination of static and dynamic tech-
niques through custom-dened rewrite rules, demonstrating
the simplicity and scalability of our approach.
3 Memory-Level Parallelism
Many dynamic binary modiers focus on dynamic trace opti-
misations [4, 23] where the frequently executed basic blocks
are predicted and concatenated into a trace or superblock
to avoid branch stalls and improve pipeline eciency. How-
ever, another signicant source of stalls that exists even in
native execution aects applications with irregular memory
accesses in the trace. These accesses are dicult, if not im-
possible, to predict in advance, causing high memory-access
latencies to fetch data into the cache hierarchy from DRAM.
Prefetching is an eective technique to take advantage
of memory-level parallelism by overlapping computation
and memory accessing through either hardware or software.
Regular stride-based prefetchers are already fully supported
by commercial processors. Irregular prefetches can be per-
formed in compilers by inserting extra address-prediction
code within memory-bound hot loops to initiate the memory
accesses in advance.
INSTR_CLONE Duplicate instructions and insert at a given location
INSTR_UPDATE Update a specied operand of an instruction
INSTR_NOP Insert n no-op instructions
MEM_PREFETCH Insert a prefetch for a memory operand and oset
MEM_SPILL_REG Spill a set of registers to private storage
MEM_RECOVER_REG Recover a set of registers from private storage
Figure 2. Major rewrite rules used for prefetching in Janus.
However, software prefetching is not normally added at
default compiler optimisation levels (e.g., O3) due to the di-
culty in getting them correct. The compiler must balance the
trade-o that occurs from generating the prefetches: code
must be added to calculate the prefetch addresses but too
much creates overhead that swamps the benets. Moreover,
the optimal prefetch distance can vary across dierent mi-
croarchitectures, meaning the compiler must compromise
when the target processor is not known.
3.1 Prefetching Approach
We implement the prefetching approach from Ainsworth
and Jones [1], originally developed in LLVM to automatically
generate software prefetches for indirect memory accesses.
The algorithm retrieves and duplicates program slices to
calculate prefetch addresses for a given prefetch oset.
We designed six major rewrite rules to support the soft-
ware prefetching in Janus, as shown in gure 2. Each rewrite
rule performs a specic modication to the incoming ba-
sic block that will be copied to the code cache. The prime
rewrite rule MEM_PREFETCH can direct the DBM to insert a
software-prefetch instruction at a specied location based on
a specic memory operand. An example is shown in gure 3,
where the rule MEM_PREFETCH directs the DBM to insert the
prefetcht0 instruction based on the existing memory access.
Figure 3(a) shows the rewrite-srule generation pass in
Janus’ static analyser to enable architecture-dependent rule
generation. Janus rst scans all the indirect loads in a loop
and creates the program slice leading to each address calcu-
lation. A prefetch is only generated when the program slice
has an input from the induction variable.
Prefetch Address Calculation The prefetched address is
a prediction of the dynamic address that will be accessed
n iterations in the future, where n is the prefetch distance.
For strided accesses in a hot loop, the prefetch oset can be
easily determined as a xed immediate value. In the DBM, the
oset can be directly re-encoded into the prefetch memory
operand with a larger displacement value.
However, for indirect memory accesses, variable strides,
or complicated address calculations, the address for the nth
future iteration has to be calculated through an extra code
snippet before the prefetch rule. The code snippet can nor-
mally be duplicated from existing instructions for the given
address in a loop. The INSTR_CLONE rule directs the DBM to
duplicate the instructions from a range within a basic block.
VEE ’19, April 14, 2019, Providence, RI, USA Ruoyu Zhou, George Wort, Márton Erdős, Timothy M. Jones
for loop in janus.hotLoops:
  for mem in loop.memReads:
    if isIndirectLoad(mem):
      slice = getPrefetchSlice(mem)
if slice.inputs.contain(loop.indvars):
        loop.prefetches.insert(slice)
  for slice in loop.prefetches:
    insertRule(INSTR_CLONE, slice)
    insertRule(MEM_PREFETCH, slice.mem, offset)
    
    for input in slice.inputs:
      insertRule(INSTR_UPDATE, input + offset)
    for mem in slice.memReads:
      insertRule(MEM_PREFETCH, mem, offset*2)
    if slice.needScratchRegister:
      insertRule(MEM_SPILL_REG, slice.scratchSet, loop.init)
      insertRule(MEM_RECOVER_REG, slice.scratchSet, loop.exit)
(a) Pseduo-code for prefetch rewrite-rule generation
33  mov qword  p tr [rsp + 0x30], rcx
34  mov rax, qword  p tr [rsp + 0x18]
35  mov edi, dword  p tr [rax + rcx*8]
xx  lea r9, [rcx + 0x40]
xx  pref etcht0 [rax + r9*8]
xx  lea r9, [rcx + 0x20]
xx  mov eax, dword  p tr [rax + r9*8]
xx  and eax, dword  p tr [rsp + 0x60]
xx  mov ecx, dword  p tr [rsp + 0x58]
xx  shr eax, cl
xx  movsxd  r 9, eax
xx  shl r 9, 5
xx  pref etcht0 [rs i + r9]
36  mov eax, edi
37  and eax, dword  p tr [rsp + 0x2c]
38  mov ecx, dword  p tr [rsp + 0x28]
39  shr eax, cl
40  movsxd  r 14, eax
41  shl r 14, 5
42  lea rbp, [r15 + r14]
43  movd xmm3, edi
44  pshufd xmm3, xmm3, 0x44
45  mov rax, qword  p tr [rsp + 0x10]
46  lea rax, [rax + r14]
47  mov rcx, qword  p tr [rsp + 8]
48  lea rdx, [rcx + r14]
49  nop
50  mov r12d, dword  p tr [rbp  + 4]
51  test r12, r12
52  je 0x401854 -> 120
53  cmp r12d, 4
54  jae 0x401750 -> 58
120  mov rb p, qword  p tr [rbp  + 0x18]
121  test r bp, rbp
122  jne 0x401730 -> 50
55  xor es i, es i
56  jmp 0x401840 -> 113
58  mov r13d, r12d
59  and r13d, 3
60  cmp r12d, r13d
61  jne 0x401763 -> 64
113  cmp edi, dword  p tr [rdx + rs i*8]
114  sete cl
115  movzx ecx, cl
116  add rb x, rcx
117  inc rs i
118  cmp rs i, r12
119  jb 0x401840 -> 113
62  xor es i, es i
63  jmp 0x401840 -> 113
64  mov rs i, r12
65  sub rs i, r13
66  movq xmm5, rbx
67  pxor xmm4, xmm4
68  mov r8d, r12d
69  and r8d, 3
70  mov rb x, r12
71  sub rbx, r8
72  xor r8d, r8d
73  nop wor d ptr cs:[rax + rax]
74  movq xmm6, r8
75  pshufd xmm6, xmm6, 0x44
76  paddq xmm6, xmm8
77  movq r9, xmm6
78  lea r10, [r15 + r14]
79  pshufd xmm6, xmm6, 0x4e
80  movq r11, xmm6
81  movd xmm6, dword  p tr [rax + r8*8]
82  movd xmm7, dword  p tr [rax + r8*8 - 8]
83  punpckldq xmm7, xmm6
84  pshufd xmm6, xmm7, 0xd4
85  pand xmm6, xmm1
86  movd xmm7, dword  p tr [r10 + r9*8 + 8]
87  movd xmm0, dword  p tr [r10 + r11*8 + 8]
88  punpckldq xmm7, xmm0
89  pshufd xmm0, xmm7, 0xd4
90  pand xmm0, xmm1
91  movdqa xmm7, xmm3
92  pand xmm7, xmm1
93  pcmpeqd xmm6, xmm7
94  pshufd xmm2, xmm6, 0xb1
95  pand xmm6, xmm9
96  pand xmm6, xmm2
97  pcmpeqd xmm7, xmm0
98  pshufd xmm0, xmm7, 0xb1
99  pand xmm7, xmm9
100  pand xmm7, xmm0
101  paddq xmm5, xmm6
102  paddq xmm4, xmm7
103  add r8, 4
104  cmp rbx, r8
105  jne 0x401790 -> 74
106  paddq xmm4, xmm5
107  pshufd xmm0, xmm4, 0x4e
108  paddq xmm0, xmm4
109  movq r bx, xmm0
110  test r13d, r13d
111  je 0x401854 -> 120
123  mov rcx, qword  p tr [rsp + 0x30]
124  inc rcx
125  cmp rcx, qword  p tr [rsp + 0x20]
126  jb 0x4016f0 -> 33
127  call clock@plt
Rule: MEM_PREFETCH
Rule: INSTR_UPDATE 
Rule: MEM_PREFETCH
Rule: INSTR_UPDATE
Rule: INSTR_CLONE
Rule: INSTR_CLONE
lea r9, [rcx + 0x40]
prefetcht0 [rax + r9*8]
lea r9, [rcx + 0x20]
mov eax, dword ptr [rax + r9*8]
and eax, dword ptr [rsp + 0x60]
mov ecx, dword ptr [rsp + 0x58]
shr eax, cl
movsxd r9, eax
shl r9, 5
prefetcht0 [rsi + r9]
(b) Example of MEM_PREFETCH interpretation
Figure 3. Prefetch generation in Janus.
The INSTR_UPDATE rule alters the instruction operands so
that the slice can take dierent inputs. A combination of
INSTR_CLONE, INSTR_UPDATE and MEM_PREFETCH can then
achieve the prefetch for complicated memory accesses.
Static Rule Generation We use the Janus static analyser
to detect regions that can be optimised through prefetching
within the input binary. We implement the detection analysis
in Janus’ static analyser based on the algorithm byAinsworth
and Jones [1] and implemented in their LLVM pass. This
detection method nds patterns of indirect memory accesses
within loops and identies all instructions that are required
for address calculation for each iteration.
Following detection the Janus static analyser generates
a combination of INSTR_CLONE and MEM_PREFETCH rules for
each opportunity it detects. In some cases, additional scratch
registers have to be used for the address calculation. Two
rewrite rules, MEM_SPILL_REG and MEM_RECOVER_REG, are
generated to direct the dynamic modier to spill and recover
registers determined by the static analyser to be least harmful
to performance. With a global view of the loop, the static
analyser can detect dead registers and avoid register spilling
in the frequently executed code path.
Error Handling The LLVM compiler pass must insert ex-
tra bounds-checking code around intermediate loads so as
to avoid introducing faults caused by out-of-bounds address
calculation. However, Janus reaps the benets of both static
analysis and dynamic transformation, which we use to avoid
generating this additional code. After each inserted prefetch
instruction we added a specic rewrite rule, INSTR_NOP, to
generate a bit pattern which, for x86 code, is 0x90909090
NOP DWORD ptr [EAX + 00H]
mov rcx,[rax]
...
prefetch0 [rax+r15]
add r15, rcx
mov r15, [r8]
add r8, rcx
...
Seg fault
Software Prefetch 
Address Calculation
Prefetch
Recovery Pattern
Janus Signal 
Handler
Figure 4. Dynamic error handling in software prefetch.
(four no-ops). If a fault occurs, control is passed to Janus’
signal handler which searches ahead in the basic block from
the position of the fault, looking for this bit pattern. If it nds
the recover pattern, then Janus assumes the fault occurred
within code inserted for prefetch optimisation, so it skips
to the end of all inserted prefetch code. If there are register
spills involved, it rolls back to the start of the sequence of
spill-recovery instructions and resumes normal execution.
During correct (i.e., fault-free) execution, these no-ops have
a negligible eect on performance and avoid the overheads
from bounds checking.
DynamicAdaptation An optimisation is to interpret Janus’
rewrite rules only in hot loops, removing cache pollution
caused by prefetches in the cold code path. Whenever a trace
is formed in DynamoRIO, we can retrieve the rewrite rules
and perform software prefetching, rather than doing it for
all loops. Our software-prefetch optimisation can also easily
adapt to the underlying hardware. For example, if it detects
The Janus Triad VEE ’19, April 14, 2019, Providence, RI, USA
VECT_BROADCAST Broadcast a scalar to all SIMD lanes VECT_BOUND_CHECK Perform a bound check.
VECT_REDUCE_INIT Initialise reduction SIMD lanes VECT_REDUCE_AFTER Merge reduction SIMD lanes.
VECT_CONVERT Convert a scalar instruction to SIMD version VECT_DEP_CHECK Dynamic dependence check
VECT_LOOP_PEEL Duplicate the original loop code with specied trip count VECT_LOOP_UNROLL Unroll the loop based on the dynamic scale
Figure 5.Major rewrite rules used in automatic vectorisation in Janus.
the processor does not support the prefetch instruction then
it avoids generating any prefetch code. This is realised by
adding an extra guard in all prefetch-related rewrite rules.
With compiler-based prefetch insertion, two copies of the
code would have to be generated (one with prefetches, one
without) to avoid incurring the overheads of additional in-
struction execution when running on processors without
prefetch support, typically resulting in excessively large exe-
cutables.
4 Data-Level Parallelism
Many processors support a range of SIMD instruction sets,
yet when an application is compiled to run on a variety of
dierent systems, compilers must generate binaries that tar-
get the lowest common denominator. This then limits the
applications to only run on machines that contain that spe-
cic instruction set, making the whole program incompatible
with machines containing dierent instruction sets.
Previous attempts to perform automatic vectorisation in a
same-ISA dynamic binary modier [9, 24] focused on purely
dynamic vectorisation in a trace. Due to lack of comprehen-
sive static and alias analysis, this technique can only vec-
torise loops containing a single basic block and with regular
memory accesses. In this section, we describe how generic
vectorisation can be achieved using rewrite rules in Janus.
4.1 Binary Vectorisation
In Janus, eight rewrite rules are designed to support vectori-
sation, as shown gure 5.
Scalar to SIMD Translation The VECT_CONVERT rule is
associated with every scalar machine instruction that needs
to be translated to a SIMD counterpart. Within the conver-
sion handler, the size of each original scalar register is ad-
justed to the corresponding widest SIMD register, as deter-
mined by the static analyser. For example, register R8 is
translated to XMM8 or YMM8 based on the specication of
the running hardware, as demonstrated in gure 6(b).
The size of the designated memory operands are also con-
verted to the corresponding size in the instruction. A check
of the instruction’s opcode is then performed and the appro-
priate instruction handler called, dealing with either single or
double precision, and rewriting to its corresponding opcode
and operands for the vector counterpart.
Loop Unrolling and Peeling Loop unrolling is directed by
rule VECT_LOOP_UNROLL, which is associated with instruc-
tions that update the induction variable in a loop. The static
rewrite rule describes the original word width. stride length
and number of SIMD lanes for the input binary. Dynamically,
Janus obtains the runtime word width from the underlying
hardware, calculates the peeling distance and unroll scale,
and adjusts memory operands.
However, both the SIMD width and loop trip count might
only be known at runtime. Dealing with this requires Janus
to calculate the remainder of the loop iterations before the
loop’s execution, where the rule VECT_LOOP_PEEL is inserted.
If the dynamic trip count is not divisible by the lane count,
the handler creates a new code cache that duplicates the
original loop and modies the loop bound based on the peel-
ing distance. If the peeling distance is an immediate value,
Janus inlines the code into the basic block before the loop
execution. Otherwise, if the distance is symbolic, it inserts a
runtime check to determine whether to jump to the peeling
code cache or not.
Static and Dynamic Alignment Check Janus provides
information in the rewrite schedule, gained from the static
analyser, on whether the memory accesses are denitely
aligned, denitely unaligned, or whether a check must be
carried out when the vector size is known. The rst two
options correspond to when an instruction has no memory
access and when unalignment must be assumed given the
lack of information available statically. The instruction op-
code is then replaced by the appropriate aligned or unaligned,
single or double, original- or VEX-encoded instruction, for
example VMOVDQA or VMOVDQU.
Alignment of memory accesses with the vector size is an
important consideration, as loading unaligned memory into
vector registers is slow, and arithmetic vector instructions
require their memory accesses to be aligned. In order to deal
with this, where possible, the initial memory-access address
is calculated and stored alongside the VECT_CONVERT rule
to allow the calculation of whether or not the instruction
is aligned, which will be performed dynamically once the
vector size is known.
An arithmetic vector instruction cannot perform a mem-
ory access that is not aligned with the vector size. This means
that an additional instruction must be inserted to load un-
aligned memory before performing the operation. The mem-
ory operand is loaded into a free register, which then replaces
the memory operand in the vector instruction.
Initialisation and Reduction The VECT_BROADCAST rule
tags a scalar register, memory operand, or existing SIMD
operand to be expanded to a full SIMD register. Depending
VEE ’19, April 14, 2019, Providence, RI, USA Ruoyu Zhou, George Wort, Márton Erdős, Timothy M. Jones
for loop in janus.hotLoops:
  checks = getRuntimeCheckPairs(loop)
  aligned = alignmentAnalysis(loop)
  for pair in checks:
    insertRule(VECT_BOUND_CHECK, pair, loop.init)
  if needPrePeeling(loop):
    insertRule(VECT_LOOP_PEEL, loop.peel, loop.init)
  for var in loop.inputVariables:
    insertRule(VECT_BROADCAST, var, loop.init)
  for var in loop.reductionVariables:
    insertRule(VECT_REDUCE_INIT, var, loop.init)
    insertRule(VECT_REDUCE_MERGE, var, loop.exit)
  for inst in loop.instructions:
    if needConvert(inst):
      insertRule(VECT_CONVERT, inst, aligned, stride)
  for iter in loop.iterators:
    insertRule(VECT_LOOP_UNROLL, iter, iter.update)
  if needPostPeeling(loop):
    insertRule(VECT_LOOP_PEEL, loop.peel, loop.exit)
(a) Pseudo code for vectorisation static rule generation
Rule: VECT_EXTEND 4 unaligned
divss 0x827cdc(%rax)[4byte] %xmm0 -> %xmm0
SSE: movups 0x827cd0(%rax)[16byte] -> %xmm1
     divps  %xmm1 %xmm0 -> %xmm0
AVX: vmovups 0x827cc0(%rax)[32byte] -> %ymm1
     vdivps  %ymm1 %ymm0 -> %xmm0
Rule: VECT_EXTEND 4 aligned
movss   0x6cd0a0(%rax)[4byte] -> %xmm0
SSE: movaps  0x6cd0a0(%rax)[16byte] -> %xmm0
AVX: vmovaps 0x6cd0a0(%rax)[32byte] -> %ymm0
Rule: VECT_EXTEND 4 unaligned
movss   %xmm0[4byte]->(%rdx)[4byte]
SSE: movups  %xmm0 -> (%rdx)[16byte]
AVX: vmovups %ymm0 -> (%rdx)[32byte]
Memory 
Displacement 
Adjustment
(b) Runtime interpretation of VECT_CONVERT
Figure 6. Automatic vectorisation in Janus.
on the ISA supported, dierent broadcast instructions are
generated. For zero initialisation, XORingwith itself provides
a quick solution.
The VECT_REDUCE_MERGE and VECT_REDUCE_INIT rules
are partnered when handling reduction variables. The rst,
VECT_REDUCE_INIT generates the correct initial value in the
reduction register. Based on the dynamic lane count and ISA
supported, VECT_REDUCE_MERGE generates dierent code to
reduce the multiple reduction variables across the SIMD
lanes to one reduction variable in the rst lane. Currently
only add and multiply reduction is supported.
Runtime Bound Check Once again, the unmodied loop
body is required, but this time to provide the original loop in
the case that it is deemed unsafe to vectorise at runtime. A
VECT_BOUND_CHECK check can be performed on a register’s
value using the condition of it being equal to a provided
value, or on it being a positive value. The check to be used is
determined by whether the rule data contains a value to be
compared against. If the condition is met then the vectorised
loop is executed, otherwise, the original loop is executed. The
loop body is bookmarked by the compare instruction and
conditional jump over the loop body before it, as well as an
unconditional jump to the to-be-vectorised loop’s compare
instruction following it. The compare instruction jumped
to will produce a result that will cause the subsequent vec-
torised loop’s jump to the start of the loop to be ignored, as
the vectorised loop ends in the same state as the original.
Runtime Symbolic Resolution The VECT_DEP_CHECK is
inserted whenever two memory accesses are identied as
“may-alias” due to ambiguities caused by inputs or function
arguments. For example, consider this loop:
1 for (int i = 0; i < LEN; i++) {
2 a[i] = a[i+k] + b[i];
3 }
If the value of k is greater than zero, then Janus can vectorise
the loop in a simple manner. For negative values of k , the
loop cannot be vectorised without reversing the access order.
The rule VECT_DEP_CHECK directs Janus to generate compare
and jump instructions before the loop, once the symbolic
value of k in a register or stack is available at runtime.
4.2 Vectorisation Rule Generation
Figure 6(a) shows the basic procedure in static rule gen-
eration. Loops are initially ltered using Janus’ existing
static analysis and statically guided proling using proling-
specic rewrite rules. Loops with function calls, undecided
indirect memory accesses, low iteration count, and irregu-
lar dependencies are disabled from further rule generation.
Janus also has the option to generate proling rewrite rules
to guide DynamoRIO to obtain the hot-loop information in
prior runs with training inputs. The resulting rewrite sched-
ule is then only applicable to those hot loops.
For each target hot loop, the static analyser generates
rewrite rules by pattern matching the constraints on the
static program dependence graph. Induction, reduction and
The Janus Triad VEE ’19, April 14, 2019, Providence, RI, USA
iterator variables are identied and their dependency and
consistency are handled by their corresponding set of rewrite
rules. Runtime-check rewrite rules are only generated when
there are static ambiguities and the checks can be addressed
by simple runtime symbolic resolution. If the symbolic ex-
pression is too complicated (e.g., contains phi node or expres-
sion too long) or there are an excessive number of runtime
checks, Janus would disable vectorisation for this loop.
5 Hybrid Parallelism in Janus
Thread-level parallelism has been explored in Janus in prior
work [25] using a set of twelve rewrite rules to perform
automatic parallelisation forDoAll loops. In this section, we
explore the possibilities for mixing the three sets of rewrite
rules to enable the extraction of MLP, DLP, and TLP within
the same rewrite schedule.
ModularRule Interpretation The principle of Janus’mod-
ication is to decompose the extraction of each form of par-
allelism into dierent sets of ne-grained dynamic opera-
tions (rewrite rules), where each set is designed to be self-
contained. In Janus, three separate rewrite-rule-generation
analyses are performed on a binary for prefetching (MLP),
vectorisation (DLP), and parallelisation (TLP), while the nal
rewrite schedule contains the combination of the three. Inter-
leaving rewrite rules for dierent optimisations corresponds
to the combination of all transformations because all rewrite
rules are modular.
Parallelism Applicability Generally the three kinds of
parallelism correspond to dierent phases of computation
and therefore dierent loops. Janus proles loops to nd
those that are hot and then selects the kinds of parallelism
to extract based on their characteristics. Prefetching is ap-
plicable to loops with indirect memory accesses. We avoid
loops with only regular strided accesses, assuming that the
hardware prefetcher will easily pick up the access pattern.
Vectorisation is suited to loops with regular memory accesses
and sucient repeated computation. Parallelisation is for
loops without cross-iteration data dependencies and can be
applied to loops that also contain MLP or DLP.
Ordered Rule Interpretation Based on the semantics of
the rewrite rules, we divide Janus’ rewrite rules into three
groups:
Atomic performs a self-contained modication that does not
aect the prerequisite of later rewrite rules. For example,
a runtime bounds check or a scalar-to-vector conversion.
Pairwise rules must be interpreted in pairs. For example,
rewrite rules for spilling registers and recovering registers.
Local rewrite rules aect the semantics of later rules, such as
loop peeling and unrolling. Therefore rewrite rules have
to be interpreted in a specic order.
However, there are a fraction of rewrite rules that need
to be handled specically whenever multiple rewrite rules
annotate the same instruction in the same basic block. We
carefully dene the rewrite-rule interpretation order by set-
ting the priority of the rule so that the output of the pre-
vious rewrite rule remains consistent with the prerequi-
sites of the later rewrite rule. For example, the rewrite rule
VECT_LOOP_PEEL peels a fraction of the total loop iterations
to perform aligned vectorisation. Meanwhile, the parallelisa-
tion rewrite rule PARA_LOOP_INITmodies the loop starting
context for each thread in its thread-private code caches. The
vectorisation rule VECT_LOOP_PEEL must be interpreted af-
ter the parallelisation initialisation rule PARA_LOOP_INIT so
that the peeling is carried out on the per-thread private ver-
sion of the loop context. Similarly, the VECT_REDUCE_AFTER
rule merges a reduction variable into one thread-private
copy. This must be executed before the parallelisation rule
PARA_LOOP_FINISH so that all threaded copies are merged
into the main thread.
However there are cases when two rewrite rules exhibit an
accumulated modication, where the order of modication
does not matter. For example, the vector loop-unrolling rule
VECT_LOOP_UNROLL can be interleaved with the parallel rule
PARA_LOOP_UNROLL in any order.
Modication Bookkeeping There are rewrite rules that
must be executed in pairs, for example, a spill must be paired
with a recover. The eect of the free register lasts between
the two rewrite rules. Therefore Janus maintains a runtime
bookkeeping variable to keep track of changes made by pair-
wise rules. For example, all rewrite-rule interpreters must
check whether the last-spilled register conicts with their
prerequisites. If a conict does occur, the handler can regen-
erate a recover and spill rewrite-rule pair before and after
the interrupted rewrite rule. So if a rewrite rule wants to
use the original value in register rax that was spilled by a
previous rule, it needs to recover the value rst.
Refreshable Rule Interpretation DynamoRIO’s code cache
may occasionally be ushed to maximise the dispatch ef-
ciency in its hash-table lookup. This requires the Janus
rewrite-rule handlers to generate the same modications
after code ushing when they encounter the code once more.
It requires the rule handlers to examine whether the runtime
context it used for just-in-time recompilation has changed.
To ensure safety, a better strategy for the handlers is to en-
code the absolute address of the runtime variable instead of
the absolute value.
Summary In Janus, we designed six rewrite rules for soft-
ware prefetching and a further eight rewrite rules to enable
vectorisation in a dynamic binary modier. These, together
with the twelve rewrite rules for parallelisation, constitute
the method of using static analysis to enable runtime ex-
ploitation of three types of parallelism.
VEE ’19, April 14, 2019, Providence, RI, USA Ruoyu Zhou, George Wort, Márton Erdős, Timothy M. Jones
✥
✥ ✁
✂
✂ ✁
✄
✄ ✁
☎
❈✆ ■✝ ❘✞ ❍✟✄ ❍✟✠ ✆●✡☛●☞✌
❙
✍
✎
✎
✏
✑
✍
❉✒✌☞☛✡❘■✓ ❈✡☛✔✕✖●✗ ✟☞✌❏✘
Figure 7. Software prefetch performance comparison be-
tween Janus and pre-compiled executables.
6 Evaluation
We evaluated the extraction of three kinds of parallelism in
Janus on an Intel Haswell i5-4570 CPU running Ubuntu 16.04
that contains four cores (8 threads), a 6MB L3 cache and runs
at a frequency of 3.2GHz.
6.1 Prefetching
To evaluate the performance of software prefetching in Janus,
we evaluated the same benchmarks from the NAS suite [3]
as Ainsworth and Jones [1] by compiling a set of binaries
using their compiler software-prefetching pass and another
without. The baseline binary was compiled by clang 3.8 with
optimisation level -O3 (Ainsworth’s prefetching pass is writ-
ten for LLVM). Our evaluation executes Janus on the same
binary with a rewrite schedule for prefetching. We report
the median, minimum and maximum time from ten runs
using the same inputs as published work [1].
Figure 7 shows the performance improvement of just-in-
time prefetch in Janus normalised to the baseline of native
executionwithout prefetching. The bar labeled “DynamoRIO”
refers to running the original executable under DynamoRIO
without performing any modication, reecting the over-
head of the dynamic modier, which is negligible for the
benchmarks. The bar labeled “Compiler” shows the perfor-
mance of the applications when software-prefetch instruc-
tions are added by the LLVM pass. Finally, the bar labeled
“Janus” shows that execution using Janus to insert prefetch in-
structions achieves a signicant performance improvement
that is comparable to what compilers can extract.
Janus-based prefetching performs similarly to the LLVM
pass for IS, and out-performs the compiler scheme for CG,
RA andHJ8. There are two reasons for this. First, the hot loop
in the binary without prefetching is unrolled, giving better
performance, whereas in the binary with compiler-generated
prefetching the insertion of prefetches limits unrolling. This
shows the benets of performing certain optimisations, such
as prefetching, dynamically at runtime. Second, the no-ops
used for error checking in Janus are more optimal than the
bounds checking inserted by the compiler.
✙✚
✛✚
✜✚
✢✚
✣✚
✶✙✚
✤✦ ✧★ ✩✪ ✫✬✛ ✫✬✣ ✦✭✮✯✭✰✱
✲
✳
✴
✵
✷
✸
✳
✹
✺
✻
✳
✼
✽
✾
✳
✹
✷
✿
✳
❀
❁
✳
✵
✻
✳
❂
✼
Figure 8. Size of the rewrite schedule for each executable
when prefetching, normalised to the size of the executable.
For HJ2, the LLVM pass achieves higher performance.
Janus’ modications for HJ2 require an additional scratch
register to be used. Although the amount of spilling required
is low, due to inserting spilling only when required, this leads
to worse performance than the compiler, since the compiler
has full control over register allocation.
Figure 8 shows the size of the rewrite schedule for each
application to encode the rewrite rules for prefetching, nor-
malised to the size of the corresponding binary. It is clear
that the rewrite schedules are small, around 4%, being at
most 7% of the size of the input executable.
6.2 Automatic Vectorisation
To evaluate the performance of automatic vectorisation in
Janus, we use the Test Suite for Vectorising Compilers (TSVC)
benchmarks [7, 18]. The baseline binaries were compiled by
gcc 5.4 with optimisation level -O3 and -fno-tree-vectorize
to disable auto-vectorisation. The reference binaries were
generated with -O3 with auto-vectorisation enabled by de-
fault.
6.2.1 Vectorisation Coverage
TSVC consists of 151 benchmarks, each containing a single
loop, 28 of which are not vectorisable due to their designated
dependence patterns. Each benchmark produces a checksum,
which may lose precision if the manipulation of values re-
quired results in more frequent rounding or prevents the
underlying use of greater precision. Janus proves itself safe
and correct by rejecting all loops that it cannot handle within
the TSVC test suite and producing checksums which never
lose more accuracy than gcc’s vectorisation. The double-
precision benchmarks never lose any precision.
gcc manages to vectorise and produce a speedup of at
least 5% for 58 and 55 loops for single- and double-precision
respectively, when using AVX, while Janus is able to speed-
up 39 and 33. For single precision, that corresponds to 47.2%
and 31.7% of the 123 vectorisable loops in TSVC for gcc and
Janus respectively, as shown in gure 9.
The Janus Triad VEE ’19, April 14, 2019, Providence, RI, USA
✥
 ✥
✁✥
✂✥
✄✥
☎✥✥
❙❙✆ ❆✝✞ ❙❙✆ ❆✝✞ ❙❙✆ ❆✝✞ ❙❙✆ ❆✝✞
❋
✟
✠
✡
☛
☞
✌
✍
✌
✎
✏
✌
☛
✠
✑
✒
✓
✔
✕
✖
✟
✌
✎
✗
✌
✌
✘
✙
✝❱✚✛✜✢✣✤❱✦
✝❱✚✛✜✢✣✤❱✦ ✧✣✛★ ✩✪✫✬✛✣✭❱ ✮★❱✚✯
◆✜✛ ✝❱✚✛✜✢✣✤❱✦
◆✜✛ ✝❱✚✛✜✢✣✤✰✱✲❱
❏✰✫✪✳ ✴✜✪✱✲❱●✮✮ ✴✜✪✱✲❱❏✰✫✪✳ ❙✣✫✵✲❱●✮✮ ❙✣✫✵✲❱
Figure 9. Number of vectorised loop compared to gcc 5.4
auto vectorisation
✶
✷
✶
✸
✶✶
✸✷
✶
✹✶✶
✹
✷
✶
✺
✶✶
❣✻✻✼✽✾✿❣❀❁ ❥❂✿❃✽✼✽✾✿❣❀❁ ❣✻✻✼❄❅❃❇❀❁ ❥❂✿❃✽✼❄❅❃❇❀❁
❈
❉
❊
❍
■
❑
▲
❉
▼
❖
P
◗
❈
❘
❚✻❂❀❂❯ ❲❯✾❣✾✿❂❀
❚❚
❳ ❳❨❩
❂✿✽✾❅✿
❬❭❪
❳❨❩
❂✿✽✾❅✿
❫
❁
❴
❯✾
❵
❁ ❚✻
❛
❁❄❃❀❁
Figure 10.Average size of Janus’ rewrite schedule compared
to gcc’s binary size
6.2.2 Static Analysis Disparity
Janus’ static binary analysis rejects 26 loops in TSVC that
gcc is able to vectorise. The lack of symbolic information in
Janus’ memory-related analysis constitutes the reason for
the disparity compared to the compiler. These are broken
down in table 1.
Undecided memory accesses are those that do not t an
ane expression, which is required for dependence checking,
hence Janus is not able to verify the dependence relation
with other memory accesses. The discrepancy comes from
Janus’ failure to perform a full alias analysis at the binary
level. Additional barriers in alias analysis are caused by the
compiler code generators that use more indirect accesses
and other architecture-specic optimisations, thwarting our
current analysis.
Highly optimised binaries also contain code from loop
unrolling, interchange, and using machine-specic optimisa-
tions such as conditional instructions and pointer arithmetic
Table 1.Major reasons in for rejecting loop during vectori-
sation in Janus.
Reason for rejection Loop count
Loop trip count not immediate nor symbolic 6
Undecided memory access 5
Incompatible instructions 5
Decrementing induction variable 4
Complex control ow in loop 2
Induction variables with dierent strides 2
Memory dependency 2
in multi-dimensional array accesses. These either create com-
plicated control ow or cause the induction variable to be up-
dated with dierent strides in a loop. The strength reduction
optimisation also introduces additional cross-iteration de-
pendencies. Although these can be handled by Janus in most
case, it complicates the analysis if the reduction is spilled to
heap memory. The rest of the reasons for rejection can be
resolved by continuing eorts in implementation in Janus’
static analysis and the corresponding dynamic handlers.
6.2.3 Storage Benet
The advantage of Janus is that vectorisation can be tailored
to dierent machines using only a single rewrite schedule
without modifying the original binary or specically tar-
geting either ISA. This is unlike gcc, which would require
multiple separate binaries or multiple runtime versions to be
produced in order to cover all possible cases. This dynamic
adaption of the binary along with runtime checks clearly
displays the advantages of Janus over static compilation or
modication.
Moreover, the vectorised executables generated by gcc
are typically larger than their scalar counterparts for x86
binaries. The storage advantage of only requiring an addi-
tional rewrite schedule is shown in gure 10. The size of the
rewrite schedule required to achieve similar performance as
gcc uses only up to half of the vectorised executable size.
6.2.4 Vectorisation Performance
Figure 11 shows the performance of TSVC aligned loops that
are amenable to Janus’ vectorisation. The baseline binary
runs natively and the execution time for Janus refers to the
execution time to run dynamic binary modication on the
same baseline binary. We force Janus to vectorise this binary
using either the SSE instruction set (128-bit SIMD lanes) or
the AVX instruction set (256-bit SIMD lanes).
Two reference binaries were also compiled natively by
gcc 5.4 using additional ags -msse4.2 and -mavx, represent-
ing state-of-the-art auto-vectorisation performance. From
gure 11, we can conclude that Janus is able to produce ex-
ecution times comparable to the vectorised versions of the
binaries produced by gcc. Using SSE, Janus attains 98.7% of
gcc’s performance where both can vectorise the loop. For
VEE ’19, April 14, 2019, Providence, RI, USA Ruoyu Zhou, George Wort, Márton Erdős, Timothy M. Jones
✥
 
✁
✂
✄
☎
✆
✝
✞
❙
✟
✟
✟
❙
✠
✡
☛
✠
❙
✠
☛
☞
✠
❙
✠
✌
✍
✠
❙
✠
✎
✌
❙
☛
✍
✠
❙
✡
✠
☛
✠
❙
✡
✌
✠
✈
✏
✑
✒
✈
✓
✈
✈
✓
✈
✔
✕
✈
✔
✈
✈
✔
✈
✔
✈
❙
✌
✠
✠
❙
✌
✠
☛
❙
✌
✠
✌
❙
✌
✠
✖
✈
✕
✗
✘
✒
✈
✙
✑
✔
✒
●
✚
✑
✘
✚
✛
✜
❆✢✣✤✦✧★ ❘✧★✩✪✫✣✬✦
❙
✓
✚
✚
✙
✗
✓
✈
✕
❙
✭
✛
✮
✛
✒
✯
✛
✔
✰
✈
✚
✱
✲
✚
✭
✗
✔
✰
✑
✜
✤✪✪ ❣❣✳ ❏✴✦✩✵ ❣❣✳ ✤✪✪ ❆✶✷ ❏✴✦✩✵ ❆✶✷
Figure 11. Performance of vectorised aligned loops in single precision TSVC workloads.
✸
✸✹✺
✻
✻
✹✺
✼
✼✹✺
✽
✽✹✺
✾
✾✹✺
✿
❀
❀
❁
✿
❀
❂
❀
✿
❀
❁
❀
✿
❀
❁
❂
✿
❀
❃
❀
✿
❂
❄
❀
✿
❂
❄
❁
✿
❂
❄
❄
✿
❀
❅
❂
❇
✿
❄
❂
❀
✿
❄
❂
❁
✿
❄
❂
❄
✿
❄
❂
❂
✿
❀
❃
❂
✿
❀
❈
❀
✿
❀
❈
❂
✿
❀
❈
❄
❇
✿
❀
❈
❃
❇
✿
❄
❁
❀
❇
❉
❊
❋
❍
❊
■
❑
P▲▲▼▲◆
❖◗❚❯❱❲▲ ❳❨▲❩❬
✿
❭
❊
❊
❪
❫
❭
❴
❵
✿
❛
■
❜
■
❝
❞
■
❡
❢
❴
❊
❤
✐
❊
❛
❫
❡
❢
❋
❑
❥
❩❩
❦❦❧ ♠♥❚◗
♦ ❦❦❧ ❥
❩❩
♣qr ♠♥
❚◗
♦ ♣qr
Figure 12. Performance of vectorised peeled and runtime-checked loops in single precision TSVC workloads. Loops marked
with * means dynamic symbolic resolution check.
AVX, Janus gains 93.1% of gcc’s performance. We divide the
benchmarks into four sections and analyse separately.
Aligned The rst cluster of loops contain aligned accesses
where the iteration count is divisible by the number of SIMD
lanes and all memory accesses are aligned. In this regard,
Janus achieves almost comparable performance to gcc. One
exception is vbor because its loop contains a relatively low
iteration count. The short running time does not amortise
the sampling overhead in DynamoRIO trace creation and
code-cache warm-up.
Peeled This group of loops exhibit unaligned memory ac-
cesses or the loop trip count is not divisible by the number
of SIMD lanes. Eight out of the nine loops that have a sig-
nicantly worse running time for Janus contain unaligned
accesses and are peeled. The peeling overhead is caused by
the Janus auxiliary control to maintain peeling correctness.
When duplicating the peeled loop code, additional registers
might be used to maintain the peeling and distance.
The performance penalty in AVX is also due to the mixture
of SSE and AVX instructions in the generated code. Janus
only performs AVX extension on the loops that are annotated
by the rewrite rules. The remaining code still contains SSE
instructions, used for scalar oating point computation. This
causes frequent transitioning between 256-bit AVX instruc-
tions and SSE instructions within the same binary, causing
performance penalties because the hardware must save and
restore the upper 128 bits of the ymm registers internally.
Even with no SSE instructions used, DynamoRIO’s inter-
nal dispatch routine still spills SSE registers instead of AVX
registers during its context switch.
Preloaded Vectorisation Janus also vectorises three other
loops that gcc cannot: S241, S243 and S244. Based on the gcc
report, it identies memory dependencies whereby in the IR
a read after write dependency exists.
The Janus Triad VEE ’19, April 14, 2019, Providence, RI, USA
✥
 
✁✥
✁ 
✂✥
❙
✄
✄
✄
❙
☎
✆
✝
☎
❙
☎
✝
✞
☎
❙
☎
✟
✠
☎
❙
☎
✡
✟
❙
✝
✠
☎
❙
✆
☎
✝
☎
❙
✆
✟
☎
✈
☛
☞
✌
✈
✍
✈
✈
✍
✈
✎
✏
✈
✎
✈
✈
✎
✈
✎
✈
❙
☎
☎
✟
❙
☎
✝
☎
❙
☎
✟
☎
❙
☎
✟
✝
❙
☎
✠
☎
❙
✝
✆
☎
❙
✝
✆
✟
❙
✝
✆
✆
❙
☎
✑
✝
❙
✆
✝
☎
❙
✆
✝
✟
❙
✆
✝
✆
❙
☎
✠
✝
❙
☎
✡
☎
❙
✆
✝
✝
❙
✟
☎
☎
❙
✟
☎
✝
❙
✟
☎
✟
❙
✟
☎
✒
✈
✏
✓
✔
✌
✈
✕
☞
✎
✌
●
✖
☞
✔
✖
✗
✘
✂✂✷✙✚ ✂✥✷✁✚
✛
✜
✢
✢
✣
✤
✜
✦
✧
✛
★
✩
✪
✩
✫
✬
✩
✭
✮
✦
✢
✯
✰
✢
★
✤
✭
✮
✱
✲
❉✳✴✵✶✸✹✺✻ ❏✵✴✼✽ ✾✿❀ ✿❁❂❃✸❄❅❆✵❃❅✸✴ ❏✵✴✼✽ ✾✿❀ ✿❁❂❃✸❄❅❆✵❃❅✸✴ ❇ ❈✵❄✵❊❊❁❊❆✵❃❅✸✴
Figure 13. Performance of vectorised and parallelised (4 threads) TSVC workloads.
1 for (int i = 0; i < LEN -1; i++) {
2 a[i] = b[i] * c[i ] * d[i];
3 b[i] = a[i] * a[i+1] * d[i];
4 }
At the binary level, the gcc optimisers load the values of
a[i] and a[i+1] into registers, essentially pre-loading the read
values. Therefore the dependence is not present at the binary
level, meaning Janus can vectorise straight away. However,
in the compiler analysis, the loop memory access is still
considered a memory dependency pattern.
RuntimeCheck Janus can also insert runtime bound checks
to enable vectorisation for loops from S162 to S422. This ben-
ets Janus, enabling it to vectorise six more loops than gcc
as well as overcoming the static ambiguity, preventing infor-
mation loss at binary level.
Reduction The reduction performance achieved by Janus
is the same as gcc as can seen from S311 to vdotr.
6.3 Hybrid Parallelisation
We next turn out attention to combining parallelism extrac-
tion in Janus.
DLP+TLP Automatic parallelisation and vectorisation can
be combined to achieve signicant performance, as shown
in gure 13. The baseline is the native execution of the scalar
binaries compiled by gcc 5.4. The parallelisation analysis
is applied to all loops but only vectorised loops are shown
in gure 13. Twelve loops are not parallelised due to cross-
iteration dependencies found during static analysis. For vec-
torisation, forward cross-iteration dependencies are allowed
using preloaded optimisation. In contrast, a DoAll paralleli-
sation requires no cross-iteration dependencies in all cases.
For the remaining 22 loops, the hybrid of vectorisation
and parallelisation achieves a speedup of 8.8× on average
that further boosts performance from 2× using vectorisation
alone.
DLP+TLP+MLP There are few opportunities to extract all
three types of parallelism from a single loop. This is because
the prefetch optimisation normally requires indirect memory
accesses that could challenge the dependence analysis to
enable parallelisation and vectorisation.
Loop S4112 from TSVC shows one such example, where
this pattern is commonly found in sparse matrix multiplica-
tion (SpMV) applications.
1 for (int i = 0; i < LEN; i++) {
2 a[i] += b[ip[i]] * s;
3 }
The indirect array accesses b[ip[i]] causes the loop to be
memory bound and it can be optimised by software prefetch.
Meanwhile, vectorisation is also possible if the hardware
supports the gather instruction within its SIMD ISA. The
indirect accesses are only reads so there are no cross-iteration
write dependencies, meaning parallelisation is also safe.
Since DynamoRIO does not support the AVX512 gather
instruction (and the AVX2 gather does not give good per-
formance), we could not evaluate this loop in Janus auto-
matically. Instead we manually wrote an assembly version
and encoded the loop into raw binary code snippets. The
code snippets were then loaded into Janus’ code cache for
dynamic execution. For parallelisation, they are copied to
each thread’s private code cache.
Figure 14 shows that prefetching, vectorisation, and paral-
lelisation (on 4 threads) by themselves achieve 1.16×, 1.21×
VEE ’19, April 14, 2019, Providence, RI, USA Ruoyu Zhou, George Wort, Márton Erdős, Timothy M. Jones
✥
 
✁
✂
✄
☎
▼
✆
✝
❉
✆
✝
❚
✆
✝
▼
✆
✝
✞
❉
✆
✝
▼
✆
✝
✞
❚
✆
✝
❉
✆
✝
✞
❚
✆
✝
▼
✆
✝
✞
❉
✆
✝
✞
❚
✆
✝
❙
✟
✠
✠
✡
☛
✟
☞
✠
✌
✍
✎
✏
✑
✠
✎
✒
✓
✍
✎
✏
✑
✠
✔
✕
✍
✌
✍
☞
✠
✖
✠
✕
☛
✎
✏
✒
Figure 14. Performance of s4112with dierent combinations
of parallelism.
and 3.78× speedups respectively, indicating the eective-
ness of the transformations made by Janus. Parallelisation
achieves the largest speedup due to the embarrassinglyDoAll
parallelism. The combination of prefetching and paralleli-
sation can further improve the performance by 12% from
the 3.78× speedup, due to these transformations being or-
thogonal. In eect, prefetching boosts performance by 12%
whether used alone or in combination with parallelisation.
Similarly, the combination of vectorisation and parallelisa-
tion can further improve the parallelisation performance by
9%, where these transformations both exploit parallelism
across loop iterations.
After prefetch optimisation, the loop becomes compute
bound whereas after vectorisation, the loop becomes mem-
ory bound. However, combining prefetching and vectori-
sation results in slowdown. This is due to the parallelism
provided by a vector memory load being swamped by the
extra overhead of inserting duplicated address calculation
for prefetching. This problem could be addressed by having
a vector version of a prefetch gather instruction, something
that is only available in the Xeon Phi ISA VGATHERPF0.
This also impacts the overall overhead of the combination of
the three forms of parallelism, achieving only 3.8× speedup.
6.4 Summary
Thanks to the combination of static analysis, proling, and
runtime checks seamlessly controlled by 26 rewrite rules,
we can achieve substantial performance through the Janus
triad: MLP, DLP and TLP. Extracting most of these forms
of parallelism achieves similar performance to a compiler
approach with an abundance of source code information.
Combining them allows extraction of three forms of paral-
lelism simultaneously from within a binary.
7 Related Work
Janus is the rst system to propose a unied platform for ex-
tracting three kinds of parallelism using a DBM.We organise
the related work based on the individual types of parallelism
extracted.
Automatic Binary Prefetching ADORE [15] uses Itanium
hardware counters to identify hotspots and phases and to
apply memory prefetching optimisations. However, ADORE
relies on the compiler reserving a scratch register that is
needed for the inserted address calculation. Beyler et al. [5]
is a pure software approach that uses a helper thread to
monitor load states and insert prefetch when needed.
In contrast, Janus works on any binary, with no assump-
tions made about the way it has been compiled with no
helper thread. Prefetching in Janus is directed by rewrite
rules, obtained through static analysis and preliminary pro-
ling. Scratch registers are spilled outside the loop, hence
minimising the spilling overhead. Janus only performs mod-
ication to indirect accesses, assuming that direct array ac-
cesses are handled by hardware prefetchers.
Automatic Binary Vectorisation Hallou et al. [9] imple-
mented auto-vectorisation in a same-ISA dynamic binary
translator (DBT) called Padrone [21], which represents the
closest work to our loop vectorisation in Janus. They inte-
grated a lightweight static analysis within the DBT to per-
form vectorisation whenever hot code is identied. However,
their system cannot handle complicated control-ow mod-
ication, such as loop peeling, and they can only work on
loops with a single basic block. Without ooading static
analysis from runtime, they also suer a signicant over-
head from performing time-consuming analysis (e.g., alias
analysis) during program execution. Janus does not suer
these overheads, thanks to the rewrite schedule that com-
municates statically derived transformations to the dynamic
modier.
Hong et al. [11, 13] proposed using amachine-independent
IR layer to achieve cross-ISA SIMD transformation imple-
mented in QEMU. Li et al. [12] implemented vectorisation
from x86 to Itanium architecture. Vapor SIMD [19] used a
custom designed IR derived source code and perform JIT
compilation. Their target is cross-ISA translation, a dierent
domain to Janus. Since it is cross-ISA translation, they get
lower performance compared to Janus.
Yardımcı and Franz [24] proposed a binary parallelisation
and vectorisation scheme for PowerPC binaries, which com-
bines static analysis and dynamic binary parallelisation in
their dynamic software layer. However, they do not fully
capitalise of the strengths of combining static and dynamic
components, such as runtime checks, error handling, and
trace optimisation. Other work has studied the upper limit
on the parallelism extractable through a dynamic binary
translator [8], but failed to consider the parallelism freed
after removing apparent data dependencies.
Dynamic Binary Optimisation DynamoRIO [6] is a ro-
bust open-source runtime code manipulation system which
originates from the well-known high-performance binary
translator, Dynamo [4]. Other dynamic modication tools,
such as Pin [16], are closed source, and, like DynInst [10],
The Janus Triad VEE ’19, April 14, 2019, Providence, RI, USA
are more focused on binary instrumentation. The Sun Stu-
dio Binary Code Optimiser [14] and Microsoft Vulcan [22]
are well-known commercial tools for rewriting binaries for
better single-threaded performance, but both rely on instru-
mentation to collect proling information.
8 Conclusion
We have presented the Janus triad, a unied approach for
exploiting three kinds of parallelism seamlessly controlled
by domain-specic rewrite rules in a same-ISA dynamic
binary modier. Rule generation and interpretation enables
easy automation and portability. Through the Janus triad, we
demonstrate substantial performance from this framework
comparable to compiler counterparts.
Acknowledgments
This work was supported by the Engineering and Physical
Sciences Research Council (EPSRC) through grant references
EP/K026399/1, EP/P020011/1 and EP/N509620/1. Additional
data related to this publication is available in the data repos-
itory at hps://doi.org/10.17863/CAM.37523 and Janus can
be obtained at hps://github.com/JanusDBM/Janus.
References
[1] Sam Ainsworth and Timothy M. Jones. 2017. Software Prefetching for
Indirect Memory Accesses. In CGO.
[2] Kapil Anand, Matthew Smithson, Aparna Kotha, Khaled Elwazeer, and
Rajeev Barua. 2010. Decompilation to compiler high IR in a binary
rewriter. Technical Report. University of Maryland.
[3] David H Bailey, Eric Barszcz, John T Barton, David S Browning,
Robert L Carter, Leonardo Dagum, Rod A Fatoohi, Paul O Freder-
ickson, Thomas A Lasinski, Rob S Schreiber, et al. 1991. The NAS
parallel benchmarks summary and preliminary results. In SC.
[4] Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. 2000. Dy-
namo: a transparent dynamic optimization system. In PLDI.
[5] Jean Christophe Beyler and Philippe Clauss. 2007. Performance driven
data cache prefetching in a dynamic software optimization system. In
SC.
[6] Derek Bruening, Timothy Garnett, and Saman Amarasinghe. 2003. An
infrastructure for adaptive dynamic optimization. In CGO.
[7] David Callahan, Jack Dongarra, and David Levine. 1988. Vectorizing
compilers: A test suite and results. In SC.
[8] Tobias J. K. Edler von Koch and Björn Franke. 2013. Limits of Region-
based Dynamic Binary Parallelization. In VEE.
[9] Nabil Hallou, Erven Rohou, Philippe Clauss, and Alain Ketterlin. 2015.
Dynamic re-vectorization of binary code. In Embedded Computer Sys-
tems: Architectures, Modeling, and Simulation (SAMOS), 2015 Interna-
tional Conference on.
[10] Jerey K Hollingsworth, Barton Paul Miller, and Jon Cargille. 1994.
Dynamic program instrumentation for scalable performance tools. In
Scalable High-Performance Computing Conference.
[11] Ding-Yong Hong, Sheng-Yu Fu, Yu-Ping Liu, Jan-Jan Wu, and Wei-
Chung Hsu. 2016. Exploiting longer SIMD lanes in dynamic binary
translation. In ICPADS.
[12] Jianhui Li, Qi Zhang, Shu Xu, and BoHuang. 2006. Optimizing dynamic
binary translation for SIMD instructions. In CGO.
[13] Yu-Ping Liu, Ding-Yong Hong, Jan-Jan Wu, Sheng-Yu Fu, and Wei-
Chung Hsu. 2017. Exploiting Asymmetric SIMD Register Congura-
tions in ARM-to-x86 Dynamic Binary Translation. In PACT.
[14] Sheldon Lobo. 1999. The Sun Studio Binary Code Opti-
mizer. hp://www.oracle.com/technetwork/server-storage/solaris/
binopt-136601.html.
[15] Jiwei Lu, Howard Chen, Rao Fu, Wei-Chung Hsu, Bobbie Othmer, Pen-
Chung Yew, and Dong-Yuan Chen. 2003. The performance of runtime
data cache prefetching in a dynamic optimization system. In MICRO.
[16] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser,
Geo Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazel-
wood. 2005. Pin: Building Customized Program Analysis Tools with
Dynamic Instrumentation. In PLDI.
[17] Chi-Keung Luk, Robert Muth, Harish Patil, Robert Cohn, and Geo
Lowney. 2004. Ispike: A post-link optimizer for the Intel® Itanium®
architecture. In CGO.
[18] Saeed Maleki, Yaoqing Gao, Maria J Garzar, Tommy Wong, David A
Padua, et al. 2011. An evaluation of vectorizing compilers. In PACT.
[19] Dorit Nuzman, Sergei Dyshel, Erven Rohou, Ira Rosen, KevinWilliams,
David Yuste, Albert Cohen, and Ayal Zaks. 2011. Vapor SIMD: Auto-
vectorize once, run everywhere. In CGO.
[20] Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. 2019.
BOLT: A Practical Binary Optimizer for Data Centers and Beyond.
(2019).
[21] Emmanuel Riou, Erven Rohou, Philippe Clauss, Nabil Hallou, and
Alain Ketterlin. 2014. Padrone: a platform for online proling, analysis,
and optimization. In International Workshop on Dynamic Compilation
Everywhere.
[22] Amitabh Srivastava, Andrew Edwards, and Hoi Vo. 2001. Vulcan:
Binary Transformation in a Distributed Environment. Technical Report
MSR-TR-2001-50. Microsoft Research.
[23] Cheng Wang, Shiliang Hu, Ho-seop Kim, Sreekumar R. Nair, Mauricio
Breternitz, Zhiwei Ying, and Youfeng Wu. 2007. StarDBT: An Ecient
Multi-platform Dynamic Binary Translation System. In Asia-Pacic
Conference on Advances in Computer Systems Architecture.
[24] Efe Yardımcı andMichael Franz. 2006. Dynamic Parallelization and
Mapping of Binary Executables on Hierarchical Platforms. In CF.
[25] Ruoyu Zhou and Timothy M. Jones. 2019. Janus: Statically-Driven and
Prole-Guided Automatic Dynamic Binary Parallelization. In CGO.
