Hardware-Accelerated Dynamic Binary Translation by Rokicki, Simon et al.
Rokicki Simon - Irisa / Université de Rennes 1
Steven Derrien - Irisa / Université de Rennes 1
Erven Rohou - Inria







2Hardware Accelerated Dynamic Binary Translation
Systems on a Chip
• Complex heterogeneous designs
• Heterogeneity brings new power/performance trade-off









Systems on a Chip








• Complex heterogeneous designs
• Heterogeneity brings new power/performance trade-off





















































5Hardware Accelerated Dynamic Binary Translation
Dynamically translate native binaries into VLIW binaries:
• Performance close to Out-of-Order processor
• Energy consumption close to VLIW processor









6Hardware Accelerated Dynamic Binary Translation
• Transmeta Code Morphing Software & Crusoe architectures
• x86 on VLIW architecture
• User experience polluted by cold-code execution penalty
• Nvidia Denver architecture
• ARM on VLIW architecture
Existing approaches
7
• Translation overhead is critical
• Too few information on closed platforms
Hardware Accelerated Dynamic Binary Translation
• Hardware accelerated DBT framework
 Make the DBT cheaper (time & energy)
 First approach that try to accelerate binary translation













Hardware Accelerated Dynamic Binary Translation 7
• Hybrid-DBT Platform
• How does it work? 
• What does it cost?
• Focus on optimization levels
• Experimental Study
• Impact on translation overhead
• Impact on translation energy overhead
• Impact on area utilization
• Conclusion & Future work
Outline




• How does it work? 
• What does it cost?
• Focus on optimization levels
• Experimental Study
• Impact on translation overhead
• Impact on translation energy overhead
• Impact on area utilization
• Conclusion & Future work
Hardware Accelerated Dynamic Binary Translation






• RISC-V binaries cannot be executed on VLIW
Hardware Accelerated Dynamic Binary Translation
How does it work?
• Direct, naive translation from native to VLIW binaries










level 0 No ILP
Hardware Accelerated Dynamic Binary Translation 12
13
How does it work?
• Build an Intermediate Representation (CFG + dependencies)
















Hardware Accelerated Dynamic Binary Translation
14
How does it work?
• Code profiling to detect hotspot




































400 cycle/instr 500 cycle/instr
What does it cost?
No ILP
ILP
• Cycle/instr : number of cycles to translate one RISC-V instruction
• Need to accelerate time consuming parts of the translation























• Hardware acceleration of critical steps of DBT
• Can be seen as a hardware accelerated compiler back-end
Hardware Accelerated Dynamic Binary Translation












level 0 No ILP
17
• Critical for system reactivity




rs1 funct rd opcodeimm12 rs1 rd opcodeimm13
VLIW
binaries
• Implemented as a Finite State Machine
• Translate each native instruction separately
• Produces 1 VLIW instruction per cycle
• 1 RISC-V instruction => up to 2 VLIW instructions




Hardware Accelerated Dynamic Binary Translation



















• Critical to start exploiting VLIW capabilities
Hardware Accelerated Dynamic Binary Translation












nop stw r5,0(r3) 
addi r4,r1,1 ldw r3,0(r2)
sub r4,r4,r3 nop
movi r3,0 stw r4,0(r3)
Exploit available ILP
• Compute dependencies
• Perform Instruction Scheduling
Hardware Accelerated Dynamic Binary Translation
Cost of optimization level 1
VLIW
binaries




400 cycle/instr 500 cycle/instr
acceleration is simple acceleration is challenging
• Generate high-level IR
• Instruction scheduling on the IR
• Instruction decoding/encoding
• Single FOR loop
• Regular computations
• Difficult to parallelize
• Complex control flow structure
Hardware Accelerated Dynamic Binary Translation
Cost of optimization level 1
VLIW
binaries




400 cycle/instr 500 cycle/instr
• Generate high-level IR
• Instruction scheduling on the IR
• Instruction Scheduling is the bottleneck
• IR is designed to speed-up scheduling
Hardware Accelerated Dynamic Binary Translation
acceleration is simple acceleration is challenging
















nop stw r5,0(r3) 
addi r4,r1,1 ldw r3,0(r2)
sub r4,r4,r3 nop
movi r3,0 stw r4,0(r3)


































g1 = g1 1















































5- mov r3 = 0 2 0 0 - - - -- - --
-
IR advantages:
• Direct access to dependencies and successors
• Regular structure (no pointers / variable size)
Hardware Accelerated Dynamic Binary Translation
Details on hardware accelerators
• Developing such accelerators using VHDL is out of reach
• Accelerators are developed using High-Level Synthesis
• Loops unrolling/pipelining
• Memory partitioning




IR Builder IR SchedulerIR
VLIW
binaries
One-pass dependencies analysis List-scheduling algorithm
Hardware Accelerated Dynamic Binary Translation 25





• How does it work? 
• What does it cost?
• Focus on optimization levels
• Experimental Study
• Impact on translation overhead
• Impact on translation energy overhead
• Impact on area utilization
• Conclusion & Future work
Hardware Accelerated Dynamic Binary Translation



















400 cycle/instr 500 cycle/instr
• VLIW baseline is executed with ST200simVLIW
• Fully functionnal Hybrid-DBT platform on FPGA 
• JIT processor: Nios II
• Altera DE2-115
Hardware Accelerated Dynamic Binary Translation













Speed-up vs Software DBT
First-Pass Translator IR Builder IR Scheduler
28
• Cost of optimization level 0 using the hardware accelerator
Hardware Accelerated Dynamic Binary Translation
Impact on translation overhead























Speed-up vs Software DBT
First-Pass Translator IR Builder IR Scheduler
29
• Cost of optimization level 1 using the hardware accelerator
Hardware Accelerated Dynamic Binary Translation



















?? J ?? J
• Hybrid-DBT platform on ASIC:
• Compiled with design compiler for ASIC 65nm
• Design frequency: 250 MHz
• Gate-level simulation with Modelsim
• Accurate power estimation
Hardware Accelerated Dynamic Binary Translation










Energy-efficiency vs software DBT
First-Pass Translator IR Builder IR Scheduler
31
• Energy-efficiency improvement using the hardware accelerator
Hardware Accelerated Dynamic Binary Translation







Energy-efficiency vs software DBT
First-Pass Translator IR Builder IR Scheduler
32
• Energy-efficiency improvement using the hardware accelerator
Hardware Accelerated Dynamic Binary Translation






















• Resource usage for all our platform components
Hardware Accelerated Dynamic Binary Translation
Conclusion
• Presentation of Hybrid-DBT framework
• Hardware accelerated DBT
• Open-source DBT framework RISC-V to VLIW
• Tested FPGA prototype












Hardware Accelerated Dynamic Binary Translation
Future Work
• DBT to support hardware adaptability
• Exploring cost/impact of optimizations


















Hardware Accelerated Dynamic Binary Translation
