INTRODUCTION
Technologies of binary translation and dynamic optimization are widely used in modern software and hardware computing systems [1] . In particular, dynamic binary translation systems (DBTS) compris ing these technologies serve as a solution to provide compatibility between widely used legacy and promis ing upcoming microprocessor architectures on the level of executable binary codes. In the context of binary translation these architectures are usually referred to as source and target, correspondingly.
DBTSs execute binary codes of source architecture on top of instruction set (ISA) incompatible target architecture hardware. They perform translation of executable codes incrementally (as opposed to whole application static compilation) interleaving it with execution of generated translated codes. One of the key requirements that every DBTS has to meet is that the performance of execution of source codes through binary translation is to be comparable or even outper form the performance of native execution (when exe cuting them on top of source architecture hardware).
Optimizing translator is usually employed to achieve higher DBTS performance. It allows to gener ate highly efficient target architecture codes fully uti 1 The article is published in the original.
lizing all architectural features introduced to support binary translation. Besides, dynamic optimization can benefit from utilization of actual information about executables behavior which static compilers usually don't possess.
At the same time dynamic optimization can imply significant overhead as long as optimization time is included in the execution time of application under translation. Total optimization time can be significant but will not necessarily be compensated by the trans lated codes speed up if application run time is too short.
Also, the operation of optimizing translator can worsen the latency (i.e., increase pause time) of inter active application or operating system under transla tion. By latency is meant the time of response of com puter system to external events such as asynchronous hardware interrupts from attached I/O devices and interfaces. This characteristic of a computer system is as important for the end user, operation of hardware attached or other computers across network as its over all performance. Full system dynamic binary transla tors have to provide low latency of operation as well. Binary translation systems of this class target to imple ment all the semantics and behavior model of source architecture and execute the entire hierarchy of sys tem level and application level software including Abstract-Binary translation and dynamic optimization are widely used to provide compatibility between legacy and promising upcoming architectures on the level of executable binary codes. Dynamic optimization is one of the key contributors to dynamic binary translation system performance. At the same time it can be a major source of overhead, both in terms of CPU cycles and whole system latency, as long as optimization time is included in the execution time of the application under translation. One of the solutions that allow to eliminate dynamic optimization overhead is to perform optimization simultaneously with the execution, in a separate thread. In the paper we present implementation of this technique in full system dynamic binary translator. For this purpose, an infrastructure for multithreaded execution was implemented in binary trans lation system. This allowed running dynamic optimization in a separate thread independently of and concur rently with the main thread of execution of binary codes under translation. Depending on the computational resources available, this is achieved whether by interleaving the two threads on a single processor core or by moving optimization thread to an underutilized processor core. In the first case the latency introduced to the system by a computational intensive dynamic optimization is reduced. In the second case overlapping of exe cution and optimization threads also results in elimination of optimization time from the total execution time of original binary codes. DOI: 10.1134/S0361768812030073
BIOS and operating systems. They exclusively control all the computer system hardware and operation. Throughout this paper we will also refer this type of binary translation systems as virtual machine level (or VM level) binary translators (as opposed to applica tion level binary translators).
One recognized technique to reduce dynamic opti mization overhead is to perform optimization simulta neously (concurrently) with the execution of original binary codes by utilizing unemployed computational resources or free cycles. It was utilized in a number of dynamic binary translation and optimization systems [2, 8] . We will refer this method as background optimi zation (as opposed to consequent optimization, when optimizing translation interrupts execution and uti lizes processor time exclusively unless it completes).
The paper describes implementation of back ground optimization in a VM level dynamic binary translation system. This is achieved by separating of optimizing translation from execution flow into an independent thread which can then concurrently share available processing resources with execution thread. Backgrounding is implemented whether by interleaving the two threads in case of a single core (single processor) system or by moving optimization thread to an unemployed processor core in case of a dual core (dual processor) system. In the first case the latency introduced to the system by the "heavy" phase of optimizing translation is reduced. In the second case, overlapping of execution and optimization threads also eliminates the time spent in dynamic optimization phase from the total run time of the orig inal application under translation.
The specific contributions of this work are as fol lows:
• implementation of multithreaded infrastructure in a VM level dynamic binary translation system;
• single processor system targeted implementation of background optimization technique where proces sor time sharing is implemented by interleaving opti mizing translation with execution of original binary codes;
• dual processor system targeted implementation of background optimization technique where optimiz ing translation is being completely offloaded onto underutilized processor core.
The solutions described in the paper were imple mented in the VM level dynamic binary translation system LIntel, which provides full system level binary compatibility with Intel IA 32 architecture on top of Elbrus architecture [9, 10] hardware.
LINTEL
Elbrus is a VLIW (Very Long Instruction Word) microprocessor architecture. It has several special fea tures including hardware support for full compatibility with IA 32 architecture on the basis of transparent dynamic binary translation.
LIntel is a dynamic binary translation system developed for high performance emulation of Intel IA 32 architecture system through dynamic translation of source IA 32 instructions into wide instructions of target Elbrus architecture (the two architectures are ISA incompatible). It provides full system level com patibility meaning that it is capable of translating the entire hierarchy of source architecture software (including BIOS, operating systems and applications) transparently for the end user ( Fig. 1) . As is noted above, LIntel is a co designed system (developed along with the architecture, with hardware assistance in mind) and heavily utilizes all the features of archi tecture introduced to support efficient IA 32 compat ibility.
In its general structure LIntel is similar to many other binary translation and optimization systems described before [11, 13] and is very close to Trans meta's Code Morphing Software [14, 15] . As any other VM level binary translation system, it has to solve the problem of efficient sharing of computational resources between translation and execution of origi nal binary codes.
Adaptive Binary Translation
LIntel follows adaptive, profile directed model of translation and execution of binary codes (Fig. 2) . It includes four levels of translation and optimization varying by the efficiency of the resulting Elbrus code and the overhead implied, namely: interpreter, non optimizing translator of traces and two optimizing translators of regions. LIntel performs dynamic profil ing to identify hot regions of source code and to apply reasonable level of optimization depending on execut able codes behavior. Translation cache is employed to store and reuse generated translations throughout exe cution. Run time support system controls the overall binary translation and execution process. When the system starts, interpreter is used to care fully decode and execute IA 32 instructions sequen tially, with attention to memory access ordering and precise exception handling. Non optimizing transla tion is launched if execution counter of a particular basic block exceeds specified threshold.
Non optimizing translator builds a trace which is a semantically accurate mapping of one or several con tiguous basic blocks (following one path of control) into the target code. The building blocks for the trace are templates of the corresponding IA 32 instructions, where template is a manually scheduled sequence of Elbrus wide instructions. After code generation and additional corrections like actual constants and address values patching the trace is then stored into the translation cache. Trace translator produces native code without complex optimizations and focuses more on fast translation generation rather than code efficiency. It improves start up time significantly as compared to interpretation. At the same time non optimizing translation is only reasonable for execut able codes with low repetition rate.
Traces are instrumented to profile hot code for O0 level optimizing translation. The unit of optimizing translation is a region. In contrast to traces, regions can combine basic blocks from multiple paths of control providing better opportunities for optimization and speculative execution (which is an important source of instruction level parallelism for VLIW processors).
O0 level translator is a fast region based optimizer that performs basic optimizations implying low com putation cost, including peephole, dead code elimi nation, constant propagation, code motion, redun dant load elimination, superblock if conversion and scheduling.
Strong O1 level region based optimizer is on the highest level of the system. The power of this level is comparable with high level language optimizing com pilers. 2 It applies advanced optimizations such as soft ware pipelining, global scheduling, hyperblock if con version and many others, as well as utilizes all the architectural features introduced to support binary optimization and execution of optimized translations.
Region translations are stored in the translation cache as well. Profiling of regions for O1 level optimi zation is carried out by O0 level translations.
Optimized translations not always result in perfor mance improvement. Unproven optimization time assumptions can cause execution penalty. These include incorrect speculative optimizations, memory mapped I/O access in optimized code (where I/O access is not guaranteed to be consistent due to mem ory operations merge and reordering), etc. Correct ness of optimizations is controlled by the hardware at runtime. Upon detecting a failure, retranslation of the region is launched applying more conservative assumptions depending on failure type. Figure 3 compares average translation cost of one IA 32 instruction and the performance of translated codes for different levels of optimization. Adaptivity aims at choosing appropriate level of optimization throughout the translation and execution process to maintain overhead/performance balance. Figure 4 shows translation and execution time dis tribution for SPEC2000 tests running under Linux (operating system is being translated as well). While translated codes are executed most of the tests' runt ime, optimizing translation overhead is significant and equals to 7% on average.
Asynchronous Interrupts Handling
One of the run time support system functions is to handle incoming external (aka asynchronous) inter rupts. The method of delayed interrupt handling allows to improve the performance of binary translated This method of pending interrupt checks arrange ment simplifies planning and scheduling of translated codes as there is no need to care about correct execu tion termination and context recovery at arbitrary moments of time. At the same time it allows LIntel to respond reactively enough to external events.
The bottleneck in this scenario is the presence of optimizing translation phase. If an interrupt occurs when optimization is in progress, it has to wait for optimization phase completion to be handled (Fig. 5) . Due to computational complexity of optimizations employed, optimizing translation can consume signif icant amount of processor time and as such, the delay of response of the system to an external event can be noticeable (see evaluation in Section 3.2).
BACKGROUND OPTIMIZATION
To overcome the problems of performance over head and latency caused by optimizing translation, the method of back ground optimization was employed in LIntel.
The concept of background optimization implies performing optimizing translation phase concurrently (or pseudo concurrently) with the main binary trans lation flow of execution of original binary codes. Application level binary translators usually imple ment this by utilizing native operating system's multi threading interface and scheduling service to perform optimization in a separate thread. VM level binary translation systems require internal implementation of multithreading to support background optimization.
In this section we describe implementation of background optimization in the VM level DBTS LIn tel. Two cases are considered: in the first case LIntel operates on top of a single core target platform sys tem; in the second case there are two cores available for utilization. SPEC2000 tests are used to demonstrate the effect of background optimization implementation.
Execution and Optimization Threads
A multithreaded execution infrastructure was implemented in LIntel, with optimizing translation capable of running independently in a separate thread, which enabled execution and optimization threads concurrency. Execution thread activity includes the entire process of translation and execution of original binary codes, except for optimizing translation (of both O0 and O1 levels), i.e.: interpretation, non opti mizing translation, run time support and execution itself. Optimizing translator is run in a separate opti mization thread when new region of hot code is iden tified by the execution thread. When optimization phase completes, generated translation of the region is returned to the execution thread, which places it into the translation cache. During the region optimization phase correspond ing original codes are being executed either by inter pretation or by previously translated codes of lower levels of optimization. Selection of new hot regions for optimization will not be launched unless current opti mization activity completes.
By the end of optimization, memory pages that contain a source code of the region under optimiza tion can get invalidated (due to DMA, self modifica tion, etc.). As such, before placing optimized transla tion of the region into the translation cache, execution thread must check region's source code consistency and reject the region if verification fails. This routine is assisted by the memory protection monitoring sub system (introduced in the Elbrus hardware to support binary translation [16] ) which controls source and translated (as well as translations in progress) codes coherency.
Separation of execution and optimization threads allows to schedule them across available processing resources in the same way as multitasking operating systems schedule processes and threads. By now, two simple strategies of processor time sharing were imple mented in LIntel enabling optimization background ing for single core and dual core systems.
Background Optimization in a Single Core System
In case of a single core system background optimi zation is implemented by interleaving of execution and optimization threads. Throughout optimizing translation of a hot region processor switches between the two threads. Scheduling is triggered by interrupts from internal binary translation dedicated timer "invisible" for executable codes under translation. Each thread is assigned a fixed time slice. When exe cution thread is active, incoming external interrupt has a chance to be handled without having to wait for region optimization to complete (Fig. 6) . If there are no hot regions pending for optimization, execution thread fully utilizes the processor core.
To demonstrate single core background optimiza tion approach, a simple strategy of processor time sharing was chosen when both threads have equal pri ority, with equal time slices assigned (meaning that optimization thread's processor utilization is 50%, in contrast to 100% utilization when optimizing conse quently). As seen from Fig. 7 , interleaving of execu tion and optimization improves interrupt delivery time significantly.
At the same time, as Fig. 8 demonstrates, this approach tend to degrade binary translation perfor mance.
Degradation can be explained by the fact that hot region optimization phase now lasts longer. As a result, optimized translations injection into execution is being delayed, meanwhile source binary codes are being executed non optimized (or interpreted). Addi tional overhead comes with context switching rou tines.
Basically, single core background optimization implementation is not of high priority currently. At the same time we believe that it is possible to improve its efficiency by tuning various parameters like execution and optimization threads' time slices and profiling thresholds to achieve earlier injection of optimized translations into execution process while keeping whole system latency acceptable. Besides, IA 32 "halt" instruction can be used as a hint to utilize free cycles and yield processor to optimization thread before the end of execution thread's time slice. Exten sive study of execution and optimization threads' pro cessor time utilization was made in [17] .
Background Optimization in a Dual Core System
In a dual core system LIntel completely utilizes the second (unemployed otherwise) processor core to per form dynamic optimization in a background thread. In this case execution thread exclusively utilizes its own core and only interrupts execution to acquire next region for optimization and allocate generated trans lation when optimization completes.
As Fig. 10 demonstrates, overlapping of execution and optimization by moving optimization thread onto a separate core not only eliminates the problem of latency, but also increases overall binary translation system performance.
The resulting speed up (6% on average) agrees good enough with dynamic optimization overhead estimated for the case of consecutive optimization (see Section 2.1).
Discussion and Future Works
As noted above, selection of hot regions in execu tion thread gets blocked unless optimization phase completes. However, profile counters continue to grow, and by the end of optimization there may be sev eral nonoverlapping regions in the profile graph with counters exceeding threshold. As counters are checked during execution of corresponding translated codes, next optimizing translation will be launched for the first region executed. Not necessarily will this region be the hottest one. As such, a problem of sub optimal hot region selection arises which also needs to be addressed (profile graph traversal can be quite time consuming and is not an option). The profile of binary translation for SPEC2000 tests (Fig. 4) suggests that current optimization workload is not enough to fully utilize optimization thread affiliated processor core, which will run idle most of the applica tion run time. To improve its utilization ratio, optimiz ing translator can be forced to activate more often. This can be achieved by dynamically decreasing of hot region profiling threshold depending on current load of the core affiliated with optimizing translator. When execu tion activity is naturally low, this core should be halted due to energy efficiency reasons. This is reasonable to ask why not utilize unem ployed processor core to execute source binary codes. In other words, if there are more than one target archi tecture microprocessor core in the system, source architecture system software (e.g. operating system) could "see" and utilize the same number of cores. Current Elbrus architecture implementation (used in this paper) does not satisfy IA 32 architecture require ments concerning organization of multiprocessor sys tems. As a result, IA 32 multiprocessor support is not possible on top of Elbrus hardware. But we hope to implement this scenario in the future. Still, we believe that having processor cores solely utilized for dynamic optimization is reasonable due to a following:
• different classes of software (legacy software, software for embedded systems, etc.), not always developed with multiprocessing or multithreading in mind, can benefit from multicore or multiprocessor systems when being executed through binary transla tion with background optimization option;
• keeping in mind the tendency towards ever increasing number of cores per chip, it seems reason able to utilize some cores to improve dynamic binary translation system performance; not only optimizing translator can consume this resources; other jobs that could also be performed asynchronously include iden tification and selection of code regions for optimiza tion [18] , software code prefetching [19] , persistent translated code storage access [20] , 3 etc.
Finally, we think that a promising direction for future research and development is building a binary translation infrastructure that could support unre stricted number of execution (in terms of source archi tecture virtual machine; so that operating system under translation could "see" more than one proces sor core), optimization and other threads and sched ule them efficiently across the available computational resources depending on their quantity, load and binary codes execution behavior.
CONCLUSIONS
The paper addresses the problem of optimization overhead in dynamic binary translation systems and presents the application of background optimization technique in full system dynamic binary translator LIntel. Implementations for single core and dual core systems are considered. In the first case back grounding is implemented by interleaving execution and optimization, while in the second case dynamic optimization is completely moved onto a separate pro cessor core. In both cases background optimization solves the problem of high latency caused by dynamic optimization which is particularly important for full system execution environment. Performing optimiza tion on a separate core also eliminates optimization overhead from the application run time thus improv ing binary translation system performance in general. 3 Asynchronous access to a persistent code storage (aka Code Base) has already been implemented in LIntel by the moment but is not covered in this paper as we only consider the effect of background optimization implementation.
Acquire new hot
Interpreter and non opt. translation Fig. 10 . Binary translation speed up when optimizing on a separate processor core (as compared to consecutive opti mization).
