Single instruction multiple data (SIMD) has been adopted for decades because of its superior performance and power efficiency. The SIMD capability (i.e., width, number of registers, and advanced instructions) has diverged rapidly on different SIMD instruction-set architectures (ISAs). Therefore, migrating existing applications to another host ISA that has fewer but longer SIMD registers and more advanced instructions raises the issues of asymmetric SIMD capability. To date, this issue has been overlooked and the host SIMD capability is underutilized, resulting in suboptimal performance. In this article, we present a novel binary translation technique called spill-aware superword level parallelism (saSLP), which combines short ARMv8 instructions and registers in the guest binaries to exploit the x86 AVX2 host's parallelism, register capacity, and gather instructions. Our experiment results show that saSLP improves the performance by 1.6× (2.3×) across a number of benchmarks and reduces spilling by 97% (99%) for ARMv8 to x86 AVX2 (AVX-512) translation. Furthermore, with AVX2 (AVX-512) gather instructions, saSLP speeds up several data-irregular applications that cannot be vectorized on ARMv8 NEON by up to 3.9× (4.2×). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. 
• We identify the issues of asymmetric SIMD capability and propose saSLP as an effective solution. It combines multiple guest instructions to form a longer SIMD instruction that can be used to exploit the host's parallelism.
• We propose a novel spill-aware cost model that can determine whether combining guest registers would be profitable by weighing the extra unpacking overheads and the benefits of conserving host registers.
• We present the vectorization of data-irregular loops by exploiting SIMD gather instructions and propose a new algorithm to vectorize loop reductions in DBTs, where the typical methods used in static compilers do not work.
• We evaluated saSLP with ARMv8 to x86 AVX2 translations. Benchmark results demonstrate that saSLP improves the performance by 1.6× (resp., 2.3×) and reduces 97% (resp., 99%) of register spilling, on average. By utilizing AVX2 (AVX-512) gather instructions, saSLP achieves 1.9× (resp.,1.74×) average speedup on a set of data-irregular applications.
The remainder of this article is organized as follows. Section 2 provides an overview of cross-ISA DBTs and explains how asymmetric SIMD capacity causes problems. We describe our approach, saSLP, in Section 3 and then report the evaluation results in Section 4. Section 5 deals with related work and Section 6 contains our concluding remarks.
BACKGROUND AND MOTIVATION
One of the core techniques in cross-ISA DBTs is mapping guest registers and instructions to effective host ones. We start with an overview of such mappings and then explain how the mappings underutilize the host's SIMD capability.
Overview of Cross-ISA Dynamic Binary Translation
Most cross-ISA DBTs translate guest binaries in the granularity of a translation code fragment and maintain a guest architectural state in the host memory for emulation. To reduce memory access, typical retargetable and cross-ISA DBTs (Baraz et al. 2003; Bellard 2005; Dolan-Gavitt et al. 2015; Fu et al. 2015) map each guest SIMD register in the architectural state to a virtual register with the same width within a translation code fragment. Thus, they have to synchronize modified virtual registers back to the guest state at the exit points. Similarly, guest SIMD instructions are mapped to equivalent machine-independent instruction register (IR) that operates on the same number of vector elements. In the final step, the DBT back end allocates host registers to mapped virtual registers and converts mapped IRs to the host binary. Figure 1 illustrates the translation of an ARMv8 NEON vector addition of two 8B integers.
An Example of Asymmetric SIMD Capability
As the width and number of SIMD registers have become more divergent between ISAs, compilers need to consider ISA-dependent SIMD registers when performing optimization. For example, many modern compilers, such as GCC, LLVM, and ICC, vectorize and unroll loops according to the SIMD registers of target ISAs to exploit both data and instruction level parallelism (ILP). The example in Figure 2 illustrates the procedure. For ease of presentation, we scale down the number of SIMD registers in ARMv8 NEON and x86 AVX2 from 32 and 16 to 6 and 3, respectively. Next, we explain how compilers optimize the loop in Figure 2 (a) for NEON. First, the compiler vectorizes the loop by maximizing the number of elements processed in a NEON register. For example, in Figure 2 (b), each of the 16B registers (v0-v4) holds four 4B floating-point numbers. Second, to achieve better ILP, the compiler unrolls and interleaves the loop according to the six available NEON registers. The loop is unrolled only once because a larger unrolling factor would require more than six registers and cause spilling. The final compiled binary is shown in Figure 2 (b).
Instructions from the first and second unrolled iteration are marked 1 -5 and ❶-❺, respectively.
For cross-ISA DBTs, such ISA-dependent optimizations cause problems in the typical register and instruction mapping discussed earlier. This is because a binary loop optimized for the guest could become non-optimal on the host, which has a different width and number of SIMD registers. For instance, to translate the NEON binary in Figure 2 (b), typical methods map each to one virtual register (%v0-%v4) with the same width, resulting in a total of five live ranges in Figure 2 (c). All NEON instructions are mapped to IR instructions that also operate on four floating-point numbers. As a result, the parallelism that can operate on eight floating-point numbers on the AVX2 host is underutilized. Furthermore, since the AVX2 host in this case has only three registers, the five overlapping live ranges in Figure 2 (c) will cause register spilling in the resulting AVX2 binary, and only half of the 32B AVX2 register capacity can be used to store a 16B NEON register.
To solve these problems, our saSLP algorithm transforms the IR in Figure 2 (c) to that in Figure 2 (d). First, it combines each of the five instructions 1 -5 in the first loop iteration with ❶-❺ in the second unrolled iteration in a pairwise manner. Therefore, the combined instructions (e.g., ➂ ❸) in Figure 2 (d) can now operate on eight floating-point numbers and thus fully utilize the AVX2 parallelism. Second, saSLP combines four 16B virtual registers (%v0-%v3) in Figure 2 (c) to form two 32B virtual registers (%v02 and %v13) in Figure 2(d) . As a result, register spilling will no longer occur because the number of overlapping live ranges is reduced to three by fully utilizing the AVX2 register capacity.
However, saSLP faces several challenges when performing this transformation. First, to combine the instructions in Figure 2 (c), saSLP must ensure there is no data dependency between the reordered memory instructions. Yet, with the limited information in binaries, saSLP may not always be able to detect the dependency at the translation time. For example, the order of load ❶ and store 5 in Figure 2 (c) is swapped in Figure 2 (d), and they access the memory locations %x1 + 16 and %x2, respectively. Since the pointer values %x1 and %x2 are unknown at the time of translation, a conditional translation approach with a runtime check is required to avoid the execution of transformed code in Figure 2 (d) when %x1 and %x2 are aliased (i.e., ❶ and 5 are dependent).
Second, although combining short registers in Figure 2 (c) can save host registers, it is necessary to unpack the short register from the corresponding long register to access an individual combined short register. For example, the two modified long registers (%v02 and %v13) in Figure 2 (d) need to be unpacked to four short registers (%v0-%v3) and then stored back in the guest state for synchronization. The overheads of extra unpacking might nullify the benefits of saving host registers. Therefore, a cost model that can precisely estimate the gain of saving host registers and the cost of extra data reordering is required to determine if short registers should be combined.
SPILL-AWARE SLP
In this section, we present our approaches for the above-mentioned transformation and then provide details of the solutions to the two challenges in Section 3.3 and Section 3.4, respectively. saSLP is an optimization algorithm designed for cross-ISA DBTs. First, we introduce the algorithm and then describe how the instructions are combined and how the registers are combined. Figure 3 details the steps of the saSLP algorithm. The white boxes are related to combining guest instructions to form longer ones on the host; the highlighted parts are designed to address the two challenges mentioned earlier.
Overview
As shown in Figure 3 , saSLP targets innermost loops captured by the DBTs, because they are usually the hottest (step 1). Then, saSLP collects short (i.e., scalar or vector) instructions in the loop body as seeds. Starting with those seeds, saSLP constructs a combining graph by traversing the usedefine chain and then groups visited short instructions. The traversal stops at short instructions that cannot be combined. The grouped short instructions in each group are candidates for being combined with each other to form a long SIMD instruction on the host (step 2).
For each constructed combining graph, saSLP estimates the performance cost of the short and long forms (i.e., c S , c L in step 3). The long cost also includes the extra overheads of data movement (e.g., packing and unpacking). Then, saSLP credits the additional gain of saving host registers (i.e., c R in step 3). If combining short instructions is profitable (step 4), saSLP versions the loop into an original short version and an optimized long version with runtime check when needed (steps 5 and 6). Finally, saSLP combines the grouped instructions to form longer SIMD instructions in the 2:6 Y.-P. Liu et al. long-version loop, and inserts additional data reordering instructions for packing and unpacking (step 7).
We use the simplified case of ARMv8 NEON to x86 AVX2 translation in Figure 2 as a running example. We use the terms ShortV S and LonдV S to denote, respectively, the width of registers in the input IR (guest) and the output IR (host). In addition, CF (i.e., combining factor) denotes how many short instructions will be combined to form a long SIMD instruction. CF is computed as LonдV S / ShortV S .
Combining Short Instructions to Form Long SIMD
The input of saSLP is the typically mapped IR of a guest binary loop, as described in Section 2.2. To construct the combining graphs, saSLP first collects consecutive short (i.e., scalar or vector) stores in the loop body to form seed chains. Starting from the seeds is a popular heuristic approach that modern compilers adopt (Lattner and Adve 2004) . In the motivating example, where the ShortV S and LonдV S is 16B and 32B, respectively, saSLP collects the two consecutive vector stores ( 5 and ❺) in Figure 4 (a) to form a seed chain of length two. Next, saSLP uses Algorithm 1 to construct the combining graphs in a bottom-up manner from the seeds with recursive traversal (lines 3 and 10) along the use-define chains (i.e., edges in Figure 4(b) ). For a list of short instructions S 1 currently visited, the algorithm checks whether the short instructions in S 1 can be combined to form a long SIMD instruction (line 6). Short instructions in S 1 that can be combined (1) must have the same opcode and data type, (2) do not have a duplicated instruction in S 1 , and (3) can be scheduled in parallel. The last requirement ensures that all data flow and memory dependency are preserved after combining the short instructions. The reason is that the combined instructions are moved to the position of the resulting long SIMD instruction. Thus, to ensure that the combining is valid, saSLP marks reordered memory instructions that can be checked at runtime with techniques introduced in Section 3.3 as combinable. Others that cannot be checked at runtime are treated as dependent and non-combinable.
If short instructions in S 1 can be combined, they are grouped and inserted into the combining graph G (line 7 and solid-line rectangles in Figure 4 (c)). Then, the traversal continues to trace along the use-define chains from S 1 to S 2 (i.e., following the arrows in Figure 4 (b) in reverse order), where S 2 is a list of operands of S 1 (lines 8 and 9). Finally, edges connecting the newly inserted nodes of S 1 and S 2 are added to the combining graph (line 11 and arrows in Figure 4 (c)).
For short instructions in S 1 that cannot be combined, the algorithm inserts a short node for each short instruction in S 1 into the combining graph and adds an extra packing operation node to pack the short nodes into a long SIMD node (line 13 and the dashed-line rectangle in Figure 4 (c)). Therefore, the resulting long SIMD node can be used by other long SIMD nodes because they now have the same width. The traversal returns to the previous level at non-combinable S 1 .
In the motivating example, the graph construction starts the traversal from seeds 5 , ❺ . As combining 5 and ❺ necessitates reordering of memory instructions 5 and ❶, the algorithm records their pointers that can be checked at runtime and marks 5 , ❺ as combinable. The traversal follows the use-define chains and groups the add 4 , ❹ , multiply 3 , ❸ , and load instructions 1 , ❶ , 2 , ❷ . On the other hand, %v4, %v4 (i.e., the right-hand side operands of 3 , ❸ ) cannot be combined owing to the violation of requirement (2) and thus are packed into a long SIMD node. Figure 4 (c) shows the final constructed combining graph.
Runtime Memory Aliasing Check
To detect memory dependency that cannot be determined at the translation time in DBTs, we propose a conditional translation with runtime check of pointer aliasing. When saSLP encounters possible memory dependency, it versions the loop into an original short version (ShortLoop) and an optimized long SIMD version (LongLoop), and inserts checking code into the RTCheck block. This operation predicates the two versioned loops based on whether aliased pointers are detected at runtime (Figure 5(a) ). Given a pair of may-alias pointers p, q , saSLP first checks whether (1) the two pointers start from loop-invariant addresses (StartAddr ) and move with constant strides (StepSize) in each loop iteration and (2) whether the input loop has a loop-invariant trip count (TripCnt). These ensure that saSLP can derive an invariant region from StartAddr to StartAddr + TripCnt × StepSize accessed by a pointer during the entire loop. Then, whether or not p and q are aliased can be verified at runtime by checking whether their regions overlap. Any overlapped region will force the execution to jump to the ShortLoop for legality. If p, q does not meet requirements (1) and (2), then they are treated as aliased for correctness.
We now consider the two memory instructions 5 and ❶, which store and load 16B vectors with pointers %x2 and %x1 + 16, respectively, as shown in Figure 5 (b). To detect the possible dependency (dashed-line arrow in Figure 5 (c)), saSLP inserts checking code into RTCheck since both pointers move with 32B constant stride and the loop has an invariant trip count determined by %x0. The code evaluates the accessed regions of %x2 and %x1 + 16 and determines whether the two regions overlap at runtime. The runtime check is further optimized by grouping pointers with constant distances and the same stride (e.g., the pointers of 5 and ❺ share a 32B stride and differ by 16B) and performing the overlap tests between pointer groups instead of between individual pointers. Since saSLP targets unrolled guest loops where many pointers can be grouped, this optimization reduces the number of overlap tests significantly.
Spill-Aware Cost Model
To decide whether combining the grouped short instructions in the combining graph to form long SIMD instructions would be profitable, we devised a new spill-aware cost model to evaluate the overall performance cost of the combining graph.
To model the performance cost, the spill-aware cost model first categorizes cost factors into three classes: (1) the short cost c S , which denotes the cost of executing original instructions with ShortV S width; (2) the long cost c L , which denotes the cost of executing combined SIMD instructions with LonдV S width as well as the extra data reordering overheads; and (3) the register saving gain c R , which is the benefit of eliminating register spilling by combining multiple short registers. For each node N i in the combining graph G, the cost model computes c S (N i ), c L (N i ), and c R (N i ), and then determines whether combining the grouped short instructions in G will be profitable if
(1)
Our cost model largely reuses the existing instruction cost table in the DBT back-end compiler to compute c S and c L . Therefore, we focus on the computation of c R . The core idea of c R is assigning higher c R to short instructions with live ranges that are likely to lead to register spilling. Thus, such live ranges have a greater chance of being combined and stored in long SIMD registers to avoid spilling. First, we define the following terms for an instruction n i .
(1) Reд(n i ) denotes the physical register capacity occupied by n i ; (2) Int (n i ) denotes the maximal physical register capacity occupied by the live ranges that interfere with the live range of n i ; (3) Avail (n i ) denotes the total physical register capacity that can be used to store n i . Hence, if Int (n i ) ≥ Avail (n i ), then n i will lead to spilling because its live range interferes with too many live ranges and thus overwhelms the available register capacity on the host. Given a list of CF short instructions S i = s 1 , . . . , s C F that can be combined to form a long SIMD instruction l i , the cost model computes the register saving gain c R (S i ) as follows:
In Equation (2), each short instruction s j respectively occupies Reд(s j ) and Reд(l i ) / CF bytes (amortized over S i ) before and after the combining operation. Therefore, combining s j can save
), the cost model rewards s j with the saving in the unit of ShortV S.
Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation 2:9 Next, we consider the motivating example in which the AVX2 host has three 32B YMM registers.
For instance, the grouped short instructions 3 and ❸ in Figure 6 (a) have two 16B live ranges before being combined. Since the upper half of a 32B YMM register cannot be used when the lower half is used by 3 and ❸, both 3 and ❸ occupy a full YMM register (i.e., Reд( 3 ) = Reд(❸) = 32). After combining, the resulting long SIMD instruction 3 ❸ occupies only a YMM register (i.e., Reд( 3 ❸) = 32). Also, both 3 and ❸ interfere with a maximum of four live ranges, exceeding the available registers on the AVX2 host. Hence, according to Equation (2), combining them could save 32B of capacity, which is equal to two 16B ShortVS and implies that c R ( 3 , ❸ ) = 2.
To compute c S and c L , the cost model queries our DBT back-end compiler and finds that the cost of SIMD multiply instructions is 2 regardless of whether it operates on four or eight 4B floatingpoint numbers on the AVX2 host. Therefore, as shown in Figure 6 (c), the cost model concludes that c S ( 3 , ❸ ) = 2 + 2 and c L ( 3 , ❸ ) = 2. Moreover, recall that all modified registers need to be synchronized back to the guest state. Thus, two extra unpacking instructions are required to extract %v0 and %v2 from the long SIMD instruction 3 ❸, as shown in Figure 6 (b). The additional cost of extraction is 2 based on the back-end interface. Therefore, unpacking 3 ❸ adds an extra cost of 4 to the unpack row in Figure 6 (c). The unpacking of 4 ❹ also incurs an extra cost of 4.
The cost model repeats the above computation on each node in the combining graph and produces the final cost table shown in Figure 6 (c). As the performance cost of the long SIMD form (c L − c R = 13) is lower than the short one (c S = 18), combining the grouped short instructions is profitable.
Combining Loop Reductions
In addition to the consecutive stores mentioned in Section 3.2, saSLP also selects a sequence of short (i.e., scalar or vector) reductions as seeds and combines them to form a long SIMD reduction. Combining reductions to exploit data parallelism is a critical issue because reductions are prevalent in applications (Hong et al. 2011; Huo et al. 2011) . For example, PageRank, one of the most important graph applications, contains reductions in its hottest loop. As shown in Figure 7 (a), for each node i in the given graph, the innermost loop iterates over the neighboring nodes of node i (i.e., nodes k) and accumulates their weighted rank values. The accumulated value is a typical reduction.
As mentioned in Section 2.2, compilers optimize loops according to target ISAs. To optimize the loop for the simplified ARMv8 guest that has six NEON registers, the compiler unrolls the loop once and produces the guest binary in Figure 7 (b). The guest binary is then mapped to the IR in Figure 7 (c), which is the input of saSLP. We use the PageRank example in Figure 7 to illustrate how saSLP combines short reductions to exploit the host's parallelism and register capacity. First, to combine short reductions, saSLP collects short values that (1) live across loop iterations (i.e., phi nodes in the input IR) and (2) are used by a sequence of associative instructions with the same opcode to compute the values for the next iteration (i.e., reduction chain). The phi nodes and their reduction chains form cycles in their data dependency graphs. For instance, saSLP collects the phi node and its reduction chain (i.e., 5 and ❺) in Figure 7 (c). The phi node passes the reduction value between loop iterations and the reduction chain accumulates the multiply results in the reduction value, resulting in the dashed-line cycle in Figure 7(d) .
Second, saSLP uses the same graph construction algorithm introduced in Section 3.2 to construct a combining graph on the reduction chain. This is because instructions in a reduction chain usually have a higher probability of reducing a sequence of similar instructions. Starting from the operands of the reduction chain, the algorithm traverses along the use-define chains and then groups short instructions that can be combined to form long SIMD instructions. Once the combining graph is built, saSLP combines short instructions in the reduction chain to form a long SIMD reduction and widens the phi node as well. In the PageRank example, the graph construction traversal starts from the multiply instructions 4 , ❹ in Figure 7 (d) and traces along the use-define chains, resulting in the combining graph in Figure 7 (e). Moreover, the 8B floating-point add instructions 5 and ❺ in the reduction chain are combined and replaced with the long SIMD reduction 5 ❺, which reduces a vector of two 8B floating-point numbers. The phi node is also widened to the width of 5 ❺, as shown in Figure 7 (e).
Finally, saSLP needs to horizontally reduce the long SIMD reduction to compute the final short reduction value that flows out of the loop. Static compilers achieve this by reducing the lanes of the long SIMD reduction at the loop exit. For example, as shown in Figure 8 n and 4 i denotes the multiply result of 4 in the ith iteration, the value of long SIMD reduction
❹i after the execution of the loop. Therefore, the horizontal reduction in However, the typical horizontal reduction that static compilers adopt does not work in DBTs. This is because all guest registers updated in the reduction chain must be synchronized back to the guest state, but typical horizontal reduction loses some of the updated guest registers. As Figures 7(c) and 7(d) show, the reduction chain in the unrolled loop modifies two registers, %d5 and %d0, whose values must be reflected in the guest states in memory. Guest register %d0 contains the final reduction result, and %d5 should accumulate the multiply results of 4 and ❹ for n iterations except ❹ in the last iteration. That is, the final value of %d5 should be n−1 i=1 ( 4 i + ❹i ) + 4 n , which is less than the final reduction value %d0 by ❹n (Figure 8(c) ). The typical horizontal reduction can resolve only the final result (e.g., %d0 but not %d5 in Figure 8(a) ) and thus causes missing guest states.
To solve this problem, saSLP horizontally reduces the widened phi node instead of the long SIMD reduction and then sequentially reduces the reduction chain in the last iteration. Since all instructions in the reduction chain are sequentially executed, all guest registers updated in the chain are preserved. In the PageRank example shown in Figure 8(b) , saSLP first horizontally reduces the widened phi node that holds
❹i , which is the accumulated result for n − 1 iterations (except the last iteration). The resulting short phi node at the loop exit contains n−1 i=1 4 i + ❹i . Next, saSLP unpacks the root of combining graph (i.e., 4 ❹) into registers %d1 and %d2, whose value is 4 n and ❹n, respectively. The reduction chain in Figure 8 (b) then sums the short phi node, %d1 and %d2, to compute the final values of %d0 and %d5 for guest state synchronization. The value of each node in Figure 8 (b) is shown in Figure 8(c) . As a result, no guest state would be lost.
To ensure that combining reductions is profitable, saSLP uses the spill-aware cost model introduced in Section 3.4 to compute the performance costs. The additional costs of the horizontal reduction and the reduction chain at loop exit are added to the long SIMD cost c L . Whether saSLP will combine the reductions or not is then determined by the same Equation (1).
Combining Indirect Loads to Form Gather Loads
To vectorize data-irregular loops that cannot be vectorized on the guest owing to the indirect array loads, saSLP combines the indirect loads and exploits the host's SIMD gather instructions. This is a significant issue because the innermost loops in many important applications, such as graph algorithms, sparse matrix kernels, and scientific stencils, contain "indirectly load and then accumulate" pattern and are read-only. Therefore, saSLP combines indirect loads only in readonly loops so that a runtime aliasing check is not required. We use the same PageRank example to elaborate on why and how saSLP combines the indirect array loads.
Recall that the innermost loop of PageRank accumulates the weighted ranks of the neighboring nodes of a specific node. As Figures 7(a) and 9(a) show, the loads for weights and ranks (i.e., 2 and 3 ) are indirect loads based on an index k loaded from the neighbors array (i.e., 1 ). Therefore, ARMv8 NEON compiler cannot vectorize the loop since NEON does not support gather loads, leaving data parallelism in the loop unexploited. However, the x86 AVX2 host supports gather instructions, making vectorizing the loop possible. As designed for vectorizing indirect array loads, AVX2 gather instructions take an array base pointer (Base), a vector of array indices (Idx), and the size of array element (ESize) to load memory location Base + Idx[i] × ESize to the ith lane of the output registers. Thus, static compilers can directly map indirect array loads to the instructions.
Without the explicit semantics of indirect array loads in source code, saSLP must analyze the data-dependency graphs of load pointers in the input IR to find the Base, Idx, and ESize. As Figure 9 (b) shows, saSLP first locates the three expressions Base, Idx, and ESize in the datadependency graphs of the load pointers. Because the Base expression is shared among the indirect array loads that could form a gather load, saSLP locates the Base expression by finding a common operand of the add instructions in Figure 9(b) . If the common operand exists, then the other operands of the add instructions should be the distances from Base to the memory locations accessed by the indirect loads (i.e., Idx × ESize). Therefore, saSLP then locates the ESize expression by searching a constant expression shared by the multiply or left-shift instructions in Figure 9(b) , and the other operands of the multiply or left-shift instructions are identified as the Idx expressions. Take PageRank as an example; saSLP analyzes the data-dependency graphs of the indirect loads 2 and ❷ in Figure 10 (a) to locate the three expressions. First, saSLP discovers that %x2 is shared by the two add instructions used by 2 and ❷ and, thus, deduces that Base = %x2. Second, saSLP analyzes the other (i.e., right-hand side) operands of the add instructions and finds out that the constant 3 is shared by the two left-shift instructions in Figure 10(a) . Therefore, saSLP concludes that ESize = 2 3 = 8B, which is the element size of the weight array. Last, 1 and ❶, the other (i.e., left-hand side) operands of the two left-shift instructions, are identified as Idx, as shown in Figure 10 (a).
For data dependency graphs where saSLP can locate all of the expressions Base, Idx, and ESize, saSLP combines the indirect loads to form a gather load, as shown in Figure 9 (c). The Base expression is casted into a pointer of type T , whose size is ESize. Moreover, a new get element pointer Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation 2:13 (GEP) instruction is added to compute the memory locations that would be gathered. The GEP instruction takes a pointer Base and a vector Idx and then outputs a vector of pointers pointing to Base + Idx × ESize, where ESize = sizeo f (T ) and T is the data type pointed by Base. As a result, the memory locations accessed by the gather load in Figure 9 (c) are identical to the ones in Figure 9 (b). On the other hand, if saSLP cannot locate expressions Base, Idx, and ESize, then the indirect array loads are non-combinable and packed into a long SIMD vector, as mentioned in Section 3.2.
In the PageRank example, because all three expressions are found, saSLP combines the two indirect loads 2 and ❷ and forms an SIMD gather 2 ❷ in Figure 10 (b). In addition, %x2 is casted to a pointer of 8B floating point number. The pointer is then used by the GEP instruction to compute the vector of gathered memory locations, whose value is %x2 + 1 × 8, %x2 + ❶ × 8 . Therefore, the gather instruction 2 ❷ would load the same memory locations as the original indirect loads 2 and ❷, as shown in Figure 10 (c). Similarly, the other indirect array loads 3 and ❸ in Figure 10 To estimate the profits of combining indirect array loads, the performance costs of the original indirect loads and the new gather instructions are also counted in the short cost c S and the long SIMD cost c L , respectively. The spill-aware cost model mentioned in Section 3.4 then decides whether to combine the short instructions in the whole combining graph (including the indirect array loads) based on the same Equation (1).
Limitations of saSLP
When the input IR contains complicated structures, such as conditional branches, function calls, and SIMD permutation instructions, saSLP vectorizes the loop partially. In this case, the combining graph contains only the use-def chain from the seeds to the complicated instructions and the combining graph is the resulting scope for the saSLP vectorization. With the optimizations to eliminate complicated structures (e.g., if-conversion and function inlining), the whole loop would be vectorized by saSLP if there is no complicated structure left in the loop.
EVALUATION
We implemented saSLP in HQEMU (Hong et al. 2012 ), a retargetable cross-ISA DBT based on the LLVM 6.0 JIT compiler. We evaluated saSLP with translations from a guest ISA, ARMv8 NEON with 32 16B SIMD registers, to two host ISAs, x86 AVX2 and AVX512. AVX2 and AVX-512 have 16 32B and 32 64B SIMD registers, respectively. The host systems for AVX2 and AVX-512 translation are Intel Core i7-6700 (Skylake, SKL) at 3.4GHz with 64GB of RAM and Intel Xeon Phi 7210 (Knights Landing, KNL) at 1.3GHz with 109GB of RAM, respectively.
The evaluation of saSLP consists of two parts. In the first part, we demonstrate and discuss the performance of saSLP with short NEON vector to long AVX vector transformation. All benchmarks used in the first part are data regular and thus are vectorizable on ARMv8 NEON. In the second part, we show that saSLP can further transform ARMv8 scalar to long AVX vector across a number of data-irregular benchmarks that are not vectorizable on ARMv8 NEON. All benchmarks used in the second part contain indirect array loads and loop reductions.
NEON Vector to AVX Vector Transformation
Fifteen benchmarks were selected from various benchmark suites for the evaluation: Livermore Loops (LL), NAS Parallel Bench (NPB) (Bailey et al. 1991) , SciMark 2 (SM2), Test Suite for Vectorizing Compilers (TSVC) (Maleki et al. 2011) , LINPACK (LP), cjpeg (JPG) and OpenCV (CV). All benchmarks were compiled using AArch64 GCC 5.4.0 with flags "-O3 -ftree-vectorize -funrollloops -ffast-math -static" to let the guest compiler automatically vectorize and unroll loops according to the NEON registers. For comparison, we used the translation without saSLP as our performance baseline, where NEON vectors were translated to AVX-128 vectors that have the same SIMD width.
ARMv8 Scalar to AVX Vector Transformation
In addition to short-vector to long-vector transformation, we also selected eight benchmarks from various applications and benchmark suites to evaluate saSLP with scalar to vector transformation. The selected benchmarks included PageRank (PR), High-Performance Conjugate Gradients (HPCG) (Dongarra et al. 2015) , SM), TSVC and stochastic gradient descent (SGD). All benchmarks were compiled using the same ARMv8 guest compiler and flags as the short-vector to long-vector transformation. The compiler failed to vectorize the kernels in the eight data-irregular benchmarks since ARMv8 does not support SIMD gather instructions. As a result, the kernel loops were unrolled only according to the number of NEON registers. For comparison, we used the translation without saSLP as our performance baseline, where ARMv8 scalars were translated to x86 scalars. Figure 11 shows the speedup of AVX2 translation with saSLP. By combining two NEON instructions, saSLP achieves a significant speedup of 1.6× on average compared to that without saSLP. Most of the benchmarks achieve more than 40% performance improvement with saSLP. In particular, the benchmarks LL-hydro, LL-first_diff, LL-mat_mul, TSVC-vpvts, and CV-hc_sobel achieve significant speedups over 1.8×. This is because their loops have good access patterns and locality, and contain a high ratio of SIMD computation. In contrast, NPB-mg_psinv and NPB-mg_resid achieve an average speedup of around 1.2×, which is relatively low. Profiling results indicate that NPB-mg_psinv and NPB-mg_resid have 17% of the total cycles with split loads (i.e., fetching two cache lines because the load crosses cache line boundary), which degrades the performance (Intel Corp. 2018a ). Moreover, the working set of these two benchmarks is larger than the L3 cache size, leading to 30% cycles on L3 miss stalls. Since these memory overheads cannot be reduced by the improved parallelism, there is little room for saSLP to enhance the performance.
Performance Results with NEON Vector to AVX2 Translation
The overheads of JIT compilation and runtime aliasing checks are measured as the ratios to the total execution time in Figure 12 . Since all loops in the benchmark set are translated once but iterate many times, the JIT overhead is amortized to 3%, on average. Moreover, our DBT uses another thread for LLVM JIT so that the overhead can often be hidden. The results also show that runtime aliasing check incurs a negligible overhead of 2.95%, on average, because the checking code is executed only once before entering the loop. As a result, the overhead is amortized. One exception is JPG-8x8fDCT, in which SIMD loops have relatively small trip counts; consequently, the overhead of runtime aliasing checks cannot be fully amortized.
To evaluate the impacts from the automatic unrolling factors made by the guest compiler, we force the guest compiler to compile benchmarks with smaller unrolling factors, which results in no register spilling on the AVX2 host. As shown in Figure 11 , the overall speedup decreases from 1.6× to 1.47× because there is no spilling to eliminate. However, even in this case, saSLP can still improve the performance by exploiting the AVX2's parallelism. Figures 13 and 14 show the dynamic instruction ratios of register spilling loads (i.e., filling) and stores (i.e., spilling) against the total executed instruction counts. On average, saSLP eliminates 67% and 97% of spilling loads and stores, respectively, by combining two 16B NEON registers and storing them in a 32B YMM register on the AVX2 host.
Register Spilling with NEON Vector to AVX2 Translation
For the spilling loads in Figure 13 , benchmarks such as LL-hydro, , and CV-hc_sobel show a significant reduction in spilling loads when saSLP is applied. On average, these benchmarks achieve 87% reduction because the number of interfered live ranges is significantly reduced by combining them. On the other hand, LLint_pred and JPG-8x8fDCT do not achieve a significant reduction because most of the NEON registers in the loops are duplicated and fill up the YMM registers rather than being combined with each other. Furthermore, for the spilling stores in Figure 14 , nearly all of the benchmarks achieve significant reductions, exceeding 70%. Almost all live ranges that flow through the loops for guest state synchronization are combined and stored in the YMM registers. Therefore, the spilling stores in the loops are eliminated as a consequence of the decrement of interfering live ranges.
The reduction in register spilling also impacts the performance depending on the load/store pressure of the benchmarks. Therefore, we profile the dispatched load/store micro-ops and memory stall cycles using the hardware performance counters. In Figures 13 and 14 , the dashed lines (right y-axis) indicate the ratios of the execution cycles with dispatched load micro-ops over the total execution cycles and the ratios of the store micro-ops, respectively. The results in Figures 11,  13 , and 14 show that the benchmarks with relatively high loads and store pressure and significant reduction in register spilling, such as LL-hydro and LL-first_diff, achieve speedups exceeding 2× (i.e., the theoretical bound of speedup from improved parallelism). The benchmarks with high store pressure and significant reductions in spilling stores, such as LL-mat_mul, TSVC-expand, and TSVC-vpvts, also achieve significant speedups, ranging from 1.71× to 1.83×. Figure 15 shows the speedup of AVX-512 translation with saSLP, where four or two NEON instructions are combined. On average, saSLP achieves a speedup of 2.3× compared to the translation without saSLP. All benchmarks from the LL, TSVC, and CV suites achieve a significant speedup of 2.9×, on average. In contrast, NPB-mg_psinv, have an average speedup of 1.5×. NPB-mg_psinv and NPB-mg_resid suffer from large working sets and excessive split loads. Moreover, the performance degradation caused by the split loads is more severe on AVX-512 because the SIMD width is the same as the 64B cache line size. If the start addresses of the AVX-512 loads are not aligned to 64B, then all subsequent loads will always cross cache line boundaries and incur a significant penalty (Intel Corp. 2018a ). The profile data show that both benchmarks have 37% of split loads, and a similar phenomenon is also observed in SM2-LU_factor. For JPG-8x8fDCT, the speedup is similar to that of the translation to AVX2 because saSLP has only two 16B NEON instructions to combine in the loops. As shown in Figure 16 , the overheads of JIT compilation and runtime aliasing checks are similar to those in the AVX2 translation. Because of the amortization of overheads, the JIT compilation and runtime aliasing checks incur only negligible costs of 1.65% and 4.28% of the total execution time, on average, respectively. Figures 17 and 18 show the dynamic instruction ratio of register spilling loads and stores with the AVX-512 translation, respectively. On average, saSLP eliminates 95% and 99% of spilling loads and stores, respectively. Although AVX-512 supports 32 64B ZMM registers, accessing the extra 16 XMM16-31/YMM16-31 sub-registers requires AVX-512VL extension, which is not supported on our host. Therefore, the baseline that translates to XMM width can only use 16 XMM registers, while saSLP can exploit all 32 ZMM registers by combining with the ZMM width. As a consequence, saSLP further reduces spilling with the extra ZMM registers. For example, the spilling loads caused by the duplicated and packed NEON registers in LL-int_pred (Figure 13 ) can be eliminated by storing these duplicated NEON registers in the extra ZMM registers, resulting in a 98% reduction in the spilling loads (as shown in Figure 17 ). For the spilling store in Figure 18 , benchmarks such as LL-hydro, LL-diff_pred, LL-frag2D, and NPB-mg_psinv also benefit from the improved SIMD register capacity after the saSLP transformation. On average, there is 82% extra reduction in the spilling stores compared to the AVX2 translation.
Performance Results with NEON Vector to AVX-512 Translation
Our AVX-512 host machine does not support the hardware counters that count the cycles with dispatched load/store micro-ops to estimate the load/store pressure. Therefore, we estimate the load/store pressure by calculating the ratios of accumulated memory latency (from the event counts and latencies of L1, L2 data cache and DRAM) over the total execution cycles (Jeffers et al. 2016) . Figures 15, 17, and 18 show that, similar to the AVX2 translation, benchmarks with a high load and store pressure and significant reduction in register spilling achieve significant speedup. Specifically, LL-hydro, LL-first_diff, and CV-hc_soble achieve 2.97×, 2.91×, and 2.95× speedup, respectively. In addition, benchmarks with high store pressure and significant reduction in spilling stores, such as LL-mat_mul, TSVC-expand, and TSVC-vpvts, achieve significant speedups, ranging from 2.6× to 3.54×.
To summarize, in the first part of the experiments, we evaluated saSLP with the translations from the short SIMD ISA, ARMv8 NEON, to two long SIMD ISAs, x86 AVX2 and AVX-512. The benchmark results show that saSLP can effectively exploit the improved parallelism and register capacity of the host. Moreover, it can achieve a significant reduction in the register spilling overhead, showing a speedup of 1.6× and 2.3× for the AVX2 and AVX-512 hosts, respectively.
Performance Results with ARMv8 Scalar to AVX2 Translation
In addition to combining NEON short vectors to form AVX2 long vectors, saSLP also uses AVX gather instructions to vectorize loops containing reductions and indirect array loads, which are not vectorizable on ARMv8 NEON. We evaluated saSLP with such scalar to SIMD vector transformation across a number of data-irregular applications and benchmarks. Figure 19 shows the speedup of ARMv8 scalar to AVX2 translation with saSLP. By vectorizing the scalar reduction chains and packing the scalars from indirect array loads into SIMD vectors, saSLP achieves a speedup of 1.75×, on average, compared to that without saSLP. Moreover, when using the AVX2 gather instructions to vectorize the indirect array loads, saSLP further improves the average speedup to 1.92×. In particular, benchmarks SM2-SpMM, TSVC-s4115, and TSVC-s4116 significantly benefit from the AVX2 gather instructions, resulting in an extra speedup of 13%, on average. All of the single-precision benchmarks achieve higher speedups with saSLP than the double-precision benchmarks (i.e., 3.73× vs. 1.29×). This is because AVX2 single-precision instructions process two times more vector elements in parallel than the double-precision instructions. Furthermore, single-precision numbers require less memory bandwidth and capacity.
Although saSLP significantly improves the performance of the benchmarks with SIMD vectorization, the speedup that can be accomplished is still limited by the following factors. (1) Data access pattern and locality of the benchmarks: if the memory access pattern is irregular (e.g., indirect loads) or the working set cannot fit the cache, it would be difficult to maintain the theoretical SIMD computation throughput due to the latency of cache miss. (2) According to Amdahl's law, the speedup is limited by the ratio of non-SIMD code. For example, PR-small achieves 38% higher speedup than PR-large, which processes much larger network graphs. This is because PR-large has higher ratio of stall cycles for waiting data from DRAM and last-level cache (LLC) compared to PRsmall, as shown in Figure 20 . Moreover, PR-small has much higher ratio of stall cycles on L1 cache hit, resulting in less memory overheads and greater performance gain. Similarly, HPCG-SpMV and HPCG-SymGS also suffer from high ratios of stall cycles on DRAM and, thus, achieve less speedup. Furthermore, since the kernels in the HPCG suite are 27-point stencils and the vectorized kernels have low trip counts, non-SIMD code becomes the bottleneck and impedes the speedup.
On the other hand, single-precision benchmarks SM2-SpMM, TSVC-s4115, and TSVC-s4116 have most of their stall cycles waiting for L1 and L2 cache hits instead of LLC and DRAM (Figure 20) , resulting in a significant speedup of 3.73×, on average. This is because these benchmarks have a smaller working set and less divergence in the indirect load addresses. Figure 19 shows that the benefit of exploiting AVX2 gather instructions instead of using software packing (i.e., inserting indirectly loaded scalars into a vector) is more significant among single-precision benchmarks. Since the AVX2 gather instructions have high latency and are implemented with complex micro-ops, the overheads of gather instructions need to be amortized among the gathered elements for good performance (Hirsh and Gideon 2017) . In fact, the performance of gathering four double-precision numbers with AVX2 gather instructions is on par with software packing on Skylake micro-architecture, while gathering eight single-precision numbers with hardware is up to 70% faster than software packing (Intel Corp. 2018a) . Therefore, by combining more elements to amortize the overheads of AVX2 gather instructions, saSLP achieves higher extra speedup of exploiting hardware gathering in the single-precision benchmarks. Figure 21 shows the speedup of ARMv8 scalar to AVX-512 translation with saSLP. On average, saSLP achieves an average speedup of 1.74× by vectorizing the kernels with AVX-512 gather instructions. Compared with the AVX2 translation, the benefit of using hardware gather instructions is more significant on AVX-512. By replacing software packing with AVX-512 hardware gathering, saSLP boosts the performance by 60%, on average. In particular, SGD-predict and TSVC achieve extra speedup over 2×. This is because AVX-512 can gather more elements to amortize the cost, and software packing has higher latency on Knights Landing (Fog 2018; Intel Corp. 2018a ). Though AVX-512 has wider vectors than AVX2, the overall performance gain on Knights Landing (AVX-512) is less significant than that on Skylake (AVX2). The speedup achieved on Knights Landing (KNL) is limited by the high latency of indirect loads. Since KNL does not have LLC, indirect loads that miss L2 would incur DRAM latency (≈200 cycles). Moreover, the out-of-order resources on KNL is much less than Skylake (Fog 2018; Intel Corp. 2018a ), making it difficult for KNL to hide the high latency. Even so, saSLP can still achieve speedup ranging from 1.9× to 4.2× in benchmarks with better locality, such as SGD-predict, SM2-SpMM, and TSVC. Table 1 lists the total number of loops in the benchmarks and the number of loops that are filtered out by our spill-aware cost model. In this experiment, the kernels from the same benchmark suite are grouped in a whole program. We conduct the measurement with ARMv8 to AVX2 translation on the Intel Skylake platform. The optimizations for loop reductions and indirect loads are enabled. As the result shows, many loops in LL and TSVC pass the spill-aware cost model evaluation and are vectorized, while many loops in NPB-mg and SGD are filtered out. The successfully vectorized loops are from the computational kernels and several array initialization routines. The other loops fail in the cost evaluation because their combining graphs are too small (i.e., insufficient data parallelism) or significant overhead could be incurred due to excessive data reorganization.
Performance Results with ARMv8 Scalar to AVX-512 Translation

Evaluation of Spill-Aware Cost Model
Since LL and TSVC have more vectorized loops, they have higher speedups than NPB-mg and SGD, as shown in Table 1 . The speedups do not achieve the theoretical value, mainly because, according to Amdahl's law, the speedup is bounded by the ratio of the non-SIMD portions, where time is spent for benchmark data initialization and library routines.
RELATED WORK
Auto-vectorization approaches are usually classified into two types: loop-based vectorization and basic-block vectorization. Among the latter, Superword Level Parallelism (SLP) (Larsen and Amarasinghe 2000) is the most well-known algorithm. It can combine multiple isomorphic statements into SIMD operations inside a straight-line code. A number of techniques have been developed to improve SLP, such as dealing with the presence of control flow (Shin et al. 2005) , padding with non-isomorphic statements (Porpodas et al. 2015) , and partial SIMD parallelism (Zhou and Xue 2016a) . In this article, the proposed saSLP extends SLP to resolve the asymmetric SIMD problems in cross-ISA DBTs. Our extensions include: (1) re-vectorizing scalar/vector to longer vector; (2) runtime aliasing check; (3) the spill-aware cost model; and (4) vectorizing indirect loads with SIMD gather instructions.
Some auto-vectorizers combine loop-based and SLP vectorization to extract both inter-and intra-iteration parallelism at the same time (Rosen et al. 2007; Kim and Han 2012; Zhou and Xue 2016b) . Rosen et al. (2007) and Zhou and Xue (2016b) target data-regular loops, while Kim and Han (2012) can vectorize irregular code. Their work focuses on reducing the data reorganization overhead incurred in vectorizing loops. In contrast, saSLP resolves the asymmetric SIMD capability problems in cross-ISA DBTs with a spill-aware cost model to determine whether combining the guest instructions would be profitable.
In contrast to classical SIMD architectures, the architecture of scalable vectors, such as the ARM Scalable Vector Extension (SVE) (ARM Ltd. 2017) and RISC-V vector extension (RISC-V Foundation 2016), aims to provide binaries that can be configured to varying vector widths without re-compilation. In contrast to the scalable vector, which requires special hardware support, our method is a pure software solution that can run on existing commodity processors and is transparent to existing application binaries.
DBT is a core technology used to execute an application binary of one ISA on a host machine with a different ISA. DBT has been used to achieve many different goals. Dynamic optimization systems (Bala et al. 2000; Lu et al. 2004; Wang et al. 2007 ) apply runtime binary re-translation to improve performance; binary instrumentation systems, such as Pin (Luk et al. 2005) and Valgrind (Nethercote and Seward 2007) , provide features of application analysis and debugging. DBT is also widely used for application migration across ISAs (Baraz et al. 2003; Bellard 2005; Chernoff et al. 1998) . For example, the cross-ISA DBT systems, such as Digital FX!32 (Chernoff et al. 1998) , IA-32 EL (Baraz et al. 2003) , and Transmeta CMS (Dehnert et al. 2003) , enabled IA-32 applications to be executed on Alpha, Itanium, and VLIW machines in the past.
With the advances in SIMD architectures, several DBT systems that attempt to exploit SIMD extensions have been proposed in the literature. However, they have very different objectives and approaches in their designs. The most closely related works are Hallou et al. (2016) , Hallou et al. (2015) , and Hong et al. (2016) , which present vectorization schemes from short SIMD to long SIMD for loops. Hallou et al. designed their work in a same-ISA DBT for the x86 architecture, while Hong et al. proposed a cross-ISA solution. Their approaches use loop-based vectorization to improve the parallelism. However, they still map one guest register to one host register, which cannot eliminate register spilling. In contrast, on the basis of the SLP algorithm, saSLP can combine multiple short guest registers and solve the register spilling problem as a result. Moreover, saSLP can exploit the host's advanced SIMD instructions to vectorize data-irregular loops. Pajuelo et al. (2002) proposed a binary auto-vectorization framework from scalars to vectors. Their system requires new hardware to execute speculatively vectorized code and falls back to scalar code when an invalid vectorization is detected by the hardware. Liquid SIMD (Clark et al. 2007 ) proposed a general SIMD representation, which can run on Liquid SIMD-enabled hardware. In contrast, our saSLP is a pure software solution running on the existing hardware. Vapor SIMD (Nuzman et al. 2011) proposed split vectorization, where offline compilers conduct vectorization of machine-independent parts and runtime JITs resolve the final vectorization based on the host SIMD ISA. In contrast to Vapor SIMD, our saSLP works for existing application binaries.
QEMU (Bellard 2005) emulates a guest SIMD instruction with a sequence of scalar instructions instead of leveraging host SIMD capability for portability. Other works (Fu et al. 2015; Li et al. 2006; Michel et al. 2011 ) focus on finding efficient translations of SIMD binaries between different SIMD ISAs. In their DBTs, guest SIMD instructions are translated to equivalent host SIMD instructions that operate on the same number of elements even if the host can provide more parallelism. In contrast, saSLP combines short guest instructions to form longer SIMD instructions so that the host's parallelism is fully used.
CONCLUDING REMARKS
The diversity of modern SIMD architectures in registers and advanced instructions has become a critical issue when applying DBTs for cross-ISA migration of applications. This issue will become more common and severe with the ongoing evolution of SIMD architectures. Since translating from longer to shorter SIMD is an easier problem that splitting vectors can solve, this article addresses the more complex optimization problem of leveraging longer SIMD on the host.
We propose a novel cross-ISA DBT technique, called spill-aware SLP (saSLP), which can combine multiple short ARMv8 instructions and registers to fully use the x86 AVX host's parallelism and register capacity. We also propose the spill-aware cost model and runtime aliasing check to prevent anticipated slowdown and violation of vectorization legality. Furthermore, saSLP exploits the x86 AVX gather instructions to vectorize data-irregular loops, which cannot be vectorized on ARMv8 NEON. We also present a new algorithm to vectorize loop reductions in DBTs, where the typical methods used in static compilers do not work. The evaluation results of saSLP demonstrate that the algorithm can significantly improve the execution efficiency and effectively reduce the register spilling across a range of benchmarks. Moreover, the proposed technique is also applicable to static binary translation (SBT) and can be further extended to support more guest and host SIMD ISAs since saSLP is built on top of the LLVM machine-independent IR layer.
