An automatic parallelizer is a tool that converts serial code to parallel code. This is an important problem because most hardware today is parallel and manually rewriting the vast repository of serial code is tedious and error prone. We study an automatic parallelizer of binary code, i.e. a tool that automatically converts a serial binary to a parallel binary. Such a tool is attractive because (i) much legacy code does not have source code available for it; (ii) it is compatible with all compilers and languages.
In the past binary automatic parallelization techniques have been developed and researchers have presented results on small kernels from polybench. These techniques are a good start; however are not able to parallelize real life codes from SPEC2006 and OMP2001 benchmarks. The main limitations of past techniques is the assumption that loop bounds are statically known (like in the small kernels). However, in real life code this is not true and binary dependencies cannot be completely determined with run-time loop bounds.
In this paper we present a novel algorithm that enhances past techniques by guessing the theoretical upper limits for the loop bounds using only the memory expressions from a loop. It then inserts run-time checks to see if these guesses were indeed correct and if correct executes the parallel version of the loop, else the serial version. These techniques are applied to the large affine benchmarks from SPEC2006 and OMP2001. We present results on the speedups and the number of loops parallelized directly from binary with and without this algorithm. Among the 8 affine benchmarks among these suites, the best existing binary parallelization method achieves a geo-mean speedup of 1.33X, whereas our method achieves a speedup of 2.96X. This is close to the geo-mean speedup from source of 2.8X.
INTRODUCTION
With the advent of multi-core machines in hardware it is most efficient to run parallel code on them. However, most legacy code ever written is serial. Several methods have been proposed to parallelize serial code which include (i) explicitly rewriting serial code using MPI, pthreads, TBB etc; (ii) using program directives such as OMP to implicitly specify parallelism in serial code and (iii) using an automatic parallelizer to convert serial code to parallel code without human intervention. The advantages of automatic parallelizer over the first two methods include (i) it is not prone to human error as no human intervention is involved; (ii) programmers do not need to be trained to think in parallel. Many studies have shown that programmer productivity per line is much lower when writing parallel code than when writing serial code. Hence, we choose automatic parallelization to bridge the gap between serial code and parallel hardware.
In this paper we develop a mechanism to implement an automatic parallelizer within a binary rewriter. i.e. we develop a tool that takes as input serial binary code and produces as output a parallel binary. The advantages of a parallelizing binary rewriter include: (i) it works for all languages and compilers since the binary rewriter can accept any program in its instruction set; (ii) it works on legacy code for which no source code is available; (iii) it works on assembly language programs; (iv) it can be used by the end user to perform platform-specific optimizations and (v) it has high economic feasibility since it needs to be implemented once per instruction set.
In the past a few attempts have been made to parallelize affine codes directly from binaries [7, 12] . Though these papers present good foundational ideas and results on polybench kernels, they are only a start to parallelizing large real world affine codes such as the SPEC2006 or OMP2001 benchmark suites. The major limitations of these methods is that their algorithms are not powerful enough to handle loops with run-time dependent loop bounds and such loops are present in abundance in real life code. This limitation is foundational in those methods, and not easy to remove. In this paper we present a novel algorithm to work with loops with run-time dependent bounds and our results show that executables corresponding to the affine benchmarks in SPEC2006 and OMP2006 suite can be parallelized directly with these techniques.
Further on this paper is arranged as follows. Section 2 presents the closest related work contrasting our techniques to them. Section 3 presents the limitations of the present binary affine parallelization techniques using an example and motivates the algorithm followed by a brief algorithm and more examples in section 4. The core algorithm is presented in section 5 followed by the description of our infrastructure in section 6 and the results in section 7.
RELATED WORK
In this section we present potentially competing related work contrasting it to the algorithm presented in this paper in the following categories (i) static automatic parallelization of binaries; (ii) dynamic automatic parallelization of binaries; (iii) automatic vectorization of binaries and (iv) array delinearization techniques.
Static automatic parallelization of binaries: Kotha et.al [7] and Pradelle et.al [12] are the only two static methods we are aware of that have done automatic parallelism in a binary rewriter. Both these methods present results on small kernels that are a part of the polybench benchmark suite. [12] automatically parallelizes binaries by feeding the the binary intermediate form to the polyhedral compiler. They have shown results only on the polybench benchmark suite only. Our method is able to parallelize affine benchmarks from the SPEC2006 and OMP2001 benchmark suites. Kotha et.al. [7] statically parallelizes binaries by using dependence information determined from binaries. However, its methods are limited to affine loops where loop bounds are known and hence, it can parallelize only small kernels from the polybench benchmark suite. Both these methods present a brief section on run-time dependent loop bounds and suggest to add run-time checks to check if different accesses were indeed to different arrays. This method is highly lacking since it has no mechanisms to break dependencies between accesses to the same array and in the absence of such a mechanism it will not produce any result from real code and they have not shown any results for real-code as well. We build on this work and have devised a novel algorithm to guess the theoretical limits for the loop bounds of affine loops. We are able to parallelize affine benchmarks from the SPEC2006 and OMP2001 benchmark suites.
Dynamic automatic parallelization of binaries: Dynamic automatic parallelization techniques present in literature are Yardimci et.al [15] , Wang et.al [14] and Yang et.al [6] . [15] focuses on a dynamic method to detect nonaffine parallelism. [14] presents a dynamic method to parallelize binaries using speculative slicing and [6] presents a method to use run-time information to parallelize binary code. All these methods are dynamic. Hence, they suffer from run-time overheads from analysis. Most importantly, they do not optimize affine loops whereas our method does.
Vectorization of binaries: [10] and [4] present techniques to analyze binaries and vectorize them. Their analysis is limited to vectorization of binaries and does not attempt to parallelize them using threads as we do.
Array delinearization techniques: Array Delinearization methods [9] [5] take source code with linearized multidimensional accesses as input, and converts those accesses to multi-dimensional accesses when possible. Ideally if we delinearize array accesses in a binary we could take parallelizing decisions on them just as we would from source. However, source-level methods to delinearize array accesses cannot be adapted to binaries easily. This is because delinearization methods such as [9] , [5] require high level intermediate C like representation equivalent to symbolic information in compilers which is not available when analyzing binary code. Such symbolic information is crucial because it contains information about the number, location, and dimension sizes of arrays, which the delinearization methods use. Finding this information in the general case from stripped binaries (i.e. those without symbolic information) is hard, and there are no existing methods for it. Hence delinearization methods cannot be adapted for binary code.
The method in this paper circumvents the problem of missing array information in binaries by not attempting to recover guaranteed information about array locations and dimension sizes. Instead it guesses the theoretical limits for the loop bounds by guessing what different arrays might have been and what different dimensions might have been by analyzing only the address expressions recovered from binary code. It then calculate the dependences with these loop bounds and parallelizes the loop. It then adds run-time checks to execute the parallel version of the loop when the actual bounds are within the guessed ranges and executes the serial version otherwise. No previous method guesses loop bounds from binaries, or uses run-time checks like our method. The result is that our method is the first to parallelize binary code with unknown loop bounds and hence can present results on large affine benchmarks from SPEC2006 and OMP2001.
MOTIVATION
To parallelize affine loops, traditional techniques calculate distance vectors for the loops and then reason about parallelizing them. In this section we first describe the bestknown methods for obtaining distance vectors from source code for affine loops with run-time determined loop bounds. We then present the limitations of the existing binary method for the same and briefly describe our method.
The code shows a normalized loop, i.e. a loop with a lower bound of zero and a step of one. Loops can be normalized using existing methods such as the normalization pass in LLVM.
Figure 1: Code Example
Distance vectors for the source code in figure 1 are calculated as follows. The existing source methods make the assumption that row and column accesses are within the bounds of the array's dimensions. They solve for two iterations that refer to the same memory location within the infinite range of iteration values. If no solution exists, like in this example, we can conclude that no two iterations ever access the same memory. This implies that iterations of the i loop can execute in parallel (i.e., the component of the distance vector for this loop is zero). Such an analysis individually proves that the loop j is parallel.
To obtain distance vectors from binary for this code we cannot use the above source method since it relies on known affine expressions for array indexes in terms of induction variables, which are not apparent from the binary. Instead we start with the existing method for binaries in [7] . It shows that we can recover linearized expressions for memory accesses from a binary, and solving these multidimensional expressions gives us distance vectors. In the presence of loop bounds the solutions from binaries are very powerful, and can handle most linear algebraic kernels as presented by [7] . However, when loop bounds are run-time dependent, we need to solve these multi-dimensional expressions in the infinite space (since we need to assume that the loop bounds can take any value at run-time). This greatly reduces the precision of the analysis.
Let us apply the existing binary method to figure 1. From the binary for the code above we will recover a memory expression of the form BaseA + 200i + 4j which corresponds to the A[i][j] access (assuming the element size is 4). The '200' in 200i is because the size of a row is 50 elements, each of 4 bytes. We need to reason about this access in the infinite space for i and j since the loop bounds are unknown. In the infinite space, iterations (2,0), (1, 50) and (0, 100) refer to the same memory location. All of these iterations except (2,0) are not possible because the legal range of j is [0,49] and if it is greater than 49 the code accesses columns out of bounds, and thus wrapping into rows. Source code methods assume that such iterations are not possible; hence proving the loop is parallel. However, unlike source methods, the binary method in [7] cannot make any assumptions about iterations remaining within array bounds, since differentiated array bounds are not known. As a result, without loop bounds, the binary method in [7] fails to prove the loop is parallel because spurious dependences appear. We make the observation that the spurious dependences that appear in binary code are predominantly due to two things: (i) columns wrapping into rows (i.e. different dimensions wrapping into each other) in an access and (ii) different arrays overlapping with each other.
In this paper we present a method to statically guess the theoretical limits for loop bounds in loops with run-time dependent loop bounds such that we prevent these spurious dependences from occuring in binary code. We then have runtime checks for bounds, and run the parallel version of the loop if the loop bounds were indeed with in the ranges that we guessed. For e.g. for the loop shown in figure 1, using the theory presented in [7] we discover the memory expression for the A[i][j] access to be BaseA + 200i + 4j. We then look at the coefficients multiplying the induction variables in this memory expression that we recover from the binary and guess that the j dimension represents columns since it has the smaller co-efficient and the i dimension represents rows since it has the larger co-efficient assuming row-major ordering (Without loss of generality, a similar argument can be made for column-major code as well). We then guess the limit on j as ( Coefficient of i/Coefficient of j) (i.e. 200 4 = 50) ensuring that our guessed columns do not wrap into our guessed rows. This way within the guessed bounds, we prevent the spurious dependence of rows wrapping into columns from happeneing. By guessing that j is less than 50 there are no two iterations that will access the same memory since j has been prevented from running into i and hence we can parallelize the loop. At run-time we check if j indeed does not exceed 50 and this run-time check will always succeed and we will always execute the parallel version of the loop.
In section 4 we present more examples and briefly describe how our algorithm would guess loop bounds for them followed by the detailed algorithm in section 5. Within these bounds for the induction variables the loop is parallel in most cases. We insert run-time checks to check that these guesses were indeed correct before executing the parallel version of the loop; else, we execute the serial version of the loop. The run-time check is necessary for binary code since our guesses may be incorrect in some cases. We show one such example in section 4.
Since we are solving for ranges for induction variables by constraining linear expressions, we can use standard linear algebriac projection mechanisms such as the Fourier-Motzkin method; however, these are not needed because our algorithm is based on a series of guesses and each guess constraints the linear expressions in only a single way, for e.g. by guessing the different dimensions and constraining that one dimension does not run into another, it is sufficient to divide the two co-efficients (like in the above example) and there is no need to use an exponential complexity linear algebriac method. Hence, we present a simple algorithm with linear complexity to guess the theoretical loop bounds.
EXAMPLES
In this section we first briefly describe the steps of the algorithm described in section 5 and then apply it to four code examples to show how their loops can be parallelized from a binary even though the loop bounds are run-time dependent.
First, we state the algorithm that we use to guess the limits of the loop bounds for a loop directly from a binary using a series of guesses. Only the steps are outlined here, details in section 5.
Step 1: Divide memory accesses (both reads and writes) in a loop into Dependence Groups (DGs) by guessing which accesses may overlap with each other and which may not. Place all accesses that may overlap with each other in the same DG.
Step 2: Arrange all DGs in ascending order of their base addresses, from DG1 to DGT.
Step 3: For all the DGs that contain a writes in them make best guesses to differentiate different dimensions and then guess the possible range for induction variables ensuring that the different dimensions of an array do not overlap with each other. These guesses are called intra-group constraints, since they are obtained by working on one DG at a time. These guesses eliminate the spurious dependences from binary code of different dimensions wrapping into others.
Step 4: Initiate a worklist with all DGs with constraints remaining after step 3.
Step 5: Work on each DGi in the worklist and solve for the values of induction variables such that the accesses in DGi do not overlap with those in DG (i+1) . This is equivalent to enforcing the guess from step 1 which said that accesses from different DGs essentially represent non-overlapping regions. It also eliminates the spurious dependences from binary code that arise due to different non-overlapping regions or different arrays running into each other. This generates further guesses on the induction variables. These guesses are called inter-group constraints because they are obtained by constraining DGi to not overlap
Example 1: The memory address expressions that we recover from the binary above will be of the form BaseA + 200i + 4j and BaseB + 200i + 4j (Assuming that the size of an integer is 4.). BaseA and BaseB will at least differ by 4000, since the size of each array is 4000 bytes. Without loss of generality lets assume we recover the following equations from the binary 100 + 200i + 4j and 4100 + 200i + 4j.
When the code above is compiled to a stripped binary, symbolic information is lost. Hence we no know longer the location or dimension sizes (20, 50) of array A. Hence we can no longer infer (as we implicitly do from source) that ubi < 20 and ubj < 50. Instead maximum values of these bounds of the loops must be inferred.
We now show briefly how our algorithm is applied to these accesses to guess the bounds on i and j. In Step 1, we check to see if the accesses belong to different DGs. The heuristic we use is that the difference of the bases is greater than a factor (5 for our experiments) of the highest coefficient;
Since this is true both the accesses will belong to different DGs. Hence, we guess that the two accesses in this loop do not overlap with each other which is true for this loop. In Step 2, we arrange the DGs in ascending order of their bases, i.e. 100 + 200i + 4j belongs to DG1 because its base is lower than the second access which belongs to DG2.
In Step 3, we solve for intra-group constraints in DG2 since it contains a write. We guess that the j dimension corresponds to columns since it has the smaller co-efficient and the i dimension corresponds to rows since it has the higher co-efficient which is true for this loop. Hence, we guess the bound on j as Co−efficient of i Co−efficient of j i.e. ( 200 4 = 50), i.e. j must belong to [0, 49] so that it does not overlap with the row accesses. In step 4, we create a worklist with all DGs that have constraints remaining. In this example both the DGs have constraints remaining on i; hence both of them will belong to the worklist. In step 5, we guess the bound on i by solving that DG1 i.e. 100 + 200i + 4j does not overlap with DG2 i.e. 4100 + 200i + 4j given the highest possible value for j is 49; i.e. 100 + 200i + 4 × 49 < 4100. Hence, i must be less than 19.02 or in the range [0, 19]. Since DG2 is the highest DG we do not solve for it overlapping with any other DG.
After we have applied our algorithm to this loop, our guess for i is [0,19] and j is [0,49]. We now solve for dependencies within this range for the loop and discover that the loop can be parallelized. We also add lightweight run-time checks before the parallel version of the loop (which will always succeed for this loop).
Example 2: The memory address expression that we recover from the binary will be of the form BaseA + 400j + 4i. Since there is only one access, step 1 and 2 will result in placing it in DG1. In step 3, we guess that the i access represents columns since it has the lower co-efficient and the j access represents rows since it has the higher co-efficient. Hence, we guess the limit of i as ( 400 4 = 100) or the range of i as [0, 99]. There would be no step 4 and 5 for this loop since there is only one DG.
Next we calculate dependencies assuming the range of i is [0,99] and j can take any value and discover that the loop can be parallelized. In reality however the range of i will not exceed [0, 49]. But we discovered larger bounds since we had no way to recover the factor 2 for j from source directly from binary code. Our larger discovered bounds work well since our bounds check will always succeed since i in [0,49] will always be below 99 and we will execute the parallel version of the loop. From the binary this means that we see a A[20,50] array as a A[10,100] array. However, this is fine as we reason about the dependencies in the correct way and parallelize the loop only when our run-time checks succeed.
Example 3: The equations that we recover from the binary will be BaseA + 4i and BaseA + 200 + 4i. After step 1, we will place them in different DGs since the difference between the bases (200) is greater than 5 times the highest co-efficient 4. The guess in this step is that both these accesses belong to non-overlapping memory regions. Even though the two accesses access the same array, this guess is correct for this loop since they access non-overlapping regions. Next in step 2, we arrange the DGs in ascending order such that, BaseA + 4i belongs to DG1 and BaseA + 200 + 4i belongs to DG2. No intra-group guesses are calculated in step 3 since the recovered equations are single dimensional hence, there are no dimensions to differentiate. After step 4, the worklist is populated with both the DGs since both contain i for which there is no guess as yet. In step 5, we solve for inter-group guesses such that DG1 does not overlap with DG2, i.e. 4i < 200 or i < 50. Hence, the range we guess for i is [0,49] which is also the actual limit on i from source. The run-time check will always succeed in binary code and we will execute the parallel version of this loop. This is correct because, regardless of the value of ubi, the two array references access non-intersecting portions of the array. Our method correctly treats these non-intersecting portions as different DGs.
Example 4: The equations we will recover from the binary will be BaseA + 8i and BaseA + 200 + 4i. After step 1, we will place them in different DGs since the difference between the bases (200) is greater than 5 times the highest coefficient 8. In this step we have guessed that both these accesses are non-overlapping; however, for this loop this is not true. Hence, we have made a wrong guess. Let us see how this effects us. In step 2, the accesses are arranged in ascending order i.e. BaseA + 8i will belong to DG1 and BaseA + 200 + 4i will belong to DG2. No intra-group guesses are calculated in step 3 since the recovered equations are single dimensional.
After step 4, the worklist will contain both the DGs since both contain i for which there is no guess as yet. In step 5, we solve for inter-group guesses such that DG1 does not overlap with DG2, i.e. 8i < 200 or i < 25. The range we guess for i is [0,24]. However, the actual range of i is [0,49], so our run-time check will fail and we will execute the serial version of the loop. This is fine for this loop since the actual loop is serial (it cannot be parallelized).
ALGORITHM TO GUESS LOOP BOUNDS
In this section we describe in detail the algorithm briefly presented in section 4. First we describe which loops from binary code we work on and then in subsequent subsections we describe the steps of the algorithm in detail.
First, we present to you the loops on which our algorithm is applied on and the kind of loops on which our algorithm is effective. We apply our method to every loop that has only affine accesses in them i. Since it is impossible to tell from binary code the form of the accesses in the source it was compiled from our method is applied to all loops that contain only affine accesses. However, the guesses may not be correct for the loops containing multiple induction variables in a single array expression or repeated induction variables. Hence, the run-time checks might fail for these loops and the serial version of the code may be executed. However, these kinds of accesses are rare in real code and hence our method is nearly as powerful from binary as from source.
Every affine memory address that we recover from the binary is a linearized multidimensional equation of the form [7] :
(where Base and d's are constants or loop invariant quantities, i's are induction variables, and d1 >= d2 >= >= dn).
We arrange the memory expression with d's in this order since in the algorithm we use this to differentiate dimensions and also while guessing the value of a particular induction variable we use the immediately higher co-efficient, i.e. we use d (m+1) when guessing the values of induction variable im. We will refer to memory addresses from binary using MemAddr(Base, d) throughout the paper. Different memory addresses from binary will have different Base and ds. Since we work on loops with only affine accesses in them, if we discover that a loop contains an access that is not affine i.e. we cannot discover an linearized expression for it then we do not work that loop.
In the following subsections we first describe our algorithm and then present an intuition for it. No proof can be presented that the guessed loop bounds are always correct, since they are not always correct. However our overall method is always correct, since we include a run-time check for the bound which executes fall-back serial code when the actual bound found at run-time is outside the guessed range of bounds. In the common case, the check succeeds and parallel code is executed.
Step 1: Divide the accesses into DGs
A DG is a subset of memory references in the loop that are sufficiently close to one another and these set of references most likely do not overlap with other DGs. Intuitively, while dividing memory references into DGs we try to guess all the references which access the same array, or a region of an array not overlapping with other regions. This is not immediately apparent since binaries lack symbolic information containing the locations and sizes of arrays.
We create DGs using the following method. We look at the address of each memory reference and place it in an already present DG if it is sufficiently close to the addresses already in that DG; else we create a new DG with this memory address. We define that two accesses are sufficiently close to one another if the difference between the bases is within a factor of the highest coefficient in the memory expression. The formal algorithm is presented in algorithm 1.
We now describe some of the terms used in the algorithm.
DGlist is the list of DGs that is initialized to NULL and then populated as we consider every memory access in the loop. d1 is the highest coefficient in the memory expression; hence, if the difference between the base and any of the bases already in a DG is within a factor of it, we guess that it most likely belongs to the same memory array and place this reference in that DG. CDThres is a number that guesses the maximum difference between references in the same DG. Currently we set CDThres to 5. With CDThres = 5, two accesses to A[i] and A[i+4] will belong to the same DG, whereas two accesses to A[i] and A[i+10] will belong to different DGs.
Having accesses to A[i] and A[i+e]
where e > 5 in the same loop is relatively rare in affine codes; as most constants in affine codes are less than 5. If this rare case occurs we will treat A[i] and A[i+e] as accesses to two different arrays. Accesses to different arrays A and B will belong to different DGs unless the highest dimension of A has size less than 5 (which is rare) and B immediately follows A in the binary's data layout. If this rare case appears we will treat both A and B as the same array. In both the above cases, the runtime checks will fail and the serial version of the loop will be executed. Hence, the loop may run slower than from source, but correctness is always maintained.
Step 2: Arrange DGs in ascending order
In this step we reorganize the DGs in DGlist in ascending order of the bases present in them. After arranging them in ascending order the following will be true:
All bases in DG1 < All bases in DG2 < · · · < All bases in DGT (This will be < since if they are equal they would belong to the same DG). We call this ordering of DGs, the FullList.
Step 3: Induce intra-group dependencies
In this step we make our best guesses for the theoretical limits for all induction variables, except the induction variable for the highest dimension. We make these guesses by guessing what the different dimensions might have been from source looking at the array expressions recovered from binary code. We apply step 3 to every DG that has a write in it. The reason we apply it to DGs with writes in them is that even if a read accesses across bounds it does not create dependencies that prevent parallelization and guessing bounds considering DGs with only reads is not necessary. For example, if there is an affine loop that only reads from an array, there is no need to guess bounds for such a loop as it is parallel in the infinite space as long as there is no scalar dependency in it.
Step 3 is divided into two sub steps 3.1 and 3.2. Step 3.1 is applied to every access in a DG and step 3.2 is applied to a pair of accesses in a DG. We first present the algorithms for both the sub steps before presenting intuitions for them.
Step 3.1: The formal algorithm for step 3.1 is presented in algorithm 2. We are working on loop nests with induction variables say i1, i2, · · · , in and guesses for each g1, g2, · · · , gn. First, we initialize the guesses for each of these induction variables to TOP representing infinity which is what we know about each of the induction variables before the start of this step. Then we look at every memory access which is of the form MemAddr(Base, d) (from eq(1)) and guess that the different induction variables may correspond to different dimensions in increasing order of their co-efficients, i.e. the induction variable with the smallest co-efficient corresponds to the lowest dimension, the next induction variable corresponds to the next dimension and so on the induction variable with the highest dimension corresponds to the highest dimension. Hence, the guesses we make for each of the induction variables such that they do not overlap with the next dimension is as follows.
The guess on i k , g1k =
We then update the guess already in gk for ik using gk = min(gk, g1k)
Note: min(TOP, g1k) = g1k since TOP represents infinity.
We apply this to every memory access in every DG that has a write and guess for every induction variable other than the highest dimension i1. Note that we cannot make a guess for i1 since there is no d0 in the equation. Hence, we do not have a guess for i1 in this step. The guess for i1 is made in Algorithm 2
Step 3.1: Guesses for induction variables using one access Input: All DGs that have a write in them Output: Initial guesses for the induction variables Require: Initialize each of g1, g2, · · · , gn to T OP for all DGi in FullList that has a write in it do for all M emAddr(Base, d) in DGi do for k = 2 → n do g 1k = d (k−1) d k g k = min(g k , g 1k ) end for end for end for step 5 and will be described later.
Step 3.2: After we have applied step 3.1 to all DGs that have a write in them, we work on the same DGs considering pairs of accesses in them and apply step 3.2 on them. This algorithm is presented in algorithm 3.
Algorithm 3 Step 3.2: Guesses for induction variables using pair of accesses
Input: All DGis that have a write in them and g1, g2, · · · , gn from step 3.1 Output: Refined guesses for induction variables Require: Initialize x1, x2, · · · , xn to zeroes
We now describe the algorithm briefly. We first initialize x1, x2, · · · , xn to zeroes. These represent the adjustment we need to make to each of the induction variable bound guesses at the end of this step. Then we consider pairs of accesses in this DG, if the bases are different then we store the absolute difference in Basediff. We then run a loop that checks to see which factor of this difference came from which co-efficient and keep track of that in dks. Later these are subtracted from the guesses for induction variables gk from step 3.1.
It is important to make this adjustment to the guesses on loop bounds from step 3.1 since by doing so we are making sure that each of the accesses do not run into the higher dimension of the other. After this adjustment we will not have spurious dependencies from binary that prevent parallelization. We present further intuition to this step below.
Intuition for Step 3.1: Let us assume that the binary code we are accessing came from source code where the loop nest had induction variables (say i1, i2, · · · , in) and an array accesses A[C1 × i1 + B1][C2 × i2 + B2] · · · [Cn × in + Bn] in the loop and the size of array A is [n1][n2] · · · [nn]. Assume that none of the induction variables is repeated, however any ordering of the induction variables is allowed. This access when recovered from the binary will be of the form.
(This assumes an element size of 1; else each one of the terms will be multiplied by the element size.) The algorithm is correct even if the compiler uses the column-major layout; we assume the row-major layout only for presenting the intuition. Our results also include fortran benchmarks for which the gcc compiler uses the column-major layout. BaseA and all the terms containing B's (shown in parenthesis above) are rolled into the constant term when recovered from the binary. We know that the memory address that we recover from binary is of the form MemAddr(Base, d) (from eq.(1)).
Equating (4) and (1) we get :
First, let us calculate the actual upper bounds of the induction variables from source. From source we know that the array indices do not access arrays out of their bounds. Hence, each dimension index must be less than the actual size of that dimension. i.e. Ck × ik + Bk < nk Second, let us see what our guess for induction variable ik is by applying step 3.1 to this access. Our guess for induction variable ik is obtained by substituting eq(6) in eq (2) i.e. g1k =
Next taking the minimum of g1k and TOP (the initialized value) we get,
We now show that the guesses for induction bounds are greater than or equal to the actual loop bounds. This is important because if the guesses were lower than the actual bounds our run-time checks would fail and we would always run the serial version of the loop which would not serve the purpose of parallelization. We have already seen that the guess on the induction variable ik =
, this is greater than the actual limit of ik, which is (nk−Bk) Ck from eq.(8). We observe that if C (k−1) is 1 and Bk is 0, then the value we would have guessed is the same as the actual upper bound. Further, if Ck is 1 as well, the guess for ik is nk, which is the size of that array dimension.
Every guess we make for the induction variables is actually higher than or equal to its actual bound as shown above. By taking the minimum at every step we have a guess that is at least its actual bound. However, we do acknowledge that if the accesses had multiple induction variables or repeated ones our guesses may be incorrect and hence we add runtime checks to make sure we run the serial version in such cases. Further, these kind of accesses are very rare in actual code and hence our method is as powerful as the source methods on real code. Intuition for step 3.2: Let us assume that there is a second accesses to A, A[C1 × i1 + B1 + E1] · · · [Cn × in + Bn + En] in this loop where Es are small numbers < 5. The memory address for this access from binary will be of the form: (11) Recollect that this access when recovered from the binary will be of the form MemAddr(Base2, e), from equation (1):
Equating eq.(11) and eq.(12), we get:
First, let us prove that both the references belong to the same DG using step 1 since we would apply step 3.2 to them only if both of them belong to the same DG. From step 1 we know that if the difference between the bases is < d1 × CDThres, then they will belong to the same DG.
then, both the accesses will belong to this DG.
The difference between the bases from eq.(13) and eq. (5) is
We know from eq.(14) and eq.(6) that:
Now, substituting eq.(16) and eq.(17) in eq.(15) we get:
, then both the accesses will belong to this DG
(which will be true in most cases since C1 and Es are small positive numbers < 5 and n2 · · · nn are relatively large, and CDThres is 5 in our experiments.)
Hence, both these accesses represented by MemAddr (Base, d) and MemAddr(Base2, e) from the binary will belong to the same DG.
First, let us see what the bounds for the induction variable from source would be in the presence of the second access as well. We know that accesses from source do not access out of bounds in correct programs. We have seen that the bounds for each induction variable (ik) only considering the first access is (nk−Bk) Ck as shown in eq. (8) . Now considering that the second access does not access out of bounds we get:
Rearranging the terms, ik < (nk−(Bk+Ek)) Ck
The difference between the bounds calculated from eq.(8) and eq.(21) is Ek Ck Now let us apply algorithm 3 to both these accesses.
We know that Basediff = n Σ j=1 Ej × n Π x=j+1 nj. By dividing it with gcd(dk, ek) = dk (since dk = ek in our case) repeatedly in loop and keeping the remainder of it for the next iteration we recover xks of the form Ek Ck as long as Cks are factors of Eks. By subtracting xks from the already present guesses of the induction variables we get gk = C (k−1) ×nk Ck − Ek Ck . Many of the Es will be zeroes, hence we will not make adjustment to many bounds, however we will make adjustment to the bounds that have small constant Es in their terms. Further, it is good to note that the term we subtract using algorithm 3 is equivalent to the difference of the bounds as shown above.
It is important to note at this point that by subtracting from the already guessed bounds, we are making sure that the second access which accesses a few extra elements in some dimensions does not run into the higher dimension of the first access. This is very important because if we do not make this adjustment we will have extra dependencies from binary which will prevent parallelization and by subtracting the extra from bounds we will not see those spurious dependencies. Also it is important to note that the new guess we have for the bounds is also higher than or equal to the actual bounds of the loop.
Step 4: Create the worklist
In this step we create a worklist with DGs that have accesses with remaining constraints so that we can apply step 5 on them to guess the upper bounds for the remaining induction variables. After step 3 we have upper bound constraints for all the induction variables in the memory addresses other than the ones that correspond to the highest dimension in the write accesses. We need a method to guess the upper bound on these induction variables as well. This method is step 5. Hence, we now create a worklist with all DGs in which there is an induction variable for which we do not have an upper bound guess as yet. These would be the highest dimension induction variables since we do not have guesses for those after step 3. This worklist will enable us to work on only those DGs that have remaining constraints.
Step 5: Work on Inter-group constraints
In this step we look at all DGs in the worklist created in step 4 (recall that these DGs have induction variables for which we have no guesses as yet) and solve for this DG not overlapping with the immediately following DG in the FullList. While creating DGs we assumed that each DG corresponds to a non-overlapping array region. Hence, by solving that one DG does not overlap with the immediately following DG we enforce the guess we made in step 1. Solving this generates further guesses on the remaining induction variables. These guesses are called inter-group constraints.
The formal method for solving that DGi from worklist does not overlap with DG (i+1) (the immediately following DG in the FullList of DGs) is presented in algorithm 4, we describe it briefly here. For every DGi that has constraints remaining we substitute the guesses for all induction variables other than the highest one in all its memory expressions and require that this be less than the lowest base in DG (i+1) . Solving the above constraint we can obtain an higher bound for the highest induction variable. We then choose the minimum of the present guess and the already present minimum guess for that induction variable. This way we ensure that all our guesses are respected.
Algorithm 4
Step 5: Algorithm for Inter-group constraints Input: Worklist from step 4 and guesses g1, g2, · · · , gn from step 3.2 Output: Final guesses for bounds g1, g2, · · · , gn for all DGi in worklist after step 4 do for all M emAddr(Base, d) in DGi do
= min(g1, g11) end for end for Intuition for step 5: Now that we have presented an algorithm for calculating the bounds on the highest induction variable, let us apply this to an access from source code, to show that our method guesses the value for the highest induction variable that is ≥ to the actual bound on that induction variable.
In step 3 we assumed we were working with loop nests comprising of the following induction variables (say, i1, i2, · · · in) and array accesses A[C1 × i1 + B1] · · · [Cn × in + Bn] in the loop, and the size of array A is [n1][n2] · · · [nn]. Let this access belong to DGi.
First, let us recollect the guesses for all induction variables except the highest induction variable from step 3. One of the guesses we would have made for induction variable ik (where k ∈ [2,n]) is C (k−1) ×nk Ck (eq.(9)). Hence, the final guess after step 4 will be equal to or lower than this guess.
Next, let us assume that there is an access to array B in the same loop belonging to DG (i+1) , i.e. the immediately following DG in the FullList. If this second array B is laid immediately after A in the binary, then BaseB will be at least:
nj (this term is the size of A) (22)
Let us assume that all accesses corresponding to B belong to DG (i+1) . The lowest address of DG (i+1) will be BaseB.
Next, we apply the method in algorithm 4 for solving DGi not overlapping with DG (i+1) from source to derive the guess for i1 and then verify that this guess is correct. For doing so we must substitute our guesses for all the induction variables except the highest dimension induction variable in the expression of memory address A and this must be less than BaseB. The expression for memory address A obtained by substituting the intra-group guesses eq.(9) in eq.(4) is: (23) The only unknown in eq.(23) is i1. This must be less than BaseB (from eq.(22)). 
Hence,
The remaining values are small since the constant C's and B's are small and the sizes of arrays in affine code are generally large.
Hence, the guess for i1 will be:
As seen before from source we require that the array expression must not exceed the size of the array dimension. Hence the highest dimension array expression (C1 × i1 + B1) must not exceed the highest dimension (n1).
Rearranging the terms i1 < (n1−B1)
C1
Hence, the maximum value i1 can take is (n1−B1) C1 − 1 and this is what we get by solving the equations from binary.
We have now seen that the algorithm 4 to calculate the bounds on the highest dimension induction variable yields a limit on it that is the true limit on it even when the method is applied to source code.
At the end of step 5, we now have made best guesses for all induction variables in the loop that appear in a memory address. If there is an induction variable that does not appear in any memory access, then we just assume that it can take any value since we have no way of determining its bounds . This does not hurt our method and is reasonable since even from source if an induction variable does not appear in any of the memory addresses present in the loop it could take any value at run-time and this would be legal.
For array accesses that came from dynamically allocated memory we apply the algorithm described above. It is important to note that all ds in the MemAddr(Base, d) expression would be loop invariant symbols rather than constants. In many cases the memory expression we recover from binary code for these accesses will be of the form
(30) where all the xs and Base are loop invariant quantities. By applying the algorithm to such an access we guess that the bound on ik is x (n−1) . We then check that the actual bounds are less than this loop invariant quantity (this check would succeed) before executing the parallel version of the loop. Now that we have constraints on all the induction variables, we calculate the distance vectors and take parallelizing decisions for this loop assuming these as loop bounds. We then clone this loop and run the parallel version of the loop when the run-time checks for all induction variables succeed; else we run the serial version of the loop. Since we check at runtime that the loop bounds that we have guessed are actually correct we will always be conservatively correct. Please note that using the distance vector method to parallelize is our implementation method, one may use any parallelizing decision algorithm including polyhedral methods.
IMPLEMENTATION-SECONDWRITE
In this section we describe the binary rewriting infrastructure, SecondWrite [7] [11] [3] used for this research and how the automatic parallelizer interacts with rest of the system. [8] developed at the University of Illinois, and is now maintained by Apple Inc. LLVM IR is language and machine independent. Thereafter the LLVM IR produced is optimized using LLVM's pre-existing optimizations, as well as our enhancements, including automatic parallelization. Our new algorithm is implemented within this static affine automatic parallelizer. Finally, the LLVM IR is code generated to output x86 code using LLVM's existing x86 code generator.
Currently SecondWrite rewrites x86 binaries. SecondWrite currently successfully rewrites binaries coming from source code totaling over 2 million lines of code, including all of the SPEC2006 benchmarks. The apache web server realworld application (230K+ LOC) is also successfully rewritten. Rewritten benchmark binaries run on average 10% faster than highly optimized input binaries, and 45% faster than unoptimized input binaries because of the existing optimizations in LLVM not including parallelization.
SecondWrite is able to rewrite binaries without relocation information [13] . SecondWrite implements various mechanisms [11] to obtain an intermediate representation which contains features like procedure arguments, return values, types, high-level control flow, symbols and aggregate data structures. SecondWrite also employs extra mechanisms to safely handle indirect calls and indirect branches [11] . It employs alias analysis frameworks present in LLVM to discover all the possible target procedures at indirect callsites, given by the points-to set of the operand in indirect call instruction. An edge is added from the indirect call-site to all its possible target procedures. Indirect branches are mostly present due to jump tables in the binary. Procedure boundary determination techniques are devised to limit the pos-sible branch targets within the current procedure and extra control flow edges are added corresponding to the possible targets determined by alias analysis. If one of the target is outside procedure boundary, it is handled as an indirect call.
The algorithm presented in section 5 can be implemented in any static or dynamic binary rewriter as long as symbol recognition and induction variable analysis is implemented in the system.
RESULTS
We use "-O3" optimized binaries from gcc and gfortran as input to SecondWrite, which includes the new algorithm proposed in this paper within a static affine parallelizer. The affine automatic parallelizer from source works on LLVM IR. Hence, we use LLVM IR generated from clang [1] (a C language front-end for llvm) for the 'C' benchmarks and LLVM IR generated using the dragonegg [2] plugin (a plugin that integrates the LLVM optimizers and code generator with GCC) for the fortran benchmarks as the input to our source parallelizer. We run all the binaries on the AMD Opteron(TM) processor 6212 and present results.
In this section we present our results on parallelizing binaries from SPEC2006 and OMP2001 using our new algorithm. First, we introduce our benchmarks. Second, we present the speedups we have from source and binary. For the binary numbers, we present results for speedups both with and without the new algorithm. Third, we present the actual number of affine loops that are parallelized from the binary with and without the algorithm. We measure speedups by measuring the clock time to run the programs on 1 thread and 8 threads First, table 1 lists the 8 affine benchmarks that we present our results on. Our source and binary parallelizers correctly parallelize every benchmark from both the benchmark suites, however do not give any speedup on the remaining benchmarks since those benchmarks do not contain affine rich regions. We have picked only the affine rich benchmarks from the SPEC2006 and OMP2001 benchmark suites. We manually profiled every benchmark belonging to both the benchmark suites and after examining the hot regions classified benchmarks as affine or not affine. We present our results on all the affine benchmarks discovered from both the benchmark suites. The benchmarks swim, mgrid and quake belong to the OMP2001 benchmark suite and bwaves, lbm, libquantum, milc and cactus belong to the SPEC2006 benchmark suite. These benchmarks range from 275 to 59,827 lines of code as shown in table 1.
Second, figure 3 presents the speedup for 8 threads from source and binary for each of the benchmarks w.r.t the gcc "-O3" compiled single thread version of the benchmark. There (ii) the second bar is the speedup of the binary for 8 threads without the new algorithm using only the theory presented in [7] and (iii) the third bar is the speedup of the binary for 8 threads using the new algorithm presented in this paper. We observe that swim, bwaves, mgrid, quake, milc and cactus gain significant speedups when the new algorithm presented in this paper is present in the static affine binary parallelizer. The significant affine loops in these benchmarks have run-time determined loop bounds and hence using our new algorithm we are able to parallelize these loops that were not parallelized using the theory developed before. The benchmarks lbm and libquantum do not have any difference in the speedups with and without the algorithm. The reason being; (i) in lbm, the loops bounds are statically known and hence the theory in [7] is sufficient to parallelize the affine loops in it and (ii) in libquantum the loops are single dimensional with a write to one single dimensional memory accesses. These loops can be parallelized without the new algorithm and hence we see a speedup in libquantum even without the new algorithm. Overall the average geo-mean speedup for 8 threads for the 8 benchmarks from binaries increases from 1.33X to 2.96X with the addition of the new algorithm. Our binaries run slightly faster than source since SecondWrite is able to rewrite "-O3" binaries to run 10% faster than the input binaries. Third, table 2 presents the number of loops that are parallelized from the binary with and without the new algorithm. We observe that in the benchmarks lbm and libquantum the number of loops parallelized with and without the algorithm do not change. The reasons for this have been explained earlier. In swim, quake, milc and cactus, a number of loops are parallelized even when the new algorithm is not present in the static affine binary parallelizer; however, these loops are small and do not contribute to the run-time of the benchmark. Hence, these loops do not result in a speedup from 8 threads for these benchmarks. We make this comparison to show that it is not the number of loops that are parallelized that matter, but it is important to parallelize the run-time intensive loops that can be parallelized by our new algorithm.
