Abstract. Sensitivity Analysis (SA) is a novel compiler technique that complements, and integrates with, static automatic parallelization analysis for the cases when program behavior is input sensitive. SA can extract all the input dependent, statically unavailable, conditions for which loops can be dynamically parallelized. SA generates a sequence of sufficient conditions which, when evaluated dynamically in order of their complexity, can each validate the dynamic parallel execution of the corresponding loop. While SA's principles are fairly simple, implementing it in a real compiler and obtaining good experimental results on benchmark codes is a difficult task. In this paper we present some of the most important implementation issues that we had to overcome in order to achieve a fairly successful automatic parallelizer. We present techniques related to validating dependence removing transformations, e.g., privatization or pushback parallelization, and static and dynamic evaluation of complex conditions for loop parallelization. We concern ourselves with multi-version and parallel code generation as well as the use of speculative parallelization when other, less costly options fail. We present a summary table of the contributions of our techniques to the successful parallelization of 22 industry benchmark codes. We also report speedups and parallel coverage of these codes on two multicore based systems and compare them to results obtained by the Ifort compiler.
Introduction

Automatic Parallelization -Current State of the Art
The recent introduction of multi-core based architectures to the mass market has brought program parallelization of the existing code base to the forefront. In fact, there seems to be a degree of urgency from the part of the major vendors to enable their users to exploit the coarser level parallelism offered by these new micros with their existing software base. Parallelizing compilers are a key enabling technology in this domain because they offer the advantage of automation and thus high productivity.
Parallelizing compilers must focus, at least as a necessary first step, on discovering which loops can be executed in parallel (ideally as a doall). Data dependence analysis techniques as simple as the GCD test [16] and as sophisticated as the Omega test [8] have been employed to statically prove the independence of memory references within a loop. After some limited success it had become clear that sparse, dynamic programs could not be automatically parallelized using these static techniques alone because their memory reference pattern is input dependent. The proposed solution was dynamic (run-time) analysis with the advantage of high accuracy (most symbolic data is instantiated) but with the drawback of run-time overhead. The dynamic approach has taken two directions: (a) a continuation of the static compilation analysis at run-time, and (b) a memory reference trace based analysis approach. In the first approach, symbolic expressions that could not be evaluated statically are postponed for run-time evaluation which then decides the (in)dependence of a loop. For example, if the static analysis cannot conclusively perform a standard data dependence test, e.g., a GCD test, because some of its parameters can be evaluated, we can always perform it at run-time when all information becomes available. In the second approach, more general and better suited for codes using indirection, the memory references are recorded and analyzed at run-time either before a loop is executed (inspector-executor mode [15] ) or after an optimistic (speculative) parallel execution [11] . The complexity of this method is proportional to the number of dynamic references and thus is potentially expensive.
Overall, the static and run-time approaches to automatic parallelization have progressed independently without significant integration. Partial, but insufficient, static analysis was not used effectively to simplify run-time analysis. An improvement over this state of the technology was presented in [12] . Instead of performing a reference-based test, the technique, named Hybrid Analysis, uses an aggregated reference representation and performs dynamic analysis using set and interval operations very similar to those performed statically by a compiler. This often results in a significant reduction of run-time overhead.
A step further in automatic parallelization has been the re-formulation of the loop independence analysis into sufficient conditions (predicates) for which a loop can be parallelized. These conditions represent the sensitivity of parallelization to some input (dynamic) conditions. For example, in [9] the authors showed some limited examples of how sufficient predicates could be extracted by simplifying Presburger formulas with uninterpreted function symbols. These predicates are returned to the programmer for evaluation (for interactive compilation). Further research [2, 5, 6, 12] showed how to extract simple scalar conditions from relatively simple array data dependence predicates for a limited number of cases.
We have used a similar approach and recently presented Sensitivity Analysis (SA) [13] as a general framework to analyze memory references and used it to extract parallel loops from sequential programs. SA seamlessly bridges static and dynamic analysis of memory references. When the compiler cannot draw definitive conclusions about interesting properties of a memory reference pattern, SA can generate a set of sufficient conditions which, when evaluated, can (in)validate these interesting properties. Examples of such interesting properties are (in)dependent memory references, privatizable references, reductions, etc.
Automatic Parallelization with Sensitivity Analysis
In [12, 13] we have shown how our compiler using SA is able to extract most available loop level parallelism from various benchmark codes using a mix of advanced static analysis and aggressive optimizations that are validated dynamically with minimal overhead. This has resulted in fairly good speedups. In [12, 13] we have explained with some detail how the overall SA framework functions. However, obtaining good results requires us to apply and refine many general techniques that together contribute to good speedups.
For example, we mentioned that SA generates a set of sufficient conditions that can be evaluated dynamically and validate parallelization. However, the work (run-time overhead) involved in the dynamic evaluation of these predicates can vary greatly. Thus an ordering of their evaluations from simple to complex is crucial (somewhat similar to evaluating complex predicates) for obtaining good performance. In fact, based on performance models we can stop evaluating predicates if the effort outweighs the benefit of parallelization.
Further examples are simple algorithm substitution transformations. Exchanging a serial reduction with a parallel one can enable the parallelization of large loops. These transformations have to be proven correct though, and, in the case of complex or input sensitive memory reference patterns, this may not be possible statically. We use the same SA approach to generate dynamic conditions to validate parallelizing code transformations.
Contribution.
In this paper we present some important aspects describing how the general framework of Hybrid Analysis (presented elsewhere [12, 13] ) has been used and implemented in our parallelizing compiler (which is a derivative of the UIUC-Polaris compiler).
A Brief Introduction to Sensitivity Analysis
The Memory Reference Representation There are three main concepts in our analysis. First, we introduce a powerful memory reference representation, the USR (uniform set representation). It was described in detail in [12] under the name RT LMAD. In essence it can represent memory references of a program as an expression whose leafs are sets of LMADs (linear memory access descriptors) or enumerated sets of references which are composed (internal nodes of the expressions) through program operations (conditionals, loops, subroutine calls, etc.) A crucial advantage of this representation is that it is closed under composition -it can represent any memory reference pattern symbolically, at program level. When USRs cannot be evaluated to the exact sets of addresses they represent at compile time, they can be embedded in the generated code and computed at run time, in the presence of actual input values. However, in most cases we do not need to compute the actual memory reference pattern, but rather prove a relation, which is generally easier.
Memory Reference Aggregation and Classification
The second concept in SA is memory reference aggregation, which ensures scalability of interprocedural analysis at the cost of losing dependence direction information. Memory references are aggregated bottom up on the Control Dependence Graph (CDG) within a subroutine, and on the call graph inter-procedurally.
The process starts at leaf CDG nodes, which are simple statements. The set of memory locations read and written by the statement is computed from the statement type and symbolic expressions. This set is parameterized by symbolic variables referenced by the statement.
The sets corresponding to successive statements are then computed using set union, intersection and difference. All these operations are performed on USRs [12] . Special nodes in the CDG require more elaborate set operations, all of which are well defined and closed on USRs: predication, union across iteration space and symbolic translation.
These simple, node-local transformations on the CDG are applied repeatedly until the memory reference pattern has been completely summarized across the whole program.
Dependence Relations Based on Reference Summary Sets
While summarizing references, we also classify them into three disjoint sets [2] : Read Only (RO), Write First (WF) and Read Write (RW). They represent the specific data flow information needed for dependence analysis. The RO summary set records all memory locations only read (not written) within a section of code, the WF summary set records all memory locations that are written first and then possibly read and written, and the RW summary set records all other memory locations referenced from within a context. Computing the RO, WF and RW sets requires only the USR operations discussed in the previous section. An example is given in Fig. 1 .
Every time we reach a loop header in the aggregation process, we compute the cross iteration data dependence relations. If there are no dependences, then all the loop iterations can be executed in parallel. This is the most effective automatic parallelization method, as it scales with the number of iterations, thus it is likely to remain efficient as the underlying hardware evolves towards a larger number of processing units. To express cross iteration dependence relations, we compute the set of memory locations that are referenced in two different iterations, and are written in at least one. At this point in the analysis, we have already computed RO i , W F i and RW i , the per iteration reference sets.
One such dependence set is
Similar dependence sets are expressed for combinations of RO, RW and WF sets [12] . If we prove DS = ∅, then no cross-iteration dependences may exist.
Sensitivity of Dependence Relations to Parameters
Finally, the third concept used in SA is the transformation of the USRs representing the aggregated memory references into a Sensitivity Graph (SG), i.e., a boolean expression representing the parallelization conditions.
In many cases, proving the dependence set empty is trivial. It often results from a set intersection such as [1 : 10] ∩ [11 : 20] , which evaluates to ∅ through symbolic calculus, at compile time. In other cases, proving the dependence set empty is not possible at compile time either because it depends on input data, e.g., DS = [1 : n] ∩ [m : 100] or because the relation is just too complicated for the compiler to evaluate.
We build SGs from dependence equations based on USRs by using a divide and conquer approach, which, at each step, breaks the dependence equation DS = ∅ into several simpler equations based on set identities [13] . We then extract a minimal (modulo the symbolic calculus capabilities of the compiler) run time check that guarantees that the loop is parallel. We then generate parallel code predicated by this condition. We use the SG [13] representation for these conditions. When they cannot be evaluated at compile time to a boolean value, they are embedded in the generated code and evaluated at run time, in the presence of actual values.
The aggregation and equation solving processes can deal with multidimensional strided reference patterns. In some cases, the divide and conquer process cannot extract a precise predicate from a dependence equation. In such cases, we approximate sets with predicated multidimensional strided intervals, and continue the analysis with affine sets, which are easier to compare. The predicates are added to the dependence condition. We can afford to make up optimistic, speculative predicates, since they are verified at run time through SG evaluation.
Engineering an Automatic Parallelizer
In the previous section we have provided an overview of the general approach to parallelization: We aggregate and, at the same time classify memory references (WF, RO, RW) at the program level into a set representation (USR) and then formulate the independence condition DS = ∅ (empty dependence set). Then, the compiler verifies the conditions for which this equation holds true by recursively descending on the equation DS = ∅ and, using boolean logic, generating a conjunction (OR) of simpler equations. Some of these equations can be proven true for all inputs, i.e., statically true, and others result in some constraints for the equation to be true. From these constraints (conditions) the compiler generates predicates (code) that are evaluated at run-time and can validate the parallelization of a loop. The constraints are expressed as sets of expressions which can be represented as a graph, the sensitivity graph (SG). This method was presented as SA (sensitivity analysis) [13] .
Parallelism enhancing transformations. Our overall goal is to uncover as much parallelism (doall type only) as possible and exploit it when beneficial. To this we apply our SG based technique not only to prove that the original loops in a program are independent but also to validate code transformations that increase the intrinsic amount of parallelism. We will show how we can use our SA to perform powerful dependence removing transformations, e.g., reduction parallelization, pushback parallelization and array privatization. These are not new techniques, but the use of SA in their implementation makes them more powerful, i.e., more often successful.
Efficient Run-time Evaluation of Parallelization Conditions. After applying the dependence removing transformations the compiler needs to generate efficient parallel code. The outcome of the static Sensitivity Analysis may be the SG (sensitivity graph) which may be varying degree of complexity and which needs to be efficiently evaluated dynamically. It is important to perform the dynamic evaluation efficiently because this evaluation represents pure overhead. The novelty of our implementation lies in the way we generate efficient code for this dynamic validation.
We will present some of the more important aspects of this process, e.g., the generation of predicates that pre-validate parallel loop execution and the use of speculation and post execution validation. Sometimes we cannot extract a condition that can be evaluated before a loop is executed because it depends on the computed data. (There may be a cycle between address and data computation). In this case, we have to resort to speculative execution [11] . This invokes other efficiency issues such as checkpointing (if used). Here too we use our program representation and SA to improve performance.
It is worth mentioning that our entire analysis framework is interprocedural. For the evaluation of USRs at run-time we have developed a library to which we generate calls. Similarly, when we employ LRPD we use a specialized library.
Let us now take a closer look at some powerful techniques.
Transformations to Remove Dependences
Conditional and Selective Array Privatization. Array privatization can be complex and expensive. In general, it means allocating a private array in each thread of execution. This replication can become quite costly if the array is big. In the most general case it is required to first copy-in from the shared array and then, after processing, copy-out the last value written. These two operations (copy-in and copy-out last value) can be very expensive because they do not scale. Thus we optimize them by performing selective copy-in and last value copy-out. In the case of relatively sparsely referenced arrays this can save significant time.
We can use USRs to express these in/out sets precisely in a general way and thus improve performance. Briefly, here is our approach:
By the time we reach a loop header, we have already classified all memory locations referenced within each iteration i into disjoint sets (USRs) RO i , W F i and RW i . Using only set operations, we put together the following descriptors as per-iteration USRs.
In practice, we compute a single USR for the iteration space of each thread, or a single USR for the whole iteration space of the loop. The formulas assume that we have already proved that ∪ The first descriptor contains the set of memory references that must be privatized because they are written to in at least two iterations. We chose to generate an OpenMP PRIVATE directive whenever this USR is not provably empty at compile time. This means we are possibly allocating too much private storage, since sometimes not all the elements in the array must be privatized. However, the alternative is to use an indirection table for just those locations that must be privatized, which introduces both complexity and overhead.
Although we privatize entire arrays, we perform selective and conditional copy in. Only those locations that are read before being written inside the loop are used in a memory copy operation from the shared object to the private copies. They are only copied if it turns out, at run time, that the values are needed inside the loop, based on actual control flow predicates. We wrote a simple memcpy like routine that uses a USR to control which locations get copied.
Conditional and Selective Array Reduction. Although implementations vary greatly, array reduction conceptually starts with an initialization of all the elements participating in the reduction with the null element of the reduction operator. The loop is then executed in parallel. Upon exit from the parallel section, elements updated by more than one thread are merged using the reduction operation. We use USRs to describe the extent of the initialization and merge phases, and wrote simple library routines that use USRs to control the exact locations that are initialized and merged respectively.
U SR to initialize = U SR to reduce
Conditional Parallelization of Pushback Sequences. We have shown [14] how to recognize sequences of pushback operations that can be parallelized by using private storage, which simply need to be copied at the end of the loop to a specific location of the shared array. We use USRs W F i to describe the extent of the writes to private storage, and a library function to perform the actual copies (the same used for copy in and copy out). Not only do they get relocated efficiently, but this makes the transformation more general, since USRs can describe arbitrarily complex patterns. Previously, only pushback sequences made of contiguous locations could be parallelized.
Sensitivity Graph (SG) Evaluation
The outcome of the static Sensitivity Analysis may be either a definitive answer at compile time or the Sensitivity Graph (SG), a boolean expression that needs to be efficiently evaluated at run-time. It represents a conjunction (logic OR) of sufficient conditions which all can validate a loop to be parallel (including the associated dependence removing transformations). The SG can be of various complexities. They can be a: a Boolean expression that can be evaluated in constant time. b Boolean expression that can be evaluated in time proportional to some fraction of the size of the program data. For example, a triply nested loop with iteration spaces N,M,K can be parallelized by performing N (or N*M or K*N) work. This situation arises many times when aggregation works well in only some of the dimensions of the analyzed data structures. c Boolean expression that can be evaluated in time proportional to data size.
In this latter case (c), some of the transformations of the equations (DS = ∅) involving the globally aggregated USRs into simpler ones has failed. This can happen when the recursive simplification of the DS = ∅ equation is not very successful or, in an extreme example, when the code uses indirection arrays. In effect, we need to generate code to dynamically evaluate the USR's (which can be seen as a program slice). The compiler will generate code for the evaluations of these conditions and sort them in order of their estimated complexity (similar to the predicate of a branch condition). For illustration purposes, we have named the resulting code a cascade of sufficient conditions (Fig. 2) . There are four types of run time operations involved in the evaluation of the SG: (1) evaluation of elementary conditional expressions (constant time), (2) interval trees (some fraction of data size, simple operations), (3) actual evaluation of USRs (fraction of data size, complex operations) and comparison to the empty set and (4) reference-by-reference LRPD [11] . The estimated complexity of these tests ranges from O(1) tests as the one in Fig. 3 to O(n) dynamic reference instrumentation as is the case in Fig. 4 . The evaluation of USRs at run time generally consists of fewer, but more complex operations than the referenceby-reference LRPD. In some cases they may either degenerate into inefficient enumerations or take conservative decisions that can lead to false negatives. The LRPD test has overhead proportional to the dynamic reference count, but is optimal for cases where aggregation and equation inversion are not possible (Fig. 4) . It is always applicable, precise, and has a more predictable complexity. Perhaps the most important aspect of the "heavy" methods (USR evaluation or LRPD test) is that they have to be performed in parallel so that the overall obtained speedup scales with the number of processors.
There are two ways to validate parallel execution: Before the loop execution (similar to an inspector) or after its execution. In the latter case we have to use speculative execution [11] .
In most cases, we can adopt either method and (hopefully) select the most efficient one. The correct choice involves a more complex cost model which is beyond the scope of this discussion. Presently, we choose speculation over pre-verification only if (1) a parallel inspector cannot be extracted (see next section) or (2) if we cannot extract a light inspector (a slice made of only scalar definitions). The actual test code generation consists of a syntax-based translation from the SG grammar to Fortran. In both cases, we reuse the test results by means of inspector hoisting, SG and USR common subexpression recognition, and run time test result memoization. We apply loop invariant hoisting to USRs and SGs by performing aggressive invariance analysis on their sets of input variables. Invariance problems on USRs resulting from subscripted subscripts are formulated as dependence problems on the subscript arrays, which are solved by the same SA algorithm applied to the subscript array. This is achieved by representing the exact referenced memory regions of the subscript array as USRs themselves, and thus identifying the exact subregion of the subscript array that affects the shape or size of the memory pattern on the host array. An interesting problem arises when a more expensive test such as LRPD can be hoisted out of a loop, but a simpler O(1) version is loop variant. At this time we (simplistically) hoist tests as far away as possible and build cascades from tests at the same loop nesting level.
Even when we cannot hoist tests out of their loops, by transforming reference based tests into simple boolean operations, we reduce the run time overhead by a constant but significant factor. For instance, in MDG/INTERF do1000, switching from reference based test to SGs improved the speedup on 4 threads from 1.5 to 3.3. (Scalability did not change). There are three main reasons for this improvement. First, simple SGs evaluation require very little extra memory, in most cases a few scalar variables. The code we insert consists of accumulation of conditions such as indep = indep.AN D.x > 0. Second, we insert the code at the earliest common postdominator of the SSA definition sites of the operands. Since in most cases there is just one operand other than the dependence decision accumulator, we insert code right after that operand is written to, which gives excellent temporal locality. If the definition is an unconditional scalar assignment, the operand is likely to be in a register. Also, simple SGs only perform logic operations. In contrast, the LRPD and USR run time libraries, even highly optimized, may use large amounts of extra memory and execute bookkeeping operations including loads/stores, branches and sub-word manipulation.
Speculative Execution
Sometimes we cannot extract a condition that can be evaluated before a loop is executed because it depends on the computed data. (There is a cycle between address and data computation). In this case, we have to resort to speculative execution [11] . This invokes other efficiency issues such as checkpointing (if used). We identified previously the conditional pushback sequence pattern, which is perhaps the simplest such example. Other cases are more complex and do not follow a preset pattern. It should be noted that even when the dependence relation can be precomputed before the loop it may be worth executing the loop in parallel speculatively in order to reduce the overhead. A more detailed discussion about these choices can be found in [7] .
If speculative parallelization is necessary, we take advantage of our novel representation and SA techniques to reduce overheads. We can compute the exact extent (as a USR) of memory that must be either saved at a checkpoint before the speculative loop, or committed from private speculative storage after the loop. The actual memory operations are implemented as calls to our selective memory copy routine used for copy in, copy out and pushback parallelization.
The Value Evolution Graph and Pushback Sequences
The Value Evolution Graph (VEG) [14] can represent the data flow in recurrences used as array indices which have no closed form solutions. The graphs are pruned based on control dependence predicates and produce tighter value ranges than abstract interpretation methods. These value ranges and their relations (overlapping, mutual exclusive) are used throughout our analysis, when building USRs and when extracting SGs.
Additionally, VEGs can be used to detect monotonic reference patterns in the code text. Unlike previous pattern recognition methods, we can analyze partially aggregated and classified memory descriptors (USRs). This single generic approach both extends and unifies in a single framework most cases which were previously solved using various, different, pattern matching techniques. It allows for the parallelization of important classes of memory reference patterns, e.g., sequences of pushback operations with complex footprints.
Experimental Results
Our experiments show that our techniques extract almost all the available parallelism at the highest granularity possible, which results in significant speedups on 22 codes from the PERFECT and various SPEC benchmark suites. TRFD  OLDA do100  67 CT  CT,A  -----OLDA do300  28 CT  CT,A  -----INTGRL do140 3 RT:IT RT:IT,A ---IE - Table 1 presents full application speedups, measured by dividing the sequential execution time of the whole application by its parallel execution time including the runtime overhead, if any. Two main factors are behind these good speedups: high granularity and high coverage. The VEG, the USR and SG are all interprocedural and flow sensitive (though they use approximations), which makes our analysis apply to large program slices, resulting in higher granularity. Our hybrid approach pushed coverage over 90%. It also increased granularity significantly, since many outer loops could be proved parallel only at run time. A detailed discussion of the speedup numbers can be found in [13] . Table 2 presents the effect of each technique towards our goal of achieving highest parallelization coverage possible. It is important to note that our hybrid framework solves the parallelization problems uniformly at both compile-time and run-time, using SGs. The techniques presented in this paper contribute substantially to the coverage and granularity of parallelization. Comprehensive reports for a large set of loops are available at: http://parasol.tamu.edu/ compilers/ha. An interesting case is loop MXMULT do10, which accounts for 73% of the sequential execution time on DYFESM. This loop contains an array MX which shows multiple patterns on different subsections. The first part of the array is only written to, while the last part is a reduction. The write section is fully independent, but this is not known until run time. The reduction section is only proven a proper reduction (not an independent update) at run time. Table 3 presents our run time tests, their dynamic outcomes and their relative overhead for this loop. Tables 4 and 5 show the occurrence of each static and dynamic dependence test, privatization, reduction, pushback, and speculative parallelization in various benchmark programs.
Conclusions
In this paper we have presented some of the more important issues involved in the implementation of the novel Sensitivity Analysis framework in our Polaris derived automatic paralleling compiler. We have shown that our powerful USR representation and our sensitivity analysis technique is useful not only in detecting independent loops but also in applying parallelism enhancing transformations (e.g, reduction and pushback parallelization, privatization). We have further shown that SA generates a flexible cascade of sufficient conditions applied in order of their estimated execution time complexity. Thus we allow a flexible cost-benefit analysis between the benefits of parallelization and the effort to obtain it. We further present the impact of our methods on 22 benchmark codes and report speedups that compare quite well with existing commercial compilers. These good results are due to our ability to uncover and efficiently exploit large granularity parallelism.
