Abstract
Introduction
In recent years, the power efficiency of embedded multimedia applications (e. g. medical image processing, video compression) with simultaneous consideration of timing constraints has become a crucial issue. Many of these applications are data-dominated using large amounts of data memory. Typically, such applications consist of deeply nested for-loops. Using the loops' index variables, addresses are calculated for data manipulation. The main algorithm is usually located in the innermost loop. Often, such an algorithm treats particular parts of its data specifically, e. g. an image border requires other manipulations than its center. This boundary checking is implemented using if-statements in the innermost loop (see e. g. figure 1 , an MPEG 4 full search motion estimation kernel [5] ).
This code fragment has several properties making it suboptimal w. r. t. runtime and energy consumption. First, the if-statements lead to a very irregular control flow. Any jump instruction in a machine program causes a control hazard for pipelined processors [11] . This means that the pipeline needs to be stalled for some instruction cycles, so as to prevent the execution of incorrectly prefetched instructions.
Second, the pipeline is also influenced by data references, since it can also be stalled during data memory accesses. In loop nests, the index variables are accessed very frequently resulting in pipeline stalls if they can not be kept in processor registers. Since it has been shown that 50% -75% of the power consumption in embedded multimedia 
Figure 1. A typical Loop Nest (from MPEG 4)
systems is caused by memory accesses [12, 17] , frequent transfers of index variables across memory hierarchies contribute negatively to the total energy balance. Finally, many instructions are required to evaluate the ifstatements, also leading to higher runtimes and power consumption. For the MPEG 4 code above, all shown operations are in total as complex as the computations performed in the then-and else-blocks of the if-statements.
In this article, a new formalized method for the analysis of if-statements occuring in loop nests is presented solving a particular class of the NP-complete problem of the satisfiability of integer linear constraints. Considering the example shown in figure 1, our techniques are able to detect that 
14.
Information of the first type is used to detect conditions not having any influence on the control flow of an application. This kind of redundant code (which is not typical dead code, since the results of these conditions are used within the ifstatement) can be removed from the code, thus reducing code size and computational complexity of a program.
Using the second information, the entire loop nest can be rewritten so that the total number of executed if-statements is minimized (see figure 2) . In order to achieve this, a new if-statement (the splitting-if ) is inserted in the y loop testing the condition x¦ 10 || y¦ 14. The else-part of this new if-statement is an exact copy of the body of the original y loop shown in figure 1 . Since all if-statements are fulfilled when the splitting-if is true, the then-part consists of the body of the y loop without any if-statements and associated else-blocks. To minimize executions of the splitting-if for values of y
Figure 2. Loop Nest after Splitting
As shown by this example, our technique is able to generate linear control flow in the hot-spots of an application. Furthermore, accesses to memory are reduced significantly since a large amount of branching, arithmetic and logical instructions and index variable accesses is removed.
Section 2 of this paper gives a survey of related work. Section 3 presents the analytical models and algorithms for loop nest splitting. Section 4 describes the benchmarking results, and section 5 summarizes and concludes this paper.
Related Work
Loop transformations have been described in literature on compiler design for many years (see e. g. [2, 11] ) and are often integrated into today's optimizing compilers. Classical loop splitting (or loop distribution / fission) creates several loops out of an original one and distributes the statements of the original loop body among all new loops. The main goal of this optimization is to enable the parallelization of a loop due to fewer data dependencies [2] and to possibly improve I-cache performance due to smaller loop bodies. In [7] it is shown that loop splitting leads to increased energy consumption of the processor and the memory system. Since the computational complexity of a loop is not reduced, this technique does not solve the problems that are due to the properties discussed in section 1.
Loop unswitching is applied to loops containing loopinvariant if-statements [11] . The loop is then replicated inside each branch of the if-statement, reducing the branching overhead and decreasing code sizes of the loops [2] . The goals of loop unswitching and the way how the optimization is expressed are equivalent to the topics of section 1. But the fact that the if-statements must not depend on index variables makes loop unswitching unsuitable for applying it to multimedia programs. It is the contribution of the techniques presented in this paper that we explicitly focus on loop-variant conditions. Since our analysis techniques go far beyond those required for loop splitting or unswitching and have to deal with entire loop nests and sets of index variables, we call our optimization technique loop nest splitting.
In [9] , classical loop splitting is applied in conjunction with function call insertion at the source code level to improve the I-cache performance. After the application of loop splitting, a large reduction of I-cache misses is reported for one benchmark. All other parameters (instruction and data memory accesses, D-cache misses) are worse after the transformation. All results are generated with cache simulation software which is known to be unprecise, and the runtimes of the benchmark are not considered at all.
Source code transformations are studied in literature for many years. In [6] , array and loop transformations for data transfer optimization are presented by means of a medical image processing algorithm [3] . The authors only focus on the illustration of the optimized data flow and thus neglect that the control flow gets very irregular since many additional if-statements are inserted. This impaired control flow has not yet been targeted by the authors. As we will show in section 4, loop nest splitting applied as postprocessing stage is able to remove the control flow overhead introduced by [6] with simultaneous further data transfer optimization.
Analysis and Optimization Algorithm
This section presents the techniques required for loop nest splitting consisting of four sequential tasks. First, conditions are checked for satisfiability (3.1). Second, an optimized search space for each satisfiable condition is created (3.2). Third, all local search spaces are combined to a global search space (3.3) which has to be explored finally (3.4). Before going into details (cf. also [4] for broader descriptions), some preliminaries are required. 
Precondition 2 is only due to the current state of implementation of our tools. By application of de Morgan's rule on an expression !(C 1 C 2 ) and inversion of the comparators in C 1 and C 2 , the logical NOT can also be modeled in ifstatements. Since all boolean functions can be expressed with &&, || and !, precondition 2 is not a limitation. Without loss of generality, a condition a==b can be rewritten as a¦ b && b¦ a (a!=b analogous) so that the required operator ¦ of precondition 3 is not a restriction, either.
Condition Satisfiability
In the first phases of the optimization algorithm, all affine conditions C x are analyzed separately. Every condition defines a subset of the total iteration space of a loop nest Λ. This total iteration space is an N-dimensional space limited by all loop bounds lb l and ub l . An affine condition C x can thus be modeled as follows by a polytope: ∞. Every condition C x can be represented by a polytope P x by generating inequalities for the affine condition C x itself and for all loop bounds. For this purpose, an improved variant of the Motzkin algorithm [10] is used and combined with some simplifications removing redundant constraints [15] .
After that, we can determine in constant time if the number of equalities Ax¨a of P x is equal to the dimension of P x plus 1. If this is true, P x is overconstrained and defines the empty set as proven by Wilde [15] . If instead P x only contains the constraints for the loop bounds, C x is satisfied for all values of the index variables i l . Such conditions that are always satisfied or unsatisfied are replaced by their respective truth value in the if-statement and are no longer considered during further analysis. 
Condition Optimization
For conditions C¨N ∑ l % 1 & c l ' i l
Global Search Space Construction
After the first GA (see section 3.2), a set of if-statements
consisting of affine conditions C i@ j and their associated optimized polytopes P i@ j are given. For determining index variable values where all ifstatements in a program are satisfied, a polytope G modeling the global search space has to be created out of all P i@ j .
In a first step, a polytope P i is built for every if-statement IF i . Therefore, the conditions of IF i are traversed in their natural execution order π which is defined by the associativity and precedence rules of the operators && and ||. P i is initialized with P i@ πq 1r . While traversing the conditions of if-statement i, P i and P i@ πq j r are connected either with the intersection or union operators for polytopes:
|| C i@ πq jr P i models those ranges of the index variables where one if-statement i is satisfied. Since all if-statements need to be satisfied, the global search space G is built by intersecting all P i : G¨ P i . Since polyhedra are not closed under the union operator, the P i defined above are no real polytopes. Instead, we use finite unions of polyhedra for which the union operator is closed [15] .
Global Search Space Exploration
Since all P i are finite unions of polytopes, the global search space G also is a finite union of polytopes. Each polytope of G defines a region where all if-statements in a loop nest are satisfied. After the construction of G, appropriate regions of G have to be selected so that once again the total number of executed if-statements is minimized after loop nest splitting.
Since unions of polytopes (i. e. logical OR of constraints) can not be modeled using ILP, a second GA is used here. For a given global search space G¨R 1 figure 3) .
Figure 3. Global If-Statement Counter
After the GA has terminated, the innermost loop λ of the best individual defines where to insert the splitting-if. The regions R r selected by this individual serve for the generation of the conditions of the splitting-if and lead to the minimization of if-statement executions.
Benchmarking Results
The techniques presented in section 3 are fully implemented using the SUIF [16] , Polylib [15] and PGAPack [8] libraries. Both GA's use the default parameters provided by [8] (population size 100, replacement fraction 50%, 1,000 iterations). Our tool was applied to three multimedia programs. First, a medical tomography image processor (CAVITY [3] ) having passed the so called DTSE transformations [6] is used. We apply loop nest splitting to this transformed application for showing that we are able to remove the overhead introduced by DTSE. The second benchmark is an MPEG 4 full search motion estimation (ME [5] , see section 1), and the QSDPCM algorithm [14] for scene adaptive coding serves as third test driver.
Since all polyhedral operations used [15] have exponential worst case complexity, loop nest splitting as a whole also has exponential complexity. Nevertheless, the effective runtimes of our tool are very low, from 0.41 CPU seconds (QSDPCM) up to 1.58 seconds (CAVITY) are required for optimization on an AMD Athlon running at 1.3 GHz. For obtaining the results presented in the following, the benchmarks are compiled and executed before and after loop nest splitting. Compilers are always invoked with all optimizations enabled so that highly optimized code is generated. Figure 4 shows the effects of loop nest splitting on the caches and pipelines of an Intel Pentium III, Sun Ultra-SPARC III and a MIPS R10000 processor. To obtain these results, the benchmarks were compiled and executed on the processors while monitoring performance-measuring counters available in the CPU hardware. This way, reliable values can be generated without using erroneous cache simulation software. The figure shows the performance values for the optimized benchmarks as a percentage of the unoptimized versions denoted as 100%.
Pipeline and Cache Behavior
As can be seen from the columns Branch Taken and Pipe Stall, we are able to generate a more regular control flow for all benchmarks. The number of taken branch instructions is reduced between 8.1% (CAVITY Pentium) and 88.3% (ME Sun) consequently leading to similar reductions of pipeline stalls (10.4% -73.1%). For the MIPS, a reduction of executed branch instructions between 66.3% (QSDPCM) and 91.8% (CAVITY) were observed. The very high gains for the Sun CPU are due to its complex pipeline consisting of 14 stages which is very sensitive to stalls.
The hardware counters also clearly show that the behavior of the L1 I-cache is improved significantly. The number of I-fetches is reduced by 26.7% (QSDPCM Pentium) -82.7% (ME Sun), large improvements of I-cache misses are reported for the Pentium and MIPS (14.7% -68.5%). For the Sun, this parameter remains almost unchanged. Due to the removal of index variable accesses, the L1 D-caches also benefit in several cases. Fetches from the D-cache are reduced by 1.7% (ME Sun) resp. 85.4% (ME Pentium); only for the QSDPCM benchmark, data fetches increase up to 3.9% due to the insertion of spill code. D-cache misses drop by 2.9% (ME Sun) -51.4% (CAVITY Sun). The very large register file of the Sun UltraSPARC III (160 integer registers) is the reason for the slight improvements of the L1 D-cache behavior for ME and QSDPCM. Since these benchmarks only use very few local variables, they can be stored entirely in registers even before loop nest splitting.
Furthermore, the columns L2 Fetch and L2 Miss show that the unified L2 caches also benefit significantly, since reductions of accesses (0.2% -53.8%) and misses (1.1% -86.9%) are reported in most cases.
Execution Times
All in all, the factors mentioned above lead to speed-ups between 17.5% (CAVITY Pentium) and 75.8% (ME Sun) for the processors considered in section 4.1 (see figure 5a) . To demonstrate that these improvements not only occur on these CPUs, additional runtime measurements were performed for an HP-9000, PowerPC G3, DEC Alpha, TriMedia TM-1000, TI C6x and an ARM7TDMI, the latter both in 16-bit thumb-and 32-bit arm-mode. Figure 5a shows that all benchmarks benefit from loop nest splitting. The runtimes of CAVITY are improved between 7.7% (TI C6x) and 35.7% (HP). On the average over all processors, a speed-up of 23.6% was measured. The fact that loop nest splitting is able to generate a very regular control flow in the innermost loop of the ME benchmark leads to very high gains in this case. The benchmark is accelerated by 62.1% on average. The minimum speed-up amounts to 36.5% (TriMedia), whereas the Sun CPU honors the optimization with an acceleration of 75.8%. For QSDPCM, the improvements range from 3% (PowerPC) up to 63.4% (MIPS) leading to an average speed-up of 29.3%.
The variations among different CPUs depend on several factors. As already stated in section 4.1, the complexity of register files and pipelines are important parameters. Additionally, runtimes are influenced by different compiler optimizations and register allocation algorithms. Due to lack of space, a detailed study can not be given here.
Code Sizes and Energy Consumption
Since code is replicated, loop nest splitting entails an increase in code size (see figure 5b) . On average, the CAV-ITY benchmark's code size increases by 60.9%, with minimum and maximum increases of 34.7% (MIPS) and 82.8% (DEC). Although the ME benchmark is accelerated most, its code enlarges least. Increases between 9.2% (MIPS) and 51.4% (HP) lead to an average growth of only 28%. Finally, the code of QSDPCM enlarges between 8.7% (MIPS) -101.6% (C6x) leading to an average increase of 61.6%.
These increases by a few hundred instructions are not a serious drawback, since the added energy required for storing these instructions is compensated by the savings achieved by loop nest splitting. Figure 5c shows the effects of loop nest splitting on memory accesses and energy [13] for the ARM7 core considering bit-toggles and offchipmemories and having an accuracy of 1.7%.
The column Instr Read shows that the number of instruction memory accesses is reduced by 23.5% (CAVITY) -56.9% (ME). Furthermore, our control flow optimization also leads to a significant reduction of data memory accesses. Data reads are reduced up to 65.3% (ME). For QS-DPCM, the removal of spill code reduces data writes by 95.4%. In contrast, the compiler inserts spill code for CAV-ITY so that an increase of 24.5% was observed. The total amount of all memory accesses (Mem Acc) is reduced by 20.8% (CAVITY) -57.2% (ME).
Our optimization leads to large energy savings both of the CPU and its memory. The energy consumed by the ARM core is reduced by 18.4% (CAVITY) -57.4% (ME), the memory consumes between 19.6% and 57.7% less energy. Total energy savings by 19.6% -57.7% are measured.
Anyhow, if code size increases (up to a rough theoretical bound of 100%) are critical, it is easy to change our algorithms so that the splitting-if is not placed in the outermost possible loop. This way, code duplication is reduced at the expense of lower speed-ups, so that trade-offs between code sizes and savings in runtimes can be realized.
Conclusions
We present a novel source code optimization called loop nest splitting which removes redundancies in the control flow of embedded multimedia applications. Using polytope models, conditions having no effect on the control flow are removed. Genetic algorithms identify ranges of the iteration space where all if-statements are provably satisfied. The source code of an application is rewritten in such a way that the total number of executed if-statements is minimized.
A detailed study of 3 benchmarks shows that the branching and pipeline behavior is improved significantly. Furthermore, caches also benefit from our optimization since I-and D-cache misses are reduced heavily (up to 68.5%). Since accesses to instruction and data memory are reduced to a large extent, loop nest splitting consequently leads to large power savings (19.6% -57.7%). An extended benchmarking using 10 different CPUs shows that we are able to speed-up the benchmarks by 23.6% -62.1% on average.
The selection of the benchmarks used in this paper demonstrates that our optimization is a very general and powerful technique. It is not only able to improve the code of typical real-life applications, but in addition, it can be used to eliminate the negative effects of other source code transformation frameworks introducing a very large control flow overhead into an application. In the future, we will generalize our analytical models so that more classes of loop nests can be treated. In particular, extensions to loops not having constant bounds will be developed.
