Control divergence poses many problems in parallelizing loops. While predicated execution is commonly used to convert control dependence into data dependence, it often incurs high overhead because it allocates resources equally for both branches of a conditional statement regardless of their execution frequencies. For those loops with unbalanced conditionals, we propose a software transformation that divides a loop into two or three smaller loops so that the condition is evaluated only in the first loop, while the less frequent branch is executed in the second loop in a way that is much more efficient than in the original loop. To reduce the overhead of extra data transfer caused by the loop fission, we also present a hardware extension for a class of Coarse-Grained Reconfigurable Architectures (CGRAs). Our experiments using MiBench and computer vision benchmarks on a CGRA demonstrate that our techniques can improve the performance of loops over predicated execution by up to 65% (37.5%, on average), when the hardware extension is enabled. Without any hardware modification, our software-only version can improve performance by up to 64% (33%, on average), while simultaneously reducing the energy consumption of the entire CGRA including configuration and data memory by 22%, on average.
INTRODUCTION
The need for higher energy efficiency across all domains of computing from embedded to supercomputers is the main driver behind the increasing popularity of accelerators, which include applications specific hardware as well as more programmable ones such as reconfigurable architectures and GP-GPUs. For those architectures, one important source of energy efficiency is Loop-Level Parallelism (LLP), and pipelining is the 62:2 Y. Jeong et al. preferred way of exploiting LLP on reconfigurable architectures, which have abundant, possibly heterogeneous, processing elements. However, except for a few application domains such as Digital Signal Processing (DSP), even loops can be quite complicated, and thus pipelining a loop optimally for a Coarse-Grained Reconfigurable Architecture (CGRA) that supports dynamic reconfiguration is an area of active research [Hamzeh et al. 2012; Han et al. 2010; Chen and Mitra 2012; Kim et al. 2012b; Park et al. 2008] .
One important problem in this context is how to handle conditional statements inside loops. Typically, conditional statements are handled using predicated execution [Mahlke et al. 1995; Paar et al. 2002] . However, predicated execution on a dynamically reconfigurable CGRA is wasteful because it allocates resources for both branches of a conditional, even if, quite certainly, only one of them is to be executed. To address the problem of wasted resources, a technique called Dual-Issue Single-Execution (DISE) [Han et al. 2010 ] was proposed for a CGRA. The key idea is to issue operations from both paths simultaneously, but at runtime, execute only one of them depending on the predicate value. This can save cycles and energy, but unfortunately becomes less effective when the sizes of the branches are unbalanced, which is often the case in applications.
We have examined conditionals 1 inside loops from applications in MiBench [Guthaus et al. 2001 ] and a computer vision benchmark suite [Venkata et al. 2009 ], in terms of both the relative size and execution frequency of the then-and else-blocks. Our findings are summarized in Figure 1 , where the x-axis represents the size balance and the y-axis the execution frequency balance. The center of the graph thus represents the perfect balance whereas the right (left) side is where then-blocks (else-blocks) are larger, and the top (bottom) side is where then-blocks (else-blocks) are more frequently executed.
This graph 2 reveals several useful insights. First, the majority (78%, or 25 out of 32) of the conditionals inside loops have then-blocks only (no else-blocks). Even the rest of them are mostly unbalanced, leaving only two loops near the 50% line in terms of size. This suggests that most loops at least in the applications we have examined have quite unbalanced conditionals, rendering techniques like DISE unattractive. Second, from the vertical distribution, we see that most conditionals are very polarized, strongly favoring one branch over the other. This has an important implication that while predicated execution is very resource-efficient for the if-then statements in region A because there then-blocks are almost always executed, for those in region B, it is very wasteful, presenting an opportunity for improvement over predicated execution. Third, interestingly enough, much more loops are found in region B than in region A, pointing to the need for addressing the problem. Thus, our goal in this article is to develop a technique that improves the efficiency of loops belonging to regions B and also D due to the symmetry of the axes. Our technique is complementary to predicated execution, which is already efficient for loops in regions A and C.
In this article, we propose a compiler technique that splits a loop with one or more conditionals into multiple smaller loops to achieve more efficient execution on CGRAs. In the simplest case, 3 we divide a loop into two loops, where the first one only evaluates the condition while the second one executes the then-block for those iterations where the condition is true. It is very similar to loop fission, but we compress the second loop for more efficient execution. We call this evaluator-executor transformation, and its performance and energy improvement comes from the fact that it can bypass unnecessary iterations completely, not wasting any energy or resources for them. There are some challenges however. First, while splitting a loop is rather straightforward for those loops with only one if-then statement, it is not obvious how to handle if-then-else or other more complex conditionals including nested and/or a series of conditionals. Second, splitting a loop in this way can incur some overhead, due to the need to pass some variables from one loop to another, which needs to be minimized. Lastly, loops with inter-iteration data dependence (recurrent loops) may not be easily fissioned. We present two techniques-one is a pure software technique and the other is based on hardware extension-to address those challenges.
Our experiments using loops with conditionals from MiBench and computer vision benchmarks demonstrate that our hardware technique can improve the performance of loops with conditionals over predicated execution, by up to 65% ( 37.5%, on average), across applications. Our software-only version, which does not require any hardware modification, can improve the performance over predicated execution by up to 64% (33%, on average). While our software transformation generates loops that use more memory operations than the original, thereby increasing the dynamic energy in the memory, the increase is very modest, and in many cases much less than what we save on the static energy through runtime reduction. Overall, our software technique can reduce the energy consumption of the entire CGRA including configuration and data memory by 22%, on average.
2 It is worth mentioning that those 32 loops (i.e., the potential target of our technique) account for a very significant portion of the application runtime. Our detailed simulation using the SimpleScalar simulator [Austin et al. 2002] and DRAMsim [Wang et al. 2005 ] modeling a typical embedded system with a dual-issue, in-order processor running at 720MHz, two-levels of cache hierarchy, and a DDR-333 SDRAM, indicates that the target kernels which are located in the high opportunity regions account for 41%, on average, of the total application runtime (arithmetic mean), while they are only one-third of all the important loops (see Table I ). We believe that it is because larger loops tend to be more complicated and therefore more likely to contain conditional statements that we find a disproportionately larger portion of the application runtime spent in our target loops. 3 More generally, our technique can handle loops with multiple if-then-else conditionals inside. 
TARGET ARCHITECTURE
Our target architecture is a Coarse-Grained Reconfigurable Architecture (CGRA) that is dynamically reconfigurable. A CGRA [Hartenstein 2001 ] is a type of reconfigurable architecture where each Processing Element (PE) operates at the granularity of words rather than bits as in FPGAs. This has both advantages and disadvantages. The advantages include fast runtime reconfiguration, high efficiency for operations directly supported by the PE such as arithmetic operations, and lower programming barrier for software engineers with no hardware design background. The disadvantages include inefficiency in bit-level manipulation or in general any operation that is not directly supported by the "instruction set" of a PE. Also, the lack of flexibility in terms of trading off the cycle time for routability is another difference from FPGAs.
Our target CGRA mainly consists of a PE array, which is a 4 × 4 array of heterogeneous PEs, and a multi-bank local memory (see Figure 2 ). Arithmetic and logic operations can be performed by any PE, but expensive operations such as multiplication can be performed by some PEs only. There are also specific PEs that can perform memory operations, called load-store PEs. The number of load-store PEs is matched with the number of banks in the local memory. The local memory is a scratch pad memory, which provides guaranteed access time with no memory stall. Thus, it is the compiler's responsibility to ensure that the data accessed by load-store PEs are present in the CGRA's local memory. To provide flexibility in accessing different banks from any load-store PE, there is a crossbar switch between the load-store PEs and the memory ports.
The PE array may employ any interconnect topology among the PEs such as simple mesh or mesh plus diagonal. The interconnect architecture connects the output port of one PE to its neighboring PEs' input ports. Each PE's output is registered, and the PEs all operate in lock-step. Hence, pipelined execution is a natural way to exploit the parallelism of a CGRA.
The instruction of a PE is called configuration, which can be quickly changed to another thanks to the distributed configuration cache. While this fast reconfiguration mechanism greatly extends the capability of a limited-sized CGRA for large tasks, general branch operations are still not supported on most CGRA architectures, posing a challenge for application mapping.
RELATED WORK
There are various ways of mapping loops onto CGRAs [Cardoso and Diniz 2009; Lee et al. 2003 ], but loop pipelining, or software pipelining-based mappings [Mei et al. 2003b; Park et al. 2008] can effectively utilize heterogeneous PEs yielding highthroughput schedules and can also handle recurrent loops. On the other hand, for parallel loops on homogeneous PEs, Single-Instruction Multiple-Data (SIMD) mappings [Kim et al. 2012a ] can lead to higher utilization because, unlike in software pipelining, no PE needs to be wasted for routing.
Control divergence poses many challenges when mapping loops for CGRAs. Software pipelining-based mappings typically use predicated execution [Rau 1994 ], which converts control flow into dataflow, by issuing instructions from both branches of a conditional (thus, no control dependence) but executing them only if the predicate is true (thus data dependence). Predicated execution requires hardware support such as (i) adding predicated versions to some (partial predication) or all (full predication) instructions; (ii) a predicate register file, which may be shared among all PEs; and (iii) one or more predicate-defining instructions that can write to the predicate register file. Our work assumes that the CGRA supports predicated execution for conditional execution.
Recent work for CGRAs tries to improve the efficiency of control flow in loops. DualIssue Single-Execution (DISE) [Han et al. 2010 ] issues two instructions simultaneously but executes only one of them with the "true" predicate, which can increase performance, especially when the conditional has balanced branches. State-based Full Predication (SFP) [Han et al. 2013 ], on the other hand, can achieve energy saving by putting a PE into a low power mode if the PE is to receive operations that will be disabled due to predication. Although SFP has been applied to mappings that are not pipelined, SFP for pipelined mappings is not known.
Bundled Execution of REcurring Traces (BERET) [Gupta et al. 2011] , which is a hardware accelerator tightly coupled to the CPU pipeline, also exploits the existence of a common scenario in a control flow. However, instead of fissioning loops, BERET restarts an iteration on the main processor whenever control diverges from the common scenario, which is monitored by the BERET hardware. Unlike BERET, CGRA has a large overhead for switching to the main processor, making it very inefficient if BERET were applied to CGRAs. Also, whereas BERET does not software-pipeline loops, which makes it easy to restart an iteration, our proposed technique apply software pipelining to each loop after evaluator-executor transformation.
GPUs also suffer from control divergence in loops, which can be addressed by dynamically forming SIMD-instructions from large collections of threads [Fung and Aamodt 2011; Rhu and Erez 2012] . Simultaneous Branch and Warp Interweaving [Brunie et al. 2012 ] executes operations from divergent control paths or even from different scheduling groups (warps) simultaneously as in the same SIMD instruction. For unstructured control flows, finding out when divergent threads reconverge can be important and challenging, which is addressed by Diamos et al. [2011] .
In high-level synthesis, previous work handles conditional statements in pipelined loops using techniques like path-based scheduling algorithms [Camposano 1991 ], speculation techniques [Gupta et al. 2001] , or static branch prediction [Holtmann and Ernst 1995] .
Our work is based on loop fission [Aho et al. 2006 ], a well-known loop transformation technique. Our technique specializes loop fission for handling control divergence and extends it by compressing the second loop, which is the main source of performance and energy improvement. A similar loop transformation called inspectorexecutor strategy was proposed [Saltz et al. 1997 ] to handle data-dependent memory access patterns. It fissions a loop into two, with the first calculating the addresses of the data to be fetched/stored while the second is using the information to do the actual computation.
EVALUATOR-EXECUTOR TRANSFORMATION

Basic Idea
The basic idea of our evaluator-executor transformation is to divide a loop into smaller ones, such as one that includes operations before and up to the condition evaluation, and one that includes operations after the condition evaluation. The evaluator loop also stores the iterator values for the iterations where the condition holds true, and the iterator values are later retrieved in the executor loop. Improvement of performance and/or energy is the main motivation, as the executor loop often has a much lower trip count than the original loop. We also extend this basic idea to loops that have if-then-else conditionals and those with more than one conditionals.
4.1.1. Illustrative Example. Figure 3 illustrates a simple loop with an if-then statement. The control flow graph of the loop body has only two basic blocks. The condition expression is evaluated in the first block, and only when the condition is true, the second basic block is executed.
In the conventional mapping based on predicated execution only, the control dependence is converted into data dependence, creating data dependence edges from 5 to 6 ∼ 9 in Figure 3 (b). These additional edges may seem to complicate routing for mapping, but the hardware architecture may provide special support for routing of predicate values, which are only one-bit signals. The main weakness of predicated execution, however, is the waste of resources allocated to the predicated block (the second block in our example), which really needs to be executed only when the condition is true. But since the condition value changes from iteration to iteration and the loop schedule must be generated statically, resources must be allocated for the second block regardless of the condition value, which leads to waste of resources.
Instead, we create two loops, namely evaluator loop and executor loop. Essentially, the evaluator loop contains the code up to the condition evaluation, and the executor loop contains the code below the condition expression. To preserve the functionality, a few operations need to be added, such as saving and recalling of the iterator values for true iterations.
Challenges and Problems.
While the transformation may seem trivial for the previous example, as mentioned before, the control structure in the loop body in general may be much more complicated. Also if there is a statement after the conditional, simple two-loop fission will not be enough. Thus, the first challenge is, given an arbitrary control flow graph of a loop body, to determine how to partition the basic blocks to fissioned loops so that the functionality is preserved and the benefit of the loop fission is maximized. Recurrent loops, or loops with inter-iteration data dependence, pose another challenge, as data dependence cycles should not be broken during loop fission. In some cases, we can circumvent data dependence cycles and still perform our loop transformation. Another challenge is intra-iteration data dependence such as a scalar variable defined before, but used inside, a conditional statement. Such variables need to be passed to the executor loop, which requires adding new operations to evaluator and/or executor loops. This is a by-product of loop fission and since it can offset the performance and energy advantage of our technique, this overhead needs to be minimized. Figure 4 lists the overall flow of our evaluator-executor transformation. Given a loop, its Program Dependence Graph (PDG) [Ferrante et al. 1987] , which is an intermediate representation that makes explicit both the control and data dependences for each operation, is partitioned into three sets, which we call simply A, B, and C, where set C may be empty. These sets correspond to evaluator, executor, and trail loops, respectively. Next, inter-iteration data dependence analysis is performed to see if there is recurrence cycle(s) that cannot be handled. Some loops with recurrence cycles can be successfully fissioned; for others, our transformation fails. Then the PDG of each set is modified by adding operations necessary for our transformation. The next step is intra-iteration data dependence handling, which identifies the variables that need to be passed from the evaluator to the executor or the trail loop, but which do not create any dependence cycle. Such "temporary" variables may be passed through the local memory or simply recomputed-we decide it by comparing their estimated cost. The next step, PDG optimization, cleans the PDG for any possible dead code introduced during the previous steps (e.g., recomputing a temporary variable in the executor loop may make the original definition in the evaluator loop unnecessary). Finally, the PDG of each loop is scheduled and mapped onto the target architecture, and the expected runtime is compared with that of mapping the original loop using predicated execution only, with the better being selected.
Overall Flow
We now explain the key steps of the flow, starting with the PDG partitioning.
PDG Partitioning
The problem of PDG partitioning is to determine which basic blocks to include in each fissioned loop, which is far from trivial for an arbitrary control flow graph such as the one in Figure 5 (c). This problem arises even for a simple if-then-else statement because the executor loop cannot contain both the then-and else-blocks, which defeats the purpose of our transformation. Another challenge is that at this stage, we simply cannot foresee all the effects of choosing one block over the other, due to the intricacies caused by later steps such as inter-and intra-iteration data dependence handling and PDG scheduling and mapping. Thus, we develop an intuitive metric for the usefulness of a block as (part of) the executor loop, which is generalized and used in our partitioning heuristic for PDGs.
Before we proceed, let us clarify one underlying assumption here. To efficiently support an arbitrary control flow graph, or even a simple if-then-else statement in a loop, we assume that the architecture supports predicated execution. Otherwise, we have to create a loop for each of the then-and else-blocks, which incurs much more overhead and is likely to cancel out any performance advantage of our transformation. With this architecture, control statements are allowed in any of the fissioned loops.
In the following, we first develop an intuition for the usefulness metric, consider the trail loop case, and present our algorithm.
4.3.1. Usefulness Metric. To explain the intuition behind our usefulness metric for including a block in the executor loop, let us consider a simple loop that contains only one if-then-else statement without any statement afterward, as illustrated in Figure 5 (a). Let us assume that the branch probability is p, and the then-and else-blocks take S T and S E cycles, on average, respectively (which may be approximated to the number of operations in each block). Further, the loop iterates N orig times, and the condition block takes S C cycles, on average.
In this example, there are only two choices for the executor loop: the then-block or the else-block. Depending on which block is included in the executor loop, the expected runtime after loop fission may vary. Assuming that the runtime of two blocks executing together is the sum of the runtime of each running separately 4 and ignoring the control overhead, the total runtime of the fissioned loops in each case can be (approximately) represented as follows:
Note that the aforementioned formulas include the effect of both then-and else-blocks, and that when the then-block is chosen for the executor, the else-block is assumed to be part of the evaluator (using predicated execution) and vice versa, which explains the term S E N orig in the first line and S T N orig in the second. For the then-block to be the better choice, we require T with-then-in-executor < T with-else-in-executor , or
, which means that the best candidate for the executor loop is a large and rarely executed one. In other words, the usefulness of a block for the executor loop is proportional to the block size divided by the execution frequency. We extend this metric to any set of blocks to define usefulness metric of a set of blocks as their total size divided by their execution frequency. This metric is used in our partitioning algorithm, which first identifies all the candidate sets of blocks, and chooses the one that maximizes the usefulness metric for the executor loop.
Trail Loop.
A slightly more complex case is one that has statements after a conditional, as illustrated in Figure 5 (b). Such statements may have data or control dependence on one or more statements in the conditional. In particular, if the postconditional statements have data or control dependence on the blocks included in the executor loop, they must be executed after the executor loop as a separate loop, which is called trail loop.
In a nested control statement, determining which blocks should be included in the trail loop may not be obvious. For instance, in Figure 6 if the executor includes block ❶ only, the trail loop does not include blocks ❷, ❸, or ❹, but block ❺ only. The principle here is that all the blocks that may be executed in an iteration where the executor block is executed must be included in the trail loop if they are executed after the executor block. Based on this observation, we provide an algorithmic solution for the general case in the next section.
4.3.3. General Solution. Our solution for an arbitrary Control Flow Graph (CFG) generated from a structured code (i.e., without gotos) is as follows: (i) identify all the candidate sets of blocks, (ii) choose for the executor the one that maximizes our usefulness metric, that is, size over execution frequency, (iii) select the blocks for the trail loop, and (iv) the remaining blocks become the evaluator. For this algorithm, we assume that the branch probabilities of the conditionals are available through profiling or by other means (e.g., static analysis). In the following discussion, we assume without loss of generality that every conditional statement has up to two cases-if-then or if-then-else statements only.
First, we associate a unique predicate variable with each condition expression as illustrated in Figure 5 (c). Then we can find the predicate expression that dictates the condition for executing each block in the CFG, which is a product of predicate variables or their complements (called literals), as shown inside each block in the figure. Let us call this product of literals execution condition of the block. It is easy to see that when the predicate variable takes the value of the corresponding branch probability, the execution condition of a block becomes the execution frequency of the block. We say that execution condition x implies execution condition y if all the literals of y are found in x. For instance, p 1 p 2 p 3 implies p 1 p 2 but not p 1 p 2 p 3 . Execution condition of a set of blocks is defined as the Boolean OR of each block's execution condition.
Let B denote the set of all the blocks included in the executor loop. Then we note that if B includes a block whose execution condition is x, adding to B any other block whose execution condition implies x can only increase the benefit of the executor, as per our usefulness metric. This is because adding it does not increase the execution frequency of B but only increases its size. This is in effect saying that in a nested conditional statement, all the inner conditionals should be included in their entirety if the outer one is included. In our example of Figure 5 (c), these are the candidate sets of blocks (each set is represented by its execution condition):
Finding all the candidates can be done efficiently by exploiting the nesting property of the conditionals. Once all the candidate sets of blocks are identified, they are evaluated in terms of the usefulness metric, and the one with the highest metric is chosen. Fig. 7 . Recurrent loops can be fissioned if the recurrence can be isolated into one fissioned loop. This example illustrates that by moving the operation in the gray box from the executor to the evaluator, we can break the dependence of the evaluator on the executor.
Then, the set of blocks included in the trail loop, called C, includes all the blocks that (i) appear after the blocks of B in the original code, and (ii) whose execution condition y is not exclusive with that of B, that is, y · x = 0, where x is the union of all the execution conditions of the blocks in B. Figure 5 (c) shows set C for the B selected as in the figure. Finally, all the remaining blocks are automatically included in set A, which is for the evaluator loop. In a later step (e.g., intra-iteration data dependence handling), if the operations in C are found to have no data dependence at all on B, C can be merged into A.
Inter-Iteration Data Dependence
In a loop, an operation that is dependent on the same or another operation of a previous iteration, either through a scalar or an array variable, can create a cycle in the data dependence graph. Since data dependence represents precedence relationship, such cycles may not be broken into different loops, which can complicate loop fission. One caveat, however, is that a data dependence cycle that remains completely in one of the fissioned loops does not interfere with loop fission. Therefore, we move operations between the sets in order to make a dependence cycle completely contained in one set. Figure 7 illustrates this for a loop with a max operation. As shown in Figure 7 (a), the original loop has only one if-statement. Therefore, the obvious choice for set B is the body of the if-statement, whereas the other operations comprise set A including the comparison operation in the condition expression of the if-statement.
However, this partitioning causes a problem in loop fission due to a dependence cycle. The dependence cycle consists of two operations: the max assignment operation in the gray box, which is in set B, and the comparison operation, which is in set A. Since the two operations belong to different sets, it is impossible to create a new loop from each set. However, if we move the max assignment operation from B to A (with necessary predication), both operations comprising the dependence cycle now belong to set A. In other words, the dependence cycle becomes completely contained in A, from which we can create an evaluator loop that is no longer dependent on the executor loop, as shown in Figure 7(b) . Figure 8 shows an algorithm that checks if a loop with inter-iteration data dependence cycles can be fissioned by moving some operations between the sets. In general, moving operations from B to A or C is possible by predicating the operations. On the other hand, it is impossible to move operations from either A or C to B, without duplicating the code. Therefore, we can gather the operations into either A or C. Since they are equivalent in terms of execution frequency, we choose to use A whenever the dependence cycle involves operation(s) from A (line 5). Otherwise, the dependence cycle consists of operations from either B or C only, in which case we gather them to C (lines 9-12). The actual moving of operations in lines 7 and 11 is carried out recursively for all their predecessors or successors depending on the destination set. The predecessor/successor relationship is defined for the entire data dependence graph for all the sets, as opposed to within each set. If all dependence cycles are contained within individual sets without emptying the set B, we can proceed to the next step (line 16). Otherwise, the evaluator-executor transformation is aborted.
PDG Modification
Our transformation requires the evaluator loop to store two pieces of information: the iterator values for the true iterations and the total number of true iterations. This can be done by adding a few operations in the PDG of set A, as illustrated in Figure 3 . While the evaluator and the trail loops inherit the loop control code (i.e., iterator initialization and increment) of the original loop, the executor loop has a trivial loop control code, and inside the loop body, the real values of the original iterator are retrieved from the 
Intra-Iteration Data Dependence
An example of intra-iteration data dependence is a variable defined in set A and used in B or C. Such variables, which we call transfer variables, may not be available in B or C unless they are explicitly copied. One can either save and restore the transfer variable through the local memory after first converting it into an array (similar to variable expansion in software pipelining [Rau 1994 ]), or simply recompute the variable where it is needed. We try both and choose the better one, or the one that requires fewer additional operations, possibly with some weighting factors. 5 In the case of save-andrestore, a transfer variable used only in B needs to be saved for true iterations only. In the case of recomputation, the expression tree corresponding to the definition of the variable needs to be copied from A.
A more complicated case is when a variable is defined in A, modified in B, and used in C. Such variables can be passed through the local memory, which is similar to the simpler case. Recomputing them in C requires copying to C both its definition from A and the conditional update from B, which can increase the overhead of the transformation (thus possibly favoring the save-and-restore method).
Applicability Study
To see the applicability of our proposed transformation, we study 15 applications 6 from MiBench [Guthaus et al. 2001] and SD-VBS [Venkata et al. 2009 ], which is a computer vision benchmark. From those applications, we first collect 93 loops that are the innermost loops in the functions that each accounts for at least 10% of the application runtime. As summarized in Table I , about one-third of them (32 loops) contain one or more conditional statements, or continue or switch statements. They are the potential target of our technique. Half of them (16) are recurrent loops while the other half (16) have no inter-iteration data dependence. In nine of the recurrent loops, the recurrence cycles can be moved into the evaluator loop, enabling our evaluatorexecutor transformation. Thus the number of applicable kernels is 25 (= 16 + 9) out of 32, which suggests a high applicability of our transformation.
FISSION-SPECIALIZED CGRA
While the previous section presented our evaluator-executor transformation in a largely target architecture-independent manner, we now propose a hardware extension that is specially tailored for the transformation, for a class of Coarse-Grained Reconfiguration Architectures (CGRAs). This architecture extension also accommodates loop fission for CGRAs.
As the extra code in our loop transformation originates from:
(1) saving and restoring iterator for true iterations, (2) counting true iterations, and (3) either save-and-restore or recompute transfer variables, our strategy is (i) to have a hardware counter for counting true iterations and (ii) to provide efficient means of saving and restoring iterator and transfer variables. Figure 9 illustrates the architecture extension for a quadrant (2 × 2 PE array) of a baseline 4 × 4 CGRA, as described in Section 2. The hardware for counting true iterations is very simple, consisting of one counter only (True_CNT in the figure) , which serves the entire CGRA. The true counter is driven by the OR-ed result of all Executor Enable (XE) outputs of Processing Elements (PEs). The XE output of a PE is identical to its predicate output, except that it is driven high only when the PE is actually computing the condition value in an Evaluator loop; thus, usual predicated execution does not interfere with true iteration counting. The CGRA's controller is aware of the true counter, and its final value, which is the total number of true iterations, is used as the trip count of the Executor loop.
Hardware for True Iteration Counting
Hardware for Save and Restore
Push and Pop.
To save and restore the iterator or a transfer variable, a number of operations may be needed, including a store/load operation and the accompanying address calculation computation. What they do, however, is simple push and pop operations using the CGRA's local memory as a FIFO. Thus, we simplify this by replacing the whole address calculation with a very simple address generator with no multithreading, which can be implemented with a counter, and by embedding the store operation into other instructions as a suboperation controlled by one bit of the configuration. 7 This suboperation is called push, and therefore can be executed by any PE. The load operation is not embedded, but provided as a separate instruction (called pop), which is simpler than a load because the address need not be specified.
Optimizing for Performance.
To explain the asymmetrical design decision between push and pop, let us consider the runtime of a software-pipelined loop on a CGRA, which can be modeled as II · (L + N − 1), where II, L, and N are the initiation interval, the number of stages (the schedule length in terms of II), and the number of iterations, respectively. Then the runtime difference between the original loop and the fissioned loops is given as follows:
where the subscripts, namely, orig, 1, and 2, denote the original loop, the evaluator, and the executor, respectively, and p is the executor's execution ratio. The approximation in the last equation is based on the assumption that N L in general. From the last equation, it is clear that to maximize the difference, we must minimize II 1 , or the initiation interval of the evaluator.
Thus, we design push, which is used in the evaluator, such that it requires no additional instruction (thus no additional PE) and can be performed on any PE (no pressure on placement or routing). On the other hand, pop is designed to require an instruction (one PE needed), and it must be done on a particular PE in each quadrant. Luckily the pressure on routing due to pop is not high, because a pop operation is usually a leaf node of an expression tree and, therefore, has much freedom in terms of placement and routing anyway.
Each quadrant supports one push/pop operation per cycle, and thus a CGRA can access up to four variables including an iterator using push/pop in any loop. If more variables need to be transferred, we do it in software, allocating hardware to the earliest appearing variables.
5.2.3. Queue. Another peculiarity of the push operation is that it may 9 need to be done only in true iterations, or when the condition expression is true. To complicate matters, there is often a time lapse between the definition of a variable and when the condition value becomes available. Thus, we need to hold the variable temporarily until it can be either committed to the local memory or discarded. While we can hold such a variable temporarily by letting it stay in a PE or keep routing it among PEs until the condition value becomes available, both of them invariably waste resources and add to the routing pressure. Instead, we add a queue, as illustrated in Figure 9 , one in each quadrant. A new value is pushed to the queue once in every initiation interval by the CGRA's controller. The decision of whether to commit or discard the data is given by the global version of the XE signal.
The required length of the queue is determined by how much the two moments, variable definition and condition value availability, are separated in time. In modulo scheduling, the condition value is usually determined at the last stage of the schedule, whereas iterator variables are typically defined at the first stage, which is the worst case. From this worst case, we know that the queue size only needs to be as large as the number of stages in a modulo schedule. Our experimental results using embedded and vision applications indicate that the maximum number of stages is 5.5, on average, across applications, and up to 11 for two applications. Thus, we use a 12-entry queue. Now, this fixed queue size does not mean that certain loops cannot be mapped if they happen to have more stages than the queue size, but it only means that the mapping has to be changed, such that either the number of stages is reduced possibly at the expense of the increased initiation interval, or some variables should wait in PEs during the excess stages.
In summary, our architecture extension is designed to nearly eliminate the code overhead of our evaluator-executor transformation for a modest amount of hardware, as listed in the following:
-(Per PE) adding push operation -(Per quadrant) adding pop operation to one of the PEs; a FIFO; and a counter for address generation -(Per CGRA) a counter for counting true iterations.
For a 4 × 4 CGRA, this hardware overhead is mostly four 12-entry FIFOs and five counters.
Hardware Synthesis Results
We have implemented the proposed architecture extension in VHDL and Verilog. The key architectural parameters of the baseline CGRA are as given in Section 6.1. Additionally, the configuration register of each PE is 32-bit, and each PE has a private register file with four entries. There is no global or rotating register files, other than the global rotating predicate register file.
Our synthesis results targeting Xilinx FPGA Virtex-5 LX330 indicate that compared to the baseline CGRA, our extended architecture uses 44% more registers and 86% more latches plus four more adders and comparators while the number of slice LUTs increases only marginally by 2.7% from 12,118 to 12,449. The maximum clock speed is almost identical, with 224MHz versus 223MHz (the latter is for the extended architecture).
EXPERIMENTS
Experimental Setup and Applications
We evaluate the effectiveness of our proposed technique on a CGRA that supports predicated execution and consists of a 4 × 4 heterogeneous PE array and a multibanked local memory, as described in Section 2. Our CGRA has four load-store PEs, one in each row, that are connected to the four banks of the local memory, where each is a 256KB scratchpad. A load operation from a PE takes six cycles due to a crossbar switch between the load-store PEs and the memory banks. The interconnection between the PEs includes mesh and diagonal. Every PE can perform ALU operations while more expensive operations (e.g., multiplication) can be done by four specific PEs only. Operations on a PE are fully pipelined and can be predicated. There is no central register file except for a predicate register file, which can be updated by any PE. For mapping, we use our version of modulo scheduling based on the EMS algorithm [Park et al. 2008] , without the rotating register file support, and with predicated execution support. For our hardware extension, we use a 12-entry queue for each quadrant.
We used 12 loops, which are a subset of the 25 loops in Table I , to which the proposed transformation can be applied, excluding those that cannot be mapped to the CGRA due to the limitations of the CGRA architecture and/or our compiler.
10 Table II lists the characteristics of the loops. The number of operations on Column 3 includes only the ones included in the PDG of the original loop, but not the loop-control related ones, which are taken care of by the CGRA controller. The fourth column shows how often the executor loop is executed per iteration, on average (i.e., execution frequency), and the fifth column is calculated as the ratio of the number of operations in the executor loop to that of operations in the entire loop body, before applying evaluator-executor transformation. The last column is the number of variables including the iterator that need to be transferred between loops.
Performance of Our Transformation with Hardware Extension
For performance evaluation, we compare three cases in this section. First, we map each loop using predicated execution only, denoted by P, which is the baseline of our comparison. Then we apply our evaluator-executor transformation with hardware extension enabled, which is denoted by H. We also estimate the number of dynamic operations that would be executed if the block(s) corresponding to the executor loop were executed only when necessary while the other blocks were executed in every iteration, denoted by Op in the graph. We use this formula, (1 − r ex ) · 1 + r ex · f ex , where r ex and f ex are the size ratio (in the number of operations) of the block(s) selected for the executor loop compared to that of the entire loop body, and their execution frequency, respectively (i.e., the fourth and fifth columns of Table II ). This metric represents the fraction of operations that must be executed (over executing all the loop body operations, as is the case with predicated execution, or P) in any loop fission-based scheme including ours without considering any overhead of loop fission. Since actual loop fission results will include overheads, the Op metric may reveal how effectively our technique can realize the performance potential. Figure 10 compares the runtime results. First, we note that our scheme with hardware extension follows the number-of-operations prediction very closely. In most cases, our scheme generates performance that is similar to the number-of-operations prediction, which therefore could be used, for example, in determining whether to apply our transformation technique in a compiler. Surprisingly, however, in several cases our scheme can excel the predictions significantly, including Cjpeg, Dijkstra, and Susan_edges. This is because the mapping result of our scheduler is not exactly proportional to the number of operations of the PDG. Rather our scheduler, which is based on a variant of modulo scheduling [Park et al. 2008 ] specialized for CGRAs, tends to perform better for smaller graphs because smaller loops usually have lower routing pressure; to wit, scheduling two smaller loops may generate better results than scheduling one large one. The other reason is that in the original loop the iterator has to be routed from the beginning until almost the end of a loop, whereas our transformation often stores the iterator early in the evaluator loop, and therefore does not consume any more routing PEs.
To illustrate the superlinear increase of the runtime in the number of operations, Figure 11 shows the PDGs of Dijkstra loop before and after the transformation. While the number of operations is reduced by 43% only (from 21 to 12), the number of necessary PEs including those used for routing only is reduced by 66% (from 38 to 13). Consequently, the minimum initiation interval on a 4 × 4 PE array is reduced by 3 times (from 3 to 1).
On the other hand, in the case of Lame_3, the performance of our scheme is much worse than the prediction and is similar to (but still better than) the original. This is a direct consequence of the evaluator, in this particular case, taking almost as many cycles as the original loop-they happen to be scheduled with the same initiation interval, with the evaluator having fewer stages. Thus, although the reduced operation count is very likely to lead to reduced runtime, this may not always be the case, due to the randomness of the scheduling process. Overall, using our hardware scheme can reduce the kernel runtimes by 37.5%, on average, which outperforms the number-ofoperations prediction mainly due to the simpler routing.
Performance of Software Transformation Only
In this section, we compare our evaluator-executor transformation with and without hardware extension. Figure 12 shows the runtime results, as normalized to that of the predicated execution (P). The rightmost bar (H) represents the case with hardware extension. The other two bars denote software-only cases. Their difference is that for Fig. 11 . Dijkstra loop example, before and after the transformation. Fig. 12 . Comparing runtimes of four cases: P (predicated execution), S1 and S2 (our evaluator-executor transformation without hardware extension), and H (with hardware extension). transfer variables, S1 exclusively uses recomputation, whereas S2 uses the better of recomputation and save-and-restore, based on the number of operations.
From the graph, we observe that while the software-only schemes can improve the runtime significantly compared with the predicated execution, the difference between software-only and hardware extension is only marginal. In particular, our softwareonly scheme, S2, can reduce the kernel runtimes by 33.0%, on average, compared to the predicated execution, which is about 4.5% point difference from that of the hardware version. The same pattern can be seen even in some loops (Lame_2, Tiffmedian) with the highest numbers of transfer variables. That said, for some other applications, the additional operations added to the software version do affect the performance rather significantly. For instance, Disparity and Stitch, going from H to S2, see 30% and 19% runtime increases, respectively.
Between the S1 and S2 cases, though the overall difference is small (they are only 3.5% point different in terms of the geometric mean), some applications do see huge runtime reductions. Most notably, the runtime of Susan_edges is reduced by 46% from 1.23 times the baseline (when all transfer variables are recomputed) to 0.67 times (when they are all passed via memory), which is because the recomputed expressions in this loop are very large and complex. For those loops, passing transfer variables via memory can eliminate the overhead of recomputation, and generate performance improvement that is comparable to the hardware version.
Energy Comparison
To evaluate the energy consumption of our proposed scheme, we use the energy model of [Kim et al. 2012b] , where the CGRA's dynamic power consists of PE (Processing Element) array power, configuration memory power, and the (CGRA's) local memory power. The PE array power has two components, individual PEs' power dissipation and that of the rest of the PE array. The most interesting parts are the individual PE's power, which is dependent on the operation being performed (ALU, multiply, divide, routing, or idle), and the local memory power, which is proportional to the number of memory accesses. Except for the local memory power, which we obtain from Cacti 6.5 [Thoziyoor et al. 2008 ] for a 65nm technology, we use the power numbers of [Kim et al. 2012b ], scaled to the 65nm technology. The leakage power is assumed to be 50% of the dynamic power with PEs being idle, whereas the leakage power of the local memory is obtained from Cacti. We assume that the CGRA runs at 500MHz.
We compare the total energy consumption of the CGRA and its associated memories in two cases: using predicated execution only (P) vs. using our software-only scheme (S2). Though faster for many loops, our software-only scheme always generates more memory accesses than the predicated execution. This is because in the predicated execution, load/store operations that are disabled due to predication do not generate memory operations, whereas our scheme generates loops that have extra load/store operations due to the iterator and possibly one or more transfer variables. In both the cases, the underlying hardware is the same. Figure 13 compares the energy results, overlaid with the normalized runtime results of the S2 case. The energy consumption is broken down into different categories, among which the leakage accounts for the largest portion, on average. In our energy model, only the local memory energy and the PE array energy are affected by the operations performed on the CGRA. The others are proportional to the runtime, and in those categories we see reductions of about 33%, on average, going from the baseline to our software scheme. The PE array's dynamic energy is only slightly larger than that of the configuration memory, and is overall reduced by our scheme, which is largely a by-product of the reduced runtime. Lastly, the local memory's dynamic energy, which accounts for 22.6% in the baseline, is increased by 16.8% (to 26.4% of the baseline total energy) Fig. 13 . Comparing energy of predicated execution (P) and our software transformed loops (S2). Shown below is the average power dissipation of S2, normalized to that of P.
due to the additional memory operations generated by our software transformation. However, this increase is not excessive in any application. This is because the iterator and optionally transfer variables that are saved in the evaluator loop are only for those iterations that will be executed in the executor loop. In other words, we save only the data that will be used later, which helps keep down the energy price we pay in our transformed loops. The additional operations by our technique, coupled with reduced runtime, increases the average power dissipation of our software technique by 16%, on average. Overall, our software technique can reduce the energy consumption of the entire CGRA including configuration and data memory by 22.0%, on average.
Discussion
While our target architecture shares some features with the ADRES architecture [Mei et al. 2003a ] such as the multibanked local memory with a crossbar switch, there are also key differences, which should affect our synthesis results in Section 5.3 as well as performance results. First, our CGRA architecture does not have rotating register files in PEs, but only a private register file in each PE and a global rotating predicate register file. The lack of rotating register files results in typically more PEs being used for operand routing, which can help explain the superlinear increase in runtime with larger PDGs. This also means that the performance improvement of our software technique over predicated execution may partially be offset on CGRAs with rotating register files. Second, our CGRA architecture is loosely coupled to the main processor, whereas the ADRES architecture, which can double as a VLIW processor, has a very short control and communication overhead. Since our technique fissions a loop into two or three loops, it has an overhead that is proportional to the amount of the CGRA control latency. Consequently, on a tightly coupled architecture such as ADRES, the scope of our technique could be much higher.
CONCLUSION
Many conditionals found in loops have unbalanced branches in terms of both size and execution frequency. Interestingly, often larger branches are less frequently executed, presenting an opportunity for optimization unknown to predicated execution, which blindly allocates resources regardless of the execution frequency. For such loops with unbalanced conditionals, we presented in this article a software technique that divides a loop into two or three smaller loops so that the condition part is evaluated only in the first loop while the less frequent but larger branch is executed in the second loop in an efficient way. To reduce the overhead of extra data transfer caused by the loop fission, we also present a hardware extension for a class of Coarse-Grained Reconfigurable Architectures (CGRAs). Our experiments using MiBench and computer vision benchmarks on a CGRA demonstrate that our techniques can improve the performance of loops over predicated execution by up to 65% (37.5%, on average), when the hardware extension is used. Without any hardware modification, our software-only version can improve performance by up to 64% (33%, on average), while also reducing the energy consumption of the entire CGRA by 22%, on average.
