Basic retiming is an algorithm originally developed for hardware optimization. Software pipelining is a technique proposed to increase instruction-level parallelism for parallel processors. In this paper, we show that applying software pipelining alone for minimizing timings under resource constraints can lead to suboptimal results, compared to the case if an unification of basic retiming and software pipelining is used. We propose an approach to realize this unification. The approach allows to minimize the code size of the optimized loop as well as minimizing the idleness of computational elements. We extend this approach to solve the problem of minimizing peak power consumption for timeconstrained and resource-constrained software pipelined loops. Solving these problems is important for portable embedded systems as well as system-on-chip design. The approaches are tested using known benchmarks. On average, relative timing improvement is 60.19%, and relative reduction of peak power consumption is 13.17% without any trade-off in timings.
INTRODUCTION
The processing speed of digital systems continues to increase thanks to the combination of the improvement of the compilers' intelligence, the development of new architectures, and the advance of the semi-conductor technology that continues to allow putting more and more transistors on the same chip. However, even if new digital systems with improved processing speed continue to emerge, new computational-hungry applications continue to emerge as well while other old applications continue to intimidate even the fastest computer at this time. Applications from digital signal processing and image processing and multimedia applications are examples of applications that still require high processing speed. Applications that require high processing speed are in general those that are loopintensive.
To increase the processing speed for loop intensive applications, many compiler techniques are used to generate code that efficiently exploits the component of the hardware. Software pipelining [4] [8] is one of such techniques. It overlaps the execution of consecutive iterations of a given loop, thereby increasing the instruction-level parallelism which is useful for parallel processors like the Very Large Instruction Word (VLIW) and the superscalar processors.
Software pipelining is not a recent technique. It has been devised since many years, and is used in developing the code for many verywell-placed processors in the market. However, how to realize this technique under some constraints is a challenging problem. Indeed, in software pipelining there are two important parameters: Latency (L) which is the time required to execute all the instructions that constitute the body of the loop, and the Initiation Interval (II) which is the interval of time that separates the start execution time of each two consecutive iterations of the loop. As a first challenging problem is the problem of minimizing II under resources constraints. That problem is an NP-hard problem in general and many heuristic approaches are proposed to approximately solve it. As a second challenging problem is the problem of minimizing L for a given II and under resource constraints. A third challenging problem will be presented shortly.
Increasing the processing speed of digital systems is no-longer the only main design objective to achieve in today and at the future. We have been and would continue to be constrained by the need of reducing the power consumption of digital systems. The need of reducing power consumption is mainly motivated by: (1) the need of reducing the cooling cost for high speed digital systems, and (2) the need of prolonging battery lifetime for battery-powered portable systems. The power consumed in high speed systems transforms to heat which requires special cooling devices in order to avoid malfunction and damaging hardware. Designers must then develop sophisticated cooling mechanisms under cooling-cost constraints. For battery-powered portable systems, prolonging battery life is required for some critical portable systems such as wearable medical systems. In addition, battery life became a product differentiator in the market of portable digital systems.
The peak power is the power consumed at the most power-hungry control-step. Peak power must be reduced since high peak power might lead to malfunction of a digital system or to damaging its hardware. Software pipelining allows to increase the instructionlevel parallelism. This means that the number of none-idle computational elements (i.e., ALU, Multiplier, etc.) would increase which would increase the peak power. However, by scheduling those instructions in some manner, peak power can be reduced compared to the case of using a peak-power-unaware schedule.
As a third challenging problem related to how to realize the software pipelining technique under some constraints, we state the following. How can one assign instructions (that constitute the body of the loop to be optimized by software pipelining) to control steps under resource constraints and for a given L and II while minimizing the peak power?
In this paper, we establish a formal relationship between L and the code size of loops that are optimized by software pipelining. When we fix II, the code size increases when L increases. Consequently, to reduce the code size, one needs to reduce L. Furthermore, by decreasing L, the idleness of computational elements will decrease. We show that applying software pipelining alone to optimize a loop, under resource constraints and a target II, will lead to a relative minimal value for L (possibly large L) compared to the case if an unification of basic retiming [1] and software pipelining is used. We propose an Integer Linear Program (ILP) to realize that unification. This ILP constitutes a flexible mathematical framework. Indeed, we have easily extended that ILP to solve the problem of minimizing peak power as stated above. To the best of our knowledge, this is the first paper in the literature that addresses the latter problem. Although it is not done yet, the proposed mathematical framework can be extended to solve the problem of minimizing register requirement for software pipelined loops.
LOOP REPRESENTATION
In this paper, we are interested in "for"-type loops as the one in Figure 1 (a). We assume that the body of the loop is constituted by a set of computational and/or assignment instructions only (i.e., no conditional or branch instruction like for instance if-then-else is inside the body).
Let be the set of non-negative integers. We model a loop by a directed cyclic graph , where is the set of instructions (or atomic operations like addition and multiplication) in the loop body, and is the set of arcs that represent data dependencies. Each vertex has a non-negative integer execution delay . Each arc , from vertex to vertex , has a weight , which means that the result produced by at any loop's iteration is consumed by at iteration . Figure 1 presents a simple loop and its directed cyclic graph model. For Figure 1 (b) , the execution delay of each operation is specified as a label (i.e., the number within a rectangular) on the left of each vertex of the graph, and the weight of each arc is in bold. For instance, the execution delay of is 2 time units and the value 0 on the arc means that operation at any loop's iteration uses the result produced by operation at loop's iteration . 
INTRODUCTION TO BASIC RETIMING
A synchronous sequential design can be modeled as a directed cyclic graph as we did for loops in Section 2. Instructions become computational elements of the design, arcs become wires, and 's become the number of registers on the wire between computational element and computational element .
Let be a synchronous sequential digital design. We denote by the set of natural integers. Basic retiming (or retiming for short in this paper) [1] is defined as a function , which transforms to a functionally equivalent synchronous sequential digital design by assigning a label to each vertex in . The physical meaning of the assigned labels can be viewed as follows. If is positive then we have to move registers from each output wire of and to put them on each input wire of , assuming that we have at least registers on each output wire of . If is negative, the previous process is reversed. When is equal to zero, no register have to be moved across .
The difference between and its retimed version is the weight of arcs. The weight of each arc in is now defined as follows:
.
(
Since the weight of each arc in represents the number of registers on that arc, then we must have:
Any retiming that satisfies inequality (2) is called a valid retiming. From expressions (1) and (2) one can deduce the following inequality:
Let denotes a path from vertex in O to vertex in O. Equation (1) implies that for every two vertices and , the change in the register count along any path depends only on its two endpoints: ,
where:
Let denotes the delay of a path from vertex to vertex . is the sum of the execution delays of all the vertices that belong to .
A 0-weight path is a path such that . The minimal clock period of a synchronous sequential digital design is the longest 0-weight path. It is defined by the following equation: ) 
One application of retiming is to minimize the clock period of synchronous sequential digital designs. For instance, by thinking of Figure 1 (b) as a synchronous sequential design, the clock period of that design is time units, which is equal to the sum of execution delays of vertices (i.e., ). However, we can obtain time units if we apply the following retiming vector to the vector of vertices in , where the value located at the i th position in the retiming vector corresponds to the value assigned by to the vertex located at the i th position in the vector of vertices. The retimed graph is presented by Figure 2 . 
NOTATIONS AND DEFINITIONS
The following notations and definitions will be used in the rest of the paper. Without loss of generality, we assume in this paper that computational elements are not pipelined.
Denotes the start execution time of . Known variable which is an upper bound on the latency to be minimized. A trivial value for is the sum of the delays of all the instructions in the loop's body, which is equal to the run-time when the loop is executed sequentially. For instance for the loop in Figure 1 , we have .
VALID PERIODIC SCHEDULE
Let be a directed cyclic graph modeling a loop. A schedule is a function that, for each iteration of the loop, determines the start execution time for each instruction of the loop's body. The schedule is said to be periodic with period iff it satisfies the following equation: ,
where is the start execution time of the first instance of the instruction . Without loss of generality, we assume through this paper that:
. (8) In this paper, the schedule is said to be valid iff it satisfies both data dependency constraints and resource constraints (in case of limited number of resources).
Data dependency constraints mean that a result computed by instruction can be used by instruction only after has finished computing that result. In terms of start execution time, this is equivalent to the following inequality:
. (9) Using equation (7), inequality (9) transforms to:
(10)
Resource constraints mean that at any time, the number of instructions that require execution on the class of computational elements, say , must not exceed the number, , of available resources in . When there are no resources constraints (unlimited number of resources), then the schedule is valid and periodic with period , iff the system of inequalities described by (10) has a solution for the unknown . By making a sum of all the inequalities of this system for any cycle, the left hand side will lead to 0. After doing this sum, then by first passing the term that contains to the other side of the resulting inequality, and secondly doing this for any cycle in the graph, one can prove that the system of inequalities described by (10) has a solution iff is such that:
(11) where denotes the set of directed cycles in the directed cyclic graph modeling the loop.
The right hand side of inequality (11) is a Minimum Cost-to-Time Ratio Cycle Problem [5] , and can be optimally solved in polynomial run-time using, for instance, one of the algorithms described in [5] .
Inequality (11) allows to compute a lower bound on that is due to data dependency constraints only. Another lower bound on that is due to resource constraints only can be derived as follows. For instance, if we have only 3 instructions of type addition, and only 2 identical adders with execution delay equal to 1ns (the same as the execution delay of any of those instructions), then the time required to execute those 3 instructions cannot be less than 1.5 ns, where . Suppose that there are resources in the class of computational elements . The time required to execute all the instructions of the same iteration that execute on resources of class is at least .
Hence, we have to wait at least the time expressed by (12) before starting to execute the next instance of any one of those instructions. The schedule is periodic. Thus, we have that:
The new arcs' weights are computed using equation (1) .
Since inequality (13) must be met for all the available classes of resources, we then have:
. (14) Using (11) and (14), a lower bound on the period of any valid periodic schedule is then:
. (15) 6.
MINIMIZING LATENCY UNDER RESOURCE CONSTRAINTS AND FOR A TARGET INITIATION INTERVAL
As we have stated in Section 1, there are two parameters related to the technique of software pipelining: Latency (L) which is the time required to execute all the instructions that constitute the body of the loop, and the Initiation Interval (II) which is the interval of time that separates the start execution time of each two consecutive iterations of the loop. Reducing the size of that code is very important in the case of embedded systems as well as of system-on-chip. Both of these two kinds of systems have constraints on the memory size, and hence the code size must be reduced for them. The size of the code has also an implicit impact on both the execution time as well as the power consumption. For a fixed II, it is then clear that to minimize the code size, one needs to minimize L. One of our objectives in this paper is then to minimize the code size by minimizing L for a certain target value of II.
Many approaches are proposed to realize the software pipelining technique. As it can be observed from Figure 3 , realizing that technique transforms to finding a valid periodic schedule with period equal to II. Regarding the relationship between that schedule and L, with the help of Figure 3 and the definition of in Section 5, we have that:
The problem of realizing the software pipelining technique while minimizing the size of the code (by minimizing L for a given II as explained above) transforms to the problem of determining a valid periodic schedule with a period II and a latency L (as defined by inequality (17)) such that L is minimal for a given value of II. The value of II is given by the user or computed by automatically trying various values starting from a lower bound, such as the one given by expression (15), until a minimal value for L is found. The latter problem constitutes our target in the rest of this section. We stated it in another manner as follows: 
Each vertex in must have an unique start execution time.
Constraint #2:
The schedule must satisfy data dependency constraints.
Constraint #3:
The schedule must satisfy resource constraints.
Constraint #4:
Each vertex in must finish executing no later than . 
We will start by first transforming the informal definition of the above optimization problem to a formal one. The resulting formal definition is an Integer Linear Program (ILP). Normally that ILP must produce a solution to Problem 1 with an absolute minimal value for L. We will show that this is not the case. In fact, that ILP will produce a relative minimal value for L, which is an absolute minimal value for L relative to the directed cyclic graph that was used. To avoid that situation, we will unify that ILP and basic retiming to produce another ILP that will always produce an optimal solution to Problem 1 (the minimal value for L will not be sensitive to the graph used). The main idea is that instead of solving Problem 1 using the given directed cyclic graph, we will solve it using a retimed version of that graph, where the retiming function to be used is computed during the schedule determination.
We now focus on deriving an ILP, a formal version of Figure 4 . Let us start by translating constraint #1 to a formal constraint. We are looking for a schedule as the one defined by equation (7) . Since the period is given, then what we still need to compute is for each . Since is an upper bound on the latency to be minimized, we have from equation (17) 
From equations (8) and (19), we deduce that:
By definition of the schedule , we have for each . Hence, using binary variables and expression (20), we can then write each as follows:
and .
Constrain #1 in Figure 4 is now formally defined by expressions (21), (22) and (23).
We focus now on transforming Constraint #2 of Figure 4 to a formal one. The schedule must satisfy data dependency constrains. Hence, inequality (10) must be met. By combining expressions (10) and (21), the data dependency constraints are:
Recall that is the period of the schedule.
We focus now on devising a formal version of the resource constraints expressed by Constraint#3 in Figure 4 . The schedule must be computed in a such way that at any time , the number of instructions that are executing on the class of computational elements, , must not exceed (the number of computational elements of that class). We derive a mathematical formula for resource constraints as follows. Any instruction that is executing at time implies that has started to execute somewhere in the discrete interval , which transforms to: Software pipelining allows to start executing an iteration of the original loop before the previous iteration has finished its execution. Consequently, instructions that are executing at any time can be classified into two classes:
and . The class contains instructions belonging to the set of instructions of the first iteration of the original loop (i.e., no instance of anyone of those instructions is executed before). The class contains instructions from iterations of the original loop that are not from its first iteration (i.e., the instance of an instruction is executing, where ). The number of instructions that are executing at any time using the class of computational elements is the sum of some instructions from and some instructions from . Expression (26) holds for the case of instructions belonging to class . Hence, the number of instructions belonging to class that are executing (at any time ) using the class of computational elements is given by the following formula: (27) We focus now on deriving the number of instructions belonging to class that are executing (at any time ) using the class of computational elements . The schedule is periodic with period . Hence, the class is empty in the time interval . is not empty starting at time . As stated above, any instruction that is executing at time implies that has started to execute somewhere in the discrete interval . Since , this means that some instances of are executing and have been executed in the discrete interval , where (i.e., derived using Figure 3 and expression (16)). This implies that:
Let be a 0-1 known variable defined as follows:
, (29) where denotes the floor of x. Note that is 0 when , and 1 otherwise. Hence, equation (28) can be re-written as follows: (30) As we did for the case of class , the number of instructions belonging to class that are executing (at any time ) using the class of computational elements is given by the following formula:
Expressions (27) and (31) give the number of instructions that are executing at any time using the class of computational elements , . That number must not exceed . Hence, using (27) and (31), the resource constraints to be met by the schedule are then formally defined as follows:
. (32) A formal version for Constraint #4 of Figure 4 can be done by using expressions (19) and (21), and replacing in the right hand side of (21) by . We then obtain:
. (33) All the constraints in Figure 4 are now expressed mathematically. The resulting ILP is given in Figure 5 . (34) Subject to:
(21), (22) and (23).
(Expression (21) can be omitted since it is just a definition that is already replaced in the other constraints).
Constraint #2:
Constraint #3: Proof: Assume that we have 2 adders and 2 multipliers. Using the graph of Figure 1 (b) , the ILP in Figure 5 produces the schedule depicted on Figure 6 (a), which has . However, it is possible to get a schedule like the one of Figure 6 (b) with , by first pre-processing the graph before passing it to the ILP. ❏ Figure 6 . Schedule for two functionally equivalent graphs.
One might want to know why the input graph leads the ILP in missing the absolute minimal value for ? The answer is that the input graph imposes already a partial sequencing of the vertices (instructions of the loop's body). As one can deduce from expression (24), any two vertices of the graph that are connected by an arc having a weight equal to zero can never execute in the same time even if we have unlimited number of resources (this because the destination of that arc can start executing only after the source of the arc finishes executing). To avoid that situation, one then needs to reduce the number of arcs having a weight equal to zero. More precisely, one needs to reduce the length (in terms of time units) of (b) Schedule using a pre-processed functionally equivalent version of Figure 1 (b) . The pre-processed graph in this case is the one in Figure 2 . Vertices o 1 and o 4 execute in parallel because their data dependency is now not of the same iteration since .
paths composed by arcs of weight equal to zero (0-weight paths); this task is nothing else than the pre-processing we have mentioned above. The question now is how can that pre-processing be done? We focus in the rest of this section on answering that question.
As we have introduced in Section 3, there is a close relationship between a directed cyclic graph modeling a loop and a directed cyclic graph modeling a synchronous sequential digital circuit. By thinking of the former graph as a directed cyclic graph modeling a synchronous sequential digital circuit, the weight of each arc can then be viewed as the number of registers on that arc. In this case, basic retiming can be used to move registers, thereby defining one possible pre-processing we are looking for. The pre-processing we did to obtain Figure 6 (b) is in fact a retiming, and the pre-processed graph passed to the ILP in this case is the one of Figure 2 .
In the case of limited resources, the pre-processing must be done during the schedule determination. Indeed, let us assume that we have now 1 adder and 2 multipliers instead of 2 adders and 2 multipliers assumed in proof of Lemma 1. Graphs in Figure 2 and Figure 7 (a) are two possible retimed graphs of Figure 1 (b) . The graph in Figure 2 is used to produce Figure 6 (b). If we again use graph in Figure 2 for the new resource constraints, we obtain a schedule with
. Vertices and will be assigned to time steps 1, 3, 4, and 3, respectively. Vertices and are serialized since we have only 1 adder. However, if we use Figure 7 (b) we obtain the schedule in Figure 7 (b) with only . Hence, it is then clear that retiming cannot be de-coupled from the schedule determination step.
We now agree that the pre-processing must be done during the schedule determination. The question is how can this be done? The pre-processing in our case is computing a retiming to be applied to the vertices of the graph. The retiming must be valid which means it must satisfy expression (3). The weight of each arc after any retiming is defined by equation (1) . Since the retiming will be computed during the schedule determination, this implies that the weight of each arc is now an unknown variable but that variable is equal to:
(35) Figure 7 . The right retiming for the pre-processing can always be obtained only if retiming and scheduling were unified.
In the ILP of Figure 5 , the only constraints that depend on the weight of arcs are the data dependency constraints which are expressed by (24). By using (24) and (35), we obtain (36) where (i.e., (37) expresses the fact that retiming must be valid) ,
By putting together all the development above, an ILP that combines both the scheduling and the pre-processing (i.e., applying basic retiming) is given by Figure 8 . (39) Subject to:
(22) and (23).
Constraint #2: (36)
Constraint #3:
Valid retiming:
Retiming takes values on Z: (38) Figure 8 . Unifying scheduling and retiming to optimally solve Problem1.
The ILP of Figure 8 depends on . To solve it, we then need to fix . Again, if the value of is not provided by the user, then the algorithm Solve_the_ILP can be used to determine a such value, and solve this ILP.
MINIMIZING PEAK POWER UNDER RESOURCES, LATENCY, AND INITIATION INTERVAL CONSTRAINTS
Suppose that we want to accelerate the loop in Figure 1 (a) to achieve a latency and initiation interval , using two adders and two multipliers. And assume that each adder (multiplier) has execution delay equal to 1ns (2ns) and power consumption equal to 20 mW (100mW). We previously showed that without retiming, applying software pipelining on that loop will lead to . A possible retiming that allows to obtain is the one that leads to the graph in Figure 2 . Using the graph in Figure 2 , we obtain two possible schedules given by Figure 9 . These schedules satisfy timing constraints (i.e., ), but differ in terms of peak power. The power consumed at each time step is given on the right hand side of each schedule. The peak power for Figure 9 (a) is 100mW while it is only 70mW for Figure 9 (b). Our objective is then to propose an approach that allows to compute periodic schedules (i.e., to realize software pipelining) that meet timing and resource constraints but require the minimum peak power consumption. More precisely, our objective is to solve the following problem:
Problem 2: Given a directed cyclic graph modeling a loop, our objective is to find a valid periodic schedule with a target period and a latency , under resources constrains, but with a minimal peak power consumption.
The left hand side of expression (32) gives the number of instructions that are executing at any time using the class of computational elements , . If we do not take care about which class of computational elements is used at time , then from (32) the number of instructions that are executing at time is: 
Schedule of (a) using 1 adder and 2 multipliers.
Let be the power consumed by the operation at any time step . The total power consumed by operations that are executing at any time is the sum of their 's. Hence, using (40), is formally defined as:
The peak power is defined as: . This implies that:
When the latency is fixed to a target value, then expression (39) can be omitted, and the resulting ILP allows to compute a valid periodic schedule with period (which is the initiation interval), and latency . That resulting ILP can then be extended to solve Problem 2. Indeed, what we have to do is to add expressions (41) and (42) to the constraints of that resulting ILP and then replace (39) by the following expression:
(43) Figure 9 . Schedules may differ in terms of peak power.
DISCUSSIONS AND RELATED WORK
Software pipelining is not a new technique. It has been proposed since many years to optimize timing for parallel processors like VLIW and superscalar ones. Many approaches has been proposed to the problem of realizing that technique in the case of unlimited as well as limited number of resources. In the case of unlimited number of resources, that problem is optimally solvable in polynomial runtime. The problem is NP-hard in general in the case of limited resources. Due to space limitation, the reader can consult [4] [8] for a literature review of many proposed approaches to that problem. We restrict ourself to approaches that are close to the problem we address in this paper.
Rotation scheduling [7] is a heuristic that realizes software pipelining under resource constraints with a shorter initiation interval . While the approach in [6] starts with a tight an iteratively increases it when a schedule cannot be found, rotation scheduling finds an approximate solution to the problem and then iteratively improve it. The value of is iteratively shortened by rotating some vertices of the graph and then re-scheduling them. Each rotation is in fact a retiming. The heuristic does not control the latency of the schedule, which might lead to a large value of . Recall that having a large value for implies that the code size of the Prologue and the Epilogue after applying software pipelining will be large. An approach to reduce the code size of the Prologue and Epilogue is proposed in [3] .
To the best of our knowledge, this is the first paper that addresses the problem of minimizing peak power consumption for a target latency and initiation interval, and under resource constraints.
EXPERIMENTAL RESULTS
The objective of this experimentation is to test the effectiveness, in terms of relative timing improvement (which has a relationship with code size of the Prologue and the Epilogue of the optimized loop), relative peak power reduction, and execution time of the proposed approach. To this end, we think of cyclic graphs modeling some real-life filters as cyclic graphs modeling loops. The names of these filters are given in the first column of Tables 1 and 2 .
We assume that we have an hypothetical processor with 3 adders and 2 multipliers. Each adder has execution delay equal to 1ns and power consumption equal to 20mW. Each multiplier has execution delay equal to 2ns and power consumption equal to 100mW.
All the experiments were done using an UltraSparc 10 with 1GB RAM. For results in Table 1 and 2, we developed a C++ tool and implemented the algorithm Solve_ILP to solve ILPs in Figures 5 and  8 as well as the one described in Section 7. The input of the tool is a graph modeling each filter, as well as resource constraints and their related features. For step 3 of the algorithm Solve_ILP, we used the lp_solve tool available at [2] .
For the case of Table 1 , the C++ tool reports a lower and an upper bounds on the latency (the second and third columns, respectively), a lower bound on using the right hand side of expression (15) (see fourth column), the value of used to compute the schedule (fifth column; it contains the value of II used to solve the ILP in Figures 5 and 8) . Columns 6 and 7 of Table 1 report the value of and the run-time when the ILP in Figure 5 is solved. Columns 8 and 9 report the value of and the run-time when the ILP in Figure 8 is solved. Column 10 reports relative reduction of the latency defined as . As it can be observed, relative reduction of the latency is 60.19% on average, and the run-time for solving the two ILPs is less than 30s on average.
The C++ tool is also used to assess the approach proposed in Section 7 to minimize peak power consumption. We use the same circuits as those used in Table 1 . We fixed and to the minimal values found in Table1 (see column 2 and 3 of Table 2 ). Obtained results are summarized by Table 2 . For column 4, we first solve 
PeakPower
Max t 1 2 ... 
II L
Figure 8 without (39) and then we compute the peak power. The runtime for this task is reported in column 5. For column 6, we solve the ILP proposed in Section 7 to minimize peak power consumption, and then we compute the peak power of the resulting schedule. The run-time for this task is reported in column 7. Column 8 reports relative reduction of peak power, which is defined here as:
As we can observe from Table 2 , the proposed approach is able to reduce peak power consumption by 13.17% on average even thought and are set to their minimal values. If and are set to values greater than the used ones, then more peak power reduction could be obtained. Indeed, for the circuit named Example in Table 2 , this table shows that peak power was not reduced. However, in Section 7 we showed that peak power for that circuit can be reduced from 100mW to 70mW when .
CONCLUSIONS
For loops optimized by software pipelining, we have showed that there is a relationship between the latency and the code size. An increase of latency implies an increase of the code size. Also, decreasing latency implies reducing the idleness of computational elements. We have showed that optimizing loops by only applying software pipelining can lead to sub-optimal value of the latency compared to the case of unifying basic retiming and software pipelining. We have proposed an ILP to realize that unification.
For software pipelined loops, concurrency between instructions increases, which implies that more computational elements are operating at the same time. Thus, peak power would increase. However, by choosing a good schedule, it is possible to reduce peak power consumption while still having the same target timings. Indeed, peak power can be reduced by using the ILP that we have proposed in this paper. To the best of our knowledge, this proposed approach is the first one in the literature that deals with peak power consumption in the context of software pipelining.
The proposed ILPs are flexible and could be extended to address other problems related to software pipelining. Indeed, as an example of such problems is the problem of reducing the number of registers. That problem can be solved with the proposed ILPs by adding constraints into the constraints of these ILPs. ----------------------------------------------------------------------------------------------------------------------------- 
