Many design techniques have been proposed to optimize the performance of a digital system implemented in a given technology. These techniques include retiming, insertion of intentional clock skew, insertion of bu ers, transistor sizing, resynthesizing, and wave pipelining. Each of these techniques can be advantageous in particular applications, and they are often applied individually to enhance performance. In this paper, a mixed integer linear programming formulation is derived in which the performance of edge-triggered design is optimized by simultaneously adjusting clock skew, retiming the circuit, and applying wave pipelining methods. This formulation is applicable to a broad range of digital circuits. We have applied our formulation to the design examples of a RISC microprocessor, a multiplier, and a correlator. In doing so, we were able to reduce the system clock period several fold in each design. A bene t of this approach to timing optimization is that unintentional clock skew tolerance is a by-product of the computation and the resulting designs often exhibit improved margins against it relative to other non-optimal digital circuits.
Introduction
Clock period optimization is crucial in high performance digital systems. Various techniques for clock period reduction have been proposed, including transistor sizing, logic resynthesis, intentional clock skew, cycle stealing with transparent latches, retiming, and wave pipelining.
Among these optimization techniques, wave pipelining 11, 12, 13, 14] is the newest. Implementations of wave pipelining in di erent technologies have been reported 10, 13] . In conventional pipelining, input data cannot be strobed into the combinational block until the data being processed is clocked out. In contrast, wave pipelining allows input data to be clocked into the combinational block before the current result is clocked out. In other words, it is possible that two or more elements of the input sequence are present simultaneously in a combinational block. This implies that the clock period is not restricted by the longest propagation delay of the combinational block. It is clear that the insertion of intentional clock skew, as proposed by Fishburn 4] , is a special type of wave pipelining technique. Most previous works for wave pipelining focused mainly on the design of a single stage combinational block 10, 13] . However, the theory of a wave pipelined system containing feedback loops and reconvergent fanout paths has not been fully explored 11, 12, 13] . Ekroot deals with this problem but with a restriction of equal average multiplexing degree, explained further in section 3, for each combinational block 12] . Other works only consider a single closed loop system 11, 15] . In this paper, we present a more general model for systems including feedback loops and reconvergent fanout paths. In developing the model, we only consider designs with edge-triggered latches, hereafter called registers. The treatment of transparent latch-based designs will be discussed in a related work 2].
The propagation delays of di erent stages of combinational logic can be equalized by relocating registers, which leads to a reduced clock period. Leiserson rst reported this retiming technique by relocating registers in zero-skew synchronous edge-triggered designs 5]. Attempts have been made to further reduce the clock period of a system by combining retiming with other optimization techniques. Examples include the combination of retiming with resynthesis 1, 3] and retiming for latch-based designs 7, 8, 9] . In this paper, we present a mixed integer linear programming formulation in which optimization of retiming, intentional clock skew, and wave pipelining can be modeled concurrently.
Optimization of bu er padding is also easily incorporated into our formulation. We have applied our formulation to designs of a RISC microprocessor, a multiplier, and a correlator and reduced the clock period by more than three fold in each case. The formulation also admits the analysis of design robustness. Due to process and environmental variations, an amount of unintentional clock skew could be associated with each register. Robustness, de ned as the maximum tolerance of unintentional clock skew, is dependent on the longest and shortest path delays. No matter if there exists a feedback loop in a zero skew design, as the clock period is increased, robustness increases but eventually saturates. For a non-zero skew design with feedback loops, robustness saturates as well, and its upper bound can be derived theoretically. However, robustness for a non-zero skew design is always larger than that of a zero skew design.
This paper is organized as follows. Section 2 de nes the terminologies used for retiming and wave pipelining. In Section 3, we discuss in detail the di culty and complexity incurred by feedback loops and reconvergent fanout paths during optimization of a single phase design via wave pipelining. Section 4 provides a graph model for a pipelined system with xed locations of registers. When optimizing wave pipelined systems, LP constraints are formulated to preserve these systems' functionality. In Section 5, these constraints are modi ed to accommodate the retiming process in the optimization formulation. In Section 6, the tolerance to unintentional clock skew is studied for both zero skew and non-zero skew design styles. Experimental data for other circuits is provided in Section 7. Section 8 considers the padding problem. Finally, directions for future research are given.
Background
In this section, we rst de ne the proposed timing model and then give a brief literature review for retiming and wave pipelining.
Timing Model
The longest propagation delay, t max (v), and the shortest propagation delay, t min (v), are de ned over every gate or combinational block, which is referred to as a node v in this paper. For a node v, its outputs are generated if all its inputs are stable for at least the minimum amount of time, denoted as t stable (v) 11, 12] . These amounts are assumed to be measured under worst case conditions. Although the timing model assumes the constant longest and shortest propagation delays for all input-output pairs of an individual node, it can be easily generalized to the case in which propagation delays for each input-output pair of a node are not equal. ! e n?1 ! v n , the path weight, w(p), is the sum of the composed edge weights. An important observation is that retiming never changes the di erence of two path weights, w(p) and w(q), if they start and end at the same nodes. This also implies that the weight, w(l), of a loop l is invariant in retiming.
Wave Pipelining
Wave pipelining is a timing methodology used in digital systems to achieve maximal rate operation 4, 11, 12, 13, 14] . Using this technique, new data is applied to the inputs of a combinational block before the previous data is clocked out, thus e ectively pipelining the logic and maximizing the utilization of the logic without inserting registers. Experimental designs of an adder 13] and a population counter 10] showed that the clock period can be reduced by a factor of two or three. Reconvergent fanout paths are de ned as simple paths which share both start and end nodes. Balanced reconvergent fanout paths exist if there are the same number of registers along each recon-
Original Formulas
Relaxed Formulas A pipelined system with unbalanced paths will be discussed in Section 3.
The following description is only well suited to a pipelined system with balanced paths. As shown in Fig. 1(a) , a host machine applies data, d i , through external input registers to a pipelined system at time i c, where c is the system cycle time, and extracts data from it through external output registers. In the following discussion, unless stated explicitly, these external registers will not In order to pass the signal successfully in a wave pipelined design, the following constraint, called the internal node constraint 13], must be satis ed.
As shown in Fig. 1(b) , if data d i is supposed to be clocked at the output of node v, then the value of s(v) plus the setup time should be smaller than or equal to the clocking time of this register, which is known as the zero clocking constraint 4]. The earliest arrival time at the output of node v is b(v) with respect to d i . In order to prevent later data from overriding the previous data, the value of b(v) plus the clock period c should be greater than or equal to the sum of the clocking time and the hold time of this register. This is also known as the double clocking constraint 4].
Preliminaries
In this section, we focus on single phase pipelined systems with unbalanced paths. Fig. 2 For a pipelined system with balanced paths, constraints at any node for wave pipelining are easily derived from its fan-in nodes as described in the previous section. However, it becomes more complicated to derive constraints in the case of a pipelined system with unbalanced paths (e.g. for node t). As described previously, there are two alternatives depending on the route taken by d i .
Constraint formulation based on either of these two routes is plausible with the proper manipulation as described in the following paragraph. In other words, if a feasible solution exists for one formulation, then without modifying the amounts of non-zero skew at each register, a feasible solution also exists for the other. Thus constraint formulation can be done by arbitrarily choosing the alternatives. In the graph model of a general pipelined system, at rst a spanning tree is found in order to de ne a pipelined system with balanced paths. Then each of the remaining edges, called chords, in the original graph de nes reconvergent fanout paths based on this spanning tree. The shift of time origin associated with each chord can then be calculated accordingly.
It is important to note that in a feedback loop d i would pair not only with d i?m n but also with d i?2m n ; d i?3m n , etc. Due to the clock's repetitive nature, it is enough to consider only the constraint due to d i?m n .
Clock Optimization via Wave Pipelining
In this section, based on the graph model, we develop constraints required for wave pipelining in a general pipelined system.
Graph Model for Pipelined Systems
A pipelined system is modeled as a directed graph G = (V; V fo1 ; V fo2 ; V l ; V pi ; V po ; E; w; t max ; t min ; t stable ; os), where V set of functional nodes in the pipelined system Fig. 3(a) . Three kinds of dummy nodes are created in the model. A dummy node v f1 2 V fo1 is inserted in an edge e(u; v) if there is more than one fanout edge for node u and w(e) 1. This removes the restriction of applying the same clocking time to the registers at every fanout edge of node u. In this procedure, edge weight w(e(u; v f1 )) is set to 0 and w(e(v f1 ; v)) is w(e). If more than one register is required to appear at an edge e(u; v), another dummy node v f2 2 V fo2 can be inserted such that at most one register is on each edge. Edge weight w(e(u; v f2 )) is set to 1 and w(e(v f2 ; v)) is w(e) ? 1. Both t max and t min of nodes v f1 and v f2 are set to 0. As explained in the previous section, a chord, e(u; v), de nes a loop or semi-loop in a spanning tree. A dummy loop node v l is added along the chord if the resulting loop or semi-loop is unbalanced. The shift of time origin for node v l with a multiplexing degree of 1 is os(v l ). In this insertion process, edge weight w(e(u; v l )) is set to zero and w(e(v l ; v)) is w(e). For the example in Fig. 3(b) , V is the set of f1,2,3,4,5,6,7g and V pi , V po , and V fo2 are ;. Dummy fanout nodes, 11, 12, 13, and 14 2 V fo1 , are inserted in the fanout edges e(2; 7), e(2; 5), e(6; 4), and e(5; 1) of nodes 2, 6, and 5. The solid lines and nodes form a spanning tree, and the dotted lines are chords. The loops or semi-loops created by chords e(7; 3), e(13; 4) , and e(14; 1) are not balanced, so dummy nodes of 8, 9 and 10 are inserted along these chords respectively. This makes the shift of time origin as follows: 2 for node 8, 1 for node 9, and 3 for node 10.
Constraints for Wave pipelining
Constraints for valid wave pipelining with a multiplexing degree m are presented in this section.
Constraints are classi ed into the categories of delay constraints, synchronous constraints, loop/semiloop constraints, wave pipelining constraints, and initiation constraint.
In our model, c is the clock period. Let t s , t h , t cl , and t cs denote the setup time, hold time, longest propagation delay, and shortest propagation delay of a register. Variables s(v) and b(v) are the latest and earliest arrival times respectively at the output of node v. Variables g(v) and h(v) are the latest and earliest arrival times respectively at the input of node v. For a node u connected to a register, s(u) + t s implicitly represents the clocking time of this register. 
Both delay and synchronous constraints de ne the timing relationship between node v and its fan-in node u. This depends on whether there exists a register in the fan-in edge e(u; v). Constraint Constraints (3), (4), (7), and (8) can be interpreted similarly. Loop/semi-loop constraints incorporate the e ects of unbalanced reconvergent fanout paths and feedback loops which have been discussed in the previous section. Three wave pipelining constraints are implicitly or explicitly included in the LP formulations. Since s(u) is the latest arrival time at the output of node u, the zero clocking constraint is satis ed implicitly by the fact that clocking is at time s(u)+t s . Constraints (11) and (12) explicitly express the double clocking constraint and internal node constraint.
Correctness of the formulation
In this section, the correctness of the LP formulation is proven by examining the pipelined system as shown in Fig. 2(b) . Since any reconvergent fanout paths can be regarded as such a structure, the proof does not lose its generality. For such a pipelined system, modeling could result in two di erent graph representations as shown in Fig. 4 . Two lemmas are presented. These lemmas imply that the designs could be veri ed by the constraints formulated with an arbitrary spanning tree instead of verifying all possible spanning trees. 
Special Considerations
A designer may require that registers in some fan-in/fanout edges of a node be clocked without skew among them. The graph model can be modi ed for such cases. Consider the example shown in Fig. 5(a) , in which the registers at edges e(a; f) and e(b; f) are required to be clocked without skew between them. A dummy node, d, can then be inserted as shown in Fig. 5(b) . Similarly, in the example shown in Fig. 5(c) , the registers at edges e(f 0 ; a 0 ) and e(f 0 ; b 0 ) are required to be clocked without skew between them. A dummy node d 0 is inserted as shown in Fig. 5(d) . Other considerations might be required for the preservation of the temporal equivalence after applying wave pipelining 12, 18] . This can be ensured by the following constraint, where k is the number of registers along the spanning tree from s(u) to s(v).
s(v) + t s ? s(u) = (k + 1) m c u 2 V pi and v 2 V po (14) 5 Clock Optimization via Retiming
In the previous section, a LP is formulated for clock period minimization if the locations of the registers are xed. Retiming reduces the clock period by means of re-positioning registers. Modi cation of the constraints is needed in the LP formulation to allow retiming and wave pipelining to be applied simultaneously. This optimization formulation becomes a mixed integer linear programming problem.
Mixed Integer LP Formulation
Since no advanced information about registers' position is available after applying retiming, the graph model must be modi ed accordingly. For a node with multiple fanout edges, a dummy fanout node is inserted at each fanout edge. For example, Fig. 3(c) illustrates the modi ed graph for the system as shown in Fig. 3(a) . V fo1 now becomes the set of f11,12,13,14,15,16,17g, and both t max (v) and t min (v) for every node v 2 V fo1 are set to zero. However, the shift of time origin for each node of V l is invariant in retiming. If more than one register is allowed at an edge, a dummy node v 2 V fo2 should be inserted along the edge. In this example, at most only one register will appear at each edge.
For an edge u e ! v, retiming sets the number of registers at edge e to w(e) ? r(u) + r(v), which must be a nonnegative integer. These retiming constraints are expressed as follows:
w(e) = r(u) ? r(v) for u e ! v and v 2 V fo1 V l (15) w(e) r(u) ? r(v) for u e ! v and v 2 V V fo2 V po (16) w(e) r(u) ? r(v) + 1 for u e ! v and v 2 V V fo2 V po (17) As described in Section 4.1, node v 2 V fo1 is used to remove the restriction of applying the same clocking time to the registers on every fanout edge of node u, and node v 2 V l is used to subtract the amount of the shift of time origin with the chord. Therefore, by de nition, the number of registers at the fan-in edge of node v 2 V l V fo1 is set to 0 after retiming as shown in constraint (15) . Constraints (16) and (17) restrict the number of registers at an edge to either zero or one. However, there are some special cases, such as w(e(7; 3)) in Fig. 3(a) , which must be greater than or equal to one since the register le is a system functional block.
Some constraints formulated in the previous section are modi ed by including these new integer variables r(v). Thus the constraint set becomes as follows: Given a clock period c, if these constraints are satis ed by variables s; b; g; h; and r, then the clock period is a feasible solution. However, solution is impeded by the qualifying clauses in constraints (18)-(26). By appropriate transformations, the optimization with these constraints becomes a mixed integer LP problem. Such a transformation is only valid if t h t cs . Nevertheless, this condition could always be satis ed by increasing the value of t h such that it is equal to t cs .
Theorem 1 A pipelined system is de ned by G = (V; V fo1 ; V fo2 ; V l ; V pi ; V po ; E; w; t max ; t min ; t stable ; os). If G has a multiplexing degree of m, and t h t cs for all registers in G, then G correctly operates at clock period c if and only if there exists a feasible solution for the following mixed integer LP constraints. In the constraints, R; T; P; Q are real variables and r is an integer variable. 
Optimal Clock Period
For a given clock period c, the mixed integer LP constraints can be tested to see if it is a feasible solution. In a single closed loop l with w(l) registers, the longest path delay along the loop is T max (l).
The feasible clock period is bounded below by Tmax(l) m w(l) for a multiplexing degree m 13] , and the lower bound of the feasible clock period is equal to max 8 l Tmax(l) m w(l) . This serves as a starting point for the search of an optimal clock period. In fact, this is a maximum-ratio-cycle problem which can be solved in polynominal time by Hartmann 8] . He shows an algorithm with a running time of O(T(jEj + jV j log jV j)), where T is the sum of all node delays.
Designers, however, usually care about a system with a multiplexing degree of 1. The following theorem facilitates the search for the optimal clock period.
Theorem 2 For a pipelined system with a multiplexing degree of 1, if clock period c is not feasible, any clock period < c will not be feasible.
Sketch of proof:
The feasible solution set of any LP problem must be convex. If there exists at least one feasible solution for a given con guration, it can be proven that the solution set must be unbounded above. The feasible solution set of the clock period for the above mixed integer LP can be regarded as the union of solution sets for all con gurations of a system. This means that the feasible solution set of c is continuous and unbounded above. 2 Theorem 2 suggests a way to obtain the optimal clock period for a system with multiplexing of one. The optimal clock period can be obtained by a binary search over possible clock periods.
Design Example: RISC Microprocessor
In this section, we demonstrate the application of our formulation to a RISC microprocessor design as shown in Fig. 6(a) . The numbers inside the circles indicate the longest and shortest propagation delay for each node. For simplicity, the setup time, hold time, and propagation delay of a register are Figure 6: Optimal Con gurations for Zero and Non-zero Skew Designs set to 0. This design was previously analyzed by Lockyear 8] . There are two fundamental di erences between our model and Lockyear's. First, our model is based on edge-triggered registers rather than latches due to the stringent synchronization requirement of multi-staged wave pipelining. Second, in our model intentional skews at registers are variables and can be optimized. In contrast, in Lockyear's model intentional skew at every latch is simply xed at a given value.
Once the set of constraints are captured, optimal solution is obtained by LINDO, a mixed integer linear programming solver. Without retiming, the cycle time of the given system is 71ns if no skew is allowed at each register (zero skew design). With retiming, the system's optimum cycle time becomes 27ns. The result is shown in Fig. 6(b) . According to our formulation, if non-zero skews are allowed at registers, the optimal clock period is further reduced to 20ns as shown in Fig. 6(c) . The intentional clock skew for each register is shown by the number inside the rectangle box. It is worth pointing out that if each combinational block is carefully designed by path balance 10], the cycle time can be further reduced.
Design Robustness
Timing constraints are formulated so that at a feasible clock period a system is immune to zero and double clocking 4]. When a system is implemented, an amount of unintentional clock skew exists at each register due to manufacturing variations. In other words, registers are not clocked exactly at the speci c moment for which they are designed. Generally, the exact moment is not predictable because of process variations, environmental and temperature uctuation, di erent propagation delay variations along routing wires, etc. These uncertainties prevent a system from operating at its theoretically optimal clock period. In order to accommodate these uncertainties, a conservative approach is to operate the system at a larger clock period. However we are interested in nding the maximal amount of safety margin, denoted by the variable , by which the clock occurrence can deviate from the design speci cation time t. In other words, the system still operates correctly as long as the clock is delivered in the interval of t ? , t + ]. In this way, the allowable clock skew becomes a continuous range, instead of a single discrete value. Constraints (22)- (27) For a node u which is connected to a register, the clocking time of this register now becomes s(u) + t s + 2 . The additional amount of 2 is needed to prevent zero clocking in the presence of non-zero unintentional clock skew. This can happen when its input registers are clocked too late and its output registers are clocked too early. Conversely, when input registers are clocked too early and output registers are clocked too late, or when some input registers are clocked too early and some are too late, constraints (48) and (49) prevent the later data from overriding the previous data by the safety margin of . Robustness is thus de ned as the maximal amount of safety margin for a given clock scheme. In a non-zero design style, for a given clock period, the robustness can be obtained by maximizing the safety margin of . Theorem 3 A pipelined system is de ned by G = (V; V fo1 ; V fo2 ; V l ; V pi ; V po ; E; w; t max ; t min ; t stable ; os). If G has a multiplexing degree of m, a safety margin , and t h + 2 t cs for all registers in G, then G operates correctly at clock period c if and only if there exists a feasible solution for the mixed integer LP constraints described as follows. In the constraints, R; T; P; Q are real variables and r is Theorem 4 A general loop containing n registers is operated at a clock period c and a multiplexing degree m. The Summing all these inequalities leads to constraint (63).
Constraints (45) and (48) If the multiplexing degree is 1, the right side of inequality (64) is 1 2 ( 1 n P n i=1 T min (i) + t cs ? t h ), which is not dependent on the clock period. It can be seen that for a general loop the robustness no longer increases when the clock period is increased to a threshold value that is dependent on the longest and shortest path delays.
Robustness for Design Examples
Depending on the skew at each register, a system could be either a zero-skew design or non-zero skew one. In this section, the robustness for both styles are studied for the design example in Section 5.3. For simplicity, the setup time, hold time and propagation delay of registers in both designs are zero.
In a zero-skew design, the robustness is calculated by the following formula where T max (i) and T min (i) represent the longest and shortest propagation delay of combinational stage i. The minimum clock cycle period is 27ns if it is zero skew design style, but unfortunately, the robustness is zero. In contrast, for this clock period the robustness is 1.75ns if non-zero skew is employed. The solution for a non-zero skew design is a by-product of LINDO. Table 2 shows the robustness of both designs. It is important to note that the robustness for zero skew design saturates at c = 32ns, while it saturates at c = 34ns if non-zero skew is employed. One way to further increase the safety margin is to equalize the path delays. This is usually achieved through careful padding of the shortest paths.
Other Design Examples
Besides the study of the RISC microprocessor, our formulation has been applied to two additional examples: a correlator 5] and a multiplier 8]. Our method improves the performance of both designs if both retiming and wave pipelining are applied simultaneously. 
Correlator
For the zero-skew design as shown in Fig. 7(a) , the clock period is 24ns before applying any optimization technique. Retiming could reduce the clock period to 13ns as shown in Fig. 7(b) . Further reduction is possible if wave pipelining and retiming techniques are applied simultaneously. In this example, the optimal clock period is 6ns. Fig. 7(c) shows the locations of registers and their associated intentional skews. The skew of external input registers is 0, while the skew of external output registers is 12ns. Note that by using non-zero skew in the design as shown in Fig. 7(b) , the minimum clock period is only 7ns, which gives a sub-optimal solution. The robustness for both the zero skew and non-zero skew designs is shown in Table 3. 7.2 Multiplier Fig. 8(a) is a serial-parallel multiplier with a minimal clock period of 6ns. The e ects of the shortest paths and register positions on the clock period are studied by the designs as shown in Fig. 8(b) and 8(c). Table 4 shows that an equal amount of the shortest propagation delays is set to nodes v 1 ; v 2 ; v 3 ; and v 4 as shown in the top entry of each column. Also, an equal amount of the shortest propagation delays is set to nodes v 5 ; v 6 ; v 7 ; and v 8 as shown in the leftmost column of Table 4 . The optimal clock period, however, is independent of the shortest delays at nodes v 9 ; v 10 ; v 11 ; and v 12 . The minimum clock period for the design shown in Fig. 8 shown in Fig. 8c below the slash. It is worth pointing out that the optimal clock period is reduced when the shortest delays for these nodes are prolonged. However, the clock period eventually saturates as shown in Table 4 . Thus, it is no longer bene cial to pad the short paths. Another point is that di erent con gurations need di erent paddings to achieve the same clock period. For the design shown in Fig. 8(c 4 . In contrast, the design, shown in Fig. 8(b) , needs 0.4ns for the shortest path of node v 5 ; v 6 ; v 7 ; and v 8 to achieve the same clock period.
Optimal Padding
When wave pipelining is applied to a combinational block, the minimum clock period can be reduced by padding short paths 10]. When padding a pipelined system for a speci c clock period, two considerations should be taken in order to minimize the area. The rst one is to determine which combinational block should be padded. For the multiplier design, the clock period is not reduced by padding nodes v 9 ; v 10 ; v 11 ; and v 12 . Also the same clock period could be achieved by padding different nodes in di erent con gurations. However, the area cost incurred by padding each node and by re-positioning registers may be di erent. Therefore, a global optimization should be proposed in advance, instead of as a postprocessing step. The second consideration is to determine by how much the short paths of a combinational block should be padded. As in the multiplier example, the clock period will eventually saturate such that padding a node past a threshold value is of no use. It has been shown that padding the short paths in a combinational block can be formulated as a mixed integer LP problem 16]. Area versus delay of the combinational block's shortest path can be approximated as a piecewise linear function by calculating area in terms of the di erent shortest path delays 17]. The global optimization of area can be formulated as follows. 
Conclusion
We have rst developed a graph model for digital systems and then proved that timing optimization is a mixed integer linear programming problem. The model is general enough because feedback loops and reconvergent fanouts are allowed. This mathematical formulation is capable of performing an optimization on a given system by simultaneously applying the techniques of adjusting clock skew, retiming the system, and wave pipelining. Our formulation has been applied to three design examples and has signi cantly improved each one if all three techniques are used concurrently We have also de ned and investigated the issue of design robustness. As the clock period is increased, robustness increases but eventually saturates no matter if there exists a feedback loop in a zero skew design. For a non-zero skew design with a feedback loop, robustness saturates as well, and its upper bound can be derived before executing a LP solver. Further research is needed to determine if there exists an upper bound for a non-zero skew design without feedback loop. As shown in Section 6, a non-zero clock skew design is more robust than a zero clock skew one. In Section 5, we have also presented a procedure for obtaining the optimal solution for the mixed integer linear programming formulation. However, further research is needed to unveil an e cient algorithm for larger problems. Finally, this paper also addressed the problem of area minimization when both padding and retiming are used. 
