Abstract-This paper describes PGR, an architectural technique to reduce dynamic power via GlitchLess or to improve performance via clock skew scheduling (CSS) and delay padding (DP). It is integrated into VPR 5.0, and is invoked after the routing stage. We use programmable delay elements (PDEs) as a novel architecture modification to insert delay on FF clock inputs, enabling all optimization steps to share it, avoiding multiple architecture modifications. The central theme of this paper is considering the trade-off between power and performance, and finding an appropriate compromise considering process variation and timing uncertainties. Overall, an average of 15% speedup can be achieved via CSS alone, or up to 37% for individual circuits. Although delay padding only benefits several circuits, the average improvement of those circuits is an additional 10% of the original period, or up to 23% for individual circuits. In addition, a new model to estimate glitching power is proposed, taking into account the analog behavior of glitch pulse width reduction as it travels along FPGA routing tracks. We show that the original glitch estimation method can underestimate glitching power by up to 48%, and overestimate by up to 15%. GlitchLess is performed on both the original VPR and post-CSS solutions. We are able to eliminate on average 16% of glitching power, and up to 63% for individual circuits.
I. INTRODUCTION
Power and performance are two very important issues in FPGA design. FPGA applications typically consume more power per operation, and run at slower speeds than their ASIC counterparts, due to circuitry needed for programmability.
There is much research effort addressing these two topics. On the performance front, two popular techniques are retiming and clock skew scheduling (CSS). The former method changes the positions of sequential elements (SEs) to shorten effective critical path while maintaining functionality [1] , and has been applied to FPGAs ( [2] , [3] , [4] ). CSS achieves the same goal by assigning intentional clock skews to SEs instead of moving them physically ( [5] , [6] ), and has also been applied to FPGAs ( [7] , [8] , [9] ).
On the power front, dynamic power consumption is significant due to large capacitive loading on the interconnect. Recent advances in process technology have seen a decreasing trend in the rate of increase of dynamic power versus static power. However, total dynamic power still accounts for about 50% of total power [10] . In this paper, "dynamic power" excludes clock network power. Dynamic power arises from two kinds of logic transitions produced by combinational look-up tables (LUTs), functional and glitch. The former causes the data to be different at the end of a clock period, a result of user logic functions. The latter results from input data signals arriving at different times during the period, causing the output to fluctuate before settling down. Several existing examples to reduce glitching power include techniques at the architecture level [11] , or at the CAD level during technology mapping [12] and routing [13] .
This paper makes the following contributions:
1) The programmable delay element (PDE) proposed in [11] is used to provide discrete delays on flip-flop (FF) clock inputs. This unified architecture change, shared by CSS, DP and glitch reduction, avoids the need for multiple architecture modifications. For glitch reduction, we use the concept of GlitchLess (GL) [11] , but with a different implementation. Previously, delay elements needed to be very precise to eliminate glitching. This can be difficult with increasing process variations. The new approach is much more resistant to variation. 2) Integrated delay padding scheme with CSS to further optimize performance. Past work ( [14] , [15] , [16] , [17] ) uses either LP or graph theory to solve CSS. However, these techniques apply only to ASICs, and assume padded delays are continuous. However, a PDE can only provide discrete delays. We adapt the algorithms to use discrete delays as well as margin for process variation. 3) An integrated tool flow that uses the same physically realizable architectural change to reduce power and increase performance. CSS, delay padding and GlitchLess are combined with VPR 5.0 [18] into a single framework. This is important for getting a final result that considers both delay and power at the same time. 4) Improvement on vector based activity estimation [19] , which used a threshold to determine whether a glitch does not propagate at all or propagates indefinitely. Our work models the analog behavior of the gradual decrease in width of a narrow glitch as it travels along FPGA interconnect, and calculates glitch power accordingly.
The central theme of this paper highlights the major difference of this work: previous research has focused purely on either performance or power. Our work shows performance optimization adds to power, while 100% glitch reduction is not possible without impacting performance. Therefore it is important to achieve an appropriate compromise between the two. Furthermore, we motivate better PDE designs by putting PDE power overhead in perspective with total dynamic power consumption before and after glitch reduction. We show that while there is potential for good savings, a power-efficient PDE is crucial to the attractiveness of both period reduction and glitch reduction. The rest of the paper is organized as follows. Section 2 introduces basic concepts. Section 3 describes architecture changes and its adaptation by the optimization steps. Section 4 details the modification to glitch estimation. Section 5 outlines our algorithm. Section 6 provides results and discussion, and section 7 concludes our work.
II. BACKGROUND AND PAST WORK A. Clock Skew Scheduling
Clock networks suffer from clock skew due to variation [20] . CSS uses it as a resource for improving performance, instead of treating it as an unavoidable burden. In the following example, Fig. 1 , a zero-skew clock network means the circuit has a minimum period of 14ns assuming zero setup/hold times. If a skew of 4ns is applied to F F B , the circuit is able to operate at a minimum period of 10ns. This effect can be viewed as time borrowing: shortening the effective delay of long paths, at the expense of increased delay for short paths. Indeed, the path from F F B to F F C now has an effective delay of 10ns. The relative skew assigned to two neighboring FFs is bound by a setup time (T s ) constraint (Eq. 1) and a hold time (T h ) constraint (Eq. 2) to avoid zero-clocking and double-clocking conditions, respectively. T i and T j are clock arrival time at FFs i and j, D max (i, j) and D min (i, j) is the maximum and minimum combinational delay (due to variation or reconvergence) between FFs i and j, respectively [5] . Delay M is a user-defined safety margin to compensate for process variation, and allows T i , T j and the path delays to vary up to M without violating the constraints [6] .
The above system of equations is an optimization problem for period P subject to |T i | < P , and can be solved by Linear Programming (LP). A more efficient method [6] uses graph theory [21] and binary search to find the optimum P between upper and lower bounds (Eq. 3), where G(V, E) is the graph constructed with a set of constraints, with vertex v i corresponding to T i .
Architecture changes required to implement intentional clock skew varies. One way is to use multiple global clock lines available in the FPGA to implement different skews [7] . An alternative approach [8] uses a single global H-tree with ribs on the H-tree for local routing. PDEs are inserted into branching points of the clock tree. The clock goes through a trail of PDEs before arriving at each FF node, and there are more choices for skew values because of this levelized structure. In [9] , 4 PDEs are inserted at each rib of the H-tree, producing 4 skewed version of the global clock for each row. A statistical model is used to model process variation. All of the above approaches focus on CSS only. Our delay padding scheme requires extra skews to be available in the clock line in addition to those for CSS. While our method may sometimes use more power than previous work, it allows extra flexibility for delay padding (further performance gains), and also for GlitchLess (power reduction).
B. Delay Padding
The setup/hold constraints can limit the range of skews that can be assigned to SEs, and therefore the smallest obtainable period. In Eq. 1 and 2, larger D max and smaller D min will decrease the permissible range of assigned skews [22] . Nothing can be done to decrease D max (i, j), but an increase in D min (i, j) will widen the permissible range, allowing skew assignment to be more flexible. This short path optimization effectively reduces hold time violations, allowing a smaller period. We call this step delay padding.
C. Glitch Reduction
GlitchLess reduces glitching by delaying early arriving signals to prevent the output from fluctuating [11] . To realize this, PDEs are added to LUT inputs. In [11] , only combinational circuits are included. In this paper, we extend this work to sequential circuits as well so CSS can be applied. Other work done to reduce glitching includes [13] , which uses routing techniques, and [12] , which proposes a new glitch-driven technology-mapping tool.
D. Power Calculation
Dynamic power is defined by P = α × C × V 2 dd × f , where α is switching activity, C is capacitance, V dd is supply voltage and f is operating frequency. For 65nm technology, V dd is 1V. The power figure we will refer to in this work is the power per operation, namely P op = α × C. We define a power unit P op as 1 femto-Farad of capacitance switching once per clock cycle (α = 1).
III. ARCHITECTURE
A major contribution of this work is the proposal of a unified architecture change that can be shared by CSS, delay padding and GlitchLess. This section will detail this architecture as well as its adaptation by each of the 3 optimization steps. We assume that newer FPGAs such as the Stratix III and Virtex 6, have 2 flip flops per LUT. 
A. CSS and Delay Padding
Architecture changes are highlighted in Fig. 2 with legends shown to distinguish optimization steps. CSS can be done by adding delay δ A to F F A . For delay padding, we use local rerouting within CLBs. The CLB input (solid arrow line) in Fig. 2 goes to LU T B originally. We reroute it (dash-dotted line) to unused F F C in another BLE, then back to the original LUT. Properly adjusting the skew assigned to δ C , any desired delay can be achieved provided there is enough slack for it.
B. Motivation for Glitch Reduction
Glitching can account for a large portion of dynamic power. CSS perturbs glitching. All SEs have the same signal departure time in zero-skew circuits, but skew assigned to SEs effectively delays that time, changing the amount of glitching created downstream. In Table I , the pre-CSS and post-CSS columns show the amount of dynamic power due to glitching before and after CSS and delay padding has been performed, respectively. An architecture with 4-input (k=4) LUTs, 10-LUT clusters with 22 inputs per cluster is used. In general, the amount of glitching increases by a fair margin after CSS. This further motivates the need for glitch reduction.
C. Architecture for Glitch Reduction
To eliminate glitching on a combinational node, we use a circuit-level architecture change different from that analyzed in [11] . Instead of inserting a PDE at LUT inputs, we achieve glitch reduction by intentional clock skew (Fig. 2) . The LUT output is directed to F F D , whose clock skew δ D will be set to the latest arrival time of all LUT inputs plus setup time and safety margin. The LUT output fluctuates, but the FF will block all glitches until the final functional evaluation is known. Our approach requires only one PDE to eliminate the glitching for each LUT, compared to at least k-1 PDEs for each LUT used in [11] . One disadvantage of this approach is the fact that clock has an activity of 1. Compared to PDEs inserted into the data lines with relatively low activity, this approach may introduce a significant power overhead. We will show how this affects the results in section VI.
IV. IMPROVED GLITCH ESTIMATION
The ACE tool [19] filters out fluctuations of short pulse widths since the routing resource can damp them out. Originally, simulation determined this maximum pulse width that can be filtered out by a single stage of length-4 routing segment. A glitch longer than this threshold is assumed to go on indefinitely, otherwise it is assumed to consume no power. Neither of these assumptions is true in reality: as long as the pulse width of a glitch is below a different threshold (short glitch), it will be gradually filtered out after propagating down a certain number of wire segments. All glitches longer than the threshold can propagate indefinitely.
We try to address the above issue by first modifying ACE to group glitches of different pulse widths into bins (for example, glitches ranging from 15ps to 20ps is bin #1, etc), and this histogram is printed for each net into an output file.
Cadence (Spectre) simulations are done for glitches of varying pulse widths, propagating down a routing track of n wires or stages. A short glitch of a particular pulse width, travelling down a routing track of n stages and being gradually filtered out, will consume a certain amount of power. This power can be expressed as a percentage normalized to the power consumed by a long glitch propagating down the same n stages. Simulation results are summarized in Fig. 3 . A converging trend is observed as the lines get closer together for increasing number of stages. Therefore it is assumed that any net longer than 10 stages (wire segments) will behave the same as a 10-stage net.
To calculate total dynamic power consumption, ACE output and Cadence simulation results are read into VPR as separate input files. For a glitch generated at the source node of a net in the circuit, the length and capacitance of the routing track for that net is determined with the VPR routing graph, the glitch activity for each bin is read from ACE, and the amount of glitching power can be calculated by multiplication of capacitance, glitch activity, and the percentage found in The former is dominant because a feedback wire from LUT output to LUT input MUXes in the same CLB carries much more capacitance than the latter, which we neglect in our calculations.
The results from the original ACE, and those obtained from binning, are compared in Table II for circuits produced by VPR (all pre-CSS). Units are P op described in section II-D. All circuits are simulated using 5000 pseudo-random input vectors. A positive percentage difference means the original ACE underestimates glitching. The original ACE can underestimate glitching power as much as 48%, for k=4, and overestimate as much as 15% for k=6. Generally, original ACE underestimates glitch power for k=4 because arrival time differences for a small LUT tend to be smaller and get dropped (below threshold).
Although our glitch power modelling is improved, there is still work to be done, such as comparing ACE results against HSPICE power simulations. In addition, glitch filtering creates two issues: glitch generation and propagation. The former is created at the output of a gate generated by the combined effect of its logic function and different input arrival times. The latter addresses the fact that a short glitch becomes narrower as it travels along a routing path, including its possible elimination. Our work better estimates the effect of pulse narrowing on power consumed in the interconnect immediately following glitch generation. However, it does not propagate the narrowed glitch through each LUT sink; instead, it propagates the original pulse width. For a complete analysis, we need to account for the change of glitching activity on downstream LUTs caused by glitch narrowing. This requires tight integration of VPR and ACE so that logic evaluation (ACE) and routing RC 
V. ALGORITHM
The overall approach is illustrated in Fig. 4 . It offers three approaches to glitch reduction. The first (route "1") uses the original VPR placement and routing solution to generate a net delay file for ACE, which produces an activity file for PGR to do glitch reduction only. The resulting net delays are analyzed by ACE again to produce final activities, and the power analysis routine of PGR is used to determine power savings. Alternatively, the P&R solution can be used directly by PGR to do CSS and DP, followed either by activity simulation to determine power (route "2"), or by ACE simulation, a full run of PGR that includes CSS, DP and GL, and final analysis by ACE to get power results (route "3"). CSS and DP are done twice since the delays affect ACE output.
A. CSS and Delay Padding
Our post-CSS delay padding algorithm offers the following novelties. While the traditional approach [15] considers an ASIC environment where any arbitrary delay is realizable, our algorithm targets FPGAs, is aware of discrete delay steps, process variation margins that limit both the minimum and maximum delay that can be assigned to a node, and the possibility that delay padding may fail due to these margins. To our knowledge, this is the first time delay padding has been applied to FPGAs. Our algorithm is outlined in Fig. 5 .
In each iteration of assign skew(), the optimum period and skews determined with the approach in [6] are stored in a solution array, and all critical hold time edges are appended into a list of currently deleted edges [15] . In each iteration, the combinational LUTs on each critical edge are identified and put into arrays with find deleted edge nodes(). When the lowest possible period is attained, the algorithm attempts to pad delays for all deleted critical edges from the most current iteration. In case delay padding is not successful, additional attempts will be made for earlier iterations until a valid padding solution is achieved.
The detailed delay padding algorithm is shown in Fig. 6 . For each deleted edge, the algorithm attempts to pad delays for each combinational node on the critical hold time short path, starting with the LUT immediately following the source node. Timing analysis is done to safeguard the setup time constraint. On lines 6 to 8 of Fig. 6 , the skew is first set to the arrival time of the fanin signal plus the setup time and variation margin, rounded up to the nearest discrete time unit specified by the 1: iteration = 0; 2: initialization(); 3: solution[iteration] = assign skews(P max , P min ); 4: num edges = find crit hold edges(edges[iteration]); 5: while num edges > 0 do 6: find deleted edge nodes(); 7: recalc binary bounds(P max , P min ); 8: iteration++; 9: solution[iteration] = assign skews(P max , P min ); 10: num edges = find crit hold edges(edges [ needed delay = calculate needed delay(iedge); 4: for all node "n" on deleted edge "iedge" do 5: analyze timing(); 6: max padding = get max possible padding(n); 7: skew = roundup(fanin→arrival + Ts + MARGIN 8: + fanin delay(n, fanin), PRECISION); 9: delay = skew − fanin→arrival 10: − fanin delay(n, fanin); 11: while delay<(needed delay && max padding) do 12: increment skew and delay by PRECISION 13: end while 14: needed delay −= delay; 15: if needed delay ≤ 0 then 16: edge done = 1; break; rank nodes(&list, threshold); 3: for all node "n" in list do 4: skew = roundup(n→arrival 5: + Ts + MARGIN, PRECISION); 6: needed slack = skew − n→arrival + MARGIN; 7: if needed slack < n→slack then 8: for all fanin "f" of node "n" do 9: needed delay = n→arrival − f→arrival 10: − fanin delay(n, f); 11: fanin delay(n, f) += 12: needed delay + needed slack; 13: end for 14: analyze timing(); 15: end if 16: end for 17: end for Fig. 7 . Glitch Reduction Algorithm parameter "PRECISION". It is then incremented in quantized steps (lines 9 to 11), until either the node's slack runs out, or the needed delay is satisfied. When delay padding for an edge finishes, check other paths() is used to check whether other short delay paths (with the same source and sink) are violated. If found, a recursive call to pad delay() will be invoked until setup and hold time constraints for all combinational paths with the same source and sink are satisfied.
In this work, "PRECISION" and "MARGIN" in Fig. 6 are chosen to be 0.1ns, or 1 discrete step. Combined with rounding up (line 7), each PDE has at least 0.1ns of uncorrelated variation tolerance between its assigned skew and path delay, for early clock signals. An additional 0.1ns is added to the needed slack (line 6 in Fig. 7 ) for both delay padding and GlitchLess to account for late clock signals. It is omitted in Fig. 6 to save space. The margin M for CSS is 0.2ns, e.g., allowing T i and T j (Eqs. 1,2) to each shift 0.1ns away from each other in the worst case. For larger delays, this margin may be too small, and future work will investigate performance impact of different margins.
B. Glitch Reduction
One significant difference of this work compared to [11] is added consideration of process variation. In [11] , all added delays are shortened by an amount d so that variation will not increase the critical path. This may result in increased glitching power in practise, since narrow pulses are not "zero power" as assumed previously. Our approach eliminates this issue by stopping all glitching at the FF input, and only clocks through the data when the last signal has arrived. The algorithm is shown in Fig. 7 .
The circuit is traversed in a breadth first fashion so that the extra delays assigned upstream do not invalidate those assigned downstream. For each node, GlitchLess delay is assigned based on calculated slack (lines 6 to 13). A threshold parameter is specified to filter out nodes with small glitch activity. In each level (line 1), all combinational nodes are ranked according to their glitching power, so that nodes with high loading get priority during PDE delay assignment. Care is taken to give each PDE extra margin for process variation in addition to setup time.
VI. RESULTS AND DISCUSSION
The largest 10 MCNC sequential circuits are used as benchmarks, and they are simulated for k = 4 and 6, N = 10, and I = 22 and 33, respectively. 65nm technology is used. All results are normalized with respect to the original solution found using VPR 5.0. Architecture files are from the iFAR repository [23] , and routing resource capacitance and resistance values are calculated from the PTM website [24] assuming CLBs are 125µm squares. All circuits are simulated using timing-driven placement and routing, with a channel width of 100. Activity estimation is produced with the modified ACE described in section IV, simulated with 5000 input vectors. CSS+DP runtime ranges from a few seconds to minutes for each circuit, and GlitchLess requires a few seconds for each circuit. A Xeon X5355 2.66GHz CPU is used. ACE is run on a UltraSPARC-III CPU at 900MHz, and it needs about 45 minutes to run 10 circuits.
A. CSS Only Results and Power Overhead Estimation
In Table III , we demonstrate the period reduction (as a percentage of original critical path) we were able to obtain from CSS and delay padding, the increase in power as a result, and the impact of PDE power overhead. An average speedup of 13% and 16% are obtained for k=4 and 6, for CSS alone. Delay padding further improves 4 circuits (bold results). For elliptic, frisc and tseng, delay padding is very helpful. Combined CSS and delay padding reduces period by up to 37.7% for individual circuits.
The percentage power normalized to pre-CSS results is shown in the "no PDE" column. The increase averages 3-4% for both LUT sizes. In fact, the total dynamic power increase is much larger due to the PDEs, and this is shown in the total dynamic power, "PDE" column. We used the PDE designed in [11] , extending it to 6 stages. As a result of more FETs as well as progressively larger FETs needed to provide large delays, each PDE adds roughly 45 units of power (45fF, activity of 1).
We compare the power overhead due to CSS for our approach with that used in [7] , which used 4 phase-shifted global clocks. The power due to an extra global clock is estimated as follows. The placement size (rows × columns of CLBs) is obtained from VPR, and we assume these CLBs are laid over a spine-and-ribs clock network. For example, if a circuit is 10×10 CLBs, then the power due to an extra clock is C clk = 10 · C rib + C spine + C clb , and C rib = 10 · C int , C spine = 10 · C global , and C clb = 10 × 10 · C local · % F F . Circuits with lower density of user FFs incur less overhead with our approach, while a higher density of FFs favors the approach in [7] .
We can decrease power overhead by decreasing the number of PDEs, e.g., by limiting each CLB to 1 PDE. In addition, there are certain nodes in the CSS solution that require zero skew. However, due to the nature of the PDE, zero skew cannot be achieved, so the entire schedule is shifted by the min-skew to maintain functionality. We can limit the CSS algorithm to use only skews higher than the min-skew, therefore avoiding the need to provide zero-skew nodes with PDEs. We may also decrease the power used by each PDE, using a more power efficient circuit design. In the P DE imp column, we show the power overhead of our approach if we can decrease the number of PDEs by 10%, and the power used by each PDE by 10%. With these minor improvements, our approach usually uses less power than [7] .
A histogram of the total number of PDEs used at each discrete delay for all circuits is shown in Fig. 8 . The data is relatively spread out, indicating it is beneficial to use PDEs that can provide a large range of delays. This provides greater flexibility for CSS and delay padding beyond what is available with just 4 global clock lines. Area calculation is done with the model used in [11] . With 1 PDE assigned to each FF, 20 PDEs are needed per CLB, and the overhead is 11.7% for k=4 and 7.6% for k=6. This is the maximum area overhead with our approach. Future work will investigate PDE reduction techniques to achieve lower overhead.
B. GlitchLess Only Results
Next, we show that 100% glitch reduction is not possible without impacting performance. In Fig. 9 , we show glitch power reduction vs. threshold. The y-axis is the normalized glitching power, and the x-axis is the ranking threshold percentage with respect to the node with the maximum glitching power in each circuit. The lines represent power savings as threshold is gradually decreased and more nets are de-glitched. The trend is rather gradual for threshold values larger than 20%, and it makes sense since there are only a few nodes with high glitching power consumption. As the threshold is lowered below 20%, more nodes became eligible for skew assignment The impact of increasing critical path is shown in Fig. 10 , where the y-axis is the normalized product between power and period. It's clear that the extra glitch power savings is not worth the increase in critical path, as the power-period product goes above 1.0, and is never better than the case where critical path is not allowed to increase. Note that since the period stays the same for the square and circle lines, they basically represent the power savings.
Also noteworthy is the number of PDEs necessary to achieve glitch reduction, and power overhead implications. In Fig. 11 the number of PDEs used is plotted on a log scale against threshold. Referring to Fig. 9 , it is interesting to see that roughly a third of the savings can be obtained using less than 10 PDEs on average, for both k=4 and 6. When increasing the critical path, more PDEs are needed. As the threshold is lowered below 20%, the number of PDEs used increases dramatically, and the savings from the diminishing rate of return of each added PDE is over-compensated by the PDE's own power overhead. As a result, savings at low thresholds The best savings (P f inal ) for each circuit, where period is not allowed to increase, is presented in Table IV . The limit (P lim ) refers to the best possible savings achievable (assuming 100% glitch elimination with no PDE overhead), and the percentage of the limit achieved is also shown (%). While some circuits cannot be improved, their limiting case is close to 100% because glitching is only a small fraction of total dynamic power. Overall, the technique is able to remove on average 13% to 20% of glitching power, or up to 63% for individual circuits.
VII. CONCLUSION AND FUTURE WORK
This paper proposed an architecture change and associated tool flow to consider both power and performance optimization. A programmable delay element (PDE) added to each flip-flop clock input can be used to satisfy CSS, delay padding and GlitchLess simultaneously. The results are summarized in Fig. 12 . CSS is able to improve performance by an average of 15%, or up to 37% for individual circuits. Some circuits can further benefit from delay padding, and their period can be reduced by 10% of the original period, or up to 23% of the original period for individual circuits, in addition to CSS improvements. We then discussed an improved method to estimate power due to glitching, and showed that the original Fig. 11 . Number of GlitchLess PDEs Used method can underestimate glitching by as much as 48% while overestimating by 15%. The power overhead due to PDEs are compared to adding 3 extra global clocks, and our method is estimated to be 45% more power efficient on average. Lastly, we investigate the effect of glitch reduction on dynamic power. We were able to eliminate on average 16% of glitching power, and up to 63% for individual circuits.
As future work, it is important to evaluate the performance trade-off of using fewer PDEs to save power. Architecturally, moving PDEs up the clock distribution network saves both power and area ([8] , [9] ), but the restriction this imposes on performance (with delay padding) is not clear. Instead, algorithmically penalizing each PDE instance saves power as well, but retains flexibility for performance when needed. In addition, integration with retiming, with CSS added as fine tuning, can be done to exploit power overhead benefits.
