In this paper we discuss optimizing the interconnect power of designs implemented in FPGA platforms. In particular, we reduce the glitch power on interconnects associated with the output of functional units in a design. The idea is to activate unused flip-flops to block the propagation of glitches, which takes advantage of the abundant flip-flops in modern FPGA structures. Since the activation of additional flip-flops may cause data hazard problems, we develop several effective behavioral synthesis techniques to prevent such data hazards. We also study the optimality of our techniques. The experimental results show that on average, our methods lead to a 28% reduction in dynamic power in the Xilinx Virtex-II platform.
Introduction
Power efficiency is becoming a forefront concern of FPGA designs in nanometer-scale technologies. The research in [14] [20] has shown that interconnect resources dominate the power consumption in modern FPGA designs. In particular, interconnect could dissipate at least 60% of the total power in the Xilinx Virtex-II family [20] . Therefore, reducing interconnect power is important for FPGA designs to achieve power efficiency. A synchronous design can be implemented with the architecture of finite state machine with data path (FSMD). Figure 1 shows the generic structure of FSMD. The data path contains arithmetic functional units as well as registers which serve to temporally store computation results between functional units. We use the term boundary output signal to refer to the interconnect at the boundary of the data path (bold line in Figure 1 ), which is between the output of functional units and the input of registers.
It is important to note that a boundary output signal may have multiple fanouts; i.e., a functional unit is connected to several registers, as shown in Figure 2 (a). This occurs commonly if during resource binding, multiple operations are bound to the same functional unit, and those operations produce results with overlapped lifetimes.
We observed that, when using an FPGA to implement designs, if we insert a single register at a multi-fanout boundary output signal, as shown in Figure 2 (b), the power consumption on the boundary output signal could significantly decrease. This is because large glitches which originally propagate through the whole boundary output signal now occur only in the interconnect with an extremely small capacitance C between the inserted register and the functional unit. We call such an additional register a firewall register due to its ability to filter out unwelcome glitches. Interconnect capacitance C in Figure 2 (b) can be much smaller than the capacitance of the whole boundary output signal if we implement it using FPGA. Figure 3 shows a typical FPGA structure. A basic logic element (e.g., a Logic Element in Altera Stratix FPGAs or a Slice in Xilinx Virtex series FPGAs) contains a LUT as well as a flip-flop, so that the output of a logic gate (implemented in LUTs) can be configured as either an unregistered mode or a registered mode. Precisely, if a logic gate is connected to only one register, it can be implemented in the registered mode. Inserting a firewall register to a functional unit creates such a situation to benefit the registered mode. Therefore, capacitance C in Figure 2 (b) indicates the interconnect capacitance between the LUT and the local flip-flop inside a basic logic element, which is much smaller than the capacitance of the inter-block programmable interconnect.
The insertion of firewall registers is not trivial because a firewall register delays data propagation from a functional unit to its original registers for one clock cycle, thus possibly causing data hazard problems. We observed that the hazard problem can be solved by scheduling and binding operations in a particular way. Therefore, in this paper we propose novel scheduling and binding methods to generate functionally correct, low-power RTL designs with firewall registers. We intend to insert firewall registers to those functional units generating large glitches at the output. This implies that our method needs an accurate glitch estimation to guide the insertion. Considering that the occurrence of glitches is sensitive to component delays, we suggest to use our behavioral synthesis method for data intensive designs, which are mainly composed of arithmetic functional units serving as IPs. Those IP blocks have pre-determined circuit structure so the glitch information is predictable in the behavioral synthesis stage. Our problem formulation considers power models [7] [12] [19] [21] for each arithmetic module as the input.
In our experiments, we applied our method to a set of data intensive designs, and obtained, on average, 28% power reduction with 4% area overhead. The small area overhead implies that the additional control circuit for firewall registers is insignificant. Also, the leakage power is not impacted much because leakage is roughly proportional to a design's area. Similar to our idea, pipelining [15] [18] [22] and retiming [13] [17] also adopt flip-flops to block the propagation of glitches for power minimization. However, all of them assume that an RTL design is given so they cannot change the computation sequence as scheduling does. This makes the previous methods hard to solve the data hazard problem encountered in our problem formulation. They focus on activating unused flip-flops in a local scale, particularly, within a functional unit. Still, one can apply both our method and previous methods in different stages of design flow for low power.
Our major contributions are 1) to expend the solution space of low-power implementation by proposing an additional dimension of using/not using firewall register, and 2) to provide methods to guide the insertion of firewall registers in the behavioral synthesis stage. We have incorporated our techniques into a behavioral synthesis tool, xPilot, introduced in [8] .
Data Hazard Problems
Inserting firewall registers may impact the correctness of a design's function. In fact, as we will discuss later, there exists a certain scheduling and binding condition where the use of firewall register will cause functional errors. Therefore, if we want to take advantage of firewall register, we must avoid scheduling and binding a design in that way. In this section we discuss the scheduling and binding pattern to be avoided.
A design's function can be represented as a data-flow graph (DFG). A DFG is a directed acyclic graph (DAG), where every node represents an operation, such as an addition or a multiplication, and every directed edge (u, v) represents a dataflow indicating that operation u produces values to be consumed by v. After scheduling, we can derive a scheduled DFG, where every operation is scheduled to execute at one or more consecutive control steps (c-steps).
To maintain a design's functionality after inserting firewall registers, we have to guarantee that for every dataflow (u, v), consuming operation v correctly read results from producing operation u. Figure 4 (a) shows a partial scheduled DFG, where operations u and v form a dataflow (u, v) and are bound to functional units p and q, respectively. Note that the use of the firewall register will delay the data transfer from functional unit p to register r by one c-step. In case functional unit q intends to fetch from register r a value that is still stored at the firewall register, a functional error occurs. We need to carefully deal with this condition when using firewall registers.
We can just "forward" the results from the firewall register to functional unit q as shown in Figure 4 (b), which is traditionally called forwarding. Forwarding can absolutely resolve the functional error problem if the consuming operation v is executed in a single c-step. In this case, the firewall register is required to keep the target results for only one c-step during operation v's reading. However, if the consuming operation v is a multi-cycle operation, the firewall register must keep the target results for several csteps until operation v finishes the reading. In case the firewall register cannot keep the target results long enough, a functional error still occurs. We elaborate this issue using an example. Figure 5 shows a partial scheduled DFG, where dataflow (u, v) are bound to functional units p and q, respectively. In addition, operation w is also bound to functional unit p, sharing the same functional unit with the producing operation u. Note that the consuming operation v is a twocycle operation and will read results through forwarding during c-steps i and i+1, so the firewall register must keep the producing operation u's results during the two c-steps. However, at the end of c-step i, functional unit p will finish the computation of operation w and store its results to the firewall register, which, accidentally, overwrites the result that is still forwarded to functional unit q. Formally speaking, a write-after-read (WAR) hazard occurs on the firewall register.
Note that we cannot attach two firewall registers to functional unit p to store lifetime-overlapping results from operations u and w, because this way the two firewall registers will not be implemented in local flip-flops, making no power reduction as shown in Figure 3 . On the contrary, because we can attach "original" registers, such as register r in Figure 4 (a), as many as possible to store lifetimeoverlapping results, it is impossible for original registers to involve WAR hazards. We formally describe the conditions to induce a WAR. Assume that an operation v is a non-pipelined multi-cycle operation and is executed at k consecutive c-steps, which are labeled by consecutive integers {i, i+1, …, i+k-1}. To maintain a design's function, we cannot apply firewall registers to dataflows satisfying Lemma 1. However, this will reduce the opportunity of using firewall registers for low power, so we should carefully perform scheduling and binding to avoid such conditions.
Binding with Firewall Register Insertion Support
In this section we discuss how to perform resource binding to avoid the conditions in Lemma 1. Our idea can be briefly illustrated with the example in Figure 5 . Since operations u and w are bound to the same functional unit p, according to Lemma 1, a hazard on dataflow (u, v) appears. If we can separately bind operations u and w to two functional units, Lemma 1 will become unsatisfied.
Traditionally, binding achieves power optimization by minimizing the switching activities of resources [1] [5] [6] [16] . In this research we perform a low-power binding by considering both the switching activity and the insertion of firewall registers simultaneously. The problem formulation is described as follows.
Given:
(1) A scheduled DFG G=(V, E); (2) a set of resources R; (3) switching activity s uw on u w, where u, w ∈ V; (4) power models for each type of resource.
Goal:
In our problem formulation, the resource number is a constraint. Therefore, although our method seems to increase the usage of resources due to binding operations to separate resources, our method will not use more resources than conventional binding approaches. However, the resource constraint limits how much our binding can avoid Lemma 1. The looser the resource constraint, the more the dataflows to be protected by firewall register.
Network Flow Formulation
We adopt network flow formulation to solve the binding problem. We will show that, through proper network construction and an optimal min-cost flow algorithm, we can derive an optimal solution for the goal. The outline of our algorithm is shown as follows.
Algorithm:
(1) Build a graph H representing the compatible information among operations in V.
(2) Assign cost and capacity constraints to the edges in H. We first introduce several notations in our formulation. In a DFG, G=(V, E), different types of operations (e.g., addition and multiplication) are bound separately. We use V f to denote the set of operations in type f. For two operations u and w of type f, if their corresponding lifetimes do not overlap, we call u and w compatible with each other. Two compatible operations can be bound to a single functional unit. Next we define two operations to be FR-compatible as follows.
Definition 1:
Two operations u and w of type f is FRcompatible if and only if (1) they are compatible, and (2) when they are bound together to a functional unit, the functional unit can be protected by a firewall register; i.e., the conditions in Lemma 1 do not occur.
For example, in Figure 5 operations u and w are not FRcompatible because binding them to the same functional unit forbids inserting a firewall register to that functional unit.
We intend to build a graph H = (s, t, V H , E H , C, K l , K u ) based on the compatibility information among operations.
First there are source node s and sink node t in H. Next, V H is the node set of the network. For each operation v ∈ V f there are six corresponding nodes in V H , as shown in Figure 6 . We denote the six nodes as {v FRin E H is the edge set of the network. The edges in E H can be classified into three categories.
(1) The internal edges among the six corresponding nodes of an operation v, as shown in Figure 6 . C is the cost assigned to the edges in E H , which is set in the following way. In this formulation the dynamic power is affected by the switching activity as well as whether a functional unit is protected by a firewall register. There must be two power models for each type of functional unit. Power model power FR (s uw ) is used to calculate the power for the case of binding operations u and w together with a firewall register; power primitive (s uw ) is for the case of binding operations without a firewall register. Both the models take switching activity as input. There have been plenty of papers [7] [12] [19] [21] discussing how to derive power models for behavioral synthesis. Especially those methods can take glitches into account when characterizing the power for pre-designed IP blocks. Then we assign the calculated power values to the corresponding edges in H as the cost.
C(u Pout
Finally, K l is the lower bound flow capacity, which is set to 0 for every edge in E H ; K u is the upper bound flow capacity, which is set to 1.
We use an example to illustrate the construction of H. Figure 7 (a) shows a scheduled DFG containing three operations {1, 2, 3} with the same type. The constructed network for those operations is shown in Figure 7( 
Obtaining Binding from Network Flow Solution
If the resource constraint of resource type f is N f , we solve the min-cost N f -flow problem on the constructed network H. Then we use the solution (network flows) to perform binding and firewall register insertion simultaneously, which will be discussed in this section.
All the operations visited by a flow will be bound to a single functional unit with or without a firewall register. Precisely, if a flow goes through edge (u FRout , w FRin ), operations u and w will be bound together with a firewall register. If a flow goes through edge (u Pout , w Pin ), operations u and w will be bound without a firewall register. However, if the condition occurs that a flow go through edge (u FRout , x FRin ) and then (x Pout , w Pin ), i.e., operations u and x are bound with a firewall register but operations x and w not, we cannot decide whether the functional unit associated with operations u, x, and w is protected by a firewall register. To avoid such a situation, for each operation u we require that edges (u FRin , u Xin ) and (u Xout , u FRout ) have the same flow and also edges (u Pin , u Xin ) and (u Xout , u Pout ) have the same flow. With these constraints, we guarantee that a flow always stays at either the primitive part or the firewall register part of edges. We call a unit flow satisfying the above constraints a valid flow. For example, in Figure 7 (b) the highlighted path indicates a valid flow.
After applying a min-cost N f -flow algorithm to the network, we can derive a set of valid flows, and then construct the corresponding binding result. Since we use the power as the cost in the network, a min-cost algorithm leads to a binding solution with the lowest power.
Solving Network Flow Problem with Equal Integral Flow Constraints
As research [6] mentioned, the min-cost flow can be solved by the shortest path based algorithm [1] . However, different from the general characteristics of networks, network H in our formulation requires that the flows on certain edges are equal. The min-cost flow problem with equal integral flow constraints is a difficult problem (NPhard) [2] . To trade off solution quality with runtime, we can use heuristic algorithms, such as the one presented in [3] , where the authors used a Lagrangian relaxation technique to speed up the min-cost equal-flow problem.
1A-2

Scheduling with Firewall Register Insertion Support
In this section we discuss how to perform scheduling to avoid the conditions in Lemma 1. Our idea can be briefly illustrated with the example in Figure 5 , which shows a data hazard on dataflow (u, v). If we can schedule the two operations u and v in a nonconsecutive way, like in Figure 8 , Lemma 1 will never be satisfied. We define the slack of an edge (u, v) as the distance between operations u and v in terms of c-step. If the slack is zero, operations u and v are executed at consecutive c-steps; if the slack is a positive value, the two operations are separated by at least one c-step. Our goal is to assign positive slacks to many edges to avoid the situation in Lemma 1.
The problem formulation is as follows.
Given: (1) A DFG G; (2) A latency constraint T in number of c-steps and a set of optional scheduling constraints, including data dependency, throughput, and relative timing [9] .
Goal: Generate a scheduled DFG G' without violating T and all the given scheduling constraints; in the meantime, the number of dataflows (or edges in G') with hazards is minimized.
Since the assignments of slacks are constrained by the overall latency constraint, we have to intelligently budget and distribute time slacks to the non-critical edges of the given DFG. This problem is traditionally called the timing budgeting problem. The previous research [10] has well studied this problem and provided an optimal solution. 
Experimental Results
We incorporated our scheduling and binding techniques into the behavioral synthesis tool, xPilot, introduced in [8] . In this section we will compare the power efficiency of the RTL designs generated by the conventional behavioral synthesis [8] and by our firewall-register-supporting (FRsupporting) behavioral synthesis.
The experimental flow is as follows. We first performed both the behavioral synthesis methods under the same resource and timing constraints. Next, we implemented each RTL design into a real FPGA device using Xilinx ISE in version 8.1.03i. The target FPGA is mainly device XC2V500 in Xilinx's Virtex-II family while we use XC2V1500 for benchmark CHEM due to its large size. All multiplications are implemented using the dedicated multiplier blocks of an FPGA device, and the target clock period is 15ns. After deriving the post-place-and-route implementations, we randomly simulated them to obtain switching activities and then used xPower [23] to compute a circuit's dynamic power. We have to emphasize that the power and area reported in the experimental results are extracted from the post-placeand-route implementations in order to reflect the real situations.
We used a set of data intensive benchmarks to test our methods. The experimental results are shown in Table 1 . Column 1 presents the name of a benchmark. Columns 2 and 3 show the resource constraints of adders/subtractors (ADD/SUB) and multipliers (MUL), respectively. Here the resource constraints are 20% of the total number of the operations in a DFG. Columns 4 to 6 show the results from the conventional synthesis flow presented in [8] . Columns 7 to 9 present the results from the conventional flow with firewall register insertion; i.e., we still use the conventional scheduling and binding algorithms but additionally insert firewall registers. Finally Columns 10 to 12 show the results from our FR-supporting flow.
Let us consider benchmark DIF as an example. The RTL from the conventional flow physically needs 788 flip-flops and 987 slices. Note that the number of flip-flops includes those in slices and dedicated multiplier blocks. The dynamic power is 174mW. After the insertion of firewall registers at the RTL, the flip-flop usage increases to 852 and the slice usage increases to 988, but the power decreases to 155. In other words, with an 8% increase of flip-flops and a 0.1% increase of slices, the power can be decreased by 11%. Furthermore, if we apply the FR-supporting flow to generate an RTL, with a 16% increase of flip-flops and a 4% increase of slices, the power can be reduced by 28%.
On average, the conventional flow with firewall registers achieves a 16% reduction of dynamic power while introducing a 1% increase of slices (area overhead). This shows that for those designs, the insertion of firewall registers can effectively reduce the dynamic power. In addition, on average the FR-supporting flow achieves a 28% reduction of dynamic power while introducing a 4% increase of slices. This shows that our FR-supporting scheduling and binding algorithms can further enhance the insertion of firewall registers, thus leading to larger power reduction.
Our method is based on the assumption that a firewall register must be implemented by local flip-flops; otherwise, no power can be saved when glitches still propagate through programmable interconnect with large capacitance. Note that we can do nothing in high level synthesis to control the placement of firewall registers while this is controlled by FPGA placer. Fortunately, using Xilinx ISE placer in the experiments, we checked the layouts of some designs and found that all firewall registers are implemented as local flipflops by this tool. Secondly, according to the experimental results, the use of firewall registers does not increase the usage of slices much, suggesting that the firewall registers are implemented in local flip-flops rather than occupying spare slices. The slice increase is due to additional control circuit for firewall registers. We believe that other synthesis tools should produce the same results since the use of local flip-flops is good for delay, power, and routing congestion.
1A-2
Note that the use of firewall registers would not adversely impact the timing of a design. Firstly, after inserting a firewall register to a functional unit, because the firewall register is implemented in local flip-flops with tiny interconnect capacitance, this causes shorter critical paths within the functional unit than those in the original design. Therefore, no setup time violation can occur under the use of firewall registers. Secondly, because there is no combinational logic between the firewall register and the original registers, hold time violation may occur in this place. This issue can be automatically handled by the synthesis tools, which will route wires or add buffers to increase the propagation delay.
We do not show leakage power here because the FPGA chip we used has no "turn-off" mechanism to shut down the leakage of unused components. Therefore, the leakage power is a constant in every result. However, we think the leakage overhead should be small in our method considering that the leakage is roughly proportional to the area.
Conclusions
In this paper we propose the concept of firewall registers to block the propagation of glitches on boundary output signals. To resolve the WAR hazard problem caused by the insertion of firewall registers, we also propose an FRsupporting behavioral synthesis flow. The experimental results show that the reduction in dynamic power is around 28%.
