Knowing the capacitance of circuit nets in an FPGA design is essential when computing the dynamic power consumed by switching these nets. Before a circuit is placed, however, there is little information available to allow the capacitance of routing wires to be estimated. In this paper we study the feasibility of estimating routing capacitance before RTL-synthesis to allow high-level power consumption optimization algorithms to be able to target routing power. We propose a novel method for estimating the capacitance of nets before RTL-synthesis and show that this method improves the accuracy and the rank ordering of the net-by-net estimates made over existing fan-out based techniques.
INTRODUCTION
While macro-modeling techniques such as [1, 2, 3] can effectively estimate the power consumed in configurable logic blocks given a high-level system description, the power consumed in the routing wires of the system is hard to predict due to the fact that their capacitance is largely determined by the physical placement decisions made after synthesis.
We focus on estimating the capacitance in wires that connect hard macros or cores (which shall from now on be referred to as components in this paper), such as are typically used in arithmetic intensive designs. The remaining wires in a system that connect the configurable logic blocks within a component can be shown to have predictable capacitance, as macros normally define a relative placement of the logic blocks within a component [3] . The capacitance of the wires that connect components can be difficult to predict however, as typically stochastic optimization techniques such as simulated annealing are used to place components. Throughout the remainder of this paper, those wires that connect configurable logic blocks within a component shall be called intrarouting wires, while those that connect components will be called inter-routing wires. We focus on the more difficult
The authors would like to acknowledge the support of Synplicity, Xilinx, Celoxica and the EPSRC under grant numbers EP/C512596/1, EP/C549481/1 and EP/E00024X/1. problem of estimating the capacitance in inter-routing wires. The main contributions of this paper are as follows.
* A quantitative analysis of where power is consumed for a variety of FPGA families that motivates the need for early stage routing power estimation, given in Section 3. * Evidence that wire-length is becoming a more dominant factor in the capacitance of FPGA nets as feature size decreases and that this improves the accuracy of pre-routing estimates made using bounding box, given in Section 4. * A novel technique for utilizing topological information on a circuit to enhance inter-routing wire capacitance estimates made prior to RTL synthesis, given in Section 5. * Results indicating that the proposed method improves the accuracy and rank ordering of capacitance estimates made for the inter-routing wires in a set of benchmarks, given in Section 6.
BACKGROUND
Dynamic power is consumed by the charging and discharging of both gate and parasitic capacitance in a circuit when logic transitions are made. Equation (1) can be used to estimate the power consumed by switching a capacitance C at an average frequency off where the voltage swing is V.
To estimate the total dynamic power consumed in a circuit using (1) the capacitance and average switching activity of each signal in the circuit must be known. For FPGAs, tools such as Xilinx XPower [4] use switch-level simulations of a design implemented on a target FPGA to determine the activities of signals, while capacitances are gathered for the tool a priori from silicon and metal capacitance information available from the silicon manufacturers, device-level simulations of the elements within the FPGA (as in [5] ), and, where possible, power measurements from the device.
Unfortunately the computation involved in calculating an estimate using a switch-level simulation of a circuit is IIR3  3rd order low pass IIR filter  389  19  248  697  LMS2  2nd order LMS adaptive filter  180  15  88  381  LMS4  4th order LMS adaptive filter  1641  42  602  3787  FIR7  Symmetric 7th order low pass FIR filter  263  15  180  458  ColorConverter  R-G-B to Y-Pr-Pb color space converter  276  26  228  419  5x5Convolution  5x5 image convolution filter  364  71  391  412  Fibonacci  Fibonacci sequence generator  147  13  218  121  FSE  T/2 adaptive Fractional Space Equalizer  2872  105  1982  5582  PolyphaseFIR  128-tap 1:8 polyphase FIR filter  741  51  588  1341  CostasLoop  Costas loop for carrier recovery  893  31  343  1563 considerable. To obtain the information required, a design must be synthesized, placed and routed, and then simulated at a low level. For a typical test system used in this paper obtaining a routed design took 16 minutes, on top of which simulation length must be carefully selected to ensure the entire circuit is exercised as under normal operation, meaning simulation time may become several hours long for more complex systems.
To enable high-level optimizations for power consumption, estimates must be made as quickly and hence as early as possible, i.e. before RTL synthesis. Power macro-models such as [1, 2, 3] have emerged as a technique suitable for estimating the power consumed by the logic elements which form components within a design. In [3] we have also shown that macro-models can be used to estimate the power consumed in the intra-routing wires that connect the logic elements within arithmetic components implemented in LUTs.
Power macro-models generally work by mapping the estimated activity in the input signals of a component to some pre-measured power consumption where similar input signals were used. Due to the regular construction of components such as adders, multipliers, multiplexers, etc., macromodels can give accurate estimates of the power consumed in their logic and intra-routing wires. Unfortunately macromodels are not suited to estimating the power consumed in the inter-component routing wires in a system, as the capacitance of these wires can vary significantly depending on the placement of the components themselves within a system.
Early-stage estimation of power consumption in routing wires has previously been considered for the Virtex 2 in [6, 7] . In [6] the authors propose using fanout alone after RTL synthesis as an estimate for capacitance, while in [7] the authors consider post-placement estimation of capacitance using several architecture specific parameters known after placement. We study the dependence of capacitance on the parameters used in both [6] and [7] on both the Virtex 2 and more recent devices in Section 4, and we then compare the pre-RTL synthesis model introduced in this paper to both models in Section 6. Our results indicate that our technique is more accurate than [6] , while being far less computationally expensive than [7] .
In order to allow changing trends in power consumption distribution and routing wire capacitance to be identified, the data used in this paper has been gathered for several families of Xilinx FPGAs. As in other work [1, 2, 3, 6, 7, 8] , the capacitance and power consumption values used in this paper have been obtained from a low-level power estimation tool (Xilinx XPower [4] ). High-level estimates cannot be expected to be more accurate than lower-level estimates, and in any case the capacitance of a single wire in a system is impossible to extract from device-level measurements [9] .
THE DISTRIBUTION OF DYNAMIC POWER CONSUMPTION
Here we present results showing the distribution of power consumption for a set of arithmetic benchmarks implemented on FPGAs from different technology nodes. The results obtained indicate that inter-routing power dominates over intra-routing power in FPGAs and forms a significant part of power consumption, motivating the importance of estimating inter-routing power at an early stage so that it may be targeted by high-level power optimization algorithms.
Previous work on the distribution of power consumption in FPGAs has shown that routing power dominates, forming 71% of power in the Virtex 2 [10] , and 45% of power in the Spartan 3 [5] . Although routing power forms a significant part of our results, it is logic power that dominates for the benchmarks used, which are arithmetic intensive and feature almost no random logic. The lower proportion of routing power can be attributed to the fact that the LUTs of large arithmetic components can be placed in a regular fashion that avoids long routes, except at the boundaries of components.
In Table 1 the names of the benchmarks used in this work, a short description of their behaviour, their size in SLICEs and their number of inter and intra-component routing wires (when implemented on the Xilinx Virtex 4 family), are shown. The first four systems in Table 1 are simple filters designed by the authors using Xilinx System Generator [4] , while the remainder are example systems included in the System Generator package. These systems contain a Table 1 were generated using System Generator 8.2 [4] , synthesized using Synplicity Synplify Pro 8.8.0 [11] , and then fitted onto each of the devices listed in Table 2 using Xilinx ISE 9.1 [4] . The routed circuit for each benchmark was simulated with appropriate input vectors selected for each system to allow the contribution to the power consumption breakdown to be estimated using Xilinx XPower [4] . The data presented in this paper was obtained using the tool flowl summarized in Figure 1 , which is based upon the data extraction method outlined in [8] .
The resulting power reports were analyzed and crossreferenced with the other information gathered in Figure 1 to allow dynamic power consumption to be broken down into logic, intra and inter-routing, clocks, and IOB power, as shown in Figure 2 , which shows the dynamic power consumed in each part of the chip, averaged accross the benchmarks in Table 1 .
Although there are many more intra than inter-routing nets, (see Table 1 1Our code for extracting the data in Figure 1 from a Xilinx design is available for download at http: //cas .ee.ic.ac. uk/people/ jac /. 2XPower is not yet fully characterized for the Virtex 5, so these results should be treated with caution. In particular, the Clocks portion is uncharacterized and has been assumed to be 16%, as in the Virtex 4. Fig. 2 . The average distribution of dynamic power consumption using the benchmarks in Table 1 . In the next section we investigate the dependence of capacitance on parameters used in previous work [6, 7] , before introducing a new method for early prediction of interrouting wire capacitance in Section 5.
ESTIMATING NET CAPACITANCE
In this section we examine the dependence of net capacitance on Fan-Out (FO), half perimeter Bounding Box (BB), and Wire-Length (WL), and architecture specific parameters used in [7] . We also compare the change in this dependency as devices progress through the different technology nodes represented by the families in Table 2 Four capacitance prediction models were constructed as follows: i) a pre-placement model using FO alone as in [6] , ii) a post-placement model using FO and BB, iii) the linear model (M8) proposed by Anderson and Najm in [7] which uses FO, BB, and some architecture specific parameters3 and iv) a post-routing model using FO and WL. Each model is a linear function of its parameters, such as: C = a FO +j3 BB + -y (2) which estimates the capacitance C of a wire with fan-out FO and bounding box BB (i.e. model ii), where a, /3 and -y are 3The architecture specific post-placement parameters used in [7] are counts of the: F-LUT load pins, G-LUT load pins, and CLB tiles containing at least one terminal. The Virtex 5 has a different SLICE structure so we have updated [7] Fig. 4 . The RMSRE achieved when the methods shown were fitted to the capacitance values from all the interrouting wires extracted from the benchmarks from Table 1 .
coefficients characterized for the model. Linear functions of the parameters have been used as:
* Each extra LUT input that a wire drives due to increasing fan-out will cause a linear increase in capacitance, * Assuming capacitance per unit distance is constant, then capacitance increases linearly with wire-length, and, * Bounding-box is commonly used as a substitute for minimum wire-length at the placement stage as it is much more easily calculated.
For each device a separate set of coefficients is charactized using the 4868 inter-routing nets from all of the benchmark systems in Table 1 . The coefficients of each prediction model are selected so as to minimize the Root Mean Squared Relative Error (RMSRE) in capacitance over all the interrouting wires for each device (3) , where N is the number of wires, and for wire i, ci is the capacitance estimated by the model, and ci is the capacitance measured using XPower.
The RMSRE is minimized by using weighted least squares regression, where the weight used for each residual is the square of the measured capacitance.
RMSRE = j1 E ( ci)
In Figure 4 the RMSRE for each capacitance estimation model, characterized to each device, is shown. It is clear that capacitance has a fairly low dependence on Fan-Out alone, as the error measured when using this parameter to predict capacitance is 65% for the Virtex 2, decreasing to 48% for the Virtex 4. Adding Wire-Length to the model reduces estimation error to 54% for the Virtex 2, but the gap between the models FO and FO + WL increases for newer devices, indicating that capacitance has a greater dependence on WireLength as feature size decreases.
For earlier FPGAs the capacitance of a wire is more affected by the transistors within the routing fabric i.e. which paths through switch boxes are used, and which LUT inputs are driven by a wire. Indeed the improvement in accuracy in the model AN over FO + BB, achieved by using the architecture specific parameters of a net3 is substantial for the Virtex 2, but decreases to the point where very little is gained in the Virtex 5. Interconnect capacitance is becoming more affected by the lengths of metal wires used, as metal dimensions are not shrinking at the same rate as logic, in order to avoid the consequent impact on routing delay.
This increased dependence on Wire-Length benefits the model using Bounding Box, which, as seen in Figure 4 , only exhibited a slight improvement over using Fan-Out alone in the Virtex 2, but gives large improvements in estimation accuracy for more recent devices.
These results indicate that, for newer devices, estimates of Bounding Box made before placement could be used to improve the accuracy achievable compared to using Fan-Out alone when estimating capacitance. In the following section we describe one such technique we have developed.
ENHANCING EARLY CAPACITANCE ESTIMATES
In this section we introduce a novel technique for estimating the bounding boxes of inter-routing wires in a system before RTL synthesis, given a high-level description of the system. From high-level descriptions of a design, such as the System Generator block diagrams used in this work, the types of blocks used (arithmetic operations, etc.), and the topology of the circuit, i.e. how these blocks are connected together, are known. We assume that the following is also available in a pre-characterized library of information on each block type available: i) the area in SLICEs a particular block type occupies and, ii) the number of LUTs that each input of a block fans-out to within the block itself.
The block area information is used along with signal word-length and circuit topology information in our method for estimating bounding box, while the input pin fan-out information is used to estimate the fan-out of each interrouting net without the need for RTL synthesis.
Given the information available, bounding box estimates could be obtained by floorplanning the components, however this is a time consuming operation, while we wish to perform high speed estimation.
As a result we approximate the problem by using a novel approximation of the 2D placement problem in ID space. We consider the components as points in a single dimension stretched between the input point at position 0 representing all FPGA input pins, and the output point at position C representing all FPGA output pins. We then formulate the following model for optimal placement over this interval.
Two Linear Programs (LPs) are solved in order to optimize the positions of components and the wire-lengths in the problem formulation as an attempt to model timing driven placement. In the first LP the length of the longest wire is minimized. In the second LP total wire-length is minimized, without exceeding the longest wire-length achieved by the first LP. After the second LP is solved, the length of each wire is extracted and used as an estimate of Bounding Box. The formulations can be summarized as follows.
Given a set of blocks V where each block vj e V drives one net n= {vj, Va, Vb, vc, ...} C V (i.e. the net nj connects the blocks {vj, Va, Vb, vc, ...} C V), we define the following variables in both LPs for each block vj: The value of C is determined for a benchmark using the half-perimeter of the square whose area is equal to the sum of the area of all blocks in the benchmark, calculated by:
where Areaj returns the area in CLB tiles of the block vj in the benchmark.
In both LPs the following constraints are defined for each block vj which has output net nj, in order to correctly calculate each net length Lj: Lg = Imaxi-minj maxj > xi Vvi C n. minj < xi Vvi C n1 1, 2,..., iVi
The strength of the 2D to ID approximation is that we can additionally make use of the component area information to enforce that the output net nj of each block vj is longer than a minimum length minWLj, equal to the sum of the minimum dimension of each block the net connects: Lj > minWLj where minWLj E Area1 (6) Vi Enj under the assumption that each block occupies a square area.
Note, however, that this is a lower bound, and each net nj may be stretched to a longer length than minWLj due to the stretching of the circuit between 0 and C.
The objective of the first LP is to minimize worst case wire-length: minLmax where Lmax > Lj, Vvj C V (7)
The worst case wire-length Lmax achieved in the first LP is then used to form the following constraint in the second LP:
which constrains the worst case wire-length in the second LP to be the same as that achieved in the first. The second LP then minimizes total wire-length by using the objective:
The number of variables in the LP formulations is 9 ( V ) while the number of constraints in the formulations is 9 (w) where w is the number of net source-to-sink pairs: (10) w= E (InjGV -1) so, w < IV12 VjE V As a result our LP formulations are O( V 2) to solve optimally, on average [12] , and so scale well with problem size.
In the following section the accuracy and computational effort achieved when using this method to predict capacitance is compared to that achieved when using estimates made at various stages in the design flow.
RESULTS
This section compares the accuracy and computation times of several methods of capacitance estimation, including the method introduced in this paper.
As we are using capacitance estimates to drive high-level optimizations, we must consider that each high-level change made to a design will cause the place and route tools to fit a circuit to a chip differently. To model the unpredictable nature of place and route when changes are made at a highlevel, we generate five alternative placements of each benchmark, using different random seeds for the Xilinx PAR tool for each placement. Each capacitance model must be characterized once for each device family, so this was done using the inter-routing wires from the benchmarks, placed using the first random seed. Estimates made by each model are then compared to the capacitances measured in the other four placements, generated using different random seeds.
We compare the capacitance estimation models below: FO A linear model using Fan-Out alone as in [6] . FO + est. BB A linear model using Fan-Out and the Bounding Box estimate for each component's output, given by the proposed method. AN Anderson & Najm's post-placement model (M8) [7] . PAR The capacitance values from XPower extracted from the placement using the first random seed.
The proposed technique, FO + est. BB, provides a fairly modest reduction of 2-3% RMSRE over using Fan-Out alone, however the method also allows much more accurate identification of those inter-routing wires in a benchmark which have the highest capacitance. In Figure 5 , we have used Spearman's rank correlation coefficient to measure the similarity of the ordering of net capacitance estimated by each We can see that the rank correlation has been improved for all devices, particularly for the Virtex 2 and Virtex 4 where the correlation coefficient for the proposed method has increased by approximately 0.2 over the FO model. The proposed method can more accurately identify high capacitance nets at an early stage, allowing these to be targeted in power-consumption optimization algorithms.
Further improvements in accuracy can be achieved by using the post-placement model AN, or by routing a circuit as in PAR, however these methods are associated with much longer computation times. For the medium-sized benchmark PolyphaseFIR, the proposed method took 0.58 seconds to calculate capacitance estimates, while generation, synthesis, mapping and placement for the Virtex 4, required to use the method AN, took 161s. Routing took a further 795s, and is needed to use the PAR model. For this system, the proposed capacitance estimation method is 300 times faster than a post-placement method and 2000 times faster than a post-routing method. The proposed method successfully trades off a reduction in accuracy for a large reduction in the computational complexity by calculating capacitance estimates before RTL-synthesis.
CONCLUSION
In this paper we investigated the feasibility of providing early estimates of routing wire capacitance in order to allow their power consumption to be estimated and optimized during high-level synthesis. We demonstrated that much more power is consumed in inter-routing wires than in intra-routing wires.
It was shown that capacitance in routing wires is becoming more dependent on wire-length in Xilinx devices since the Virtex 2, and as such bounding box could be used to significantly improve capacitance estimates made before routing is performed. A method for estimating the bounding box of the inter-routing wires of a system was proposed which uses circuit topology and pre-characterized component fanout and area values in order for it to be calculable before RTL-synthesis. We demonstrated that the proposed method helps to improve capacitance estimates over using fan-out alone [6] , while being far less computationally expensive than later-stage techniques such as [7] .
The proposed method will allow high-capacitance routing wires to be targeted by optimizations before performing RTL-synthesis, and in future work will be combined with other power estimation methods to form a complete dynamic power model for high-level optimization algorithms.
