Fast word-level power models for synthesis of FPGA-based arithmetic by Clarke, JA et al.
Fast word-level power models for
synthesis of FPGA-based arithmetic
Jonathan A. Clarke, Altaf Abdul Gaffar, George A. Constantinides, Peter Y. K. Cheung
Department of EEE, Imperial College London
Email: {jonathan.a.clarke, altaf.gaffar, g.constantinides, p.cheung} @imperial.ac.uk
Abstract- This paper presents power models for multiplication The contributions of the work in this paper can be summa-
and addition components on FPGAs which can be used at a high- rized as follows:
level design description stage to estimate their logic and intra-
component routing power consumption. The models presented * an improved power model which allows for non-zero
are parameterized by the word-length of the component and mean Gaussian input signals, allowing a greater range
the word-level statistics of its input signals. A key feature of of signals to be modelled accurately, and,
these power models is the ability to handle both zero mean * establishing a link between the logic power consumed by
and non-zero mean signals. A method for measuring intra-
component routing power consumption is presented, enabling component andthe routing power th
the power models to account for both logic and routing power component, allowing intra-component routng power to
in components. The resulting models are equations which can be be estimated.
used to estimate the power consumed in an arithmetic component
in a fraction of a second at the pre-placement stage of the design II. BACKGROUND
flow. The models have a mean relative error of 7.2% compared
to bit-level power simulation of the placed-and-routed design. The average power consumed in a particular capacitive
element within a device can be calculated using (1), where:
I. INTRODUCTION a is the average number of logic transitions that the signal
Power consumption of Field-Programmable Gate Arrays on the element makes during one clock cycle, known as the
(FPGAs) is rapidly increasing due to more tightly packed switching activity of the signal, C is the capacitance of the
logic and higher clock speeds. This is causing costs to rise for element, Vdd is the power supply voltage, and fclk is the clock
larger devices as they now require more expensive packaging, frequency
heatsinks, or fans, and prevents smaller devices being used 1
in portable applications where short battery life and little p - a C.VC d fclk (1)
opportunity for heat dissipation are dominant factors. Digital 2
circuits consume power when transistors within the circuit By using this equation on every capacitive component in
switch as computation is performed (dynamic power), and also an FPGA the dynamic power consumed on the device can be
when no switching is taking place, due to leakage currents in estimated. However this requires knowledge of the average
the device (static power). It is FPGA vendors who choose the activity of each signal in the circuit and the capacitance
number and layout of the transistors on a device, and hence which that signal switches. These are only available once a
they shoulder the task of optimizing the static power of future design has been synthesized, placed and routed (to obtain the
architectures. Hardware designers, however, can influence the capacitance values), and then simulated at a low level (to
dynamic power consumed on an FPGA through decisions in obtain the activities of each signal) with test vectors which
the design process, hence it is essential to provide tools which are expected to be typical inputs to the design. This process
allow the comparison of the effects of different design changes is very computationally intensive and so is unsuitable for
on dynamic power consumption at various stages of the design use in power-conscious high-level synthesis tools where many
flow. iterations of power estimates followed by design modifications
Previous work by the authors presented a technique for esti- are made.
mating the logic power consumption in arithmetic components In [2] the transition density technique is used to avoid
on an FPGA where the inputs to a component were modelled low-level simulation of an FPGA design. Switching density
as zero mean Gaussian signals [1]. This work extends our estimates are instead propagated through the circuit, reducing
previous work by relaxing the zero-mean constraint, and by the computational cost of estimating switching activity rates
introducing a technique for estimating the power consumed for each node in the circuit, though it is still necessary to place
within the routing of multipliers, whilst using only a small and route the circuit. In [3] the transition density technique is
number of parameters in the models developed. The resulting also used, but the authors describe a technique for estimating
models are very fast compared to low-level power estimation the capacitance of routing wires between components without
techniques, but still account for glitching within components needing to perform place and route. Even so the synthesis
through characterization of the equations used. and technology mapping stages which must still be performed
0-7803-9390-2/06/$20.00 ©2006 IEEE 1299 ISCAS 2006
are too computationally intensive for use within a high-level IV. POWER MODEL CONSTRUCTION
synthesis loop. This section describes the power models developed which
Work in [4] and also in [5] uses characterized equations can be used at the design description stage to estimate the
of functional components and mostly closely resembles the power consumed in an arithmetic component, given infor-
work presented here. In [4] a power model is described mation about the component's word-length and some word-
which is characterized using bit-level activity characteristics level signal statistics regarding the signals which pass through
of the inputs and outputs to a component. The model is not it. Currently the tools developed work with Xilinx System
parameterized by word-length however, meaning that separate Generator [7], which allows arithmetic intensive FPGA de-
models are created for identical operations with different signs such as DSP algorithms to be entered using a block
word-lengths. In [5] a model is presented which uses bit- diagram interface. Our tool monitors the signals at the inputs
level characteristics of the inputs to a component, as well of each arithmetic component in the design during a high-level
as its word-length, to calculate the power consumed in that floating-point simulation, and measures the word-level signal
component, however the work is unable to distinguish between statistics which are used in the power model for each signal.
signals with similar bit-level activities averaged across the These are then used along with the word-length information in
signal, but with very different statistical distributions, which the design to estimate the power consumed in each component.
may cause the component driven by the signal to consume This process can be done entirely without synthesizing or
different amounts of power than predicted. mapping the design to a target FPGA. The models described
are intended to estimate the power consumed in adder and
multiplier components implemented on Xilinx Virtex2 [7]
In [6] the authors demonstrate that it is possible to abstract devices, though the parameters selected for use in our power
away from the bit-level activities of a signal by grouping ad- models are general enough to be used on other devices where
jacent bits into regions which exhibit similar behaviour. Three ripple carry adders tree-based multipliers are used. Table I
different groups of bit-level activity were identified in a zero- shows the equations for the power models developed, which
mean Gaussian signal which are, in order from LSB to MSB: are explained in detail in the following subsections.
spatially and temporally uncorrelated LSB bits, a region with
increasing spatial correlation towards the MSB, and the sign
bit(s) at the MSB end of the signal. The authors demonstrate Our previous work [1] indicated that the logic power
that it is possible to estimate the breakpoints between these consumed in an adder increases linearly with the word-
groups of activity using only the lag-I autocorrelation and length of the component, and can be estimated using
standard deviation for zero-mean Gaussian signals. Co ((7y Pxxl, Pyyl)Ws + C1 ((Jy, Pxxl, Pyyi), where o7y is the
For a statistically stationary Gaussian signal with non-zero standard deviation of the smaller of the two inputs, scaled
mean, the signal varies with a Gaussian distribution around according to the standard deviation of the larger signal, pxxl,
some constant mean value. This is reflected in the bit-level Pyyl are the lag-I autocorrelation coefficients of the signals,
activity of the signal by a number of constant-valued bits at C0, Ci are functions mapping these parameters into coefficient
the MSB end, referred to here as the mean bits in a Gaussian values, and Ws is the word-length of the component which
signal, which are followed by the two lower groups of activity captures the linear dependance of power on word-length. If
from [6]. The number of mean bits N,. in a Gaussian signal both input signals to an adder contain a number of mean bits
with mean ,u and standard deviation o7 can be approximated then a corresponding decrease in the power consumed in the
using (2). Here lg2 (O,u + 3u7) represents the bit-position component can be expected. An extension is proposed here
relative to the binary point of the largest bit needed to represent to account for mean bits in the signal x with larger standard
the signal, and log2 (3(X), taken from [6], represents the bit- deviation, by modelling the power consumed in an adder of
position relative to the binary point of the largest bit needed word-length W, and a number of mean bits Ntx in the input
to represent the Gaussian variation about the mean of the signal x, as equivalent to the power consumed in a smaller
signal (an uncorrelated Gaussian signal lies in the ±3u7 range adder with word-length Ws = W - Nx, as shown in Table I,
99.7% of the time). This simplifies as shown in (2), where it where Ntx is given by (2).
can be seen that the larger the ratio I'l, the more mean bits B Multipliers
there are in the signal. Mean bits in a signal do not make
any transitions, and so strongly affect the power consumed Multipliers implemented in Look-Up Tables (LUTs) on the
in arithmetic components driven by such signals. As such the Xilinx Virtex2 FPGA are constructed as a set of partial product
power models described in Section IV use the number of mean generators at the top level followed by a summation tree to
bits in the input signals to help determine the power consumed sum the partial products and generate the multiplication result.
in arithmetic components. The work in [1] accounts for the effects of the autocorrelations
Pxxi, Pyyl of the input signals using Co(pxx1, pyy1)W2 +
/ ~~~Ci(Pxxl, Pyyi), where Co, Ci are functions mapping the
NM,= log2 ( ,u + 3ur) -log2 (3uo) =log2 ( "l +i) (2) autocorrelations to coefficient values, and W2 accounts for the
3'+J quadratic dependence of power on word-length in multipliers.
1300
TABLE I
THE RELATIONSHIP BETWEEN WORD-LENGTH (W) AND DYNAMIC POWER CONSUMPTION (P) FOR ADDERS AND MULTIPLIERS.
Component Dynamic Power
Adder Logic P Co(CTy,pxxl,pyy1)Ws + C1(CTy,pxxl,pyyl), where Ws = W - Nx




5~~ ~ ~ ~ E20~~~Z E,2
10 0~~~~ ~ ~05 ~-3,0 40
3: CZ E20 ~ ~ ~ ~~-oC 20
-1 -0.5 0 0.5 1 0 8 12 1 6 20 24 28 -20 2 4 6 8 10 _j 2 8 12 1 6 20 24 28
Fig. 1. (a) The effects of the sign of the mean of the multiplier signal on the Fig. 2. (a) The error in the logic power estimate made using the autocorrela-
logic power consumption in an 18 x 18-bit multiplier component. 1500 tests tions of the inputs and sign of the multiplier signal mean, and its relationship
are shown with independently chosen, uniformly distributed random input to the number of mean bits in the multiplicand, in a 16 x 16-bit multiplier.
statistics. (b) How word-length affects the percentage difference between the 1500 tests are shown with independently chosen, uniformly distributed random
power consumed when the mean value in the multiplier signal is positive or input statistics. (b) The over-estimation factor (i.e. the gradient of the graph
negative. in Figure 2(a)) per mean bit in the multiplicand, across different word-length
multipliers.
When considering non-zero mean signals, it was noted that
the MSB of the multiplier signal has an approximately 50% approximately linearly with the number of mean bits in the
higher fanout than all other input bits to a multiplier com- multiplicand (hence a linear increase in over-estimation is
ponent, due to sign extension of this bit in each partial product seen in the data). Figure 2(b) shows that the rate of this
pair which is summed at the top level of the component. decrease across multipliers of word-lengths 8 to 28 bits is
Approximately two thirds of the logic driven by this signal inversely related to word-length. Again, this is due to the
within the multiplier uses it as a gating signal, thus when the amount of logic affected by one mean bit in the multiplicand
multiplier signal is positive and its MSB bit is 0, transitions in increasing linearly with word-length, while the overall power
the other signals driving this logic are not propagated through consumption of the component increases quadratically. The
the rest of multiplier, causing lower power consumption than term [N,,y (QiW-1 + Q0) + 1] in Table I accounts for these
when the multiplier signal is negative and transitions are effects.
propagated through the component. As a result, the sign of
the multiplier signal can have significant effect on power C. Routing power within multipliers
consumption, as shown in Figure l(a). However this effect The dynamic power consumed within the routing of a
decreases as the word-length of the multiplier component multiplier implemented in slices is directly related to the
increases, as its power consumption grows quadratically with activity of the input signals of the multiplier and the power
word-length, whilst the amount of logic driven by the MSB consumed in the logic of the multiplier. To establish the
bit in the multiplier signal (and hence the power consumed nature of this relationship a test system was developed which
in this logic) only increases linearly with word-length. As would allow for the separate measurement of intra-routing
such an inversely proportional relationship between the word- power consumption (i.e. within multipliers) and inter-routing
length of a multiplier and the change in power consumption power consumption (between components). The diagram in
due to the sign of the multiplier signal exists, as can be seen in Figure 3(a) illustrates the test circuit used, where the input
Figure l(b). This is reflected by the introduction of the term and output pins of the FPGA on which the multiplier is to
[Signm (y,W-1 + -yo) + 1] into the multiplier logic power be implemented have been constrained to lie in bit order at
model in Table I, where Signm is a binary variable equal to particular positions on the device. This forces the multiplier
0 iff the mean of the multiplier signal x is positive, while t7i, to be consistently placed in the same location independently
-Yo are architecture specific coefficients. of its word-length, as a result each multiplier will have
The other term introduced to the power model in Table I approximately the same length of inter-component routing
accounts for the number of mean bits in the multiplicand, between its terminals and the pins of the chip.
Figure 2(a) shows that once the autocorrelation in the input The largest multiplier used was 28x28 bits; when smaller
signals and the sign of the multiplier signal have been accoun- multipliers were placed on the chip those signals not used by





___ N 5 10 ACCURACY RESULTS FOR MULTIPLIER MODEL W.R.T. XPOWER.
8
,9t6------t B/ Estimation technique Mean Relative Error
_ _ L 4- /Work in [1] 32.7%
. 2 [1] + sign in multiplier 11.5%
00 10 20 30 40 [1] + sign in multiplier 72%
+ mean bits in multiplicand
(a) (b)
Fig. 3. (a) The test setup used to measure intra-component routing power. of the estimates made by the model in Table I was 4.98%
The inputs to the chip are constrained to lie in the same position for different compared to the estimates made by XPower.
word-length multipliers, and are routed across the chip even if not required The multiplier logic power model in Table I was charac-
by smaller multiplications. (b) The relationship between logic and internal
routing power for multipliers. Note that the results have been normalized so terized for each of the word-lengths 8, 10, 12,.... 28, with
that the base case has no routing power consumption. 1500 tests for each word-length. Table II gives a breakdown
of the accuracy of the technique in [1] and the modifications
unused input pins to the unused output pins. This allowed described in Section IV-B, when applied to the total of 16500
the routing power of a base case to be measured where tests made across 11 different word-lengths. The MRE for the
only routing, and no multiplier, was placed on the chip. By multiplier logic power model developed in this work is 7.2%.
subtracting the routing power consumed in the base case, VI. CONCLUSION
measured using XPower [7], from the routing power consumed
by the other circuits it was possible to obtain an estimate This paper has presented models to estimate the logic power
of the power consume withinte in ta ing of te consumption in adders and multipliers which are parameteri-the power c d w hin th ternal routing the
multipliers tested. Note that the outputs of each multiplier zed using word-level statistics of the inputs to the compo-
were registered to avoid glitch propagation through the inter- nents. The work presented has investigated the use of input
*routing to the chip pins signals with non-zero mean values and their effects on powercomponent conumpion The reulin power moel avnMEofa
Figure 3(b) shows a scatter plot of the logic and routing consumption. The resulting power models have a MRE of at
worst 7.2% and can be used at the design description level
power consumed by multipliers for varying input word-lengths g . n
from 8 to 28 bits and varying input signal statistics. The to optimize a system before or during synthesis. A method
routing power consumption in the base case was observed to for investigating the relationship between the logic power
be negligibly small (< 1 mW/MHz). and internal routing power consumed within a multiplier was
From these results it can be observed that when the outputs also presented which showed that an approximately constant
of components in a design are registered, preventing the pro- relationship between the two exists, with a MRE of 8.27%.
pagation of signal glitches, the inter-component routing power Future work will integrate these tools into a synthesis package
is small compared to the intra-component routing power. The allowing designs to be optimized for power at a high level.
results can be fitted to a linear curve with a mean relative ACKNOWLEDGMENTS
error of 8.4%, as shown in Figure 3(b). Based on this linear The authors would like to acknowledge the support of
curve a scaling factor of (1 + a), where a is the gradient of Synplicity, Inc., Xilinx, Celoxica and the EPSRC under grant
the curve, can be applied to the values of Co, C1 from [1], numbers EP/C512596/1 and EP/C549481/1.
shown in the multiplier equation in Table I, to allow for both
the logic and internal routing power consumed in a multiplier REFERENCES
to be estimated. [1] J. A. Clarke, A. A. Gaffar, and G. A. Constantinides, "Parameterized
logic power consumption models for FPGA-based arithmetic," in FPL,
V. RESULTS T. Rissa, S. Wilton, and P. Leong, Eds. IEEE, 2005, pp. 626-629.
[2] K. K. W. Poon, A. Yan, and S. J. E. Wilton, "A flexible power model for
The Mean Relative Error (MRE), used in this section to FPGAs," in FPL, M. Glesner, P. Zipf, and M. Renovell, Eds. Springer,
measure the accuracy of the models developed, is calculated 2002, pp. 312-321.
using (3), where Pe, Pm, are the ith power estimates and [3] J. H. Anderson and F. N. Najm, "Power estimation techniques formeasurements,respectively. tFPGAs," IEEE Trans. on VLSI Systems, vol. 12, no. 10, pp. 1015-1027,measurements, respectively. 2004.
[4] L. Shang and N. K. Jha, "High-level power modeling of CPLDs and
1N Pe-Pm FPGAs," in Proc. of the Int. Conf: on Comp. Design. IEEE Computer
MRE N E Mi (3) Society, 2001, pp. 46-53.N pmi [5] T. Jiang, X. Tang, and P. Banerjee, "Macro-models for high level area
and power estimation on FPGAs," in GLSVLSI. ACM Press, 2004, pp.
The tests described in this section were performed using 162-165.
pairs of input signals with independently chosen, uniformly [6] P E. Landman and J. M. Rabaey, "Architectural power analysis: Thedual bit type method," IEEE Trans. on VLSI Systems, vol. 3, no. 2, pp.distributed random values for the mean, autocorrelation and 173-187, 1995.
standard deviation statistics for each signal. For the adder 1000 [7] System Generator, XPower, Virtex2. Xilinx. [Online]. Available:
tests were performed on a 16-bit component. The MRE error http://www.xilinx.coml
1302
