Power and Aging Characterization of Digital FIR Filters Architectures by Calimera, Andrea et al.
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
General rights 
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners 
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. 
 
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. 
• You may not further distribute the material or use it for any profit-making activity or commercial gain 
• You may freely distribute the URL identifying the publication in the public portal  
 
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately 
and investigate your claim. 
   
 
Downloaded from orbit.dtu.dk on: Dec 20, 2017
Power and Aging Characterization of Digital FIR Filters Architectures
Calimera, Andrea; Liu, Wei; Macii, Enrico; Nannarelli, Alberto; Poncino, Massimo
Published in:
First MEDIAN Workshop 2012
Publication date:
2012
Document Version
Publisher's PDF, also known as Version of record
Link back to DTU Orbit
Citation (APA):
Calimera, A., Liu, W., Macii, E., Nannarelli, A., & Poncino, M. (2012). Power and Aging Characterization of
Digital FIR Filters Architectures. In First MEDIAN Workshop 2012
Noname manuscript No.
(will be inserted by the editor)
Power and Aging Characterization of Digital FIR Filters
Architectures
Andrea Calimera · Wei Liu · Enrico Macii · Alberto Nannarelli ·
Massimo Poncino
Abstract With technology scaling, newer metrics have
been introduced, in addition to delay, area, and power
dissipation, to characterize the behavior of digital sys-
tems. While dynamic and static power dissipation still
remain the most serious concern at nanometer lengths
(65nm and below), process-variation, temperature and
aging induced variations pose new challenges in the fab-
rication of the next generation of ICs.
This work presents a detailed power and aging char-
acterization of digital FIR filters in an industrial 45nm
CMOS technology, and a design space exploration of
diﬀerent filter architectures with respect to throughput,
area, power dissipation and aging. The exploration is
intended to provide new design guidelines when consid-
ering aging of components in power/performance trade-
oﬀs.
1 Introduction
With rapid scaling in CMOS technology, negative bi-
ased temperature instability (NBTI) is becoming one
of the major reliability concerns that can limit device’s
lifetime. The NBTI eﬀect primarily aﬀects PMOS tran-
sistors and can lead to a shift in the threshold voltage
up to 50 mV over time. The delay increase induced by
NBTI aging can severely degrade performance and in
the worst case result in system failure [1], [2].
Recent works have shown that NBTI-induced aging
may benefit from the application of traditional power
management implementations, namely voltage scaling
and power gating. There exist however special classes
A. Calimera, E. Macii, M. Poncino
Politecnico di Torino, Italy
W. Liu, A. Nannarelli
Technical University of Denmark, Denmark
of circuits for which very few power management op-
tions are available, because of the nature of their com-
putation. Digital filters are one relevant example of this
class: they implement a sort of streaming computation,
in which generally no structural idleness is present.
Since the structure of a digital filter is quite fixed
(e.g., direct or transpose forms), the most relevant de-
grees of freedom in their implementation are (a) the
representation of data, (b) diﬀerent approaches in im-
plementing arithmetic operations, and (c) the frequency
characteristics.
In this work, we characterize aging for a selection
of architectures normally used to implement Finite Im-
pulse Response, or FIR, filters and perform a design
space exploration by comparing maximum delay (fre-
quency/throughput), area, dynamic and static power
dissipation with aging.
2 NBTI Eﬀects on pMOS Transistor
Negative Bias Temperature Instability, or NBTI is a
time-dependent degradation mechanism which aﬀects
p-type MOS transistors. NBTI arises when a pMOS,
operating at high temperature, is negative biased (i.e.,
Vgs = −Vdd). Under this condition, called stress-state,
the electric field across the gate dielectric causes the
generation of traps at the Si/SiO2 interface. This af-
fects the threshold voltage Vth, whose absolute value
increases over time, thus causing the shift of other elec-
trical parameters, such as the drive current Ids and the
transconductance Gm. The resulting eﬀect is the pro-
gressive slow down of CMOS standard gates. However,
as soon as the stress is removed (i.e., Vgs = 0), recovery-
state, a significant fraction of traps are annealed, and
Vth appears to partially relax.
First MEDIAN Workshop 2012
1
2 Andrea Calimera et al.
While there is no consensus on the exact quantum-
mechanics mechanisms which govern the NBTI eﬀects,
the characterization of such eﬀects is very challenging,
and several fast measurement techniques have been de-
veloped recently [3]. In the meantime the reactivation-
diﬀusion (R-D) model [4] has emerged as the most ac-
credited model for pMOS NBTI. However, since a de-
tailed treatment of NBTI models is out of the scope of
this paper, we limit our contribution to list a few basic
aspects that are essential for the understanding of the
NBTI induced eﬀects on digital circuits.
Operating Condition For a given set of technological
parameters of a device (e.g., thickness oxide, channel
strain, and nitrogen concentration), NBTI eﬀects are
mainly dependent on temperature (∆Vth increases with
increasing T ), supply voltage (∆Vth increases with in-
creasing Vdd). Therefore, each library cell instance has
its own specific NBTI-induced curve of Vth degradation
in a multi-parameter space (Vdd, temperature, size and
elapsed time). Therefore, a customized characterization
of cell libraries is required for the estimation of delay.
Static Signal Probability The alternation of stress and
recovery periods complicates the modeling of NBTI,
since each single device should in principle be explicitly
simulated by collecting the temporal profile of stress/
recovery cycles. Things are even more complicated for
generic gates, in which each pMOS device is connected
to a distinct input with its own time-dependent wave-
form. Fortunately, this behavior can be approximated
with negligible error. It has been show in [5] that a
generic waveform can be modeled as a periodic one
with the same amount of stress time, and that aging
is independent of the frequency of the applied wave-
form. Together, these properties imply that it is the
total stress time that matters, (rather than the actual
waveform), thus allowing to use signal probabilities in
the simulation for the evaluation of the eﬀective aging.
From the above considerations it is possible to de-
rive a simplified model that describes the Vth variation
induced by NBTI:
∆Vth = K · α · t1/4
where K is a parameter which lumps all the technolog-
ical constants and considers the operating conditions of
the device, α is the static zero-probability of the gate
signal, and t is the elapsed time.
3 Digital Filters
Finite Impulse Response (FIR) filters are among the
most popular components used in Digital Signal Pro-
+ ++
Z!1 Z!1Z!1
+
a
0
a
1
aa
n−1
y(t)
x(t)
n−2
Z!1
+
+ Z!1
+
+Z!1 +
++ a
0
a
1
a an−1
y(t)
x(t)
n−2
Fig. 1 FIR filters in transposed (top) and direct (bottom)
form.
cessing (DSP). A FIR filter of order N is described by
the expression
y(n) =
N−1￿
k=0
akx(n− k) (1)
which is implemented in hardware with a sequence of
multiply-add operations.
Normally, DSP systems process data in fixed-point
format as the corresponding arithmetic units (circuits)
are simpler (smaller) and faster. One of the first steps
in designing a digital filter is to determine the dynamic
range of the system according to the filter specifications.
That is, to determine the bit-width of the datapath.
In the following, we present some alternatives and
their tradeoﬀs in terms of the conventional metrics: de-
lay, area and power dissipation, for implementing the
FIR filter of expression (1). We consider a sample FIR
filter of order N = 16 with dynamic ranges of 12, 10
and 22 bit for x, a and y, respectively. However, by (1),
the results can be easily extended to diﬀerent dynamic
ranges (datapath bit-width) and filter’s order (N).
FIR filters can be realized in either transposed or
direct form, as shown in Fig. 1.
A FIR filter can be seen as the connection of N
sections (referred as ”taps”) containing a multiplier, a
delay line (implemented by a register) and some adding
structure: a regular adder for the transposed form, and
an adder tree for the direct one.
We briefly describe the filters composing blocks and
some implementation alternatives.
First MEDIAN Workshop 2012
2
Power and Aging Characterization of Digital FIR Filters Architectures 3
3.1 Multiplication
Multipliers are present in each filter tap. Therefore, it
is crucial they are implemented eﬃciently with respect
to delay, area and power dissipation.
Parallel multiplication (combinational) is a three
steps computation [6]. We indicate with
p = a× x
the product p (n+m bits) of a n-bit operand x and a
m-bit operand a.
1. First, m partial products
pi = 2
ix · ai i = 0, . . . ,m− 1
are generated. Because ai = {0, 1}, this step can be
realized with a n×m array of AND-2 gates1
2. Then, the m partial products are reduced to 2 by
an adder tree
m−1￿
i=0
2ix · ai = ps + pc .
3. Finally, the carry-save product ps, pc is assimilated
by a carry-propagate adder (CPA).
p = ps + pc .
Because in a FIR filter, the product is added to the
value y coming from the previous tap for transposed
form, or to the other products for direct form, the final
CPA of the multiplier is eliminated to save time. This
is illustrated in Fig. 2 for the FIR in transposed form
where the multiplication and the addition are fused and
only one CPA per tap is used. The block marked CSA
(Carry-Save Adder) in Fig. 2 is an array of full-adders
which reduce 3 (n+m)-bit operands to 2 (n+m)-bit
operands without carry propagation [6].
The delay in the adder tree and its area depend on
the number of addends to be reduced (m : 2). By radix-
4 recoding the multiplier a, often referred as Booth’s re-
coding, the number of partial products is halved m2 . As,
a consequence the multiplier’s adder tree is smaller and
faster. However, in terms of delay, the reduction in the
adder tree is oﬀset by a slower partial product genera-
tion, due to the recoding [6]. On the other hand, the re-
duction in area is significant, and the power dissipation
is reduced as well due to both the reduced capacitance
(area) and the nodes’ activity because for two’s comple-
ment representation, sequences of 1’s are recoded into
sequences of 0’s resulting in less transitions when posi-
tive and negative values are alternated at the multiplier
inputs.
In the evaluation of the diﬀerent FIR filter architec-
tures, we include filters with both radix-2 and radix-4
multiplication.
1 Shifting (2i) is done by hard-wiring the AND-2 array’s
output bits.
k+1Y(n)
ak+1
Y(n) k+2
X(n)
n
re
gm
n+m n+m
n
re
g n+mCSA
3:2 CPA
akre
gm
n+m n+m
n
kY(n)
re
g n+mCSA
3:2 CPA
Fig. 2 Tap for FIR filters in transposed form.
k+1Y(n)
ak+1
Y(n) k+2
X(n)
kY(n)
akre
gm
n+m n+m
n
re
g
re
g
CSA 4:2 n+m
n+m
n
re
gm
n+m n+m
n
re
g
re
g
CSA 4:2 n+m
n+m
n+m
n+m
Fig. 3 Carry-save tap for FIR filters in transposed form.
3.2 Addition
The delay of the critical path, of the transposed FIR
filter of Fig. 2 is
tmaxCPA = tMULT + tCSA + tCPA + tREG
This delay determines the maximum clock frequency
and the throughput of the filter. Computing the carry-
propagate addition in each tap can be avoided by stor-
ing the output of the CSA. This requires the doubling
of the registers storing y in each tap, as shown in Fig.
3. In this way, the delay of the critical path for the
transposed FIR filter is reduced to
tmaxCS = tMULT + tCSA4:2 + tREG .
Because we have now a carry-save representation in
both p = ps + pc and y = ys + yc, the CSA 3:2 is re-
placed by a CSA 4:2 which is slightly slower [6].
To give a numerical example, for a transposed FIR
filter with radix-4 multipliers the speed-up for carry-
save over carry-propagate is 25%.
speed-up =
tmaxCPA
tmaxCS
=
1.27 ns
1.01 ns
= 1.25
The carry-save representation of y, requires an addi-
tional stage at the filter output where a CPA assimilates
ys + yc = y.
The above considerations apply for FIR filter in
transposed form only. For filters in direct form (Fig.
First MEDIAN Workshop 2012
3
4 Andrea Calimera et al.
Fig. 4 Flow of the Proposed NBTI-Aware Exploration
Framework for logic Circuits.
1 bottom), the carry-save output of the N multipliers
is reduced by an adder tree 2N : 2, and the final y is
computed by a final stage consisting in a CPA, as for
the case of the carry-save transposed FIR filter.
3.3 FIR filter architectures
Summarizing, we list the alternative architectures for
the FIR filter implementation:
- FIR-T-R2-CPA is the transposed form implemen-
tation using radix-2 multipliers and one CPA per
tap (Fig. 2).
- FIR-T-R2-CSA same as FIR-T-R2-CPA, but with-
out CPA in taps (Fig. 3). An additional stage, im-
plementing a single CPA, is inserted at the filter
output.
- FIR-T-R4-CPA is the same as FIR-T-R2-CPA ex-
cept that radix-4 multipliers are used within each
tap.
- FIR-T-R4-CSA is the same as FIR-T-R2-CSA ex-
cept that radix-4 multipliers are used within each
tap.
- FIR-D-R2 is the direct form implementation using
radix-2 multipliers. An additional stage, implement-
ing a single CPA, is inserted at the bottom of the
tree (filter’s output).
- FIR-D-R4 is the same as FIR-D-R2, but with radix-
4 multipliers.
4 NBTI-Aware Exploration Framework
Fig. 4 shows the implemented NBTI-aware exploration
framework for the aging profile of standard cells based
digital circuits.
After obtaining a synthesized circuit, a post-synthesis
simulation2 is needed to extract the statistical informa-
tion of all the internal nodes. These information, which
are stored on a dedicated ASCII file, formatted using
the Value Change Dump (VCD) format, are the actual
input of the implemented framework.
Depending on the operating PVT corner (i.e., Pro-
cess, supply Voltage, and Temperature), and the static
0-probability of internal signals, the NBTI-induced de-
lay degradation of each standard cell is extracted and
annotated. This is done with the support of new NBTI-
aware timing libraries that support time-dependent vari-
ations. Since today’s design kits do not provide design-
ers with such type of libraries, we filled new look-up
tables containing the NBTI-induced delay degradation
of each cell. The characterization, which is made under
several operating conditions, i.e., Static 0-probabilities
of the inputs, stress voltage (i.e., Vgs), temperature, and
aging time, was run by using a dedicated SPICE-based
aging analysis flow consisting of a two-phase simula-
tion: the pre-stress simulation phase, in which we es-
timate the aging eﬀects of the p-type transistors con-
tained in the standard cell, and the post-stress simula-
tion phase, where the stress information are integrated
into the pMOS device parameters. At this point the de-
lay degradation of the aged cell is measured and stored
in a dedicated look-up table.
The netlist, annotated with the NBTI information,
is then loaded into a standard Static Timing Analysis
(STA) engine that provides timing information of the
aged circuit, Fig. 4. The collected aging curves are than
used to calculate the lifetime of the circuit. The lifetime
is measured as the time at which the aging curve crosses
a user defined delay guard-band.
5 Experimental Results
The FIR filter architectures previously described can
have diﬀerent characteristics in terms of power, delay
and aging. To explore the design space of these filter
architectures, we implemented the six units listed in
Section 3.3.
The units are synthesized by Synopsys’s Design Com-
piler with a 45 nm standard cell library in topographical
mode. The topographical mode gives us better estima-
tions of parasitics associated with interconnects which
has an increasing contribution to path delay in nanome-
ter technologies. The whole design flow is illustrated
in Fig. 5. After we obtain the synthesized netlist, a
2 Post-synthesis simulations are obtained applying dedi-
cated testbenches that emulate the actual workload
First MEDIAN Workshop 2012
4
Power and Aging Characterization of Digital FIR Filters Architectures 5
Area Delay MaxFreq Pleak Pdyn Lifetime
Unit [µm2] [ns] [MHz] [µW ] [mW ] [years]
FIR-D-R2 47137 2.10 476 46.5 31.2 2.92
FIR-D-R4 35852 2.63 380 33.7 17.6 2.49
FIR-T-R2-CPA 51019 1.45 692 49.5 24.6 2.79
FIR-T-R2-CSA 54315 1.13 882 82.6 20.5 3.09
FIR-T-R4-CPA 43799 1.27 787 45.3 13.6 2.32
FIR-T-R4-CSA 43600 1.01 989 69.5 10.1 3.29
Pdyn is dynamic power measured at 100 MHz.
Table 1 Implementation results of diﬀerent FIR filter architectures.
Fig. 5 Design space exploration flow.
post-synthesis simulation is run using Mentor Graph-
ics’ Modelsim to extract the toggling information of all
the internal signals. The test-patterns used to extract
the switching activity, program the FIR units as a low-
pass filter and apply white noise (random vectors) at
the filter input.
Our NBTI analysis tool then takes the synthesized
design, the statistics about internal signals and library
information as input and produces an aging profile of
the design.
Table 1 shows the implementation results of the
units. MaxFreq is the maximum frequency at which
the design can be clocked. To have a fair comparison,
dynamic power dissipation Pdyn is normalized for all
units at 100 MHz. In addition to the traditional met-
rics, we define Lifetime as the time elapsed until when
the delay of the critical path exceeds, due to aging ef-
fects, by 15% the delay of the critical path at age-0
(chip production).
From the data in Table 1, the direct form imple-
mentations, due to their large reduction tree, have the
largest delay and thus the lowest maximum frequency.
However, they occupy less area especially in the case
of FIR-D-R4 which has the smallest area, and conse-
quently, the smallest leakage power. The transposed
form implementations have a much better performance
over FIR-D-R2 and FIR-D-R4 at a price of larger area.
For example, the maximum operating frequency in FIR-
T-R4-CSA is two times than that of FIR-D-R2.
In general, units that use radix-4 multipliers are
faster, smaller and more power eﬃcient than their radix-
2 counterparts which makes the radix-2 implementa-
tions less attractive. Additionally, FIR-T-R2-CSA and
FIR-T-R4-CSA which save the results at each tap in
carry-save format have a shorter delay and consume
less power than FIR-T-R2-CPA and FIR-T-R4-CPA.
This makes the CSA implementations more favorable.
In fact, within all the units FIR-T-R4-CSA is the fastest
in speed and lowest in dynamic power consumption.
To understand the aging characteristics of diﬀerent
architectures and the lifetime shown in Table 1, we plot
the aging curves of all units in Fig. 6. The x-axis shows
the cumulative operating time in years and the y-axis
shows the critical delay in ns.
For each unit, the initial climb (as years moves from
0 to 1) has a larger slope than the other segments (as
years moves from 1 to 6), showing a shift of the criti-
cal path. This means that a path other than the criti-
cal path at age-0 is more stressed under NBTI eﬀects,
and eventually, it obtains a larger delay induced by ag-
ing. For diﬀerent units, the slope of the initial climb
is ordered from high to low as: FIR-D-R4, FIR-D-R2,
FIR-T-R2-CPA, FIR-T-R4-CPA, FIR-T-R2-CSA, FIR-
T-R4-CSA which is consistent with the delay without
First MEDIAN Workshop 2012
5
6 Andrea Calimera et al.
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 0  1  2  3  4  5  6
De
lay
(n
s)
Years
FIR-T-R2-CPA
FIR-T-R2-CSA
FIR-T-R4-CPA
FIR-T-R4-CSA
FIR-D-R2
FIR-D-R4
Fig. 6 Aging curve for diﬀerent FIR filter architectures.
aging eﬀects considered. This is as intuitively expected,
because units with large logic depth have longer paths.
After 1 year of operation time, the relative delay in-
crease in longer paths is also larger than shorter paths.
However, over time, this does not necessarily to be
true. Transistors along a longer path can age at a dif-
ferent speed than transistors along a shorter path. As
explained in the previous section, the speed of aging
is determined by the signal probabilities (duration of
0’s which stress PMOS). Therefore, as time elapses a
shorter path could have a longer delay, as we can see in
the case of FIR-T-R2-CPA in Table 1 where its lifetime
is shorter than FIR-D-R2.
Three metrics of the most interest, frequency, power
and lifetime, of all units are plotted in Fig. 7. All num-
bers are normalized to unit FIR-D-R4. The x-axis and
y-axis shows normalized frequency and power, respec-
tively. Normalized lifetime is shown in the plot. Filter
FIR-T-R4-CSA, shown in the lower right corner, per-
forms best in terms of speed, power and lifetime, mak-
ing it the most attractive architecture overall.
The characterization of aging in FIR filters showed
that the architectures with shallower logical depth (trans-
posed form and CSA) are the ones with longer aging
time. Therefore, they provide the highest throughput
sustainable over a longer period of time.
On the other hand, our design space exploration
provided no evidence of a relationship between aging
and power dissipation for the FIR filter architectures
implemented. This is probably due to the high activ-
ity in filters (little idleness) which produces evenly dis-
tributed power dissipation in the diﬀerent parts of the
system independently of the number of paths aﬀected
by the aging.
 0
 0.5
 1
 1.5
 2
 0.5  1  1.5  2  2.5
No
rm
 P
ow
er
Norm Frequency
FIR-T-R4-CSA
1.32
FIR-T-R2-CSA
1.24
FIR-D-R2
1.17
FIR-T-R2-CPA
1.12
FIR-D-R4
1.00
FIR-T-R4-CPA
0.93
Fig. 7 Design space exploration of FIR filter architectures.
Frequency, power and lifetime from Table 1 are normalized.
6 Conclusions
In this paper, we described a design space exploration
framework which besides traditional metrics also takes
NBTI induced aging as another dimension. The explo-
ration on a set of FIR filter architectures shows that
transposed form implementations perform better than
direct forms in terms of delay (they can sustain a higher
throughput) and dynamic power dissipation, while fil-
ters in direct form have smaller area and, consequently,
consume less static power.
The results also show that significant diﬀerences in
lifetime exist in these units and that FIR filters pro-
viding higher throughput (frequency) are the ones that
can sustain this throughput for a longer period of time.
As for the power dissipated, the results of the charac-
terization do not show any dependency of aging.
References
1. Borkar, S., ”Electronics beyond nano-scale CMOS,” De-
sign Automation Conference, 2006 43rd ACM/IEEE,
pp.807-808.
2. Schroder, Dieter K., Babcock, Jeﬀ A., ”Negative bias
temperature instability: Road to cross in deep submi-
cron silicon semiconductor manufacturing,” Journal of
Applied Physics, vol.94, no.1, pp.1-18, Jul 2003.
3. Ming-Fu Li, et.al., “Understand NBTI Mechanism by
Developing Novel Measurement Techniques,” IEEE
Transaction on Device and Materials Reliability, vol. 8,
no.1, pp. 62–71, Mar. 2008.
4. M. Alam, “Reliability- and process-variation aware de-
sign of integrated circuits,” Microelectronics Reliability,
vol. 48, no. 8, pp. 1114–1122, Aug. 2008.
5. S. V. Kumar et al., “An analytical model for negative
bias temperature instability,” Proc. of IEEE/ACM In-
ternational Conference on Computer-Aided Design (IC-
CAD), pp. 493–496, Nov. 2006.
6. M. Ercegovac and T. Lang, Digital Arithmetic. Morgan
Kaufmann Publishers, 2004.
First MEDIAN Workshop 2012
6
