Optimized Architectural Synthesis of Fixed-Point Datapaths by Caffarena Fernández, Gabriel et al.
Optimized Architectural Synthesis of Fixed-Point Datapaths
Gabriel Caﬀarena1, Juan A. Lo´pez1, Gerardo Leyva2
Carlos Carreras1 and Octavio Nieto-Taladriz1
1Dep. de Ing. Electro´nica, Universidad Polite´cnica de Madrid, Spain
2Dep. de Sistemas Electro´nicos, Universidad Auto´noma de Aguascalientes, Mexico
e-mail: gabriel@die.upm.es
Abstract
In this paper we address the time-constrained
architectural synthesis of fixed-point DSP algorithms
using FPGA devices. Optimized fixed-point implemen-
tations are obtained by means of considering: (i) a
multiple wordlength approach; (ii) a complete datapath
formed of wordlength-wise resources (i.e. functional
units, multiplexers and registers); and, (iii) a novel re-
source usage metric that enables the wise distribution
of logic fabric and embedded DSP resources.
The paper shows: (i) the benefits of applying a multi-
ple wordlength approach to the implementation of fixed-
point datapaths; and (ii) the benefits of a wise use of
embedded FPGA resources. The proposed metric en-
ables area improvements up to 54% and the use of a
complete fixed-point datapath leads to improvements up
to 35%.
1. Introduction
In this paper we deal with the architectural synthe-
sis (AS) of common Digital Signal Processing (DSP)
cores implemented on modern FPGAs. Two important
factors leading to optimization are the use of Multiple
Word-Length (MWL) ﬁxed-point descriptions of the al-
gorithms and the use of both LUT-based and embed-
ded FPGA resources. The former reduces implemen-
tation cost notably, and the latter minimizes area and
inproves performance in FPGA implementations.
The MWL implementation of ﬁxed-point DSP al-
gorithms [8, 5, 4, 2] has proved to provide signiﬁcant
cost savings when compared to the traditional uniform
word-length (UWL) design approach. The introduc-
tion of MWL issues this in AS, although it increases
optimization complexity, can lead to signiﬁcant cost
reductions [5, 4, 3].
FPGA devices have been extensively used in the
implementation of DSP algorithms, especially due to
the recent introduction of embedded blocks (i.e. mem-
ory blocks, DSP blocks, etc.). Traditional approaches
to estimate FPGA resource usage do not apply to
heterogeneous-architecture FPGAs since they only ac-
count for lookup table (LUT) based resources [10].
This situation calls for new resource usage metrics that
can be integrated as part of automatic optimization
techniques (i.e. architectural synthesis) to fully exploit
the possibilities that embedded resources oﬀer [1, 9].
The main contributions of this paper are: (i) the
presentation of a novel resource usage metric that al-
lows minimum resource usage for heterogeneous-FPGA
implementations; (ii) the presentation of an architec-
tural synthesis procedure tuned to ﬁxed-point imple-
mentations, that handles a complete datapath (FUs,
multiplexers and registers); and, (iii) a novel strategy
for ﬁxed-point data multiplexing.
The paper is divided as follows: In section 2, the
architectural synthesis of DSP cores using multiple
wordlength systems and modern FPGAs is introduced.
Section 3 deals with the implementation results from
synthesizing several well-known DSP benchmarks for
diﬀerent latency constraints and output noise con-
straints. Finally, in section 4, conclusions are drawn.
2. Generation of Fixed-Point Datapaths
2.1. Formal description
This work focuses on the time constrained resource
minimization problem [6]. The notation used is based
on [6], and it is similar to that in [5, 2, 3].
Given a sequencing graph GS(V, S), a maximum la-
tency λ and a set of resources R (e.g. functional units
RFU , registers RREG and steering logic RMUX), it is
the goal of AS to ﬁnd the time step when each oper-
ation is executed (scheduling), the types and number
2008 International Conference on Reconfigurable Computing and FPGAs
978-0-7695-3474-9/08 $25.00 © 2008 IEEE
DOI 10.1109/ReConFig.2008.48
85
Authorized licensed use limited to: Univ Politecnica de Madrid. Downloaded on July 10, 2009 at 05:16 from IEEE Xplore.  Restrictions apply.
of resources forming R (resource allocation), and the
binding between operations and variables to functional
units and registers (resource binding) that comply with
the constraints, while minimizing cost (i.e. area).
GS(V, S) is a formal representation of a single iter-
ation of an algorithm, where V is the set of operations
and S ⊂ V × V is the set signals that determine the
data ﬂow. We consider V = VM∪VG∪VA∪VD∪VI∪VO
composed of typical DSP operations: multiplications,
gains, additions, unit delays, and input and output
nodes. Signals are in two’s complement ﬁxed-point
format, deﬁned by the pair (n, p), where n is the
wordlength of the signal – not including the sign bit
– and p is the scaling of the signal that represents the
displacement of the binary point from the sign bit [2].
Functional units (RFU ) are in charge of executing
the set of operations from V . Registers (RREG) store
the data produced by FUs and some intermediate val-
ues. Finally, steering logic (RMUX ) interconnects FUs
and registers by means of multiplexers. The set of func-
tional units RFU = RALUT ∪RMLUT ∪RMEMB is com-
posed of LUT-based adders, LUT-based generic mul-
tipliers, and embedded multipliers. This set of FUs
covers a representative set of modern FPGA devices.
An FU r ∈ RFU is deﬁned by its type type(r) =
{AdderLUT ,MultiplierLUT ,MultiplierEMB} and by
its size, that depends on the input wordlengths. An
operation is compatible with an FU if they have com-
patible types and if the size of the operation is smaller
than or equal to the size of the FU [2, 3].
Scheduling is expressed by means of function ϕ :
O → Z+, which assigns a start time to each opera-
tion. Resource binding, is divided into FU binding and
register binding. FU binding makes use of the com-
patibility graph GC(V ∪R,C) [5], which indicates the
compatible resources for each v ∈ V by means of the
set of edges C ⊂ V × R. The binding between opera-
tions and resources is expressed by means of function
β : V → R × Z+, where β(v) = {r, i} indicates that
operation v is bound to the i-th instance of resource
r. The compatibility rules impose that (v, r) ⊂ C.
In a similar fashion, register binding links variables
d ∈ D to registers r ∈ RREG by means of function
γ : D → RREG × Z+. The set of variables D is ex-
tracted from V considering that there is a variable as-
signed to the output of each operation from the subset
VM ∪ VG ∪ VA and to each delay vD connected to an-
other delay. Registers have an associated size nr that
determines the maximum allowed wordlength of the
variables bound to them.
The steering logic consists of the multiplexers re-
quired in front of FUs and registers to send data to
and from these two types of resources. RMUX is deter-
mined by ϕ, β and γ, since ϕ determines when data is
generated, β when data is used by FUs, and γ where
data is stored.
2.2. Handling resource heterogeneity
The recent appearance of specialized blocks in
FPGAs calls for new design methods to eﬃciently ex-
ploit their advantages. In [1], it is proposed to use
a normalized resource usage vector. Given an FPGA
with M diﬀerent types of resources Ri (i = 0 · · ·M−1),
each type with a maximum number of |Ri| resources,
the resource requirements of a particular design imple-
mentation d can be expressed as the following normal-
ized area vector:
Aˆ ≡
〈
#r0
|R0| ,
#r1
|R1| , · · · ,
#rM−1
|RM−1|
〉
(1)
where #ri is the number of resources of type Ri used.
Two useful norms are the ∞-norm and the 1-norm:
∥∥∥Aˆ
∥∥∥
∞
= max
{
#r0
|R0| ,
#r1
|R1| , · · · ,
#rM−1
|RM−1|
}
, (2)
∥∥∥Aˆ∥∥∥
1
=
M−1∑
i=0
#ri
|Ri| . (3)
The inverse of ∞-norm represents the number of
times that the same implementation of design d can be
replicated within the FPGA device (see [1]), and the
1-norm gives information about the overall resource us-
age of the implementation. Each norm is interesting on
their own, but they also have some pitfalls. On the one
hand, if two implementations have the same ∞-norm
this implies that they can be replicated the same num-
ber of times, but there is no way to know which imple-
mentation requires less resources. On the other hand,
the 1-norm can tell if a design implementation requires
less resources than other, but that does not guaran-
tee that the implementation with less resources can
be replicated more times than the other. We propose
a linear combination of ∞-norm and 1-norm, called
+-norm (plus-norm), that has all the beneﬁts of the
norms but none of the drawbacks:∥∥∥Aˆ∥∥∥
+
= K ·
∥∥∥Aˆ∥∥∥
∞
+
∥∥∥Aˆ∥∥∥
1
. (4)
The value of constant K can be obtained from the
parameters M and Ri. If
K > (M − 1) · (max(|Ri|))−M , (5)
then, it is guaranteed that for any two implementations
di and dj : (a) if ‖Ai‖+ < ‖Aj‖+ then di can be repli-
cated more times than dj ; and, (b) if ‖Ai‖+ ≤ ‖Aj‖+
86
Authorized licensed use limited to: Univ Politecnica de Madrid. Downloaded on July 10, 2009 at 05:16 from IEEE Xplore.  Restrictions apply.
then di can be replicated more times than dj , or the
same number of times consuming less resources. There-
fore, minimizing +-norm implies that the design can
be replicated within the FPGA the maximum possi-
ble number of times while using the minimum possible
number of resources.
The metric +-norm has a low computational cost
and it is suitable for integer linear programming ap-
proaches [2] and heuristic approaches [3].
2.3 Resource modeling
Resources are divided into three types: functional
units (RFU ), registers (RREG) and steering logic
(RMUX). The area and latency of FUs and registers
(i.e. A(r) and l(r)) are expressed as functions of the
input and output wordlength information (p and n).
They are obtained by applying curve ﬁtting to hun-
dreds of synthesis results. The use of accurate delay
cost functions was proved to provided signiﬁcant per-
formance improvements (from 12% to 63%) compared
to other existent naive approaches (see [3]). Registers
are assumed to have a zero latency. Note that A is
a vector with as many components as types of FPGA
resources. The fact that multiplexers and wiring laten-
cies are neglected could be easily overcome by multi-
plying the latency of FUs by an empirical factor [7].
The area of multiplexers in UWL systems is only
aﬀected by the data wordlengths, which set the mul-
tiplexers sizes, and by the number of diﬀerent data
sources (e.g. registers or FUs), which determines the
multiplexer width. An estimation of the area of an
N -input multiplexer of wordlength M for Virtex-II de-
vices is given by
AMUX = M ·N/4 slices. (6)
In MWL systems, data must be aligned before be-
ing processed by FUs or stored in registers. In [11] the
problem of data alignment and multiplexing is tack-
led by means of alignment blocks introduced before
multiplexers. In this work, multiplexers are used for
both data multiplexing and data alignment, since the
combination of these two tasks leads to a reduction in
the number of control signals, and therefore, control
logic. In addition, the chances for logic optimization
are greater than if two separate blocks (an alignment
block and a multiplexer) are used.
Fig. 1 presents three diﬀerent types of alignments for
a 4-input multiplexer with inputs signals sa, sb, sc and
sd and output o: arbitrary alignment (Fig. 1(a)), LSB
alignment (Fig. 1(b) ), and MSB alignment (Fig. 1(c) ).
Note that, on one hand, sign extension (Fig. 1(a) and
(a) arbitrary (b) LSB (c) MSB
Figure 1. Signal alignment
Fig. 1(b)) does not oﬀer any opportunity for logic opti-
mization, since the signs bits must be multiplexed. On
the other hand, zero padding (Fig. 1(a) and Fig. 1(c))
does oﬀer it, due to the reduction in the number of sig-
nals and the introduction of constant bits (zeros) that
can be hard-wired into the multiplexer logic. In fact,
it is MSB alignment (Fig. 1(c)) the option that allows
greater logic reduction. Therefore, it is proposed to
apply this alignment whenever possible.
A lower bound on the multiplexers’ area if the MSB
alignment is adopted can be computed as
AMUX =
1
4
N−1∑
i=0
(ni + 1) slices, (7)
where N is the maximum wordlength present and ni is
the wordlength of signal i.
2.4. Datapath automatic synthesis
The optimization procedure is based on the use of
Simulated Annealing (SA). The inputs to the optimizer
are the sequencing graph GS(V, S) and the total la-
tency constraint λ, from which it is possible to extract
the set of resources R and the compatibility graph GC .
The optimization is based on changing the current FU
binding (β) and computing the area of the datapath.
The movements are the following:
• MA: Bind an operation v ∈ V to a non-used re-
source r ∈ RFU .
• MB: Bind an operation v ∈ V to another already
used resource r ∈ RFU .
• MC : Swap the binding of two compatible opera-
tions v1 and v2 mapped to diﬀerent resources r1
and r2.
Every time a movement is produced, the resulting
area, expressed as in (4), is obtained by, ﬁrst, a list-
based resource-constraint scheduling tuned to MWL
systems [3], and after that, register binding based on
the left-edge algorithm [6] and multiplexers generation
based on a MSB-alignment.
87
Authorized licensed use limited to: Univ Politecnica de Madrid. Downloaded on July 10, 2009 at 05:16 from IEEE Xplore.  Restrictions apply.
MWL issues are handled automatically due to the
wordlength-wise cost modeling of resources, which have
wordlength-dependent area and latency costs . FPGA
resource heterogenity is handled by means of the use
of +-norm, which enables the automatic selection of
LUT-based resources and embedded resources, without
adding any additional step to the SA approach. This
method provides a robust way to obtain optimized im-
plementations of DSP algorithms using FPGAs.
3. Results
The following benchmarks are used for the analy-
sis: (i) an ITU RGB to YCrCb converter (RGB); (ii) a
3rd-order lattice ﬁlter (LAT3); (iii) a 4th-order IIR ﬁl-
ter (IIR4); and (iv) an 8-th order linear-phase FIR ﬁlter
(FIR8). All algorithms are assigned 8-bit inputs and
12-bit constant coeﬃcients. The algorithm implemen-
tations have been tested under diﬀerent latency and
output noise constraint scenarios assuming a system
clock of 125 MHz. In particular, the noise constraints
were σ2 = {10−k, 10−(k+1), 10−(k+2)}, where k is the
minimum number that makes 10−k as close as possible
to the variance of the quantization noise that would
present the output of the benchmark if quantized to 8
bits (n = 7).
The target devices belong to the Xilinx Virtex-II
family. The area results are normalized with respect
to the XC2V40 device (256 slices, 4 embedded 18x18
multipliers) and expressed according to Eqn. 2. For in-
stance,
∥∥∥Aˆ
∥∥∥
∞
≤ 1 implies that the device XC2V40
is the smallest-cost device able to hold the design,
whereas 1 <
∥∥∥Aˆ
∥∥∥
∞
≤ 2 implies that XC2V80 is the
smallest-cost device, and so on. The datapath allow re-
source sharing – for both adders and multipliers – and
there are generic LUT-based multipliers (RMLUT ) as
well as and 18x18 and 36x18 generic multipliers (form-
ing RMEMB ).
Before AS, each algorithm is translated to a ﬁxed-
point speciﬁcation by means of two wordlength opti-
mization procedures, that follow an UWL approach
and an MWL approach, respectively. Once the ﬁxed-
point signals formats are available, AS is applied to
each possible combination of latency, quantization sce-
nario and FPGA architecture (homogeneous or hetero-
geneous) for a total of 120 implementations per bench-
mark).
3.1. Homogeneous UWL vs. MWL
Fig. 2(a) contains the area vs. latency curves of the
homogeneous implementations of IIR4 (MWL-HOM
and UWL-HOM, grey symbols). The latency ranges
from λMWL−HOMmin to λ
UWL−HOM
min +10. It can be seen
how both the UWL and MWL areas decrease as the la-
tency increases, which is expected since the chances for
FU reuse increases. The area improvements obtained
by means of using an MWL approach are up to 65%.
Also, the minimum latency that each implementation
achieves diﬀers considerably.
Fig. 2(b) displays the detailed resource distribution
for the IIR4 UWL and MWL implementations cor-
responding to σ2 = 10−5 and λ = 17 (see UWL-
HOM and MWL-HOM bars). The separate resource
usages of FUs (FU-LUT and FU-EMB), multiplexers
(MUX-FU and MUX-REG) and registers (REG) are
displayed. The overall area saving is 36%, and it is
due to the fact that the wordlengths of the majority of
signals, which impact on FUs, multiplexers and regis-
ters, have been highly reduced.It is important to high-
light that the area due to multiplexers and registers,
although smaller than the FUs’ area, makes up a sig-
niﬁcant part of the total area (37% for UWL and 43%
for MWL). Hence the importance of including these
costs within the optimization loop.
The graph on the left of Fig. 3 depicts the overall
results regarding homogeneous implementations. For
each quantization scenario the latency ranges from
λUWL−HOMmin to λ
UWL−HOM
min + 10, and the mean and
maximum values of the area improvements obtained
by the MWL implementations in comparison to the
UWL implementations are computed. The area im-
provements obtained are remarkable: RGB obtains a
mean improvement of 66%; LAT3 of 41%; IIR4 of
47%; FIR8 of 30%. Note that the mean improve-
ments obtained for all benchmarks are relatively close
to the maximum value. The overall average improve-
ment obtained is 46% and the maximum achieved is
77%. Regarding latency, the minimum latency achiev-
able by UWL implementations is reduced in average
22%. The results clearly show that an MWL AS ap-
proach achieves signiﬁcant area reductions.
3.2. UWL vs. MWL: Heterogeneous case
The curves in Fig. 2(a) – UWL-HET and
MWL-HET , black symbols – yield that, again, there is
a signiﬁcant gain in using an MWL approach, since the
improvements are up to 65%. Fig. 2(b) (UWL-HET vs
MWL-HET ) shows that the improvements are mainly
due to an overall wordlength reduction. Also, it can be
seen that the LUT-based resources are almost entirely
devoted to data storing and multiplexing.
The central graph in Fig. 3 holds the results regard-
ing heterogeneous implementations. For each quanti-
88
Authorized licensed use limited to: Univ Politecnica de Madrid. Downloaded on July 10, 2009 at 05:16 from IEEE Xplore.  Restrictions apply.
00.5
1
1.5
2
2.5
3
6 7 8 9 10 11 12 13 14 15 16 17
Latency(?)
Area
UWL-HOM
UWL-HET
MWL-HOM
MWL-HET
(a) σ2 = 10−5
0
0.25
0.5
0.75
1
1.25
1.5
UWL
HOM
MWL
HOM
UWL
HET
(LUT)
UWL
HET
(EMB)
MWL
HET
(LUT)
MWL
HET
(EMB)
Area
MUX-REG
REG
MUX-FU
FU-LUT
FU-EMB
(b) σ2 = 10−5, λ = 17
Figure 2. Implementation results for IIR4: (a) area vs. latency curves; (b) resource distribution.
0
10
20
30
40
50
60
70
80
90
RGB LAT3 IIR4 FIR8 ALL RGB LAT3 IIR4 FIR8 ALL RGB LAT3 IIR4 FIR8 ALL
?-norm (%)
MEAN
MAX
HOM: UWL vs. MWL HET: UWL vs. MWL MWL: HOM vs. HET
Figure 3. Overall resource usage improvement for all benchmarks.
zation scenario, the latency ranges from λUWL−HETmin
to λUWL−HETmin + 10, and the mean and maximum val-
ues of the area improvements obtained by the MWL
implementations in comparison to the UWL imple-
mentations are presented. The area improvements ob-
tained are, again, remarkable: RGB obtains a mean
improvement of 50%; LAT3 of 34%; IIR4 of 37%;
FIR8 of 35%. The average improvement obtained is
39% and the maximum achieved is 81%. The latency
analysis throws that the minimum UWL latency is re-
duced an average 19%. Again, an MWL AS approach
achieves signiﬁcant area reductions for heterogeneous
implementations.
3.3. Homogeneous vs. Heterogeneous
Fig. 2(a) – MWL-HOM and MWL-HET, triangles –
and Fig. 2(b) –MWL-HOM and MWL-HET– contain
the MWL vs. UWL heterogeneous implementation re-
sults for IIR4. The area vs. latency curves show that
the introduction of embedded multipliers dramatically
reduces the overall resource usage with improvements
of up to 65%. Fig. 2(b) indicates that the improve-
ments are mainly due to a migration of LUT-based
multipliers to embedded multipliers.
The graph on the right in Fig. 3 holds the results
regarding heterogeneous vs. homogeneous implemen-
tations. For each quantization scenario, the latency
ranges from λMWL−HOMmin to λ
MWL−HOM
min + 10, and
the mean and maximum values of the area improve-
89
Authorized licensed use limited to: Univ Politecnica de Madrid. Downloaded on July 10, 2009 at 05:16 from IEEE Xplore.  Restrictions apply.
ments obtained by the heterogeneous implementations
in contrast to the homogeneous implementations are
presented. The area improvements obtained are re-
markable: RGB obtains a mean improvement of 52%;
LAT3 of 40%; IIR4 of 35%; FIR8 of 31%. Note that
the mean improvements obtained for all benchmarks
are quite close to the maximum values. The aver-
age improvement obtained is 40% and the maximum
is 55%. The results clearly show that a wise use of em-
bedded resources leads to highly optimized datapath
implementations. Regarding latencies, the minimum
latency achievable by both kinds of implementations is
the same for the experiments performed. This is due to
the fact that the latency of resources are very similar in
the particular conditions used for the tests. The same
experiments presented in this section were repeated in-
creasing the constant wordlength to 16 bits, obtaining
that heterogeneous implementations reduced 7% the
minimum latency of homogeous implementations.
Summarizing, the eﬃcient use of embedded re-
sources highly improves the ﬁnal implementation re-
sults.
3.4. Eﬀect of using a complete datapath
All these experiments were repeated excluding regis-
ters and multiplexers from the available resources dur-
ing the optimization process, and adding them after-
ward to compute the ﬁnal datapath area. The ho-
mogeneous implementations were degraded up to 5%,
while the heterogeneous up to 35%. It can be con-
cluded that the use of a complete datapath description
is specially signiﬁcant in heterogeneous implementa-
tions, since they are strongly based on optimizing the
balance between LUT-based and embedded resources.
4. Conclusions
In this paper an architectural synthesis approach
able to produce optimized ﬁxed-point implementations
using modern FPGA devices is presented. The key
to success is provided by the use of highly accurate
models of the datapath resources, a complete datapath
resource set that includes multiplexer and registers, a
novel method to handle ﬁxed-point data alignment and
multiplexing, and also the introduction of a novel novel
resource usage metric that can cope with LUT-based
and embedded FPGA resources.
The AS procedure produces area improvements of
up to 80% when compared to uniform-wordlength im-
plementations, and minimum latency improvements of
up to 22%. The eﬃcient use of embedded resources
achieves area improvements of up to 54% when com-
pared to homogeneous implementations. Also, the in-
eﬃciency of current FPGA architectures to implement
data steering was exposed.
These results are intented to be further improved by
means of including the ﬁxed-point reﬁnement process
as part of the architectural synthesis [2].
5. Acknowledgment
This work was supported by the Spanish Min-
istry of Education and Science under Research Project
TEC2006-13067-C03-03.
References
[1] C.-S. Bouganis, G. Constantinides, and P. Cheung.
A Novel 2D Filter Design Methodology for Heteroge-
neous Devices. In Proc. FCCM, pages 13–22, 2005.
[2] G. Caﬀarena, G. Constantinides, P. Cheung, C. Car-
reras, and O. Nieto-Taladriz. Optimal Combined
Word-Length Allocation and Architectural Synthesis
of Digital Signal Processing Circuits. IEEE Trans.
Circuits Syst. II, 53(5):339–343, May 2006.
[3] G. Caﬀarena, J. A. Lo´pez, C. Carreras, and
O. Nieto-Taladriz. High-Level Synthesis of Multiple
Word-Length DSP Algorithms using Heterogeneous-
Resource FPGAs. In Proc. FPL, pages 675–678,
Madrid, Spain, 2006.
[4] J. Cong, Y. Fan, G. Han, Y. Lin, J. Xu, Z. Zhang, and
X. Cheng. Bitwidth-Aware Scheduling and Binding in
High-Level Synthesis. In Proc. ASP-DAC, pages 856–
861, 2005.
[5] G. Constantinides, P. Cheung, and W. Luk. Heuris-
tic Datapath Allocation for Multiple Wordlength Sys-
tems. In Proc. DATE, pages 791–796, 2001.
[6] G. De Michelli. Synthesis and Optimization of Digital
Circuits. Series in Electrical and Computer Engineer-
ing. McGraw-Hill, New York, 1994.
[7] R. Enzler, T. Jeger, D. Cottet, and G. Tro¨ster. High-
Level Area and Performance Estimation of Hardware
Building Blocks on FPGAs. In Proc. FPL, pages 525–
534, 2000.
[8] K. I. Kum and W. Sung. Combined Word-Length
Optimization and High-Level Synthesis of Digital Sig-
nal Processing Systems. IEEE Trans. Circuits Syst.,
20(8):921–930, Aug. 2001.
[9] X. Liang, J. Vetter, M. Smith, and A. Bland. Balanc-
ing FPGA Resource Utilities. In Proc. ERSA, pages
156–162, 2005.
[10] A. Nayak, M. Haldar, A. Choudhary, and P. Banerjee.
Accurate Area and Delay Estimators for FPGAs. In
Proc. DAC, pages 862–869, 2002.
[11] K. Schoofs, G. Goossens, and H. De Man. Bit-
Alignment in Hardware Allocation for Multiplexed
DSP Architectures. In Proc. DATE, pages 289–293,
1993.
90
Authorized licensed use limited to: Univ Politecnica de Madrid. Downloaded on July 10, 2009 at 05:16 from IEEE Xplore.  Restrictions apply.
