Abstract
Introduction
In this paper we deal with the architectural synthesis (AS) of common Digital Signal Processing (DSP) cores implemented on modern FPGAs. Two important factors leading to optimization are the use of Multiple Word-Length (MWL) fixed-point descriptions of the algorithms and the use of both LUT-based and embedded FPGA resources. The former reduces implementation cost notably, and the latter minimizes area and inproves performance in FPGA implementations.
The MWL implementation of fixed-point DSP algorithms [8, 5, 4, 2] has proved to provide significant cost savings when compared to the traditional uniform word-length (UWL) design approach. The introduction of MWL issues this in AS, although it increases optimization complexity, can lead to significant cost reductions [5, 4, 3] .
FPGA devices have been extensively used in the implementation of DSP algorithms, especially due to the recent introduction of embedded blocks (i.e. memory blocks, DSP blocks, etc.). Traditional approaches to estimate FPGA resource usage do not apply to heterogeneous-architecture FPGAs since they only account for lookup table (LUT) based resources [10] . This situation calls for new resource usage metrics that can be integrated as part of automatic optimization techniques (i.e. architectural synthesis) to fully exploit the possibilities that embedded resources offer [1, 9] .
The main contributions of this paper are: (i) the presentation of a novel resource usage metric that allows minimum resource usage for heterogeneous-FPGA implementations; (ii) the presentation of an architectural synthesis procedure tuned to fixed-point implementations, that handles a complete datapath (FUs, multiplexers and registers); and, (iii) a novel strategy for fixed-point data multiplexing.
The paper is divided as follows: In section 2, the architectural synthesis of DSP cores using multiple wordlength systems and modern FPGAs is introduced. Section 3 deals with the implementation results from synthesizing several well-known DSP benchmarks for different latency constraints and output noise constraints. Finally, in section 4, conclusions are drawn.
Generation of Fixed-Point Datapaths

Formal description
This work focuses on the time constrained resource minimization problem [6] . The notation used is based on [6] , and it is similar to that in [5, 2, 3] .
Given a sequencing graph G S (V, S), a maximum latency λ and a set of resources R (e.g. functional units R F U , registers R REG and steering logic R MUX ), it is the goal of AS to find the time step when each operation is executed (scheduling), the types and number of resources forming R (resource allocation), and the binding between operations and variables to functional units and registers (resource binding) that comply with the constraints, while minimizing cost (i.e. area).
G S (V, S) is a formal representation of a single iteration of an algorithm, where V is the set of operations and S ⊂ V × V is the set signals that determine the data flow. We consider V = V M ∪V G ∪V A ∪V D ∪V I ∪V O composed of typical DSP operations: multiplications, gains, additions, unit delays, and input and output nodes. Signals are in two's complement fixed-point format, defined by the pair (n, p), where n is the wordlength of the signal -not including the sign bit -and p is the scaling of the signal that represents the displacement of the binary point from the sign bit [2] .
Functional units (R The steering logic consists of the multiplexers required in front of FUs and registers to send data to and from these two types of resources. R MUX is determined by ϕ, β and γ, since ϕ determines when data is generated, β when data is used by FUs, and γ where data is stored.
Handling resource heterogeneity
The recent appearance of specialized blocks in FPGAs calls for new design methods to efficiently exploit their advantages. In [1] , it is proposed to use a normalized resource usage vector. Given an FPGA with M different types of resources R i (i = 0 · · · M −1), each type with a maximum number of |R i | resources, the resource requirements of a particular design implementation d can be expressed as the following normalized area vector:
where #r i is the number of resources of type R i used. Two useful norms are the ∞-norm and the 1-norm:
The inverse of ∞-norm represents the number of times that the same implementation of design d can be replicated within the FPGA device (see [1] ), and the 1-norm gives information about the overall resource usage of the implementation. Each norm is interesting on their own, but they also have some pitfalls. On the one hand, if two implementations have the same ∞-norm this implies that they can be replicated the same number of times, but there is no way to know which implementation requires less resources. On the other hand, the 1-norm can tell if a design implementation requires less resources than other, but that does not guarantee that the implementation with less resources can be replicated more times than the other. We propose a linear combination of ∞-norm and 1-norm, called +-norm (plus-norm), that has all the benefits of the norms but none of the drawbacks:
The value of constant K can be obtained from the parameters M and R i . If
then, it is guaranteed that for any two implementations
then d i can be replicated more times than d j , or the same number of times consuming less resources. Therefore, minimizing +-norm implies that the design can be replicated within the FPGA the maximum possible number of times while using the minimum possible number of resources. The metric +-norm has a low computational cost and it is suitable for integer linear programming approaches [2] and heuristic approaches [3] .
Resource modeling
Resources are divided into three types: functional units (R F U ), registers (R REG ) and steering logic (R MUX ). The area and latency of FUs and registers (i.e. A(r) and l(r)) are expressed as functions of the input and output wordlength information (p and n). They are obtained by applying curve fitting to hundreds of synthesis results. The use of accurate delay cost functions was proved to provided significant performance improvements (from 12% to 63%) compared to other existent naive approaches (see [3] ). Registers are assumed to have a zero latency. Note that A is a vector with as many components as types of FPGA resources. The fact that multiplexers and wiring latencies are neglected could be easily overcome by multiplying the latency of FUs by an empirical factor [7] .
The area of multiplexers in UWL systems is only affected by the data wordlengths, which set the multiplexers sizes, and by the number of different data sources (e.g. registers or FUs), which determines the multiplexer width. An estimation of the area of an N -input multiplexer of wordlength M for Virtex-II devices is given by
In MWL systems, data must be aligned before being processed by FUs or stored in registers. In [11] the problem of data alignment and multiplexing is tackled by means of alignment blocks introduced before multiplexers. In this work, multiplexers are used for both data multiplexing and data alignment, since the combination of these two tasks leads to a reduction in the number of control signals, and therefore, control logic. In addition, the chances for logic optimization are greater than if two separate blocks (an alignment block and a multiplexer) are used. Fig. 1(a) ), LSB alignment ( Fig. 1(b) ), and MSB alignment (Fig. 1(c) ). Note that, on one hand, sign extension ( Fig. 1(a) and Fig. 1(b) ) does not offer any opportunity for logic optimization, since the signs bits must be multiplexed. On the other hand, zero padding ( Fig. 1(a) and Fig. 1(c) ) does offer it, due to the reduction in the number of signals and the introduction of constant bits (zeros) that can be hard-wired into the multiplexer logic. In fact, it is MSB alignment (Fig. 1(c) ) the option that allows greater logic reduction. Therefore, it is proposed to apply this alignment whenever possible. A lower bound on the multiplexers' area if the MSB alignment is adopted can be computed as
where N is the maximum wordlength present and n i is the wordlength of signal i.
Datapath automatic synthesis
The optimization procedure is based on the use of Simulated Annealing (SA). The inputs to the optimizer are the sequencing graph G S (V, S) and the total latency constraint λ, from which it is possible to extract the set of resources R and the compatibility graph G C . The optimization is based on changing the current FU binding (β) and computing the area of the datapath. The movements are the following:
• M A : Bind an operation v ∈ V to a non-used resource r ∈ R F U .
• M B : Bind an operation v ∈ V to another already used resource r ∈ R F U .
• M C : Swap the binding of two compatible operations v 1 and v 2 mapped to different resources r 1 and r 2 .
Every time a movement is produced, the resulting area, expressed as in (4) , is obtained by, first, a listbased resource-constraint scheduling tuned to MWL systems [3] , and after that, register binding based on the left-edge algorithm [6] and multiplexers generation based on a MSB-alignment. MWL issues are handled automatically due to the wordlength-wise cost modeling of resources, which have wordlength-dependent area and latency costs . FPGA resource heterogenity is handled by means of the use of +-norm, which enables the automatic selection of LUT-based resources and embedded resources, without adding any additional step to the SA approach. This method provides a robust way to obtain optimized implementations of DSP algorithms using FPGAs.
Results
The following benchmarks are used for the analysis: (i) an ITU RGB to YCrCb converter (RGB ); (ii) a 3rd-order lattice filter (LAT 3 ); (iii) a 4th-order IIR filter (IIR 4 ); and (iv) an 8-th order linear-phase FIR filter (FIR 8 ). All algorithms are assigned 8-bit inputs and 12-bit constant coefficients. The algorithm implementations have been tested under different latency and output noise constraint scenarios assuming a system clock of 125 MHz. In particular, the noise constraints were is the smallest-cost device able to hold the design, whereas 1 < Â ∞ ≤ 2 implies that XC2V80 is the smallest-cost device, and so on. The datapath allow resource sharing -for both adders and multipliers -and there are generic LUT-based multipliers (R MLUT ) as well as and 18x18 and 36x18 generic multipliers (forming R MEMB ).
Before AS, each algorithm is translated to a fixedpoint specification by means of two wordlength optimization procedures, that follow an UWL approach and an MWL approach, respectively. Once the fixedpoint signals formats are available, AS is applied to each possible combination of latency, quantization scenario and FPGA architecture (homogeneous or heterogeneous) for a total of 120 implementations per benchmark). + 10. It can be seen how both the UWL and MWL areas decrease as the latency increases, which is expected since the chances for FU reuse increases. The area improvements obtained by means of using an MWL approach are up to 65%. Also, the minimum latency that each implementation achieves differs considerably. Fig. 2(b) displays the detailed resource distribution for the IIR 4 UWL and MWL implementations corresponding to σ 2 = 10 −5 and λ = 17 (see UWL-HOM and MWL-HOM bars). The separate resource usages of FUs (FU-LUT and FU-EMB ), multiplexers (MUX-FU and MUX-REG) and registers (REG) are displayed. The overall area saving is 36%, and it is due to the fact that the wordlengths of the majority of signals, which impact on FUs, multiplexers and registers, have been highly reduced.It is important to highlight that the area due to multiplexers and registers, although smaller than the FUs' area, makes up a significant part of the total area (37% for UWL and 43% for MWL). Hence the importance of including these costs within the optimization loop.
Homogeneous UWL vs. MWL
The graph on the left of Note that the mean improvements obtained for all benchmarks are relatively close to the maximum value. The overall average improvement obtained is 46% and the maximum achieved is 77%. Regarding latency, the minimum latency achievable by UWL implementations is reduced in average 22%. The results clearly show that an MWL AS approach achieves significant area reductions.
UWL vs. MWL: Heterogeneous case
The curves in Fig. 2(a) -UWL-HET and MWL-HET , black symbols -yield that, again, there is a significant gain in using an MWL approach, since the improvements are up to 65%. Fig. 2(b) (UWL-HET vs MWL-HET ) shows that the improvements are mainly due to an overall wordlength reduction. Also, it can be seen that the LUT-based resources are almost entirely devoted to data storing and multiplexing.
The central graph in Fig. 3 holds the results regarding heterogeneous implementations. For each quanti- + 10, and the mean and maximum values of the area improvements obtained by the MWL implementations in comparison to the UWL implementations are presented. The area improvements obtained are, again, remarkable: RGB obtains a mean improvement of 50%; LAT 3 of 34%; IIR 4 of 37%; F IR 8 of 35%. The average improvement obtained is 39% and the maximum achieved is 81%. The latency analysis throws that the minimum UWL latency is reduced an average 19%. Again, an MWL AS approach achieves significant area reductions for heterogeneous implementations. 4 . The area vs. latency curves show that the introduction of embedded multipliers dramatically reduces the overall resource usage with improvements of up to 65%. Fig. 2(b) indicates that the improvements are mainly due to a migration of LUT-based multipliers to embedded multipliers.
Homogeneous vs. Heterogeneous
The graph on the right in Fig. 3 . The results clearly show that a wise use of embedded resources leads to highly optimized datapath implementations. Regarding latencies, the minimum latency achievable by both kinds of implementations is the same for the experiments performed. This is due to the fact that the latency of resources are very similar in the particular conditions used for the tests. The same experiments presented in this section were repeated increasing the constant wordlength to 16 bits, obtaining that heterogeneous implementations reduced 7% the minimum latency of homogeous implementations.
Summarizing, the efficient use of embedded resources highly improves the final implementation results.
Effect of using a complete datapath
All these experiments were repeated excluding registers and multiplexers from the available resources during the optimization process, and adding them afterward to compute the final datapath area. The homogeneous implementations were degraded up to 5%, while the heterogeneous up to 35%. It can be concluded that the use of a complete datapath description is specially significant in heterogeneous implementations, since they are strongly based on optimizing the balance between LUT-based and embedded resources.
Conclusions
In this paper an architectural synthesis approach able to produce optimized fixed-point implementations using modern FPGA devices is presented. The key to success is provided by the use of highly accurate models of the datapath resources, a complete datapath resource set that includes multiplexer and registers, a novel method to handle fixed-point data alignment and multiplexing, and also the introduction of a novel novel resource usage metric that can cope with LUT-based and embedded FPGA resources.
The AS procedure produces area improvements of up to 80% when compared to uniform-wordlength implementations, and minimum latency improvements of up to 22%. The efficient use of embedded resources achieves area improvements of up to 54% when compared to homogeneous implementations. Also, the inefficiency of current FPGA architectures to implement data steering was exposed.
These results are intented to be further improved by means of including the fixed-point refinement process as part of the architectural synthesis [2] .
Acknowledgment
This work was supported by the Spanish Ministry of Education and Science under Research Project TEC2006-13067-C03-03.
