Abstract-In this brief, we address the combined application of word-length allocation and architectural synthesis of linear timeinvariant digital signal processing systems. These two design tasks are traditionally performed sequentially, thus lessening the overall design complexity, but ignoring forward and backward dependencies that may lead to cost reductions. Mixed integer linear programming is used to formulate the combined problem and results are compared to the two-step traditional approach.
I. INTRODUCTION
T HIS brief addresses the problem of hardware synthesis of digital signal processing (DSP) algorithms under both error and latency constraints. Programmable logic devices are chosen as the target architecture.
The multiple word-length implementation of DSP algorithms [1] has lately been an active research field. The traditional uniform word-length design approach, inherited from a microprocessor-oriented approach, has been reviewed for the last few years and algorithms for both word-length allocation [2] - [6] and architectural synthesis [4] , [7] , [8] have been tuned to the more efficient multiple word-length design. However, little research has been carried out regarding the combined application of both design tasks. In [4] a 3-step methodology is presented: approximate word-length allocation, architectural synthesis and accurate word-length allocation of the resulting architecture. The approach is a pioneer work in the combination of word-length allocation and architectural synthesis; it only lacks a report on the area savings obtained compared to the traditional approach.
Mixed integer linear programming (MILP) is used to define the problem. Mainly, it allows assessing the suitability of the simultaneous application of these two design tasks, and the results presented in this brief can be used to evaluate future heuristic algorithms.
The main contribution of this brief is the presentation of an optimal analysis of the simultaneous application of word-length allocation and architectural synthesis. This approach is compared to the sequential application of optimal algorithms for word-length allocation and architectural synthesis. Area savings up to a 13% are reported.
The brief is divided as follows. Section II deals with the main concepts involved in the combined application approach. The next section deals with the MILP formulation of the problem. In Section IV some results are analyzed. Finally, conclusions are drawn in Section V.
II. COMBINED WORD-LENGTH ALLOCATION AND ARCHITECTURAL SYNTHESIS

A. Combined Approach
The combined application of the word-length allocation and architectural synthesis tasks has as a starting point a computation graph , a maximum latency , and a maximum noise variance at the output .
is a formal representation of the algorithm, where is a set of graph nodes representing operations, and is a set of directed edges representing signals that determines the data flow. We consider composed of gains, additions, unit delays, forks (branching nodes), and input and output nodes. Signals are in two's-complement fixed-point format defined by the pair , where is the word-length of the signal not including the sign bit, and is the scaling of the signal that represents the displacement of the binary point from the sign bit (see Fig. 1 ).
Operations are to be implemented on resources from set and it is the aim of the combined approach to find the wordlengths , the time step when each operation is executed (scheduling), the types and number of resources forming (resource allocation) and the binding between operations and resources (resource binding) that comply with both and constraints, while achieving minimum area. The notation for the range of a function is used in this brief.
represents the cardinality of set denotes logical AND, , logical OR, and , set subtraction. The set of input signals driving node are expressed as and the output signals driven by as . The upper bounds on variable are represented as .
B. Scaling
The scaling of signals can be computed before the optimization process starts. We choose an analytical approach based on the computation of the -norm for each signal. Given the input peak value , the scaling of a signal is determined by (1), where is the -norm and is the transfer function from the input to signal
(1)
C. Noise Model
We adopt the quantization error presented in [9] . The quantization error introduced by the quantization of a signal from bits to bits is modeled by the injection of a uniform-distributed white noise with a variance equal to (2) . The variance of the noise contribution at the output is , where is the -norm and is the transfer function from signal to the output (2) As stated in [5] , the error introduced by forks requires a special treatment. Fig. 2 shows a 2-way fork with quantized outputs. First, the outputs must be sorted in descendant word-length order to take into account the correlation between them and the input of the fork [ Fig. 2(b) ]. It can be clearly seen that the quantization noise injected by traverses both and , hence requires both and [ Fig. 2(c) ]. The noise injected by only traverses , thus only contributes to the output's noise. The error that a -way fork introduces can be expressed as in (3), where the -tuple expresses the order of the outputs [5] (3)
D. Architectural Synthesis
The data flow of a single iteration of the algorithm is expressed by means of the sequencing graph extracted from . is the set of operations and are the edges specifying the precedence relations among operations. This graph is used to decide about scheduling.
In our approach, we assume 1-cycle latency operations. Each operation can be executed during the time interval defined by (4) where denotes the set of nonnegative integers.
is the execution time of operation for the as soon as possible scheduling and is the execution time of operation for the as late as possible scheduling for a total time steps of . The set of all possible execution times is given by (4) (5) The set of resources is divided into multipliers, adders, and registers which implement gains, additions, and delays. Multiplexing logic and memory to store intermediate values are not considered among resources.
We express the compatibility between an operation, or set of operations and resources with function . Targeting programmable logic devices, we regard as shareable only multipliers since the multiplexing logic necessary is often negligible compared to the area of these resources, a situation that does not apply to adders or registers. For instance, the ratio between the area of a LUT-based 16 16-bit multiplier and a 16-bit 2-input multiplexer and a 16-bit 4-input multiplexer are 17.25 and 8.625 respectively for Virtex-2 and Virtex-4 devices (using Xilinx ISE v7.1). Thus, there are dedicated resources to implement each addition and delay, so and are one-to-one functions and = and . Multipliers have one input devoted to coefficients and its word-length is equal to a system-wide constant . The other input is assigned to the input signal of gains and must have a word-length greater than or equal to the maximum word-length of the inputs of gains bound to the resource.
An upper bound on the number of multipliers necessary can be estimated from the number of multipliers necessary to implement the ASAP scheduling. Initially, all gains can be implemented on all multipliers, therefore .
E. Area Models
The cost of an adder bound to addition with inputs and and output is given by (6) and it is derived from the model in [5] . A ripple-carry adder is supposed. Signals and must comply with the following: signal is shifted bits from the least significant bit of and scaling should be bigger than or equal to the value of (see Fig. 3 for an example). Equation (6) requires the definition of and . Let us define as the number of overlapped bits between and with sign extension (7) . A safe adder would require bits. Let us define as the number of nonrequired bits at the output due to scaling (8) . The area of an optimized adder is equal to the area of the safe adder minus bits (6) . Note that the max operation in (7) can be expressed as a disjunction, and that is a constant number 
The cost of a register bound to delay with input is given by the straightforward equation (9) Equation (10) contains the cost of a multiplier bound to a subset of gains with inputs (10)
III. MILP FORMULATION
This section relies on some knowledge of integer linear programming [10] . The variables used in the MILP model are divided into: binary scheduling and resource binding variables , integer signal word-lengths , integer signal word-lengths before quantization , binary auxiliary signal word-lengths , binary auxiliary signal word-lengths before quantization , binary decision variables and , integer adder costs , integer auxiliary variables and real fork-node error variables . In the following subsections, we present the formulation of the MILP model.
A. Objective Function
The objective function is the sum of the area of all resources (adders, registers, and multipliers) and it is given by (11) . The cost of adders is to be linearized in the constraints section according to (6) (11)
B. Architectural Synthesis Constraints
Here, we introduce the constraints related to scheduling, resource allocation and resource binding. Equation (12) defines the binary variables that steer the constraints in this subsection if operation is scheduled at time step on resource otherwise (12) Equation (13) shows the binding constraint that ensures that an operation is executed on exactly one resource. The next constraint (14) states that a resource does not implement more than one operation at a time. Note that there is no need to apply (13) and (14) to operations with dedicated resources. The precedence constraints are given by (15) ensuring that operations obey the dependencies in the sequencing graph (13) (14) (15) And finally, (16) expresses the resource compatibility constraints, which guarantee that a resource bound to several operations must be compatible with all of them. Again, only multipliers are considered: the input devoted to signals must have a word-length as big as the maximum of the word-lengths of each gain input bound to it. The summation is equal to 1 if operation is bound to resource (16) Note that although only multipliers are prone to sharing the notation can be easily extended to include more resources that can be shared (dividers, adders, etc.) or to map more than one type of operation to the same resource (e.g., gains and multiplications bound to multipliers).
C. Adder Cost
The linearization of the adder cost is based on the model from [5] . Constraints (17)-(20) cast (7) using binary decision variables and , and also trivial bounds on the left side of the equations. Equation (6) 
D. Word-Length Allocation Constraints
Here, we present the constraints related to the estimation of the noise at the output of the system [5] . The error constraint is given by (22) and it is divided into two summations, the first dealing with forks' signals, and the second dealing with the remaining signals in (see Section II-C) ( 
22)
Note that nonconstant powers of two must be linearized. Each term is replaced by , where are binary auxiliary variables associated to signal by (23) and (24). For simplification sake, we leave all nonconstant powers of two as they are throughout the text
The noise introduced by a fork is expressed by constraints (25)-(28), which come from applying DeMorgan's theorem to (3) and linearizing the disjunction obtained. Binary variables and are introduced. These constraints are repeated for each possible ordering of the outputs of a fork (25) (26) (27) (28)
E. Conditioning Constraints
This last set of constraints computes the word-lengths before quantization when considering scaling and word-length propagation information [5] .
Given an addition with inputs and and output , its output's word-length is equal to , expression linearized through the following: Regarding forks, the outputs do not require conditioning but its inputs must comply with the following:
The conditioning of gain is expressed by constraint (33) where is the scaling of the coefficient associated to . Finally,the following equation: (34) indicates that signals must be truncated to a word-length smaller than or equal to its pre-quantization word-length.
F. Bounds on Word-Length of Variables
Bounds on word-lengths are estimated using an adaptation of the procedure presented in [5] : 1) use an heuristic algorithm to allocate word-lengths and calculate the area due to gains; 2) assign to each gain input the word-length that makes its area to be as big as ; 3) set all gain inputs to the maximum word-length of all gain inputs; and 4) condition the graph.
IV. RESULTS
An MILP solver [11] was used to find the optimal solutions for a set of FIR and IIR filters. The filters coefficients were obtained using the tool fdatool from Matlab 6.5 [12] . The FIR filters were implemented using the direct transposed symmetric FIR structure. We denote a second-order FIR filter with 8-bit inputs and 4-bit coefficients a third-order FIR filter with 8-bit inputs and 8-bit coefficients ; and a fourth-order FIR filter with 4-bit inputs and 4-bit coefficients . The IIR filter was a second-order filter with 4-bit inputs, gain and 4-bit coefficients , implemented using the direct form II transposed. The filters were tested under different latencies and for each latency two solutions were computed, one for the sequential approach, where the error constrained problem was solved first and its solution was fed to the latency constrained problem, and another for the combined approach.
The comparison results are in Table I in terms of percentage of area reduction comparing both sequential and combined approaches. The number of lookup tables (LUTs) required for the different approaches is also provided (sequential/combined). The area savings range from 0% to 13.16%, and are due to an optimal exploration of the dependencies between word-lengths, resources and error variance. Empty cells imply that a solution was not found by the MILP solver in practical times (less than 12 hours). For instance, Table II, shows the word-lengths, including the sign bit, assigned to gains ( and ) and to multipliers , adders ( and ) and registers ( and ) for the error/latency conditions and (see Table I , third row) . In case the area of the multiplier is reduced while the area of adders is slightly increased. In case the area of registers and adders is reduced thanks to the increase of the word-lengths of gains. Finally, case shows that an increase in the word-length of gains enables reducing the area of adders.
The execution times to solve the MILP problems range from several seconds to several hours (IIR).
V. CONCLUSION
In this brief we have presented a novel MILP formulation for the combined error and latency constrained area-minimization problem, applicable to linear time-invariant DSP algorithms. This optimal model of the problem can be used to assess the quality of future heuristic methods that address the problem. The problem can be easily reformulated to include more complex resource binding [8] , [13] that support multiple latency and pipelined resources, operation chaining, etc. The approach can be also applied to ASIC implementations.
Results show the advantage produced by the combined use of these well-known design tasks. Area savings up to 13% are reported.
