Abstract
Introduction
This paper addresses the problem of hardware synthesis from an initial, infinite precision, specification of a digital signal processing (DSP) algorithm. DSP algorithm development is usually initially performed without regard to finite precision effects, whereas for Field-Programmable Gate Array (FPGA) implementation, finite precision effects are often of critical importance.
It has been argued elsewhere [1] that often the most efficient FPGA implementation of an algorithm is one which supports a wide variety of finite precision representations, so that the best representation can be used for each internal variable. The accuracy observable at the outputs of a DSP system is a function of the wordlengths used to represent all intermediate variables in the algorithm. However accuracy is less sensitive to some variables than to others, as is implementation area.
The contribution of this paper is to present a technique for optimum wordlength allocation, for the case where the DSP algorithm to be synthesized is a linear, time-invariant (LTI) system [2] . Existing methods for wordlength allocation are heuristic by nature and thus it is difficult to measure the quality of the solutions produced by these methods. It is this uncertainty that has motivated the work set forth in this paper: to enable accurate characterization of the effectiveness of wordlength optimization techniques with respect to optimum solutions.
The wordlength optimization techniques of interest are those which allow a user-controlled trade-off between implementation area and signal quality at the DSP system outputs, such as those described by [1, 3, 4, 5] .
The Mixed Integer Linear Programming (MILP) technique described in this paper has been applied to several small benchmark circuits, and the results compared to the heuristic presented in [1] . Modelling as a MILP permits the use of industrial-strength MILP solvers such as BonsaiG [6] . Although MILP solution time can render the synthesis of large circuits intractable, optimal results even on small circuits are valuable as benchmarks with which to compare heuristic optimization procedures. For this purpose the optimal specifications have been made available for public download for anyone interested in comparing new or existing wordlength optimization techniques.
Although the construction of the MILP is described in detail in this paper, no complete example MILP is given for space reasons. Several examples can be found at: http://infoeng.ee.ic.ac.uk/ gac1/OptimumWL This paper has the following structure. Section 2 describes the relevant literature in wordlength optimization, before the computation model and associated high-level area models are described in Sections 4 and 5. A brief review of the proposed noise model for LTI systems is then provided in Section 6, before the construction of the proposed MILP is given in Section 7. Results from application of the model to benchmark circuits are given in 8, before concluding the paper in Section 9.
Background
In [7] it has been demonstrated that a simplified version of the problem addressed in this paper is NP-hard. There are, however, several published approaches to wordlength optimization. Those offering an area / signal quality trade-off are all of an heuristic nature [1, 3, 5] or do not support different fractional precision for different internal variables [4] .
Bendetti and Perona [8] have proposed an analytic method for wordlength optimization based on interval arithmetic. The authors propose a 'multi-interval' approach, and demonstrate that the addition, subtraction, multiplication and division of the proposed intervals result in similar intervals, which may be propagated through a loop-free data-flow graph in order to estimate the wordlength required for a computation without losing any precision.
The Bitwise Project [9] propose a similar compilerbased technique based on propagating the ranges of integer variables backwards and forwards through data-flow graphs. Their focus is on removing unwanted mostsignificant bits (MSBs) -no technique is proposed for removing unwanted least-significant bits (LSBs).
The MATCH Project [4] also use compiler-based propagation through data-flow graphs, except they allow variables with a fractional component. All signals in their model must have equal fractional precision -the authors propose an analytic worst-case error model in order to estimate the required number of fractional bits.
Wadekar and Parker [3] have also proposed a methodology for wordlength optimization. Like [4] , their technique also allows controlled worst-case error at system outputs, however they allow each intermediate variable to take a wordlength appropriate to the sensitivity of the output errors to quantization errors on that particular variable. A Genetic Algorithm is used to perform the optimization, and Taylor series are used to evaluate an estimate of the worst-case error at a system output for any given internal wordlengths.
Cmar et al. [10] have developed a wordlength optimization system which uses a combination of range propagation and simulation with known input vectors to limit the wordlengths of internal variables. An heuristic algorithm is applied whereby the wordlength is decided based on the value of an empirically derived scaling of the error standard deviation for each signal under simulation. The idea behind such an approach is that it sets an upper-bound on the wordlength of each variable, beyond which the least significant bits will be drowned in quantization or external noise. No additional mechanism is proposed to automate the tradeoff of system area against error.
Kum and Sung [5] have proposed several wordlength optimization techniques to trade-off system area against system error, some of which have been incorporated in the Cadence Signal Processing Worksystem [11] . These techniques are heuristics based on bit-true simulation of the design under various internal wordlengths. Some sim- [12] . In a previous paper [1] we present an optimization heuristic based on analytic average-case error analysis of LTI systems. Results of between ± and ± area reduction were achieved by our heuristic compared to the use of the optimum uniform wordlength design. However until this paper it has been impossible to effectively judge the quality of the solutions achieved due to the lack of an optimum comparative benchmark.
Notation
In this paper, the following notation is used. 
Computation Model
A computation graph ´Î Ë µ is the formal representation of an algorithm. Î is a set of graph nodes, each representing an atomic computation or input/output port, and Ë Î ¢ Î is a set of directed edges representing the data flow. An element of Ë is referred to as a signal. The set Ë must satisfy the constraints on indegree and outdegree given in Table 1 . We partition the set Î into subsets Î Î Î Á Î Ç Î Î Î , representing the set of gain nodes, input nodes, output nodes, adder nodes, fork nodes and delay nodes respectively. A graphical representation of a simple computation graph is shown in Fig. 1 . Adders, constant coefficient multipliers and unit sample delays are represented using different shapes. Edges are represented by arrows indicating the direction of data flow. Fork nodes are implicit in the branching of arrows. Inport and outport nodes are also implicitly represented, and may be labelled with the input and output names, Ü Ø and Ý Ø respectively in this example.
The algorithms described by computation graphs will be implemented using a multiple wordlength architecture, as introduced in [1] . This scheme will be briefly reviewed in order to aid the understanding of the remainder of this paper.
In FPGAs, it is well known that a fixed-point implementation is generally more efficient than a floating-point implementation for most DSP algorithms with low dynamic range [13] . Each signal in a multiple wordlength architecture is allowed to take a distinct wordlength and scaling, appropriate to the internal variable represented by the signal. Fig. 2(a) shows the meaning of these two parameters:
Ò is the number of bits in the representation of the signal (excluding the sign bit), and Ô is the displacement of the binary point from the LSB side of the sign bit towards word LSB.
During the design stage, each wordlength is chosen individually to minimize logic usage while satisfying roundoff or truncation error. The contribution of this paper is to perform this design optimally. 
Area Models
In order to formulate the error-constrained area minimization problem, it is necessary to construct high-level models of the area consumption of each type of node. Only adders, gains, and delays are considered to consume area resources on the FPGA; the remaining nodes are simply wiring or input/output constructs.
Model formulation has proceeded by defining a parameterized high-level area model from knowledge of the internal architecture of a component. The model parameters have then been tuned to the Xilinx Virtex series of FPGAs through the synthesis of many sample library elements using coregen and least-squares fitting to the theoretical model. Although the values of the model parameters presented are specific to Xilinx Virtex, the models themselves are general and can easily be re-tuned to alternative FPGA families and manufacturers or for ASIC implementation.
Adders
Usually adders are implemented in FPGAs as ripplecarry designs, since fast carry chains are provided in modern FPGA architectures [14] . Multiple wordlength implementations of adders can be conceptually quite complex due to the alignment required for signals of different wordlength or with different scaling (binary point location). Fig. 3 illustrates the adder types found in practice in multiple wordlength implementations [15] . The inputs of these adders have been arranged so that binary-point align- The core integer adder used to implement such multiple wordlength adders will consist of a total of up to Ñ Ü´Ò × Ò µ · ¾ single-bit adders if all MSBs of the result are required. However not all these adders will have equal cost, because those not driving a portion of the output signal, illustrated in Figs. 3(a) and (b), require carry logic but no sum logic. In Figs. 3(c) and (d) there are no such cases. Also it is important to note that not all possible MSBs of the summation may be required by the output. A total of Ñ bits may not be required due to signal scaling information [1] .
The output is drawn entirely from the overlap between signals and if Ò Ó · Ñ Ñ Ü´Ò × µ · ½ . Thus the overall area of an adder can be modelled as (1) .
For a Xilinx Virtex implementation our experiments suggest ½ ½ ¼ LUTs and ¾ ¼ LUTs.
Gains
Area estimation for constant coefficient multipliers is significantly more problematic. A constant coefficient multiplier can be implemented in FPGAs as a series of additions and subtractions, through a recoding scheme such as the classic Booth technique [16] . This implementation style causes the area consumption to be highly dependent on coefficient value. Although an ideal area model would account for a recoding-based implementation, this currently remains unimplemented. Instead we propose to use a 'coefficient blind' area model, which has been demonstrated in practice to provide good results [1, 15, 17] . The placed-and-routed area results attainable with the present implementation also provides an upper bound for those attainable by a more sophisticated model.
For the remainder of this section, we consider a gain node with input signal , output signal Ó and a coefficient of wordlength CW.
The number of additions required to implement a constant coefficient multiplier is assumed to rise proportionally with the coefficient wordlength. Each of these will be a´Ò Ó · ½ µ -bit addition. However, a total of Ò · Ò Ò Ó additions along the edge of the multiplier array may not require their sum circuitry, as with the adder case. Note that this area model is equally valid with full-adder based array multipliers and standard Wallace or Dadda-tree [18] multiplier implementations.
For a Xilinx Virtex implementation, our experiments over a wide range of coefficient values and wordlengths suggest values of ¿ ¼ ¼ LUTs and ¼ LUTs.
Delays
The area of a unit sample delay with input , implemented as a register, is simply expressed as (3). For Xilinx Virtex implementation, our experiments suggest ½ ¼ LUTs.
´Ò · ½ µ (3)
Noise Model
As shown in [1] , since the systems of interest for this work have the LTI property, an analytic model based on [19] can be used to estimate the error at each system output. The variance of the error injected by each truncation of a signal from Ò ½ bits to Ò ¾ bits is given by (4) . If the transfer function from this point to the system output of interest is given in the Þ-domain as À´Þµ, then the resulting error variance at the output is 
A contribution of this paper is to demonstrate how to linearize these error models and hence incorporate them within a MILP model for the entire optimization problem.
MILP Model
The MILP formulation presented relies on some knowledge of integer linear programming. An excellent tutorial is given by Garfinkel and Nemhauser [20] on this topic.
The proposed MILP model contains several variables, which may be classified as: integer signal wordlengths, and signal wordlengths before quantization, binary auxiliary signal wordlengths, and auxiliary signal wordlengths before quantization, binary decision variables, real adder costs, and real fork node errors.
Note that only adders, gains, and delays cost area resources (forks are considered free). However adders have an inherently complex area model and thus while gains and delays are included directly in the objective function, the cost of each adder Î ¾ Î is represented by a distinct variable Ú .
We are now in a position to formulate an area-based objective function for the MILP model (6), where CW´Úµ represents the coefficient wordlength of gain node Ú. Constraints on quantization error propagation are much harder to cast in linear form due to the exponentiation, shown in Section 6. In order to overcome this nonlinearity, we propose to use an additional binary variables, Ò, one for each possible wordlength that a signal could take. This is expressed in (7), and (8) ensures that each signal can only have a single wordlength value. Here Ò is used to denote set subtraction. Note that in order to apply this technique, it is necessary to know upper-bound wordlengths Ò × for each × ¾ Ë. Techniques to derive these will be discussed in Section 7.1. Note that signals which drive fork nodes are not considered in this way, as fork node error models are considered separately (see Section 7.3).
Using these binary variables it is possible to re-cast expressions of the form ¾ ¾Ò , which appear in error constraints (see Section 6), into linear form as
Similarly it is necessary to linearize the exponentials in wordlengths before quantization (9)-(10).
For each system output, we propose to use an error constraint of the form given in (11) . Note that in this paper we only consider single-output systems, for simplicity of explanation, however the technique is general and our software can optimize multiple-input, multiple-output (MIMO) systems. represents a user-defined bound on the error power at the system output, and hence on the signal quality. variable, as this parameter is defined by the system environment. Place-holders Ú are used for the contribution from fork nodes; these will be defined by separate constraints in Section 7.3.
Wordlength Bounds
Upper bounds on the wordlength of each signal, before and after quantization, are required by the MILP model in order to have a bounded number of binary variables corresponding to the possible wordlengths of a signal.
Our bounding procedure proceeds in three stages: perform an heuristic wordlength optimization on the computation graph [1] ; use the resulting area as an upper-bound on the area of each gain block within the system, and hence on the input wordlength of each gain block; 'condition' the graph, following the procedure of [1] . The intuition is that typically the bulk of the area consumed in a DSP implementation comes from multipliers. Thus reasonable upper-bounds are achievable by ensuring that the cost of each single multiplier cannot be greater than the heuristically achieved cost for the entire implementation.
Of course this only bounds the wordlength of signals which drive gain blocks. In addition, the wordlength of signals driven by primary outputs is bounded by the externally-defined precision of these outputs. Together this information can be propagated through the computation graph, resulting in upper bounds for all signals under the condition that any closed loop must contain a gain block.
In the remainder of this paper, we denote by Ò the soderived upper bound on the wordlength of signal ¾ Ë and by Ò Õ the upper bound on the wordlength of the same signal before LSB truncation.
Adders
It is necessary to express the area model of Section 5.1 as a set of constraints in the MILP. Also a set of constraints describing how the wordlength at an adder output varies with the input wordlengths is required.
Area model
In the objective function, the area for each adder Ú ¾ Î was modelled by a single variable Ú . It will be demonstrated in this section how this area can be expressed in linear form.
Let us define ¬ Ú for an adder Ú ¾ Î with input signals and (12) , where the inputs 'a' and 'b' are chosen to match with Fig. 3 so that it is which needs to be left-shifted for alignment purposes. × Ú is also illustrated in Fig. 3 , and models the number of bits by which input must be shifted.
We may then express the area of an adder as (13) . Signal Ó is the output signal for the adder and Ñ Ú models the number of MSBs of the addition which are known through scaling to contain no information, as described in [1] and illustrated in Fig. 3 . This value is independent of the wordlengths, and for an adder can be expressed as
The non-linearities due to the Ñ Ü operator in (12) and the decision in (13) must be linearized for the MILP model. This is achieved through the introduction of four binary decision variables AE Ú½ , AE Ú¾ , AE Ú¿ and AE Ú for each adder Ú ¾ Î .
For the remainder of this section, we consider a general adder with inputs and and output Ó, to distinguish from the more specific case considered above, where input was used to denote the left-shifted input to an adder. In order to model (12) , if Ô Ô then (14)- (17) are included in the MILP. Otherwise (18)- (21) are included in the MILP.
Note that ¬ Ú and « Ú are only bounded from below by the constraints given. Inequalities are used in order to allow disjunctions and thus implications, for example selecting
as an optimization variable results in Ò Ò Ô · Ô µ ¬ Ú Ò · Ô Ô . Equality of Ú is guaranteed through its positive coefficient in the objective function. In order to model (13) , (22)- (25) are included in the MILP. These terms model the choice in (13) as a pair of implications, in an identical manner to that described above.
Output Wordlength
The pre-quantization output wordlength of an adder with inputs and and output Ó is given by Ò Õ Ó Ñ Ü´Ò Ô Ò Ô µ · Ô Ó . We may express this as (26)- (27), since before-quantization wordlengths only appear with negative coefficient in the error so the error constraints can be relied upon to reduce Ò Õ Ó to achieve equality.
Forks
As demonstrated in [15] , fork nodes can lead to unusual error behaviour due to the different possible orderings of wordlength at their outputs, which are required in order to guarantee freedom from statistical correlation and hence an accurate error model. It is necessary, however, to guarantee that the input signal provides enough wordlength for the largest of its outputs (34).
Ò Ò Ò Ò
(34) Ò Ò
Gains
In contrast to adders and fork nodes, gain nodes are straight-forward. The area of a gain node has already been modelled in the objective function (Section 7). The only remaining constraint required is to model the pre- 
MILP Summary
A MILP model for the wordlength optimization problem has been proposed. It remains to quantify the number of variables (37) and constraints (38) present in the model. Note that the number of constraints given does not include integrality constraints, the unit upper bounds on Boolean variables, or the trivial fork constraints in (34) which do not form part of the optimization problem. Fig. 5 illustrates area-error tradeoff curves for both a second and a third order linear phase FIR filter [2] . For the second order filter, results for both 4-bit and 8-bit inputs are given. For the third order filter, only results for a 4-bit input have been obtained. Three curves are present in each plot: the optimum uniform wordlength implementation, the heuristically derived multiple wordlength implementation from [1] , and the optimum multiple wordlength implementation achieved by solving the MILP presented in this paper.
Results
The results clearly illustrate the high-quality solutions achievable by the heuristic solution, averaging only ¼ ± with a maximum of ¿ ± worse than the optimum result.
An optimum wordlength allocation for an RGB to YCrCb convertor described in [1] with 4-bit inputs has also been performed. This result shows an optimal cost of 78.61 LUTs, equal to the result achieved by the heuristic presented in [1] . an error-free Y, whereas a bounded error power of up to ½¼ ¾ has been allowed for Cr and Cb. We believe such optimum results, even for small circuits, to be highly valuable as a benchmark against which many new and existing heuristics [1, 3, 4, 5, 8, 9 , 10] may be compared. For this reason we are making several optimum wordlength benchmarks publicly available at http://infoeng.ee.ic.ac.uk/ gac1/optimumWL.
The BonsaiG MILP solver [6] was used to solve the MILP models: execution time ranged from 2 seconds to 6 minutes on an AMD Athlon 1.2 GHz with 512 MB RAM. This compares to less than 0.2 second for the heuristic solutions on the same machine. Limits on the scale of the MILP solvable are due to both excessive run-time and numerical instabilities in the MILP solver.
Conclusion
This paper presents an approach to construct a mixed integer linear program (MILP) from an error-constrained area optimization problem, in order to perform wordlength allocation. High-level area models of parameterizable library blocks have been proposed and fitted to a Xilinx Virtex implementation. These form the basis of the objective function for the optimization, which is performed subject to user-specified constraints on output signal quality. Results indicate that our previously proposed heuristic solution [1] produces results reaching the optimum in most cases and, on average, deviating only ¼ ± from the optimum area.
Our current and future work is concentrating on wordlength optimizations of nonlinear DSP algorithms, and on including other models such as power consumption into the optimization procedure.
