This is a tutorial paper that examines the problem of performing fixed-point constant integer multiplications using as few adders as possible. The driving application is the design of digital filters, where it is often required that several products of a single multiplicand are produced. Thus two specific problems are examined in detail, i.e., the one-input/one-output case and the one-input/several-output case. The latter is of interest because it can take advantage of redundancy in the different coefficient multipliers. Graphical methods can be used to design multipliers in both cases.
INTRODUCTION
This paper examines the problem of minimising the number of adders required to perform fixed-point shift-and-add multiplication by one or more constants. Constant integer multiplication can be implemented using a network of binary shifts and adders. In an integrated circuit implementation of parallel arithmetic, binary shifts (scaling by a power of 2) can be hardwired and therefore do not require gates. Gates are thus only required to implement adders and subtractors, which require approximately equal numbers of gates. The hardware cost of the multiplier is thus approximately proportional to the number of adders and subtractors required; for convenience, both will be referred to as "adders", and their number as "adder cost". The paper is divided into three parts, addressing the distinct problems of the single multiplier case, the case where several products are required of a single multiplicand, and the matrix multiplication case where the requirement is for several sums of products of several inputs. Almost all of the material discussed and presented has been published before, but this is the first paper where these findings have been gathered together.
The problem of reducing the number of adders required for a single fixed-point multiplication has been studied for some years. As early as 1951, Booth [1] recognised that if subtractors were also allowed, the total number of adders and subtractors, hereafter collectively called "adders", could be reduced. The 1960's saw the introduction of signed-digit (SD) representation by Avizienis [2] . A coefficient in "n-bit" signed-digit representation can be written:
representation that has fewest non-zero digits is known as the canonic signed-digit (CSD) representation. Reitweisner [3] showed that the representation with no string of non-zero bits is canonic, and an algorithm for finding this representation was presented by Hwang [4] . Garner [5] showed that, on average, and for long wor- dlengths, CSD requires 33% fewer non-zero digits than binary. Since non-zero digits represent additions (or subtractions), CSD therefore is significantly more efficient in adders than binary. The graph multiplier technique, developed by the author and described in Section 2, has proven to be significantly more efficient than CSD.
In signal processing applications, several products of a single multiplicand are often required, e.g., in a direct form finite impulse response (FIR) digital filter, the input data at some stage is multiplied by each of the coefficients. Transposition of the FIR filter as in Figure allows all of these multiplications to be performed at once, with the individual multipliers replaced by the "multiplier block". Redundancy between the coefficient multipliers can be exploited in order to reduce the number of adders required to produce all of the products. Various techniques have been proposed to maximise these savings. The multiplier block method described in Section 3 was developed by the author from a technique first described by Bull and Horrocks [6] . It uses the same graphical methods used for the graph multipliers of Section 2 and has proven to be more efficient in terms of adders than any other method examined by the author.
GRAPH MULTIPLIERS
where bi is taken from the set {,0, 1} where represents -1. It is therefore a ternary representation (binary representation would be identical except the bi would be taken from the set {0,1}). In general, there are several different SD representations for a given integer, and the
Graph Representation of Multiplication
Multiplication by a constant integer can be described in terms of a graph as follows. There is an initial vertex of the graph, which can nominally be assigned the value 1. There is a terminal vertex of the graph, which is assigned the value of the multiplier being designed. The multiplicand can be considered as being input to the initial vertex. The product is output from the terminal vertex. Figure 2 is 45, the smallest integer that be represented using fewer adders than in CSD. Figure 2 (a) shows 45x computed as ((x4x) 16x) + 64x, using 3 
The Bull and Horrocks Algorithm
The algorithm of Bull and Horrocks [6] was not designed to produce single multipliers, although it can be used for this purpose. Originally, it exploited redundancy in designing multiplier blocks, a problem more completely addressed in Section 3. The reader is referred to [6] for a detailed description of the algorithm, hereafter denoted "BH", and to [7] for a version modified by the author, denoted "BHM" which produced significantly improved results. The basis of the algorithm is that it starts with the input ("1") vertex, and takes all pairwise sums of powers-oftwo multiples of that vertex. The sum closest to the required multiplier value is added to the graph as a vertex. This process is repeated, using powersof-two multiples of all vertices in the graph, until the multiplier value is added to the graph. Bernstein [8] , and has been used in various software compilers (e.g. [9, 10] ). Recently, the algorithm was extended by Wu [11] Figure 3 [12] . For integer 711, BERN's "product" graph is optimal, whereas for 707, BHM's "entangled" graph is optimal. Therefore, an algorithm that selects the "better of BHM and BERN" (BBB) algorithm was defined:
It is difficult to predict which of the BHM or BERN graph topology will be more appropriate for a given integer, so the BBB algorithm simply designs using both methods and chooses the better result.
where and n/(2i 1) are integral. This algorithm will be referred to as "BERN".
Comparison of Algorithms
Figure 4 [7] Figure 5 [12] shows that despite the BERN algorithm being inferior on average, individual instances of superiority can be exploited to give a significant average cost gain for the BBB algorithm. For 32-bit words, average BERN cost is 3.5% worse than BHM cost, whereas BBB cost is 3.4% better than for BHM. Similarly, for all 12-bit words, the results in Table I Figure 6 , illustrating the Fibonacci sequence. Knuth's "power trees" [13] optimally search these chains and hence could design an optimal graph with the above constraints. If subtractors are allowed, the problem becomes that of finding the shortest FIGURE 6 The graph to produce the Fibonacci sequence. Note that edges are not labelled each represents a scaling of 1. This is an example of an addition chain.
addition/subtraction chain [14] . Bull and Horrocks "add-only" and "add-subtract" algorithms [6] The algorithmic procedure is simply to replace the binary digits "1" (excepting the leading "1") with "square, multiply" and "0" with "square". Almost identically, a multiplication by 11 can be achieved by replacing binary "1" with "shift-add" and binary "0" with "shift". Thus lx could be evaluated as shift x, shift the result, add x, shift, add x, giving, sequentially, 2x, 4x, 5x, 10x, x. Note the product is built up identically to the power of x in the previous example. This method, when used for multiplication is almost 4000 years old, and is over 2000 years old for exponentiation [13] .
Efficient exponentiation is a subject of research interest because the popular RSA cryptographic algorithm [15] requires the calculation of m(mod N) where m and N are very large integers. Any of the algorithms already discussed could be used to perform this task, but when e is large, heuristic techniques have been used. These require certain powers to be pre-computed. The analogy with the graphical method is to force certain fundamentals into the graph prior to the graph design. Two algorithms, the k-SR algorithm [16] and the SS(1) algorithm [17] [7] must be searched. A recent algorithm presented by Li [19] , which claims to be optimal does not search all of these graphs [20] and is therefore not truly exhaustive. However, because it exhaustively searches the graphs that it does recognise, it can be considered to be an algorithm of the exhaustive type. Vertex value theorems in [7] show that we need only search using positive, odd fundamentals, simplifying the search significantly. Edge value theorems in [7] indicate that there is a limit to the size of integers that can appear as edge values and fundamentals, thus ensuring that the algorithm is exhaustive and can be executed in finite time. These considerations led to the design of the MAG algorithm.
The Minimised Adder Graph (MAG) Algorithm
The operation of the minimised adder graph (MAG) algorithm is described in detail in [7] . Here it need only to be said that it exhaustively searches all the graphs of Figure 7 . The algorithm produces two lookup tables, one with the cost of the multiplier and the other with a record of the fundamentals of all the graphs used to produce the multiplier at that cost. This second "fundamentals" table grows exceedingly quickly with wordlength, and the capability of the machine used to produce the results limited the extent of those results to 12-bit wordlengths. The results of applying the algorithm to all integers up to 212 are shown in Figure 8 [7] . In general, binary and CSD implementations are limited to the graph topology labelled "1" in Figure 7 for Figure 2b could be used to produce the multiplication by 45, with the multiplication by 5 as a "free" by-product (pun intended).
The n-dimensional Reduced Adder Graph (RAG-n) Algorithm
The n-dimensional reduced adder graph (RAG-n) algorithm is currently the best algorithm for designing short-wordlength multiplier blocks. It is in two parts; the first is optimal, i.e., if the set of coefficients is completely synthesised by this part of the algorithm, minimum adder cost is assured, and the second part is heuristic. It uses the two lookup tables generated by the MAG algorithm, which, at present, cover the range to 4096.
The algorithm is described in detail in [22] [22] , and the modified algorithm is again denoted "BHM". Nakayama's permuted differences [23] , the subexpression elimination techniques of Hartley [24, 25] and Potkonjak et al. [26] , and the nested structures of Mahanta et al. [27] have been shown to produce structures that can be represented by particular types of multiplier block graphs. However, these methods have been shown [28] to be far less versatile than the two mentioned above and therefore need not be discussed further here.
The author has also defined an optimal algorithm for the design of 2-coefficient multiplier blocks, known as MAG2 [28] . The Figure 9 , where it can be seen that for a given wordlength, average adder cost increases roughly linearly with set size. A set of 80 coefficients of 12-bit wordlength requires fewer than one adder per coefficient on average. For the smaller wordlengths in the figure, an asymptote is reached which is the cost of the graph that can fully represent all of the coefficients of that wordlength. The value of this asymptote is the number of odd integers of wordlength w, i.e., 2w-. Once this asymptote is reached, any "new'" coefficient is simply a repetition of a coefficient already in the set.
For a set size of 5 (the number of coefficients that would be required for the implementation of linear phase FIR filters of order 9 or 10), the average adder cost of a multiplier block for the BH, BHM, hybrid and RAG-n algorithms, were compared over a range of wordlengths, as shown in Figure 10 . Comparisons with individual multipliers using CSD and binary are also shown. The RAG-n algorithm provides a significant improvement in cost over BHM (8.4% for 12 bit words), which in turn provides a significant improvement (10.6%) over the original BH method. All the algorithms that utilise graph synthesis techniques are far superior in terms of adder cost to CSD and binary.
The computation time of these multiplier block design algorithms is an interesting subject and has been discussed at some length in [22, 28] . The optimal part of the RAG-n algorithm is actually very quick while the heuristic part is slow. It was also found in [22] that for a given wordlength, there is an "optimality threshold" set size above Number of coefficients lOO FIGURE 9 Average adder costs for the RAG-n algorithm for various wordlengths against uniformly-distributed coefficient set size. [22] and 80% to 50% for IIR filters [28, 29] . This has a number of important implications. In the past, the emphasis on reducing the complexity of a filter has focused on the multipliers. Multiplier blocks have been so successful in reducing that cost that there is incrementally less to be achieved by further attention to reducing multiplier complexity. In other words, elaborate schemes which select the coefficients in some "optimal" way (e.g.
[30]) may not offer significant savings over a technique which selects the coefficients in a straightforward fashion and implements them as a multiplier block.
It is important to note that multiplier blocks are applied directly to a selected set of coefficients, and the complexity savings they offer are limited to that application. Methods that select simple coefficients can be used in conjunction with multiplier blocks, such as statistical wordlength minimisation as described by Crochiere [31] for IIR filters, Grenez [32] for FIR filters, and the author [33, 34] for average wordlengths. There are many techniques that are aimed at reducing filter wordlength (see the reference list for [22] ) which can all be used in conjunction with multiplier blocks. The multiplier block method does not in itself attempt to minimise the number of adders that are required to meet a given filter specification; instead it aims to minimise the number of adders required to produce the products for a given set of coefficients. Some methods have been described that attempt to minimise the number of adders in the filter directly. The earliest technique of this type, described by Jain et al. [35, 36] , aimed to minimise the number of CSD "bits" required by the coefficients (without using redundancy between the coefficient multipliers). Another method, of Wade et al. [37, 38] , tries to reduce the number of adders in a cascade of primitive sections that meets the filter specification. The cost functions associated with this type of optimisation are badly behaved so non-gradient searches such as genetic algorithms have been devised for this task by Roberts [39] and Suckley [40] and for the relationship between this adder cost function and the filter error specification by Wilson and Macleod [41] .
These various optimisation algorithms could be modified to operate with multiplier blocks producing the cost function. However, this cost function has been shown to be relatively flat in a local region [28, 29] , providing further discouragement for optimisation in addition to the reasons discussed earlier in this section.
The Complexity Hierarchy
Multiplier blocks apply where several products of a single multiplicand are required. The larger the block, the more that the cost of the multipliers can be reduced. Therefore, structures that allow the use of large blocks, such as direct-form FIR and IIR filters, can be expected to gain more from using multiplier blocks than structures with isolated multipliers, such as the lattice wave structure. In fact, early results [42, 43] showed that using multiplier blocks can make the direct form structure more efficient than the wave structure! In these studies, we found that whereas traditional methods favour the lattice wave structure for filters, multiplier block implementation so drastically reduces the cost of cascaded second-order forms that they become significantly less costly. The cost of the direct structure is reduced to less than that of the wave structure, despite having more coefficients and requiring a much longer coefficient wordlength. Even when the data wordlength noise effects are taken into account [28] , the direct structure is still competitive with the wave structure. However, the direct form has always had poor limit-cycle (instability due to nonlinear feedback) performance and a more recent study [44] shows that although cascaded secondorder forms still outperforms the lattice wave structure when limit cycles are eliminated, the direct form no longer competes.
The Order-complexity Trade-off
In Section 3.3.1, reference was made to the flattening of the cost function in coefficient space due to the use of multiplier blocks. This means that the total cost of a set of coefficients in a block does not vary very much if the values of the coefficients are varied in value. This slow variation in cost has also been observed when the number of coefficients in the block is varied, corresponding for example to a variation in filter order for an FIR filter. This slow variation means that there may be an incentive to increase the order of the filter in order to reduce the complexity.
Studies of both FIR [45] and IIR [28] filters indeed show that there is an incentive to increasing the order and that multiplier blocks flatten the cost of FIR filters such that any order up to 10% above the estimate produced by the usual order estimators may produce the most efficient design.
Comparison with Other Efficient
Filter Design Methods
We applied multiplier blocks to some of the filters published as examples of advanced techniques of designing low-complexity filters. These methods include Powell and Chau's CSD delta-modulation of the coefficients [46] and the cascade of primitive sections due to Wade et al. [37, 38] . For the multiplier block filters used in the comparison, the output of the Remez exchange algorithm design was simply quantised prior to application of multiplier blocks, i.e., no special technique was used to select the coefficients. For all the examples, it was shown [28, 47] that multiplier blocks produced a filter that required fewer adders. Where a recursive running sum (RRS) prefilter was not used, the cost of the multiplier block filter ranged from 48% to 78% of the cost of the other design. Where an RRS prefilter was used, the advantage of the multiplier block design was less significant.
Jones [48] has proposed a distributed arithmetic method, which extends the idea originally described by Peled and Liu [49] . He shows that this method requires more adders than the Bull and Horrocks method [6] , and also uses RAM, ROM and control circuitry. It is therefore also less efficient than the RAG-n algorithm, but it is not coefficient-dependent.
Multiplier Blocks and Filter Banks
Filter banks (parallel connections of digital filters) are used in many signal processing applications including design of analysis and synthesis filters for multirate signal processing [50] , time-frequency analysis [51] , wavelets [52] and for fractional delay filter design [53] . Figure 11 [54] We have examined the application of multiplier blocks to filter banks and found [54] that once again, costs of the multiplier elements can be reduced significantly. Hence, the cost of the structural components becomes even more significant than if each filter was built separately. In designing a filter bank, there are a variety of structures that can be used, including, for the interpolation application we examined, the Farrow structure [53] . If multiplier blocks are used for multiplication, this choice of structure then dominates the overall cost of the filter bank.
MATRIX MULTIPLICATION
The problem of performing matrix multiplication using graphical techniques increases the complexity of the problem by one dimension. If the single multiplier case of Section 2 has zero dimensionality, and the single-input, multiple-output case of Section 3 has dimension (a vector multiplied by a scalar), then the multiple-input, multiple-output task of multiplying a matrix by a vector has dimension 2. The algorithms described already can be used to design the multipliers, but there is no guarantee of an optimal result. The multiple-input nature of the problem means that an optimal graph will be even more "entangled" than some (a) Using the RAG-n or BHM algorithm to design the matrix multiplier would result in the structure of Figure 12a , where the various products of the inputs are produced and then combined at the end. This method uses 7 adders. A more efficient graph is shown in Figure 12b , which requires only 6 adders. The outputs are synthesised using the equations
Y2 3x1 / 5X2 / using an intermediate result, C 3x / 5x2, which uses both inputs and supplies both outputs. It would appear that algorithms of the type used for the 0-and 1-dimensional designs are not appropriate for matrix multiplication. The search for an appropriate algorithm remains the subject of ongoing research.
DISCUSSION AND CONCLUSIONS
Graph multipliers and multiplier blocks have many advantages as we have discussed above.
However, there are also some limitations that affect their application. First, they are only of use where constant multipliers are required. If the filter coefficients need to be programmable or variable, another technique should be used. Second, when synthesised, they do not produce regular structures. It is believed that the gains they make will outweigh the inefficiency due to irregular layout, but this conjecture has yet to be tested. The comparisons made herein and throughout this work have been at an adder level in an attempt to make the comparisons as independent of technology as possible. Technology-dependent comparisons will be explored in the near future. These comparisons will extend to serial arithmetic, where all the comparisons here effectively apply to parallel arithmetic. Third, the products produced from a multiplier block do not necessarily have the same latency, so for pipelined applications, extra pipelining registers will be required.
To summarise the findings of the multiplier work:
1. For single coefficients, the MAG algorithm guarantees minimum adders for a given multiplier. Due to memory use, the MAG algorithm has been limited to a given wordlength. Above this wordlength, the BBB algorithm, the better of BHM and BERN, is the best available. For extremely long wordlengths, the exponentiation algorithms k-SR and SS(1) are worth considering. In addition to the VLSI application of primary interest here, all of these algorithms can also be used for reducing the number of ADD (and SHIFT) instructions a software compiler assigns to a multiply, and may assist in reducing the exponentiation overhead in cryptograhic algorithms. 2. When multipliers can be blocked, i.e., where several products of a single multiplicand are required, the RAG-n algorithm is the best. It is often optimal, but its use of the MAG algorithm also limits its maximum wordlength. The BHM algorithm is the best to use above that wordlength. These algorithms design filters that are more efficient in terms of adders than any other method to which they have been compared. 3 . For both FIR and IIR filters, the use of multiplier blocks substantially reduces the contribution to overall complexity made by the multipliers, reducing the imperative to optimise the multiplier contribution. The remaining elements (adders and delays) are intrinsic to the structure of the filter and cannot be optimised. Attention must then turn to the selection of structure and order. 4 . This choice of structure should not be made without examination of the effects of the use of multiplier blocks. Without the use of multiplier blocks, wave structures are the most efficient choice. Application of multiplier blocks so dramatically reduces the cost of cascade structures that they are then least costly.
