Abstract
Introduction
Important optimizations of the speed, area and power consumption of circuits can be achieved by using dedicated operators instead of general ones whenever possible. Multiplication by constant is a typical example. Indeed, if one operand of the multiplication is constant, one can use some shifts and additions/subtractions to perform the operation instead of using a full multiplier. This usually leads to smaller, faster and less power-consuming circuits.
Applications involving multiplication by constant are common in signal processing, image processing, control and data communication. Finite impulse response (FIR) filters, discrete cosine transform (DCT) and discrete Fourier transform (DFT), for instance, are central operations in high-throughput systems, and they use a huge amount of such operations. Their optimization widely impacts the performance of the global system that uses them. In [11] there is an analysis of this operation frequency.
The problem of the optimization of multiplication by constant has been studied for a long time. For instance, the famous recoding presented by Booth in [3] can simplify the multiplication by constant operation as well as the full multiplication. This recoding and the algorithm proposed by Bernstein in [1] were widely used on processors without multiplication unit.
The main goal in this problem is the minimization of the computation quantity. The multiplication by constant problem seems to be simple, but its resolution is a hard problem due to its combinatorial properties. This problem can occur in more or less complex contexts. In the case of a single multiplication of one variable by one constant, it seems possible to explore the whole parameter space. But in the case of the multiplication of several variables by several constants, the space to explore is so huge that we have to use heuristics.
A first solution proposed to optimize multiplication by constant was the use of the constant recoding, such as Booth's. This solution just avoids long strings of consecutive ones in the binary representation of the constant. Better solutions are based on the factorization of common subexpressions, simulated annealing, tree exploration. . . Our work deals with constant matrix multiplication (CMM), i.e. one useful form of the multiplication of several variables by several constants. A lot of applications involve such linear operations. This method is based on constants recoding followed by some dedicated common subexpression factorization algorithms. The proposed method was implemented in a VHDL generator. The generated results for several applications have been implemented on Xilinx FPGAs and compared to other solutions. Some significant improvements have been obtained: up to 40% area saving in the DCT case for instance. This paper is organized as follows. The problem is presented in Section 2. In Section 3 some related works are presented. Our algorithms are presented in Section 4. The developed generator and the target architectures are discussed in Section 5. Finally, the results of the implementation of some applications and their comparison to other solutions are presented in Section 6.
In this paper, and in the related works, the central problem is the substitution of full multipliers by an optimized sequence of shifts and additions and/or subtractions. We focus on integers but all the results can be extended to fixed-point representations.
All the values are represented using a standard radix-2 notation or two's complement unless it is specified. The notation Ü denotes the -bit left shift of the variable Ü (i.e. Ü ¢ ¾ ). As we look at FPGA implementations, we assume that shift is just routing and addition and subtraction have the same area and speed cost.
As an example, let us compute Ô as the product of the input variable Ü by the constant ½½½ ¿ ½½¼½½¼¼½½¼½½¼¼½½½ ¾ . The simplest algorithm consists in using the distributiveness of multiplication. There is one addition of Ü (after some potential shift) for each one in the binary representation of . In the case ½ ½ ½ ¿ , it leads to 10 additions:
The central point in this problem is the minimization of the total number of operations. It can be significantly reduced by using a recoding of the constant and/or subexpression elimination and sharing. The theoretical complexity of this problem seems to be still unknown.
Depending on the target application, this problem can occur at different levels of complexity. It starts with the multiplication of one constant by one variable. After, the multiple constant multiplication (MCM) problem appears with the multiplication of several constants by one variable [17] . In this present work, we deal with a more general version of this problem with the multiplication of one constant matrix by one variable vector: the constant matrix multiplication (CMM).
Related Works
There are, at least, four types of methods to address the multiplication by constant problem: Direct recoding methods.
Evolutionary methods.
Cost-function based search methods.
Pattern search methods.
Direct Recoding Methods
In the radix-2 signed digit (SD) representation, the digits belong to the set ½ ½ ¼ ½ . A number is said to be in the canonical signed digit (CSD) format if no two nonzero digits are consecutive, then the number of non-zero digits is minimal. Using the CSD format on a Ò-bit value, the number of non-zero digits is bounded by´Ò · ½ µ ¾ and it tends asymptotically to an average value of Ò ¿ · ½ , as shown in [7] . For our example, using Booth 
The KCM algorithm [5] was specifically designed for LUT-based FPGAs. It decomposes the binary representation of the variable into 4-bit chunks (a radix-16 representation). There is a more general version of this decomposition problem with distributed arithmetic. For instance, in [19] it was used on a DCT operator with an area saving of 17% compared to the direct implementation of the whole computation.
There are some recent works on the use of high-radix recoding. For instance, in [2] , they use a radix-8 representation with punctured coefficients with the digits in the set ¼ ¦½ ¦¾ ¦ instead of the set ¼ ¦½ ¦¾ ¦¿ ¦ . This is a lossy representation, so they have to deal with some additional accuracy requirements. In our case, we want to study this problem for a lossless representation at first, but this approach seems to be interesting.
Another recoding solution was proposed with the use of multiple-radix representations and especially with the double-base number system (DBNS) [6] . In this solution, the authors use both radices 2 and 3 simultaneously, i.e. the values are expressed by È ¾ ¿ with ¾ ¼ ½ . This multiple-radix representation, sometimes useful in some analog circuits, does not seem to be efficient in the multiplication by constant problem in digital circuits.
Evolutionary Methods
Some evolutionary methods such as evolutionary graph generation [8] have been proposed to generate arithmetic circuits and especially for constant multipliers. These methods based on genetic algorithms seem to provide very bad results. For instance, in [8] the results are slightly better than the straightforward CSD encoding which is very far from the best known results. Furthermore, it seems that these methods are limited to the problem of multiplication Proceedings of the 16th IEEE Symposium on Computer Arithmetic (ARITH'03) 1063-6889/03 $17.00 (C) 2003 IEEE by one constant and have never been used to produce real circuits.
Cost-Function based Search Methods
The algorithm presented by Bernstein in [1] allows to reuse some intermediate values that are just used once in recoding methods. A more detailed and corrected version of this algorithm can be found in [4] . The algorithm based on a tree exploration defines three kinds of operations: Ø ·½ Ø , Ø ·½ Ø ¦ Ü and Ø ·½ Ø ¦ Ø . A cost can be specified for each operation according to the target technology. The cost function used to guide the exploration is the sum of the costs of all the implied operations. This algorithm only shares a few common sub-expressions. For our example Ô ¢ Ü, this algorithm gives a 5-addition solution:
There are some other cost-function based search methods such as simulated annealing. In [14] , this technique was used to produce multiplication by a small set of constants. The same multiplier is used for a small set of different coefficients. This problem is different from our's.
In [15] a greedy algorithm is used to determine a solution with a low total operation cost. A 28% average area saving is achieved. This solution seems to be limited due to local attraction of the greedy algorithm.
Pattern Search Methods
Most of the pattern search methods are based on the same general idea. The algorithm recursively builds a set of useful constants to be optimized. This set is initialized with the recoded initial constants. The different methods differ in the way they match the common sub-expressions and the way they reuse and share them.
The multiple constant multiplication (MCM) solution presented in [17] performs a tree exploration with selection of matching parts of the SD representation of the constants. This paper is the most cited one, and it presents a lot of details about the algorithm as well as about the comparisons.
In [9] the matches between constants are represented using a graph. The exploration of this graph is used to produce a specific form of FIR filters with a reduced number of adders.
A solution based on an algebraic formulation of the possible matches between constants is presented in [12] . Unfortunately, the authors use random filters for their tests without specifying the coefficients. So it is difficult to compare their results to the other solutions.
In [13] a factorization method based on the selection of the best pair of matching digits is used. This solution can be easily extended to the selection of common parts of words larger than two digits.
We will base our solution on extensions and improvements of the algorithms presented in [10] and [16] . A detailed description of this idea is presented below.
One can notice that among all the abundant bibliography about the multiplication by constant problem there is no general solution to the CMM problem.
Proposed Algorithms

Lefèvre's algorithm
In 2001, Lefèvre proposed a new algorithm to efficiently multiply a variable integer by a given set of integer constants [10] . As a special case, this algorithm can be used to multiply a variable by a single constant.
Definitions
The principle of the algorithm is to keep a list of constants to be optimized, and to find a "pattern" that appears several times in the set of constants. The constants are recoded using the CSD format in the very beginning. A pattern is a sequence of digits in ½ ¼ ½ . The number of non-zero digits in the pattern is called its weight.
A pattern È is said to occur in a constant with a shift « when for each ½ in position of È , there is a ½ in position · « in , and for each ½ in position of È , there is a ½ in position · « in . And a pattern is said to occur negatively when there is a ½ in for each ½ in È , and a ½ in for each ½ in È . This last point is one of the main difference between the two papers [10] and [16] , Lefèvre's algorithm allows to use more complex patterns and leads to slightly better optimizations.
When two occurrences of two patterns in the same constant match the same non-zero digit of the constant, the two occurrences are said to conflict. 
Description of the algorithm
The principle of the algorithm can be described by the pseudo-code presented in Algorithm 1.
Algorithm 1 : Principle of Lefèvre's algorithm.
While there are some patterns with weight 2 and at least 2 non-conflicting occurrences choose a pattern with maximal weight and 2 non-conflicting occurrences remove the chosen pattern for both occurrences add the pattern as a new constant Then, multiplication by each constant in the final set can be implemented in the usual way: for each ½ (½) in position Ô, add (subtract) Ü shifted by Ô bits to the left. And then, by rolling back the algorithm, each constant can be computed by shifts and additions/subtractions, with roughly one addition/subtraction for each chosen occurrence of a pattern.
Extensions and enhancements to Lefèvre's algorithm
Mathematically speaking, Lefèvre's algorithm deals with the multiplication of a number by a constant vector. The first thing to do is to extend it for the multiplication of a vector by a constant matrix. This extension is rather straightforward: we simply replace each constant with a constant vector. Patterns are then replaced by vectors of patterns and shifts are done component-wise. With this algorithm, it is possible to share all kinds of expression.
For example, let us consider the computation of Ü · Ý · Þ and Ü · Ý · Þ. The algorithm will first share the computation of Ü · Ý between and . After, it will share Ü · Ý in´ Ü · Ýµ Ǘ · Ýµ · Ü · Ýµ, effectively sharing the multiplication by 5 between Ü and Ý.
One point is kept unspecified in Lefèvre's algorithm: which maximal pattern and which occurrences to choose? In his implementation, Lefèvre simply chose the first maximal pattern he found, with the first two occurrences. This solution is probably not the best, so we tried to find something better.
The first idea was to find all the maximal-weight patterns with at least two non-conflicting occurrences, and all their occurrences. And then, we try to choose a set of patterns and a set of, at least two, occurrences such that two chosen occurrences (of the same pattern or of different patterns) do not conflict. The choice is performed in order to maximize the gain in the weight of all the constants; with a constant with occurrences, we gain ½ times its weight. As all the chosen patterns have the same maximal weight, we want to maximize the sum, for each pattern, of the number of occurrences diminished by one. We tried three different solutions for this. The first one, called "random", and which is the closest to the original algorithm, is to recursively choose, on random, a pattern with 2 non-conflicting occurrences, and to remove everything that conflicts with these occurrences. The second one, called "graph-heuristic", is to recursively choose a pattern with a maximal set of non-conflicting occurrences, and a minimal set of conflicts with the other patterns, and then remove everything that conflicts with these occurrences. And the third one, called "graph-optimal", is to build all the maximal sets of patterns and non-conflicting occurrences, and to choose the best one. This last solution can be very computationally intensive.
We tried to compare those three solutions, by running them several times for the same constant matrix (a huge 2D IDCT operator). The results in Figure 1 show that the "graph-optimal" and "graph-heuristic" are roughly equivalent, and better than the "random", with a tiny advantage to "graph-optimal". The time required to generate these results is less than one minute for "graph-heuristic" and "random", while it can grow to hours for "graph-optimal". Hence, we generally choose the "graph-heuristic" solution, so we can do lots of tries (thanks to its speed), and then choose the best solution.
Beyond the mathematical optimization
The improvements described above only deal with the minimization of the total number of additions and subtractions. Translated to hardware, this is not enough. Some additions and subtractions should be reordered without changing their total number, thanks to properties such as associativity and commutativity.
First of all, one may want to have a small circuit. When
Proceedings of the 16th IEEE Symposium on Computer Arithmetic (ARITH'03) 1063-6889/03 $17.00 (C) 2003 IEEE three numbers , and are added, the order in which they are added influences the size of the adders. For example, if and are narrow numbers, while is wide, computinǵ · µ · leads to a smaller circuit than´ · µ · oŕ · µ · . Hence, for space optimization, it is generally better to add the narrowest numbers first.
On the other hand, one may want to have a fast circuit. The order in which three numbers are added influences also the worst case delay of the circuit. For example, if and are available early while is available late, the result is available earlier if we compute´ · µ · than´ · µ · or´ · µ · . Hence, for speed optimization, it is preferable to add the earliest available values first.
Of course, those two kinds of optimization may often conflict. Hence, the user must specify if he or she prefers to optimize for speed or for area. Of course, if the user prefers to optimize speed, and if two solutions are equivalent for speed, the smallest one is chosen.
Moreover, the algorithmic optimization is not enough. We need to generate some real circuits. Hence, we decided to generate some VHDL code. Our VHDL code generator is currently targeted for Xilinx FPGAs. So additions and subtractions are performed on the dedicated carry-propagate adders and subtracters. The generator is able to produce VHDL code for fully parallel circuits, or for digit-serial circuits with radix ¾ Ò for any Ò.
Implementation
Our implementation is mainly in two parts. The first part performs the mathematical optimization, with our extended and enhanced version of Lefèvre's algorithm. This part was written in C++, and is approximately 1500 line long. This part is not a program by itself, but a collection of simple classes that can be easily interfaced with any C or C++ program. Hence, it would be easy, for example, to interface this with a program that computes coefficients for FIR filters. Then the user could simply choose the type of filter and the pass and cut-off frequencies, and the program would compute the coefficients and generate some efficient VHDL code for it.
After the mathematical optimization, everything is implemented as plug-ins. Hence, there are, for example, plugins that optimize the order of the additions and subtractions, or plug-ins that generate the output VHDL code. This structure with plug-ins make the whole thing very modular. Hence, if someone wants, for example, to generate some Verilog code, or some assembly language code for a DSP, it is sufficient to write a new output plug-in. Then if someone wants to get pipelined circuits, he or she should write a new pipelining plug-in, and it can then be used with any output plug-in. Those plug-ins are also written in C++, but could be written in C or any language that can be interfaced with C. The collection of plug-ins is currently approximately 2500 line long.
The plug-ins communicate between themselves and the main program with simple interfaces that describe the circuit as a graph. In this representation, vertices represent mathematical values. Hence, there are vertices for input values, for output values, and also for intermediate values.
Then there is an arc, from vertex Ü to vertex Ý, tagged with × Ø × Ò Ð Ýµ, if Ü shifted × Ø bits to the left and delayed of Ð Ý clock cycles is a positive or negative part (according to × Ò) of Ý. The Ð Ý part will be useful for filters or for pipelined circuits. This representation has the quality of being independent from the desired output.
Let's give an simple example of how this can be used, and what is generated. As a simple example, we will consider building a constant multiplier by 111463. The corresponding source code and generated VHDL are presented Figure 2 and Figure 3 respectively. 
Results and Comparisons
The synthesis have been performed using Xilinx ISE XST 4.2.03i tools for a Virtex XCV200-5 FPGA. The operators are not pipelined. The area is measured in number of slices (2 LUTs per slice in Virtex devices). The required area compared to the 2352 available slices in a XVC200 device is also reported in parentheses. The delay is expressed in nanoseconds. The number of additions/subtractions and the number of FA cells are computed by our generator (the two last lines of Figure 3) ; the number of FA cells is only an estimation .
Only a few papers give enough elements to compare our solutions. In [17] and [13] : std_logic_vector (22 downto 0); begin --struct --t3 = 20641*x1 t3 <= ( t2 (15)&t2 (15)&t2(15 downto 0) & "00000" ) +( t1(22 downto 0) ); --t2 = 129*x1 t2 <= ( x1 (7)&x1 (7)&x1 (7)&x1 (7)&x1 (7)&x1 (7)&x1 (7) &x1 (7)&x1(7 downto 0) ) +( x1(7)&x1(7 downto 0) & "0000000" ); --t1 = 16513*x1 t1 <= ( t2 (15)&t2 (15)&t2 (15)&t2 (15)&t2 (15)&t2 (15) &t2 (15)&t2(15 downto 0) ) +( x1 (7) DCT application. Table 1 presents the number of additions/-subtractions for the 1D 8-point DCT for several word sizes. Our generator improves the best previous results from 17% up to 44%. We performed some other comparisons on some errorcorrecting codes from [17] and [13] : the ¢ Hadamard matrix transform,´½ ½½µ Reed-Muller,´½ µ BCH and ¾ ½¾ µ Golay codes. The comparison with the previous works in [17] and [13] is presented in Table 3 and the corresponding synthesis results are presented in Table 4 . Synthesis results for some errorcorrection benchmarks. Table 5 presents the synthesis results of the same operator with the three possible optimizations of our generator: none, area or speed. The operator is a 1D 8-point IDCT for 14-bit constants and 8-bit inputs. From the same initial addition/subtraction number the optimizations presented in Section 4.3 lead to significant improvements, 40% for the speed optimization for instance. The generation time for all these operators is around a few seconds on a standard desktop PC.
Some low-pass FIR filters used in [18] and [9] have been synthesized, these results are presented in Table 7 . The specifications of the different filters are presented in Table 6 .
Their coefficients have been generated using the Ö Ñ Þ Matlab function: Ö Ñ Þ´ Ø Ô ¼ Ô × ½ ½ ½ ¼ ¼ µ, and rounded to the target width. As the other authors do not specify the way they generate the coefficients, it is not possible to fairly compare our results. There are several function that use the Remez's algorithm in Matlab, and there are many options. Without more details it is not possible to In Section 4.3, we said that our generator can produce digit-parallel as well as digit-serial circuits using different output plug-ins. 
Conclusion
A new algorithm for the problem of multiplication by constant was presented. We generalized the previous results by dealing with the problem of the optimization of multiplication of one vector by one constant matrix. This algorithm is based on extensions and enhancements of previous algorithms [10] and [16] . Compared to the best previous results, our solution leads to a significant drop in the total number of additions/subtractions, up to 40%.
We implemented this algorithm in a VHDL generator. Based on a simple mathematical description of the computation, the generator produces an optimized VHDL code for Xilinx FPGAs. At the moment, the generated operators are non-pipelined parallel or digit-serial ones. We will extend our generator to produce pipelined circuits to reach higher clock frequencies.
We will also improve our algorithm in the case of operators that use different time samples of the same inputs such as filters. At the moment, the algorithm only factorizes some common sub-expressions without any temporal aspect.
We also want to extend our algorithm and generator to standard-cell based ASICs. The way to implement the adder/subtracter will widely impact the performance of the complete operator. The optimization required for lowpower consumption may also change our solutions.
Another way to explore in the future, is the use of lossy representations such as [2] . In a lot of applications, the models include some approximations and the coefficients quantization. It may be a good idea to allow small perturbations of the coefficients.
