Many DSP applications such as digital filters and linear transforms are composed of multiple constant multiplication (MCM) circuits. In hardware design of MCM circuits, it is important to decrease the hardware cost to the minimum. For a design of MCM circuits with minimum cost, it is a feasible approach to apply combinatorial optimization algorithms. However, if implemented as software, the time needed for optimization increases rapidly as the circuit scale increases. In the design procedures, circuit synthesis is the most timeconsuming module, and it is called the most frequently. The purpose of this study is to develop a hardware-oriented circuit synthesis module using FPGAs to shorten the time spent on the design of MCM circuits.
Introduction
MCM circuits are designed to multiply a single input by multiple constants, and are often used in digital signal processing, such as in digital filters and linear transforms. When implementing MCM circuits using finite hardware resources, it is important to design the circuits with minimum cost. The circuit scale can be reduced by replacing each multiplier by shifts and additions and then sharing them. However, as the number of constants or word length increases, the number of combinations of shared patterns becomes huge. For this type of problem, applying combinatorial optimization algorithms such as genetic algorithms (GAs) is likely to be effective and realistic. However, if implemented as software, as the circuit scale increases, a longer time is needed for optimization. In this study, we propose a hardware-oriented MCM synthesis algorithm suitable for FPGAs to shorten the time spent designing MCM circuits.
Design of MCM circuits using optimization algorithms
An MCM is an operation to multiply an input by multiple constants. Assuming that N is the number of constants, it is described by the next equation:
where x is an input, and each a i is a coefficient. Several techniques have been proposed for designing MCM circuits that reduce the cost of circuitry by replacing multipliers with shifts and adders and sharing them ( [1, 2, 3, 4] ). Each algorithm produces somewhat better results, but does not approach the optimal. Therefore, for the design of MCM circuits to decrease the cost to the minimum, it is necessary to apply combinatorial optimization algorithms such as genetic algorithm, simulated annealing, and tabu search. For example, in [5] , the application of a genetic algorithm for the design of MCM circuits has been proposed.
In the optimization algorithm, the generation and evaluation of solutions is processed by applying a great number of iterations. Sequential processing is also required for synthesizing MCM circuits. It is therefore imperative to shorten the iterative processing cycle when implementing optimization algorithms. However, if implemented as software, optimization becomes timeconsuming, since software operates by sequential processing. For realistically designing MCMs, a far shorter processing time is needed.
In the design of MCM circuits, circuit synthesis is the most time-consuming module, since it searches for common subexpressions within the entire set of combinations of coefficients. Furthermore, when applying combinatorial optimization algorithms, the circuit synthesis module is most frequently called. Therefore, shortening the time spent on the circuit synthesis module would greatly contribute to shortening the time taken at the design of MCM circuits.
Circuit synthesis module in MCM design
This chapter describes a circuit synthesis module in MCM circuit design. First, a synthesis algorithm based on reference [1] is described. Then the hardware configuration of the algorithm is proposed.
MCM synthesis algorithm
The case is considered where a coefficient list
Defining n c as the non-zero bit count of c, combinations of {i, j, s} are searched to maximize n c . The algorithm terminates when n c < 3.
; and the processing returns to STEP 1. 2 , and n c = 5. In STEP 2, the updating gives b 0 = b 0 − (c 1) = (00000000) 2 and b 2 = b 2 − c = (00000001) 2 ; a coefficient b 3 = c 1 = (01101101) 2 is added; and the processing returns to STEP 1.
As an example, the case is considered where
Then, when STEP 1 is executed on the updated coefficient list B, the combination {i, j, s} = {3, 1, 2} is found. At this point, c = b 3 2 & b 1 = (10100100) 2 , and n c = 3. Similarly, by updating the coefficient list B in STEP 2, b 1 = (00000001) 2 , b 3 = (01000100) 2 , and b 4 = (00101001) 2 . Further executing STEP 1 on B gives n c < 3, and therefore the algorithm terminates.
The result can be expressed as a 0 = b 3 + b 4 , a 1 = 2 2 b 4 + 1, and a 2 = 2(b 3 + b 4 ) + 1. As the coefficient b 3 requires one adder and b 4 requires two, the total necessary number of adders can be reduced from twelve to six.
Hardware implementation of circuit synthesis module
A hardware configuration for realizing the circuit synthesis algorithm described above is illustrated in Fig. 1 . The shared module in the figure controls the following: (i) the input of the coefficient list to be synthesized; (ii) the First, the coefficient list A is transferred to a coefficient list RAM, and a coefficient list B for intermediate processing is generated. Then, the coefficient list B is transferred to the search module, and STEP 1 of the synthesis algorithm is executed. The processing of STEP 2 is performed by the shared module, and the processing is repeated until the end condition is satisfied.
The search module is configured by multiple systolic array modules in parallel. The hardware configuration of each of the systolic array modules is illustrated in Fig. 2 . Coefficients are transferred from the shared module, passed through selectors, and stored in shift registers. The coefficients input to the shift registers are transferred to the systolic array. In the systolic array, an operation is performed on the coefficient entering the uppermost shift register in the figure to ascertain a common bit count thereof with the other coefficients. Actually, this is realized by performing an AND operation of the coefficients serially transferred to a processing element (PE), followed by using a counter to count the non-zero bits. Another output of the PE is the coefficient delayed by one clock, which is fed back to the selector and once again input to the shift register. Thereby, the bit shift of a coefficient and equivalent functions can be realized. After counting the number of common bits, a comparison module searches for a combination having the highest number of common bits and returns the result to the shared module. These search operations may be performed by pipeline processing based on the characteristics of the systolic array.
Evaluation
To evaluate the effectiveness of the proposed method, the circuit synthesis module was implemented on the following FPGA, and execution speeds were compared.
• FPGA: Xilinx VirtexII Pro 33088Slices 66 MHz
• Development Tool: Xilinx ISE Foundation 8.2i
The results of operation speed by the number of coefficients are illustrated in Fig. 3 for coefficients having a bit length of 32 bits. Of the methods for comparison in the figure, "Software" refers to implementation by software using Microsoft Visual C++ 2005 on a PC with the specifications of a Xeon 3.4 GHz CPU and a 1 GB memory. Also, "Software Based Hardware" (SBH) refers to implementation on hardware based on the software algorithm and using the same FPGA ([6]).
As the number of coefficients increases in Fig. 3 , the proposed method exhibits a greater proportion of speed increase than that of software. In the case of 256 coefficients, the speed is higher by a factor of 494. Although the SBH was implemented on the same hardware, the clock frequency of the FPGA is 66 MHz and is slow in comparison to that of a PC. Therefore, a speed increase is not realized for the SBH. The SBH is 1.8 times faster than software in the case of 256 coefficients, but is slower than software for 128 and fewer coefficients.
Thus, the effects of the parallel configuration of the proposed method result in much of its overwhelming speed increase in comparison to the SBH implemented on the same hardware. However, it should be noted that in the proposed method, creative effort is given to maximize the effects of the parallel configuration.
For the SBH, the clock usage count is expressed as a product of the number of combinations of coefficients and the count of bit shift operations; that is 1 2 (N 2 − N )L, where N is the number of coefficients and L is the bit length of the coefficients. On the other hand, the proposed method requires 3(L + N + 1) clock counts for input/output of the systolic array, and (L − 3) iterations of (L + 1) clock counts for PE operation in the systolic array. Therefore, the total clock usage count is expressed as 3(L + N + 1) + (L − 3)(L +1) = L 2 + L +3N . Here, L is normally fixed at about 16 bits or 32 bits, and this explains why in reality, as N becomes larger, the proportion of the speed increase of the proposed method becomes larger.
Next, the relation between circuit scale and speed is examined. The proposed method has a parallel configuration, and therefore, the circuit scale becomes larger than that of the SBH. Here, assuming that the circuit scale is approximated as the number of FFs to compare a set of coefficients, the proposed method requires N (L + 2) FFs, while the SBH requires 2L FFs. Therefore, if it is assumed that the SBH is configured in parallel, a degree of parallelism to provide about the same circuit scale as that of the proposed method can be estimated by N (L+2)/2L. For example, 136 rows are possible in the case where N = 256 and L = 32. However, for these conditions, the proposed method can operate at the clock frequency of the FPGA of 66 MHz, while the SBH can operate at no more than 6 MHz. Accordingly, even in the case where the SBH is made parallel, a speed increase of only about twelve times is obtained.
From the examination above, the results illustrated in Fig. 3 do not simply derive from the parallel configuration. It can be said that the search module of the proposed method is designed to sufficiently leverage the effects of the parallel configuration.
Conclusions
In this study, a fast circuit synthesis module was proposed for reducing calculation time in the design of MCM circuits. A synthesis algorithm using a binary representation suitable for realization on hardware was employed, and by using a parallel structure having a systolic array, a configuration method suitable for FPGAs was proposed.
The proposed method realized a high speed that was 494 times that of software implementation and 273 times that of SBH implementation for a synthesis for 256 coefficients having a length of 32 bits. Also, even in the case where the SBH is assumed to have a parallel configuration providing an equivalent circuit scale, it can be said that the proposed method utilizes the parallel configuration with high efficiency and enables high execution speeds 22 times that of the SBH.
During the design of MCM circuits, the circuit synthesis module has the highest frequency of use, and the benefits of higher speeds can be expected in design using an optimization algorithm. One of the future plans would be the implementation of the total design method including an optimization algorithm.
