Abstract-The problem of an efficient hardware implementation of multiplications with one or more constants is encountered in many different digital signal-processing areas, such as image processing or digital filter optimization. In a more general form, this is a problem of common subexpression elimination, and as such it also occurs in compiler optimization and many highlevel synthesis tasks. An efficient solution of this problem can yield significant improvements in important design parameters like implementation area or power consumption. In this paper, a new solution of the multiple constant multiplication problem based on the common subexpression elimination technique is presented. The performance of our method is demonstrated primarily on a finite-duration impulse response filter design. The idea is to implement a set of constant multiplications as a set of add-shift operations and to optimize these with respect to the common subexpressions afterwards. We show that the number of add/subtract operations can be reduced significantly this way. The applicability of the presented algorithm to the different highlevel synthesis tasks is also indicated. Benchmarks demonstrating the algorithm's efficiency are included as well.
A New Algorithm for Elimination of Common Subexpressions R. Paško, P. Schaumont, V. Derudder, S. Vernalde, Member, IEEE, and D.Ďuračková
Abstract-The problem of an efficient hardware implementation of multiplications with one or more constants is encountered in many different digital signal-processing areas, such as image processing or digital filter optimization. In a more general form, this is a problem of common subexpression elimination, and as such it also occurs in compiler optimization and many highlevel synthesis tasks. An efficient solution of this problem can yield significant improvements in important design parameters like implementation area or power consumption. In this paper, a new solution of the multiple constant multiplication problem based on the common subexpression elimination technique is presented. The performance of our method is demonstrated primarily on a finite-duration impulse response filter design. The idea is to implement a set of constant multiplications as a set of add-shift operations and to optimize these with respect to the common subexpressions afterwards. We show that the number of add/subtract operations can be reduced significantly this way. The applicability of the presented algorithm to the different highlevel synthesis tasks is also indicated. Benchmarks demonstrating the algorithm's efficiency are included as well.
Index Terms-Common subexpression elimination, DSP synthesis, optimization, resource sharing.
I. INTRODUCTION
T HE advent of consumer applications demanding very high data throughputs like digital television requires high-speed components such as digital filters. Because of the speed, programmable solutions such as digital signalprocessing (DSP) cores cannot be considered a satisfying solution in dealing with these problems. Rather, an applicationspecific approach in hardware is necessary, thus efficient verylarge-scale integration (VLSI) synthesis methods are needed.
The core of many VLSI design tasks is the multiplication of a variable by a set of constants (digital filtering, image processing, linear transforms, etc.). The optimization of these multiplications can lead to important improvements in various design parameters like area or power consumption. In this paper, an algorithm for efficient solution of the multiple constant multiplication problem (MCM, as defined in [3] ) is presented. Common subexpression elimination (CSE) as a way to tackle the MCM problem was already proposed by various authors [3] - [5] , primarily as a possible method for the optimization of finite-duration impulse response (FIR) filter area through the reduction of the multiplier block logic. In [3] also, a number of other applications in which the MCM transformation can be successfully applied were proposed. In this work, we will introduce an algorithm able to solve the CSE problem in an efficient way. The idea of CSE can be demonstrated on a FIR filter design example shown in Fig. 1 . The optimization procedure targets the minimization of the multiplier block area [ Fig. 1(a) ]. After expressing the coefficients in a canonical signed digit (CSD) format [1] , [2] , in order to reduce the total number of nonzero bits (thus also the additions/subtractions necessary), an addshift expansion is performed, as shown in Fig. 1(b) . The goal of CSE is to identify the bit patterns that are present in the coefficient set more than once. Since it is sufficient to implement the calculation of the multiple identical expressions only once, the resources necessary for these operations can be shared. The pattern in the example in Fig. 1 is present twice, so an optimized structure shown in Fig. 1(c) can be implemented instead of the original one. The second occurrence of the pattern is removed, and only the result is used for the further calculation. In general, the goal of CSE can be defined as follows.
1) Identify multiple patterns in the coefficient set.
2) Remove these patterns and calculate them only once. The problem to solve is how to identify the "proper" patterns for elimination so the optimization impact can be maximal. Our algorithm is based on a combination of an exhaustive search technique with a steepest descent (or greedy) approach in order to select the "proper" patterns for elimination. Thanks to an efficient implementation of the algorithm combined with 0278-0070/99$10.00 © 1999 IEEE some simplification techniques to speed up the processing of large tasks, very satisfactory runtimes are also achieved. The rest of this paper is structured as follows. In Section II, we discuss the related work. In the next section, we give a formal description of the problem and indicate our goal. Also, some considerations concerning the problem complexity are included. In Section IV, an in-depth discussion of our algorithm is given. In Section V, we indicate an implementation strategy of the algorithm for different design tasks (we concentrate on transposed-and direct-form FIR filters and matrix multiplication), followed by experimental results in Section VI. Section VII presents a comparison with the related work, and Section VIII states the conclusion.
II. RELATED WORK
The idea of the consta+nt multiplications optimization (generally) or FIR filter area minimization (specifically) by common subexpressions sharing was already considered by several authors [3] - [7] . In this section, we will briefly introduce their approaches with an appropriate reference. In [3] , a bipartite matching algorithm was used to identify the common subexpressions for elimination, and there were shown also numerous examples different from FIR filter optimization on which it can be successfully applied. In [4] , an algorithm for the identification and elimination of only two-nonzerobit subexpressions was proposed, but the mechanics was extended to direct-form FIR filters as well (the remaining papers consider only transposed-form FIR filters). In [5] , an elimination of 2-bit subexpression was also proposed, but as an additional criterion in the subexpression identification process, an estimation of a latch-count improvement was used as well, which introduces the issues related with timing. Both [4] and [5] were specifically targeting the optimization of the FIR filters area. These three papers ( [3] - [5] ) used generally the same idea (common subexpression elimination) as a basic optimization strategy. References [7] and [6] applied a different approach. In these works, the whole multiplier block is synthesized using similar graph synthesis algorithms ( [6] can be considered an extension of [7] ). Despite this difference, the results obtained by these works are of course of interest for this paper in order to compare the effectiveness of both approaches.
III. PROBLEM ANALYSIS
In this section, we will define the goal of the CSE technique formally as a matrix transformation. Afterwards, a short discussion on behalf of the problem complexity will be given, and a simple heuristic to tackle the complexity issue will be proposed.
A. Problem Definition
The problem we are targeting can be formally described as a multiplication-free linear transform. In general, a multiplication-free linear transform is defined by the equation , where and are -dimensional vectors and is an matrix containing only 1, 1, and 0. In this form, represents the variable input while the matrix indicates the set of constants. As will be shown later, this formalism can be extended to a number of different problems.
Consider the example of a multiplication-free linear transform described in (1) (1)
The product has to be calculated three times during the evaluation of . However, the splitting of into two matrices and , as shown in (2) and (3), groups the partial products in question into one matrix , which can be decomposed as shown in (2) (2)
The formulation (3) requires the evaluation of the partial product only once. This idea of a matrix splitting for multiple identified patterns can be expressed as follows: (4) Concerning the matrices , any row in every matrix must be either an all-zero vector or must be equal to any nonallzero vector in the matrix as in (2) .
is the remainder in which no more multiple subexpressions could be found. The final product can be computed as shown in (5)
A subset of the previously defined problem deserving attention occurs in the case when the relative position of the pattern within the matrix is of no importance in the pattern identification process. This is a valid assumption in the case of the vector's being defined as in (6) with or . Then the matrix splitting as shown in (7) can be performed
In order to be able to perform a decomposition of the matrix in the same way as shown in (2), additional scaling of the elements as shown in (8) must be performed. This results in the scaled vector as shown in (9) (8)
Since or , the scaling of the vector elements is equivalent to the bit shifts, so it can usually be performed very efficiently in either software or hardware.
To demonstrate the matrix-splitting technique, we use the FIR filter subblock optimization example shown in Fig. 1 
Then the pattern elimination performed in (11) corresponds to the matrix splitting shown in (13). Of course, the partial product must be rescaled according to (14) . . .
. . .
Finally, there are some terms that need to be defined.
Definition 1 (Computational Effort):
The computational effort is equal to the number of additions/subtractions necessary to produce the final product .
To estimate the computational effort, we consider only the necessary number adders and subtractors. Shifts can be implemented in the applications we are targeting (HW) almost for free (as hardwired with only some additional wiring effort). If this is not the case, the cost of shifts should be included in the computational effort estimation as well. Concerning the use of other criteria in the computational effort estimation process (like the timing related latch-count parameter used in [5] ), we prefer to keep the method as general as possible. But as will be shown later, the modification of the computational effort cost function can be done easily without significant modification of the algorithm itself. Consequently, the goal of CSE defines the following.
Definition 2 (CSE Goal):
The goal of the CSE technique is to find a splitting of the matrix according to (2) and (4) such that the total computational effort to produce the product is minimal.
To measure the success of the CSE optimization, we use the following.
Definition 3 (Optimization Ratio):
An optimization ratio where and are the computational efforts before and after CSE, respectively.
The optimization ratio can be used in an alternate definition of the CSE goal equivalent to Definition 1.
Definition 4 (CSE Goal):
The goal of the CSE technique is to find a splitting of the matrix that maximizes the optimization ratio .
Last, the term frequency or pattern frequency will often be used in the text.
Definition 5 (Pattern Frequency): Pattern frequency (or just frequency) represents the number of occurrences of a pattern in a matrix.
For example, the frequency of the pattern 1001 in (1) is equal to three [or four if the relative position of a pattern is of no importance, as shown in (6)- (8)].
In this work, we propose an algorithm to solve both outlined problems efficiently. In order to be able to clearly distinguish between these two problems later, we will define the problem described in (1) as Problem A and the problem shown in (6) as Problem B.
Of course, multiplication-free linear transform is not the only application for the CSE technique. Similar problems occur in many different areas (e.g., in compiler design). The proposed algorithm might be capable of performing the common subexpression identification and elimination also for tasks that are quite different from multiplication-free linear transforms, but this is outside the scope of this paper.
B. Problem Complexity
In this section, a short discussion about the practical feasibility of the CSE goal is given. The problem in question is as follows: by each pattern elimination, we are likely to lose also other patterns due to the sharing of nonzero bits. For example, during an elimination of the pattern 1001 from the row 10010100, also the pattern 101 is lost, so every pattern elimination can change the frequencies of other ones significantly. The graph synthesis problem is claimed to be NPcomplete [7] , but we are not aware of the existence of such a proof for CSE. As a consequence, since there is no known efficient algorithm to solve the problem exactly, we propose a simple heuristics for the CSE problem in this paper. It is based on a steepest descent approach, i.e., we choose in every matrixsplitting iteration a splitting such that the computational effort, minimization is maximal (for that iteration). This of course does not guarantee finding the optimal solution in a global sense, but the results have proven the viability of this approach. Another issue is the complexity of an algorithm creating the statistics of the available patterns for elimination, which is necessary in order to realize the proposed heuristics (see Fig. 2 ). The number of patterns with ones in a single row is , so the total effort to create the statistics containing rows is equal to . It is obvious that the values and are the crucial factors in the complexity issue, since the combinatorial number can often rise significantly over the value . Fortunately, for a number of problems, these values ( and ) are relatively small compared to (e.g., in FIR filters, the number of bits in the coefficients and the number of nonzero bits), so it is possible to create the complete pattern statistics. An additional strategy to tackle this issue will be proposed in the next section for the cases when this does not apply.
IV. CSE ALGORITHM
In this section, we will give a detailed description of an algorithm able to solve Problem B (i.e., the elimination of patterns with arbitrary shifts within the input matrix). Afterwards, we will discuss the modifications necessary for the algorithm to be able to solve Problem A as well. As indicated in the previous section, the algorithm must accomplish the following tasks.
1) Identify the presence of multiple patterns in the input matrix. 2) Select one pattern for elimination. 3) Eliminate all occurrences of the selected pattern. This should be iteratively repeated until there are no more multiple patterns present. The complete algorithm flowgraph is given in Fig. 3 . The input parameter represents the number of nonzero bits in the examined patterns. In the first step, an exhaustive search for all possible multiple -bit patterns is performed and complete statistics of the pattern frequencies are created. Since many different patterns will occur more than once, some criterion must be used to select the one for elimination. We use the steepest descent approach, i.e., select always the pattern with the highest frequency. In the second step, all occurrences of the selected pattern are removed (i.e., the nonzero bits are replaced by zeros), and the pattern is added as a new line at the bottom of the matrix so it can be searched for the multiple patterns with smaller later. Last, since the removal of a pattern must influence the total frequency statistics of the remaining ones, the global frequency statistic holding the complete information has to be adjusted to properly reflect the changes. After all multiple patterns with nonzero bits are processed, the whole cycle is repeated for nonzero bit patterns. A detailed discussion will be further concentrated on the following problems:
A) pattern identification; B) pattern selection; C) frequency statistics management; D) adaptation of the algorithm for Problem A; E) viability of the algorithm for large tasks; F) applicability for similar CSE tasks.
A. Pattern Identification
Since an exhaustive search is performed, all possible combinations of -bit patterns must be examined. The algorithm must also be able to detect a "collision" between two equal patterns, when these share at least one nonzero bit. Since such a pattern can be eliminated only once, it must be taken into account also during the frequency statistics creation phase (Fig. 4) . For example, the pattern 1 010 101 in Fig. 4 has two valid 2-bit patterns, as shown in Table I , because only two patterns can be identified without conflicts [ Fig. 4(a) ]. The interleaving of two patterns without common nonzero bits, on the contrary, does not influence the statistics, as shown in Fig. 4(b) . 
B. Pattern Selection
In case some patterns with the same frequency are present in the frequency statistics, a decision criterion must be provided to choose only one. The one we have chosen originated from the assumption that the optimized structure will be integrated on silicon. If two (or more) patterns with the same frequency occur, the shortest one is selected. Implementation of an addition/subtraction on silicon will result in an adder/subtractor with word length depending on the length of the pattern. Thus, the selection of the shorter one will result in a smaller adder/subtractor structure. In the case of 2-bit patterns, one additional criterion was introduced: the preference of adders over subtractors, i.e., the pattern 101 will be preferred over 101. This can be justified by the same reasoning as before, since a subtractor structure is more expensive than an adder (in terms of the area). In a case where the algorithm would be used for tasks aiming at a different goal, another criterion might be preferred.
C. Frequency Statistics Management
Since an exhaustive search is performed, attention must be paid to its implementation strategy. For the -bit statistics, a binary tree with the patterns as the keys is a perfectly suited structure. The complete statistics can be created simply by processing the input matrix row after row. One problem is caused by the fact that the same pattern can be present in a row multiple times, so the large binary tree must be searched for the same pattern multiple times. To avoid this, an alternative approach in the global statistics generation process was used as shown in Fig. 5 . First, a local tree holding the frequency statistics of a single row is created [ Fig. 5(b) ], and this local tree is used to update the global statistics. This way, searching in the global statistics tree is minimized.
After pattern elimination, the frequencies of the other patterns can also change, and therefore the global frequency statistics must be reevaluated. Since the creation of a new global statistics after each pattern elimination is not a feasible solution, an alternative method of the global statistics adjustment must be found (Fig. 6) . After each pattern elimination, a local statistics tree is created holding the information about the frequency changes of the remaining patterns in the processed matrix row [ Fig. 6(c) ]. These difference statistics can be used to update the global statistics tree, which results in a much smaller number of operations on the large global frequency statistics since it has to be accessed only for the patterns that frequency actually changed. This way, the global tree has to be created only once at the beginning of each iteration (see Fig. 3 ). 
D. Algorithm Modification for Problem A
It is also possible to use the previously described algorithm to perform a CSE optimization for the case when the position of a pattern within the matrix row is of importance (i.e., Problem A), but some changes must be made to take this into account. First, together with the pattern, also its position within the row must be used as a key during the construction of a binary tree. Second, since in this case all the patterns in a row are unique (at least their positions within a row must be different), it is not necessary to use the local statistics during the creation and maintaining of the global statistics (see Fig. 7 ). Apart from this, both algorithms can be identical. To distinguish between the two algorithms, we will mark them as Algorithm I (to solve Problem A) and Algorithm II (to solve Problem B).
E. Algorithm Modification for Large Tasks
To modify the algorithms to be able also to solve large tasks efficiently, first the bottlenecks must be identified. Let us assume that the processed matrix has the dimension . If the number of rows would be doubled to , then approximately a double amount of keys would have to be inserted into the binary tree during the -bit statistics creation. On the contrary, if the number of columns would be doubled, then the number of patterns processed for each row would rise from to , which differs by a factor of approximately . Thus the total computing time in the second case can rise even in orders of magnitude. This limits the number of columns that can be processed by the algorithm. This analysis, however, also shows the possible recipe for tackling the problem of large inputs . . . . . .
If we would split the matrix as shown in (15) and both parts would be processed separately, we would gain an execution time at the cost of the detection of the patterns that cross the split boundary and the adders necessary to add both split parts together again. On the other hand, this process of matrix splitting can be applied iteratively again and again until acceptable runtimes can be achieved, so matrices of orders 1000 1000 or even higher can be optimized this way.
F. Applicability for Arbitrary CSE Tasks
Both of the previously defined algorithms can be applied (with some modification of the pattern generation function) to a matrix containing arbitrary elements as long as these are lexically ordered, since it is necessary to evaluate the relations and during the binary-tree construction. No other restrictions are put on the matrix elements, so these can be numbers, algebraic elements, etc., which opens a whole new field of possible applications.
V. APPLICATION OF CSE ALGORITHM
In this section, we will indicate several possible applications of the CSE algorithm for the optimization of some commonly faced design tasks. We will discuss the optimization of FIR filters (in both transposed and direct form) and linear transforms in general, as well as matrix multiplication. All those tasks are quite common in areas such as telecommunications and DSP (filters), image processing (matrix multiplication), etc. 
A. FIR Filters-Transposed Form
In this case, (6) is satisfied with the constant ; thus Algorithm II can be used to optimize the multiplier block, and the scaling of the results can be implemented as hardwired shifts for free. An example of a transposed-form FIR filter optimization is given in Fig. 1 and (10)-(14) . 
B. FIR Filters-Direct Form
A block scheme of a direct-form FIR filter is shown in Fig. 9 . After binary (CSD) expansion of coefficients (as in the previous section), the outputs can be calculated according to (18) and the final result according to (19) . . . . . . . . .
Equation (18) unfortunately cannot be represented as a multiplication-free linear transform, but it is possible to reorder it in such a way. Let us calculate the sums of columns instead of rows in (18). We obtain the set of equations in (20) and (21) . . . . . .
This set of equations can be expressed as a multiplication-free linear transform as shown in (22) and (23) . . .
After this transform, it is possible to apply Algorithm I for the optimization. Unfortunately, the straightforward application of Algorithm I on (22) suffers from an important disadvantage. Since the order in the resulting matrix will be much larger than , the matrix-splitting technique would be necessary. Therefore, we use an improved version of the described method. The idea is to perform the CSE on the original bit matrix (so the number of rows will be higher than the number of columns). Let us rewrite the filter output 
In the next step, a matrix split satisfying (2) and (4) will be performed. Then the final sum can be rewritten as in (26) (26)
By reordering the sums, we can obtain (27) for the output of a direct-form FIR filter (27) Furthermore, the intermediate results are again scalable by powers of two, so Algorithm II can be used for optimization. We will give a small example of direct-form FIR filter optimization as shown in Fig. 10 The pattern 11 is present twice, so the following optimization requiring one adder less for implementation can be performed:
This can be considered an improvement compared to the optimization proposed in [4] , since in [4] a method equivalent to Algorithm I for only two-nonzero bit patterns was proposed.
C. Linear Transforms and Matrix Multiplication
Many operations in DSP, communications, or image processing can be expressed in the form of a multiplication of a matrix with either a vector or a matrix. A number of applications can be directly considered as a multiplicationfree linear transform, where the application of Algorithm I is straightforward. These include some signal transforms (Walsh, Hadamard) or some error-correcting codes (Reed, BCH). In the case that the transform in question cannot be described as multiplication free [e.g., discrete Fourier transform (DFT)], an algorithm based on the multiple use of MCM was introduced in [3] . This enables one to transform any linear transform into a multiplication-free linear transform. The general linear transform can be described as in (28) (28) The conversion algorithm can be described by the following pseudocode.
1) Minimize the number of additions necessary to compute all products . 2) Rebuild the input matrix using instances computed in the previous step (this will create a multiplication-free linear transform). 
To demonstrate the previously described technique, we will use an 8-point DFT. Since a DFT has a complex transform kernel , it must be split into its real and imaginary parts first [according to (30)]
The resulting real and imaginary kernels and are shown in (31) (31)
The transformation into a multiplication-free linear transform will be demonstrated only on the real-part kernel , since its application on the imaginary part is equivalent. The resulting multiplication-free linear transform kernel is shown in (32)
where and are defined as in (29). The matrix is a multiplication-free linear transform and as such can be subject to the CSE optimization by Algorithm I.
VI. EXPERIMENTAL RESULTS
Both algorithms were developed in C and run on an HP-RISC workstation. First, the performance was tested on randomly generated data. In test I, Algorithm I was run on a set of matrices containing only zero and one with dimensions ranging from 16 8 up to 256 32. To evaluate the performance, we give the values of computational efforts before and after CSE ( and , respectively) the optimization ratio as well. Time values give the algorithm runtimes. The results are shown in Table II . The same type of tests was performed also for Algorithm II, with the results shown in Table III . For these tests, also the number of shifts before and after optimization are given ( and ), and the improvement ratio was calculated as well. The results indicate that the methods work better when more potential subexpressions are present (rise with the values and ). The sudden increase of in the cases of 128 8 and 256 8 can be explained by the fact that after raising the value high enough , the whole rows start to repeat frequently, which leads to this abrupt increase of the optimization ratio. These tests also prove the statement that the increase of the value is critical with respect to the algorithm runtimes.
To test the splitting technique for large matrices, we have run both algorithms again on random matrices with dimensions TABLE II  ALGORITHM I PERFORMANCE TESTS   TABLE III  ALGORITHM II PERFORMANCE TESTS   256 256, 512 512, and 1024 1024. The Col variable gives the number of columns the processed matrix was split into, and gives the number of nonoptimizable adders due to the splitting. These are the adders used to add the submatrices together afterwards. The results are shown in Table IV and prove the viability of such an approach. It is also obvious that in the case of splitting into matrices with smaller (such that ), the effect described previously (the repeating of the rows) occurs again. This causes very high optimization ratios of the processed submatrices, so the adders that are unoptimizable due to splitting become a dominant part of the total number of adders after optimization . Last, the results of the optimization of the Hadamard matrix of order 1024 1024 with different numbers of split columns Table V . The reason for the very high optimization ratios is the regular structure of the Hadamard matrix.
The second set of experiments consisted of the optimization of some real-life structures. First the transposed-form FIR filters were optimized by CSE and synthesized by means of the SYNOPSYS design compiler. In order to make the comparison of our results easier in the future, we have chosen the three filters published in [2] (examples 1 and 2-and ) and [8] (example 1-). The filters in [2] were subject to nonzero-bits minimization. We have optimized the coefficients of the filter in the same way as in prior CSE processing, and the optimized coefficients are given in Table VI . This optimization results in a relatively small amount of adders in the multiplication block already before CSE (an average of 0.85 adders per tap coefficient in , 1.9 in , and 2.4 in ). The optimization results are in Table VII. The structure was compiled into a MIETEC 0.5-CMOS library and optimized for area only. The area figures are divided in combinatorial area , part of which is subject to CSE, sequential area , and total area . The figures are given in equivalents of invertor gates. An interesting observation is that for a smaller filter, the area of the multiplication block becomes insignificant compared to the remaining registers and adders in the accumulator block (see Fig. 8 ), so the effect of the CSE optimization is not significant, especially if the optimized CSD coefficients are used (a similar conclusion is stated also in [6] ). The experiments performed on a direct-form FIR filter showed a reduction of the multiplication block equivalent to the transposed-form FIR filter.
The third experiment was the optimization of a DFT to test the linear-transform optimization technique. We performed CSE optimization on real and imaginary kernels and , as defined in (30)-(32). The results are shown in Table VIII for DFT8, DFT16, and DFT32 (for DFT32, also matrix splitting into two columns was used). 
VII. RELATED WORK COMPARISON
The optimization of the transposed-form FIR filters by means of CSE was already discussed by several authors. However, the exact comparison is not easy to make since all authors did make the experimental testing on different inputs. We tried to test our method on the same (or at least equivalent) data to obtain some estimate of the algorithm performance comparison. In [7] and [6] , similar graph synthesis algorithms were used. Since [6] is an improvement over [7] , we have made the comparison to [6] , where experimental tests were performed on two FIR filters taken from [8] (examples 2 and 3) . The results are shown in Table IX . The results are practically identical (there is one adder difference in the first filter). A possible explanation is that the presented graph synthesis algorithm is capable of finding redundancies unidentifiable to the common subexpression elimination technique (for example, see Fig. 11 ).
In [4] , a method similar to ours based on the identification of 2-bit common subexpressions was proposed. Actually, this work can be considered an extension of [4] . However, starting an elimination of patterns with a higher number of nonzero bits than just two should give better results, as shown in Table X , where optimization results of the input set from Table II are used with respect to the input parameter . For the FIR filters optimization, however, this is not a very important parameter, since the filter coefficients usually have only a small number of nonzero bits, which significantly reduces the chance that many -nonzero bits expressions with would be identified. The work [5] was tested on 23 random coefficients quantized into 32 bits with an average improvement of the adder count by a factor of two. This seems to be a similar result to the one obtained in [4] or here in the case of , and the additional criterion based on the timing is of interest in the case where hardware (HW) implementation is targeted.
Last, in [3] , the adder count improvement on the set of reallife filters of orders and ranged from 1.36 to 1.46 . The values obtained for real filters by CSE were significantly higher (from 1.78 to 2.5). On the other hand, the number of shifts obtained by the bipartite matching algorithm used in [3] was much lower compared to the values from CSE optimization.
VIII. CONCLUSION
In this paper, a novel algorithm to solve the multiple constant multiplication problem, i.e., the optimization of the multiplication of a variable by a set of constants, was proposed. It is based on the common subexpression elimination technique and combines the exhaustive search for multiple pattern identification with a steepest descent approach for pattern selection. The results show a significant reduction in either arithmetic operations or hardware necessary to implement those operations combined with satisfactory runtimes. It can be considered an extension of the 2-bit pattern optimization technique presented in [4] since in the proposed method, no such restrictions on the patterns are given.
Comparison with related work based on the available data shows that our method yields comparable or better results in FIR filter optimization. Its major advantage is a general concept that does not restrict the use of the presented technique to the tasks proposed in this paper (FIR filter design and linear-transform optimization).
