Abstract-We present a computation reduction method which can be used to obtain low-complexity parallel multiplierless implementation of digital FIR filters, exploring the use of shift inclusive differential (SID) coefficients and common subexpression elimination (CSE). We introduce a new directed multigraph to represent the design space greatly expanded by the use of SID coefficients. A graph-theoretic algorithm is then employed to efficiently explore the greatly expanded design space. Further, we propose a novel CSE method applied to the design space represented by the graph, which recursively eliminates 2-bit subexpressions with a steepest descent approach for subexpression selection. Compared with conventional multiplierless implementation, up to 75% reduction in terms of number of additions has been achieved. In comparison to a recently reported CSE method based on available data, our approach achieves an improvement up to 19%.
I. INTRODUCTION
Future battery-powered wireless communication systems are expected to provide higher data rates with improved energyefficiency. This has made low-power, high-performance digital signal processing (DSP) an important research area. Lowcomplexity design, which aims at reducing the number of certain basic operations (e.g., addition) for a given DSP task, is an attractive approach. The resultant complexity reduction can potentially improve the processing speed of a DSP algorithm while achieving high energy-efficiency by removing energy consuming operations. In addition, low-complexity design generally leads to smaller chip area and thus lower cost.
FIR filtering with a set of fixed coefficients is widely used in many DSP and communication applications. In the tranposed direct form of a FIR filter as shown in Fig. 1 , the input data x(n) is simultaneously multiplied by the set of M filter coefficeints c = [c 0 , c 1 
, ..., x M −1 ]
T , where M is the filter length. Complexity reduction on these multiplications in the multiplication network can lead to significant improvements in various design parameters such as speed, area or power dissipation. Since multiplication with a constant can be substituted by shifts and additions/subtractions and in dedicated fully parallel implementation shift operation can be simply done by wiring, considerable amount of research work has been conducted to reduce the number of additions, thereby leading to low-complexity designs. Common subexpression elimination (CSE) has been extensively studied and various algorithms have been proposed [1] This research was supported in part by SRC (1122.001) [4] . One common feature of the CSE method is to identify common bit-patterns in the set of coefficients and to share those identified common subexpressions to reduce the number of additions.
The use of shift inclusive differential (SID) coefficient is proposed in [5] . With the assistance of graph representation, the problem of searching for low-complexity solutions is mapped into a weighted minimum set cover problem and a heuristic algorithm based on a greedy approach is given.
In this paper, we reformulate the idea of SID coefficient by introducing a new graph representation and mapping the optimization problem into an equivalent problem of determining a directed minimum spanning tree (DMST) of a directed multigraph, which is then solved by an optimal graphtheoretic algorithm. To achieve further complexity reduction, we propose a CSE method which recursively eliminates 2-bit subexpressions with a steepest descent approach for subexpression selection. 
II. A REFORMULATION OF THE IDEA OF SHIFT INCLUSIVE DIFFERENTIAL COEFFICIENT

A. A new graph representation
The key idea behind the use of shift inclusive differential (SID) coefficients is that the result of c i x(n) could be reused for computing c j x(n) . Note that if c i x(n) has been computed already, we can obtain 2 L c i x(n) by simple wiring. Thus we can express In [5] , cases when L < 0 are not considered since it may lead to re-quantization or increase the actual word-length of the intermediate computations. However, that argument is not generally true. For example, for an original coefficient c i = 000101000, there is no problem with considering right shifts with L = −1, −2, or −3. Hence L < 0 will also be considered in our proposed method when the tailing bits of an original coefficient are zeros. There are two clear distinctions between our graph representation and that of [5] : (1) Introduction of the virtual vertex automatically takes the original coefficients into consideration during our later optimizations; (2) L < 0 is considered. This makes our graph representation more comprehensive. A graph representation of a 4-tap filter is illustrated in Fig. 2 . Further, in the following subsection, an efficient graph-theoretic algorithm is presented to explore the low-complexity solutions. 
B. Algorithm SID DMST
For each edge of graph G, we can assign a weight that represents the associated implementation cost when the corresponding original or SID coefficient is used. In this work, we focus on minimizing the number of additions at a high level of abstraction and the implementation cost is quantified as the number of required additions. Canonic signed digit (CSD) number representation [9] is used, although our proposed approach does not depend on any specific number representation. The weight of each edge of G is defined as follows. For the edge directed from vn to each c i for
, its weight is the number of additions required for computing c i x(n), which is one less than the number of nonzero bits of the CSD representation of c i if c i is nonzero. The weight is zero if c i is zero. For each edge associated with an SID coefficient, its weight equals to one plus the number of additions required for computing
, then the weight of the corresponding edge is zero. The optimization problem of constructing a possible implementation with minimum number of additions is to find an acyclic subset of edges in E that connects all of the vertices in V such that the sum of weights of these edges is minimized. In [5] , this optimization problem is formulated as a weighted minimum set cover problem and a heuristic algorithm based on a greedy approach is developed.
In this work, we note that the optimization problem is equivalent to determining a directed minimum-weighted spanning tree (DMST) of the directed multigraph G [6] . We present an optimal graph-theoretic algorithm to solve this problem as follows. And we denote the resulting complexity reduction algorithm as Algorithm SID DMST.
We construct a directed graph
, where the vertex set V 1 = {0, 1, 2, ..., M } and the edges in E 1 of G 1 are defined as follows. We denote an edge of G 1 directed from vertex i to j by (i, j) and its weight by w ij . For each ordered pair of vertices
, if there is an edge directed from c i to c j in G, there will be exactly one edge (i, j) in graph G 1 and its weight is the minimum of the weights of all edges directed from c i to c j in G. For each edge directed from the virutal vertex nv to c i in G, there is one edge (M, i) and its weight is the same as that of the edge directed from nv to c i .
Assuming we have determined a DMST T 1 of G 1 , we can construct a DMST T of G from the DMST of G 1 as follows. For each edge (i, j) of a DMST T 1 of G 1 with i = M , we select an edge of G directed from c i to c j with weight being w ij . For each edge (M, j), we select the edge directed from vn to c j in G. Let these selected edges form set S. Graph T = (V, S) is a spanning tree of G due to the one-to-one correspondence between the vertices of G and those of G 1 . Additionally we will show there is no spanning tree of G that has a total weight smaller than that of T . For descriptive convenience, we denote the total weight of all edges of a graph H as tw (H). Obviously tw (T 1 ) = tw (T ).
Suppose there is a spanning tree T of G with a smaller total weight than T , i.e., tw (T ) > tw (T ). We can construct a spanning tree T 1 of G 1 by mapping each edge of T directed from c i to c j into an edge (i, j) of G 1 and each edge directed from vn to c j into edge (M, j), for 0 ≤ i, j ≤ (M −1). Due to the way we constructed the graph
, which contradicts the fact that T 1 is a DMST of G 1 . Hence, by contradiction we prove that graph T as constructed above is a DMST of G.
There exists an optimal graph-theoretic algorithm for finding a DMST of G 1 [7] . For a directed graph with n vertices and m edges, an implementation of the algorithm which runs in O(m log n) time was presented in [8] . In finding a DMST of graph G 1 , we always select the vertex M as the root of the DMST, which implies that in the DMST of G constructed from the DMST of G 1 , the virtual vertex vn will be its root.
As will be evident later, this provides us great convenience in constructing the implementation structure of the multiplication network of a FIR filter.
III. INCORPORATING COMMON SUBEXPRESSION ELIMINATION
The edges of graph G, corresponding to either original or SID coefficients, could have some common subexpressions. It is then natural to consider if we can identify certain common subexpressions and reduce computational complexity through subexpression sharing. This leads us to consider using CSE for further complexity reduction. However, existing CSE methods such as proposed in [1] - [4] are not directly applicable since they are applied to a set of fixed and known coefficients, i.e., the set of original coefficients. In this work, we are to apply CSE to a design space represented by a directed multigraph G. Each spanning tree of G corresponds to a possible filter implementation and its edges form a set of coefficients, either original or SID. Hence we need to find a set of subexpressions and a spanning tree of G to minimize the hardware complexity.
One way is to enumerate all the spanning trees of G and on each spanning tree of G, we can apply conventional CSE algorithms since the coefficients in each spanning tree of G are fixed and known. Then the spanning tree with the lowest complexity can be chosen for implementation. However, the number of spanning trees of G is very large and increases exponentially with the number of filter coefficients. Thus it is computationally prohibitive to enumerate all the spanning trees of G.
In this work we propose a CSE method which recursively eliminates 2-bit subexpression with a steepest descent approach for subexpression selection.
A 2-bit subexpression is defined as a bit-pattern with exactly two nonzero bits, where a nonzero bit is an element of set {1, −1, 2, −2, 3, −3, ...}. In eliminating each occurrence of the subexpression, we replace it by an integer k or -k in place of the second of the two nozero bits making up the subexpression and the first of the two nonzero bits by zero. In the sequel, we will refer to this way of elimination as elimination by k. We use k = 2 to represent the first eliminated subexpression, k = 3 for the second eliminated subexpression, and so on. If the occurrence of the subexpression is exactly same as the subexpression, then it is replaced by k. If the occurrence is the complementary of the subexpression, it is replaced by −k. This way of elimination can be best demonstrated by an example. Consider three coefficients, either orignal or SID, in the CSD format, a 1 = 10101001, a 2 =10101000 and a 3 = 00000101, where1 represents −1. Suppose at the first iteration bit-pattern 101 is selected for elimination. Then after elimination by 2, a 1 = 00201001, a 2 = 00201000 and a 3 = 00000002, where2 represents −2. In the second iteration bit-pattern 201 is selected and after elimination by 3, a 1 = 00003001, a 2 = 00003000, and a 3 = 00000002 , wherē 3 represents −3.
Since in each iteration many different subexpressions will occur, a criterion much be used to select the subexpression for elimination. In conventional CSE algorithms, usually subexpression with the highest frequency is chosen since the amount of complexity reduction achieved by eliminating a subexpression is directly related to its frequency. However, in our case, frequency of the occurrence of a subexpression in all the edges of graph G is usually not a good measure of how much complexity reduction we can achieve if the subexpression is chosen and eliminated. This is mainly because there are so many edges in graph G while just a small portion of the edges will be chosen to form a spanning tree of G. Adopting a steepest descent strategy, in each iteration, among all the subexpressions that occur at least twice, we choose the subexpression whose elimination leads to the greatest complexity reduction. This is elaborated as follows. In each iteration, we first search for all 2-bit subexpressions which occur at least twice, and put them into a set as SE = {s 1 , s 2 , s 3 , . ..}. If we can not find any 2-bit subexpression occurring at least twice, i.e., the set SE is empty, the CSE process is terminated. Otherwise, for each subexpression s i , let graph H i = G and eliminate s i in all edges of graph H i as described above, update the weights of all edges of H i by decreasing the edge weight by the number of occurrences of s i in the edge, determine the DMST of H i as described in section II, and calculate the complexity reduction in terms of the number of additions as a result of eliminating s i , which is denoted by
that there is no reduction through subexpression elimination, terminate the CSE process. Otherwiese, put subexpression s n into a table denoted as CSE table, update G by letting G = H n and go into the next iteration with this update graph G. The resulting algorithm is named as Algorithm SID CSE, which is summarized in Fig. 3 . CSE table contains the subexpressions that have been eliminated. |CSE table| denotes the number of subexpressions in CSE table.
The final outcome of Algorithm SID CSE is a set of subexpressions contained in CSE table and a DMST of G. Since the DMST of graph G corresponds to a low-complexity implementation of the multiplication network of the filter, we define it as an implementation tree. Note that the root of the implementation tree is the virtual vertex vn. Hence, for a vertex c i of the implementation tree, if its parent is vn, then c i x(n) is implemented as it is, i.e., using the original coefficient; Otherwise, if the parent of c i is c j , c i x(n) is implemented using the SID coefficient c i −2 L c j corresponding to the specific edge directed from c j to c i in the implementation tree. Thus an implementation structure of the filter can be readily derived from the implementation tree.
IV. NUMERICAL RESULTS
We first take 12 example linear-phase filters with filter length ranging from 21 to 161 and wordlength of 16. The filter types include equi-ripple, least-square, low-pass and band-pass. Three techniques are considered, i.e., simple CSD implementation where the filter coefficients are encoded in CSD format, SID DMST and SID CSE. The complexity in terms of the number of addition is shown in Fig. 4 . In com-
Construct G and find a DMST T of G. cost = tw (T ).
Set CSE 
Output implementation tree as T , the number of additions as cost and CSE table. parison to the simple CSD implementation (CSD), algorithms SID DMST and SID CSE achieve 44%-69% reduction and 53%-73%reduciton, respectively. Our proposed method is also applicable to general multiple constant multiplication (MCM) operations as defined in [1] since the multiplication network of a FIR filter performs exactly a MCM operation. To test the effectiveness of our proposed methods on general MCM operations, we applied our proposed methods to random vectors with length ranging from 20 to 160 and the results are shown in Fig. 5 . Compared with simple CSD implementation (CSD), application of SID DMST algorithm leads to 45%-60% reduction while algorithm SID CSE leads to 65%-75% reduction. This demon- SID CSE  S1  25  9  11  6  6  6  S2  60  14  57  32  29  26  L1  121  17  145  58  61  51  L2  63  13  49  23  24  22  L3  36  11  16  5  5  5 strates that our proposed methods are effective for general MCM operations.
Paško et al proposed a CSE algorithm in [4] , which achieves comparable or better results than other CSE methods proposed in [1] - [3] . We compare our proposed method with Paško's algorithm based on the data included in [4] . We apply our methods to the filters that are denoted as S1, S2, L1, L2 and L3 in [4] . The results are shown in Table I , where M denotes the filter length and W denotes the wordlength of the filter coefficients. For all the filters, the proposed SID CSE algorithm yields similar or better results. In particular, for filter L1 and S2, 12% and 19% improvement has been achieved, respectively. The improvement can be attributed to the fact that, in our proposed SID CSE algorithm, common subexpression elimination is performed over the design space that has been greatly expanded through using SID coefficients. And the expanded design space is represented by a directed multigraph and explored by an efficient graph-theoretic algorithm.
V. CONCLUSIONS
We present a novel low-complexity design method for figital FIR filters. We reformulate the idea of SID coefficients by introducing a new graph representation and employing an efficient graph-theoretic algorithm. Further, we proposed a CSE method which recursively eliminates 2-bit subexpressions with a steepest descent approach. Compared with conventional multiplierless implementation, up to 75% reduction in terms of number of additions has been achieved. In comparison with a recently reported CSE method based on available data, our approach achieves an improvement up to 19%.
