This work presents two novel optimization methods based on integer linear programming (ILP) that minimize the number of adders used to implement a direct/transposed finite impulse response (FIR) filter adhering to a given frequency specification. The proposed algorithms work by either fixing the number of adders used to implement the products (multiplier block adders) or by bounding the adder depth (AD) used for these products. The latter can be used to design filters with minimal AD for low power applications. In contrast to previous multiplierless FIR approaches, the methods introduced here ensure adder count optimality. To demonstrate their effectiveness, we perform several experiments using established design problems from the literature, showing superior results.
Design of Optimal Multiplierless FIR Filters Martin Kumm,Anastasia Volkova, and Silviu-Ioan Filip
Abstract-This work presents two novel optimization methods based on integer linear programming (ILP) that minimize the number of adders used to implement a direct/transposed finite impulse response (FIR) filter adhering to a given frequency specification. The proposed algorithms work by either fixing the number of adders used to implement the products (multiplier block adders) or by bounding the adder depth (AD) used for these products. The latter can be used to design filters with minimal AD for low power applications. In contrast to previous multiplierless FIR approaches, the methods introduced here ensure adder count optimality. To demonstrate their effectiveness, we perform several experiments using established design problems from the literature, showing superior results.
Index Terms-FIR filters, multiplierless implementation, ILP optimization, MCM problem, etc.
I. INTRODUCTION
F IR filters are fundamental building blocks in digital signal processing (DSP). They provide strict stability and phase linearity, enabling many applications. However, their flexibility typically comes at the expense of a large number of multiplications, making them compute-intensive. Hence, many attempts have been made in the last four decades to avoid costly multiplications and to implement FIR filters in a multiplierless way [1] - [20] .
One of the most promising ways to do so is to replace constant multiplications by additions, subtractions and bit shifts. Take for example the multiplication by a constant coefficient 23. It can be computed without dedicated multipliers as 23x = 8 · (2x + x) − x = ((x << 1) + x) << 3) − x, (1) where (x << b) denotes the arithmetic left shift of x by b bits. This computation uses one addition and one subtraction. As the add and subtract operations both have similar hardware cost, the total number of add/subtract units is usually referred to as adder cost. Bit shifts can be hard-wired in hardware implementations and do not contribute any cost. For (1) , this is illustrated in Fig. 1(a) . In general, the task of finding a minimal adder circuit for a given constant is known as the single constant multiplication (SCM) problem and is already an NP-complete optimization problem [21] .
Such a problem extends to multiplication with multiple constants, which is necessary when implementing FIR filters. It is called multiple constant multiplication (MCM). Here, some of the intermediate factors like the adder computing 3x in Fig. 1(a) can be shared among different outputs. Take for example the coefficients {7, 23}; Fig. 1(b) shows a solution for M. Kumm is with the Fulda University of Applied Sciences, 36037 Fulda, Germany (e-mail: martin.kumm@cs.hs-fulda.de).
A. Volkova is with University of Nantes, Nantes, France (e-mail: anastasia.volkova@univ-nantes.fr) S.I. Filip is with University of Rennes, Inria, CNRS, IRISA, Rennes, France (e-mail: silviu.filip@inria.fr). multiplying with both coefficients at an adder cost of only two. The corresponding optimization problem is called the MCM problem and has been addressed by numerous heuristic [22] - [24] and optimal [25] - [27] approaches. Fig. 2 shows the two most popular structures used to implement FIR filters: the direct and transposed forms. The result of an MCM solution can be directly placed in the multiplier block of the transposed form from Fig. 2(b) . The total adder cost is the sum of the number of multiplier block adders and the remaining ones, commonly called structural adders. The transposed form can be obtained from the direct form by transposition [28] . As the transposition of a singleinput-single-output system does not change the adder count, it leads to the same adder cost. So, in the end, it does not matter which one of the two considered filter structures is actually optimized.
In the MCM optimization problem, it is assumed that the coefficients are known and already quantized to a fixed-point (or integer) representation. The design of FIR filters with fixed-point coefficients and a minimum frequency response approximation error is itself a well-known optimization problem, going back to at least [29] (with subsequent extensions and improvements [30] - [34] ). However, it is often the case in practice that a bounded frequency response is acceptable.
In fact, there may be a large number (often hundreds or more) of different fixed-point coefficient sets that meet such a specification. Therefore, a lot of effort in fixed-point FIR filter design has gone into optimizing for resource use. In this context, the problem of finding a minimal adder circuit for a given filter specification was addressed by several authors [1] , [2] , [4] , [5] , [7] , [9] , [15] , [19] . However, to the best of our knowledge, no previous work has addressed the design of multiplierless FIR filters in an optimal way. Here, by optimal multiplierless filter we mean a direct/transposed form FIR filter requiring a minimum number of adders to meet a target frequency specification, as well as constraints on the maximum coefficient word size and filter order. The main contributions of our work are as follows:
• We present for the first time a solution for the optimal multiplierless design of FIR filters from a frequency specification using an ILP formulation. • We provide another ILP formulation that is capable of additionally limiting the adder depth inside the FIR filter. • We show that relevant problem sizes can be addressed by current ILP solvers and that the adder complexity of well-known FIR filters can be further reduced compared to the most advanced methods.
In the following, we will give background information about previous work this paper is based on. In Section III and Section IV we describe the two ILP formulations that are at the core of the paper, whereas in Section V we talk about ideas meant to improve the practical runtime of the proposed algorithms. We then present experimental results accompanied by a comparison with the state-of-the-art (Section VI), followed by concluding remarks (Section VII).
II. BACKGROUND
Multiplierless filter design problems usually start with a functional specification of the frequency domain behavior, together with the number of filter coefficients and their word lengths. An optimization procedure is applied to get a set of bounded integer coefficients together with their associated adder circuits for the constant multiplications needed in the final implementation. Summarized in Fig. 3 , this section overviews these parameters and their interactions, together with the state-of-the-art design methods found in the literature. 
A. Linear Phase FIR Filters
An N -th order linear phase FIR filter can be described by its zero-phase frequency response [35] 
which has the property that its magnitude is identical to that of the transfer function, i.e.,
The c m (ω) terms are trigonometric functions and M denotes the number of independent coefficients after removing identical or negated ones due to symmetry. Both depend on the filter symmetry and on the parity of N as given in Table I [35] . Let D(ω) and D(ω) be the desired lower and upper bounds of the output frequency response H R (ω). The associated frequency specification-based FIR filter design problem consists of finding coefficients h m , m = 0, . . . , M − 1 that fulfill the constraints
where Ω ⊆ [0, π] is a set of target frequency bands (usually pass and stopbands). A standard approach in practice is to work with Ω d ⊆ Ω, a uniform discretization of Ω. One number for the size of Ω d found in the literature is 16M [32] .
B. Fixed-point Constraints
Fixed-point (integer coefficient) FIR filter design problems further restrict the search space to integer variables
and B ∈ N is the maximum effective word length of each coefficient (excluding sign bit).
To broaden the feasible set of efficient designs, some applications allow the use of a real-valued scaling factor G > 0 when computing the quantized fractional coefficients h m . Equation (4) thus becomes
When the frequency specification contains a passband, it is called the passband gain [31] . Finding adequate bounds for G is dependent on the set/format of feasible h m coefficient values. If these values are constrained to a power of two space, the ratio between the upper and the lower bound on G does not need to be larger than 2 [31, Lemma 1] . Even when this is not the case, the interval [0.7, 1.4] is frequently used [13] , [19] , [31] . For our tests, unless otherwise stated, we prefer the slightly different interval [2/3, 4/3] since it is centered around one. In case a unity or fixed-gain filter is required we set the gain to G = 1.
C. Multiplierless FIR Filters
Formulas (5) and (6) are easily expressed as constraints in an ILP formulation. However, to ensure an optimal multiplierless design, further constraints are needed.
The way these constraints are constructed and used has varied over the years. Early research in this direction looked at multiplierless designs where each coefficient was represented by a limited number of signed power-of-two terms, optimized using branch-and-bound techniques [1] . Later, minimum signed digit (MSD) representations characterized by a minimum number of non-zero power-of-two terms were quickly adopted for this purpose [2] , [5] , [6] , [10] , [15] .
MSD representations can be used to find sharing opportunities of intermediate computations like the 7x term shown in Fig. 1 (b). One way is by searching and eliminating redundant bit patterns common to several coefficients, a technique called common subexpression elimination (CSE). Savings are obtained by performing the computation specified by the bit pattern and distributing the result to all coefficients depending on it [3] , [7] , [8] , [22] . However, the CSE search cannot deliver all possible sharing opportunities due to its dependency on the number representation [23] and the effect of hidden non-zeros [36] . To avoid them, graph-based approaches are commonly used in state-of-the-art MCM methods [22] - [27] . Some early work on multiplierless FIR filter design already considered this by incorporating the graph-based MCM algorithm of [22] into a genetic algorithm that optimizes the filter coefficients according to the adder cost [4] . A different approach is followed by [9] , where a branch-and-bound-based ILP optimization is used; here, a pre-specified set of integer terms, called the subexpression space, has to be provided that can be shared among the different coefficient expansions. This work was later extended with a dynamic subexpression space expansion algorithm [11] , [13] , which, at least in the case of [13] , claims to usually produce designs with a minimal number of adders. In contrast to these potentially slow branchand-bound approaches, in [16] , a fast polynomial-time heuristic for the design of low complexity multiplierless linear-phase FIR filters was proposed.
Recent work has also focused on integrating filter coefficient sparsity, which can also have a big impact on the complexity of the final design [17] by reducing the number of structural adders. Also, other structures than the direct and transposed forms (see Fig. 2 ) have been shown to possess good properties. The factoring of FIR filters into a cascade of relatively small subsections can lead to a lower bit-level complexity [14] , [18] . Alternative structures have also been proposed [20] ; they provide lower word sizes for the structural adders.
Besides optimizing the adder count, it was shown early that the power consumption of the resulting filter also strongly depends on the adder depth (AD), which is defined as the number of cascaded adders in the multiplier block [37] , [38] . Since then, many works have focused on limiting the AD either in MCM algorithms [36] , [39] , [40] or directly in multiplierless filter design methods [13] . Again, all of those approaches are heuristics that provide minimal AD but not guarantee minimal adder cost. Looking at the average adder depth in the structural adders can also help reduce power consumption [19] .
III. MULTIPLIERLESS FIR FILTERS WITH FIXED NUMBER
OF MULTIPLIER BLOCK ADDERS Our first ILP model targets the design of generic multiplierless FIR filters regardless of their adder depth. It is based on a recently proposed MCM ILP formulation [27] , where the goal is to directly compute the parameters of an MCM adder graph, if feasible, for a given number of adders. This idea is extended here for multiplierless FIR filter design by adding constraints on the frequency specification. As a result, we get an ILP model to design a multiplierless filter for a fixed number of adders in the multiplier block. To optimize the total number of adders, this ILP model is solved several times using an overall algorithm discussed in Section III-B. In the following, we first present the ILP formulation for the fixed number of multiplier block adders.
A. ILP Formulation for Fixed Multiplier Block Adder Count
The proposed ILP formulation is given in ILP Formulation 1 and uses the constants and variables listed in Table II . The objective is, given a fixed number of multiplier block adders A M , to minimize the number of structural adders A S (which depend on the number of zero filter coefficients, encoded by the binary decision variables h m,0 ).
The resulting constraints can be roughly divided into frequency response conditions (C1, C2), equations linking the filter coefficients with the coefficients of the multiplier block (C3) and formulas describing the multiplierless realization of the multiplier block (C4 -C8).
The integer coefficients h m (m = 0, . . . , M − 1) of the FIR filter are directly used as integer variables in the ILP formulation. The resulting frequency response is constrained in C1a by setting (2) and (5) into (6) . Constraints C1b are so-called lifting constraints. These are actually not required to solve the problem, but can significantly reduce the search space and improve runtime performance. Specifically, they limit the range of the coefficients to lower h m and upper h m bounds. The computation of these bounds is considered in Section V-A. Constraint C2 limits the range of the gain as discussed in Section II.
Constraints C3a to C3c provide the connection between the filter coefficient h m and the (potentially shifted and signcorrected) multiples computed in the multiplier block c a or a zero coefficient. For that, the binary decision variables o a,m,s,φ ∈ {0, 1} encode if h m is connected to adder a of the multiplier block, shifted by s, and either added (φ = 0) or subtracted (φ = 1) in the structural adders (C3a). In case the coefficient is zero, a single binary decision variable h m,0 is
C6b:
ϕ a, ,s = 0 ∀s > 0
C7d: ϕ a, ,s = ϕ a,r,s ∀s < 0
C8a:
. This encoding allows the optimization of structural adders by considering the h m,0 variables in the objective function. For every zero coefficient, the corresponding structural adder(s) can be saved depending on the coefficient and filter type. Table III shows the number of structural adders for the different filter types. Overall, constraints C3c ensure that only one of the above cases is valid. The remaining constraints C4 -C8 are identical to the ones used for solving the MCM problem from [27] . We give a brief description here, but refer the reader to [27] for a more detailed presentation. The multiplier block input is viewed as a multiplication by factor one (c 0 = 1) and is defined with constraint C4. Constraints C5 represent the actual add operation of adder a and its corresponding factor c a . It is obtained by adding the shifted and possibly sign corrected factors of its left input c sh,sg a, and its right input c sh,sg a,r . The source of the adder inputs is encoded by C6a/b. Indicator constraints C6a are used to set the value c a,i of the adder 
shifted by s and sign φ hm, hm ∈ Z Lower and upper bound for filter coefficient
Gain of a variable gain filter (G = 1 when the gain is fixed) input i ∈ { , r} to the actual factor when the corresponding decision variable c a,i,k is set. Indicator constraints are special constraints in which a binary variable controls whether or not a specified linear constraint is active. They are in-fact nonlinear but supported by many modern ILP solvers and are also simple to linearize for other solvers (see [27] ). Constraints C6b make sure that only one source is selected. The actual shift is constrained by C7a/b in a similar way: indicator constraints C7a are used to set the shifted factor c sh a,i according to the corresponding decision variable ϕ a,i,s .
Constraints C7c and C7d are both optional lifting constraints used to reduce the search space. As the filter coefficients can be shifted in constraint C3a, we can limit the constants of the multiplier block to odd numbers. This allows us to use the well-known fact that odd coefficients can be computed from odd numbers using one addition with either one operand left shifted while the other operand is not shifted or both operands are right shifted by the same value [41, Theorem 3] .
To support subtractions, indicator constraints C8a/b are used to set the sign according to decision variable φ a,i . Finally, constraints C8c ensure that at most one input of the adder can be nagative, as subtracting both inputs is typically more hardware demanding.
All of the integer variables from Table II are computed from integer constants or booleans and represent integer values. So they can be relaxed to real numbers to speed up the optimization.
B. Minimizing the Total Number of Adders
As the number of adders in the multiplier block A M is fixed in ILP Formulation 1, we need to iterate over various values A M to find the minimum number of total adders 
For that, we first search for a solution with minimal number of multiplier block adders by solving ILP Formulation 1 for A M = 0, 1, 2, . . . until we obtain the first feasible solution. This solution with minimum multiplier block adders A M,min is not necessarily the global optimum as there might be a solution with A M > A M,min and a smaller A S . To account for this, we need a lower bound for the structural adders A S,min . This is obtained once at the beginning of the overall algorithm by solving the problem for a maximally sparse FIR filter, which we do by taking ILP Formulation 1 where only the constraints C1 -C3 are considered.
In case the structural adders A S of solution with A M = A M,min are not identical with A S,min , we continue to further increment A M until we find a solution with A S = A S,min . This is a safe stopping point since, by the optimality of ILP Formulation 1, there is no solution with larger A M and smaller A S . The solution with minimum total adders A found so far is hence also globally optimal. Typically, only a few iterations are necessary to reach this point.
IV. MULTIPLIERLESS FIR FILTERS WITH BOUNDED ADDER DEPTH
As discussed above, limiting the AD is important to reduce the power consumption of a filter. Unfortunately, adapting ILP Formulation 1 to limit the AD is not straightforward, as the topology of the adders and thus the AD is left open. We present in this section a novel ILP model for the design of multiplierless FIR with limited AD which is based on a formulation that was initially designed for optimizing pipelined MCM (PMCM) circuits [40] , [42] .
In contrast to ILP Formulation 1, the possible coefficients are precomputed for each adder stage s and selected using binary decision variables. The computation of the corresponding coefficient sets is given next.
A. Definition of Coefficient Sets
We use some notation and definitions originally introduced in [23] . First, we define the generalized add operation called A-operation, which includes the shifts. An A-operation has two input coefficients u, v ∈ N and computes
where q = (l u , l v , r, s v ) is a configuration vector which determines the left shifts l u , l v ∈ N 0 of the inputs, r ∈ N 0 is the output right-shift and s v ∈ {0, 1} is a sign bit which denotes whether an addition or subtraction is performed. Next, we define the set A * (u, v) containing all possible coefficients which can be obtained from u and v by using exactly one A-operation:
A valid configuration is a combination of l u , l v , r and s v such that the result is a positive odd integer A q (u, v) c max . The reason for limiting the integers to odd values is that we can compute every even multiple by shifting the corresponding odd multiple to the left. The c max limit is used to keep A * (u, v) finite. It is chosen as a power-of-two value which is usually set to the maximum coefficient bit width B plus one [22] , [23] c max := 2 B+1 .
For convenience, the A * set is also defined for an input set X ⊆ N as
We can now define the coefficients that can be computed at adder stage s, denoted as A s , by recursively computing the A * sets
In addition, let T s denote the set of (u, v, w) triplets for which w ∈ A s can be computed using u and v from the previous stage (i.e., u, v ∈ A s−1 ). T s can be computed recursively, starting from the last stage s, which is equal to the maximum allowable AD:
To give an example, the first elements of T 1 are T 1 = {(1, 1, 1), (1, 1, 3), (1, 1, 5), (1, 1, 7), (1, 1, 9) , (1, 1, 15) , . . .}. This set contains all the possible rules for computing multiples from the input within one stage of additions, while set T 2 = T 1 ∪ {(1, 3, 11), (1, 5, 11) , . . . , (3, 5, 11) , . . .} contains all the combinations of how elements in the next stage can be computed.
B. ILP Formulation for Fixed Adder Depth
The bounded AD model is given in ILP Formulation 2, while the corresponding constants and variables are given in Table IV .
In contrast to ILP Formulation 1, the objective is to directly minimize the total number of adders A, which is separated into adders in the multiplier block (A M ) and structural adders A S . Similar to ILP Formulation 1, the constraints are divided into frequency response conditions (C1, C2), the link between the filter coefficients and the coefficients of the multiplier block (C3, C4) and the equations describing the multiplierless realization of the multiplier block (C5 -C8).
ILP Formulation 2 Multiplierless FIR filters with depth limit
C3a:
Constraints C1a/b and C2 are identical to the ones in ILP Formulation 1. Now, the connection between the odd multiplier block coefficients of the pre-computed sets and the filter coefficients is performed using binary decision variables. Let h m,w ∈ {0, 1} be a binary decision variable that is true if the magnitude of h m is identical to w, i.e., 
The value of each integer coefficient h m is selected by the indicator constraints C3a. In addition, constraints C3b make sure that only one value per filter coefficient is selected. Next, we distinguish between coefficients that are computed for the selected stage (by using an addition) and coefficients that are just replicated from a previous stage. This replication can be either implemented by a simple wire (as this was implied in ILP Formulation 1) or in case of a pipelined implementation of the multiplier block, it will be implemented by a register. This allows to also model the register cost in the latter case (not treated here but it is a trivial extension of the objective). Hence, we introduce two new decision variables for each w and stage: a s w and r s w , which are true, if w in stage s is realized using an adder or register/wire, respectively.
The connection to the filter coefficients h m,w is made through C4. As several of the M h m coefficients can have the same w value, the right hand side of C4 is scaled by 1/M to keep it less than one. Whenever the right hand side of C4 is non-zero it forces the realization of coefficient w in the output stage S, either as an adder or as a register/wire. Constraints C5 and C6 consider the realization as register/wire: they require that a value w can only be replicated from a previous stage if it was computed or replicated before.
The realization as an adder computing constant w from the inputs u and v requires the presence of both inputs in the previous stage. For that, the binary variables x s (u,v) are introduced which determine if both are available in stage s: 
Now, constraint C7 specifies that if w is computed by w = A(u, v) in stage s, the pair (u, v) has to be available in the previous stage. If a pair (u, v) is required in stage s, constraints C8 make sure that u and v have been realized in the previous stage either as register or adder. Note that instead of using constraint C5 it is more practical to remove all variables r s w which are zero from the cost function and their related constraints. Also note that the binary variables x s (u,v) and the integer variables h m can be relaxed to real numbers to speed up the optimization.
C. Selecting the Adder Depth
The AD is often selected to be as small as possible, typically at the expense of a higher adder cost. It is well known that the minimal AD needed when multiplying with a given coefficient can be realized by using a binary tree [43] . Therefore, it cannot be lower than the base two logarithm of the nonzero digit count of its MSD representation. Unfortunately, as the coefficients are not known in advance, the minimum AD cannot be derived from the filter specification. However, the upper bound of the AD can be computed from the coefficient word size B as follows. A B bit binary number can have up to B + 1 digits when represented as an MSD number and up to (B + 1)/2 + 1 non-zeros in the worst case [40] . This leads to a maximum adder depth of
Using this bound, a search from AD = 0, . . . , AD max can be performed until the first feasible solution is found. For practical FIR filters, early studies have shown that coefficient word sizes between 15 bit to 20 bit are sufficient to achieve approximation errors between −70 and −100 dB [44] . Using (18) , this translates to ADs of at most three to four. In our experiments, we found very good solutions with AD = 2 for most of the filters from practice.
V. REDUCING THE PROBLEM COMPLEXITY A. Reducing the Coefficient Range
Following [5, Sec. 4] , we bound the search space for the coefficient values by projecting the polytope corresponding to the discretized versions of (4) or (6) 
B. Discretizing the Frequency Domain
Even though Ω is replaced by a finite set Ω d , we perform a rigorous posteriori validation of the result over Ω [45] . Still, the typically large size of Ω d (16M is a common value found in the literature) can have a big impact on the runtime of the filter design routine. This is shown for instance in the context of an optimal branch-and-bound algorithm for FIR filter design with fixed-point coefficients [32, Table 2 ]. A too small number of points can, on the other hand, lead to an invalid solution over Ω and a larger feasible set, potentially incurring a larger runtime as well.
It is thus important to consider a discretization of reasonable size that is unlikely to lead to invalid solutions over Ω (i.e., equations (4) or (6) do not hold) and does not increase the search space by a too large factor. To this effect, we use so-called approximate Fekete points (AFPs), which contain the most critical frequencies for a given filter that needs to fit a target frequency response. They have recently been used to improve the robustness of the classic Parks-McClellan Chebyshev FIR filter design algorithm [46] and for a fast and efficient heuristic for FIR fixed-point coefficient optimization [47] . They are efficient choices when performing polynomial interpolation/approximation on domains such as Ω. This is relevant in our context since H R (ω) in (2) is a polynomial in cos(ω) [48, Ch. 7.7] . For details on how to compute them we refer the reader to [46] , [47] and the references therein.
C. An Adaptive Search Strategy
Even if the current Ω d leads to a solution that does not pass a posteriori validation, it might still be possible to rescale the gain factor G such that (6) holds. By taking a point
is largest (i.e., the point of largest deviation from the specification) we first update G to take a value close to H R (ω max )/D(ω max ) or H R (ω max )/D(ω max ), depending on where the deviation occurs. If this new gain leads to a valid solution over Ω, then it is optimal. If not, we update Ω d by adding the points of largest deviation for each frequency subdomain. We rerun the optimization with this new Ω d , repeating until either (a) there are no more invalid frequency points or (b) the problem becomes infeasible, meaning no solution with the imposed constraints over Ω exists.
We should mention that running the result validation code of [45] at each iteration of the adaptive routine is computationally expensive. This is why at each iteration we perform a fast, non rigurous test consisting of verifying (6) on a much denser discretization of Ω than Ω d . We found this to usually be sufficient in ensuring that the a posteriori validation [45] done at the end of optimization is successful.
VI. EXPERIMENTAL RESULTS
To test the ILP formulations discussed above, we have implemented them in a C++ filter design tool 1 . It features a flexible command-line interface.
A. Experimental Setup and Parameter Choices
All experiments were run on a Linux machine with an Intel Xeon E5-2690 v4 CPU with 56 cores and 252 GB of RAM. The proposed implementation supports several popular open source and commercial (M)ILP solvers, such as SCIP [49] , Gurobi [50] and CPLEX [51] 2 . For convenience, these solvers are accessed through the ScaLP [52] library, which acts as a frontend. Based on our experiments, Gurobi usually proved to be the fastest backend, which is why, apart from a few exceptions, use it on all the examples below. All experiments use the AFP-based frequency grid discretization mentioned in Section V-B. As discussed before, the number of frequency points in Ω d is run-time critical. To determine an appropriate size, we ran an experiment using a typical design scenario with an Ω d size of kM points and k = 1, . . . , 32. Fig. 4 shows the runtimes. Not surprisingly, they start large for very low k, as in these cases the frequency grid usually has to be extended to address violations, which require re-running the optimization routine on a new grid. As soon as k is large enough (around k 4), invalid results become rare, meaning just one optimization pass is sufficient. Further increasing k at this point just leads to more constraints in the model and likely a larger runtime for the optimizer. Based on these results, we selected to start with 4M points. We find this choice usually delivers a good balance between optimizer runtime and number of iterations needed to obtain a valid solution over Ω.
B. Benchmark Set
Several multiplierless filter designs were computed to evaluate our methods. They are introduced next.
1) A Family of Specifications from [4, Example 1]:
We consider a family of low-pass linear-phase filter specifications from Redmill et al. [4] . These specifications are defined by:
where δ is a parameter regulating error. We set δ = 10 − p 20 , where p > 0 is the error in decibels (dB).
Our goal with this benchmark is to explore the tradeoff between the error (p), the filter order (N ) and the word length (B) in terms of the total number of adders. 
2) A Set of State-of-the-art Specifications:
We also test our tool on a set of reference specifications from the literature [1] , [2] , [4] , [9] , [12] , [13] , [15] , referred to as S1, S2, L1, L2 and L3. They are all low-pass filters defined by
where the values of δ p , δ s , Ω p , Ω s for each specification are given in Table V . Over time, these reference filter specifications were slightly modified by different publications. To compare with each one, we indicate variations by suffixes, e.g. S1a and S1b. We note that this restriction to low-pass filters comes only from the existing literature and that our tool can be successfully used for the design of other types of filters, such as multiband filters or decimators (since we generalize constraints on the frequency response as functions of frequencies).
C. Results

1) ILP Formulation 1 vs. ILP Formulation 2:
In the first experiment, we compare ILP Formulation 1 (in its overall form discussed in Section III-B) to ILP Formulation 2. Both models can be used to optimize for the total number of adders (MB and structural) given fixed parameters like filter order N , filter type and the effective word length B (see Fig. 3 ). In case of ILP2, the adder depth is an additional constraint. Therefore, in practice, the two approaches can sometimes lead to different results. This is exemplified in Fig. 5 , where we design a set of filters using the family of specifications from [4, Example 1] as described in Section VI-B1. We consider 2 × 37 filters corresponding to p = 2, . . . , 38 (i.e., error is varied from −2 dB to −38 dB), with a 9-bit effective word length and fixed gain G = 1. In each case, a type I filter with smallest N that leads to a feasible solution under the given constraints was used. For ILP2, the upper bound on the AD is set to 2 as this turned out to be sufficient for all test instances.
For most error targets the resulting total adder count is identical between the two. The exceptions are −27, −31, −32 (adapted to minimize the total number of adders) and ILP2 (with AD limit set to 2) on the Redmill set of filters [4] . and −36 dB, where ILP1 gives a better total adder count, and −37 dB, where ILP2 is better. The four cases where ILP1 gives a better result are not surprising considering that the limited AD in ILP2 restricts the coefficient search space. For −37 dB, the difference comes from the fact that the ILP1 solver is able to find an optimal solution with N = 20, while an AD = 2 solution for ILP2 is only possible starting with N = 22. Taking AD = 3 with ILP2 gives the N = 20 solution found with ILP1.
Due to the different nature of the constraints and objective values of both ILP models, it is hard to do a runtime comparison between the two. We only mention that when there is a feasible filter with small number of MB adders, ILP1 can be quite fast for moderate size problems (N < 50), but turns out to struggle for problems with many MB adders. Similarly, for feasible designs with small AD (up to 3), ILP2 will be fast for N < 50 and overall scales better than ILP1.
In the rest of the paper, for comparison with previous work, we use ILP Formulation 2 with AD = 2 (unless otherwise stated).
2) Design Space Exploration for [4, Example 1]: The experiment setting from Section VI-C1 is expanded upon. We compare our best results (with effective word lengths B ∈ {8, 9, 10, 11}) with those from [4, Example 1]. We start off by considering only type I filters (just like in [4] ), variable gain G ∈ [2/3, 4/3] and minimal order N for each error target. The results are illustrated in Fig. 6 . We note that there are certain cases where, for a given B, taking the minimal filter order leading to a feasible solution does not minimize the adder cost. This is most visible for B = 11 and a −30dB error target, where a minimal order N = 14 filter requires 24 adders. For B = 10, the minimal N is 16, leading to only 17 total adders, a 7 adder improvement. Taking N = 16 for B = 11 also results in a 17 adder solution. A lower implementation cost is sometimes possible when increasing the filter order leads to a sparser filter and/or a more economical MCM design. Such solutions better optimize the objective functions in the proposed ILP models. We nevertheless remark that Total number of adders best results extracted from [4] best results with fixed gain best results with flexible gain Fig. 7 : Comparison between our best design space exploration results and the best results from [4, Example 1]. Our tool improves designs from [4] or proves them optimal. increasing the filter order beyond a certain threshold will not lead to different solutions [53] .
Of course, increasing the word length can also lead to a significant improvement in the results. For instance, the optimal −50dB atteanuation results for B ∈ {9, 10, 11} require 41, 32 and 31 adders, respectively.
This nonlinearity of the word length/cost relation means that the user should favor a comprehensive exploration of the design space, varying the design parameters (especially B, filter type and N ) and examine the various trade-offs. This is possible with our tool. Fig. 7 shows the results of such an experiment where, with respect to the setting of Fig. 6 , we additionally allow N to vary and also consider type II filters. We also added the flexible gain, genetic algorithm results produced in [4] . Compared to [4] , we could improve all of the results, except two of them where we obtained the same adder cost (−9 dB and −25 dB). It is clearly visible that allowing variable gain designs can have a major influence on the quality of the results. 3) The Set of State-of-the-art Specifications: Table VI presents the comparison between our designs and the best results from literature [1] , [2] , [4] , [9] , [12] , [13] , [15] for the filter specifications from Table V. The following information for each implementation is given: filter order (N ), filter type, number of multiplier adders (A M ), number of structural adders (A S ), total number of adders (A), adder depth of the multiplier block (AD), gain (G), effective coefficient word length (B), frequency response error and, finally, coefficients of the filter. The frequency response error represents by how much the resulting filter does not respect the specification (zero in case of no issue) and was necessary to introduce since coefficients from previous work turned out to sometimes slightly violate the specification. For those cases we also adjusted the specification to find solutions with a similar error in order to perform a fair and complete comparison. In the following, we discuss each instance in detail. S1a: we obtain a 23% improvement (27 vs. 35 adders) in comparison to the implementation in [2] . By also considering type II designs, one extra adder can be saved (26 total adders). If furthermore a slight error is allowed, a design with 24 adders is achievable. S1b: for this specification we show that, for type I designs, the AD can be reduced from 3 to 2 stages, while keeping the same effective word length. If we use type II filters, keeping the same word length as [4] allows the reduction of two adders (one structural and one MB). We can also decrease the word length by one, and still get a result with the same cost as [4] . S1c: we improve the result from [9] by 3 adders by prioritizing sparse implementations. For type II specifications we can gain one extra structural adder and decrease the word length by one. A type II implementation of these specifications from [13] requires only 23 adders but has higher error than that allowed by the specification. If we allow some error, we obtain the same 23 adder solution as [13] .
S2a: we offer a significant improvement in the implementation of this specification w.r.t. the best known result [2] . Our solution requires 30% less adders (81 adders vs. 116 adders) and we reduce the effective word length from 13 to 10 bits.
S2b: in [9] , implementations with adder depths 3 and 2 are proposed, at the cost of 78 and 80 adders, respectively. These results are improved in [13] , with the authors claiming that a 3-stage implementation at the cost of 76 adders has a high probability to be optimal. Our check shows that the latter design does not pass an a posteriori validation by a margin of at least 0.01395. If some error margin is indeed acceptable, we demonstrate that a 2-stage design with a cost of only 66 adders is possible. Again, our result has higher sparsity than previous designs. If frequency specifications must be rigorously met, we show that 76 adders is the optimal cost for a 2-stage design.
L1: the tool timed out for this specification before giving a feasible result, hence it is not presented in Table VI . The best known result from literature is a 120-tap filter [9] and the size of an instance of the corresponding ILP formulation goes beyond the current capabilities of the solvers we tried, showing its limitations.
L2: the result of [9] can be improved by one adder with the same coefficient set. Similar to L1, we had to timeout before the solver could ascertain if the feasible filter obtained is indeed optimal.
L3: in [9] , an implementation with 38 adders is provided. As in the case of the S2b specification, [13] reduces both the adder count and adder depth, but at the cost of a substantial violation of the frequency specifications. We improve on the result from [9] , proving that only a 7-bit wordlength is necessary to get the same 38 adder cost. This cost can be reduced to 35 adders by using 8-bit coefficients. If we allow the same error as in [13] , we show that 34 adder solutions are possible.
Overall, the proposed tool achieves significant improvements to the majority of the considered filter design problems. Morover, the user can explore a large design space by setting different implementation parameters, e.g. adder depth, coefficient word length, filter type, etc. The required runtime however, will depend greatly on the problem. For producing the results in Table VI , it varied from several seconds for the smallest filters (S1) up to several days for the largest ones (S2 and L2).
VII. CONCLUSION
In this paper we have introduced two new algorithms for the design of optimal multiplierless FIR filters. Relying on ILP formulations stemming from the MCM literature, our algorithms minimize either (a) the number of structural adders given a fixed budget of multiplier block adders or (b) the total number of adders (multiplier block + structural adders) given a fixed adder depth. We further show how (a) can be applied iteratively to optimally minimize the total number of adders (without any adder count or adder depth constraints). Extensive numerical tests with example design problems from the stateof-the-art show that our approaches can offer in many cases better results. We also make available an open-source C++ implementation of the proposed methods.
