Abstract-Conventional approaches for fixed-point implementation of digital signal processing algorithms require the scaling and word-length (WL) optimization in the algorithm level and the high-level synthesis for functional unit sharing in the architecture level. However, the algorithm-level WL optimization has a few limitations because it can neither utilize the functional unit sharing information for signal grouping nor estimate the hardware cost for each operation accurately. In this study, we develop a combined WL optimization and high-level synthesis algorithm not only to minimize the hardware implementation cost, but also to reduce the optimization time significantly. This software initially finds the WL sensitivity or minimum WL of each signal throughout fixed-point simulations of a signal flow graph, performs the WL conscious high-level synthesis where signals having the similar WL sensitivity are assigned to the same functional unit, and then conducts the final WL optimization by iteratively modifying the WLs of the synthesized hardware model. A list-scheduling-based and an integer linear-programming-based algorithms are developed for the WL conscious high-level synthesis. The hardware cost function to minimize is generated by using a synthesized hardware model. Since fixed-point simulation is used to measure the performance, this method can be applied to general, including nonlinear and timevarying, digital signal processing systems. A fourth-order infiniteimpulse response filter, a fifth-order elliptic filter, and a 12th-order adaptive least mean square filter are implemented using this software.
the determination of optimum WLs for general, including nonlinear and time-varying, signal processing systems is a difficult and tedious process. In order to solve this problem, simulation-based WL optimization tools are developed [2] , [5] , [6] . An interpolative approach that uses range and WL propagation is also studied to minimize the simulation time for optimization [7] .
The authors developed a simulation-based method to determine the optimal WLs and scale factors for general digital signal processing algorithms, which was adopted to a commercial computer-aided design product, Fixed Point Optimizer [2] , [8] . This method measures the performance of a fixed-point algorithm using simulation results and iteratively modifies the WL of a signal to find a set of optimum WLs that satisfies the fixed-point performance measure while minimizing the hardware cost. In this method, the WL optimization is conducted before the high-level synthesis and is composed of signal grouping, scaling factor determination, minimum WL determination, and the optimum WL search, as illustrated in Fig. 1 . The signal grouping that assigns the same WL to all the signals in a group is needed to reduce the time for the WL search step by minimizing the number of variables. The minimum WL of a signal group is the lower bound of the WL that the group should have in order to meet the desired fixed-point performance assuming that all other signals have sufficient precisions such as double precision floating-point format [2] . The optimum WL search is conducted by employing exhaustive or heuristic search methods. Note that this WL optimization is performed on the algorithm level, while the subsequent high-level synthesis is conducted based on the determined WL, as shown in Fig. 1 . Although this approach supports an easy integration of the WL optimization software with existing high-level synthesis tools, it can bring higher implementation cost because of the following two reasons. First, the WL optimization without considering the hardware sharing can assign different WLs to the operations that are mapped to the same functional unit. In this case, the largest one determines the WL of the functional unit and more efforts are needed for the WL determination because a larger number of variables are involved. Second, the hardware cost function needed for the WL optimization is not accurate because the hardware sharing information is not used. Especially, this problem becomes worse because most of the high-level synthesis tools do not consider the WL information for binding. Choi [9] developed an improved binding method that assigns similar WL operations to the same functional unit. However, it still has the aforementioned disadvantages of the WL optimization followed by the high-level synthesis approach.
In this paper, an architecture-level WL optimization method that conducts the WL optimization and high-level synthesis simultaneously, as shown in Fig. 2 , is proposed. The high-level synthesis step is performed based on the minimum WL information, while the final WL optimization is conducted using the synthesized hardware model. Thus, this approach binds the operations that need the similar WLs, assigns the same WL to the bound operations, and uses the accurate hardware cost model. Also, the optimal WL search time is decreased considerably because the hardware sharing information is used for the signal grouping.
The proposed architecture-level WL optimization method is implemented using a very high-speed integrated circuit hardware description language (VHDL) developer's toolkit [10] . This includes a VHDL analyzer, a VHDL to control data flow graph (CDFG) converter, and a CDFG to C converter. A digital signal processing algorithm is represented in the behavioral-level VHDL description, which is then converted to a CDFG format. The CDFG is used in our WL optimization software. Since the simulation for evaluating the fixed-point performance is conducted using converted C programs that include a quantization model, the simulation time is much less than that of VHDL simulation. The final synthesized architecture is represented with the structural-level VHDL description.
In Section II, the signal grouping and the minimum WL determination methods are described. This section also explains the fixed-point data format and the quantization models. In Section III, the high-level synthesis techniques using the WL information are described. In Section IV, the optimum WL search and the shift minimization methods are described. The implementation examples are shown in Section V. Finally, Section VI contains the concluding remarks.
II. SIGNAL GROUPING AND MINIMUM WL DETERMINATION

A. Fixed-Point Data Format
A fixed-point format that assigns an independent binary point location for each signal is employed using the attributes specified as follows [2] , [8] :
word length, integer word length, sign
The number of bits assigned to the integer value representation is called the integer WL (IWL) and the number of bits assigned to the fraction is the fractional WL (FWL). Thus, the WL of a two's complement number corresponds to IWL FWL . The range ( ) and the quantization step ( ) are dependent on the IWL and FWL, respectively: and . For example, a format can represent a signal having the range of and the quantization step size of . The minimum IWL for a signal can be determined from the range of a signal as follows: (2) where is the smallest integer that is greater than or equal to . The range is estimated from the peak-to-peak value or the mean and the standard deviation that are measured during the simulation [11] . A robust estimation method utilizing a complex statistical information can be found in [6] . The determination of the minimum IWL for all signals requires only one simulation for the range estimation. However, since overflows greatly degrade the performance of a system, it is necessary to employ several input sample files for reliable estimation.
B. Signal Grouping
The number of fixed-point simulations needed for optimum WL search is rapidly increasing with the complexity of a signal flow graph. Thus, it is very needed for reducing the optimization time to minimize the number of variables by grouping signals. The signals can be grouped based on two ideas; one is using the signal flow graph analysis and the other is based on the hardware sharing information. In our previous work that was implemented in the Fixed Point Optimizer, only the signal flow graph analysis is used for the grouping. In this paper, however, the signal grouping based on the graph analysis is used for determining the In the former grouping method, the signals that are connected by delays or multiplexers are grouped. In addition, the signals connected by adders are also grouped [2] . On the other hand, multipliers or quantizers break the signal grouping. Since the output WL of multiplication is the sum of two input WLs, quantization is definitely needed in most systems, especially when the output is used for the other multiplications. There are a few approaches regarding the WL decrease or quantization of multiplied signals.
The first approach is to conduct quantization immediately after multiplication, as illustrated in Fig. 3(a) . This model was also employed in the Fixed Point Optimizer. The second approach is to conduct addition or accumulation of multiplied signals in full precision and then quantize it when the added signal is used as the input of another multiplication or is stored in memory, as shown in Fig. 3(b) . Note that this model has been widely adopted in most programmable digital signal processor (DSP) architectures. Many programmable DSPs, such as TMS320C5x and Motorola 56000, equip double precision accumulators to add multiplied results without quantization. The first approach causes the larger quantization noise than that of the second approach when they have the same WL . Therefore, when the same quantization noise level is allowed, the first method requires larger than that of the second method, but the second method requires 2 -bit WL adders. The third approch developed in this paper is the combination of two models. It not only inserts quantizers at the multiplier outputs, but also quantizes the output of a final adder in an adder tree, as shown in Fig. 3(c) . This method intends to optimize both adder and multiplier WLs.
The grouping method is implemented as follows. First, the adders whose output are used for a multiplier input or a system output are selected. Second, the adders whose output has the fan-out of greater than one are also selected. These selected adders are called the output adders because they generate the output signals of adder clusters. Third, search the graph from the output adders to the input direction until another output adder or a multiplier is encountered. In the fourth-order infinite-impulse response (IIR) filter example shown in Fig. 4 , the adders A2, A9, and A14 are selected as the output adders. The adder A3 is included in the first cluster, and the adders A7, A8, and A10 are in the second cluster, while the adder A15 is in the third cluster. These clusters are represented as the dotted boxes in Fig. 4 .
C. Determination of Minimum FWL
The minimum FWL can be regarded as the lower bound of the optimal FWL for a signal because the quantization effects of the other signals are ignored. It is determined for each group seperately by the performance evaluation of simulation results. In this section, the same IWLs are assumed for simple explanation and the WL is used instead of the FWL.
The WL and in Fig. 3 (c) should be determined by iterative simulations. For reducing the simulation time, the is determined first without considering the -bit quantizer and the is determined by the number of adder inputs as follows.
Assuming that the quantization noise energy due to one -bit adder is , the total quantization noise energy for the third quantization model is as follows:
Since the optimum WL will be determined in the final search stage, a rough estimation of the minimum WL is sufficient in this stage. Empirically, it is enough to let the first term of (3) be smaller than . Therefore, the fractional guard bit can be determined as follows: (4) In the fourth-order IIR filter example shown in Fig. 4 , the desired fixed-point performance is set to 40 dB of the signal-tonoise ratio (SNR) for the output signal. The filter coefficients have 16-bit WL with 1-bit IWL. Note that the coefficient WL optimization is not considered here because there are several previous studies for them [12] , [13] . However, the proposed signal WL optimization method can be applied for determining the reasonable number of bits for coefficients, while the previously known coefficients optimization methods aggressively try to find out the optimum constant values, such as the canonic signed digit representation. According to the simulation results, the output WLs for the adder clusters, which correspond to in Fig. 3(c) , are determined as shown in Table I . Using (4) of the proposed quantization model, the WLs of the adders in the adder clusters, which correspond to in Fig. 3(c) , are determined as shown in Table I . Note that these WLs are determined assuming that the adders in a cluster have the same IWLs.
III. HIGH-LEVEL SYNTHESIS USING WL INFORMATION
The high-level synthesis that aims to find the minimum amount of hardware resources while meeting a certain time constraint is usually conducted by analyzing the dependency relation given by a signal flow graph. In this study, the time constraint including clock rate and latency is assumed fixed and only hardware area cost is minimized, while the fixed-point performance is satisfied. In the proposed method, the minimum WL information is also utilized to reduce the hardware cost by assigning similar WL operations to the same functional unit. A functional unit can execute an operation that requires an equal or a smaller WL, but cannot conduct operations that need a larger WL. Note that the scheduling and binding in this step is conducted based on the minimum WL information. Thus, it needs the final WL optimization step that will be discussed in Section III-A. Two scheduling algorithms based on the list-scheduling and the integer linear programming (ILP) models are applied and a register binding algorithm based on the clique partitioning is used in order to consider the WL information.
A. WL Conscious List Scheduling
List scheduling solves a resource constraint problem, where the constraints are usually given in terms of the number and the types of functional units [14] . In our problem, the WL of a functional unit is also considered as a constraint.
Conventional list-scheduling algorithms generate the ready list for each type of operations, such as arithmetic logic unit (ALU) and multiplication operation lists. In the proposed method, the ready lists are generated for each functional unit having a different WL. The ready list with the largest WL has the highest priority, i.e., the ready node in the list having the largest WL is assigned first. When the list for a certain control step is empty, the resource is available to another ready node having a smaller or the same WL. When several resources are available for a ready node, it is advantageous for improving the fixed-point performance to use the largest WL resource. However, if the power saving is more urgently required, the smallest WL resource can be considered.
The scheduling algorithm that selects the largest WL functional unit is shown in Fig. 5 . To bind the smallest WL functional unit, the innermost for loop should be changed to "for all resource from the smallest WL to the largest WL."
In the fourth-order IIR filter example, the data flow graph includes four 9-bit by 16-bit, three 11-bit by 16-bit multiply, three 13-bit, one 12-bit, five 11-bit, one 10-bit, and two 9-bit ALU operations. Since the same IWLs are assumed for the adders in the previous section, the WL of the adders are determined as 13-bit and 12-bit. However, when the adders can have their own IWLs according to the range of their signals, the required WL can be reduced. The critical path length is six control steps. Assuming that an 11-bit by 16-bit multiplier, a 13-bit ALU, and an 11-bit ALU are used, ten control steps are required as illustrated in Table II . The generated ready list is also shown in this Table II . For example, in the fifth control step, the operations M12, M11, and M13 are in the multiplication ready list and the operations A2 and D11 are in the 11-bit ALU ready list, where D is a data move operation. The first ready operations M12 and A2 are scheduled in this control step. Since the 13-bit ALU is not used in this control step, the operation A2 is assigned to the 13-bit ALU and D11 is also assigned to the 11-bit ALU.
Since the list-scheduling is a resource constraint algorithm, the WL optimizing scheduling needs iterative operations to find out the minimum resource hardware that satisfies the time constraint. In our proposed method, the minimum number of functional units that are required for satisfying the time constraint is initially determined using a conventional scheduling method. Note that the WLs of functional units are set to the largest WL of operation nodes. Then, the WLs of functional units are reduced step by step as long as the time constraint is not violated. For example, if there are three 16-bit units in the initial functional unit estimation, then the WL conscious list scheduling with one 16-bit unit and two 15-bit units are tried. If this is successful, then the WLs of the two functional units are reduced again until it fails. If it fails, the WL is restored to the last successful one and then the WL of one functional unit is reduced until the scheduling fails. The WL decreasing step is shown in Fig. 6 . The WLs of the functional units are decreased by one bit, where is the number of functional units.
Although this method is a heuristic technique, it was possible to find out very close results to the optimum one obtained from the ILP, which is explained in Section III-B.
B. WL Conscious ILP
ILP is a well-known formal approach to the scheduling problem [15] . We suppose that the data flow graph contains operations, data dependencies, and is to be scheduled into steps. (5) subject to for (6) for (7) for (8) Equation (6) states that no control step requires more than functional units of type , (7) means an operation is scheduled only once between the control steps and , and (8) indicates that should be scheduled before is scheduled. Since a functional unit cannot perform an operation that needs a larger WL in the proposed WL conscious scheduling algorithm, the scheduling problem is modified as follows. The type of functional unit is subdivided into , where is the WL of a functional unit. Let the set of the WLs for the operations be . The total functional unit cost (5) is modified as (9) and the number of functional unit constraint is modified as for (10) Since is the number of operations that can be performed with a functional unit of WL , is the number of operations requiring the WL larger than or equal to . It should , which is the number of functional units having the WL larger than or equal to . A conventional ILP model has only one constraint for type operation for control step . However, the WL conscious schedule problem has the constraints as many as to the number of elements of . For the example of Fig. 7 , the integer programming formulations are as follows:
Note that the sixth and the seventh inequalities correspond to the constraints for the first control step because the two operations and have different WLs. For minimizing the cost function , where and are set to 12 and 16, respectively, becomes one, i.e., is scheduled at the first control step.
ILP problems are solved using an ILP solver package CPLEX [16] . For the examples of a fourth-order IIR filter and a fifthorder elliptic filter, the optimal solution can be found in a few minutes using a Pentium class personal computer as shown in the Table III. The scheduling results for the fourth-order IIR filter example using the list scheduling and the ILP scheduling are compared in Table III . The parenthesized numbers represent the input WL of a multiplier whose coefficient WL is fixed to 16-bit and the nonparenthesized WLs are for ALUs. The hardware costs for the functional units are estimated for Altera field programmable gate array (FPGA), where a model of is for a 16-bit by -bit multiplier and that of for an -bit ALU. The comparison shows that the ILP solution results in less hardware compared with the list-scheduling solution.
C. WL Conscious Register Binding
The WL conscious register binding algorithm is based on the clique partitioning algorithm. Finding a clique partition with a bounded cardinality is known to be an intractable problem. There is a heuristic of iteratively searching for the maximum clique of a graph and then deleting it from the graph until there is no more vertices [17] . This heuristic is modified for minimizing the total number of bits for the allocated registers. A node that has the largest WL is initially selected instead of a node with the largest degree. Then, the node is selected in the of WL size from the candidate set as described in Fig. 8 . Since the variables which need a similar WL are grouped, the total number of bits for registers can be reduced.
In the fourth-order IIR filter example, the proposed register binding algorithm requires four 13-bit registers, two 12-bit registers, and one 11-bit register (total 87 bits), while a conventional register binding algorithm needs seven 13-bit registers (total 91 bits).
IV. OVERALL WL OPTIMIZATION AND SHIFT MINIMIZATION
A. WL Optimization of Functional Units
When the simulation result after hardware binding satisfies the desired fixed-point performance, no further optimization is required. Otherwise, the WLs of the functional units should be increased until the desired performance is obtained. Note that the WL of a functional unit cannot be further reduced because this corresponds to the minimum WL of bound operations.
In our previous work, two strategies, the exhaustive and the heuristic search methods, are developed for this purpose [2] . The WLs of all functional units constitute a WL vector in this search step. In the exhaustive search algorithm, the WL vectors are tested in the order of increasing hardware cost starting from the minimum WL vector. This method can find the minimum cost WL vector that satisfies the fixed-point constraint. However, it requires the number of simulations that is proportional to the power of the number of groups. Thus, this method is not applicable when the number of groups is greater than a threshold, which is usually six. In the proposed method, the exhaustive search algorithm can be used in most cases because the number of groups is significantly reduced due to the use of the hardware sharing information for grouping.
For the fourth-order IIR filter example, the WL optimization after the hardware binding is performed as follows. In this example, when the schedule result requires three functional units, an 11-bit by 16-bit multiplier, and 13-bit and 11-bit ALUs, the minimum WL vector is (11, 13, 11) . Note that one input of the multiplier is fixed as 16-bit and not included in the WL vector. In the exhaustive search algorithm, the WLs of ALUs are increased first as shown in Table IV . Since the fixed-point performance after WL increase meets the fixed-point performance, the WL of the multiplier is not increased at all. The minimum cost WL vector is determined as (11, 15, 11) .
For comparison purpose, the WL optimization in the previous approach is demonstrated in Table V for the fourth-order IIR filter example of Fig. 4 [2] . This method groups the signal only by analyzing the signal flow graph. To improve the performance, the manual grouping method is also used. Group 1 includes the signal X, group 2 includes M1, A2, A3, M4, M5, D10, and D11, group 3 includes M6, A7, A8, A9, A10, M11, M12, D20, and D21, and group 4 includes M13, A14, A15 signals. The hardware cost to minimize is calculated for all multiply and add operations and becomes . The results show that the exhaustive search method requires a total of 16 iterations for the optimum WL search, while the proposed method needs only four iterations. The high-level synthesis using this WL information is conducted. When the number of control steps is ten, the synthesized results needs one 13-bit multiplier, one 13-bit ALU, and one 12-bit ALU, which corresponds to the total hardware cost of 846.5, where the hardware cost of is used for the multiplier and for the ALUs. Note that the hardware cost for ten control steps that results from WL optimization followed by high-level synthesis is about 15% larger than the optimization result using the proposed algorithm.
B. Shifter Minimization
The synthesized architecture using fixed-point hardware needs shifters between functional units and registers. The shifters are used for aligning the binary points of adder inputs to outputs and scaling multiplier outputs. For an adder, the input signals from registers are aligned to the output signal using a shifter, while the output is stored to registers without requiring any shift as shown in Fig. 9 . For a multiplier, the input ports are directly connected to registers, but the output is connected to a shifter as shown in Fig. 9 . For a right shift, the most significant bits (MSBs) are sign extended and the MSB of truncated bits is used as the carry-in signal of the adders for rounding. For a left shift, the least significant bits (LSBs) are filled with zeros and the MSBs are thrown away, but overflows do not occur because the IWLs are carefully determined throughout the range estimation. The shifters are implemented using multiplexers that select one of signals having different shift amounts. For example, a three-input multiplexer is needed when a 2-bit right shift is required for control step 1, a 1-bit left shift for control step 2, and no shift is required for the other control steps. The hardware cost of a shifter is dependent on the number of multiplexer inputs.
The hardware cost of the shifters may be reduced by exchanging the two inputs of functional units. For example, assuming that 1-bit and 2-bit shifts are required for the two inputs of an adder at control step 1 and 2-bit and 1-bit shifts are required at control step 2, the shifters can be removed by swapping the two inputs at the second control step. The shift reduction by input exchange is possible when the operation is commutative and the shift amounts of the two inputs are different.
C. Constant Multiplication Conversion
Constant multiplication can be implemented using a few shift and add operations [18] . In our architecture, the scaling shifters between registers and adder inputs can be used for the shifts in multiplications. Since constant multiplications can be converted to a sequence of add operations, the number of inputs for adder clusters is increased and the adders require more bits for their FWL. The adders used for the implementation of multiplications may have smaller WLs than the accumulation adders because the scaled signals have small IWL, but have the same FWL to that of the output signal.
The fourth-order IIR filter can be implemented without using hardware multipliers. In this example, seven multiplications are converted to 31 addition/subtractions. The WL of the ALU operations are from 1 bit to 15 bit, where about a half of the operations require WLs exceeding 12 bit. The scheduling results are shown in Table VI. Note that the ILP scheduling result requires more functional units than that of the list scheduling in some cases, but the total cost is always smaller.
V. IMPLEMENTATION EXAMPLES
A. Fifth-Order Elliptic Filter
A fifth-order elliptic filter having a ladder structure, shown in Fig. 10 , is implemented. This design is one of the well-known high-level synthesis benchmarks. The filter coefficients in reference [19] are used for the implementation. They have 6-bit WL with 5-bit fraction. A uniformly distributed white noise with the range of is used for the input signal and the SNR of the output signal is set to 40 dB as for the desired fixed-point performance. Fig. 9 . Example of synthesized architecture using fixed-point hardware.
At first, the WL optimization is conducted without the multiplication conversion. The design is grouped into 17 adder clusters. For example, cluster 1 consists of adders 1, 30, and 25, while cluster 2 is formed with adders 2 and 35. According to the range estimation, the IWLs of signals are determined between 15-19 bits. The FWLs of the output signals of the adder clusters and those of the internal adders in the adder clusters are determined as 8 to 11 and 7 to 10, respectively, by the fixed-point simulations. This step requires most of the optimization time because it needs a total of 44 fixed-point simu- Table VII . The parenthesized WL is the input WL of a multiplier whose the other input WL is 6-bit. The hardware cost is estimated using Altera FPGA synthesis results. For example, is for a 6-bit by -bit multiplier and is for an -bit ALU. According to the scheduling results using 18 control steps, a 6-bit by 12-bit multiplier, a 12-bit ALU, and an 11-bit ALU are required. The system performance after the scheduling becomes 42.86 dB. Since the desired performance is obtained, no more optimization is required. The numbers of multiplexer inputs for scaling shift are reduced from 3, 4, 3, and 3 to 3, 3, 2, and 3 through the shift minimization.
When the constant multiplication conversion is used, eight multiplications are converted to only ten ALU operations because the original coefficients are canonic signed digit optimized. The WLs of adders are increased because of the enlarged adder cluster size. The scheduling results are shown in Table VIII , which shows that the list-scheduling algorithm can find the solution very close to the optimum result in this example.
B. Adaptive LMS Filter
A 12th-order adaptive LMS filter shown in Fig. 11 is implemented. It is used for a channel identification system and the channel is modeled by a 12th-order finite-impulse response (FIR) filter. The difference between the output signal of the channel and that of the adaptive filter is used for the error signal of the adaptive filter. A uniformly distributed white noise with the range of is used for the input of the channel and the adaptive filter. When the WL of the adaptive filter coefficients or the input data is not sufficient, the system not only converges slowly, but the error after the convergence becomes large as well. Thus, the average power of the error signal after some time-off period is used as for the fixed-point performance measure [2] .
Each adder used for the filter coefficient update is separated and the adders used for the summation of the filter output are clustered together. The IWLs and the FWLs are determined by the simulation results as shown in Table IX . When the delay operations of the input signal are implemented without ALU, the scheduling result needs 13 control steps with an 11-bit by 13-bit multiplier, an 11-bit by 10-bit multiplier, a 15-bit ALU, and a 13-bit ALU.
Because the simulation result after hardware sharing does not satisfy the desired performance, the exhaustive search algorithm is performed for the final WL optimization. Six WLs should be determined-four for each input WL of two multipliers, two for two adders. It is smaller than that of the previous research in which there are 15 signal groups unless the manual grouping is employed [2] . In this step, the 11-bit by 10-bit multiplier is replaced by an 11-bit by 11-bit multiplier and the 15-bit ALU is replaced by a 16-bit ALU to meet the desired performance.
VI. CONCLUDING REMARKS
A combined WL optimization and high-level synthesis approach that results in a more efficient or cost-effective design when compared with the previous WL optimization followed by high-level synthesis approaches. The developed method also requires less time for optimization since the use of the hardware sharing information for signal grouping results in fewer signal groups. The reduction of optimization time is very important because this method evaluates the fixed-point performance by simulation [6] , [7] . Note that the proposed method needs the same optimization time for scheduling and binding when compared with the previous WL conscious hardware synthesis method. The developed algorithm contains two WL conscious schedule algorithms-one is based on the list scheduling and the other is on the ILP. The list-scheduling algorithm can find a solution close to the optimum result within a short time, while the ILP algorithm can find the optimum solution. New signal flow graph grouping algorithm and shifter minimization are also implemented. The proposed quantization method that quantizes both inputs and outputs of multiplication helps to compromise adder and multiplier hardware cost. When the constant multiplication conversion is used, we can further reduce the hardware cost because the scaled signals in multiplication requires very small WLs.
A fourth-order IIR filter, the fifth-order elliptic filter from high-level synthesis benchmarks [19] , and an adaptive LMS filter are implemented using the developed software. The hardware cost is reduced by 15% in the fourth-order IIR filter and 7% in the elliptic filter example using the new algorithm. The number of signal groups is reduced from 15 to 6 in the adaptive filter example by using the hardware sharing information.
