Reduction of the number of operations optimizes the important design metrics such as area, cost, throughput, and power consumption for both custom ASIC and programmable processor implementations. We propose a novel technique to minimize the number of operations in DSP computations. The first step of the approach logically partitions a computation into strongly connected components. The second step optimizes each component separately. In the third step the components are merged to further optimize. Finally, the components are scheduled to minimize memory consumption. The effectiveness of our approach is demonstrated on real-life examples.
INTRODUCTION
Reducing the number of operations needed for a given computation decreases cost, area and power consumption, and increases the throughput of custom datapath ASIC implementations. In the case of programmable processor implementations, the throughput is mostly determined by the number of operations, and power consumption can be decreased through effective voltage scaling technique which is enabled by the extra throughput.
We illustrate the key ideas of our approach for minimizing the number of operations by considering the computation of Figure 1 . Each node represents a subpart of the computation. We make the following assumptions only specifically for clarifying the presentation of this simplified example. We stress here that the assumptions are not necessary for our approach. We assume that each subpart is linear and dense, which means that every output and state in a subpart are linear combinations of all inputs and states in the subpart with no 0, 1, or -1 coefficients. The number inside a node is the number of delays or states in the subpart. We assume that when there is an arc from a subpart X to a subpart Y, every output and state of Y depend on all inputs and states of X.
The number of operations per input sample is initially 2081 (We illustrate how the number of operations is calculated in a maximally fast procedure [7] using a simple linear computation with 2 states and 1 output which is described in Figure 2) . Using the technique of [lo] which unfolds the entire computation, the number can be reduced to 725
with an unfolding factor of 12. Our approach can optimize each subpart separately, which is enabled from isolating the subparts using pipeline delays. The Figure 3 shows the resulting computation after the isolation step. Separate op-timization step results in 522.27 operations. We perform subparts merging to further optimize. If the subparts C and D are merged and optimized together, the number of operations is further reduced to 399.4. The approach has reduced the number of operations by a factor of 1.82 (5.2) from the previous technique of [lo] (from the initial number of operations).
The main technical innovation of the research presented in this paper is the first approach for the minimization of the number of operations in general computations. The approach does not treat just significantly wide set of computations than the other previously published techniques [lo] , but also outperforms or performs at least as well as other techniques on all examples.
The rest of the paper is organized in the following way. In Section 2. we briefly review the related work on the minimization of the number of operations. Section 3. presents the key idea of the new approach and describes optimization techniques for the approach. Section 4. illustrates the effectiveness of the technique using real-life examples. Finally, Section 5. draws conclusions.
RELATED WORK
In this section, we briefly review the related work on the minimization of the number of operations. Potkonjak and Rabaey [7] addressed the minimization of the number of multiplications and additions in linear computations in their maximally fast form so that the throughput is preserved. Potkonjak et al.
[8] presented a set of techniques for minimization of the number of shifts and additions in linear computations. Sheliga and Sha [9] presented an approach for minimization of the number of multiplications and additions in linear computations. Srivastava and Potkonjak [lo] developed an approach for the minimization of the number of operations in linear computations using unfolding and the application of the maximally fast procedure. Guerra et al. [2] developed a divide and conquer approach for minimizing critical paths.
OPTIMIZATION APPROACH
The core of the approach is presented in the pseudo-code of Figure 4 . The rest of this section explains the global flow of the approach in more detail.
The first step of the approach is to identify the computation's strongly connected components(SCCs), using the standard depth-fist search-based algorithm [ll] . For any pair of operations A and B within a SCC, there exist both a path from A to B and one from B to A . The SCCs are isolated from each other using pipeline delays, which enables us to optimize each subpart separately. The inserted Decompose a computation into strongly connected components( SCCs); Use pipelining to isolate the SCCs; Minimize the number of delays using retiming; pipeline delays are treated as a subpart input or output. As a result, every output and state in a subpart depend only on the subpart's inputs and states. Note that this isolation is not affected by unfolding.
In the next step, the number of delays in the computation is minimized using retiming by the Leiserson [lo] for optimization of linear SCCs, which uses unfolding and the maximally fast procedure [7] . We note that instead of maximally fast procedure the ratio analysis by [9] can be used. [lo] has provided the closed-form formula for the optimal unfolding factor with the assumption of dense linear computations which are provided in Figure 5 . For sparse linear computations, they have proposed a heuristic which continues to unfold further until there is no improvement.
When a SCC is classified as nonlinear, all nonlinear operations are isolated from the SCC so that the remaining linear subparts can be optimized. All arcs from nonlinear operations to the linear subparts are considered as inputs to the linear subparts, and all arcs from linear subparts to the nonlinear operations considered as outputs from the linear subparts. The linear subparts are logically partitioned into SCCs and each SCC is optimized by the same approach in the previous paragraph. 
Figure 7. i times unfolded state-space equations
Sometimes it is beneficial to decompose a computation into larger subparts than SCCs. We consider an example given in Figure 6 . We use the same assumptions made for the motivational example in Section 1.. Separately optimiz- Initially, we only consider merging of SCCs. When two SCCs are merged, however, the merged SCCs 'does not form a SCC. Thus, in general, we must consider merging of any adjacent arbitrary subparts. Suppose we consider merging of subparts i and j . The gain GAIN(i,j) of merging subparts i and j can be computed as follows; COST(i, j ) is the number of operations for the merged subpart of i and j . To compute the gain, C O S T ( i , j ) must be computed, which requires to get constant coefficient matri-
GAIN(i,j) = COST(i)+ COST(j) -COST(Z,j), where

COST(i) is the number of operations for subpart i and
[/--11, which gives smaller value of Sj = # states in state group j Oj = # outputs in output group j IOj = # inputs that output group j depends on ISj = # inputs that state group j depends on SOj = # states that output group j depends on SS, = # states that state n o u p .j depends on The pseudo-code is provided in Figure 9 . The algorithm is simple. Until there is no improvement, merge the pair of subparts which produces the highest gain. The other heuristic algorithm is based on a general combinatorial optimization technique known as simulated annealing [4] .
Since the subparts of a computation are unfolded separately by different unfolding factors, we need to address the problem of scheduling of the subparts. They should be scheduled so that memory requirements for code and data of a schedule are minimized. We observe that the unfolded subparts can be represented by multi-rate synchronous dataflow graph [5] and the works of [l] can be directly used.
E X P E R I M E N T A L R E S U L T S
This section presents the experimental results of our technique for real-life examples, where Filter -analog to digital converter (ADC) followed by 18 order parallel filter; and Video Filter -two ADCs followed by 12-order two dimensional (2D) IIR filter. DAC, modem, and GE controller are linear computations and the rest are nonlinear computations. The fifth column of Table 1 provides only the improvement factor of our method from the initial number of operations since [lo] is either ineffective or inapplicable for all examples. Our method has reduced the number of operations by an average factor of 1.82 (average 42.9 %) for the examples, which clearly indicates the effectiveness of our new method.
C O N C L U S I O N
We proposed a novel technique to minimize the number of operations in DSP computations. The effectiveness of our approach was demonstrated on real-life examples. Our method has reduced the number of operations by an average factor of 1.82 (average 42.9 %) for the examples that previous techniques are either ineffective or inapplicable.
