We introduce an approach for power optimization using a set of compilation and architectural techniques. The key technical innovation is a novel divide-and-conquer compilation technique to minimize the number of operations for general computations. Our technique optimizes not only a significantly wider set of computations than the previously published techniques, but also outperforms (or performs at least as well as other techniques) on all examples. Along the architectural dimension, we investigate coordinated impact of compilation techniques on the number of processors which provide optimal trade-off between cost and power. We demonstrate that proper compilation techniques can significantly reduce power with bounded hardware cost. The effectiveness of all techniques and algorithms is documented on numerous real-life designs.
INTRODUCTION

Motivation
The pace of progress in integrated circuits and system design has been dictated by the push from application trends and the pull from technology improvements. The goal and role of designers and design tool developers has been to develop design methodologies, architectures, and synthesis tools which connect changing worlds of applications and technologies.
Recently, a new class of portable applications has been forming a new market at an exceptionally high rate. The applications and products of the portable wireless market are defined by their intrinsic demand for portability, flexibility, cost sensitivity, and by their high digital signal processing (DSP) content [Schneiderman 1994 ]. Portability translates into the crucial importance of low power design, flexibility results in a need for programmable platforms implementation, and cost sensitivity narrows architectural alternatives to a uniprocessor or an architecture with a limited number of off-the-shelf standard processors. The key optimization degree of freedom for relaxing and satisfying this set of requirements comes from properties of typical portable computations. The computations are mainly linear, but rarely 100% linear due to either need for adaptive algorithms or nonlinear quantization elements. Such computations are well suited for static compilation and intensive quantitative optimization.
Two main recent relevant technological trends are reduced minimal feature size and therefore reduced voltages of deep submicron technologies, and the introduction of ultra low power technologies. In widely dominating digital CMOS technologies, the power consumption is proportional to square of supply voltage ͑V dd ͒. The most effective techniques try to reduce V dd while compensating for speed reduction using a variety of architectural and compilation techniques [Singh et al. 1995] .
The main limitation of conventional technologies with respect to power minimization is also related to V dd and threshold voltage ͑V t ͒. In traditional bulk silicon technologies both voltages are commonly limited to range above 0.7V. However, in the last few years ultra low power silicon on insulator (SOI) technologies, such as SIMOX (SOI using separation of oxygen), bond and etchback SOI (BESOI), and silicon-on-insulator-withactive-substrate (SOIAS), have reduced both V dd and V t to well below 1V [El-Kareh et al. 1995; Ipposhi et al. 1995] . There are a number of reported ICs which have values for V dd and V t in a range as low as 0.05V-0.1V .
Our goal in this paper is to develop a system of synthesis and compilation methods and tools for realization of portable applications. Technically restated, the primary goal is to develop techniques which efficiently compile typical DSP wireless applications on single and multiple programmable processors assuming both traditional bulk silicon and the newer SOI technologies. Furthermore, we study achievable power-cost trade-offs when parallelism is traded for power reduction on programmable platforms. programmable processor platforms. The first step is to find power-efficient solutions for a single processor implementation by applying the new technique described in Section 4. The second step is to continue to add processors until the reduction in average power consumption is not enough to justify the cost of an additional processor. This step generates costeffective and power-efficient solutions. This straightforward design methodology produces implementations with low cost and low power consumption for the given design requirements.
The main technical innovation of the research presented in this paper is that it is the first approach for the minimization of the number of operations in arbitrary computations. The approach optimizes not only a significantly wider set of computations than the other previously published techniques [Parhi and Messerschmitt 1991; Srivastava and Potkonjak 1996 ], but also outperforms or performs at least as well as other techniques on all examples. The novel divide-and-conquer compilation procedure combines and coordinates power and enabling effects of several transformations (using a well organized ordering of transformations) to minimize the number of operations in each logical partition. To the best of our knowledge this is the first approach for minimization of the number of operations which in an optimization-intensive way treats general computations.
The second technical highlight is the quantitative analysis of cost vs power trade-off on multiple programmable processor implementation platforms. We derive a condition under which the optimization of the costpower product using parallelization is beneficial.
Paper Organization
The rest of the paper is organized in the following way: First, in the next section we summarize the relevant background material. In Section 3 we review the related work on power estimation and optimization as well as on program optimization using transformations, and in particular, the minimization of the number of operations. Sections 4 and 5 are the technical core of this paper and present a novel approach for minimization of the number of operations for general DSP computations and explore compiler and technology impact on power-cost tradeoffs of multiple processors-based low power application-specific systems. We then present comprehensive experimental results and their analysis in Section 6 followed by conclusions in Section 7.
PRELIMINARIES
Before we delve into technical details of the new approach, we outline the relevant preliminaries in this section. In particular, we describe application and computation abstractions, selected implementation platform at the technology and architectural level, and power estimation related background material.
Computational Model
We selected as our computational model synchronous data flow (SDF) [Lee and Messerschmitt 1987; Lee and Parks 1995] . Synchronous data flow (SDF) is a special case of data flow in which the number of data samples produced or consumed by each node on each invocation is specified a priori. Nodes can be scheduled statically at compile time onto programmable processors. We restrict our attention to homogeneous SDF (HSDF), where each node consumes and produces exactly one sample on every execution. The HSDF model is well suited for specification of single-task computations in numerous application domains such as digital signal processing, video and image processing, broadband and wireless communications, control, information and coding theory, and multimedia.
The syntax of a targeted computation is defined as a hierarchical control data-flow graph (CDFG) [Rabaey et al. 1991] . The CDFG represents the computation as a flow graph, with nodes, data edges, and control edges. The semantics underlying the syntax of the CDFG format, as we already stated, is that of the synchronous data flow computation model.
The only relevant speed metric is throughput, the rate at which the implementation is capable of accepting and processing the input samples from two consecutive iterations. We opted for throughput as the selected speed metric since in essentially all DSP and communication wireless computations latency is not a limiting factor, where latency is defined to be the delay between the arrival of a set of input samples and the production of the corresponding output as defined by the specification.
Hardware Model
The basic building block of the targeted hardware platform is a single programmable processor. We assume that all types of operations take one clock cycle for their execution, as it is the case in many modern DSP processors. The adaptation of the software and algorithms to other hardware timing models is straightforward. In the case of a multiprocessor, we make the following additional simplifying assumptions: (i) all processors are homogeneous and (ii) interprocessor communication does not cost any time and hardware. This assumption is reasonable because multiple processors can be placed on single integrated circuit due to increased integration, although it would be more realistic to assume additional hardware and delay penalty for using multiple processors.
Power and Timing Models in Conventional and Ultra Low-Power Technology
It is well known that there are three principal components of power consumption in CMOS integrated circuits: switching power, short-circuit power, and leakage power. The switching power is given by P switching ϭ ␣C L V dd 2 f clock , where ␣ is the probability that the power consuming switching activity, i.e., transition from 0 to 1, occurs, C L is the loading capacitance, V dd is the supply voltage, and f clock is the system clock frequency.
␣C L is defined to be effective switched capacitance. In CMOS technology, switching power dominates power consumption. The short-circuit power consumption occurs when both NMOS and CMOS transistors are "ON" at the same time while the leakage power consumption results from reverse biased diode conduction and subthreshold operation. We assume that effective switched capacitance increases linearly with the number of processors and supply voltage can not be lowered below threshold voltage V t , for which we use several different values between 0.06V and 1.1V for both conventional and ultra low-power technology.
It is also known that reduced voltage operation comes at the cost of reduced throughput [Chandrakasan et al. 1992] . The clock speed T follows the following formula:
where C is a constant [Chandrakasan et al. 1992 ]. The maximum rate at which a circuit is clocked monotonically decreases as the voltage is reduced. As the supply voltage is reduced close to V t , the rate of clock speed reduction becomes higher.
Architecture-Level Power Models for Single and Multiple Programmable Processors
The power model used in this research is built on three statistically validated and experimentally established facts. The first fact is that the number of operations at the machine code level is proportional to the number of operations at high-level language [Hoang and Rabaey 1993] . The second fact is that the power consumption in modern programmable processors such as the Fujitsu SPARClite MB86934, a 32-bit RISC microcontroller, is directly proportional to the number of operations, regardless of the mix of operations being executed [Tiwari and Lee 1995] . Tiwari and Lee [1995] report that all the operations including integer ALU instructions, floating point instructions, and load/store instructions with locked caches incur similar power consumption. Since the use of memory operands results in additional power overhead due to the possibility of cache misses, we assume that the cache locking feature is exploited as far as possible. If the cache locking feature cannot be used for the target applications, the power consumption by memory traffic is likely to be reduced by the minimization of the number of operations since less operations usually imply less memory traffic. When the power consumption depends on the mix of operations being executed as in the case of the Intel 486DX2 [Tiwari et al. 1994 ], a more detailed hardware power model may be needed. However, it is obvious that in all proposed power models for programmable processors, significant reduction in the number of operations inevitably results in lower power. The final empirical observation is related to power consumption and timing models in digital CMOS circuits presented in the previous subsection. Based on these three observations, we conclude that if the targeted implementation platform is a single programmable CMOS processor, a reduction in the number of operations is the key to power minimization.
When the initial number of operations is N init , the optimized number of operations is N opt , the initial voltage is V init , and the scaled voltage is V opt , the optimized power consumption relative to the initial power consumption is ͑V opt V init ͒ 2 ͑N opt N init ͒. For multiprocessors, assuming that there is no communication overhead, the optimized power consumption for n processors relative to that for a single processor is ͑V n V 1 ͒ 2 , where V 1 and V n are the scaled voltages for single and n processors, respectively.
RELATED WORK
The related work can be classified along two lines: low power and implementation optimization, and in particular, minimization of the number of operations using transformations. The relevant low power topics can be further divided in three directions: power minimization techniques, power estimation techniques, and technologies for ultra low power design. The relevant compilation techniques are also grouped in three directions: transformations, ordering of transformations, and minimization of the number of operations.
In the last five years power minimization has been arguably the most popular optimization goal. This is mainly due to the impact of the rapidly growing market for portable computation and communication products. Power minimization efforts across all level of design abstraction process are surveyed in Singh et al. [1995] .
It is apparent that the greatest potential for power reduction is at the highest levels (behavioral and algorithmic). Chandrakasan et al. [1992] demonstrated the effectiveness of transformations by showing an order of magnitude reduction in several DSP computationally intensive examples using a simulated annealing-based transformational script. Raghunathan and Jha [1994] and Goodby et al. [1994] also proposed methods for power minimization which explore trade-offs between voltage scaling, throughput, and power. Chatterjee and Roy [1994] targeted power reduction in fully hardwired designs by minimizing the switching activity. Chandrakasan et al. [1994] and Tiwari et al. [1994] did work in power minimization when programmable platforms are targeted.
Numerous power modeling techniques have been proposed at all levels of abstraction in the synthesis process. As documented in Singh et al. [1995] while there have been numerous efforts at the gate level, at the higher level of abstraction relatively few efforts have been reported. Chandrakasan et al. [1995] developed a statistical technique for power estimation from the behavioral level that takes into account all components at the layout level including interconnect. Landman and Rabaey [1996] developed an activity-sensitive architectural power analysis approach for execution units in ASIC designs. Finally, in a series of papers it was established that the power consumption of modern programmable processors is directly proportional to the number of operations, regardless of what the mix of operations being executed is [Lee et al. 1996; Tiwari et al. 1994] .
Transformations have been widely used at all levels of abstraction in the synthesis process, e.g., Dey et al. [1992] . However, there is strong experimental evidence that they are most effective at the highest levels of abstraction, such as system and, in particular, behavioral synthesis. Transformations only received widespread attention in high-level synthesis [Ku and Micheli 1992; Potkonjak and Rabaey 1992; Walker and Camposano 1991] .
Comprehensive reviews of use of transformations in parallelizing compilers, state-of-the-art general purpose computing environments, and VLSI DSP design are given in Banerjee et al. [1993] , Bacon et al. [1994] , and Parhi [1995] , respectively. The approaches for transformation ordering can be classified in seven groups: local (peephole) optimization, static scripts, exhaustive search-based "generate and test" methods, algebraic approaches, probabilistic search techniques, bottleneck removal methods, and enabling-effect based techniques.
Probably the most widely used technique for ordering transformations is local (peephole) optimization [Tanenbaum et al. 1982] , where a compiler considers only a small section of code at a time in order to apply, one by one, iteratively and locally, all available transformations. The advantages of the approach are that it is fast and simple to implement. However, performances are rarely high, and usually inferior to those obtained by other approaches.
Another popular technique is a static approach to transformations ordering where their order is given a priori, most often in the form of a script [Ullman 1989 ]. Script development is based on experience of the compiler/ synthesis software developer. This method has at least three drawbacks: it is a time consuming process which involves a lot of experimentation on random examples in an ad-hoc manner, any knowledge about the relationship among transformations is only implicitly used, and the quality of the solution is often relatively low for programs/design which have different characteristics than than those used for the development of the script.
The most powerful approach to transformation ordering is enumerationbased "generate and test" [Massalin 1987 ]. All possible combinations of transformations are considered for a particular compilation and the best one is selected using branch-and-bound or dynamic programming algorithms. The drawback is the large run time, often exponential in the number of transformations.
Another interesting approach is to use a mathematical theory behind the ordering of some transformations. However, this method is limited to only several linear loop transformations [Wolf and Lam 1991] . Simulated annealing, genetic programming, and other probabilistic techniques in many situations provide a good trade-off between the run time and the quality of solution when little or no information about the topology of the solution space is available. Recently, several probabilistic search techniques have been proposed for ordering of transformations in both compiler and behavioral synthesis literature. For example, backward-propagation-based neural network techniques were used for developing a probabilistic approach to Power Optimization Using Divide-and-Conquer Techniques • the application of transformations in compilers for parallel computers [Fox and Koller 1989] and approaches which combine both simulated annealingbased probabilistic and local heuristic optimization mechanisms to demonstrate significant reductions in area and power [Chandrakasan et al. 1995] .
In behavioral and logic synthesis, several bottleneck identification and elimination approaches for ordering of transformations have been proposed [Dey et al. 1992; Iqbal et al. 1993] . This line of work mainly addresses throughput and latency optimization problems, where the bottlenecks can be easily identified and well quantified. Finally, the idea of enabling and disabling transformations has recently been explored in a number of compilations [Whitfield and Soffa 1990] and high-level synthesis papers [Potkonjak and Rabaey 1992; Srivastava and Potkonjak 1996] . Using this idea several very powerful transformations scripts have been developed, such as one for maximally and arbitrarily fast implementation of linear computations [Potkonjak and Rabaey 1992] , and joint optimization of latency and throughput for linear computations [Srivastava and Potkonjak 1994] . Also, the enabling mechanism has been used as a basis for several approaches for the ordering of transformations for optimization of general computations [Huang and Rabaey 1994] . The key advantage to this class of approaches is related to the intrinsic importance and power of the enabling/ disabling relationship between a pair of transformations.
Transformations have been used for optimization of a variety of design and program metrics, such as throughput, latency, area, power, permanent and temporal fault-tolerance, and testability. Interestingly, the power of transformations is most often focused on secondary metrics such as parallelism, instead of on primary metrics such as the number of operations.
In compiler domain, constant and copy propagation and common subexpression techniques are often used. It can easily be shown that the constant propagation problem is undecidable when the computation has conditionals [Kam and Ullman 1977] . The standard procedure with this problem is to use so-called conservative algorithms. These algorithms do not guarantee that all constants will be detected, but that each data declared constant is indeed constant over all possible executions of the program. A comprehensive survey of the most popular constant propagation algorithms can be found in Wegman and Zadeck [1991] . Parhi and Messerschmitt [1991] presented optimal unfolding of linear computations in DSP systems. Unfolding results in simultaneous processing of consecutive iterations of a computation. Potkonjak and Rabaey [1992] addressed the minimization of the number of multiplications and additions in linear computations in their maximally fast form so that the throughput is preserved. Potkonjak et al. [1996] presented a set of techniques for minimization of the number of shifts and additions in linear computations. Sheliga and Sha [1994] gave an approach for minimization of the number of multiplications and additions in linear computations. Srivastava and Potkonjak [1996] developed an approach for the minimization of the number of operations in linear computations using unfolding and the application of the maximally fast procedure. A variant of their technique is used in the "conquer" phase of our approach. Our approach is different from theirs in two respects. First, their technique can handle only very restricted computations, which are linear, while our approach can optimize arbitrary computations. Second, our approach outperforms or performs at least as well as their technique for linear computations.
SINGLE PROGRAMMABLE PROCESSOR IMPLEMENTATION: MINIMIZING THE NUMBER OF OPERATIONS
The global flow of the approach is presented in Section 4.1. The strategy is based on divide-and-conquer optimization followed by postoptimization step, merging the divided subparts, which is explained in Section 4.2. Finally, Section 4.3 provides a comprehensive example to illustrate the strategy.
Global Flow of the Approach
The core of the approach is presented in the pseudocode of Figure 1 . The rest of this subsection explains the global flow of the approach in more detail. The first step of the approach is to identify the computation's strongly connected components (SCCs), using the standard depth-first search-based algorithm [Tarjan 1972 ] which has a low order polynomial-time complexity. For any pair of operations A and B within an SCC, there exist both a path from A to B and a path from B to A. An example of this step is shown in Figure 2 . The graph formed by all the SCCs is acyclic. Thus, the SCCs can be isolated from each other using pipeline delays, which enables us to optimize each SCC separately. The inserted pipeline delays are treated as inputs or outputs to the SCC. As a result, every output and state in an SCC depend only on the inputs and states of the SCC. Thus, in this sense, the SCC is isolated from the rest of the computation and it can be optimized separately. In a number of situations our technique is capable of partitioning a nonlinear computation into partitions consisting of only linear com- Power Optimization Using Divide-and-Conquer Techniques • putations. Consider for example a computation which consists of two strongly connected components SCC 1 and SCC 2 . SCC 1 has as operations only additions and multiplications with constants. SCC 2 has as operations only max operation and additions. Obviously, since the computation has additions, multiplications with constants and max operations, it is nonlinear. However, after applying our technique of logical separation using pipeline states, we have two parts which are linear. Note that this isolation is not affected by unfolding. We define an SCC with only one node as a trivial SCC. For trivial SCCs unfolding fails to reduce the number of operations. Thus, any adjacent trivial SCCs are merged together before the isolation step, to reduce the number of pipeline delays used.
The number of delays in each subpart is minimized using retiming in polynomial time by the Leiserson-Saxe algorithm [Leiserson and Saxe 1991] . Note that a smaller number of delays will require a smaller number of operations, since both the next states and outputs depend on the previous states. SCCs are further classified as either linear or nonlinear. Linear computations can be represented using the state-space equations in for the optimization of linear subparts that uses unfolding and the maximally fast procedure [Potkonjak and Rabaey 1992] . We note that instead of maximally fast procedure, the ratio analysis by Sheliga and Sha [1994] can be used. Srivastava and Potkonjak [1996] have provided the closed-form formula for the optimal unfolding factor with the assumption of dense linear computations. We provide the formula in Figure 4 . For sparse linear computations, they have proposed a heuristic which continues to unfold until there is no improvement. We have made the simple heuristic more efficient with a binary search, based on the unimodality property of the number of operations on an unfolding factor . When a subpart is classified as nonlinear, we apply unfolding after the isolation of nonlinear operations. All nonlinear operations are isolated from the subpart so that the remaining linear subparts can be optimized by the maximally fast procedure. All arcs from nonlinear operations to the linear subparts are considered as inputs to the linear subparts, and all arcs from linear subparts to the nonlinear operations are considered as outputs from the linear subparts. The process is illustrated in Figure 5 . All arcs denoted by i are considered to be inputs and all arcs denoted by o are considered to be outputs for unfolded linear subpart. We observe that if every output and state of the nonlinear subpart depends on nonlinear operations, then unfolding with the separation of nonlinear operations is ineffective in reducing the number of operations.
Srivastava and
Sometimes it is beneficial to decompose a computation into larger subparts than SCCs. We consider an example given in Figure 6 . Each node represents a subpart of the computation. We make the following assumptions specifically for clarifying the presentation of this example and simplifying it. We stress here that the assumptions are not necessary for our approach. Assume that each subpart is linear and can be represented by 
Power Optimization Using Divide-and-Conquer Techniques
• state-space equations in Figure 3 . Also assume that every subpart is dense, which means that every output and state in a subpart are linear combinations of all inputs and states in the subpart with no 0, 1, or -1 coefficients. The number inside a node is the number of delays or states in the subpart. Assume that when there is an arc from a subpart X to a subpart Y, every output and state of Y depends on all inputs and states of X. Separately optimizing SCCs P 1 and P 2 in Figure 6 costs 211 operations from the formula in Figure 4 . On the other hand, optimizing the entire computation entails only 63.67 operations. Separate optimization does not perform well in this example because there are too many intermediate outputs from SCC P 1 to SCC P 2 . This observation leads us to an approach of merging subparts to further reduce the number of operations. Since it is worthwhile to explain the subpart merging problem in detail, the next section is devoted to the explanation of the problem and our heuristic approaches.
Since the subparts of a computation are unfolded separately by different unfolding factors, we need to address the problem of scheduling the subparts. They should be scheduled so that memory requirements for code and data are minimized. We observe that the unfolded subparts can be represented by a multirate synchronous data-flow graph [Lee and Messerschmitt 1987] and the work of Bhattacharyya et al. [1993] can be directly used.
Note that the approach is particularly useful for architectures that require high locality and regularity in computation because it improves both locality and regularity of computation by decomposing into sub parts and using the maximally fast procedure. Locality in a computation relates to the degree to which a computation has natural clusters of operations while regularity in a computation refers to the repeated occurrence of the computational patterns such as a multiplication followed by an addition [Guerra et al. 1994; Mehra and Rabaey 1996] .
Subpart Merging
Initially, we only consider merging of linear SCCs. When two SCCs are merged, the resulting subpart does not form an SCC. Thus, in general, we must consider merging any adjacent arbitrary subparts. Suppose we consider merging of subparts i and j. The gain GAIN͑i, j͒ of merging subparts i and j can be computed as follows: GAIN͑i, j͒ ϭ COST͑i͒ ϩ COST͑j͒ Ϫ COST͑i, j͒, where COST͑i͒ is the number of operations for subpart i and COST͑i, j͒ is the number of operations for the merged subpart of i and j.
To compute the gain, COST͑i, j͒ must be computed, which requires constant coefficient matrices A, B, C, and D for only the merged subpart of i and j. It is easy to construct the matrices using the depth-first search [Tarjan 1972 ]. The i times unfolded system can be represented by the state-space equations in Figure 7 . From the equations, the total number of operations can be computed for i times unfolded subpart, as follows. Let N͑*, i͒ and N͑ϩ , i͒ denote the number of multiplications and the number of additions for i times unfolded system, respectively. The resulting number of operations is ͑N͑*, i͒ ϩ N͑ϩ , i͒͒ ր ͑i ϩ 1͒ because i times unfolded system uses a batch of i ϩ 1 input samples to generate a batch of i ϩ 1 output samples. We continue to unfold until no improvement is achieved. If there are no coefficients of 1 or Ϫ1 in the matrices A, B, C, and D, then the closed-form formula for the optimal unfolding factor i opt and for the number of operations for i times unfolded system are provided in Figure 8 .
We can now evaluate possible merging candidates. We propose two heuristic algorithms for subpart merging. The first heuristic is based on the greedy optimization approach. The pseudocode is provided in Figure 9 . The algorithm is simple. Until there is no improvement, merge the pair of subparts that produces the highest gain.
The other heuristic algorithm is based on a general combinatorial optimization technique known as simulated annealing [Kirkpatrick et al. 1983 ]. 
Power Optimization Using Divide-and-Conquer Techniques •
The pseudocode is provided in Figure 10 . The actual implementation details are presented for each of the following areas: the cost function, the neighbor solution generation, the temperature update function, the equilib- Fig. 8 . Closed-form formula for unfolding; if two outputs depend on the same set of inputs and states, they are in the same group, and the same is true for states. rium criterion, and the frozen criterion. First, the number of operations for the entire given computation has been used as the cost function. Second, the neighbor solution is generated by the merging of two adjacent subparts. Third, the temperature is updated by the function T new ϭ ␣͑T old ͒*T old . For the temperature T Ͼ 200.0, ␣ is chosen to be 0.1, so that in a high temperature regime where every new state has very high chance of acceptance, the temperature reduction occurs very rapidly. For 1.0 Ͻ T Յ 200.0, ␣ is set to 0.95 so that the optimization process explores this promising region more slowly. For T Յ 1.0, ␣ is set to 0.8 so that T is quickly reduced to converge to a local minimum. The initial temperature is set to 4,000,000. Fourth, the equilibrium criterion is specified by the number of iterations of the inner loop. The number of iterations of the inner loop is set to 20 times the number of subparts. Lastly, the frozen criterion is given by the temperature. If the temperature falls below 0.1, the simulated annealing algorithm stops.
Both heuristics performed equally well on all the examples and the run times for both are very small because the examples have few subparts. We have used both greedy and simulated annealing-based heuristics for generating experimental results, and they produced exactly the same results.
Explanatory Example: Putting It All Together
We illustrate the key ideas of our approach for minimizing the number of operations by considering the computation of Figure 11 . We use the same assumptions as those in the example in Figure 6 .
The number of operations per input sample is initially 2081. (We illustrate how the number of operations is calculated in a maximally fast way [Potkonjak and Rabaey 1992 ] using a simple linear computation with 1 input X, 1 output Y, and 2 states S 1 , S 2 , described in Figure 12 .) Using the technique of Srivastava and Potkonjak [1996] that unfolds the entire Power Optimization Using Divide-and-Conquer Techniques
• computation, the number can be reduced to 725 with an unfolding factor of 12. Our approach optimizes each subpart separately. This separate optimization is enabled by isolating the subparts using pipeline delays. Figure 13 shows the computation after the isolation step. Since every subpart is linear, unfolding is performed to optimize the number of operations for each subpart. The subparts parts P 1 , P 2 , P 3 , P 4 , and P 5 cost 120. 75, 53.91, 114.86, 129.75, and 103 .0 operations per input sample with unfolding factors 3, 10, 6, 7, and 2, respectively. The total number of operations per input sample for the entire computation is 522.27. We now apply SCC merging to further reduce the number of operations. We first consider the greedy heuristic. The heuristic considers merging adjacent subparts. Initially, the possible merging candidate pairs are P 1 P 2 , P 1 P 3 , P 2 P 5 , P 3 P 4 , and P 4 P 5 , which produce gains of Ϫ51.48, Ϫ112.06, Ϫ52.38, 122.87, and Ϫ114.92, respectively. SCC P 3 and SCC P 4 are merged with an unfolding factor of 22. In the next iteration, there are now 4 subparts and 4 candidate pairs for merging, all of which yield negative gains. So the heuristic stops at this point. The total number of operations per input sample has further decreased to 399.4. The simulated annealing heuristic produced the same solution for this example. The approach reduced the number of operations by a factor of 1.82 from the previous technique of Srivastava and Potkonjak [1996] , while it has achieved the reduction by a factor of 5.2 from the initial number of operations. For single processor implementation, since both the Srivastava and Potkonjak [1996] technique and our new method yield higher throughput than the original, the supply voltage can be lowered to the extent that the extra throughput is compensated for by the loss in circuit speed due to reduced voltage. If the initial voltage is 3.3V, then our technique reduces power consumption by a factor of 26.0 with the supply voltage of 1.48V, while the Srivastava and Potkonjak [1996] technique reduces it by a factor of 10.0 with supply voltage 1.77V. The scheduling of the unfolded subparts is performed to generate the minimum code and data memory schedule. The schedule period is the least common multiple of the unfolding factor ϩ1's, which is 3036. Let P 3,4 denote the merged subpart of SCC P 3 and P 4 . While a simple minded schedule (759 P 1 , 276 P 2 , 132 P 3,4 , 1012 P 5 ) to minimize the code size ignoring loop overheads generates 9108 units of data memory requirement, a schedule (759 P 1 , 4(69 P 2 , 33 P 3,4 , 253 P 5 )) which minimizes the data memory requirement among the schedules minimizing the code size generates 4554 units of data memory requirement.
MULTIPLE PROGRAMMABLE PROCESSORS IMPLEMENTATION
When multiple programmable processors are used, potentially more savings in power consumption can be obtained. We summarize the assumptions made in Section 2: (i) processors are homogeneous; (ii) interprocessor communication does not cost any time and hardware; (iii) effective switched capacitance increases linearly with the number of processors; (iv) both addition and multiplication take one clock cycle; and (v) supply voltage cannot be lowered below threshold voltage V t , for which we use several different values between 0.06V and 1.1V. Based on these assumptions, using k processors increases the throughput k times when there is enough parallelism in the computation, while the effective switched capacitance increases k times as well. In all the real-life examples considered, sufficient parallelism actually existed for the numbers of processors that we used.
We observe that the next states, i.e., the feedback loops S͓n͔ ϭ AS͓n Ϫ 1͔ can be computed in parallel. Note that the maximally fast procedure by Potkonjak and Rabaey [1992] evaluates a linear computation by first doing the constant-variable multiplications in parallel and then organizing the additions as a maximally balanced binary tree. Since all the next states are computed in a maximally fast procedure, more parallelism exists at the bottom of the binary computation tree. All other operations not in the feedback loops can be computed in parallel because they can be separated by pipeline delays. As the number of processors becomes larger, the number of operations outside the feedback loops gets larger and results in more Power Optimization Using Divide-and-Conquer Techniques • parallelism. For dense linear computations, we provide the closed-form condition for sufficient parallelism when using k processors in Figure 14 . We note that although the formulas were derived for the worst-case scenario, the required number of operations outside the feedback loops is small for the range of the number of processors in the experiment. More operations exist outside feedback loops than are required for full parallelism in all the real-life examples we have considered.
Now we can reduce the voltage so that the clock frequency of all k processors is reduced by a factor of k. The average power consumption of k processors is reduced from that of a single processor by a factor of ͑V 1 V k ͒ 2 where V k is a scaled supply voltage for k processor implementations, and drakasan et al. 1992] . From this observation it is always beneficial to use more processors in terms of power consumption with the following two limitations: (i) the amount of parallelism available limits the improvement in throughput and the critical path of the computation is the maximum achievable throughput and (ii) when supply voltage approaches close to threshold voltage, the improvement in power consumption becomes so small that the cost of adding a processor is not justified. With this in mind, we want to find the number of processors that minimizes power consumption cost-effectively in both standard CMOS technology and ultra-low power technology.
Since the cost of programmable processors is high, and especially the cost of processors on ultra-low power platforms such as SOI is very high [El-Kareh et al. 1995; Ipposhi et al. 1995] , guidance for cost-effective design is important. We need a measure to differentiate between cost-effective and cost-ineffective solutions. We propose a PN product, where P is the power consumption normalized to that of optimized single-processor implementation and N is the number of processors. The smaller the PN product, the more cost-effective the solution. If PN is smaller than 1.0, using N processors decreases the power consumption by a factor of more than N. It The number of processors the implementation should use depends on the power consumption requirement and the cost of the implementation. Table I provides the values of PN products with respect to the number of processors for various combinations of the initial voltage V init , the scaled voltage for single processor V 1 , and the threshold voltage V t . V init is the initial voltage for the implementation before optimization. We note that PN products monotonically increase with respect to the number of processors.
From Table I , we observe that cost-effective solutions usually use a few processors in all the cases considered on both the standard CMOS and ultra-low power platforms. We also observe that if the voltage reduction is high for a single processor case, then there is not much room to further reduce power consumption by using more processors.
Based on these observations, we developed our strategy for multipleprocessor implementation. The first step is to minimize power consumption for single-processor implementation using the technique proposed in Section 4. The second step is to increase the number of processors until the PN product is below the given maximum value. The maximum value is determined on the basis of the power consumption requirement and the budget for the implementation. The strategy produces solutions with only a few processors, in many cases with a single processor for all the real-life examples, since our method for minimizing the number of operations significantly reduces their number and, in turn, reduces the supply voltage for the single-processor implementation. Adding more processors does not usually reduce power consumption in a cost-effective way. Our method achieves cost-effective solutions with a very low power penalty, compared to solutions that optimize power consumption without considering hardware cost. Power Optimization Using Divide-and-Conquer Techniques 
EXPERIMENTAL RESULTS
Our set of benchmark designs include all the benchmark examples used in Srivastava and Potkonjak [1996] , as well as the following typical portable DSP, video, communication, and control applications: DAC-4 stage NEC digital to analog converter (DAC) for audio signals; modem-2 stage NEC modem; GE controller-5-state GE linear controller; APCM receiverMotorola's adaptive pulse code modulation receiver; Audio Filter 1-analog to digital converter (ADC) followed by 14 order cascade IIR filter; Audio Filter 2-ADC followed by 18 order parallel filter; Video Filter 1-two ADCs followed by 10-order two dimensional (2D) IIR filter; Video Filter 2-two ADCs followed by 12-order 2D IIR filter; and VSTOL-VSTOL robust observer structure aircraft speed controller. DAC, modem, and the GE controller are linear computations and the rest are nonlinear computations. The benchmark examples from Srivastava and Potkonjak [1996] are all linear, including ellip, iir5, wdf5, iir6, iir10, iir12, steam, dist, and chemical. Table II presents the experimental results of our technique for minimizing the number of operations for real-life examples. The fifth and seventh columns of Table II provide our method's improvement factors over Srivastava and Potkonjak [1996] and the initial number of operations, respectively. Our method has achieved the same number of operations as Srivastava Power Optimization Using Divide-and-Conquer Techniques
• we should stop increasing the number of processors. When PN T ϭ 1, i.e., power reduction by the addition of a processor must be greater than 2 to be cost effective; in almost all cases a single processor solution is optimum. When PN T gets larger, the number of processors increases, but the solutions still use only a few processors, which results in an order of magnitude reduction in power consumption. All the results clearly indicate the effectiveness of our new method.
CONCLUSION
We introduced an approach for power minimization using a set of compilation and architectural techniques. The key technical innovation is a compilation technique for minimizing the number of operations that synergistically uses several transformations within a divide-and-conquer optimization framework. The new approach not only deals with arbitrary computations, but also outperforms previous techniques for limited computation types. Furthermore, we investigated the coordinated impact of compilation techniques and new ultra-low power technologies on the number processors that provide optimal tradeoff in cost and power. The experimental results on a number of real-life designs clearly indicate the effectiveness of all proposed techniques and algorithms. Power Optimization Using Divide-and-Conquer Techniques 
