Abstract -Linear computations form an important type of comlmtation that is widely used in DSP and communications. We introduce two approaches for power minimization in linear comliutations using transformations. First we show how unfolding combined with the procedure for maximally fast implementation of linear computations reduces power in single processor and multiprocessor implementations by factors 2.2 and 8 respectively. To accomplish this we exploit a newly identified property of unfolding whereby as a linear system is unfolded, the number of operations per sample at first decreases to reach a minimum and then begins to rise. For the custom ASIC implementation even higher improvements a r e achievable using the second trans:formational approach, which builds upon the unfolding based strategy of the first approach. We developed a method that combines the multiple constant multiplication (MCM) technique with the generalized Horner's scheme and unfolding in such a way that power is minimized.
Abstract -Linear computations form an important type of comlmtation that is widely used in DSP and communications. We introduce two approaches for power minimization in linear comliutations using transformations. First we show how unfolding combined with the procedure for maximally fast implementation of linear computations reduces power in single processor and multiprocessor implementations by factors 2.2 and 8 respectively. To accomplish this we exploit a newly identified property of unfolding whereby as a linear system is unfolded, the number of operations per sample at first decreases to reach a minimum and then begins to rise. For the custom ASIC implementation even higher improvements a r e achievable using the second trans:formational approach, which builds upon the unfolding based strategy of the first approach. We developed a method that combines the multiple constant multiplication (MCM) technique with the generalized Horner's scheme and unfolding in such a way that power is minimized.
Throughput and Power in Linear Systems
Linear computations form an important type of computation that is widely used in video and image processing, DSP, control, communications, and many other applications. A large fraction of syst1:ms in these application domains are either linear, or have subsystems that are linear. This paper explores the relationship of throughput with increasingly important design metric -power. In particular, we seek to find the extent to which power consumption of linear systems can be reduced, both independently and in con,junction with throughput improvement, and to develop techniques for doing so.
To explore the throughput and power relationship in linear systems we take a more thorough and systematic approach. First, we con:;ider analytically as well as empirically the effect of several algc~rithm transformations that can be considered as the buildingblocks for exploring the power-throughput space. Specifically, we consider unfolding, which is the underlying transformation behind arbi.trary throughput improvement, both separately and in cotribination with decomposition of multiplication-by-constants into primitive sequences of shifts and additions, factorization, common sub-expression elimination. Second, we consider implementations not just in the form of ASICs with applicationspecific datapaths as is usually the case, but also implementations based on single programmable processor and multiple programmable processors. This is important because increasingly programmable processors such as DSP-cores are the preferred medium of implementation as opposed to custom datapaths.
33rd Design Automation Conference@ Permi.ssion to make digitalhard copy of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission andlor a fee. I n CMOS technology there are three sources of power consumption: switching current, short-circuit current, and leakage currents. The switching component not only dominates in most designs, but is also the only one which cannot be made negligihle even when proper circuit design techniques are used. The average power consumption of a CMOS gate due to the switching component is given by:
where f is the system clock frequency, V d d is the supply voltage, C L is the load capacitance, and CI is the switching activity (the probability of a Oh1 transition during a clock cycle). The term aC is often lumped into a single parameter called effective switched capacitance. Further, it is well known that the delay through a gate has a monotonically inverse relationship to the supply voltage. normalized to galie delay at 5.0 Volts. Therefore, the maximum rate at which a circuit can be clocked will monotonically decrease as the voltage is reduced.
The above expression suggests several behavior level approaches to reduce power consumption of a computation. First is to shut the system down during periods of inactivity either by shutting off the clock (f = 0 ) lor by shutting off the power supply
Second is to reduce the effective switched capacitance a C L by restructuring the computation, communication, and/or the memory hierarchy, and by changing data encoding and a data formats. The third is to exploit the quadratic dependence of power consumption upon the supply voltage V d d and operate at a reduced voltage while compensating for the resulting loss in circuit speed by techniques that increase the throughput.
Noting that throughput is the sole metric of speed that is important to us, one can combine the latter two approaches to minimize power at the behavioral level in the following fashion: First, apply a behavior transformation to reduce the effective switched capacitance (by reducing the number of operations) or to increase the throughput (by reducing the critical path). Next, in the case of increased throughput, we lower the supply voltage just the right amount so as to decrease the clock speed to an extent that the throughput reverts back to what it was before. The net power consumption is reduced if either the effective switched capacitance is reduced at a constant voltage, or if the reduction due to reduced voltage and frequency overshadows any increase effective capacitance penalty paid for increase in throughput. When voltage reduction is not possible, one can trade-off the extra throughput obtained with lower clock frequency or with shutdown, both of which will result in h e a r reduction.
Linear Systems
In high-level synthesis terminology, a system is linear if it can be realized by a control-dataflow graph (CDFG) 
where X [ n ] E Z P x i , Y[n] E % Q x l , and S[n] E % R x l are the input, output, and state vectors respectively, and A , B , C , and D are constant coefficient matrices. Note that the throughput of the system, or the maximum rate at which it can process incoming samples, is decided solely by the critical path of t h e f e e d b a c k s e c t i o n c o r r e s p o n d i n g to t h e term . The remaining terms are not in the feedback loop and therefore can be pipelined away. This is the base or reference case for our analysis, and has following characteristics: and S [ n ] . In a system that has been unfolded i times, a batch of 
. . Y [ n + i]
and the next state S [ n + i ] . The batch processing itself can be done in various ways: in a block processing [Rob871 fashion where the batch processing is begun only after all the input samples have been collected in a buffer, or in an on-arrival processing [Sri941 fashion where batch processing is begun as soon as the frst relevant data is available. Independent of how the processing is organized, the basic computation executed by a linear system that has been unfolded i times can be represented by the following state-space equations:
Note that these equations process i + 1 data samples for each execution. For the i times unfolded system we get the following characteristics, where #(+, i) is the number of additions, #(*, i) is the number of multiplications, CP(i) is the feedback critical path, and MaxThroughput(i) i s t h e m a x i m u m t h r o u g h p u t :
As expected, when I = 0 the above equations reduce to those for the unfolded case presented earlier. Further, the maximum achievable throughput is arbitrarily increased as the amount of unfolding increases because the feedback critical path remains the same while more samples are processed.
An interesting observation is that the effective number of operai:ions per input sample is lower in the unfolded case when the amount of unfolding i is less than a certain threshold. In particular, the increase in number of multiplication operations per sample due to i times unfolding is:
and, the increase in number of addition operations is:
It shcluld also be noted that above expressions for differences in numtiers of * and + per sample achieve minimum at certain i below the shown thresholds -in other words, as one unfolds, the number of operations per sample at first decreases to reach a minimum and then begins to rke.
The above observations lead to the following strategies for low power implementations.
Implementation on a Single Processor
I;n the case of a single programmable processor the throughput that is achieved is solely decided by the number of opera.tions. It follows that the throughput is maximized by using the d u e of i that minimizes the total number of instructions. If one assumes that + and * are the basic processor instructions (they need not take the same number of cycles), it can be shown that the optimum value of unfolding iopt is one of the following two on whichever leads to a smaller value of iopl(PQboth lead to same value, we pick the smaller iopt so as to save on coefficient memory because larger unfolding leads to more conslant coefficients.
Fromi this one can obtain the following expression for maximum improvement in thoughput for the single processor case:
Finally, the processor voltage can be reduced just the right amount so that the clock frequency f is reduced by a factor of S, , , . This leads to a reduction in power because in the expression for power P = aCLV2df the terms V2d and f are reduced whereas the other two terms remain constant. It must be mentioned that in we are implicitly assuming that the processor power consumption is dominant compared to coefficient and data memory power consumption, an assumption that is true in most CPU-memory systems as found in DSP and control processing systems.
As an example, consider a hypothetical linear computation with P = 1 input, Q = 1 output, R = 12 states. Then, from the approach above one can show that which leads to S,n,x = 4.075 . One can therefore reduce the voltage such that
the clock frequency is reduced by a factor of 4.0007 so that the throughput reverts back to the original throughput. If the initial voltage is 3.0V, lhen from Fig. 1 it follows that the voltage reduction to 1.5V will result in the desired clock slowdown. The processor operating at 1.5V and computing the equations that have been unfolded 16 times will have the same throughput as the processor operating at 3.0V and computing the initial nonunfolded equations. However, in the unfolded case one obtains a power reduction of
, or a factor of 16 over the initial [ Eg ($4) power consumption. If the initial voltage was 5.0V, then our technique will result in a processor operating at 1.9V, with an even larger power reduction of x-= 27.7
(%>? 1x
As mentioned earlier, the above result is based on analysis that assumed that the coefficient matrices A , B , C , and D are dense matrices with arbitrary non-trivial coefficients. While this is certainly true of linear systems that are found in process controllers, it is often not the case with filters found in DSP applications where these matrices often tend to be sparse and have coefficients that are trivial (for example, coefficients of 1 or -1 do not need multiplication). Unfortunately, it is not possible to come with meaningful analytical expressions for the non-dense case.
However, we have empirically found that unfolding helps in reducing the number of operations and the power even in such cases -although by smaller factors. The optimum level of unfolding and the number of operations can no longer be found by merely evaluating closed-form formulas. We therefore use the following heuristic to find the desired level of unfolding in the non-dense cases: first pick the best performing level of unfolding form amongst all values of i from 0 through i o p t , the optimum value analytically predicted for the dense 'case. If the best level turns out to be iopr, then we continue to unfold further using binary search as :long as the number of operations continues to decline. Since the run times are low, the preceding linear search strategy is quite acceptable. In any case, more sophisticated search techniques such as binary search could be employed if desired.
In case unfolding results in such a large increase in throughput (reduction in number of operations) that even after reducing the voltage to the minimum feasible (about 1V in the technology that we used) the new system has higher throughput than the original, then one can obtain a further reduction in power by operating the processor at an even lower frequency (or, equivalently, by shutting the processor for part of the time). This, however, did not happen for any of our examples.
The following results summarize the power reduction obtained for several real-life examples listed in Table 1 . In Table 2 we give the design in the table above yields a x1.6 reduction in the number of operations. Therefore, the clock frequency can be reduced by x1.6, resulting in a power reduction by x1.6 (or, 37%) while the processor voltage remains unchanged. This strategy gives an average power reduction of x1.4 (29%) over all our examples.
Implementation on Multiple Processors
Potentially more savings can be obtained if one considers implementations that are not restricted to a single processor. By using more than one processors the throughput achieved by the implementation can be reduced compared to the single processor case, and by using enough processors the maximum possible throughput (decided by the critical path through the feedback portion of the linear computation) can be achieved. The extra throughput thus obtained can be used for further throughputvoltage trade-off as long as the power reduction from this compensates for the power increase due to more processors.
As an example, consider the same hypothetical linear computation with P = 1 input, Q = 1 output, R = 12 states, and dense coefficient matrices that we considered for the single processor case. Previously we had shown that the number of operations per sample is minimized when the linear computation is unfolded for iopt = 16 times, and that the maximum throughput achieved by a s i n g l e p r o c e s s o r r e l a t i v e to o r i g i n a l n o n -u n f o l d e d Now, if a second processor is added, the throughput will increase by x2 (ignoring communication costs), and at the same time power consumption will increase by x2 due to the addition of the second processor. Now, one can reduce the voltage such that the clock f r e q u e n c y of both t h e p r o c e s s o r s i s r e d u c e d by Smax(l) = 2x4 = 8 . If the initial voltage was 3.0V, then this reduced voltage (from Fig. 1) is given by 1.27V . Therefore, the 16-unfolded two-processor 1.27V implementation will have a power reduction of -x-x-= 22.3 relative to an non-unfolded (YjZ A ;
.OV single-processor implementation
In general the situation is more complex when adding processors. First, addition of processors causes a linear increase in switched capacitance, and hence power, for a given voltage and clock frequency. In fact, the increase in switched capacitance may be super-linear due to inter-processor communication hardware. Second, the speed-up due to an additional processor is not linear, and begins to saturate due to inter-processor communication overhead. Even if the inter-processor communication cost is ignored, the computation cannot be speeded up more than that allowed by the critical path of the feedback section. Finally, the voltage cannot be reduced below a certain point.
The following approach, developed under certain simplifying assumptions, explores the unfolding-driven power-throughput trade-off in implementations using multiple processors. The simplifying assumptions are (i) inter-processor communication does not cost any time, (ii) effective switched capacitance n C , increases linearly with the number of processor N , (iii) voltage cannot be reduced below lV, and (iv) both addition and multiplication instructions take one clock cycle (i.e. m = 1 ). The assumptions are appropriate when one also considers empirical results, reported by researchers such as [Tiw94] , that indicate a strong correlation between power and number of operations in programmable general purpose and DSP computation.
The first step is to unfold the linear computation to the optimum level i = i O p , where the number of operations (instructions) per sample is minimized. The second step, is to increase the number of processors to N . Let Sma,(N, i) be the maximum improvement in throughput achieved by N processors on an i times unfolded linear computation compared to a single processor on the original non-unfolded computation. The third step is then to slow-down each of the N processor by a factor of S,,,(N, i ) -this is done by reducing the voltage just the right amount (but limited by the technology-imposed lower bound) so as to decrease the clock frequency (increase the gate delay) by Smax (N, i) . Let V ( d ) be the voltage at which the value of gate delay relative to the initial implementation is d , with V(1) typically being 3.0V or 5.0V. Then. the power of the new N processor implementation relative to the original non-unfolded implementation is:
The task is to find the optimum value N = Nopr where the above expression is minimized. The crucial missing piece of the puzzle is an estimate of S,,(N,  i) , the maximum improvement in throughput achieved by N processors on an i times unfolded linear computation. It can be shown via some intricate algebraic manipulation that under our simplifying assumptions the speed-up d u e to m u l t i p l e p r o c e s s o r s i s l i n e a r f o r N I R , i . e . , for ( N 2 R ) . This will allow a linear decrease in frequency, and therefore power, and thus offset the linear increase in power due to increase in number of proc.essors. Therefore, one can always add up to R processors and get ,A reduction in power due to the reduction in the voltage term.
In other words, the optimum number of processors is at least R .
The observation that the speed-up is linear for N 5 R is valid even for :real-life non-dense coefficient matrices in a slightly modified form: the speed-up will be at-least linear (under our assumption of zercl communication cost). We exploit this fact, and conservatively use N = R processors to get at least a linear increase in throughput (on top of what iopt level unfolding alone gives) and trade this increased throughput with a voltage reduction to slow down the clock by an equivalent amount. Table 2 shows the resulting power reduction for our suite of examples. 
01 Implementation on Custom Datapath ASICs
We start this section by summarizing the key background information about the MCM transformation which is used as an building block in the new technique for power minimization.
Co nstant multiplication is a transformation which replaces a constant multiplication by shifts and additions. For example, the product y = 175 * x, can be computed in the following way: y = x <<'7 + x << 5 + x << 4 + x << 3 + x. Since the shifts and additions are significantly more area, time, and power efficient, this transformations has been widely used in computer architecture, compilers [Mag88], and VLSI signal processing [Rab91] .
Rem-ently, it has been realized that a common computational structure in many ASIC application domains is multiple constant mciltiplication with same variable [Pot94]. More complex structures give a significantly higher potential for design optimization which is related to a complex combinatorial problem [Pot94] . The crux of technique can be illustrated using the following e x a m p l e which involves only two constant multiplications with the same variable x: y1 = 17.5 * x and y2 = 23.5 * x. The second product y2 can be expressed as y2 = x << 7 + x << 6 + x << 5 + x <. : 3 + x << 1 + x. The direct computations of two product using the. constant multiplication transformations requires nine shifts andl nine additions. However, using common subexpression the number of shifts and additions can be reduced. The first observation is that shifts can be shared between two products, therefore only six shifts are required. Moreover, if first is the product y3 (y3 = x << 7 + x << 5 + x << 3 + x) computed, the products yl and y2 can be computed as y1 = y3 + x << 4 and y2 = y3 + x << 6 + x i < 2 only six additions are required. We are now readly to develop an approach for power reduction in linear designs which combines the power of unfolding, the MCM transformation, and generalized Homer scheme. Formally, linear computations can be defined as those which can be described by: We now present the procedure which transforms an arbitrary linear computation in B form which can be implemented so that power is an arbitrary low level. The procedure combines the novel use of Horner's rule for polynomial evaluation with the MCM for power optimization. Horner's rule rearranges an n-th degree polynomial indicates what is required in order to achieve low power implementation. We have to find an efficient implementation for the (n+l) sets of products with the vector X, and the efficient implementation for products of inputs vectors with a variety of coefficient. The frst task can be properly solved using the MCM for power transformations, which indicates that eventually regardless of the number of additional unfolding. the number of operations will stay constant for this part of computation, as indicated by the asytnptotic effectiveness theorem. For the remainder of task we apply the key idea from Horner's scheme, on the part of the computation used to compute the influence of primary inputs shown in Fig. 2 , so that this overhead is reduced to linear increase. For each new unfolding, only three matrix multiplications (by B, A, and C) are required and one matrix addition. Furthermore the computational structure cotnputes significant part of Xn+l, after one level of additional unfolding we need only one more matrix addition.
The resulting cotnputational structure is shown in Fig. 3 using the functional dependency form. Note that we can add to the nonrecursive part of the computational structure an arbitrary number of pipeline delays and therefore increase throughput and reduce voltage to an arbitrary low level. The only part of cotnputation which is cycle is AnXl during cotnputation of Xn+l.
The length of this path does not increase with unfolding, since the constant An can be precomputed during synthesis.
S o , the approach for achieving arbitrarily low power implementation of linear systems can be described using the following pseudo-code.
Transformation order for low power implementation of linear computations:
(1) Unfold the compulation n times;
(2) Rearrange computation using the generalized Horner's scheme; 
Experimental Results
We evaluate the effectiveness of the method by conservatively assuming that voltage can not be lowered bellow 1 V. Table 4 shows the initial power dissipation and power consumption after the application of the new ordering of transformations, as well as the power reduction factors. The average and median power reduction were by factors 30.2 and 31.8 respectively.
Conclusion
We introduced a new approach for power minimization in linear cotnputations using transformations. The generic approach was augmented to produce very high power reductions when either progratntnable of custom ASIC implementations are targeted, often with no or minimal hardware overhead. 
