Scheduling and resource binding for low power by Musoll Cinca, Enric & Cortadella, Jordi
Scheduling and resource binding for low power 
E. Musoll and J. Cortadella 
Department of Computer Architecture 
Universitat Politkcnica de Catalunya 
0807 1-Barcelona, Spain 
Abstract 
Decisions taken at the earliest steps of the design pro- 
cess may have a significant impact on the characteristics of 
the final implementation. This paper illustrates how power 
consumption issues can be tackled during the scheduling 
and resource-binding steps of high-level synthesis. Algo- 
rithms fo r  these steps targeting at low-power data-paths 
and trading off in some cases, speed and area for low 
power are presented. 
The algorithms focus on reducing the activity of thefunc- 
tional units (adders, multipliers) by minimizing the transi- 
tions of their input operands. The power consumption of 
the functional units accounts for  a large fraction of the 
overall data-path power budget. 
1 Introduction 
Current VLSI technology allows circuits with more and 
more functionality to be integrated in just one chip. Nowa- 
days, portable applications are not only wrist clocks or 
calculators but multi-media terminals, mobile telephones 
and other real-time systems. These new applications are 
based on intensive data-path tasks such as video compres- 
sion, speech recognition and other digital signal processing 
tasks. The portable feature of these applications imposes 
a limit on power consumption whereas the real-time char- 
acteristic forces the designer to comply with the required 
throughput. 
Power consumption can be taken into account at differ- 
ent levels in the design process [4]: technological, topo- 
logical, architectural and algorithmic levels. High-level 
synthesis (HLS) comprises techniques at the architectural 
and algorithmic level. Design decisions taken in the HLS 
process have a significant impact on the quality of the fi- 
nal implementation. Traditionally, HLS has been applied 
to obtain small and fast designs, but including power con- 
sumption as one of the design parameters or constraints has 
rarely been addressed. 
Preliminary studies in the HLS steps of scheduling and 
resource binding [9] targeting at low power reported in [14] 
have guided the algorithms presented in this paper. 
The main target for reducing power consumption is the 
set of functional units (adders, multipliers) because its 
power consumption accounts for a large fraction of the 
overall data-path power budget. The algorithms attempt to 
reduce the activity of the functional units by minimizing 
the switching activity of their input operands. 
Models derived from switch-level simulations of the 
main data-path components (functional, interconnection 
01995 ACM 0-89791-771-5/95/0011/0104 $3.50 
and storage units) [ 141 will be used to estimate the power 
reduction achieved with the algorithms. 
The paper is organized as follows: in Section 2, pre- 
vious work on low-power circuits with special insight in 
high-level techniques is briefly presented. Section 3 dis- 
cusses how the functional units consume power in data-path 
intensive systems. It briefly describes the scheduling and 
resource-binding tasks along with the basic ideas behind 
the algorithms presented in the paper. Sections 4 and 5 de- 
scribe how the scheduling and resource-binding algorithms 
for low power are implemented. Results are presented for 
some benchmarks. Power reduction results are obtained 
by comparing traditional scheduling and resource-binding 
methods with ours targeting at low power. Section 6 con- 
cludes the paper. 
2 Previous work 
Most of the efforts in HLS for low power propose models 
and estimations of power consumption at algorithmic and 
architectural level [12, 13,2, 6, 151. 
Few authors have addressed the set of transformations at 
algorithmic and architectural level to obtain lower-power 
designs. In [5],the power consumption of additions and 
constant multiplications as a function of the operand ac- 
tivity is studied. From this study, a data flow graph trans- 
formation is derived for a typical operation in signal pro- 
cessing applications. In [21], some memory transforma- 
tions for low power systems are hinted. The aim of these 
transformations is to reduce the number of off-chip ref- 
erences. In [3], the traditional transformations for faster 
and smaller circuits are applied in order to evaluate the 
power-consumption savings. Whenever the resulting cir- 
cuit is faster than the required throughput, power-supply 
reduction can be applied to take advantage of its quadratic 
impact on consumption. 
High-level synthesis for low power has been addressed 
in [17, 7, 141. In [17], an allocation method that attempts 
to reduce both the capacitance and switching activity of the 
synthesized design is presented. In [7] ,  a scheduling and 
binding technique for reducing the activity in the buses is 
described. 
The algorithms presented in this paper are based on pre- 
liminary results reported in [ 141, where high-level synthesis 
techniques for reducing the activity of functional units are 
also described and their potential benefits evaluated. 
3 Power consumption of the functional units 
Power consumption in the data-path accounts for a large 
fraction of the overall system power budget. Among the 
104 
permission to make mgitavhard copy of all m paa of th is ,~ tmia i  
without fee is granted, pmvided that copies an not made or ~J&&&zI 
fa pmfit or m d  advantage, the ACM cq&ghtlsqer notice, the 
title of the publication and its date appear, and notice IS @veri that 
mpying is by permiasion ofthe Assxiation for computing Machineq, Inc. 
8 x 8-bit Radix-4 1300th multiplier 
8 
I I I I  6 
26/8 1 @*3?1 
1 8, (1 cycle) 




-128 -64 -32 0 32 64 127 
Unchanged! operand 
(a) 
Figure 1: Energy of a multiplier as a function of the 
Benchmark I +I* I @ I  @ 1 idle 1 
1 1 1  ARjilier[11] I 12/16 I 1 p. (3 (2cycles) I 27% I 
4 @ (2 cycles) 
3 61 (1 cycle) 
2 @ (2 cycles) 
Lee DCT [I 81 
4 x 4 muirix 
LMS uduptive 
pixel 
1 8, (1 cycle) 
different types of units that compose a data-pathi, power 
consumption is mainly considered in the functional units 
due to their large contribution to the power consumption of 
the data-path. 
The power consumption of a functional unit depends on 
the operand variability of its inputs. Figure 1 illustrates this 
fact for an 8 x 8 radix-4 Booth multiplier [ 101. 
In Figure l(a), plot (3) represents the energy of the mul- 
tiplier in nJ/operation when one operand remains un- 
changed (x axis) with respect to the previous operation and 
the other operand varies randomly'. Line (2) is the average 
of plot (3) and line (1) is the average energy when both 
operands vary randomly with respect to the previous op- 
eration. Comparing lines (1) and (2), the average power 
consumption of the multiplier is approx. 35% less when 
one operand remains unchanged. 
Figure l(b) represents the energy in nJ/operation 
as a function of the average Hamming distance 
(AHD) of the operands defined as A H D ( x )  = 
lim nice n , where H ( p ,  q )  is the Hammtng E:=, H(1.,2-.-1) 
'Although data is correlated for some of the HLS applications, we fo- 
cus on fairly compare the relative benefits of different circuit descriptions. 
8x8-bit Radix-4 Booth multiplier 
H Cy?)- 
(b) 
(a) operand repetition and (b) operand activity. 
distance between p and q and xi is the value of operand x 
in cycle i. Obviously, the power consumption tends to zero 
when the AHD of both operands tends to zero. The power 
consumption in the multiplier with an AHD of its operands 
of 4 and 2 is approx. 25% less than with AHD values of 4 
and 6. 
A functional unit in a data-path consumes both useful 
and useless power. It consumes useful power when it is 
executing an operation and consumes useless power when 
there is an input operand transition while the functional unit 
is idle. The control unit is usually synthesized using don't 
care values to minimize area or increase speed. Thus, an 
idle functional unit may have input operand changes due to 
the variation of the selection signals of multiplexers. 
Useless power is specially important in data-paths syn- 
thesized from sparse schedules. A schedule is said to be 
sparse if its unit occupation is relatively low. Table 1 
presents the functional unit occupation for some bench- 
marks. 
The power consumption of a functional unit (idle or 
not) depends on the operand variability of its inputs. In 
the sequel, we will distinguish between operand activity 
and operand repetition. Both concepts are related to the 
variability of the bit-pattern that represents the operand. 
Operand activity relates to the variability of the bit-pattern 
of one operand from one cycle to the next. Operand repeti- 
tion relates to the coarse-grained variability of the operand, 
i.e. the operand may or may not change between two con- 
secutive cycles. 
Figures l(a) and l(b) illustrate how the power con- 
sumption of a multiplier can be studied as a function of 
its operand repetition and operand activity respectively. 
Simple power-consumption models have been derived 
for each of the main data-path components as a function of 
the operand repetition and operand activity [ 141. 
Since we focus on data-path circuits, whenever we refer 
to the power consumption of a design we mean the energy 
per operation executed by that design. Data-path circuits 
have a fixed throughput and, therefore, the energyloperation 
is the best metric that quantifies the energy efficiency for 
these type of circuits [2]. 
3.1 Scheduling and resource binding for low 
power: basic ideas 
The HLS process is divided in three basic tasks [9]: allo- 
cation, scheduling and resource binding. The latter task is 
105 
itself decomposed into functional, storage and interconnec- 
tion unit binding steps, all of them tightly related to each 
other. They are usually ordered and executed sequentially 
due to the high complexity of the resource-binding task. 
Two traditional approaches for the scheduling and 
resource-binding tasks have been modified to target at low- 
power designs and their algorithms are presented in this 
paper. Both algorithms attempt to reduce power consump- 
tion only in the functional units. They do not address the 
reduction of power in UO, clocks or data transfers. 
The scheduling algorithm for low power uses a list- 
scheduling approach where the priorities of the operations 
of the ready-operation queue are set in such a way that op- 
erations sharing the same operand are scheduled in control 
steps as close as possible. Thus, the potential for a func- 
tional unit to reuse the same input value (and, therefore, to 
decrease its input activity) is higher. 
The resource-binding algorithm for low power is based 
on a clique partition of a restricted variable-lifetime com- 
patibility graph to obtain a register set that, for each func- 
tional unit, reduces the power consumption during idle cy- 
cles. Power consumption in functional units during non- 
idle cycles is further decreased by taking into account the 
AHD among the variables of the behavioral description and 
the commutative property of some operations. 
Although the scheduling technique will obtain better 
improvements if applied to dense schedules (e.g. sched- 
ules where the functional unit occupation is high) and 
the resource-binding technique is more suitable to sparse 
schedules, both techniques are compatible and complemen- 
tary. 
4 Scheduling for low power 
The goal of the scheduling algorithm for low power is 
to increase the potential for a functional unit (FU) to reuse 
an operand. Henceforth, we will call operand reutilization 
(OPR) the fact that an operand is reused by two operations 
consecutively executed in the same FU. 
cvcle cvcle 
(4 (b) 
Figure 2: (a) One possible schedule and FU binding with 
no OPRs assuming two adders and (b) improved schedule 
and FU binding with 2 OPRs. 
Figure 2, where two schedule and FU bindings of a sim- 
ple data-flow graph (DFG) are shown, illustrates the OPR 
concept. There are some operations in the DFG whose re- 
sult is the input for more than one operation. For example, 
the result of addition 1 is input for additions 2 and 4. As- 
sume that additions 2 and 4 are assigned to the same adder 
A. Assume also that between the execution of addition 2 
and 4 there is no other use of adder A. Then, one of the 
operands of adder A will not change from addition 2 to 
addition 4. 
Figure 2(a) shows a schedule and an FU binding with 
two adders obtained with a traditional list-scheduling algo- 
rithm (LS) for the scheduling task and a clique-partitioning 
approach with weights to minimize the number of inter- 
connection units for the FU-binding task. None of the two 
OPRs are achieved. 
Figure 2(b) shows the schedule and FU binding obtained 
with the list-scheduling algorithm for low power (LPLS) 
for the scheduling task and a slightly different approach to 
the clique-partitioning for the FU-binding task. Now both 
OPRs are achieved. 
LPLS also trades off latency for OPRs. This idea is also 
illustrated in Figure 2. If addition 5 happens to be in the 
critical path, the schedule and FU binding in Figure 2(c) 
has one more cycle of latency than the one in Figure 2(b). 
4.1 LPLS key features 
Some heuristics have been included in the traditional 
list-scheduling algorithm (Figure 3(a)) to obtain its low- 
power version (see Figure 3(b) for a simplified algorithm). 
Algorithms in Figure 3 follow the notation in [9]. 
Those operations that share an operand are grouped 
in operand-sharing sets (henceforth, S S )  (CREATEALLSO). 
All operations of a group (IS-osso) can be executed on the 
same FU. An operation of an SS is able to reserve the FU 
where it is going to be assigned for the rest of its SS in 
case it has not one reserved yet (RESERVEXJINNSSO). Given 
an SS and its reserved FU, in the best case 1SSI - 1 OPRs 
can be obtained. All these consecutive OPRs on the same 
FU are called an operand-sharing chain. LPLS attempts 
to schedule as many operations as possible of the SS on its 
reserved FU. It also attempts not to execute other opera- 
tions on it in order to prevent breaking the operand-sharing 
chain (OBTAI"EANDNOTRESERVEDXJ()). The scheduling 
of the operations of an SS is guided by giving more priority 
to the operations in the operand-ready queue whose SS has 
already a reserved FU (UPDATE_PRIORTTIES()). The priority of 
an operation is decreased (i.e. will be scheduled later) if 
it is going to be assigned to an FU not reserved by its SS. 
If the operation scheduled in a later cycle happens to be in 
the critical path, the final latency is increased. 
All the information about achieved OPRs gathered dur- 
ing the execution of LPLS is transferred to the FU-binder as 
a set of binding constraints. The FU-binder first complies 
with all these constraints (i.e. achieves all OPRs already 
obtained by LPLS) and after that proceeds as the tradi- 
tional FU-binder with weights to minimize the number of 
interconnection units (multiplexers). 
LS has a complexity of O(n),  where n is the number of 
operations. LPLS has a complexity of O(n2m), where m 
is the number of unit types. 
4.2 Results 
LS is compared with its low-power version LPLS over 
some data-path benchmarks. With LS, many of the OPRs 
are achieved because the FU binder already forces some 
OPRs in its attempt to minimize the number of multiplexers. 
Several results are shown in Table 2. The benchmarks 
have been scheduled with the resources reported in Table 1. 
To estimate power consumption, 12-bit-wide FUs are as- 
sumed. 
The effect of an OPR on the power consumption of an 
FU has been evaluated by measuring the energy of the FU 
as a function of the operand repetition (see Section 3). The 
106 
- 
V is the set of operations. 
P L i s t t k  is the prioritylist for each 
C.t,, is the current control step. 
m is1 T I. 
N t ,  is the number of F U s  perfonning 
Scurrent is the current schedule. 
operation type t k  E T. 
operations of type t k . 
- 
INSERTREADYI)PS(V,PLlsttl,PLlstt2,. .  , P L i s t f n i , ) ;  
C , t e p  = 0; 
e ( ( P L l s f f l  # 0 ) o r  . . .  o r ( P L r s f f m  # 0 ) ) &  
C s f e p  = C , t e p  + 1; 
f o r k  = "om& -__ f o r f u n t i = l & N k &  
- if P L t S i f k  # 0- 
SCHEDULEDP (Scup pe n t ,  FIRST ( P  L I F  f t 
P L  2 s  t t  
), C ); 
= DELETE ( P L  I F f t , FIRST (P?. t L) t ) ); 
e n d z f  
e n d f o r  
-
e n d f o r  
INSERTREADYDPS (V, P L  Z B  11  ~ , P L  IS f t 2 ,  . . . , P L  IS t - ), 
e n d w h i l e  
(a) 
ASS = CREATEALLSS ( V ) ;  
lNSERTREADYDPS(V, P L i s t t , , P L i s t t 2 ,  .. . , P L t s f t , ) ;  
C . t e p  = 0: 
- ( ( P L z s i f l  # 0)  or.. . o r ( P L . s i t m  # 0 ) ) &  
C s t e p  = C s t e p  + 1; 
f o r k  = l & m &  -
UPDATEPRIORITIES ( P L t s i t k ) ;  - w h d e P L i s t t k  # 0 &  
o p  = F I R S T ( P L i s t t k ) ;  - zf  ISl)SS(ASS, op)+ 
 tfnof SSHASXESEReDRI (SS) + 
funrf  = GETJREEANDNOTRESERVEDEU (SS); 
RESERVEIUJNSS (SS, f u n i t ) ;  
e n d r f  
s c h e d u l e - o p e r a i r o n  = T R U E ;  - 
e l s e  -
f unrf = GETEREEANDNOTXESERVEDN ( S S ) :  
cf f u n i t  = 0- - 
s c h e d u l e . o p e r a t i o n  = F A L S E ;  
e l s e  -
s c h e d u l e - o p e r a t i o n  = T R U E ;  
e n d  w h  rl e 
e n d f o r  
INSERTREADYI)PS(V, P L i s t t l ,  P L i s i f 2 , .  . . , P L i s t i m ) ;  -
Figure 3: (a) Traditional list-scheduling algorithm (b) list-scheduling algorithm for low-power. 
Table 2: Latency and number of OlPRs (for both type 
of FUs) achieved. ( I )  benchmark; (21'3) latency obtained 
with LS/LPLS; (4)  max. OPRs; (Y6)  achieved OPRs with 
LSLPLS and (7) power reduction in the functional units. 
last column of Table 2 accounts for the savings in power 
consumption in the FUs due to the increment of achieved 
OPRs obtained with LPLS. The power consumption due to 
an operation of the benchmark depends on the type of FU 
where this operation is scheduled and on how many operand 
changes that FU has when it executes the operation. A 17% 
of power reduction is achieved in the Claubechies filter and 
a 7% in the 4 x 4 matrix multiplication. The rest of the 
benchmarks present a small or null power-consumption 
reduction due to the following reasons: (a) the maximum 
number of OPRs is too small compared to the number 
of operations of the benchmark and (b) the null or little 
increase in OPRs achieved by LPLS with respect to LS. 
5 Resource binding for low plower 
The goal of the resource-binding algorithm for low 
power (LPRB) is to reduce power consumption in the FUs 
once the scheduling and FU-binding tasks have been done. 
LPRB tackles both useful and useless power consumption 
of FUs. 
LPRB assumes that the control unit maintains, for each 
FU, the same registers on its inputs during idle cycles. 
The LMS benchmark (see Figure 4(a) for its DIFG) will 
illustrate how LPRB works. 
5.1 Reducing useless power 
LPRB addresses the reduction of useless power con- 
sumption by building up a register set that minimizes the 
number of input changes on the idle units. All this process 
is represented in the first part of the algorithm in Figure 5. 
A traditional approach for building up a register set (reg- 
ister binding) is the clique-partitioning method. After ap- 
plying this method to a lifetime compatibility graph for the 
variables (CG), each clique of the partition corresponds to 
one register. LPRB uses the same traditional approach but 
applied to a different variable-compatibility graph (LPCG). 
To build up the LPCG, the register-binding for low power 
first constructs the CG (CREATEKCGO). In a second step, a set 
of edges of the CG are removed (REMOVEXDGEO). Each edge 
removed from the CG connects two compatible variables 
with the following property: should both be assigned to the 
same register, an idle FU would have an input change. 
Figure 4(b) illustrates this concept. It shows the schedule 
and FU binding for the DFG of Figure 4(a). The shadowed 
slots represent the cycles in which the FUs are idle. For 
each FU, the variables in parenthesis in the shadowed slots 
force the control unit to maintain the same registers on its 
inputs during idle cycles. Let us consider what happens 
with FU A0 in cycle 10. An input change will occur at the 
inputs of idle unit A0 if, for example, variables w16 and 
zr21 are assigned to the same register because multiplier 
MO will modify the value of that register in cycle 9. The 
same happens with variable pair (w20 - w21). But not all 
the variables of these two pairs have compatible lifetime 
between them. In this example, only the pair (v20 - v21) 
does. Thus, for the FU A0 in cycle 10, this edge is removed 
from the CG. If the same procedure is applied to all the idle 
slots of Figure 4(b), 6 edges will be removed. 
The drawback in removing edges is the possibility to 
obtain a larger register set, as it will be confirmed later with 
the results. 
Not all the useless power consumption in the idle FUs 
107 













I I  






Operation a1 is s x e ~ u k d  
inunir A 0  and reads 
variables vl and VS 
and writes variable "18 
Figure 4: (a) DFG of the LMS filter and (b) Schedule and FU binding with one adder (one cycle) and two multipliers (two 
cycles). 
/ *  r e d u c e p o ~ e r c o n s u m p f , ~ n ~ n , d l ~ f u n = f * ~ n a l  unci .  * /  
CG = CREATEXG (V); 
f o r = =  l S M A X X Y C L E S &  
f o ?  f u = le MAXEUs & 
__ 
-
zf  IDLE (f U ,  c) - 
f O P  each operairon o p  w h o s e  r e s u l t  
i s  r n  cycle ( c  - 1) MODMAX-CYCLES & 
op-source  =OPERATIONIN€U(f U); 
REMOVE-EKE (CG , < VARDEST( o p s  o U r ce) ,VARA( o p)  > ); 
REM(IVE€DGE (CG. < V A R D E S T ( o r r ~ o u r c e ) . V A R E ( ~ = )  >): , .  ~. I. ~ . I  - I 
e n d f  o r e n c h  -_ 
e n d i f  
e n d f o r  
e ndfor  
REGlSTERBINDlNG(CG); 
f O T  each F U  f u& 
e n d f  oreach 
lNTERCONNECTlONLlNIT3IERO: 
~ 
/ r e d u c e p o w e r s o n s u m p f l o n t n  n o n  - i d l e  f u n o f z o n a l  u n t i s  * / 
OBTAlNEEST-VARIABLEnRER (f U, A H  D ) ;  
Figure 5: Resource binding algorithm for low power. 
is eliminated with this technique. As an example, let us 
consider FU A0 in cycle 16 in Figure 4(b). Because the 
previous operation executed in FU A0 has variable v7 as an 
operand and as the result, FU A0 has in cycle 16 an input 
change. 
5.2 Further reduction of useful power 
Once the register set has been derived, the useful power 
consumption in FUs may be reduced if the commutative 
property of some operations and the average Hamming dis- 
tance (AHD) among the variables are taken into account. 
The process to reduce the power consumption in non-idle 
units is shown in the second part of the algorithm in Fig- 
ure 5. 
As an example, consider additions a1 and a2 of Fig- 
ure 4(b). With the variable input order shown, the FU A0 
has an AHD on one of its inputs of H(v1, v9)  and on the 
other input of H ( v 5 ,  ~ 1 3 ) .  Recall from Section 3 how the 
power consumption of an FU depends on the AHD of its 
inputs. If the AHD information among the variables is 
available, the reduction in power can be evaluated if the 
variable order in addition a2  is changed. The problem of 
obtaining the best variable order for all operations requires 
an exhaustive exploration. Thus, for simplicity, LPRB fol- 
lows a greedy approach (OBTAINBEST-VARIABLE-ORDERO). 
By defining a variable order, the degrees of freedom 
for the interconnection-unit binder are reduced because the 
correct variable order (which implies the correct register 
order) has to be satisfied. This implies that the number of 
multiplexers will be at least equal to the number obtained 
if no useful power is reduced. 
5.3 Results 
TRB is compared with its low-power version LPRB over 
three data-path benchmarks for which we have representa- 
tive input data. The AHD among the variables has been 
obtained by means of profiling the benchmarks2. In all 
of them, the scheduling and FU-binding tasks have been 
done with the low-power methods described in Section 4. 
The benchmarks have been scheduled with the resources 
reported in Table 1. 
By means of switch-level simulations [20] of the ba- 
sic functional units, multiplexers and registers, power- 
consumption models similar to the one in Figure l(b) have 
been obtained. 12-bit-wide FUs are assumed in the power 
results. 
For both resource-binding algorithms, useful and useless 
'It is important to notice that the AHD among the variables highly 
depends on the input data. The AHD of the benchmarks related to image 
processing has been obtained using the well-known Lena benchmark. We 
have observed that the AHD values converge fast (in approx. 500 iterations 
of the algorithm). 
108 
Bench. P an TRB LPLS and LPRB Power 
( 1 )  I (2) I (3 )  I ( 4 )  I ( 3 )  Red. 
Table 3: Comparison between the tradiitional resource-binding algorithm (TRB) and its low-power version (LPRB). All 
power estimations are in nJ/iteration. (1) number of registers; (2) power due to registers; (3) number of multiplexers; (4) 
power due to multiplexers; (5) useless/useful power of FUs and (6) total data-path power. 
power consumption of FUs, and the number of registers 
and multiplexers3 along with estimations of their power 
consumption are reported in Table 3. 
In the 1-D 8-input Lee DCT and pixel interpolation 
benchmarks, no improvement has been observed when ap- 
plying the algorithm for reducing the useful power con- 
sumption in FUs. The greedy method used did not change 
the variable order for any FU. 
In the pixel interpolation benchmark, only two adders 
are used. This implies that the power consumption due to 
the registers and multiplexers plays an important role in this 
benchmark. 
It is worth noticing the area-power trade-off in two 
benchmarks the number of registers arid multiplexers has 
increased when applying LPRB. Although the total area has 
increased, the power consumption has has been reduced. 
6 Conclusions 
Algorithms that reduce the activity of the functional 
units by minimizing the switching activity of their input 
operands have been presented for the high-level synthesis 
tasks of scheduling and resource binding. 
Significant power-consumption reduction is obtained in 
the scheduling task with little increasle or no increase at 
all in latency. Further power reduction is achieved in the 
resource-binding task by increasing the number of storage 
and interconnection units and takmg into account both the 
commutative property of some operations and the average 
Hamming distance among the variables of the data-flow 
graph to be synthesized. 
In this paper, the impact of the number of functional 
units on the power consumption has not been addressed. 
Our future work is devoted to the evaluation of this impact. 
Acknowledgment 
We would like to thank to Rosa Badh for her constructive 
comments which were instrumental in improving this paper. 
This work has been partially supported by CICYT TIC94- 
053 1-E and Dept. d'Ensenyament de la Generalitat de Catalunya. 
References 
[l]  C. Brown and B. Shepherd. Graphics File Formats: refer- 
ence and guide. Prentice-Hall, 1995. 
[2] T. Burd and R. Brothersen. Energy efficient CMOS micro- 
processor design. In Proc. 28th Hawaii Int. Con$ on System 
Sciences, Jan. 1995. 
[3] A. Chandrakasan, M. Potkonjak, 5. Rabaey, and R. Broder- 
sen. HYPER-LP: A system for power minimization using 
architectural transformations. IEEE Trans. on CA,D, pages 
300-303, NOV. 1992. 
3The equivalent number of 2-input multiplexers 
A. Chandrakasan, S. Sheng, and R. Broderssen. Low power 
CMOS digital design. IEEE Trans. on SSC, 27(4):473-483, 
Apr. 1992. 
A. Chatterjee and R. Roy. Synthesis of low power linear 
DSP circuits using activity metrics. In Proc. of the Int. Con$ 
on VLSIDesign, pages 265-270, Jan. 1994. 
R. M. D. Marculescu and M. Pedram. Information theoretic 
measures of energy consumption at register transfer level. In 
Int. Symp. on Low Power Desi n, pages 81-86, Apr, 1995. 
A. Dasgupta and R. Karri. dmultaneous scheduling and 
binding for power minimization during microarchitectural 
synthesis. In Int. Symp. on Low Power Design, pages 69- 
74, Apr. 1995. 
P. Dewilde, E. Deprettere, and R. Nouta. Parallel and 
pipelined VLSI implementation of signal processing algo- 
rithms, chapter 15, pages 257-264. VLSI and Modem Sig- 
nal Processin . Prentice-Hall, Inglewood Cliffs NJ, 1985; 
D. Gajski, N. butt ,  A. Wu, and S. Lin. High-levi1 synthesis: 
introduction to Chip and System Design. Kluwer Academic 
Publishers, 1992. 
I. Koren. Computer Arithmetic Algorithms. Prentice-Hall, 
1993. 
S. Kung. On supercomputing with systolic/wavefront array 
rocessor. In Proc. ofthe IEEE, a es 867-884, July 1984. F. Landman and J. Rabaey. Bla&-%ox capacitance models 
for architectural power analysis. In Proc. Int. Workshop on 
Low PowerDesi n ages 165-170 Apr. 1994. 
P. Landman and%. 'dabaey. Activiti-sensitive architectural 
power analysis for the control path. In Int. Symp. on Low 
Power Design pa es 93-98 Apr. 1995. 
E. Musoll and 5 .  drtadella. high-level synthesis techniques 
for reducing the activity of functional units. In Int. Symp. on 
Low PowerDesi n, ages99 104, Apr. 1995. 
E Najm. Towarck a Righ-lev2 power estimation capability. 
In Int. Sym . on Low PowerDesign pages 87-92 Apr. 1995. 
W. Press, Teukolsky, W. Vetterlhg, and B. Flinnery. Nu- 
merical Recipesin C: The Art of Scientijic Computing. Cam- 
brid e University Press second edition, 1992. 
A. faghunathan and N: Jha. Behavioral synthesis for low 
power. In Proc. of the Int. Con$ on Computer Design, pages 
K. Rao and P. Yip. Discrete Cosine Transform. Academic 
Press 1990. 
J. Trgichler, C. Johnson, Jr., and M. Larimore. Theory and 
Design of Adaptive Filters. New York: John Wiley & Sons, 
1987. 
A. van Gerenden. SLS: An efficient switch-level timing 
simulator using min-max voltage waveforms. In Proc. VLSI 
89 Con$, a es79 88, Au . 1989. 
S. Wuyta&,%. Catthoor, F!Franseen, L. Nachtergaele, and 
H. D. Man. Global communications and memory optimizing 
transformations for low power. In Proc. Int. Workshop on 
Low Power Design, pages 203-208, Apr. 1994. 
3 18-322, Oct. 1994. 
109 
