A Thread Partitioning Algorithm in Low Power High-Level Synthesis by Uchida Junpei et al.
1 D-2 
A Thread Partitioning Algorithm in Low Power High-Level Synthesis 
Jumpei Uchidat Nozomu TogawattJ Masan Yanagisawat Tatsuo Ohtsukit 
f Dept. of Computer Science, Waseda University 
ttDept. of Information and Media Sciences: The University of Kitakyushu 
#Advanced Research Institute for Science and Engineering, Waseda University 
3-41 Okubo, Shinjuku, Tokyo, 169-8555, Japan 
Tel: +81-3-320!3-3211(5716) Fax: +81-3.32044875 
Email: uchida@yanagi,comm.waseda,ac.jp 
Abstract 
This paper proposes a thread partitioning algorithm in 
low power high-level synthesis. The algorithm is applied 
to  high-level synthesis systems. In the systems, we can 
describe parallel behaving circuit blocks(threads) ezplic- 
itly. First it focuses on a local register file RF in a 
thread. It partitions a thread into two sub-threads, one 
of which has RF and the other does not have RF. The 
partitioned sub-threads need to be synchronized with each 
other to keep the data dependency of the original thread. 
Since the partitioned sub-threads have waiting time for 
synchronization, gated clocks can be applied to each sub- 
thread. Then we can synthesize a low power circuit urith 
a low area overhead, compared to the original circuit. 
Ezperimental results demonstrate effectiveness and ef i -  
ciency of the algorithm. 
1 Introduction 
Recently, design complexity is highly increasing. At the 
same time, requirements for low power system VLSIs are 
also increasing for the needs of cdlular phones, PDAs, 
and note PCs. We should develop high speed, small area 
and low power system VLSIs in a short period of time. 
One of the solutions of these requirements is using high- 
level synthesis systems which are able to  synthesize low 
power system VLSIs. 
Several power reduction techniques at high-level syn- 
thesis were proposed. Gated clocks were exploited for 
power reduction techniques [1],[6],[10]. In [l], gated 
clocks were applied efficiently by reducing waiting time 
of logic circuits. In [6], area/delay/power estimation for 
low power system VLSIs with gated clocks was proposed. 
In [lo], a clock tree wa8 generated bmed on the profile 
of switching activities at high-level. A binding technique 
of functional units for reducing switching activities was 
proposed in [4]. A register allocation and binding tech- 
nique was adopted for the purpose of minimizing switch- 
ing probahility [2]. A scheduling technique for low power 
circuits was proposed in (51. 
These days, several high-level synthesis systems were 
developed. In the Bach system[3],[8], we can describe 
parallel behaving circuit bb&s(threads) explicitly in 
BachC which is the input language of the Bach sys- 
tem. Each thread has waiting time for synchroniza- 
tion, since each thread has synchronous communication 
to keep the data dependency. The Bach system has a 
high-level power reduction technique. The technique is 
gated clocks which is applied to threads[7]. 
In this paper, we propose a thread partitioning algo- 
rithm in low power high-level synthesis. First the a lge  
rithm focuses on a local register file RF in a thread. It 
partitions a thread into two sub-threads, one of which 
has RF and the other does not have RF. The partitioned 
two sub-threads need to be synchronized between the 
two sub-threads to keep the data dependency of the orig- 
inal thread. Since the partitioned two sub-threads have 
waiting time for synchronization, gated clocks can be 
applied to each sub-thread. Then we ca,n synthesize a 
low power circuit with a low area overhead, compared 
to the original circuit. 
This paper is organized as follows: Section 2 shows the 
gated clocks of the Bach system which is adopted in the 
proposed algorithm and the example for reducing power 
consumption. Section 3 proposes a thread partitioning 
algorithm. Section 4 shows several experimental results 
and ' edua te s  effectiveness of the proposed algorithm. 
Section 5 gives concluding remarks. 
2 Motivated examDle 
2.1 Clock gating for threads 
We apply gated clocks to the threads when threads have 
waiting time. We can use the gated clocks in the Bach 
system. We show the controller of the gated clocks in 
this section. 
In the Bach system, threads have two types of waiting 
time. 
(I) A thread waits until all the operations of the other 
(11) A thread waits until the handshake of synchronous 
Figure l(a) explicitly shows that the processes A and 
B are executed in parallel. The processes A and B con- 
sist of the sequential operations. The processes A and 
B are allocated to two threads (in Fig. l(b)). The pro- 
cesses described by par construction start and finish at 
the same time. The synchronous communication (I) is 
generated at the start and the end of the threads . Sup- 
pose that the process A in the threadl finish before the 
process B of the thread2 finish in Fig. 1. Then the 
threadl waits until the process B of the thread2 fin- 
ish. In Figure 2, the signal syIeql  is high when all 
the operations of the threadl finish. The signal sy-ok is 
threads finish. 
communication is completed. 
0-7803-8175-0M17.00 @ZOO4 IEEE. 14 
1 D-2 
thread2 + ,.... iij .... B,.hl 
par< 
1 
A i  
B i  
Ihreadl 
(a) (b) 
Figure 1. Synchronous communication between 
threads. (a) The par construction described by 
BachC. (b) The syncronous communication between 
the threadl and the thread2. 
Figure 2. The circuit of the synchronous communi- 
cation between threads. 
high when all the operations of the two threads finish. 
The synchronous communication (11) is need to keep the 
data dependency between the therads. Suppose that the 
data dependency exists between the processes A and B. 
Therefore the synchronous communication must be in- 
serted in the threadl and the thread2. The threadl has 
waiting time, when the sender of the synchronous com- 
munication in the threadl is ready before the receiver of 
the synchronous communication in the thread2 is ready. 
Figure 3 shows the gate-level circuit model for ap- 
plying gated clocks. Since the controller of the gated 
clocks consists of small logic gates, low power circuits 
are synthesized, with a low area overhead, compared to 
the original circuits to which gated clocks are not ap- 
plied. In the case (I), the signal sy i eq f  is high when 
the synchronous communication of par construction in 
the thread is ready, and the signal sy.ok is low until all 
the operations of the other threads finish. The case (11) 
is synchronous communication by the channel. In the 
case (11), the signal chl3req.t and ch2Sreq.t are high 
when the synchronous communication of the channel 1 
and 2 is ready, and the signal ch1Sack.t and ch2.Rack.t 
are high until it is completed. Then the signal gc1k.t is 
obtained hy 
Since the clock of each thread which waits for other 
threads is not supplied by applying gated clocks to the 
each thread, the power consumption of the circuit is re- 
gdk.t <= d r L t  or d k  (1) 
J 
Figure 3. Gated docks controller in the Bach system. 
. . -d". s d s . t l o .  
tO k 4  kh. d.t. a.*- 
0 pro=... In .*thr.d
Figure 4. Thread partitioning. 
duced[7]. 
As deiscussed in Section 2.1, we show the example. 
The circuit consists of some threads. For example, the 
circuit consists of the threadA, the threadB, and the 
threadC in Figure 4. The threads are executed in paral- 
lel. Figure 5(a) shows the execution time of the threads 
before applying the thread partitioning. The threadC 
has the longest execution time than other threads and 
the threadB has longer execution time than the threadA. 
Then the threadA and the threadB have waiting time. In 
the case of Fig. 5(a), the power consumption of the cir- 
cuit is reduced, since the clock of the each thread which 
waits for other threads is not supplied by applying gated 
clocks to the each thread. 
We partition the each thread into two sub-threads in 
Fig. 4. For example, the threadA is partitoned into 
the sub-threadAl and the subthreadA2. Figure 5(b) 
shows the execution time of the threads after applying 
the thread partitioning algorithm. When the data de- 
pendency exists between subthreads, the sub-threads 
need to be synchronized with the each other to keep the 
data dependency in Fig. 4. As showed in Fig. 5, they 
are not executed in parallel at all times. Suppose that 
the area overhead is low by partitioning the threads in 
Fig. 5(b). In Fig. 5, the waiting time of the threadA is 
5011s. The threadA corresponds to the sub-threadAl and 
the subthreadA2. The average waiting time of the s u b  
threadAl and the sub-threadA2 is (75+25+50)/2=75ns. 
In Fig. 5, the waiting time of the case (a) and the case 
75 
1 D-2 
Figure 5. Execution time. (a)Before applying the 
thread partitioning. (b)ARer applying the thread 
partitioning. 
(b) is 
(a) : 50 + 25 = 75ns (2 )  
(b) : (75 + 25 + 50 + 65 +35 + 25 + 30 + 30)/2 
= 162.5ns (3) 
We apply gated clock to the sub-threads which wait for 
other sub-therads. When the total waiting t i e  of Fig. 
5(b) increases compared to that of Fig. 5(a), we can 
synthesize a low power circuit. 
2.2 Power reduction by gated clocks and thread 
parti t ioning 
We confirm that power consumption of a circuit can be 
reduced by partitioning a thread. Power consumption 
of a circuit can be classified into static power consump- 
tion and dynamic one. Static power consumption is the 
power consumption when the gates do not run. Dynamic 
power is the power consumption by the charge and the 
discharge of load capacitances in the gates. Generally, 
dynamic power consumption occupies the greater por- 
tion of the power consumption of the circuit than static 
power consumption. The purpose of this paper is to 
reduce dynamic power consumption. 
Generally, dynamic power consumption Pdvn in a 
CMOS circuit is given by 
Where, V d d  is the power supply voltage, f is the clock 
frequency, and is the effective capacitance. Let CL 
be the load capacitance, and be the probability of 
signal transition "0"+"1." Then po+l is equivalent to 
switching activity. Here the effective capacitance Ces is 
obtained by Ce# = CL x po+l[9]. PCLK is the power 
consumption of the clock tree. Pgote is the power con- 
sumption of gates. The technique using gated clocks 
reduces the switching activity of registers. 
Figure 6 shows an example for reducing dynamic 
power consumption. In Fig. 6(a), we suppose that the 
register file size of Regl, Reg2 and Reg3 are chosen ar- 
bitrarily, and the data path of D1 and D2 consist of 
registers and functional units. The execution time from 
Regl to RegZ takes m clocks, and the execution time 
from Reg2 to Reg3 takes n clocks. In Fig. 6(a), D2 
executes useless operations former m clocks, and D1 
executes useless operations latter n clocks. Then D1 
and D2 have useless power consumption. In Fig. 6(b), 
75ns < 162.5ns (4) 
P d y n  = V d d 2 f  cefl = PCLK + Pg&e (5) 
~ 
76 
Figure 6. Example for reducing power consumption. 
(a) ia the original circuit. (b) is the thread parti- 
tioned circuit. 
threadA is partitioned into the subthreadAI and the 
sub-threadAz. Since the data dependency exists h e  
tween the sub-threadill and the sub-threadA2, the sub- 
threadAI and the sub-thread& are not executed in par- 
allel. Then the sub-threadA2 has waiting time former m 
clocks, and the subthreadAI has waiting time latter n 
clocks. The clock signals do not need to be supplied to 
the sub-thread which has waiting time. Then we can 
apply gated clocks in order not to supply clocks to s u b  
threads. Therefore we can obtain the low power circuit. 
The power consumption of a clock tree is directly pro- 
portional to the effective capacitance. The technique 
using gated clocks reduces the effective capacitance of 
the clock tree. Therefore the power consumption of the 
clock tree is reduced. By Equation (5), the power con- 
sumption P R ~ ~ ~ ~ , ~ ~  of Regl in the threadA and the power 
consumption P R ~ ~ ~ ~ . .  of Regl in the sub-threadill are 
(6) 
PRegI.., = Vdd' ' f ' CRegl . s R e g l p a .  (7) 
sReglo7;e 2 SRegl,,, (8) 
2 
PRegl.,ip = V d d  ' f ' CRegl . SRegl.,iQ 
where C R ~ ~ ~  is the load capacitance of Regl, S R ~ ~ ~ . , ~ ~  
and S R ~ ~ ~ ~ . ,  are the switching activity of Regl b e  
fore and after partitioning the thread respectively. The 
power consumption of Reg3, D1, and D2 is obtained the 
same as the power consumption of Regl is obtained by 
Equation (6), (7), (8). In Fig. 6, the gated clocks are 
not applied to Reg2, since Reg2 is the outside register 
file of the sub-tbreads. In Fig. 6(a) and (b), the switch- 
ing activity of Reg2 is not changed before and after the 
thread is partitioned. Therefore the power consumption 
PR.~Z..~.  and P R ~ ~ ~ ~ . ,  of RegZ in Fig. S(a) and (b) are 
Therefore the power consumption Poy.ig and Pp., of the 
circuits in Fig. S(a) and (b) are obtained by 
pReg2.,i. = PReg2.., (9) 
Pmig = PRegI.,;. + PReg2..i9 pRcg3.,io 
+pDlo.dn + pD20F6g (10) 
+ p D l p a .  + PDZ,., (11) 
Pmig 1 Ppar (12) 
Ppm PRegl.., + PReg2p.. + PReg3.., 
1 D-2 
(b) 
Figure 7. A thread partitioning based on a local r e g  
ister. (a) The behavioral description before applying 
thread partitioning. (b) The behavioral description 
&er applying thread partitioning in sub-threads. 
The power consumption of the clock tree P C L K ~ , , ~ , ~ ~ ~  
PcLK,.. of the circuits in Fig. 6 are 
PCLK.,i, = VddZ ' f ' C o ~ i g  (13) 
PCLK.., = VddZ ' f . Cpm (14) 
Since the effective capacitance is reduced by applying 
gated clocks to the partitioned sub-threads, the effective 
capacitance C,,i, and Cpav of the clock tree in Fig. 6(a) 
and (b) are 
Then the power Consumption of the clock tree is 
Cwig 2 Cpw (15) 
Pmi, b ppa, (16) 
Thread P a r t i t i o n i n g  based on Local Reg- 
i s t e r  
3 
3.1 Outline of Thread Partitioning 
We propose a thread partitioning algorithm to generate 
subthreads which have waiting time . Suppose that a 
circuit has large register tiles. The power consumption 
of the circuit occupies the greater portion of the power 
consumption by charge and discharge capacitance in the 
register files and the power consumption of the clock 
tree. We partition a thread based on a local register file 
for reducing this power consumption. 
First our thread partitioning algorithm focuses on a 
local register file RF in a thread. It partitions a thread 
into two sub-threads, one of which has RF and the other 
does not have RF. The power consumption of the out- 
side rzgisters of subthreads can not be reduced by ap- 
plying gated clocks to the subthreads. Therefore the 
register file has to be generated in sub-threads to reduce 
the power consumption effectively. We partition the ex- 
pressions in the thread into two categories, one of which 
has the variable assigned to RF and the other does not 
have it. By assigning the former to a sub-thread, RF 
can be generated in the sub-thread. When the other 
register tiles are used in the only one sub-thread, they 
are generated in the sub-thread. Applying gated clocks 
to sub-threads can reduce the power consumption of the 
register tiles generated in subthreads. 
sub-threadAI sub-thread.42  
registers 
[(fuction units)] 
Figure 8. The hardware architecture after partition- 
ing the threadA. 
In the Bach system, the array described in BachC lan- 
guage is assigned to a register fde. Here we defme the 
size of the register file as the product of the width and 
the number of the registers. In Figure 7, arrayal is as- 
signed to the 64 registers whose bit width is Bbit. We 
decide that the target register file RF is the largest reg- 
ister file in the thread. In Fig. 7, a r raya l  is assigned 
the largest register file. Therefore RF is arrayal.  We 
partition the threadA into the sub-threadAland the sub 
threadAz based on arrayan. Then the sub-threaddl has 
RF(arraya1). The partitioned two subthreads need to 
synchronize when the data dependency exists between 
the two subthreads. Then the subthreads may have 
waiting time. In Fig. 7, the data dependency exists 
between sub-threaddl and subthreadAz. Then s u b  
threadAI and Az have waiting time. Figure 8 shows the 
hardware architecture after partitioning the threadA. 
The ar raya l  is only used in sub-threadAl, and the ar- 
ray.c[ is only used in subthreaddz. Therefore arrayal 
and array-cu are generated in subthreaddl and s u b  
threadAz. Array-bO is generated outside of the sub 
threads, since it is used in the two subthreads. 
Generally, the power reducing techniques by applying 
gated clocks depend on the input data. However a p  
plying gated clocks to threads does not depend on it. 
We partition a thread into two sub-threads. The par- 
titioned subthreads have waiting time, since they have 
synchronous communication. The waiting time does not 
depend on the input data. 
3.2 Thread Partitioning Algorithm 
In this section, we propose the thread partitioning alge 
rithm based on the outline of the Sect. 3.1. The proposed 
algorithm consists of the two steps. In Stepl, we parti- 
tion a thread into two subthreads based on RF. The one 
sub-thread has RF. In StepS, we insert the syncronous 
communication in the partitioned subthreads to keep 
the data dependency of the original circuit. 
The inputs of the proposed algorithm are the source 
code which is input into the high-level synthesis system 
and the target local register file RF. The designer de- 
cides that RF is the largest register file in the thread. 
The source code is a behavioral description and we can 
describe parallel behaving circuit blocks explicitly in it. 
First, we find the function which is assigned a thread. 
We generate F1 and F2 each of which is the copy of the 
function (Step 1-1). F1 and F2 are executed in paral- 
1 D-2 
lel. We define X as a set of variables which are assigned 
RF. The expressions including X(E X) are executed in 
F1 and the expressions not including x are executed 
in FZ(Step 1-2). Secondly, it is necessary to generate 
the synchronous communication to keep the data de- 
pendency between F1 and F2. Since F1 and F2 are ex- 
ecuted in parallel, we obtain the incorrect result when 
the data dependency exists between F1 and F2. There- 
fore we insert the synchronous communication into the 
place where we removed the expressions in F1 and F2 
(Step 2-1). In the original source code, the expressions 
which do not have the data dependency can be executed 
in parallel. In Step 2-1, we insert the synchronous com- 
munication into the sub-threads even if the sub-threads 
do not have the data dependency. Therefore F1 and F2 
may have excessive synchronous communication. Then 
we remove it(Step 2-2). 
We define the data dependecy between the two sub- 
threads. El, and E2, denote a set of the expressions be- 
tween the n-th synchronous communication and (n+l)th 
synchronous communication in F1 and F2 respectively. 
V1,J and V2,J denote a set of the variables of the left 
side of the expressions and Vl,.r and V2,r denote a 
set of the variables of the right side of the expressions 
in El, and E2, respectively. Suppose that an expres- 
sion depends on another expression. The definitions are 
below, 
av1,d = v 2 , + 1 ~  or 3 v 2 , 1 = v 1 , + l ~  or 
3 v i , ~  = ~ 2 , + ~ . r  or 3 v 2 , ~  =vl,+l.r or 
3v1,.r = v2,+1~ or 'w2,.r = w l n + l J .  
( ~ l d  E V L J ,  vln+lJ EVl,+iJ, (17) 
W2"J E V2,J WZ,+IJ E V2,+1J, 
v l , r  E V L r ,  vl,+l.r E V1,+1-r 
v2,.r E V2,.r, vZn+l.r E V2,+1r) 
When Equation (17) is satisfied, the data dependency 
exists between F1 and F2. Then we can not remove the 
(n+l)th synchronous communication. 
Finally, we create the function F which executes F1 
and F2 in parallel. Then we can obtain the source code 
partitioned into the threads F1, F2 and F(Step 2-3,2-4). 
Figure 9 shows the proposed algorithm. 
4 Experimental Result 
In this section, we verify the proposed algorithm which 
is shown in Fig. 9. We utilize a DCT, a Quantizer, a 
Huthan  encoder and an IIR filter designed by the Bach 
system for verification. We apply the gated clocks to the 
circuits adopted the proposed algorithm. 
We use Synopsys Design Compiler as a logic-level syn- 
thesis tool, VDEC libraries (CMOS and 0.35pm tech- 
nology) ' , and Synopsys DesignPower to estimate the 
power consumption in the simulation by Synopsys VSS. 
We use the logic-level synthesized values as the values 
of area and delay. Synopsys Designpower estimates the 
' The libraries in this study have been developed in the chip fab 
rication program of VLSI Design and Education Center (MEC) ,  
the University of Tokyo with the collaboration by Hitachi Ltd. 
and Dai Nippon Printing Corporation. 
~ 
78 
(Inputs: the input source code of the high-level syn- 
(Output: the input source code which has partitioned 
thesis system and the local register file RF 
threads of the high-level synthesis system) 
Step 1-1.Find the function which is assigned to a 
thread and genarate F1 and F2 each of which is the 
copy of the function. 
Step 1-2. For all the expressions in F1, remove the ex- 
pressions which do not include X(E X) in F1. For all 
the expressions in F2, remove the expressions which 
include x(E X) in F2. 
Step 2-1.For F1 and F2 obtained in Step 1-2, insert 
the syncronous communication into the points where 
we removed expressions in F1 and F2. 
Step  2-2. For all the syncronus communication inserted 
in the S tep  2-1, if the expressions in El, and E2, do 
not depend on the expressions in El,+1 and E2,+1, 
remove the (n+l)th syncronous communication. 
Step 2-3. Create the function F which executes F1 and 
F2 in parallel. 
S tep  2-4. Output the obtained F, F1 and F2, and exit. 
Figure 9. A thread partitioning algorithm. 
values of power consumption with switching information 
which is obtained by the simulation. We decide that the 
input RF of the proposed algorithm are the register file 
of the largest size in the input source code. When there 
are more than one register file with the largest size, we 
select the register file which are more frequently used in 
the input source code. 
In table 1 and 2, we show the experiment results of the 
values of power consumption, area, delay, execution time 
and active rate of threads before and after the proposed 
algorithm is applied to the circuits. In Table 1, CLK net 
power denotes the power consumption of a clock tree, 
and except CLK net power denotes the power consump 
tion of gates. P denotes the power consumption rate of 
reducing the power consumption compared to the power 
consumption of the original circuit. In Table 2,  active 
rate denotes the rate of the time supplied clocks to sub- 
threads per the execution time. Table 3 shows the ex- 
perimental results of the values of area and delay on the 
Quantizer before and after applying the proposed a lge  
rithm. 
Table 1 shows that we achieve in maximum 455  and 
42% power reduction at total power and CLK net power 
respectively. The power consumption of the clock tree 
is reduced more than that of the gates. Since the com- 
binational circuits of the DCT and the Quantizer are 
larger than the other circuits, the power consumption 
of gates is reduced effectively by applying gated docks 
to the sub-threads. Since the power consumption of the 
clock tree is directly proportional to the effective capac- 
itance by Equation (5), the power consumption of the 
clock tree is reduced effectively when the sub-threads 




Table 1. Experimental result of the power consumption. 
1695054 33.93 
712611 53.95 





Table 2 shows that the subthreadl of the DCT has 
74% active rate, and the sub-threadl of it has 40% active 
rate. The one of the sub-threads, which has the lower ac- 
tive rate, has smaller local registers than the other sub- 
thread in the DCT, tne H u f i a n  encoder and the IIFl 
filter. However in the Quantizer, both subthreads have 
the same size registers. Therefore the power consump- 
tion of the Quantizer is reduced effectively by applying 
gated clocks to the sub-threads which have large sue 
registers. Tahle 3 shows that the sub-threads have same 
size registers in the Quantizer. Each subtbread has the 
64 registers whose bit width is 16bit. The outside reg- 
isters of the sub-threads are smaller than the registers 
in the sub-threads. The area overhead of the circuits 
for synchronous communication is small. In the other 




The proposed algorithm obtains larger circuits than 
the original circuits in area. The number of expression 
do not increase in the behavioral description of the appli- 
cations for the proposed algorithm. However the num- 
ber of functional units increase by the proposed algo- 
rithm compared to the number of the original circuits, 
for example the functional units used in control (loop 
and branch) and the few functional units. The DCT and 
the HufTman encoder have less execution time compared 
to  the original circuits, since they are executed in par- 
allel. In contrast, the Quantizer and the IIR filter have 
more execution time compared to the original circuits, 
since the synchronous communication is increased. The 
H u f i a n  encoder obtained by the proposed algorithm 
has simple memory controllers compared to the origi- 
nal Hufhnan enwder. Therefore it has smaller area and 
less execution time compared to the original encoder. 
The experimental results show that the proposed algo- 
rithm obtains the sub-thread, which has low active rate, 
has the large size registers compared to the other. Con- 
sequently, the proposed algorithm obtains lower power 
system VLSIs. 
Table 3. Syntheaized result of the 
Quantizer. 
5 Conclusions 
In this paper, we proposed the thread partitioning alg* 
rithm in low power high-level synthesis. This approach 
reduces the power more efficiently with a low area over- 
head when gated clocks are applied. Consequently, This 
approach enables the high-level synthesis system to  syn- 
thesize low power VLSL. 
We intend to improve the decision of RF when the 
same largest size registers exist in the input source code 
in the future. 
References 
[l] L. Benini and G. De Micheli, “Automatic synthesis of gated- 
clock finite-state mdines ,”  IEEE “U. Computer-Aided 
Deaign of Intggmted Cireuite and Sgstema, vol. 15, no. 6, pp. 
630-643, June 1996. 
[2] J. M. Chang and M. Pedram, “Register allocation and binding 
for low power,” in Pm. 3Ind DAC, pp. 2S-35, June 1995. 
[3]T. Kambe, A. Yamada, K. Niehida, K. Okada, M. Ohniehi, 
*A C-baaed synthesis system, Bach, and its application,” in 
Pm. ASP-DACIOOf, pp. 151-155, 2001. 
[4] A. Kumarand M. Bayoumi, “Novel formulationsfor low-power 
binding of function units in high-level synthesis,” in Pm. 
ICCD ‘99, pp. 321-324, Oct. 1999. 
[5] J.  Monteiro, S. Devada, P. Ashar, and A. Mauskar, “Schedul- 
ing techniques to enable power management,” in Pm. 33rd 
DAC, pp. 34S-352, June 1996. 
[6] S .  Nada, N. Togawa, M. Yanagisaw., and T. Ohtsuki, “High- 
level arealdelaylpower estimation for low power sptem vLSb 
with gated clocks,” IEICE %na. on findamentok, vol. E85- 
A, pp. 827-834, April 2002. 
[7]M. Ohnishi, R. Sakurai, K. Nishida, K. Okada, A. Ymada, 
T. Kambe, =A method of low power design for Bach system,“ 
Proc. IPSJ Design Automation Sympwium 2001, pp. 11%lW, 
July 2001. 
[E] K. Okada, A. Yamadmada and T. Kambe, “Hardme algorithm 
optimization using B d  C,” IEICE llans. findamentab, vol. 
E85-A, no. 4, pp. 835-841, April 2002. 
[Q] J. M. Rabaey, Digital Intgrated Circuits: A Design Perspz- 
tive, Prentice Hall, 1995. 
[lo] G. Telle~, A. Fmrahi, and M. Sarrafzadeh, “Activity driven 
clock design for low power circuits,” in Pm. ICCAD-95, pp. 
6245.  Nov. 1995. 
79 
