Experimental Evaluation of High-Level Energy Optimization Based on Thread Partitioning by Uchida Junpei et al.
The 2004 IEEE Asia-Pacific Conference on 
Circuits and Systems, December 69,2004 
EXPERIMENTAL EVALUATION OF HIGH-LEVEL ENERGY 
OPTIMIZATION BASED ON THREAD PARTITIONING 
Jumpei UCHIDAt. Yuichim MYAOKAt. Nozomu TOGAWAttst, Musao YANAGISAWAt and Totsuo OHTSUk? 
t Dept. of Computer Science, Waseda University 
tt Dept. of Information and Media Sciences, The University of Kitakyushu 
* Advanced Research Institute for Science and Engineering, Waseda University 
ABSTRACT 
This paper presents a thread partitioning algorithm for high- 
level synthesis systems which generate low energy cir- 
cuits. In the algorithm, we partitions a thread into two sub- 
threads, one of which has RF and the other does not have 
RE The partitioned sub-threads need to be synchronized 
with each other to keep the data dependency of the origi- 
nal thread. Since the partitioned sub-threads have waiting 
time for synchronization, gated clocks can be applied to 
each sub-thread. We achieve 33% energy reduction when 
we apply our proposed algorithm to a JF'EG encoder. 
1. INTRODUCTION 
Recently, design complexity is highly increasing. At the 
same time, requirements for low energy system LSIs are 
also increasing for the needs of cellular phones, PDAs, 
and mobile PCs. We should develop high speed, small 
area and low energy system LSIs in a short period of time. 
One of the solutions of these requirements is using high- 
level synthesis systems which are able to synthesize low 
energy system LSIs. 
Several power reduction techniques at high-level syn- 
thesis were proposed. Gated clocks were exploited for 
power reduction techniques [1],[6],[9]. In [ I], gated clocks 
were applied efficiently by reducing waiting time of logic 
circuits. In [6], area/delay/power estimation for low power 
system LSIs with gated clocks was proposed. In [9], a 
clock tree was generated based on the profile of switch- 
ing activities at high-level. A binding technique of func- 
tional units for reducing switching activities was proposed 
in [4]. A register allocation and binding technique was 
adopted for the purpose of minimizing switching proba- 
bility [2]. A scheduling technique for low power circuits 
was proposed in [5]. These days, several practical high- 
level synthesis systems were developed. In these high- 
level synthesis system, we can define threads as parallel 
behaving circuit blocks. In [3],[8],[10] a high-level syn- 
thesis system called Bach system has been proposed. The 
input language of Bach system is called BachC where we 
can describe threads explicitly. Each thread has waiting 
time for synchronization, since it has synchronous com- 
munication to keep the data dependency. The Bach system 
has a high-level energy reduction mechanism. In order to 
reduce energy consumption, the Bach system automati- 
cally applies gated clocks to all of the threads in a BachC 
description(71. 
Assume that we have a single thread in a BachC de- 
scription. If this thread is partitioned into two or more sub- 
threads, we can further reduce energy consumption. This 
is because each partitioned sub-thread must have wait- 
ing time for synchronization and thus we can apply gated 
clocks to each of the partitioned sub-threads. We consider 
that thread partitioning must be one of the most powerful 
energy reducing techniques. 
Based on the above idea, we propose in this paper a 
thread partitioning algorithm for high-level synthesis sys- 
tems which generate low energy circuits. First the algo- 
rithm focuses on a local register file RF in a thread. It 
partitions a thread into two sub-threads, one of which in- 
cludes RF and the other does not include RE Each of the 
partitioned two sub-threads needs to be synchronized be- 
tween the two sub-threads to keep the data dependency of 
the original thread. Since the partitioned two sub-threads 
have waiting time for synchronization, gated clocks can 
be applied to each of the partitioned sub-threads. We can 
synthesize a low energy circuit with small additional area 
overhead compared to the original circuit. In this paper, 
we also present experimental evaluation of high-level en- 
ergy optimization based on thread partitioning. We apply 
our proposed algorithm to a JF'EG encoder. 
This paper is organized as follows: Section 2 proposes 
a new thread partitioning algorithm. Section 3 shows sev- 
eral experimental results and evaluates effectiveness ofthe 
proposed algorithm. Section 4 gives concluding remarks. 
2. THREAD PARTITIONING BASED ON LOCAL 
REGISTER FILE 
2.1. Outline of Thread Partitioning 
We propose a new thread partitioning algorithm to gen- 
erate sub-threads which have waiting time. It is assumed 
that a circuit has large register files. The power consump- 
tion ofthe circuit occupies the greater portion of the power 
consumption by charge and discharge capacitance in the 
register files and the power-consumption of the clock tree. 
0-7803-8660-4/04/$20.00 02.004 IEEE 161 
I I I I 
(W 
Fig. 1. A thread partitioning based on a local register. 
(a) The behavioral description before applying thread par- 
titioning. (b) The behavioral description after applying 
thread partitioning in sub-threads. 
Fig. 2. The hardware architecture after partitioning the 
threadA. 
ture after partitioning the threadA. The array.a[] is only 
used in subthreadA,, and the array.c[] is only used in 
sub-threadA2. Therefore array-a[] and array.c[] are gen- 
erated in sub-threaddl and sub-threadAz. Array.b[] is not 
included in any subheads ,  since it is used in both sub- 
threads. 
Generally, the power reduction techniques by applying 
gated clocks depend on the input data. However applying 
eated clocks to threads does not denend on it. We nartition 
We partition a thread based on a local register file for re- 
ducing this power consumption. When a thread is Par- 
;thread into two sub-threads. The partitioned &-heads 
have waiting time, since they have synchronous commu- 
titioned into two sub-threads, execution time of the sub- 
threads is almost equal to that of the thread. If we can re- 
duce the power consumption without increasing execution 
time of the original circuits, we can generate low energy 
circuits compared to the original circuits. 
First our thread partitioning algorithm focuses on a lo- 
cal register file RF in a thread. It partitions a thread into 
two sub-threads, one of which has RF and the other does 
not have RF. The power consumption of the outside regis- 
ters of sub-threads can not he reduced by applying gated 
clocks to the sub-threads. Therefore the register file has to 
nication. The waiting time does not~depend on the input 
data. 
2.2. Thread Partitioning Algorithm 
In this section, we propose the thread partitioning algo- 
rithm based on the outline of the Sect. 2.1, The proposed 
algorithm consists of the two steps. In Stepl, we parti- 
tion a thread into two sub-threads based on RF. The one 
sub-thread has RF. In Step2, we insert the synchronous 
communication in the partitioned sub-threads to keep the 
be generated in sub-threads to reduce the power consump- 
tion effectively. We partition the expressions in the thread 
data dependency of the original circuit. 
The inputs of the proposed algorithm are the source 
into two categories, one of which has the variable assigned 
to RF and the other does not have it. By assigning the 
former to a sub-thread, RF can be generated in the sub- 
thread. When the other register files are used in the only 
one sub-thread, they are generated in the sub-thread. Ap- 
plying gated clocks to sub-threads can reduce the power 
consumption of the register files generated in sub-threads. 
In the Bach system, the array described in BachC lan- 
guage is assigned to a register file. Here we define the 
size of the register file as the product of the width and the 
number of the registers. In Figure 1, arrays[] is assigned 
to the 64 registers whose hit width is 8bits. We decide that 
the target register file RF is the largest register file in the 
thread. In Fig. 1, array.a[] is assigned the largest regis- 
ter file. Therefore the target RF is arrays[]. We partition 
the thread.4 into the sub-threadill and the sub-threadAz 
based on arrays[]. Then the sub-threadAl has the target 
RF(array.a[]). The partitioned two sub-threads need to 
synchronize when the data dependency exists between the 
two sub-threads. Then the sub-threads may have waiting 
time. In Fig. 1, the data dependency exists between suh- 
threadAI and sub-threadAz. Then suh-threadA1 and Az 
have waiting time. Figure 2 shows the hardware architec- 
code which is input into the high-level synthesis system 
and the target local register file RE The designer decides 
that the target RF is the largest register file in the thread. 
The source code is a behavioral description and we can 
describe parallel behaving circuit blocks explicitly in it. 
First, we find the function which is assigned a thread. 
We generate F1 and F2 each of which is the copy of the 
function (Step 1-1). F1 and F2 are executed in parallel. 
We define X as a set of variables which are assigned the 
target RF. The expressions including x(E X) are executed 
in F1 and the expressions not including x are executed in 
F2(Step 1-2). Secondly, it is necessary to generate the 
synchronous communication to keep the data dependency 
between F1 and F2. Since F1 and F2 are executed in par- 
allel, we obtain the incorrect result when the data depen- 
dency exists between F1 and F2. Therefore we insert the 
synchronous communication into the place where we re- 
moved the expressions in F1 and F2 (Step 2-1). In the 
original source code, the expressions which do not have 
the data dependency can he executed in parallel. In Step 
2-1, we insert the synchronous communication into the 
sub-threads even if the sub-threads do not have the data 
dependency. Therefore F1 and F2 may have excessive 
162 
synchronous communication. Then we remove it(Step 2- 
2). 
We define the data dependency between the two sub- 
threads. El ,  and E2, denote a set of the expressions be- 
tween the n-th synchronous communication and (n+l)th 
synchronous communication in FI and F2 respectively. 
V1,1 and V2,./ denote a set of the variables of the left 
side of the expressions and V l , r  and V2,r denote a set 
of the variables of the right side of the expressions in El ,  
and E2, respectively. It is assumed that an expression 
depends on another expression. The definitions are below, 
3u1,-l E vl , ,~ ,  
3v2,1 E vz,~, 
3v2,+1-l E v2,+1 
3 ~ i , + l ~  E V ~ , , + ~ A  
s.t. u2,.l = Vl,+l_l or 
S . t .  tJIn-l U2,+1-l UT 
%Jl,-l E Vl,-l, 3v2,+l-T E VZ,,+i-T 
S . t .  Uln.1 = ?J2,+l.T OT 
3u2,.1 E VZ,.~, 
' U l n - T  E Vln-T, 
3u2,-T E VZ,_T, 
3 ~ 1 , + 1 . ~  E v ~ , + ~ - T  (1) 
3U2,+1.1 E ~ 2 , + ~ . 1  
3u1,+1.i E VI, ,+~J 
S . t .  V2,.1 = Uln+i-T OT 
S . t .  U l n - T  = V2,+1A OT 
s.t. U2,.T = Vln+1./. 
When Equation ( I )  is satisfied, the data dependency exists 
between FI and F2. Then we can not remove the (n+l)th 
synchronous communication. 
Finally, we create the function F which executes FI 
and F2 in parallel. Then we can obtain the source code 
partitioned into the threads F1, F2 and F(Step 2-3,2-4). 
Figure 3 shows the proposed algorithm. 
3. EXPERIMENTAL RESULT 
In this section, we verify the proposed algorithm which is 
shown in Fig. 3. We utilize a DCT, a Quantizer, a Huf6nan 
encoder and a JPEG encoder designed by the Bach system 
for verification. The JPEG encoder consists of the DCT, 
the Quantizer and the Huffman encoder designed by Bach 
system. We apply the gated clocks to the circuits adopted 
the proposed algorithm. 
We use Synopsys Design Compiler as a logic-level 
synthesis tool, VDEC libraries (CMOS and 0.35pm tech- 
nology) I ,  and Synopsys DesignPower to estimate the power 
consumption in the simulation by Synopsys VSS. We use 
the logic-level synthesized values as the values of area 
and delay. Synopsys DesignPower estimates the values 
of power consumption with switching information which 
is obtained by the simulation. We decide that the input 
R F  of the proposed algorithm are the register file of the 
largest size in the input source code. When there are more 
'The libraries in this sNdy have been developed in the chip fabrica- 
tion program of VLSl Design and Education Center (VDEC), the Uni- 
versity of Tokyo with the collaboration by Hitachi Ltd. and Dai Nippon 
Printing Corporation. 
Inputs: the input source code of the high-level synthesis 
system and the local register file RF 
Output: the input source code which has partitioned 
threads of the high-level synthesis system 
Step 1-1. Find the function which is assigned to a thread 
and generate FI and F2 each of which is the copy 
of the function. 
Step 1-2. For all the expressions in FI, remove the ex- 
pressions which do not include x(E X) in F1. For 
all the expressions in F2, remove the expressions 
which include X(E X) in F2. 
Step 2-1. For F1 and F2 obtained in Step 1-2, insert the 
synchronous communication into the points where 
we removed expressions in FI and F2. 
Step2-2. For all the synchronous communication in- 
serted in the Step 2-1, ifthe expressions in El ,  and 
E2, do not depend on the expressions in El,+1 
and E2,+1, remove the (n+l)th synchronous com- 
munication. 
Step 2-3. Create the function F which executes FI and F2 
in parallel. 
Step 2-4. Output the obtained F, FI and F2, and exit. 
Fig. 3. A thread partitioning algorithm. 
than one register file with the largest size, we select the 
register file which are more frequently used in the input 
source code. 
Table I and 2 show the experiment results of the val- 
ues of energy consumption, area, delay, execution time 
and active rate of threads before and after the proposed al- 
gorithm is applied to the circuits. The energy consumption 
E is given by 
(2) 
Where, Clock is the clock cycle and T,,, is the execution 
time of the circuit. In the all circuits, We evaluate the 
energy consumption using the clock cycle 20MHz. 
block size is Bbits x Bbits and the kind of images is Bbits 
binary format image) of a image is processed. The JPEG 
encoder has pipeline architecture. Figure 4 shows the be- 
havior of the JPEG encoder. The par of Table l and 2 
denotes the results when our proposed algorithm is ap- 
plied, and The non-par denotes the results of the original 
circuit. The P of Table 1 denotes the energy consump- 
tion rate of reducing the energy consumption compared to 
the energy consumption of the original circuit. We par- 
tition the thread into two sub-threads. In Table 2, acfive 
rafe denotes the rate of the time supplied clock signals to 
suh-threads per the execution time. The active rate of the 
P E G  encoder is the combination of that of each module. 
E = Clock x Teze 
Table 1 shows the energy consumption when 1 block(the 
163 
Table 1. Experimental result of the energy ( 
sumption. 
Quantizer 71.28 
Encoder 
I JPEGEncoder I 8.24 I 5.54 I 67.23 ] 
:on 
Table 2. Synthesized result of the circuits. 
P E G  Encoder(non-par) I 5.17 I 33.93 I 272.55 I 11. 
41.01 I 309.45 I *. JPEG EncodeQar) I 5.32 I 
Execubon bme of 
JPEG e n d e r  
Fig. 4. Behavior of JPEG encoder. 
Table 1 shows that we achieve in maximum 45% en- 
ergy reduction in the modules. We achieve 33% energy 
reduction in the P E G  encoder. Inthe modules, the energy 
consumption is reduced effectively when the sub-threads 
have larger registers. 
Table 2 shows our proposed algorithm obtains larger 
circuits than the original circuits in area. The number of 
expression does not increase in the behavioral description 
of the applications for the proposed algorithm. However 
the number of functional units increases by the proposed 
algorithm compared to the number of the original circuits, 
for example the functional units used in control (loop and 
branch) and the few functional units. The DCT and the 
Huffman encoder have less execution time compared to 
the original circuits, since they are executed in parallel. 
In contrast, the Quantizer has more execution time com- 
pared to the original circuits, 'since the synchronous com- 
munication is .increased. The H u f i a n  encoder obtained 
by the proposed algorithm has simple memory controllers 
compared to the original Huffian encoder. Therefore it 
has smaller area and less execution time compared to the 
original encoder. We achieve the most powerful energy re- 
duction with low additional area and execution time over- 
heard. 
The experimental results show that the proposed algo- 
rithm obtains the sub-thread, which has low active rate, 
has the large size registers compared to the other. Con- 
sequently, the proposed algorithm obtains lower energy 
system LSIs. 
reduces the energy more efficiently with small additional 
area overhead when gated clocks are applied. Consequently, 
We achieve 33% energy reduction when we apply our pro- 
posed algorithm tn a P E G  encoder. This approach en- 
ables the high-level synthesis system to synthesize low 
energy LSIs. 
We intend to improve the decision of the target RF 
when the same largest size registers exist in the input source 
code in the future. 
5. REFERENCES 
[I] L. Benini and G. De Micheli, "Automatic synthesis of gated-clack 
finite-state machines:' IEEE Tram. Computer-Aided Design of In- 
tegmted Circuils end $'stem, vol. IS, no. 6, pp. 630643, June 
1996. 
[2] 1. M. Chang and M. Ped" ,  "Register allocation and binding for 
low power:' in Pmc. JZndDAC, pp. 2S-35, June 1995. 
[3] T. Kambe, A. Yamada, K. Nishida, K. Okada, M. Ohnishi, "A C- 
based synthesis system, Bach, and its application:' in Pmc. ASP- 
DAC2001, pp. 151-155,2001. 
[4] A. Kumar and M. Bayoumi, "Novel formulations for low-power 
binding of function units in high-level synthesis:' in Pmc. lCCD 
'99, pp. 321-324, Oct. 1999. 
[5] J. Monteiro, S. Devadas, P. Ashar. and A. Mauskar, "Scheduling 
techniques to enable power management:' in Pmc. 33rdDAC, pp. 
349-352, June 1996. 
[6] S. Noda, N. Togawa, M. Yanagisawa, and T. Ohtsuki, "High-level 
arealdelaylpower estimation for low power system VLSls with 
gated clocks," / E K E  Tmnr on Fundamentals, vol. E85-A, pp. 
827-834, April 2002. 
[7] M. Ohishi, R. Sakurai. K. Nishida, K. Okada, A. Yamada, T. 
Kambe, "A method of low power design for Bach system:' in 
Pmc. /PSI Derign Automation Symposium 2001. pp. 119-123. 
July 2001. 
[E] K. Okada, A. Yamada and T. Kambe, "Hardware algorithm opti- 
mization using Bach C:'IElCE Tronr. on Fundamentals, vol. E85- 
A, no. 4, pp. 835-841. April 2002. 
[9] G. Tellez, A. Farrahi, and M. Sarmfiadeh, "Activity driven clock 
design for IOW power circuits:' in Pmc. ICCAD-95, pp. 6245,  
Nov. 1995. 
[IO] A. Yamada, K. Nishida, A. Kay, A. Yamada, T. Fujimoto, and T. 
Kambe, "A scheduling method for synchronous communication in 
the Bach hardware compiler:' in Pmc. ofASP-DACP9, pp. 193- 
196, 1999. 
4. CONCLUSIONS 
In this paper, we proposed the thread partitioning algo- 
rithm in low energy high-level synthesis. This approach 
164 
