A hardware/software partitioning algorithm for SIMD processor cores by Tachikake Koichi et al.
A Hardware/Software Partitioning Algorithm for SIMD Processor Cores 
Koichi Tachikaket Nozomu Togawatt.$ Yuichiro Niyaokat Jinkii Choit 
hlasao Yanagisawat Tatsuo Ohtsukit 
t Dept. of Electronics, Information and Communication Engineering, Waseda University 
ttDept. of Information and Media Sciences, The University of Kitakyushu 
Advanced Research Institute for Science and Engineering, Waseda University 
3-41  Okubo. Shinjuku, Tokyo 169-8555, Japan 
Tel: +81-3-5286-3396 Fax: +81-3-3203-9184 
Email: tatikake8ohtsuki.comm.waseda.ac.jp 
Abstract 
This pasper proposes a neui harduiare/softuiare partition- 
ing a.lgorithm for processor cores with SIMD instruc- 
tions. Given a. compiled assembly code including SIMD 
instmctions, a timing constraint of execution time, and 
available hardiliare units, the proposed algorithm synthe- 
sizes an area-optimi2zt.d processor core with a nevi assem- 
bly code. Firstly, we assusme on initial processor core on 
which a,n input assembly code can run with the short- 
est execution time. Secondly uie reduce a hardviare m i t  
added to a processor core one by one while the timing 
constraint is  satisfied. A t  the same time, uie update the 
a.ssembly code so that it can run o n  the new processor 
configuration. By repeating this process, we finally ob- 
tain a processor core architecture with small arm under 
the given timing constro.int. We expect that uie can ob- 
tain a processor core which has a,ppropriate SIMD func- 
tional units for  ninning the input a,pplication program. 
The promising experimental resdts  w e  also shown. 
1 Introduction 
In image processing applications such as image synthesis 
and/or image corrections, each pixel in an image is com- 
posed of small bits of data. For example, a pixel can be 
represented by an &bit data. However, a general micro 
processor has a basic bit width of 32 bits or more. In 
image processing applications, how to deal with a short- 
word data with a long-word functional unit is a main 
problem. A packed SIMD type operation[5], [9], [lo], [151 
(or a SIMD operation in short) gives one of the most ef- 
fective solutions for this problem. A SIMD operation is 
n-parallel h/rlcbit sul>operations executed by a modified 
bbit functional unit. An instruction corresponding to a 
SIMD operation is called a SIMD instruction. A fiinc- 
tional unit executing SIMD operations is called a SIMD 
functional unit and a processor core with SIMD instruc- 
tions is called a SIMD processor core. A SIMD processor 
core can be effectively applied to image processing ap- 
plications since we can deal with n pixels concurrently 
by modifying normal &bit functional units. 
Generally, a SIMD operation has many options (see 
2.3.2 in detail). Thus we can configure so many different 
SIMD operations. However, a particular image applic& 
tion program often uses very limited SIMD operations. 
We consider that appropriate configuration for a image 
processor core is required depending on application pro- 
gram. a. well as hardware costs. Hardware/software 
cosynthsis must be a very powerful strategy to synthe- 
size a SIMD processor core. 
Hardware/software codesign is to  design a hardware 
part and a software part of a processor and/or a sys- 
tem simultaneously depending on application programs. 
Particularly the hardware/software codesign systems 
such as in [l], [2], [4], [7], [12], [14], [16] synthesize micro 
processor cores for given application programs. All the 
systems proposed so far, however, focus on conventional 
micro processor cores and then they do not deal with 
SIMD operations/instructions. 
We have been developing a hardware/software cosyn- 
thesis system for SIhlD processor cores [ll], [13], [17], 
[18]. For image processing applications, the system au- 
tomatically synthesizes an optimal image processor ar- 
chitecture through compiling, hardware/software parti- 
tioning, and hardware/software generation. The basic 
system which automatically synthesizes a digital signal 
processor architecture was proposed in [ 171, [ 181. A par- 
allelizing compiler with SIMD instructions was proposed 
in [13]. The compiler generates an initiallyscheduled as- 
sembly code including SIMD instructions given to hard- 
warelsoftware partitioning. The functional unit gen- 
erator for SIMD operations was proposed in [ll]. The 
functional unit generator estimates area/delay values for 
each functional units used in hardware/software parti- 
tioning. 
In this paper, we focus on hardware/software par- 
titioning in our system and propose a new hard- 
ware/software partitioning algorithm for SIMD proces 
sor cores. Firstly, we determine the numbers and types 
of hardware units added to a processor core to execute 
an input assembly code. Then we reduce the number 
of the hardware units or we reduce a sub-function of 
the hardware units, one by one. At the same time, we 
reconfigure the processor core and update the assembly 
code. Finally, we obtain a processor core architecture 
with small area under the given timing constraint. 
2 Architecture Model and Instruction Set 
In this section, we define our processor architecture 
model and its instruction set [ll], [13], [17], [18]. Fig. 1 
shows our processor architecture model. Our processor 
architecture is based on a digital signal processor in [6] 
and composed of one of the two processor kernels and 
extra hardimre units. A processor core is constructed 
by adding several hardware units to a processor kernel. 
In the following, processor kernels, hardware units, and 
an instruction set are defined. 
135 
Arithmetic operation 
Shift operation 
Bit exteridjextract operation 
Data rriove operation 
ADD, SUB. MUL. MAC 
SRA. SLA. su 
EXTR. EXTD 
EXCH. PERM 
2.1 Processor Kernels 
A processor kernel is (i) a RISC-type kernel or 
a DSP-type kernel. A RISGtype kernel has the 
Arithmetic arid logic 
operdtiori 
(ii) 
five 
ADD. SUB, SU. SRL, S L L ,  AND, 
E. E, MUL, DIV,  SLT. SEU. 
pipeline stages (IF, ID, EXE, MEM, and WB) as in the 
micro processor of [3]. A DSP-type kernel has the three 
pipeline (IF, ID, and EXE) stages as in the DSP pro- 
cessors of [6],[8]. The number of pipeline stages and 
processes in each pipeline stage are fixed and cannot 
be changed. A processor core will become a general- 
purpose RISC core if a RISC-type kernel is selected. It 
will become a DSP core if a DSP-type kernel is selected. 
A hardware configuration of each processor kernel is de- 
termined in the same way as in [ 171. 
2.2 Hardware Units 
Our processor core can have extra hardware units: 
(1) a Y-bus for Y data memory, (2) functional units 
(shifters, ALUs, multipliers, MAC units, bit exten- 
ders/extractors, and data move units), (3) addressing 
units, and (4) hardware loop units (see [13], [17], [18] for 
detailed functions in each hardware units). A functional 
unit has a functional unit type t f u  (see Table 1). 
All these hardware units can be added to the DSP 
kernel. The hardware units except addressing units and 
hardware loop units can be added to the RISC kernel. 
2.3 Instruction Set 
2.3.1 Basic Instructions and Parallel Instruc- 
tions 
Our synthesized processor core has basic instructions 
such as ADD and MUL and parallel instructions such 
as (ADD I I ADD) and (ADD I I MUL). The basic instruc- 
tions correspond to the functions of our processor ker- 
nels and hardware units. A parallel instruction executes 
more than one basic instructions. All the combination of 
basic instructions cannot be a parallel instruction. Our 
hardware/software partitioner determines which basic 
instructions should be included in a processor core and 
which combination of basic instructions should be a par- 
allel instruction. 
Load arid store 
Jump 
Parallel load irrid store 
SE. COM2, MAC, INC.  DEC. 
AODI. SUBI, SUI ,  SRLI. S U I .  
ANDI. O R I .  XORI, MULI, D I V I  
E, LDY, G, STY.-. 
- STXI, STYI. LDIX. LDIY. S T I X .  
BEq,  m. &&Z,IP. LOOP, 
LDPX. STPX 
LDRY. m. smy. =I. LDYI. 
S T I Y .  NV,  p&l 
_ _  z. CALL. El'. NOP. HLT 
The type of ADD and SUB is defined as ALU, although ADD and 
SUB can be executed by either an ALU or a MAC unit. The type 
of SIMD version of A D D  and SUB is also defined as ALU. 
136 
Tahle 1. Functional unit tvDe and its corremondinn oDerations. 
“ I  0 -  ~ ~ 
Furiction unit I FU type tf,, I Operatiom 
Shifter I sft 1 Shift operation 
ALU 
Multiplier 
Divider 
MAC unit 
Bit extractor/extender 
Data move unit 
mu1 
diu 
mac 
ext 
exh 
Arithmetic and logic operations 
hlultiply 
Divide 
irlultiply arid addition 
Bit extend/extrxt 
Data exchange arid permutat ion 
3 2  b i t s  3 2  b i t s  - -
8 b i t s  
- ‘ J 1 6  b i t s  17 16 b i t s  ;b) 
( a )  
Figure 2. SIMD multiplications. (a) Four &bit multiplications. (b) Two 16bit bit-extend multiplications. 
3 
3.1 Hardware/Software Cosynthesis System 
We have been developing a hardware/software cosynthe- 
sis system for SIMD processor cores [ 111, [ 131, [ 171 , [ 181. 
We named the system SPADES ( System for processor 
- Architecture Design with Estimation - type SIMD). In 
this subsection, we briefly review our basic idea of the 
system. 
Given an application program in C and a set of its 
application data, our system synthesizes a hardware de- 
scription of a processor core and generates an object 
code and a software environment (compiler, assembler 
and simulator) for the processor core under the con- 
straint of the execution time to run the application pro- 
gram. The objective is to minimize the hardware cast of 
a processor core. The hardware cost of a processor core 
is given by the sum of hardware costs of a processor 
kernel and hardware units used in the processor core. 
The hardware cost refers to area in this paper. The 
execution time to run an application program is given 
by multiplying the clock period by the number of clock 
cycles to run the application program. 
3.2 The HW/SW Partitioning Algorithm 
In this subsection, we focus on a hardware/software par- 
titioning algorithm for SIMD processor cores. We first 
defme a hardware/software partitioning problem. Then 
we propase a hardware/software partitioning algorithm 
for SIMD processor cores. 
3.2.1 Problem Definition 
An assembly code is defined as a graph (call graph, 
control-floui graph, and data-floui graph)[l7]. A call 
graph G, = (x, E,) is defined as a graph represent- 
ing function calls in an application program. A node 
v E V, in G, represents a function. Each node in a call 
graph ha5 a control-flow graph. A control-floui gmph 
G,. = ( x f ,  E,.) is defined as a graph representing con- 
trol flow in a function. A node 71 E V,j in G,f represents 
a basic block. Each node in a control-flow graph has a 
dataiflow graph. A da,ta-floui graph G q  = (Vdf, Ed.) is 
a graph representing data flow in a basic block. A node 
A HW/SW Partitioning Algorithm for 
SIMD Processor Cores 
U E V& in Gdj represents a basic instruction. 
Let Bupp and Fapp be a set of basic blocks and a set 
of functions, respectively, in an input asembly code. 
Consider that a basic block B E Barn is executed N z e  
times. NE, is calculated by our system. Let N,& 
be the number of clock cycles to  execute B. The num- 
ber of the total clock cycles Ncycle to execute an input 
assembly code can be computed as 
Ncycle = N Z e  . x & l e .  (1) 
BEBapp 
The execution time Tam of an assembly code is defined 
as 
Tupp = Ncycle x Tcycle, (2) 
where Tcycle is a clock period of a synthesized proces 
sor core. Let qF be the maximum execution time of 
an application program which is given by the designer. 
Then a timing constraint is given by 
Taw 5 r;. (3) 
Then a hardware/software partitioning problem is de- 
fined. 
Definition 1 Given a,n initially scheduled assembly 
code, N g ,  for ench basic block B E Barn, the timing 
constraint, and available hardware units for a proces- 
sor core, a harduiare/softuiare partitioning problem i s  to 
find a processor core configuration, an  assembly code ex- 
ecu td  on  the processor core, and an  instruction set for 
the processor core m d e r  the timing constraint and the 
hardware configuration conditions so as to minimize the 
hardware cost of tb.e processor core. 
3.2.2 The Algorithm 
The proposed algorithm is an extended version of the 
algorithm in [17] so that it can deal with SIMD instruc- 
tions and SIhlD functional units. Firstly, we determine 
the numbers and types of hardware units added to a pro- 
cessor core to execute an input assembly code (Phase 1). 
Phase 1 determines an initial processor core. An initial 
processor core includes full SIMD functional units where 
a functional unit with type t f u  can execute all the SIMD 
instructions in an input assembly code with the instruc- 
tion type tiTIYt = t j u .  Then we reduce the number of 
the hardware units or we reduce a sub-function of the 
hardware units, one by one, while the timing constraint 
137 
I 
t 
ibl 
Figure 3. Instructions with the type of mu1 (a) and 
a multiplier configuration for them (b). 
is satisfied. At the same time, we reconfigure the pro- 
cessor core and update the assembly code (Phase 2). 
Our approach is heuristic but we expect that it can 
find a globally good solution in a practical time since 
it optimizes the numbers, types, and subfunctions of 
hardware units including SIMD functional units simul- 
taneously. 
Phase 1. Allocate an Initial Resource: In Phase 
1, we configure an initial processor core. 
Let us consider processor kernel parameters. A pro- 
cessor kernel type, RISC or DSP, is not determined in 
Phase 1 but this is determined in Phase 2. The basic bit 
width hk,ll.fu of a processor core is given a. input and 
all the other parameters are determined in the same way 
as in [17]. The configuration of the ALU and shifter in 
a processor kernel will be discussed later together with 
other functional units. 
Let us consider hardware unit parameters. If an in- 
put assembly code includes an instruction using the Y 
data memory, we add the Y data memory to a proces- 
sor kernel. The number of loop registers, the number 
of address registers, and the type of addressing units 
are all determined by an input assembly code. Finally, 
we must determine the configuration of functional units 
including SIMD functional units. 
Configuration of functional units: The configma, 
tion of each functional unit is determined in the follow- 
ing way. Let us consider a set It of the instructions 
whose instruction type of tinyt = t in an input assembly 
code. We construct the functional unit with the type 
t f u  = t so that it can execute all the instructions in 
It and minimum instructions with the type of t. For 
example, assume that an input assembly code includes 
the instructions of MUL MUL-4-ur4s and MUL-2-sr7w 
for multiplication. In this case, we construct a SIMD 
multiplier as shown in Fig. 3. The SIMD multiplier is 
composed of a multiplier for one, two and four data, a 
Cbit and 7-bit right shifter, and a saturation unit. 
The number of each functional unit is determined in 
the following way. If nt-parallel instructions are exe- 
cuted for a set It of the instructions with tiTlyt = t in an 
input assembly code, we add rh functional units with 
Inputs: Assembly code, initial processor core, and timing 
Outputs: New processor core and its corresponding assem- 
Phase 2.For each of a DSP-type kernel and a RISC-type 
Step 1. For each U in the hardware units, subfunctions of 
hardware units, and registers currently added to a 
processor kernel; try to eliminate 'U or try to replace 
'U with the one which has the smaller hardware cost 
than U .  
Step 2. Evaluate the T&(,u) value. For unlin which gives 
the minimum Trate(unlin) value without violating 
the given timing constraint, eliminate uvlin from a 
current processor kernel or replace llnlin with the 
one which has the smaller hardware cost than %in. 
Step 3. Update the assembly code according to a new pro- 
cessor core configuration. 
Step 4. While there exists a hardware unit, sub-function, 
or register which meets Step 2, repeat Steps 1-3. 
Otherwise finish. 
constraint. 
bly code 
kernel, execute Steps 1-4. 
~ ~ 
Figure 4. The algorithm of Phase 2 (configuration 
of a processor core). 
the type of t f u  = t to a processor kernel. For exam- 
ple, amume that an input assembly code includes the 
parallel instruction as below: 
MUL-4-ur4s RI,R2,R3 1 1  MUL-2-sr7w R4,R5,R6 
In this case, we add two multipliers whose configuration 
is shown in Fig. 3(b) to a processor kernel. 
Phase 2: Determine a Processor Core Configura- 
tion: Phase 2 determines (1) a processor kernel type 
(RISC or DSP), (2) the number of general-purpose reg- 
isters, (3) whether the Y data memory is added to a 
processor kernel or not, (4) the number of address reg- 
isters and types of addressing units, ( 5 )  the number of 
loop reasters in the hardware loop unit, and (6) func- 
tional unit configuration, depending on an input assem- 
bly code and timing constraint. 
Firstly, we assume that a processor core has a RISC- 
type kennel or a DSP-type kernel. For each of a kernel, 
we reduce the parameters in (1)-(6) one by one while the 
processor core satisfies the timing constraint. Finally, 
we can find an processor core architecture with small 
area satisfying the timing constraint. 
Fig. 4 shows our proposed algorithm. In the alge 
rithm, Step 1 and Step 3 are discussed later. Step 4 
is trivial. In Step 2, TTUte(u) for each hardware unit, 
each sut~-fimction of hardware units,' or each register 
is defined as: 
(4) 
where and To refer to a-hardware cost and execution 
time of the processor core before eliminating 71,  respec- 
tively, and Al(v )  and TI ( U )  refer to a hardware cost and 
execution time of the processor core after eliminating U, 
respectively. Step 2 finds ,7bnir1 which gives minimum 
An addressing unit and a SIMD functional unit have s u b  
functions. Subfunctions of an addressing unit refer to  the ad- 
dressing operations such z~ post increment, post decrement, index 
addition, and modulo operation. For subfunctions for a SIMD 
functional unit, see the discussion later. 
138 
nvll , 
(fired Ibt i&t shift) 
baskad data = 2 and I )  
(fixed 4tit ight rhiift) 
Figure 5. Original multiplier configuration. 
Figure 6. Multiplier configuration (after eliminating 
a subfunction in mulz). 
Trate(umin) and actually eliminates u,,in from a cur- 
rent processor core. By using the T , , t e ( ~ ~ )  value, we can 
effectively reduce a hardware cost of a processor core 
with satisfying a timing constraint. 
SIMD functional unit reduction and assembly 
code update (Steps 1 and 3): In Step 1 and Step 3, 
we can deal with hardware units other than SIMD func- 
tional units in the same way as in [17]. Then we discuss 
here SIMD functional unit reduction and its correspond- 
ing assembly code update. 
For any SIMD functional unit U added to a processor 
core, we consider to (a) replace U with a SIMD func- 
tional unit U' which has the same functions with U and 
has the smaller hardware cost than U or (b) eliminate 
some suhfunction of U. 
(a) is realized by calling our SIMD functional unit 
generator proposed in [ll]. If U is replaced with U', 
assembly code update is unnecessary since the function 
of U' is just the same as that of U .  
Now let us focus on the case of (b). The SIMD func- 
tional unit U can execute several SIMD instructions. 
Then we can consider a suhfunction corresponding to 
each SIMD instruction and eliminate the sub-function 
from the SIMD functional unit. After eliminating the 
subfunction in a SIMD functional unit, we update an 
assembly code according to a new SIMD functional unit. 
Note that, we eliminate a sub-function in SIMD func- 
tional units only when the SIMD instruction is executed 
by another SIMD functional unit. 
For example, we assume that a processor core h a  
two SIMD multipliers, mull and mul2, each of which 
can execute the two SIMD instructions m - 2 - u r 4 s  
and MUL-4-sr7w. The SIMD multiplier configuration is 
shown in Fig. 5. Each SIMD multiplier, 7ndl or mu12, 
is composed of a multiplier for two and four data, a 4- 
In the following, we discuss Step 1 and Step 3. 
bit and 7-bit right shifter, and a saturation unit. Using 
these mul1 and mu12, we can execute the following two 
parallel instructions in two clock cycles. 
MUL-2-ur4s Rl,R2,R3 1 1  MUL-2-ur4s R4, R5, R6 
MUL-4-sr7w R7,R8,R9 I I MUL-4-sr7w RlO,Rll,R12 
The first instruction is executed by using mu11 and mu12 
and the second instruction is also executed by using 
mu11 and 7nul2. 
Consider to eliminate the sub-function corresponding 
to ~ ~ ~ - 2 - u r 4 s  in mu12. The configuration of 7riu12 is 
changed so that it can execute only MUL-4-sr7w. We 
have the new SIMD functional unit mu& as shown in 
Fig. 6. mulh is composed of a multiplier for two data 
and a 7-bit right shifter. Comparing the configuration 
of mu12 and that of niulh, the hardware cost of mu& 
must be smaller than that of mdz. However, m u l ~  can- 
not execute m-2-ur4s .  Then if we eliminate the sub- 
function corresponding to m - 2 - u r 4 s  in 7nTd2. we must 
update the above assembly code as follows: 
MUL-2-ur4s Rl,R2,R3 
MUL-2-ur4s R4,R5.R6 
MUL-4-sr7w R7,R8,R9 I I MUL-4-sr7w RlO,Rll,R12 
The first instruction is executed by 7nd1 and the second 
instruction is also executed by mull. The third instruc- 
tion is executed by using mull and mu$. 
In this way, we try to eliminate each of the hardware 
units, sub-functions of hardware units. and registers in 
Step 1. Then in Step 3, we update an assembly code 
according to a new processor core configuration. 
Based on this algorithm, we can reduce redundant 
sub-functions in SIMD functional units and then we can 
find an optimal processor core configuration. 
4 Experimental Results and Conclusion 
The propoised hardware/software partitioning algorithm 
has been implemented in the C language on Sun Ultra 
Workstation. The algorithm was applied to the Alpha 
Blend (image size of 640 x 480 pixels) and the Copying 
Machine Application (image size of 640 x 480 pixels). 
The basic bit width of a processor core is set to be 32 bits 
and the number of instructions executed concurrently is 
set to  be four. 
In 
the tables, Const shows timing constrains, Area shows 
synthesized processor core area, Time shows execution 
time for running an application program. and Hardware 
configuration shows hardware configuration for synthe- 
sized processor cores. In the tables. SIbiD functional 
unit configuration is shown as follows: Assume that 
a synthesized SIMD processor core has one ALU and 
two SIMD ALUs, salul and salu2, where salul and 
salv2 have two SIMD ALU instructions and one ALU 
insturction, respectively. This ALU confiugration is 
shown a5 (1,2[2,1]). 
The tables indicate that, our hardware/software par- 
titioning algorithm configures appropriate SIhiD func- 
tional units depending on the given application pr+ 
grams and timing constraints. If a similar timing con- 
straint is given to a non-SIND processor core and a 
SIblD processor core, an area of a SIhiD processor core 
Tables 4 and 5 show the experimental results. 
139 
Tirne 
[ms] 
17.740 
SIMD 20.0 
([17]) 22.0 1 24.0 
32.1) 
Hardware configuration 
Kernel #ALUs #SFTs #iwJLs #MACS #Regs Y-mern Addr unit HW loop 
DSP 2 1 2 3 (7. 3. 1) Yes Xll.21. Yi1.21 Yes 
5,672.754 
3,839,427 
3,672,783 
3.549,223 
6.299,770 
4,065.268 
2,873,058 
2.857,929 
2.656,377 
2,656,377 
Nor1 
Consts Area 
[Ins] [p“] 
18.0 11.591.021 
18.923 
20.106 
22.471 
30.750 
4.421 
4.886 
5.584 
6.981 
13.962 
13.962 
DSP 2 1 1 1 is, 3 ,  1) Yea Xil.2j. Yjl,2j Yes 
DSP 2 1 1 0 (8, 3 .  1) Yes X[1.2]. Y[1.2] Yes 
DSP 2 1 1 0 (8. !, 0) Yes X[1.2], Y[1,2] No 
DSP 2 1 1 0 (7. 3, 0) Yes Xjl.21. Y[1.2] No 
DSP 0. 2[2.2] 1. 0 0, :3[1,1.1] 0. 2 1.1 (8. 3, 1) Yes X[1,2]. Y[1.2] Yes 
DSP 0, 1 [3 ]  1, 0 0, 2[1,1] 0. ![l]’ (8, 3, 1) Yes X[1,2]. Y[l.2] Yes 
DSP 0. 1[3] 1, 0 0. 1[1] 0, 0 (10, 3 ,  0) Yes X[1,2]. Y(l.21 No 
DSP 0, 1[3] 1, 0 0, 1[1] 0. 0 (12, 0. 0) Yes No N o  
DSP 0. 1[3] 1, 0 0. 1111 0. 0 (9. 0, 0) Yes No No 
DSP 0. l[jl 1, 0 0. 1111 0. 0 (9, 0, 0) Yes No No 
#ALUs for SIMD cores: (#ALUs, #SIMD ALUs[#SIMD instructions in SIMD ALU1,. . .]) 
#SFTs for SIMD cores: (#Shifters, #SIMD Shifters[#SIMD iustructioris in  SIMD Shifterl, . . .I) 
#h.iULs for SIMD cores: (#MULs, #SIMD MULs[#SIMD instructions in SIMD MUL1, . . .]) 
#MACS for SIMD cores: (#MACS. #SIMD MACs[#SIMD instructions in SIMD MAC1, . . .]) 
#Regs: (#General registers, #Address registers. #Loop registers) 
Addr unit: Address unit configuration. X[1.2] (or Y[1,2]) rnearls that the X (or Y) data rnernory hils the addressing 
unit with post incrernerit operation. 
Non 
SIMD 
([17]) 
Packed 
SIMD 
c m  be smaller than that of a non-SIMD processor core 
for both application programq. Tables 4 and 5 show 
that the area of SIhlD processor core is 22-53 % smaller 
than that of non-SIMD processor cores configured under 
the similar timing constraints. Because the numbers of 
functional units and reginsters added to SIMD proces- 
sor cores are smaller than that of non-SIMD processor 
cores. 
By using our new hardware/software partitioning al- 
gorithm, we can find a processor core architecture with 
small area satisfying a given timing constraint. Now our 
a lgo r i thm is a greedy heuristic approach, but for l a rge r  
applications we may need more efficient heuristics. In 
the future, we will improve our algorithm so that it can 
optimize the configuration of each SIMD functional unit 
by reducing several sub-functions at once. Thus we will 
have globally optimized hardware/software partitioning. 
Acknowledgement 
This research is supported in part by STARC (Semicon- 
ductor Technology Academic Research Center). 
References 
[l] H. Akaboshi and H. Yasuura, “COACH: A computer aided 
design tool for computer architects,’’ IEICE Zhnsactions 
on Fundamentals of Electronics, Communications and Com- 
puter Sciences, vol. E76-A, no. 10, pp. 17GO-1769, 1993. 
[2]N. N. Binh, M. Imai, A. Shiomi and N. Hikichi, “A hard- 
ware/software partitioning algorithm for designing pipelined 
ASIPs with least gate count,” in Proc. 33nI DAC, pp. 527- 
532, 1998. 
[3] J. L. Hennessy and D. A. Patterson, Computer Amhitecture: 
A Quantitative Approach, Morgan-Kaufman, 1990. 
[4] I. J. Huang and A. M. Despain, “Synthesis of instruction sets 
for pipelined microprocessors,’’ in Proc. 31st DAC, pp. 5-11, 
1994. 
[5] Intel, MMX Technology Amhitecture Overview, http://www. 
intel.com/technology/itj/q31997/articles/art 2.htm, 1997. 
Corists Area Tirne Hardware colfiguration 
[IIlS] [pd] (ms] Kernel #ALUs #SFTs #h?ULs #Regs Y-niern Addr urd HW loop 
100.0 5,086,785 99.421 DSP 2 1 1 (48, 6, 0) Yes X[1.2]. Y[1,2] No 
50.5 8,753,937 50.295 DSP 4 1 4 (69, 6. 1) Yes X[l,Z]. Y[1.2] Yes 
250.0 3,944.657 249.138 DSP 2 1 1 ( 3 1 ,  6, 0) Yes X[1,2], Y[1.2] No 
500.0 2.668.161 499546 DSP 1 1 1 (12. 6, 0) Yes X[1.2]. Y[1.2] No 
50.0 3,155,438 49.837 RISC 1, 1[2] 1. 0 0. 1[1] (10. 0, 0) Yes No No 
5.7 10,698,879 5.688 DSP 0. 4[2.2.2,2] 1, 0 0. 4[1.1.1,1] (69, 6, 1) Yes X[1.2]. Y 1.21 Yes 
10.0 5,500,743 9.783 DSP 2. 2[2,2] 1. 0 0, 1[1] (44. 6. 0) Yes X[l,Z]. Y[rl.Z] No 
100.0 2,696.463 99.881 DSP 1. 1[2] 1. 0 0. l[l] (6, 6. 0) Yes X[1.2]. Y(1,2] No 
[6] P. Lapsley, J. Bier, A. Shoham, and E. A. Lee, DSP Processor 
Fundamentals: Architectures and Features, Berkeley Design 
Technology, Inc., 1994-1998. 
[7] H. Liu and D. F. Won, “Integrated partitioning and schedul- 
ing for hardware/software codesign,” in Pmc. Intenzational 
Conference on Computer Design, 1998. 
[8] V. K. hladisetti, Digital Signal Processors, IEEE Press, 1995. 
[9] MIPS Technologies, MIPS Eztension for digital media with 
30, 1997. 
[lo] M. Mittal, A. Peleg, and U. Weiser, “Mh,fX technology ar- 
chitecture overview,’‘ Intel Technology Journal, 3rd Quarter, 
1997. 
[ll] Y. Miyaoka, N. Togawa, M. Yanagisawa, and T. Ohtsuki, “A 
hardware unit generation algorithm for a hardware/software 
cosynthesis system of digital signal processor cores with 
packed SIMD type instructions,” 25ansactions on I n f o n a -  
tion Processing Society of Japan, vo1.43, no& pp.1191-1201, 
2002, (in japanese). 
[12] E. F. Nurprasetyo, A. Inoue, H. Tomiyama, and H. Yasuura, 
“Soft-core processor architecture for embedded system de- 
sign,” IEICE Trans. on Electron, vol.E81-C, no.9, pp.141G 
1423, 1998. 
[13] N. Nonogaki, N. Togawa, M. Yanagisawa, and T. Ohtsuki, 
“A pardlelizing compiler in a hardware/software cosynthe- 
sis system for image/video processor with packed SIMD type 
instruction sets,’‘ IEICE Technical &port, VLD2000-139, 
ICD2000-215,2001, (in japanese). 
[14] J. Sato, A. Y. Alomary, Y. Honma, T. Nakatta, A. Shiomi, N. 
Hikichi and M. Imai, “PEAS-I: A hardware/software codesign 
system for ASIP development,’’ IEICE Transactions on f i n -  
damentals of Electronics, Communications and Computer 
Sciences, vol. E77-A, no. 3, pp. 483-491, 1994. 
[15] Sun hJicrosystems, VIS Instruction Set User’s Manual, 1997. 
[ t G ]  Tensilica, Xtensa Microprocessor: Overview Handbook, 
http: //www. tensilica.com/. 
[17] N. Togawa, h.I. Yanagisawa, and T .  Ohtsuki, “A hardware/ 
software cosynthesis system for digital signal processor cores,” 
IEICE Pans.  on findamentals, vol. E82-A, no. 11, pp. 2325- 
2337, 1999. 
[18] N. Togawa, M. Yanagisawa, and T. Ohtsuki, “A hardware/ 
software cosynthesis system for digital signal processor cores 
with two types of register files,” IEICE Trans. on Fundamen- 
tals, vol. E83-A, no. 3, 2000. 
140 
