A Cosynthesis Algorithm for Application Specific Processors with Heterogeneous Datapaths by Miyaoka Yuichiro et al.
3A-3 
A Cosynthesis Algorithm for Application Specific Processors 
with Heterogeneous Datapaths 
Yuichiro Miyaokat Nozomu Togawatt Masao Yanagisawat Tatsuo Ohtsukit 
Waseda University The University of Kitakyushu Waseda University Waseda University 
miyaoka@ohtsulu.co-.waseda.ac.jp togawa@env.kitakyu-u.=~.jp yanagi@yanagi.camm.wascda.acJp to@ohtsukLcomm.waseda.ac.jp 
t Dept. of Electronics, Information and Communcation Engineering, Waseda University 
i t  Dept. of Computer and Media Sciences, The University of Kitakyushu 
3 Dept. of Computer Science, Waseda University 
3 4 1 ,  Okubo, Shinjuku. Tokyo, 169-8555, Japan, Tel: +81-3-52863396. Fax: +81-3-3203-9184 
1-1, Hibikino. Wakamatsu, Kitakyushu, 8084l35. Japan, Tel: +81-93-595-3264, Fax: +SI-93-595-3368 
3 4 1 ,  Okubo, Shinjuku. Tokyo, 169-8555, Japan, Tel: +81-3-528&{3392,3387 },Fax: +81-3-{3204-4875,3203-9184 } 
Abstract- This paper proposes a hardwadsoftware cosyn- 
thesis algorithm for processors with heterogeneous registers. 
Given a CDFG corresponding to an application program and a 
timing constraint, the algorithm generates a processor configura- 
tion minimizing area of the processor and an assembly code on 
the processor. First, the algorithm configures a datapath which 
can execute several DFG nodes with data dependency at one cy- 
cle. The datapath can execute the application program at the 
least number of cycles. The branch and bound algorithm is ap- 
plied and all the number of functional units and memory banks 
are tried. For assumed number of functional units and mem- 
ory banks, an appropriate number of heterogeneous registers and 
connections to functional units and registers are explored. The 
experimental results show effectiveness and efficiency of the algo- 
rithm. 
1. INTRODUCTION 
General DSPs such as TMS32OC2~[141, DSP56300[12], 
DSP16xx[lO], ADSP-2lxx[3], and [9] have heterogeneous 
datapaths. Heterogeneous registers (accumulate registers, for 
example) can have flexible hit width, while general purpose 
registers must have single hit width. Heterogeneous regis- 
ters can satisfy application requirements with less hardware 
costs. Sophisticated heterogeneous datapaths including het- 
erogeneous registers can execute application programs fast. 
Therefore, processors with heterogeneous datapaths can have 
small costs and achieve high performance. 
For processors with heterogeneous datapaths, code opti- 
mization or generation is a struggled problem, since, for ex- 
ample, it must be considered which of heterogeneous regis- 
ters a variable is bound to. Several retargetable compilers to 
processors with heterogeneous datapaths have been reponed 
as in [5, 6, 7, 11, 17, 181. These retargetable compilers, 
however, cannot always make a sufficient application code, 
since datapath configuration of the target processor is not al- 
ways suitable for a given application program. We think that 
an application specific processor should be synthesized, espe- 
cially for a processor with a heterogeneous datapath. A Hard- 
warehofhvare codesign method can be effectively applied to 
processors with a heterogeneous datapaths. Several researches 
on a hardwarelsoftware codesign for microprocessors have 
been reported as in [ I ,  2.4, 13, 151. However, they does not 
0-7803-8175-0/04/$17.00 02004  IEEE. 250 
focus on heterogeneous datapaths. In [8] a hardwarelsoftware 
codesign environment have been proposed. In [8], given an ap- 
plication program and datapath configuration, the codesign en- 
vironment generates a processor hardware description and the 
object code on the processor. However, another datapath con- 
figuration must be manually designed when estimated area or 
the execution time of a given application program on a proces- 
sor with a datapath configuration is insufficient. At the design 
of a processor with a heterogeneous datapath, it is difficult to 
find a performance bottleneck and to design another appropri- 
ate processor datapath. Exploring datapath configuration and 
compiling an application program must be close each other. 
Therefore we propose a hardwarelsoftware cosynthesis al- 
gorithm for processors with heterogeneous datapaths. Given a 
CDFG corresponding to an application program and a timing 
constraint, the algorithm generates a processor configuration 
minimizing area of the processor and an assembly code on the 
processor. First, the algorithm configures a datapath which can 
execute several DFG nodes with data dependency at one cycle. 
The datapath can execute the application program at the least 
number of cycles. The branch and bound algorithm is applied 
and all the number of functional units and memory banks are 
tried. For an assumed number of functional units and memory 
banks, an appropriate number of heterogeneous registers and 
connections to functional units and registers are explored. 
This paper is organized as follows: Section I1 defines a pro- 
cessor architecture. Section I11 proposes a hardwarelsoftware 
cosynthesis algorithm for processors with heterogeneous data- 
paths. Section IV shows experimental results. Section V gives 
concluding remarks. 
11. ARCHITECTURE MODEL 
This section defines a processor architecture model for our 
synthesis algorithm. Our VLIW type processor has the 3 stage 
pipelines composed of IF, ID, and EXE stages. Immediate 
values are decoded on the I D  stage and written in the ID/EXE 
pipeline registers. The processor can have one or two data 
memoly banks. The data bus width of data memory is fixed to 
hit width of processor basic hit width bb,,,,. Besides the num- 
her of data memory banks, configuration of general purpose 
registers, heterogeneous registers, and functional units can be 
changed. Figure 1 shows our processor model. 
3A-3 
Fig. 2. Instruction formal. 
Fig. 1. Our pipeline mhilcclure. 
Bit width of general purpose registers is bbasic. In case that 
data of a general purpose register is read, the data is read at 
the I D  stage and saved in the ID/EXE pipeline registers. In 
case that data is written in a general purpose register, data is 
generated at the EXE stage and written back to the register. 
The uumher n, of general registers is given in advance. 
Data of heterogeneous registers are read and written at the 
EXE stage. Data read from a heterogeneous register is (a) used 
for an input of a functional unit, (b) written hack to a general 
purpose register, (c) written back to another heterogeneous reg- 
ister, or (d) written to data memory. To convert bit width of 
data in a heterogeneous register to bit width required for (a). 
(d), a heterogeneous register can have a bit extendedlextracted 
unit. (e) Output data of a functional unit, (f) data in a general 
purpose register, (g) data in another heterogeneous register, or 
(h) data from data memory is written to a heterogeneous reg- 
ister. The number of heterogeneous registers and hit width of 
each heterogeneous register can be changed. Which of (a)-(h) 
connections each register has and whether each register has a 
bit extendedlextracted unit can he changed. 
Functional units are such as ALUs, adding units, mnlti- 
plying units, shift operation units, and hit extendedlextracted 
units. For the same operation type, a functional unit can 
have different bit width of inputs and outputs. A proces- 
sor has several functional units in a functional unit library 
F U  = { J l ,  Ji, . . . fu}. By connecting more than one func- 
tional units, a processor can have a datapath which can execute 
a operation and another operation continually. For example, 
by connecting an output port of a multiplying unit to an input 
port of an adding unit, multiplying and adding operation can 
be executed at one cycle. 
We define minimum instructions to perform as a general 
processor. A processor must have a datapath which can execute 
(in one or several cycles) adding operation, logical and/or/xor 
operation, shift operation, and comparing operation for two 
variables a and b in the data memory memo, and write their 
operation results to memo. A processor must have a non- 
conditional branch operation and a branch-on-equal operation 
for a and b.  
n,, nv and n,, and n h  denote the number of general purpose 
registers, reading ports and writing ports of general purpose 
registers, and heterogeneous registers respectively. n f  denotes 
the number of functional units in a processor. nm E {1,2} 
- 
25 1 
denotes the number of data memory banks and s,,,, denotes 
the number of signals which can be written to data memory 
i. Let s, be the number of signals which can be written back 
to general purpose registers and s i  be the number of signals 
which can he written back to a heterogeneous register RL. As- 
sume that a functional unit J ,  should have nf, input ports and 
that the number of signals which connect a j ( 0  5 j < nf.)-th 
input of a functional unit should be sL. Then an instruction set 
is composed of, at most, 
instructions. Only the instructions which are required in order 
to execute a given application program and minimum instruc- 
tions as mentioned above are synthesized. Bit length for an 
instruction code is b,.,l 5 [log nznstl and hit length for speci- 
fying general purpose registers is (nF +nu) x b,,,, where b,,, 
is bit length to specify a general register and b,,, = [logn,] 
(Fig. 2). 
We estimate processor area by adding up area of (i) I D /  EXE 
pipeline registers, (ii) general purpose registers, (iii) hetero- 
geneous registers, (iv) functional units, and (v) multiplexers 
(for writing to general purpose registers, heterogeneous regis- 
ters, and data memory banks, and input of functional units). 
Let aPeg and amur be area of a single bit register and a 2- 
1 multiplexer respectively. Area A,,,, of (i) is estimated as 
estimated as A, = aVeg x n, x bboslc + amuz x (nT x bboszc x 
(n, - 1) + nw x bbnstc x n,). Area Ah of (iii) is estimated as 
Ah = aveg x E:'; bh, where b; is bit width of R;. Area AF 
of (iv) is estimated as AF = CfuEF afu. Finally, area A, 
of (v) is estimated as A, = amvs x ~,E,,m,(s, - 1) x b,, 
where s p  is the number of signals which can he input to a mul- 
tiplexer p and b, is bit width of the multiplexer p.  
The clock period of a processor is estimated as critical path 
delay at the EXE stage, which is composed of delay of mul- 
tiplexers, functional units, and writing back to registers. We 
assume that d, should be delay of a 2-1 multiplexer and we 
estimate delay of an s,-l multiplexer as d, x [log s p l .  
A,,,, = aTeg x (bctrL + n, x bbaszc + n, x b r e d  Area A, is 
111. A HARDWARElSOFTWARE COSYNTHESIS ALGORITHM 
FOR PROCESSORS WITH HETEROGENEOUS DATAPATHS 
This section proposes an algorithm to synthesize a processor 
with a heterogeneous datapath. 
A. Problem definition 
A control flow graph (CFG) G, = (Vc, E,) is defined as 
a graph reprepenting control flow in a function. A CFG has 
no input and output edges. A basic block is a node of a 
CFG and has a data flow graph (DFG). A basic block has no 
branches and joins except for its starting and ending. A DFG 
G d  = (&,Ed)  is defined as a graph representing data flow. 
An ending node of a DFG may be connected to a starting node 
in another DFG. A DFG is a set of operations from a branch or 
3A-3 
join to a next branch or join in an application. A node Y E V ,  
is associated with an operation in a functional unit f E F U  or 
a memory access operation. If the execution of a DFG node U, 
has data dependency to another DFG node 212, a DFG has an 
edge (v1,vz). 
Top,, the execution time of an application program, is cal- 
culated by a product of the number of total clock cycles to 
execute the CDFG and a clock period of the processor. Proces- 
sor Configuration includes the number of data memory banks, 
the number of heterogeneous registers, bit width of each het- 
erogeneous register, the types and number of functional units, 
and the connections among data memory, general purpose reg- 
isters, heterogeneous registers, and functional units. 
Definition 111.1 (Processor synthesis problem) A processor 
synthesis pmblem is to determine processor configuration 
given a CDFG and a timing constraint in order to minimize 
pmcessor area while the execution time of a given application 
program satisfies a timing constraint T,,,, that is, Tapp 5 
T,,. 
B. Proposed algorithm 
The proposed algorithm is composed of two phases. One de- 
termines initial processor configuration and the other explores 
processor configuration. Initial processor configuration has a 
sufficient number of functional units and heterogeneous regis- 
ters, and can execute the application program in a small num- 
ber of cycles. while processor area of the configuration is large. 
The branch and bound algorithm is applied and all the number 
of functional units and memory banks are tried. For an as- 
sumed number of functional units and memory banks, an ap- 
propriate number of heterogeneous registers and connections 
to functional units and registers are explored. In this way, we 
can obtain optimized processor configuration in a short time. 
B. I Initial processor configuration 
To configure initial processor configuration, all the DFGs in 
a CFG are scheduled in the following method. To schedule a 
DFG Gd = (Vd, Ed) associated with a CFG node U, E V, 
is to determine a pair (U, st) for all the nodes U E V, ,  st E 
{ 1 , 2 , .  . .}, that is, to construct SUc = {(v,st)l'v E Vd}. A 
node V E Vd is a node executed by a functional unit (called 
a functional node) or a node accessing data memory (called a 
memory accessing node). A set of all the functional nodes in 
Vd is denoted as VI. A set of all the memory accessing node 
in V, is denoted as V,,, where Vf and V, satisfy V, = Vf U 
V,, Vf n V, = 4. Our target processor can execute several 
functional nodes at one cycle which have data dependency to 
one another by connecting an output of a functional unit to an 
input of another functional unit. However, a processor cannot 
execute a node to load data from memory and a node to operate 
the loaded data at the same cycle and cannot execute a node to 
store data to memory and a node to operate the stored data 
at the same cycle. Therefore, if an edge ( U * ,  vj) is included 
in E d  and vi and U; E V, are assigned to a step st i  and stj  
respectively, st; 2 stj  in vi, vj E Vf and sti < st; in vi E V, 
or v; E V,. Figure 3 shows the step assigning rule. DFG 
nodes are assigned to steps in a topological order. To assign 
a functional node and memory accessing node to a step st is 
realized by assigning all the functional nodes which can he 
assigned to st and memory accessing nodes satisfying resource 
constraints. Figure 4 shows this algorithm. 
~ 
252 
stj 0 vi E Vf stj bViEvm vi 
Fig. 3. A step assigning mk. (a) In case a predecessor and successor node of 
an edge is in V f .  (b) a predecessor or successor node is in V,.
Inputs A CFG Gc = (Vc, Ec) and DFG Gd = (Vd, Ed). 
Outputs A scheduling result {S,,c[vvc E Vc}. 
Step 1. Select a CFG node vC E V,. Execute st c 1 and S,, - 0. 
Step 2. Select zi E Vd in Gd = (Vd, Ed)  corresponding to vc in a topologi- 
cal order. 
Step 2.1. For II E V,. assign li to st  if there are no U' where U' E V, 
and (v ' ,  w )  t Ed. Execute S,, - S,, U (v, s t ) .  
Step 2.2. For v E V,. assign v to st if a memory resource consmint 
is satisfied and a step st' assigned lit where (U', U) E Ed is less 
than st .  Execute S,, c S, U (U, s t )  
Step 3. Go to Step 4 if every node in Vd are assigned to a step. Otherwise, 
Step 4. S,, has been obtained. Select another CFG node uC' and go to Step 
update st c st  + 1 and go to Step 2. 
2. Finish the algorithm when all the CFG nodes have been scheduled. 
Fig. 4. Initial scheduling algorithm. 
Based on an initial scheduling result, an initial pnxessor 
configuration is configured. If a predecessor node and a suc- 
cessor node of an edge are assigned to different steps, the edge 
is assigned to a register. The lifetime of each variable is ana- 
lyzed in advance. 
be a set of nodes assigned to a step st in a node uc E V,. fu (v )  
denotes a functional unit or data memory corresponding to U E 
V,. The number of a functional unit j ,  to execute V,c,st in a 
cycle is nf*(vcrs t )  = /{v E V , c , , t l j u ( ~ )  = f , } l .  Therefore, 
the number of a functional unit f, which a processor has is, 
First, we decide the number of functional units. Let 
Similarly, the number of data memory banks is determined to 
be one or two. 
Now, we propose an algorithm to configure the connec- 
tions among functional units, data memory banks, and regis- 
ters. We define a processor configuration graph in order to 
represent the connections. A processor configuration graph 
denotes Gpr = (V,,, Epr). A node in V,, correspond to a 
pipeline register, a heterogeneous register, a functional unit, or 
data memory. A processor configuration graph has an edge 
(211, v2) if there is a connection between hardware correspond- 
ing to U, and one corresponding to w2. The number of nodes 
corresponding to functional units or data memory can be de- 
termined as mentioned above. Thus, there are nf. functional 
units associated with j ;  in upr. There are one or two nodes 
associated with data memory banks. Corresponding to reading 
st-I 0 
3A-3 
Inputs A scheduling result {Svc I’llc E V,) and a set of subgraph Gsub = 
(Vsub, &=a). 
Outputs A processor configuration graph G,, = (V,,, En?). 
Step 1. Calculate the number nf. o f f ;  as -\ 
GV 
*t+l 
G.”b 
Fig. 5 .  An example of G,, satisfying configuring conditions based on Gsvb. 
and writing general purpose registers, nodes associated with 
pipeline registers must be included in V,?. 
First, to perform as a general processor, our algorithm adds 
a node U associated with an ALU to V, and edges between w 
and nodes associated with pipeline registers. 
The numbs  of nodes associated with heterogeneous regis- 
ters and a set E,? of edges are decided by applying the fol- 
lowing algorithm to all the steps of all the basic blocks and 
updating G,, = (V,,, Epr). 
Let E s u b  be a set of edges where ( V I ,  UZ) E Ed, VI E VUc+t 
or uz E Vu,,.t. A subgraph Gsub = ( h u b r  Eeub) of Gp, C a n  
be constructed, where va/Sub is a set of all the nodes ,which are 
u1 or uz of (211, w2) E Eau!,. Configuring conditions for G,, in 
a step st is defined as follows. 
1 .  wd E V,,,,t is associated with upr E V,?. A mapping 
function is denoted as Bf : V& c V,?. If U,,. = 
Bf(wd). wpT is corresponding to a hardware fu(wd). If 
Q , W Z  E VVc,,t, Bf(ui) # Bf(uz). 
nf, = m u  { ~ f , ( ~ c , ~ t ) l  
V”<,V*f 
Similarly, the number of data memory b& is determined to be one or 
two. 
Step 2. Include nodes associated with functional units. data memory banks, 
and reading and writing general p q s e  registers. The numbers of 
nodes associated with functional uniu and data memory banks are de- 
cided at Step l. Epy - +. 
Step 3. Add a node U associated with an ALU to V,, and edges between v 
and nodes associated with pipeline registers to Epr. 
Step 4. Pick up a CFG node U<. s t  + 0. 
Step 5. Decide a mapping function B f  which associates functional nodes in 
GaUb at st  to the nodes in Gpv. Try to update G,, and calculate the 
G,, EOSI. 
Step5.1. If (Bf(w),Bf(uz)) is in EpT for   VI,^) E Esub 
where VI and viz are in Vv,,st, Epr is not updated. If 
(B~(wI), Bf (uz ) )  @ Epr. Epr + Epr U ( B , ( V ~ ) ,  Bf(vz)) .  
Step 5.2. Le1 us focus on an edge e = (w. w) E E, where VI E 
Vo,,standvz # V,,,.t. Letv,,,beanodewhichisassociated 
with a heterogeneous register connected to an outgone edge of 
Ef(l~1) and does not have s t  in lifetime. Update a mapping 
function B, so that B,(e)  = ureg. If there does not exist such 
vTegr V, + V, U ureg and &(e) = vreg is added to E,. 
Step5.3. For e = ( w , w z )  E E, where ut E Vv..st and 
w @ V,,,,t, B,(e) has been determined at an iteration by 
s t  ~ 1. If ( B , ( e ) . B f ( u z ) )  is not in Epr. Epy + Epr U 
(Br(e), B f ( ” 2 ) .  
Step 6. Select Bf which has the minimum G,, cost and update Gpr. 
Step 7. st - S t  + 1 and go to Step 5. if the steps for all the st are fried. go 
to Step 4. Finish the algorithm if all the CFG nodes are Uied. 
Fig. 6 .  A processor configuration algorithm. 
2. If 211,212 E vuG,d for e = (111 ,Uz)  E Esh 
(Bf (q), Bf(uz)) must be included in Epr. 
2. Let us focus on an edge e = ( u l , u ~ )  E E, where 
v1 E V,c,st and WZ $ Vuc,at. Let uYeg be a node which is 
associated with a heterogeneous register connected to an 
outgone edge of Bf(w1) and does not have st in lifetime. 
Update a mapping function B, so that &(e)  = ‘u,.~. If 
there does not exist such uvegr a new uTeg is included in 
V, and &(e)  = wTeg is added to B,. 
3. Let E, = {(u. ,ut)  E Esub} be a set of edges where 
U, $ V,,,,t or ut 4 Vvc,st. e, E E, is associated 
with uPr E VP7. A mapping function is denoted as 
B, : E, H V,?. If wpr = &(e,),  U,. is corresponding to 
a heterogeneous register between Bf(w..) and Bf(wt). For 
el = (v,,vj),ez = (w~k,vi) E E,,b, &(el) = if 
ui = and B,(el) # B,(ez) if vi # uk. 
4. If u1 E Vuc,st and uz 4 Vuc,st for e = (w1,uz) E Esub. 
(B f (u l ) ,B , ( e ) )  must be included in Epr. Similarly, if 
211 $ V,,,,t and ~2 E Vuc,at for e = (UI,UZ) E Eaub. 
(&(e),  Bf(w)) must be included in Epr. 
3. For e = (w1,u2) E E, where U I  E Vvc,st andvz $ V,,,,t, 
&(e) has been determined at an iteration by st  - 1. If 
(B7(e),B,(uz)) is not in Eprr (B,(e),Bf(wz)) is in- 
cluded in Epr. 
Figure 5 shows an example of G,,a and a graph satisfying con- 
figuring conditions for G,, in a step st .  
to V, is 
decided so that the configuring condition 1 for G,, is satisfied. 
Then, G,, is updated as follows so that configuring conditions 
2 4  for G,, are satisfied. 
The algorithm selects Bf which has the minimum G,, cost 
and updates G,,, where G,, cost is defined by estimating pro- 
cessor area as mentioned in Sect. 11. The processor configura- 
tion algorithm is shown in Fig. 6. 
An initial processor configuration is obtained by applying 
the processor configuration algorithm to an initial scheduling 
1 .  If (Bf(u1),Bf(wz)) is in Epr for (wlru2) E Esub result. An initial processor configuration has the minimum 
number of cycles executing the application program. Based 
onamodel in Sect. 11, areaandaclockperiodof theprocessor 
A mapping function Bf which corresponds 
where u1 and u2 are in Vue,,,, Epr is not updated. If 
(Br(vl),Bf(w2)) $ Epr, (Bf(ul),Bf(wz)) is added to 
Epr . is estimated. 
253 
3A-3 
Inputs A processor configuration graph G,, = (V,,, Epr) obtained by 
means of applying the processor configuration algorithm. 
outpuu A updated processor configuration graph Gpv' = (Vnr', Epr'). 
Step. 1 Try to reduce for one of all the registers as following two reduction 
methods. Calculate processor area and the execution time of the appli- 
cation program. 
I .  Focus on two heterogeneous registers which have same bit width 
and have no overlap in their lifetime. If a register 112 is , Update 
all the edges which have up in either side of edges by replacing 
112 to 111. V,, + V,, \ 2 ) ~  and Update Epr. 
2. Reduce a heterogeneous register which has the same bit width as 
general purpose registen and Save a content of the heterogeneous 
register in a general register. Update all the edges which have the 
heterogeneous register II in either side of edges by replacing v2 
to a pipeline register. V, - V, \ U and Update Epr. 
Step 2. Actually update a pmcessor configuration which has the minimum 
area under the timine constraint of all the candidates. New G,, is ob- 
associated with the heterogeneous regisler is the most 
Step 3. Finish if there are no registers that can be reduced in Step 1 or all the 
candidates do not satisfy a timinz constraint. 
Fig. 7. An algorithm to reduce heterogeneous registers. 
8.2 Exploring a processor configuration 
Based on an initial processor configuration, the maximum 
numbers of functional units and data memory banks are de- 
termined. We apply a branch and bound method which sub- 
problems are consuucted by branching about the number of 
functional units and data memory banks. When the numbers 
of functional units and data memory banks are determined, we 
have only to configure registers. Therefore we optimize pro- 
cessor configuration in a short time. We solve the sub-problem 
as follows. First we schedule a CDFG under a constraint of the 
numbers of functional units and data memory banks. Figure 
4 can be easily applied to it by modifying that assigned nodes 
are restricted by resource constraints. The processor config- 
uring algorithm is applied to the scheduling result. While a 
timing constraint is satisfied, a heterogeneous register reduc- 
tion algorithm shown in Fig. 7 is applied. 
IV. EXPERIMENTAL RESULTS 
We have implemented the proposed algorithm in C++ on 
Sun Ultra Sparc 3 750MHz. We use gcc 2.95.3 as a com- 
piler. To estimate area and delay of hardware, we use VDEC 
libraries (CMOS and 0.35 fim technology)'. We use area and 
delay in Table I for estimation of functional units. We assume 
that area and delay of 2-1 multiplexer are amur =167[fim2] 
and d,,, =0.23[ns], respectively and that area and writ- 
ing delay of a single bit register are a,..eg =383[pmZ1 and 
d,,, =0.40[ns], respectively. We assume that basic bit width 
of a processor is 16 bit. 
The algorithm has been applied to an FIR filter and a DCT 
in which basic bit width of variables is 16 bit and 32 bit mul- 
tiplying and accumulate operation are used. The results are 
shown in Table 11. For different timing constraints, different 
processor configuration can be obtained. In order to compare 
'The libraries in this study have k e n  developed in the chip fabrication pro- 
gram of VLSl Design and Education Center (VDEC), the University of Tokyo 
with the collaboration by Hitachi Ltd. and Dai Nippon Printing Corporation. 
~ 
254 
TABLE I 
FUNCTIONAL UNITS. 
Unit Area [,"I Delay Ins1 
25,259 
ALU I6  78,915 
MUL16 356,948 
ADD32 41,963 2.49 
ALU32 15 I ,  194 
our results with existing systems, we have picked up two pro- 
cessor synthesis systems (which are referred to as [151 and [161 
in References. [ 151 synthesizes a simple processor with a ho- 
mogeneous datapath. [161 synthesizes a processor which has 
two types of register files and a homogeneous datapath. Com- 
parison results have been shown in Table 111. It shows that our 
system can synthesize processors with less area than existing 
systems which synthesize processors with homogeneous data- 
paths. When the timing constraint of 60 ps is given, System 
1 and 2 cannot output a processor configuration meeting the 
timing constraintSt is because processors synthesized by sys- 
tem 1 and 2 cannot have chains among the functional units and 
no processors satisfy the timing constraint. Our proposed al- 
gorithm synthesizes a processor with less area compared with 
System 1 because of the following reason: In a processor syn- 
thesized by system 1, all the registers and functional units must 
have bit width of 32 bits. In a processor synthesized by the pro- 
posed algorithm, however, registers and functional units can 
have flexible bit width. The processor can have both 16 bits 
and 32 bits resources. Our proposed algorithm synthesizes a 
processor with less area compared with System 2 because of 
the following reason: A processor synthesized by system 2 can 
have 32-bit and 16-bit registers and functional units. Since the 
synthesized processor, however, bas a homogeneous datapath, 
it must have connections and multiplexers between all the reg- 
isters and all the functional units. A processor synthesized by 
the proposed algorithm can have only connections and multi- 
plexers required in order to execute a given application pro- 
gram. Therefore we can synthesize a pmcessor with less area. 
V. CONCLUSION 
In this paper, we proposed a hardware/software cosynthe- 
sis algorithm for processors with heterogeneous datapaths. In 
the future, we will incorporate SlMD functional units into the 
processor model and establish the algorithm to optimize the 
processor configuration. 
REFERENCES 
[ I ]  H. Akaboshi. and H. Yasuura, COACH A computer aided design tml  
for computer architectures, IEICE Transactions on Fundamentals of 
Electronics. Communicarions and Computer Sciences, vol. E76-A, no. 
10,pp. 1760-1769, 1993. 
[2] A. Alomary, T. Nakata. Y. Honma, M. Imai. and N. Hikichi, 'An ASIP 
insmaion set optimization algorithm with functional module sharing 
constraint," in Proceeding of1993 IEEWACM InremrioMI Conference 
on Comupurer-AidedDesign, pp. 526532. 1993 
[31 Analog Devices.ADSP-2IOO Family User's Manual, 1995. 
[4] N. N. Binh. M. Imai, A. Shiomi, and N. Hikichi, "A hardwarelsoftware 
partitioning algorithm for designing pipelined ASIPs with I-t gate 
count." in Proceedings of 33rd Design Automarion Conference, pp. 
527-332. 1996. 
___ 
3A-3 
App. 
FIR 
DCT 
TABLE I1 
EXPERIMENTAL RESULTS. 
Const.[psl Arealp"] Processor configuration 
ALU16x I , M U L 1 6 x 3 , A D D 3 2 x Z  
16 bit heterogeneous registersx 7.32 bit heterogeneous register x 1 
ALU16 x 1, MUL16 x 1, ADD32 x I 
16 bit heterogeneous ngistersx 5,32 bit heterogeneous register x 2 
ALU16 x I ,  ADD16 x 3, MUL16 x I, ADD32 x I 
16 bit heterogeneous registenx 6,32 bit heterogeneous register x 1 
A L U 1 6 x I , M U L 1 6 x 1 , A D D 3 2 x I  
16 bit heterogeneous registers x 6.32 bit heterogeneous register x I 
ALU16 x I .  MUL16 x I, ADD32 x I 
16 bit heterogeneous registers x 4,32 bit heterogeneous register x 1 
60 1.53 1,347 
70 6763761 
60 656,057 
80 580,290 
110 568,038 
TABLE Ill 
RESULTS COMPARED TO EXISTING SYSTEMS. 
Proposed System I [I51 System 2 I161 
656,057 776.326 705,471 
580,290 720.323 619,468 
613,340 
Proposed System I 1151 System2 1161 
1,531.347 
DCT 
1  
0 1 0 . 1 0 1  
110 568.038 693.918 
151 S .  Fr6hlich and B. Wess, "Integrated appmaeh to optimized code 
generation for heterogeneous-register architectures with multiple data- 
memory banks:' in Pmceedings of 14th Annul  IEEE lnremational 
ASIWSoC Conference, pp. 122-126,2001. 
1161 N. Togawa. M. Yanagisawa, and T. Ohuuki, "A hardwadsoftware 
cosynthesis system for digital signal processor cores with two types of 
register files," IEICE Trans. on Fundamentals of Electronics. Cammu- 
nications and Computer Sciences, vol. E83-A. no. 3,ZMW). 
[6] N. Ishiura. M. Yamaguchi, and T. Kambe, "A graph-based algorithm of 
operation binding for compilers targeting heterogeneous datapath," in 
Pmceedings of The 1998 IEEEAsia PaciJic Conference on Circuits and 
System, pp. 395-398, 1998. 1996. 
1171 I. Van bael, D. Lanneer, G. Gcassens. W. Geuns. and H. De Man, 
"A graph based processor model for retargelable code generation:' in 
Pmceedings of Eumpean Design and Test Conference. pp. 102-107. 
(71 N. Ishiura, T. Wamak, and M, Yamaguchi, ..A code generation 
method for datapath oriented application specific processor design," in 
Pmceedings of SASIMI2o00, pp. 71-78.2000. 
[I81 M. Yamaguchi, N. Ishiura, and T. Kambe. "'A binding algorithm for 
retargetable compilation to non-orthogonal DSP architecture," IElCE 
Tmnsoctions on Fundomentnls of Electmnics, Communicotions and 
Computer Sciences, vol. E81-A. no. 12, 1998. 
181 N. Ishiura and T. Watanabe, "Datapath oriented codesign method of 
application specific DSPs using retargetable compiler," in Pmceedings 
of 2002 Asia-Pacific Conference on Circuits and System, vol. I ,  pp. 
55-58.2002. 
191 S. Kurohmaru. M. Matsuo, H. Nakajima. Y. Kohashi, T. Yonezawa, 
T. Moriiwa. M. Ohashi. M. Toujima, T. Nakamura, M. Hamada, T. 
Hashimoto, H. Fujimoto. Y. lizuka. J. Michiyama, and H. Komori, 
"A MPEG4 pmgrammable codec DSP with an embedded prelpost- 
processing engine," in Pmceedings of the IEEE 1999 Custom Integrated 
Circuits, pp. 69-72, 1999. 
[IO] Lucent Technologies. DSP1611/17/18n7R8R9 Digital Signal Pmces 
sorlnfomotion Manual, 1998. 
[ I  I ]  P. Manuedel. "Code generation for core processor," in Proceedings o/ 
34th Design Automation Conference. pp. 232-237, 1997. 
(121 Motorola, DSP563W 24-bit Digital Signal Processor Family Manu1 
(DSPS6300FM/ADj. 2000. 
1 131 Tensilica, Xrenra Micmpmcessor: Overview Handbook, 
http:/lwww.tensilica.com. 
1141 Texas Instruments. TMS32OCZx Datasheet. 1998. 
I151 N. Togawa, M. Yanagisawa. end T. Ohtsuki, "A hardwarelsoftware 
cosynthesis system for digital signal processor cores," IEICE Transac- 
tions on Fundamentals of Electronics, Communications and Computer 
Sciences, vol. E82-A, no. I I ,  pp. 2325-2337, 1999. 
255 
