A Processor Core Synthesis System in IP-based SoC Design by Tomono Naoki et al.
3B-2 
A Processor Core Synthesis System in IP-based SoC Design 
Naoh  TOMONO+ Shunitsri KOHARAt Jumpei UCHIDAt Yuichiro MIYAOKAt 
Nozomu TOGAWAt.?* hlasao YANAGISAWAt Tatsuo OHTSUKIt 
t Department of Computer Science, Wmeda University * Department of Information and Media Sciences, The University of Kitakyushu 
* Advanced Research institute for Science and Engineering, Wseda  University 
email.tomonoByanagi.comm.wascda.ac.jp 
Abstract- This paper proposes a new design 
methodology for SoCs reusing hardware IPS. In our 
approach, after system-level HW/SW partitioning, we 
use IPS for hardware parts, but synthesize a new pro- 
cessor core instead of reusing a processor core IP. Sys- 
tem performs efficient parallel execution of hardware 
and software by taking account of a response time of 
hardware IP obtained by the proposed caiculation al- 
gorithm. We can USE! optimal hardware IPS selected by 
the proposed hardware IPS selection algorithm. The 
experimental results show effectiveness of our new de- 
sign methodology. 
I. INTRODUCTION 
The increased complexity of System-On-Chip (SoC) 
designs makes it difficult for designers to meet the de- 
mands from market such as short t imotemarket,  small 
gate count and high-performance. In practice, hard- 
ware/software cc-design [l] and IP-based design [Z] are 
proposed in order to build the required complexity in a 
short time. Another methodology is that after the hard- 
ware/software partitioning, a designer reuses thc hard- 
ware IPS for hardware, and the processor core IPS for 
software. 
However, the designers do not always find the suitable 
IPS that match the application. Some IPS have excess 
performance and some do not have enough performance. 
In our approach, after the hardware/software partition- 
ing, a designer uses hardware IPS for hardware: but syn- 
thesizes new processor core instead of reusing processor 
core IPS. Synthesizin.g processor core can compensate thc 
excess or deficient performance of hardware IPS. 
In this paper we propose a new processor COPC synthesis 
system which is hardwarelsoftwarc co-synthesis system 
based on rcsponse time of hardware IPS. A processor core 
synthesized by the system can execute another operation 
while hardware IPS execute the operations. The system 
can also select the suitable IP by a selection algorithm 
if thcre are some IPS which have the same functions but 
different performances. 
Figure 1 shows a frame work based on the proposed 
processor core synthesis system. A designer describes the 
specification of application by systemC [4. After eval- 
uating and validating the performance required for the 
application, the designer decides which part of the spec- 
ification is implemented by hardware or software (hard- 
ware/software partitioning). Then the hardware part is 
implemented by hardware IPS, and the software part is 
implemented by processor core synthesized by proposed 
Process4r core 
Syntheeis Syatam 
roceeaer O b j e c t  
Fig. 1. A framework. 
system. The optimized processor core can absorb t h e  cx- 
cess or deficient performance of hardware IPS. 
The system requires the response timc of hardware IPS 
at the scheduling, but it was difficult to know the response 
time in advance. The designers selected one hardware 
IP by their experienccs and intuitions if there were some 
hardware IPS which have the same functions but different 
performances. 
111 this paper, we propose a calculation algorithm of 
response time of hardware IPS and a hardware IP auto- 
selection algorithm. The calculation algorithm can au- 
tomatically calculate the response time of hardware IPS. 
The hardware 1P auto-selection algorithm can select the 
suitable IP for the input application from some IPS having 
the same functions but different performance. 
This paper is organized as follows. Section I1 defines 
a architecture of IP-based SOC. Section 111 proposes a 
processor core cosynthesis system which is the key issue 
in the proposed systcni. Section IV shows several experi- 
mental results compared with existing processors. Section 
V gives concluding remarks. 
11. TARGET ARCHITECTURE 
Figure 2 shows an architecture model of IP-based SOC. 
The architecture is consisted of an processor core, a mem- 
0-7803-8736-8/05/$20.00 02005 IEEE. 286 ASP-DAC 2005 
Fig. 2. An architecture model of IP-based SoC . 
Proseraor 
Core 
Fig. 3.  An architecture model of processor core . 
w* n 
---c 
rICPI 
--- - - -_-___ c-3" 
ory and several hardware IPS which are connected each 
other via a shared bus. Our approach is first the input 
appIication is partitioned into hardware/software parts, 
then the hardware parts are implemented by hardware 
IPS, and the software parts are implemented by a proces- 
sor core. 
In the following section, we define the processor core, 
the hardware IP and the interface of both units. 
A .  Processor Core 
Figure 3 shows an architecture of the VLIW-type pro- 
cessor core which is consisted of B processor kernel and 
some optional hardware units. The architecture is based 
on [6], 
A Proccssor kernel is (1) a RISC-type kernel or (2) a 
DSP-type kernel. A ItISC-type kernel. has the five pipeline 
stages composed of IF (instruction fctch), ID {instruction 
decode), EXE (execution), MEM (memory access) and 
WE (write back) stages. A DSP-type kernel has the three 
pipeline stages composed of IF, ID and EXE stages. The 
processor core can add the optional hardware units, such 
as functional unit (ALU, multiplier and so on) and ad- 
dressing unit shown in Figure 3, to the processor kernels. 
The processor core has basic instructions, parallel in- 
structions and hardware-IP-instructions. The basic in- 
structions are based on a general digital signal processor 
(41. The parallel instruction executes more than one ba- 
sic instructions. A processor core synthesis system deter- 
mines which combination of basic instructions should be 
a parallel instruction based on an application program. 
The hardware-IF-instructions are described in the follow- 
ing section. 
B. Interface 
The interface between processor core and hardware IP 
is based on ARM Coprocessor Interface [5 ] .  The ARM 
Coprocessor Interface defines signal interface and instruc- 
tion interface. 
W I P I  
,---------I 
M I P 2  ~ H I P "  ~ 
B.l  Signal Interface 
Figure 4 shows an connection of the processor core and 
the hardware IPS. The processor core can connect up to 
16 hardware IPS. The processor core communicates with 
hardware IPS via three handshake signals as follows: 
1. nCPI (Processor + IPS) A processor core wants to execute 
2. CPA (IP + Processor) There are no hardware IPS which 
3. CPB (IP + Processor) Hardware IP can not execute the 
hardware-IP-instruction immediately since the it is exe- 
cuting another hardware-IP-instruction. 
hardware-IP-instruction. 
can execute the hardware-IP-instruction. 
B.2 Instruction Interface 
Processor core sends hardware IPS three h'ardwareIP- 
instructions: (a) CDP (Coprocessor Data Operation), (b) 
LDC/STC (Coprocessor load and store operations) and 
(c) MCR/MRC (Register transfer operations). The for- 
mat of hardware-IP-instruction is as follows: 
CDP RW#, OP# 
LDC HW#, N, Rd,  Rn, offset 
STC HW#, N, Rd. Rn. offset' 
MRC HW#, Rdi, Rd2 
MCR HW#, Rdl, Rd2 
CDP performs processing operation on the data held in the 
hardware IP's register. Each hardware. IP is numbered 
and a processor core defines one HW IP to which it wants 
to send an CDP instruction by HW#. OP# shows which 
one of the operations the hardware IP have should be 
executed. 
LDC/STC transfer data between a hardware IP and mem- 
ory. 
MCR/MRC transfer data between a processor core register 
and a hardware IP register. 
C. hardware IP 
Figure 5 shows an architecture of a hardware IP. It has 
a detapath, some registers, an instruction pipeline to  hold 
the hardware-IP-instructions from the processor core and 
an instruction decode logic to decode the hardware-IP- 
instructions. . 
111. PROCESSOR CORE SYNTHESIS SYSTEM 
Figure 6 shows the proposed processor core synthesis 
Given an application program written in the system. 
287 
Shared Bus 
I I Hardware I P  I I  
Datapath 
Fig. 5. An architecture model of n hardware IP 
SystemC language 131 and time constraint, the applica- 
tion is partitioned into hardware/software parts as shown 
in Figure 1. The CWL (Component Wrapper Language) 
[SI descriptions of hardware IPS which have the functions 
of hardware parts arc also the input of the systcm. The 
system synthesizes a hardware description of a processor 
core and generates object code for the processor core and 
selected hardware 1P's data. 
Our approach is as follows: 
1. Compile: First the system assumes a processor core to 
which all the hardware units are added and runs an a p  
plication program on the assumed processor core. The 
compiler generates an assembly code which is minimum 
execution time and maximum area since there are no lim- 
itations of hardware units [6]. 
2. Processor core WW/SW partitioning: Second, the 
system replaces a part of hardware with software by elim- 
inating hardware units added to a processor kernel one by 
one. The execution time of the assembly code becomes 
longer but the required area for processor,core to run it 
becomes smaller [SI. ' 
3. The system repeats process 2 while the execution t h e  
of the asse~nbly code satisfies the timing constraint and 
obtains a processor core satisfying the timing coiistraint 
with a small area. 
We explain the corisistents of the system. 
response time calcukator calculates the response time of 
hardware IP from the input CWL description. 
pre-processor (SW extractor) extracts a SW description 
of an application from the system-level description. 
profiler profiles software parts of the application from the 
system-level description. 
compiler assumes a processor core to which all the hardware 
units are added and runs an application program on the 
assiimed processor core. The compiler generates an as- 
sembly code. This assembly code is minimum execution 
time and maximum area since there are no liniitations of 
hardware units. 
processor core HW/SW partitioner replaces a part of 
hardware with software by eliniinating hardware units 
added to a processor kernel one by one. The execution 
time of the assembly code becomes longer but the re- 
quired area for processor core to run it becomes smaller. 
HW XP auto-selector automatically selects the suitable 
hardware IP for the application from several hardware 
I 
Fig. 6. A processor core synthesis system. 
IP candidates which have the same functions but have 
different performances. 
hardware generator generates a hardware description of 
assembler generates the object code of the application p r e  
See 161 for the compiler and the processor core HW/SW 
partitioner, and see [7] for the hardware generator. The 
pre-processor and the assembler are simple ones. We focus 
on the response time calculator and the HW IP a n t e  
selector in the following section. 
A. response time calculator 
We propose an algorithm t o  calculate the response time 
of hardware IP from the CWL, the interface description 
language. We define the response time as; after the hard- 
ware IP receives the hardware-IP-instruction from the 
processor core, the response time is the time consumed 
by a hardware IP to cxceute a HW-IP-instruction. 
CWL is a language used to define the interface specifi- 
cations of the target IF correctly. Such interface specifi- 
cations include specifications of logical signal changes as 
well as structural specifications, such as 1/0 pin informa- 
tion. 
The CWI, is consisted by four major sections; po r t ,  
alphabet, word. sentence. Our algorithm focuses on 
word section. 
Figure 7 shows the calculation flow. The word section 
defines the pattern of each transaction in normal repre- 
bentation. We classifies the word section into four expres- 
sions. 
Basic expression is the most frequently used, consisted of 
the processor core. 
gram run on the processor core. 
alphabet and regular expressions. 
288 
I 
a#/b 
a&b 
d / b  
Fig. 7.  A response time calculation flow . 
that "a" ends. 
"b'! starts after "a" ends. 
"6" starts at the same time 
that "a" starts. 
"b" starts at after "a" x t. y 
max{x, y} 
Hierarchical expression represents the word hierarchically. 
Parallel expression without synchronization 
represents the parallel processing such as pipeline. 
parallel processing including synchronization. 
Parallel expression with synchronization represents the 
In the response time calculator, thkse four expressions 
are applied to the different algorithms separately as shown 
in Figure 7. 
Figure 8 shows an example of the responsc time cal- 
culation algorithm. The word section is extracted from 
the CWL description. The example is a parallel. expres- 
sion without synchronization, shown in the leftmost box 
in Figure 7 since there is the keyword "SERIAL". 
A calculation algorithm for parallel expression without 
synchronization is: 
1. Calculate the clock cycles of "SERIAL". "SERIAL" is con- 
sisted of the alphabet and regular expressions. We sum 
up all the alphabet taking account of regular expressions. 
In Figure 8, since all the alphabet (a lphabet  a, b,  c) is 
counted as one cycle, readcommand is calculated a: 
readcommand = 1 x 8 + 1 x 3 + 1 
= 12 (1) 
The basic expression is calculated in the same way as 
equation (1). 
2. Calcuiate the response time of "PARALLEL". "PARALLEL" 
is consisted of SERIAL uords, alphabet and parallel no- 
tation shown in Table I. The readtransaction is caIcu- 
iated a5 shown in Figure 8. We calculate the PARALLEL 
word from the last consistent with taking account of par- 
aliel notations k~ shown in table I. In Figure &, the last 
consistent p a r i t y e r r o r  takes 8 cycles and the next one 
r e a d p a r i t y  takes 14 cycles. The relation between both 
consistents is "'&" which means parityerror starts at the 
same tinie that r e a d p a r i t y  start,s and Lhe response time 
is calculated as: 
max{par i tye r ro r ,  readpar i ty}  = 14 (2) 
YAlN i~ calculrted 
word: f m  the top m n l l s t c n t  
SERIAL : _1 = I f f i k  
r e a d c m a n d  . 8181 b131 c ; 1 2  
readpar i ty  . ... ; - 14 
p a r i t y e r r o r  ' ... : -c 6 PAWL I S  calculated from ths 
1st consistent 
PARALLEL : 
readtransaction : r e a d c m a n d  &/ readpar i ty  & D a r i t y e r r o r .  
12 
endword U 
marill 6) = 14 
I readtransa6ttlan - 12 + 14 = 26 
Fig. 8. An example of the response time calculator . 
TABLE I 
PARALLEL NOTXTIOXS AND CLOCK CYCLES. 
Notation 1 Meaning I Cycles 
a#b I "b" starts at the same time I x + y 
I starts. 
Repeat this calculation to the top consistent and we can 
obtain the response time of the readtsansaction. 
The proposed algorithms calculate the maximum clock 
cycles it could take. Therefore, it can calculate more pre- 
cisely if the processing time docs not rely on the quality 
of the data. And the more the designer writes the CWL 
in detail, the more the calculation result is precise. 
B. HW IP Auto-selector 
Wc also propose a hardware IP autc-selector which se- 
lects the suitable one from several hardware IPS having 
the same functions but different performances. The ob- 
jective is to minimize the area of synthesized processor 
core and hardware IPS. The input of the hardware IF' 
auto-selector is; 
1. a timing constraint of the application, 
2. the firstest assembly code obtained from the compiler, 
3. profile information from the profiler 
4 .  the data of reusable hardware IPS (area and response time 
The configuration of the processor core obtained from 
the processor core HW/SW partitioner varies by the hard- 
ware IF. Hardware IP auto-selector feeds back the configu- 
ration of the processor core to the processor core HW/SW 
partitioner and finds the optimal configuration. Hence we 
propose the algorithm which makes the number of trial 
minimum. 
Figure 9 shows the flow of the hardware IP autc- 
selector. The selector reduces the hardware IP candidates 
from the hardware IF database by a candidate reduction 
algorithm. Then it selects one hardware IP from the can- 
didates by a selection algorithm. 
obtained from the response time calculator). 
289 
Fig. 9. A flow of the hardware IP auto-selector . 
Candidates Reduction Algorithm The hardware IF 
auto-selector obtains the number of hardware IP instruc- 
tions from the input profile information. Let us consider 
there are several hardware IPS which have the same func- 
tions. i-th hardware IF includes n operations. Let IC be 
one of the n operations. Since the total processing time 
of the hardware IP is response time x number of instruc- 
tions, the total processing time of i-th hardware IP (Si)  
is defined as: 
(3) 
where Tik is a response time of k-th operation in i-th 
hardware IP. I z k  is the total number of k-th operations in 
i-th hardware IP executed in the application. 
The auto-selector arranges all the hardware IPS in as; 
cending order of the area and numbers them. It arranges 
all combinations of the hardware IPS if the designer de- 
cides to use two or more hardware IPS. S, given by the 
equation (3) is expected to be in reverse. if the two hard- 
ware IPS, j and j - 1, have the relation S,-, < S,, then 
j t h  hardware IP is eliminated. 
Selection Algorithm After the hardware IPS are elim- 
inated by the reduction algorithm, The auto-selector 
searches minimum combination of processor core and 
hardware IPS. 
The hardware IP candidates have the trade-off between 
area and performance. The selection algorithm searches 
the minimum hardwxe IP (combination) which satisfies 
the timing constraint by the binary search. The key of 
the search is the arca of hardware IP (total arca of the 
combination). 
Figure 10 shows the selection algorithm. The auto- 
selector selects the suitable hardware IP (combination) 
by log, n trials if the number of hardware IP candidate is 
n. 
IV. EXPERIMENTAL RESULT 
In the expriment, we use two application examples: the 
JPEG Encoder and the 3D-Animation. 
They were paritioned into hardware/software part. The 
hardware part was UCT in the JPEG encoder and hard- 
ware T&L in the 3D-animation. Then we applied the 
proposed processor core synthesis system to these appli- 
cations. ' 
Input: a timing constraint, assembly code obtained from com- 
piler, profile information and data of hardware IF candidate 
Output: assembly code, processor care architecture, instruction 
set and selected hardware IP. 
Stepl-1. Calculate the total area of all Combinations of hard- 
ware IP candidates and arrange them in ascending order. 
Stepl-2. Let xi be the total area of i-th combination (Ct ) ,  n 
be the number of combinationand c k be the target combination. 
The initial k is n/2. 
Stepl-3. Let t be the number of trial and the initial t is 0. 
The minimum total area of processor core and combination of 
hardware IPS is a in the trial process. and let T be the data 
(assembly code, processor core architecture, instruction set and 
data of hardware IP) of this trial. 
S t e p 2  Add 1 to the t .  If t > logan ,  go to Step3. input the 
data of c k  with timing constraint and execution profile to the 
processor core HW/SW partitioner. 
1. 
2. 
If no solution which satisfies the timing constraint is ob- 
tained, replace k = k + - and repeat StepS. 
If the solution which satisfies the timing constraint is ob- 
tained. let yt the sum of the area of the obtained processor 
core and 2 i; , and 
(a)  if t = 1 or yt < a ,  replacek and a,  k = k -  n,a = pt 
n 
2t 
2 t  
respectively and repeat Step2. 
2 
(b) if yt > a, replace k k = k + and repeat Step2 
StepS. Output T and finish searchiw 
Fig. 10. A hardware IP selection algorithm I 
A. JPEG Encoder 
We gave some time constraints to the application. Ta- 
ble I1 shows the area, execution time and hardware con- 
figurations of the synthesized processor core. 
The JPEG encoding system is consists of four major 
parts: 
Image Fragmentation 
DCT (Discrete Cosine Transform) 
Quantization 
Huffman Coding 
In the design frame work shown in Figure 1, We par- 
titioned off these parts into hardware/software parts and 
decided DCT to be implemented by hardware IP. We used 
Xilinx [9] 2-D DCT as a hardware IP. The response time 
of the hardwarc IP obtained by the calculation algorithm 
was 285 cycles. 
B. 3D-Animation 
We prepared six types of the T&L hardware IP shown in 
Table I11 in order to validate the effectiveness of hardware- 
IP- selector. 
Table IV shows the area, execution time and hardware 
configurations of the synthesized processor core in some 
time constraints. 
G! Discussion 
In order to compare our results with the general pur- 
pose processors, We prepare the RlSC processor having 5 
stage pipelines: 32 32-bit registers, 32-bit instruction sets 
and the multiplier. We applied this processor core to  the 
target architecture shown in Figure 2. 
290 
TABLE I1 
AREA, EXECUTION TIME A N D  HARDWARE CONFIGURATIONS OF SYNTHESIZED PROCESSOR CORE (JPEG ENCODER). 
Hardware configuration 
120.0 113.740 
130.0 129.799 
150.0 146.212 
170.0 169.273 
175.0 174.058 
TABLE IV 
Timig consts Execution time System area Pmrmcarea Hardware configuration HW IP 
AREA: EXECUTION TlhlE AND HARDWARE CONFIGURATIONS OF SYNTHESIZED PROCESSOR CORE (3D-ANlhlATlON).  
#ALUs #Fie# 
24.0 23.751 40.188014 4.795864 RISC AIJU*2,Mult*3 19 
28.0 27.491 21.945339 2.820687 RISC ALU*l,hI~ilt*l  12 
34.0 33.788 21.217409 2.092757 RISC ALU*l.Mult*l 13 
40.0 39.117 21.122514 1.997862 DSP A-LU*l,h?rilt*l 14 
30.0 29.852 21.878155 2.753503 RJSC ALU*l ,Mul t* l l l  
TABLE 111 
I Name 1 Area I remonse time I 
T H E  SPECIFICATION OF T&L HAHDbV;\RE IPS. 
67.927146 
100.462142 34 
i98.067130 29 
TABLE V 
THE AREA .4ND EXECLiTlON TILIE OF GENEICAL PUKPOSE RISC 
PROCESSOR. 
Application 1 Area(pm2] [ Execution time[ms] 
.JPEG encoder 1 2,107,831 I 225.621 
I 3D animation I 2.107.831 I 41.8291tis1 I 
Table V shows the area arid the execution time of pro- 
cessor cores. In the 3D-animation, we use the Hardware 
IP A in the Figure 111. 
In Tables I1 and IV, our results show that compared 
with the general purpose processor corc, the execution 
time of JPEG encoder was reduced 42%, and that of 3D- 
animation was reduced 20% though the area was almost 
the same. 
In Table IV, Hardware IP A was selected in most case. 
Because the synthesized processor core and hardware IP 
can run in parallel, high performance of the hardware IP 
is not required for a11 the applicationb. Under the sevcre 
timing constraints, Hardwarc IP B mas selcctcd. There- 
fore, the hardware IP selector selects thc suitable hard- 
ware IP up to the timing constraint of the application. 
Overall the experimental results demonstrate that our 
processor core synthesis system effectively generates syn- 
thesizable processor cores based on application programs. 
V. CONCLUSION 
This paper proposed a processor core synthesis system 
based on the IP-based SOC. This paper also proposed two 
key issues for the system: hardware IP response time cal- 
culation algorithm and hardware IP auto-selection a l p  
rithm. The experimental results demonstratc that the 
system synthesizes processor corm effectively according 
to the features of an application program and synthesized 
processor cores have higher performances compared with 
general purpose processors. 
The future work is the interface of processor core and 
hardware IPS. It is difficult to find the hardware IP de- 
fined in the section I1.C. We will build a interface synthesis 
system for the proposed architecture, 
ACKNOWLEDGEMENTS 
We would like to thank Dr.Kei Suzriki and Mr.Hiroshi 
Ara at  Hitachi, Ltd., Central Research Laboratory 
for many interesting discussions and suggestions which 
helped shape this paper. 
PI 
I21 
131 
141 
151 
REFERENCES 
1. J. Huang and A. &I. Despain? “Synthesis of instruction sets 
for pipelined rnicroprocessors,” in Proc. 91st DAG, pp. 5-11> 
1994. 
J. A. Rowson arid A. Sangiovanni-Vinceirtclli, “Interface- 
Rased Design,“ in Proceedangs of the Design Automataon Con- 
ference, pp. 178-183: June 1997. 
httl,://~,ww.systemc.org/ 
NEC, http://www.ic.nec.co.jp/micro/micro.htmi 
http://www.arni.coni/ 
IS] N. Togawa, M. Yanagisawa, and T. Ohtsuki! A hard- 
ware/software cosynthesis systcrn for digital signal processor 
cores,’’ IEICE Trarrsactions on F7indamentals of Electronics, 
Communications and Computer Sciences, vol. E82-A, no.11, 
l‘JY<j. 
[7] M. Haraabe, A. Nose: N. Togarva. At.  Yanagisawa and T. 
Ohtsuki, ‘‘A4 generation system for hardware description of 
pipelined processors,’! Technical Report Of IEICE, VLD97- 
117, ICD97-222,1998 (in Japanese). 
html/index.htm. 
[E] Hitachi, LLd., h t t p :  //koigakubo.hitachi .co.~p/-sl/cxl/ 
PJ] Xilirix Inc., “2-D Discrete Cosine Transform v2.0,” h t t p : / /  
vuv.xilinx.com/. 
29 1 
