3D-SoftChip: A novel 3D vertically integrated adaptive computing system [thesis] by Kim, Chul
Edith Cowan University 
Research Online 
Theses: Doctorates and Masters Theses 
1-1-2005 
3D-SoftChip: A novel 3D vertically integrated adaptive computing 
system 
Chul Kim 
Edith Cowan University 
Follow this and additional works at: https://ro.ecu.edu.au/theses 
 Part of the Engineering Commons 
Recommended Citation 
Kim, C. (2005). 3D-SoftChip: A novel 3D vertically integrated adaptive computing system. 
https://ro.ecu.edu.au/theses/656 

































USE OF THESIS 
 
 
The Use of Thesis statement is not included in this version of the thesis. 
3D-SoftChip: 
A Novel 3D Vertically Integrated Adaptive Computing System 
A Dissertation 
Presented to the School of Engineering and Mathematics 
Edith Cowan University 
Western Australia 
In partial fulfillment of the r~quirements for the degree of 
Master of Engineering Science 
Supervisor: Dr. Alexander Rassau 
Submission Date: June 2005 
by 
CbulKIM 
• EDITH COWAN 
,. UNIVERSITY 
rt.;$ PERT>I WESTER~ •usm•tlA 
Dedication 
To my fiance Sang-Mi Hyun, 
my father Nam-Gil Kim, 
my mother Sung-Sun Park, 
my brother-in-law Sun-Shin Lee, 
and my sisters Hee-Joung, Su-Joung, Youn-Joung Kim. 
©2005 
Chui KIM 
All Rights Reserved 
-1-
Table of Contents 
USE OF THESIS .......................................................................................................................................................... S 
DECLARATION .......................................................................................................................................................... 6 
ACKNOWLEDGMENTS ........................................................................................................................... : ................ 7 
ABSTRACT ............................................... , .................... ; ................. , .. ~ ........................ ~; •• , ..... , ........... ~ ......................... 8 
PUBLICATIONS ........................................................... ,;· ............ : .................... : ................ ~ ........... -.-............................. 9 
LIST OF FIGURES ........................................................... ,; .................................. -............ ~ ....................................... 11 
LIST OFT ABLES ..................................................................................................................................................... 13 
1, INTRODUCTION ........................................................................................... : ...................................................... 14 
l.13D VERTICALLY INTEGRATED SYSTEMS OVERVIEW ................................... :: ................ .' ................................ 16 
1.2 ADAPTIVE COMPUTING SYSTEMS OVERVIEW ................................................ .-....... : .. : ..•... ." .. ; ........................... 18 
1.2,l Adaptive Computing SystenlS ....................................................................................... :.: ...... : ................ 18 
1 .2.1 .1 The Need for Adaptive Comp11tiiig Systems ............................. :.;·; .... ; •• -.:-..... -..... .' .. ;;;;,., ........ ;;,.,,.;; ............. 18 
1.2.J.2 The Concept of Adaptive Computing Systems ..................................... '. .................... , .............................. 19 
1.2.2 Classification of Adaptive Computing SystenlS ................................................................ : .................... 21 
1 .2.2.1 Previo11s Works ....................................................................................... '.""'""·"' ................ .-................. 21 
1.2.2.2 MorphoSys Vs JD-SoftChip .............................................................. , ..... .-.......................... ; .................... 24 
1.3 Mo'rfVATION Ol' THESIS ............................................................................................. :; .................................... 25 
1.4 SCOPE 01'" TIIESIS ..................................................................................................................... ; ........................ 25 
1.4.1 Scope of Each Chapters ............................................................................................ , .............. : ............... 26 
1.5 CONCLUSIONS ................................................................................................................................................... 26 
2. SYSTEM ARCHITECTURE OF 3D-SOFTCffiP ............................................................. ." ................................ 27 
2.1 Co RE TECHNOLOGY FOR 3D-SOFTCHIP ......................................................................................................... 27 
2,2 OVERALL ARCIIlTECTURE OF 3D-SOFI'CIIIP .,., .............................. ,,,,;;,.,,,,,,;;, ............................................... 28 
2.3 FEATURES OF 3D-SOFI'CIDP .......................................................... : ......... ;; .. '; ............... ;-,;,e;: •• .-...... -............. ., ...... 29 
2.4 SYSTEM COMPONENTS ..................................................................... ; ....... , ... -......... , ... :: .. , .......................... :: ...... 33 
2.4.1 Configurable Array Processor (CAP) Chlp .................................. .-........................................................ 33 
2.4.1.l Heterogeneous Types of PEs ................................................ -.................................................................. 33 
2.4.2 Intelligent Configurable Swih:h (ICS) Chip ............................... : .... : ................... .'.' ................................ 33 
2.4.2.l Switch B/ock .............................................................. ._ .......... ~: ................ : .... '."'"'";;,, ....... : .• : ..................... 33 
2.4.2.2 ICS_RISC ............................................................................ , ............ ~ ........... : ...... ~ ................................... 34 
2.4.2.3 Data Frame Bujfer .................................................................................................................................. 34 
2.4.2.4 Program Memory ....................................................................................... '.: ... :: ..................... : ................ 34 
2.4.2.5 Data Memory .................................................................................. :: ....................................................... 34 
2.4.2.6 DMA Controller ............................................................................................ , .. ; .............. :-....................... 34 
2.4.2.7 JD /111erco11necti011 Technology ....................................................... ;:: ............... ~· .. : .... : ............................ 35 
2.5 DESIGN GUIDELINES ......................................................................................... : .... :: .. :'. ... :;.,;;, .. ,, ....................... 35 
2.6 DESIGN METIIODOLOGY ........................................................................................ ,, .. ,;;;,., ................................ 36 
2.6.1 Suggested HW/SW Co-design and Verification Methodology ............ · ............. .'; ....... :: ......................... 36 
2. 7 CoNCLUSIONS ...................................................... '""' .................. ; ............ , ...... ;;,,. ... , ... ~.';, ......... ;,;;, .................... 37 
3. ARCHITECTURE OF CAP CHIP,, .......................................................... .''. .... ~ ... ; ..... ~ ........... ~ .............................. 39 
3.1 OVERALL ARCHITECTURE OF CAP CHIP .................................................... : .......... :;;;;.'.· ................... : ............. 39 
3.2 Two TYPES OF PROCESSING ELEMENT (PE)s ................................ .' ...... :.:; ....... ;; ....... : ................... : ................. 40 
-2-
3.2.1 Standard-PE (S• PE) ................................................................................................................................. 41 
3,2,2 Processing Accelerator-PE (PA-PE) ............................................ ; .......................................................... 42 
3,3 PE FUNCTIONS .................................................................................................................................................. 42 
3.3.1 Standard-PE Functions .............................................................................. .-.............. ,,,,,,,,,,,, .................. 42 
3.3.2 Processing Accelerator-PE Functions .............................................. -......... , ... ;;,., .... ;' .... : ..... : .................... 43 
3.3.3 PE Instruction Formats and Operation Modes ............................................... , ..................................... 43 
3,4 EMBEDDED LoCAL SRAM ........................................................................ : ...................... -.......... :-..................... 44 
3.5 CONFIGURABLE NA TIJRE OJ,' ARITIRofETIC PRIMITIVES ..................................... -....... ; ......... : ........................... 44 
3.5.1 Scalable Parallel Multiplier Ccll ............................................................................................................. 45 
3.6 QUAD-PE ............................................................................................................ ,; ............................................. 46 
3.7 UNIT CAP CIIlP ARCIUTECTURE ...................................................................................................................... 47 
3.8 CONCLUSIONS ................................................................................................................................................... 47 
4. ARCHITECTURE OF JCS CHIP ............................................... : ................................................... ,; ................... 49 
4.1 SWITCH BLOCK ................................•................. ,,,,.,,., ................................. :; ................................................... 49 
4.2 I CS_RISC ................................................................................................. ; ...... , .. , ............................................. 50 
4.2.1 Features of ICS_RISC ............................................................................................................................. 51 
4.2.2 System Components of ICS_RISC .................................................... ; .................................................... 51 
4.2.3 Types of Instruction Set .......................................................................................................................... 52 
4.2.4 ICS_RISC Instruction Set Archltedure,Versionl.O ............................................................................. 53 
4.3 HIGH BANDWIDTH DATA INTERFACE UNIT ..................................................................................................... 55 
4.4 CONCLUSIONS ................................................................................................ i: ............. ,,,,, ............................... 55 
5. ARCHITECTURE OF UNITCHIP ...................................................................................................................... 57 
5,1 UNITCIIIP ARCIIITECTURE ......... , ..................................................................................................................... 57 
5.2 PIPELINED OPERATION MECHANISM OF UNITCIIIP ...................................................... _ .................................. 58 
5,3 AREA ESTIMATIONS AND CONSTRAINTS ........................................................... ;; ............................................. 60 
5.4 CONCLUSIONS ................................................................................................................................................... 60 
6. INTERCONNECTION NETWORK .................................................................................................................... 61 
6.1 HIERARCHICAL INTERCONNECTION ARr::Jifl'ECTURE ...................................... -............................................... 6) 
6.1.1 PE and Switch Block Array Interconnection .................................................... : ................................... 63 
6.1.1.1 Programmable Nature of PE Array /nterco1111ectio11 .............................................................................. 63 
6.1.2 Indium Bump Interconnection ..................... , ......................................................................................... 64 
6.2 CONCLUSIONS ................................................................................................................................................... 65 
7, HIGH-LEVEL MODELING OF 3D-S0FTCHIP USING SYSTEMC ............................................................. 66 
7.1 SYSTEMC OVERVIEW .......................................................................................................... : ............................ 66 
7.1.1 CAD Environment forSystemC ........................................................ ; .................................................... 68 
7,2 SYSTEM-LEVEL MODELING OF 3D·SOFTCIIIP .............................................................................. -.': ................ 69 
7.2.1 Standard-PE ...................................................................................... : ... : ............ ' .. .,; ................................. 69 
7.2.2 Processing Accelerator-PE ..................................................... ; ................................................................ 70 
7.2.3 ICS_RISC ................................................................................................................................................. 71 
7 .2.4 UnitChip ................................................................................................. -; .................. .-; ............................. 74 
7,3 CONCLUSIONS ................................................................................................. ,,.,,,,,,,, ........... ; ......... ,,,,.,,,,., ... , ... 75 
8. APPLICATION MAPPING FOR 3D-S0FfCHIP ............................................................. ~ ......... : ..................... 76 
8.1 F'uLL SEARCH BLOCK MATCHING ALGORITHM (FBMA) .............................................................................. 76 
8.2 FBMA MAPPING METHOD FOR3D,SOFTCIDP , ....................................... : .................. : ................................... 78 
8.3 PERFORMANCE ANALYSIS ............................... ,.,, ................................... ,,,., .... , ............. t••"•"'•"''·"····"''·····•···• 80 
8.4 CONCLUSIONS ........................................................................... ;:: ................. , ........ ,,,,', ...................................... 82 
9. CON CLUSIONS .............................................................................. ~ ............................. : ......... ;~ ............................. 83 
9,1 CONTRIBUTIONS ..................................................................................................................... ,, ......................... 83 
9.2 F'uTURE WoRK ............................................................... ';.,., ....... :· ••• : .... -............... ; .............................................. 84 
f-
-3-
BIBLIOGRAPHY ....... , .............................................................................................................................................. 86 
APPENDIX A-ICS_RISC ISA VERSION 1.0 ......................................................................................................... 89 
APPENDIX B,HIGH-LEVEL MODELING OF 3D-SOFFCHIP USING SYSTEMC ....................................... 95 
APPENDIX B-SYSTEMC CODES ........................................................................................................................ 119 
-4-
Declaration 
I certify that this thesis does not incorporate without acknowledgement any material 
previously submitted for a degree or diploma in any institution of higher education; and 
to the best of my knowledge and belief it does not contain any material previously 
published or written by another person except where due reference is made in the text. 
-6-
Acknowledgements 
I would like to express my gratitude to the following people, who helped me to stand 
this position. 
Prof. Kamran Eshraghian as my principle supervisor who initiated the research 
program and gave me the opportunity to commence my master course at Edith Cowan 
University providing financial support and great inspiration towards my research. 
Unfortunately, he left the university towards the end of my research however he left 
significant impression of his great leadership that I want to follow. 
Prof. Mike Myung-Ok Lee, who inspired me to study overseas and gave me an 
opportunity and wannth supervision during my course. I have learned strong propulsion 
and passion through his supervision. 
Prof. Byung-Lok Cho, he used to be my supervisor during my undergraduate study. I 
have started with his great supervision and learned the life and belief as an electronic 
engineer. I will not forget his guidance that has changed my whole life. 
Dr. Alexander Rassau, my principle supervisor, I am really a lucky fellow to meet him 
as a principle supervisor. Sometimes, he becomes my ear, mouth, hands and legs. I can 
not forget his infinite interest and supervision capacity for me. I cm1ld not finish my 
course without his great supervision which will never be forgotten and it is very much 
appreciated. Thanks Dr. Alexander Rassau. 
My family is the most precious in my life. They motivated and encouraged me 
unfailingly so my deepest gratitude goes to my family and I dedicate my dissertation to 
my family; my father, my mother, my brother-in-law, my sisters and my future new 
family; my father-in-law, mother-in-law and my new brother-in-law. 
Lastly, my fiance Sang-Mi, she pushing me to study hard, but ironically, she gave me 
so many interruptions as well. But I even love these interruptions. I will promise that I 
will be a good husband and I will love you forever. 
-7-
ABSTRACT 
At present, as we enter the nanD and giga·scaled integrated-circuit era, there are many 
systna design challenges which must be overcome to resolve problems ill current systems. 
The incredibly increased Ilollrecllrring engineering (NRE) cost, abruptly shortelled Time­
to-Market (ITA) period and ever widening design productive gaps are good examples 
illllstrating the problems ill clfrrelil systems. To cope with these problems, the concept of 
an Adaptive Computlilg System is becomillg a critical technology for next generation 
computing s;'stems. The other hig problem is all explosion ill the interconnection wire 
requiremell1s ill standard planar technology resulting from the very high data-bandwidtll 
requirements demallded for real-time commullications and multimedia signal processing. 
TIle concept of 3D-vertical i1!legratioll of 2D planar chips becomes an attractive solution 
to combat the ever increasing illlercollllect wire requirements. As a result, this research 
proposes tile concept of a llove13D integrated adaptive computing system, which we term 
3D-ACSoC. The architecture and advanced system design methodology of the proposed 
3D-SojtChip as a forthcoming giga-scaled integrated circuit compllling system has been 
introduced, along with high-level system modeJing andfimctional verifi�!!!irm ill the early 
design stage using Systemc. 
A major challenge in this research is to explore the proposed 3D-SoftChip platform to 
investigate the effectiveness of the first novel 3D vertically integrated Adaptive 
Computing System-oil-Chip (ACSoC) as a next generation computing system. The 
suggested 3D-SoftChip has been modeled at a system level using SystemC alld tile 
functional verification of the modeled system has been firmly verified. The hand-crafted 
assembler code for imp/emelllation of the MPEG4 motion estimation algorithm has been 
applied with m(}re than 3.8 times perfonnance improvement over conventional systems. It 
call be clearly demonstrated that it is a highly suitable architecture for 1Z�Xt generation 
computing systems. Finally, further work to realize the full impieme1llation of the novel 
concept of a 3D-ACSoC has been suggested. 
-8-
Publications 
The following is a list of papers published during the course of this research. 
International Journals 
Chui Kim, Alex Rassau, Stefan Lachowicz, Mike Myung-ok Lee and Kamran Eshraghian 
(2005) - "3D-SoftChip: A Novel Architecture for Next Generatior. Adaptive Computing 
Systems", lo be published in EURA.SIP Joumal 011 Applied Signal Procmi11g. 
Chul Kim, Alex Rassau, Stefan Lachowicz, Mike Myung-ok Lee and Kamran Eshraghian 
(2005) - "3D-SoftChip: A System-level Verification and Cllaracterizatio,J of a Novel 
3D Vertically Integrated Adaptive Computitlg System-on-Chip", S11bmi1ted in IEEE 
Tra11sactio11s 011 Computer-Aided Desig11 if Integrated Circuits and SJ•stenJS 
International Conferences 
Chui Kim, Mike Myung-ok Lee, Kamran Eshraghian and Byung Lok Cho (2004) -
"SoC-B Design a11d Testing Tech11iq11e of 1S-9SC CDMA Tra11smitter for 
Measureme11t of Electro11ic Field J11te11sity usi11g FPGA a11d ASIC', Z'J IEEE 
Intemational Workshop 011 Electronic Design, Test and Applicatiom(DELTA2004), Perth, A111tralia, 
pp.251-254. 
Chul Kim, Mike Myung-ok Lee, Kamran Eshraghian and Byung Lok Cho (2005) - "A 
Highly Accurate Electric Field J,zte,zsity Measurement System for IMT2000 a11d 
CDMA Network", Accepted 011 Advanced Ind11strial Conference on Te/uomtJ11micatio11 (AICT2005), 
Lisbon, Portugal 
-9-
Chui Kim, Alex Rassau, Mike Myung-ok Lee and Kamran Eshraghian (2005) - "3D-
SoftChip: A Novel 3D Vertically Integrated Adaptive Computbig System", 131" ACM 
lntemalio11al Syn1posillm 011 Field-Programmable Gale Arr"D' (FPGA2005), Mo11terry, Califontia, 
U.SA., Poster Sec1;011, pp.270 
Chui Kim, Alex Rassau, Mike Myung-ok Lee and Kamran Eshraghian (2005) - "3D-
SoftChip: A Novel 3D Vertically Integrated Adaptive Computing System", s11bmilled i11 
IFIP Intenralional Co1ifem1ce 011 Very Large Scale lntegralio11 (IFJP VLS"I.JOC 2005). 
Chui Kim, Alex Rassau, Mike Myung-ok Lee and Kamran Eshraghian(2005) - "High-
level System Modeling and Fu,,ctional Verification of a 3D-SoftChip Adaptive 
Computing System using SystemC", s11bJJJilled in IEEE lt1lematio11al Co11farence 011 Computer 
Du~n (ICCD 2005) 
-10-
List of Figures 
FIGURE 1.1: 3D-SOFTCHIPPHYSICAL ARCHITECTURE ................................................................................................... 15 
FIGURE 1.2: 30-SOFfCHIP: A NOVEL3D VERTICALLY INTEGRATED ADAPTIVE COMPUTING SYSTEM-ON-CHIP ........... 16 
FIGURE 1.3: COMPUTING SYSTEMS ............................................................................. ,, .. ,.,, ............... , ............... ,,, ..... , .. 18 
FIGURE 1.4: AN EXAMPLE OF"DO-IT-ALL" DEVICE .................................................................................................... 20 
FIGURE 2.1: CORE TECHNOLOGY FOR 30-SOFfCHIP .................................................................................................... 28 
FIGURE 2.2: OVERALL ARCH!TECTUREOF 30-SOFfCHIP , ............................... , ............................................................ 28 
FIGURE2.3: COMPUTATION ALGORITHM: 3 TYPES OF SIMD COMPUTATION MODELS (A) MASSIVELY 
PARALLEL SIMD COMPUTATIONAL MODEL, (11) MUL11TIIREADED SIMD COMPUTATIONAL MODEL, 
(C) PIPELINED SIMD COMPUTATIONAL MODEL ................................................................................................. 31 
FIGURE2.4: WORD-LENGTH CONFIGURATION ALGORITHM (A) 8BITCONFIGURATION, (B) 16BIT 
CONFIGURATION, (C) 32!llT CONFIGURATION ...................................................................................................... 32 
FIGURE2.5: SUGGESTED HW/SWCO-DESlGN AND VERIRCATION ME11f0DOLOGY .................................................... 38 
FIGURE 3.1: TYPES OF PEs (A) HOMOGENEOUS n'PE, (B) HETEROGENEOUS TYPE, (C) HETEROGENEOUS TYPE 
WITH DEDICATED FUNCflONS FOR SPECIAL PURPOSE .......................................................................................... .40 
FIGURE3.2: TWO TYPES OFPE(A) STANDARD-PE, (B) PROCESSING ACCELERATOR-PE ............................................ .41 
FIGURE 3.3: PE INSTRUCTION FORMATS (A) STANDARD-PE INSTRUCTION RJRMAT, (B) PROCESSING-
ACCELERATOR-PE INSTRUCTION FORMAT .......................................................................................................... 43 
FIGURE 3.4: PE ARRAY OPERATION MODES (A) HORIWNT AL MODE, (11) VERTICAL MODE, (C) CIRCULAR 
MODE .................................................................................................................................................................. 44 
FIGURE 3.5: A GENERIC Ix I-BIT MULTIPLIER CELL FOR N=l ...................................................................................... 45 
flGURE3.5: 8 X 8 MULTIPLIER USING4-IIITGENERICCELLS ........................................................................................ 46 
FJGURE3.7: QUAD-PE .................................................................................................................................................. 47 
FIGURE 3.8: UN!TCAP CHIP ARCHITECT1JRE ................................................................................................................ 48 
FIGURE 4.J: ARCHITECI1JRE OF SWITCH BLOCK : A 6-SIDED SWITCH BLOCK, 7-SIDED SWITCH BLOCK AND 8-
SIDED SWITCH BLOCK ......................................................................................................................................... 50 
FIGURE 4.2: ARCHITECI1JRE OF ICS_RISC 32-BIT DEDICATED CONTROL PROCESSOR ................................................ 51 
FIGURE4.3: A OF.TAILED ARCHITECTURE OF ICS_RISC .............................................................................................. 52 
FIGURE 4.4: OMA CONTROLLER ARCHITECTURE AND INSTRUCTIONS FOR OMA CONTROLLER .................................. 55 
FIGURE 5.1: OVERALL ARCHITECTURE OF UNJTCHIP .................................................................................................... 58 
FIGURE 6.1: THREE HIERARCl!!CAL INTERCONNECTION NETWORKS: (A) PE ARRAY INTERCONNECTION 
NETWORK: 20-MESH INTERCONNECTION FOR LOCAL INTERCONNECTJor:,(B)SW!TCH BLOCK ARRAY 
INTERCONNECTION NETWORK: 20-MESH lNERCONNECTION FOR LONG INTERCONNECTION, (C) INOl\1M 
BUMP INTERCONNECTION: S!NGLEINDIUM BUMP AFTER REFLOW ...................................................................... 63 
FIGURE 6.2: QUAD-PE AND PROGRAMMADLE INTERCONNECT ARCHITECTURE ............................................................ 63 
-11-
FIGURE 6.3: 30 FLIP-CHIP WAFER BONDING TECHNOLOGY USING INDIUM BUMP INTERCONNECTION 
ARRA YS ............................................................................................................................................................... 65 
FIGURE 7.1: SYSTEM DESIGN METHODOLOGY: (A) CONVENTIONAL DESIGN METHOOOLOGY, (B) SYSTEMC 
DESIGN METHODOLOGY ...................................................................................................................................... 67 
FIGURE 7 .2: THE CAD ENVIRONMENT FOR SYSTEMC: VISUAL C++ VERSION 6.0, GTKWAVE WAVEFORM 
VIEWER ............................................................................................................................................................... 68 
FIGURE 7 .3: HIGH-LEVEL MODELING OF S-PEs: (A) S-PE BLOCK DIAGRAM, (13) FILE STRUCTURE OF S-PE ................ 69 
FIGURE 7.4: THE OUTPUT WAVEFORM OF S-PE ............................................................................................................ 70 
FIGURE 7 .5: HIGH-LEVEL MODELING OF PA-PES: (A) PA-PE BLOCK DIAGRAM, (B) FILE STRUCTURE OF PA-
PE ....................................................................................................................................................................... 10 
FIGURE 7 .6: THE OUTPUT WAVEFORM OFPA-PE ......................................................................................................... 71 
FIGURE 7 .7: HIGH-LEVEL MODELING OFICS_RISC: (A} ICS_RISC BLOCK DIAGRAM, (B) FILE STRUCTURE 
OFICS_RISC ...................................................................................................................................................... 72 
FIGURE 7 .8: THE PSUEDO CODE FOR ICS_RISC ........................................................................................................... 72 
FIGURE 7 .9: THE OUTPUT WAVEFORM OF ICS_RlSC .................................................................................................. 73 
flGURE7.10: THE INSTRUCTION INDEX ........................................................................................................................ 73 
FIGURE 7 .11: HIGH-LEVEL MODELING OF UNrrCttrP: (A) UNrrCHIP BLOCK DIAGRAM, (B) FILE STRUCTURE 
OFUNITCHIP ....................................................................................................................................................... 74 
FIGURE 7 .12: THE OUTPUT WAVEFORM OFUNrrC111p .................................................................................................. 75 
FIGURE 8. J: BLOCK MATCHING MOTION EsTJMATION ................................................................................................. 77 
FIGURE 8.2: MAPPING METHOD 1-'0R FULL SEARCH BLOCK MATCHING AND DATA Ft.ow ........................................... 78 
FIGURE 8.3: PERFORMANCE COMPARISON FOR MOTION EsTlMATION ......................................................................... 81 
FIGURE 9.1: THE PARAMTERIZED MEMORY MODELING EXAMPLE USING SYSTEMC .................................................... 85 
-12-
List of Tables 
TABLE I.I: JDFADRICATION TECHNOLOGIES .............................................................................................................. 17 
TABLE 1.2: RECONHGURABLE COMPUTING Vs ADAPTIVE COMPUTJNG ........................................................................ 20 
TABLE 1.3: RECONFIGURABLE AND ADAPTIVE COMPUTING SYSTEMS ............................................. , •• , ........................ 23 
TABLE 1.4: COMPARISON OF MORPHOSYS wrm 30-SOFfCHIP .................................................................................... 24 
TABLE 3.1: CHARACTERISTICS OF EACH PE TYPES ...................................................................................................... .40 
TABLE 3.2: CHARACTERISTICS OF THE TIIE TWO TYPES OF PE ................................................................. ,,., ............... .41 
TABLE 3.3: STANDARD-PE FUNCTIONS ........................................................................................................................ .42 
TABLE 3.4: PROCESSING ACCELERATOR-PE FUNCTIONS .............................................................................................. 43 
TABLE 4.1; TYPES OF INSTRUCTION SET ...................................................................................................................... 52 
TABLE4.2: INSTRUCTION SET SUMMARY (ICS_RISC ISA VERSION l.0) ..................................................................... 53 
TABLE5.l: PIPELINED UNlTC/llP OPERATION MECHANISM .......................................................................................... 58 
TABLE 5.2: AREA ESTIMATION AND CONSTRAINT OF UN!TC/llP (T ARGF.T TECHNOLOGY: 0.13UM PROCESS) ............... 60 




System design is becoming increasingly challenging as the complexity of integrated 
circuits and the time-to-market pressures relentlessly increase. Adaptive computing is a 
critical technology to develop for future computing systems in order to resolve most of 
the problems that system designers are now faced with due in no small part to its 
potential for wide applicability. Up until now, however, this concept has not been fully 
realized because of the many constraints such as chip real-estate limitations and the 
software complexity. Advancements of semiconductor processing technology and 
software technology, however, adaptive computing is now facing a turning point. For 
instance, the concept of reconfigurable computing has more recently started to receive 
considerable research attention [2, 3, 7] and this concept is now starting to move and 
expand into the realm of adaptive computing. Software defined virtual h!1rdware [9] and 
"Do-it-all" devices [12] are good ex.amples that demonstrate this development direction 
for computing systems. 
Another growing problem in advanced computation systems, particularly for real-time 
communication or video processing applications, is the data bandwidth necessary to 
satisfy the processing requirements. A novel 3D integration system such as 3D SoC [24], 
3D-SoftChip [14,15] which is able to satisfy the severe demand of more computation 
throughput by effectively manipulating the functionality of hardware primitives through 
-14-
vertical integration of two 20 chips is another concept proposed for next generation 
computing systems. This research explores the proposed 30-SoftChip platform to 
investigate the effectiveness of the first novel 30 vertically integrated Adaptive 
Computing System-on-Chip (30-ACSoC) as a next generation computing system. This 
thesis outlines research into the system level design and functional verification of 3D-
SoftChip in the initial stage of development of the novel 30 vertically integrated ACSo . 
lntelligeot Configuration 
witch (ICS) 
Figure 1.1: 3D-SoftChip Physical Architecture 
Figure 1.1 illustrates the physical archHecture of the 3D-SoftChip comprising the 
vertical integration of two 2D chips. The upper chip is the Intelligent Configurable 
Switch (ICS). The lower chip is the Configurable Array Processor (CAP). 
Interconnection between the two 20 chips is achieved via Indium bump interconnections. 
As the starting point for our 3D mapping, the 2-0 plane architecture of the 3D-SoftChip 
is also illustrated in Figure 1.2 in order to demonstrate the principle. 
-15-
Figure 1.2: 3D-SoftChip: A novel 3D vertically integrated adaptive computing system-on-chip 
1.1 3D Vertically Integrated Systems Overview 
During the past few years, there has been significant research demand for 3D 
vertically integrated systems due to the ever growing wiring requirements, which are fast 
becoming the major bottleneck for future gigascale integrated systems [23,24) . In Very 
Deep Submicron silicon geometry, standard planar technology has many drawbacks such 
as performance, reliability etc. caused by limitations in the wiring. Moreover the data 
bandwidth requirements for the next generation computing systems a.re becoming ever 
larger. To overcome these problems, the concept of 3D-SoC, 3D-SoftChip has been 
developed, which exploits the vertical integration of two or more 2D planar chips to 
effectively manipulate computation throughput. Previous work has shown that the 3D 
integration of systems can significantly reduce interconnection requirements [25] . As 
described by Joyner, et al [25], 3D system integration offers a 3.9 times increase in wire-
Limited clock frequency, an 84% decrease in wire-limited area or a 25% decrease in the 
-16-
number of metal levels required per strcitum. There are three feasible 3D integration 
methods; a stacking of packages, a stacking of ICs and Vertical System Integration as 
was introduced by IMEC [23]. There are four main enabling technologies for the 
fabrication of 3D-Integrated Circuits, Bean Recrystallization, Silicon Epitaxial Growth, 
Solid Phase Crystallization and Processed Wafer Bonding [26]. Table 1.1 shows the main 
characteristics of each of these 3D fabrication technologies. In this research, however, the 
focus is on the use of processed wafer bonding technology using an indium bump 
interconnection array (IBIA). The reason why wafer bonding technology is adopted for 
this work is because the process has particular benefits for applications where each chip 
carries out independent processing. The characteristic of the 3D-SoftChip are that each of 
the two planar chips should be effectively manipulated to maximize computation 
throughput with parallelism. Also indium has good adhesion, a low contact resistance and 
can be readily utilized to achieve an interconnect array with a pitch as low as lOµm. The 
development of the 3D integrated systems will allow improvements that should be seen in 
the packaging cost, the performance, the reliability and a reduction in the size of the chips. 
Table 1.1: JD Fabrication Technologies 
3D Fabrication Technologies Characteristics 
Deposit poly-silicon and fabricate Thin-film Transistors (TFfs). 
High perfonnance ofTFT's 
Beam Reci-ysta11ization High temperature of melting poly-silicon(Not practical Fab.Tech.) 
Suffers from low carrier mobility 
Epitaxia!ly grow a single crystal Si 
Silicon Epitaxial (SE) High temperature causes degradation in quality of devices 
Growth Process not yet manufacturable 
Low temperature alternative to SE 
Flexibility of creating multiple layers 
Solid Phase Crystallization Compatible with current processing environments 
Useful for stacked SRAM and EEPROM cel!s 
Bond two fully processed wafers together 
Similar electrical properties on all devices 
Independent of temperature since all chips are fabricated then 
Processed Warer Bonding honded / Good for applications where chips do independent 
processing 
Lack of prccision(alignment) restricts inter-chip communication to 
global metal line 
-17-
1.2 Adaptive Computing Systems Overview 
There are three types of computing system cunently in existence; a general-purpose 
computing system, a reconfigurable/adaptive computing system and an application 
specific computing system. The general-purpose computing system is based on using a 
general-purpose processor for broad applications. Discrete application specific ICs are 
used for application specific computing systems for declicated and limited applications. 
These computing systems have certain drawbacks such as low performance in the case of 
the general-purpose computing system, or extremely limited applicability for the 
application specific computing system. The reconfigurable/adaptive computing system, 
however, allows for an optimum trade-off between flexibility and performance. Because 
of this fact, reconfigw·able/adaptive computing systems are attracting attention as a new 
alternative for the next generation of computing systems. Figure 1.3 illustrates how the 
reconfigurable/adaptive computing system provides an optimum trade off between 
flexibility and performance. 










Figure 1.3: Computing ystems 
1.2.1 Adaptive Computing Systems 
1.2.1.1 The Need for Adaptive Computing Systems 
The nonrecurring engineering (NRE) costs associated with the design and testing of 
complex chips are one of the great threatening factors in current sy tern design 
approaches. According to the International Technology Roadmap for Semiconductors 
-18-
(ITRS), the manufacturing engineering costs of complex chips have reached almost one 
million dollars. The associated design NRE costs almost reached tens of millions of 
dollars in year 2003 [21]. Moreover, product life cycles are getting ever shorter due to 
rapid changes in technology and as a result the time-to-market (TIM) period is keenly 
shortened. On the other hand, design and verification cycle times are getting longer into 
the months or even years. As a consequence of these issues, a reconfigurable/adaptive 
computing system that could be metamorphosed across multiple standards and 
applications becomes very attractive for the next generation of computing system~. 
1.2.1.2 The Concept of Adaptive Computing Systems 
A reconfigurable system is one that has reconfigurable hardware resources that can be 
adapted to the application currently under execution providing the possibility to 
customize across multiple standards and applications. In most of the previous research the 
concepts of reconfigurable and adaptive computing have been described interchangeably. 
In this document, however, these two concepts will be more specifically described and 
differentiated. Adaptive computing will be treated as a more extended and advanced 
concept of reconfigurable computing systems, which means it includes more advanced 
software technology to effectively manipulate the mapping and scheduling of context 
memory over a wide range of applications along wit.l more advanced reconfigurable 
hardware resources to support fast and seamless exe1:11tion across these applications. 
Table 1.2 shows the differentiations between reconfigurable computing and adaptive 
computing. The benefits of adaptive computing are silicon reuse, bug-fixing post-
shipping, updating and fixing in market allowing for standards evolution, faster TIM and 
lower costs. The reconfiguration capacity allows for significant reuse of silicon. If bugs 
are found post-shipping or standards evolve, the adaptive computing system is easy to fix 
and update simply by changing the contexts in the reconfigurable hardware resource. The 
forthcoming impact from the deployment of adaptive computing is "Do-it-all'' devices. A 
small handheld PDA size device can assume the functionality of about IO standard 
devices simply depending on the context programs included such as a cellular phone, a 
GPS receiver, an MP3 player, an e-book reader, a digital camera, a portable television, a 
-19-
satellite radio, a held-held gaming platform etc. Figure 1.4 shows the futuristic concept of 
"Do-it-all" devices. 
Figure 1.4: An Example of ''Do-it-all" Device (*Source: www.chosun.com) 
Table 1.2: Reconfigurable Computing Vs Adaptive Computing. 
Recont'igurable Computing Adaptive Computing 
Linear array of homogeneous Heterogeneous algorithmic 
Hardware Resources elements elements 
(Logic gates, look-up tables) (Complete function units such as 
ALU MultipUer) 
Configuration Static, Dynamic configuration Dynamic, partial run-time 
Slow reconfiguration time reconfiguration. 
Mapping methods Manual routing , conventional High-level language (SystemC,C) 
ASJC Design tools (HDL) 
Large silicon area, Low speed Smaller silicon size, high speed, 
Characteristics (high capacitance), high power high performance, low power 
consumption, high cost consumption, low cost 
-20-
1.2.2. Classification of Adaptive Computing Systems 
Adaptive computing systems are mainly classified in tenns of granularity, 
programmability, reconfigurability, computational methods, hardware mapping methods 
and target applications. The granulo.rity is the basic data size of the reconfigurable 
hardware resources. In fine grained systems, the primitive reconfigurable hardware 
resources are typically logic gates, flip-flops and look-up tables and operate using bit-
level computations. Field Programmable Gate Array (FPGA) and Complex 
Programmable Logic Gates (CPLD) are good examples of fine grained systems. In 
contrast, the coarse grained systems have complete function units such as ALU, 
multiplier and dedicated functional units and operate using word-level computations. The 
combination of the fined grained systems and the coarse grained systems creates a mixed 
grained system. 
The programmability relates to the capacity of the configuration. Single-
programmability allows only one customization, while multiple-programmability allows 
for customization on-the-fly. The reconfigurability is executed by changing the context 
memory. Static (interrupted execution) and Dynamic (in parallel execution) are two 
categories of reconfigurability. Common computational methods used in the adaptive 
computing systems are Single-Instruction stream Multiple-Data stream (SIMD)/Multiple-
Instruction stream Multiple-Data stream (MIMD) and Very-Long Instruction Word 
(VLIW). The hardware mapping methods vary depending on developed systems from 
manual routing to high-level language compilation. Most of the target applications for 
adaptive computing arc in the areas of wired and wireless communications and 
multimedia digital signal processing 
1.2.2.1 Previous Works 
The research and commercial development of reconfigurable/adaptive computing 
systems has been going vigorously since the early I990's. According to the classification 
of adaptive computing described above, the nature of this research is classified in the 
Table 1.3[3, 22) and it shows the best-known existing coarse-grain reconfigurable 
-21-
systems, the fine-grain reconfigurable systems have been excluded because these are 
different category from our research. 
The Matrix [ I J, REMARC [5] and MorphoSys [3] belong to the category of mesh-
based reconfigurable systems which is a combination of an array of word-level 
processing elements with a control processor, such as a multi-granular array of Basic 
Functional Units (BFUs) in the case of the MATRIX, an 8 by 8 array of 16-bit 
nanoprm~essor with MIPS-II RISC processor in the REMARC, or an 8 by 8 array of 
reconfigur.ible cells with MIPS-like processors in the MorphoSys. These are dynamic 
reconfiguration, mesh based hierarchical interconnection fabric architectures. Their 
application is restricted only to DSP type tasks and they have certain disadvantages in 
term of the power consumption because of frequent data movement between the control 
processor and processing elements. As well as, need to access external memory resources. 
Another category is a linear array-based reconfigurable system such as, RaPiD [6J or 
PipeRench [2]. These are linear arrays of processing elements with row-wide 
interconnection fabrics. Each combination of the processing element array and the row-
wide interconnection can make a pipeline stage. The target application of these systems is 
pipelining regular computation-intensive applications. The other categories such as 
crossbar-based [35, 36] and reconfigurable processors (37] have been excluded in this 
table. 
The Trisend A7 [10] is considerably similar to other mesh-based reconfigurable 
systems, the difference is in the granularity of the processing elements. The A7 has a 
fine-grain reconfigurable fabric in comparison with the word-level processin,r elements in 
the mesh-based reconfigurable system. 
The MRC60I I [II], Adapt 2400 [16], DFAIOOO [9], PCI02 [17] are up-to-date 
commercially developed adaptive computing systems, which have mostly heterogeneous 
arrays of reconfigurable hardware except the DFA 1000 and dynamic configurability. The 
main target application is computation-intensive multimedia DSP and communication 
signal processing. These have more advanced adaptive computing characteristics 
compared with the systems introduced earlier. 
-22-
r 
As indicated, the early research and development was into single linear array type 
reconfigurable systems with single and static configuration [8,l,6,5,4,2J but this has 
evolved to large adaptive SoCs with heterogeneous types of reconfigurable hardware 
resources and multiple and dynamic configurability. The MRC601 l, Adapt2400, 
DFAIOO, PC102 and 3D-SoftChip are good examples to show the current research and 
commercial development directions. The ultimate goal for the adaptive computing system 
is currently the "Do-it-all" device as explained before. 
Tablcl.3: Reconfigurable and Adaptive Computing Systems 
System 
- - -
Computallon Mnpping Tnrget Application 
Method 
PADDI {BJ Coarsc(16bit) Multinlc Static VLIW,SIMD Rnutin~ DSP ir lications 
MATRIX I I I Coarsc(8bit) Multinlc Dvnamic MIMD Multi-Jen General Pumosc 
RaPiD {6} Coarse(l6bit) Single Mostly static Linear array Channel routing Systolic arrays, 
Data-intensive 
Rt:VARC{SJ Coarsc(l6bit) Multip!e Static SIMD NIA Data-parnllel 
a--lication 
RAW/4/ Mixed Sin~le Static MIMD Swiech box routin" General numosc 
PipcRench {2/ Mixcd(l28bit) Multiple Dynamic Pipe-lined Scheduling Data-parallel, 
DSP a~·lications 
Morp/wSys {31 Coarsc(l6bit) Multiple Dynamic SIMD Assembler. Manual Data-parallel, 
P&R lm.:we annlications 
Trilce11d A7 f /OJ Mixed Multiple Dynamic N/A Co-compilation General Purpose 
(Assembler, C, Embedded System 
Vcrilo~. VHDL) 
Motorola Coarsc(l6bit) Multiple Dynamic SIMD C-Compilation Computation 
MRC601! /II} Intensive 
am !ications 
Q11ickSi1ver Coarse(S.16,24, Multiple Dynamic Heterogeneous SilverC Comm., Multimedia 
Adavt2400 I 16} 32bit) Nodes arra~ DSP 
Eli.rent Coarse (4bit} Multiple Dynamic Linear D-Fabric Verilog. VHDL, Multimedia 
DFA/000 {9/ A=, Handle-C, Matlab a"' lications 
picoC/iip Coarsc(l6bil) Multiple Dynamic 3way-LIW Assembler Wireless 
PC/021171 Communications 
Co=. 
JD-SoftCh;p Co11rse(4blt) Multiple Dynlllllie Various types C-compllatlon Mulllmedle 
or computation (Assembler, C) Signe] Processing 
models 
-23-
1.2.2.2 MorphoSys Vs 3D-SoftChip 
One of the most successful reconfigurable systems to date is the MorphoSys system, 
so it is meaningful to make a comparison of the proposed 3D-SoftChip architecture to 
this. Table 1.4 shows the comparison between the MorphoSys and the 3D-SoftChip. It 
can be seen that the 3D-SoftChip is more appropriate to the most up-to-date adaptive 
computing system. 
Tablel.4: Comparison of MorphoSys with 3D-SoftChip 
MorpboSys 3D-SoftChip 
System-on-Chip except main Vertically Integrated complete 
Integrated Model memory System-on-Chip witll abundant 
memory capacity 
Employs a two-set data buffer that Using Indium bump technology, 
Memory Interface enable overlap of computation with vertical data conununicatioa. 
data transfers. Variable memory word-lengtb for 
adaptive computing 
MuJtiple contexts on-chip (32 Multiple context on-chip with 
Reconfiguration planes) wiU, dynamic and single- dynamic and single-cycle 
cycle. 
Controller On-chip general-purpose Every unit 3D-SoftChip has an 
processor. ICS_RISC which role of control 
processor. 
MPEG-2 Video Compression, 
Examples of Encoder Real time communication and 
Application Automatic Target Recognition multimedia signal processing 
Data Encryption 
Various types of Computational 
model (S1SD.SIMD,M1SD,MIMD) 
And 3 types of SIMD Computation 
SIMD nature. models(massively parallel, 
Fixed Word length multithreaded, pipelined) 
Characters Comprehensive tool sets.(mView, Configurable word length and 
mLoad, mScbed, mcc, MuLate, variable memory word length for 
MorphoSim) Adaptive Computing. 
30 Vertically lntegrated System -
High speed data interface. 
Optimum System Architecture for 
Comm. and Multimedia Sig,ia.l 
Processing 
-24-
1.3 Motivation of Thesis 
As the microelectronics industry enters the nano and giga-scaled integrated circuit era, 
many problems, as described before, have been to occur. To cope with these problems, 
especially the system-on-chip complexity and interconnection crisis, innovative new 
computing systems with novel interconnection methods will be required. A very 
promising candidate to overcome these problems is the concept of a 3D vertically 
integrated adaptive computing system-on-chip (3D-ACSoC). This concept may well be a 
critical technology for the next generation of computing systems because of its wide 
applicability/adaptability and because of the significant benefits gained from 3D systems 
such as reduction in interconnect delays and densities, and reduction in chip areas due to 
the possibility for more efficient layouts etc. 
Conventional SoC design methodologies include many error-prone and tedious 
iteration processes which can result in a lack of system reliability and extend the design 
time. Moreover, the portion taken up by verification processes in the total design time is 
exponentially increasing. By adopting the suggested SoC design methodology using 
SystcmC, the design time can be significantly reduced and more reliable systems can be 
reaJised. To satisfy these needs, the concept of the 3D-ACSoC and advanced HW/SW co-
design and verification methodology has been suggested. 
1.4 Scope of Thesis 
In this thesis, the novel 3D-SriftChip architecture for real-time communication and 
multimedia signal processing is introduced, and its high-level system modelling and 
functional verification using SystemC is described. The 3D-SoftChip has been fully 
modelled using SystemC at high-level and implementation of the MPEG4 full search 
block matching motion estimation algorithm has been mapped to the modelled 30-
SoftChip. Finally, the performance analysis is detailed in the last chapter. The thesis is 
composed of nine chapters including this one. The following is the scope to be covered 
by each of the following chapters. 
-25-
1.4.1. Scope of Each Chapters 
Chapter 2 is an introduction to the overall 3D-SoftChip architecture. The novel 
architecture and several salient features for next generation computing system will be 
introduced along with the suggested HW/SW co-design and verification methodology. 
The detailed architecture of the CAP chip will be described in Chapter 3. Heterogeneous 
types of Processing Elements architecture and functions will be presented. Chapter 4 
covers the ICS chip, its components and the ICS_RISC instruction Set Architecture with 
instruction set summary. Chapter 5 presents the architecture of the UnitChip, its pipeline 
operation mechanism and area constraint. A three hierarchical interconnection 
architecture and the configurable nature of the inter-PE bus will be introduced in Chapter 
6. Chapter 7 presents the high-level modelling of 3D-SoftChip using SystemC. The 
simulation result of each component of the 3D-SoftChip is also provided to show the 
verification of the functionality of the each component. Chapter 8 introduces application 
mapping for high-level modelled 3D-SoftChip vdh the MPEG4 full search block 
matching algorithm. The performance analysis will be perfonned in comparison with 
conventional systems. Finally, the last Chapter outlines the contributions of this thesis 
and suggested future work. 
1.5 Conclusions 
In this chapter, the motivation for the emergence of the novel 30 vertically integrated 
system-on-chip and the benefits which can be acquired through its use have been 




System Architecture of 3D-SoftChip 
In this chapter, the core technology for the 3D-SoftChip along with its detailed and 
overall architecture will be described. Finally a design guideline and the suggested design 
methodology will be introduced. 
2.1 Core Technology for 3D-SoftChip 
The core technology for the 30-SoftChip can be mainly classified into 3 fields of 
technology is follows, a Very Deep Submicron (VDSM) silicon process technology, a 3D 
Interconnection technology and an advanced software technology. The target silicon 
process technology for the 30-SoftChip is less than 0.13um to maximize the effect of 
large scale integration in order to fit as much as possible into the Processing Element 
(PE)s. This large scale integration into the PEs can be leveraged to amplify the 
computation capacity of the 3D-SoftChip because of the SIMD computation nature of the 
3D-SoftChip. The 3D Interconnection technology using the Indium Bump 
Inte1connection /lrray (IBIA) is another state-of the art technology for the 3D-SoftChip. 
The IBIA can cope with the severe demand for data bandwidth from real time 
communication and multimedia ~ignal processing applications. The last technology is the 
advanced software technology, which is able to effectively execute context mapping and 






Figure 2.1: Core technology for the 30-SoftChip 
2.2 Overall Architecture of 3D-SoftChip 
Figure 2.2 shows the overall architecture of the 3D-SoftChip. 










,ao• " ,~ • ! oo;, '" ,~ • jl 
jITlID, PEAnay '' Hi, PEAnay '' 11 
'-.::.'- ···········································-··..: : ••.• •••.••...•••••••...... .••.•••.••••.•.••.••••••••.. ..1 
Figure 2.2: Overall architecture of the 3D-SoftCbip 
As can be seen, it is comprised of 4 UnitChips. By including four separate unit chips 
in the architecture, sufficient flexibility is provided to allow multiple optimized task 
-28-
threads to be processed simultaneously. Given the primary target applications of 
communication and multimedia processing four UnitChips should be sufficient for all 
such requirements. Each UnitChip has a PE array, a dedicated control processor and a 
high bandwidth Cata interface unit. According to a given application program, the PE 
array processes a large amount of data in parallel, the ICS controls the overall system and 
directs the PE array execution and data and address transfers within the system. 
2.3 Features of 3D-SoftChip 
The 3D-SoftChip has 4 distinctive features: Various types of computation modes, 
adaptive Word-length configuration [14], optimized system architecture for real-time 
communication and multimedia signal processing and dynamic reconfigurability for 
adaptive computing. 
Computation Algorithm : Various Computation Models 
As described above, one 32-bit RISC controller can supply control, data and 
instruction addresses to 16 sets of PEs through the completely freely controllable 
switch block so various computation models can be achieved such as SISD, SIMD, 
MISD, MIMD as required. Enough flexibility is thus achieved for an adaptive 
computing (AC) system. Especially, in the SIMD computation model, 3 types of 
different SIMD computational model can be realized, massively parallel, 
multithreaded and pipelined SirvID computational models [13]. In the massively 
parallel SIMD computation model, each UnitChip operates with the same global 
program memory. Every computation i:, processed in parallel, maximizing 
computational throughput. In the Multithrcaded SIMD computation model, the 
executed program instructions in each UnitChip can be different from the others, so 
multithreaded programs can be executed. The final one is the pipelined SIMD 
computation model. In this case each UnitChip executes a different pipelined stage. 
These three computational models are illustrated in Figure 2.3. 
-29-
'" 
l[QIDmru mm, I ' ' PE Array • 
\. .... ._ ............. ._._ ...... -~ ........... , ... _ .. . 
181 
I mmmm ., ... , ! 
. : 
.,.._.,, •• • ,,,.._ , ,,.,,,,,,,,,, .... . .. .. ,,Ho,uuoo,,o,,,• 
11!1 IBI IBI 
! mmmru PE Array 
i ......... ,,. __ ....................... _ ....... .. . 
-... 
IBI 181 
PE Array [@[@ I 
\ ........................................ , .. _ ........ } 





( 1C!. Ct11p 











\ ..... . ................................................. 1 1 .. . ......... . - ... - . ..... - .................... ) 








: CAP Chip ~ CAP Chi 
, ,, ,u,a, '' ~ ~[@~=i-~[@=--i-:_-:_~P-E-:_A-r_ra-:_y-:_~mm-:_-i_,,-,=~--,:-. 
~ ! i~~~~~~~~~ \ : ,----~~-mo-~~~-' 
"••••• •·••••-~••••••••n- ••••••-•••••••••-••••• '---•••-•••••-••••••••••• 
(c) Pipelined SIMD Computation Model 
Figure 2.3: Computation Algorithm: 3 types of SIMD Computation Models 
• Word-Jeogtb Configuration 
This is a key characteristic in order to classify the 3D-SoftChip as an adaptive 
computing system. Each PE's basic processing word-length is 4-bit. This can, 
however, be configured up to 32-bit according to the application in the program 
memory. Figure 2.4 illustrates the proposed word-length configuration algorithm. 
When 2 PEs configure together an 8-bit word-length system is created. If 4 PEs 
configure together this extends to 16-bit. And finally when 8 PEs configure together 
a full 32-bit word length is achieved. This flexibility is possible due to the 
configurable nature of the arithmetic primitives in the PEs [18] , (see chapter3 .5) and 
the completely freely controllable switch block architecture in the JCS chip 
-31-
(a) 8-bit Word-length Configuration S0S0ME PE 
@G@G PE PE 
'-==---===--
(b) 16-bit Word-length Configuration 
(c) 32-bit Word-length Configuration 
Figure 2.4: Word-length Configuration Algorithm 
• Optimized System architecture for Communication and Multimedia Signal 
Processing 
There are many similatities between communications and multimedia signal 
processing, such as data parallelism, low precision data and high computation rates. 
The different characteristics of communication signal processing ru·e basically more 
data reorganization such as matrix transposition and potentially higher bit level 
computation. To fulfill these signal processing demands, each UnitChip contains 
two types of PE. One is a standard-PE for generic ALU functions , which is 
optimized for bit-level computation. The other is a processing accelerator-PE for 
Digital Signal Processing (DSP). In addition, special addressing modes to leverage 
the localized memory along with 16 sets of Loop buffers to generate iterative 
-32-
address in the ICS_RISC add to the specialized characteristics for optimized 
communication and multimedia signal processing. 
• Dynamic Reconfigurability for Adaptive Computing 
Every PE contains a small quantity of local embedded SRAM memory and 
additionally the ICS chip has an abundant memory capacity directly addressable 
from the PEs. With multiple sets of program memory and the abundant memory 
capacity, it is possible to switch programs easily and seamlessly, even at run-time. 
2.4 System Components 
As introduced above, the 3D-SoftChip consists of a linear array of heterogeneous PEs 
with an associated array of Indium bump 3D Interconnects, dedicated Switch Blocks, the 
ICS_RISC and a high bandwidth data interface unit. 
2.4.1. Configurable Array Processor (CAP) Chip 
2.4.1.1 Heteroge11eo11s Types of PEs 
The CAP chip comprises a linear array of two types of PE, a Standard-PE and a 
Processing Accelerator-PE. The advantages of heterogeneous PEs with dedicated 
functions for special purpose DSP are more suitability for specific applications with only 
a medium flexibility trade-off compared with homogeneous type PEs. In this case, two 
Standard-PE and two Processing Acce\erator-PEs fonn one Quad-PE. These will be in 
detail in a later section. 
2.4.2. Intelligent Configurable Switch (JCS) Chip 
2.4.2.1 Switch Block 
Each group of 4 PEs (Called Quad-PEs) are controlled by one Switch Block through 
the IBIA. This transfers data from/to each PE and also provides instruction data for the 
-33-
PEs. It can completely freely configure each PE group, and makes it possible to achieve 
efficient variable word-length configuration. 
2.4.2.2 ICS_RISC 
A 32-bit dedicated RISC processor is used to control each set of 4 Quad-PEs (called 
UnitCAP). It controls the execution of the PE array and provides control and address 
signals to the Switch Block and the high bandwidth data interface unit in the UnitChip. 
2.4.2.3 Data Frame Buffer 
Two sets of Data Frame Buffers are included to support the transfer of large volumes 
of data from/to data/program memory and the ISC. 
2.4.2.4 Program Memory 
This is separnted into two areas. One is a program memory for the ICS_RISC and the 
other is the program memory for the PE array. This memory supports adaptively 
configured word-lengths to increase the computation efficiency dependent on the 
application. Additionally, multiple sets of program memory arc included to allow 
dynamic program switching. 
2.4.2.5 Data !,femory 
Abundant memory capacity is one of the characteristic of the 30-Softchip with each 
PE containing its own embedded lo-cal memory along with a high bandwidth connection 
to the memory store on the ICS. 
2.4.2.6 DMA Controller 
A dedicated controller is included to facilitate the transfer of large volumes of data 
from/to program memory, data memory and the ICS. This provides a high efficiency data 
interface between any of these units. 
-34-
2.4.2.7 3D Interconllection Techllology 
The CAP chip carries out all data manipulation operations in the system. There is 
rarely the need for data transfer within the CAP beyond basic nearest neighbor 
interconnects, except for computation with word-lengths configured to > 4-bit. All the 
manipulated data is, therefore, transferred through the Indium Bump Interconnection 
Array (IBIA) and processed by the ICS allowing for very high speed computation 
because the !BIA provides very high bandwidth and very low inductance/capacitance 
[ 15]. 
2.5 Design Guidelines 
The design guidelines and constraints to satisfy the design goals are as follows. 
• The 3D-SoftChip is the first novel 3D vertically integrated Adaptive Computing 
System-on-Chip (3D-ACSoC) 
• Using Indium bump technology, data can be manipulated at very high speed with 
wide bandwidth. 
• The variable memory word-length and configurable word-length are unique 
features for an adaptive computing system. 
• Various computation models (SISD, SIMD, MlSD, MIMD) are possible for 
• 
adaptability/nexibility in accordance with the current application and 3 types of 
SIMD computational models (massively parallel, multithreaded, pipelined) allow 
for maximized computational throughput. 
The heterogeneous types of PE architectures are optimized for communication 
and multimedia signal processing. 
• Dynamic run-time reconfigurability for adaptive computing. (Multiple sets of 
program memory and abundant memory capacity) 
• The area constraint of PE should be minimized as much as possible (less than 
60um x 60um in 0.13um technology) for a 4-bit word size. 
-35-
2.6 Design Methodology 
2.6.1. Suggested HW/SW Co-design and Verification Methodology 
HW/SW co-design is a development methodology that supports the concurrent and co-
operative development of hardware and software (co-specification. co-development, co-
verification). It helps to evaluate the effect of design decisions and to explore the design 
space at an early stage to obtain the optimal architecture. As a result of this, design cost 
and design cycle time can be reduced and more reliable system can be realised because of 
the verification at the high-level of the system. Figure 2.5 shows a suggested HW/SW co-
design methodology for the 3D-SoftChip. Once the system specification is firmly decided, 
HW/SW partitioning is executed to determine which functions should be implemented in 
hardware and which in software. The HW can then be modeled using SystemC [19] and 
SW modeled in C. After that, a co-simulation and verification process is implemented to 
verify the 3D-SoftChip operation and performance and to decide on an optimal HW/SW 
architecture. 
More specifically, the SW is modelled using a modified GNU C Compiler and 
Assembler. After the compiler and assembler for ICS_RISC has been finalised, a 
program for the implementation of the MPEG4 motion estimation algorithm will be 
developed and compiled using it. After that, object code can be produced, which can be 
directly used as the input stimulus for an instruction set simulator and system level 
simulation. The HW/SW verification process can be achieved through the compa1;son 
between the results from instruction level simulation and system level simulation. From 
this point on, the rest of the procedure can be processed using any conventional HW 
design methodology, such as full and semi-custom design. N.B. SystemC is a system 
design language which supports concurrent HW/SW co-design methodologies and offers 
a simulation kernel that supports hardware modeling concepts at the system level, 
behavioral level and register transfer level [20}. 
-36-
2.7 Conclusions 
The core technology, overall and detailed architecture of the 3D-SoftChip has been 
presented. The four kinds of salient features, as described Section 2.3 can differentiate the 
3D-SoftChip from conventional reconfigurable/adaptive computing systems. The design 
time and reliability of the system will be significantly improved by adopting the 












Circuit Level Simulation 





H/W System-Level Modeling & 
Architecture Exploration of 
3D- SoftChip using System( 




System Level modeling for 
Function/ Instruction Verif. 
& Arch . Exploration 
Optimum H/W 
SPECIF1CATIONS 
Circuit Optimization i.-t-----' 
Go to Foundry Chip Test 
Figure 2.5: SuggestedHW/SW Co-design and Verification Methodology 
-38-
Chapter3 
Architecture of CAP Chip 
In this chapter, the overall architecture of the Configurable Array Processor (CAP) 
chip will be described along with the PE architecture for communication and multimedia 
signai processing. The integration of 4 heterogeneous PEs forms one Quad-PE, and four 
Quad-PE make up the UnitCAP chip. 
3.1 Overall Architecture of CAP Chip 
The basic architecture of the CAP chip is a linear array of heterogeneous PEs. Figure 
3.1 shows three possible architecture choices for the PEs. The architecture in Figure 
3.l(b) is suggested as the most feasible architecture for the PE in the 30-SoftChip 
because it has the optimum trade-off between application specific performance and 
flexibility. Examples of type A can be seen in [1,2,3], type Bin [16] and type C in [17]. 
The CAP chip has the basic role of the processing engine for the 3D-SoftChip. It 
manipulates large amounts of data at a high computational rate using any of the three 
different SIMD computation models previously described. 
-39-
GJ ........... ~ GJ ... ca 1.:1 .............. !.~~~ L_~_:r·-L : 
1 1 I I 
i Switch Block l i Switch Block i i (ICS) i i (ICS) i 
: : : : 
I ··";~- . 
GJ·G Er····· ~:~i~~~~lf --············r·1~ Pf.2 0 
(a) (b) (c) 
Figure 3.1: Types of PEs (a) homogeneous type, (b) heterogeneous type, (c) heterogeneous type with 
dedicated functions for special purpose. 
Table 3.1: Characteristics of each PE types 
PE Architecture Flexibility Performance 
Type A Homogeneous type PEs with Embedded Suitable for general purpose Relative low pe1formance 
memory, ALU. MAC, Address decoder etc. High flexibility for specific applications. 
Each PEs are optimized for special 
f-unctions 
Example : Suitable for specific applications. Relative medium 
Tl'.ueB •PEI : Multiple MAC, ALU array Medium flexibility performance for specific 
• PE2: Bit-oriented operations applications 
• PE3: General purpose RJSC or Control 
Logic. 
• PE4: Memory 
Combination of the Type B arch. with 
dedicated functions for special purpose 
Example : Suitable for dedicated Relative high pe1formance 
TypeC •PEI : Multiple MAC, ALU amiy applications. Low flexibility for specific applications 
•PE2: Memory and Co.ntrol 
• PE3,PE4: A Co-processor optimized for 
dedicated sig.nal processing functions (FEC, 
Preamble detect e1c) 
3.2 Two Types of Processing Element (PE)s 
Figure 3.2 illustrates the two type of PE architecture chosen to optimize 
communication and multimedia signal processing type applications. Table 3.2 shows the 
characteristics of the two type of PE. 
-40-
ICS ICS 
i ! '···-············-··············· ........... .... .. . Register 
Data i '-···················· 8 bit Barallel l Adjacent PEs ~--sh_lft_e_r ~ 
Embedded SRAM 
Embedded SRAM 
(a) Standard-PE (b) Processing Accelerator-PE 
Figure 3.2: Two Types of PE 
Table 3.2: Characteristics of the two type of PE 
Standard-PE Processin2 Accelator-PE 
Standard ALU(Mul, Add, Sub, Multiplier, modified Adder, 
Components Comparator) MUX A,B, Subtractor, 8-bit Barrel Shifter, 
Registers, Embedded SRAM Registers, Embedded SRAM 
Purpose Bit-wise manipulation, Standard Dedicated for MAC, MAS functions 
ALU functions (for DSP application) 
Single clock cycle MAC, MAS 
Cbai-acteristics Standard ALU functions absolute value computation 
Comparison operation. operations 
8bit barrel shifter. (Logical, 
Arithmetical Shift) 
3.2.1. Standard-PE (S-PE) 
The S-PE is for standard ALU functions and is also optimized for bit-level operation 
for communication signal processing. It comprise 4 sets of 19-bit registers for S-PE 
instruction decoding, two multiplexers to select input operands from the data bus, 
adjacent PEs or internal registers, a standard ALU with bit-serial multiplier, adder, 
subtracter and comparator, embedded local SRAM and 4 sets of Registers. The arithmetic 
primitives are scalable so as to make it possible to reconfigure the word-length for 
-41-
specific tasks. The scalable architecture arithmetic primitive architecture is presented in 
[18]. 
3.2.2. Processing Accelerator-PE (PA-PE) 
The PA-PE is dedicated specifically for Digital Signal Processing (DSP) operations. It 
consists of 4 sets of 19-bit registers for PA-PE instruction decoding, two multiplexers to 
select input operands from the data bus, adjacent PEs or internal registers, a signed 4-bit 
scalable parallel/parallel multiplier and accumuJator/subtractor modified to enable 
Multiple-and-Accumulate (MAC), Multiple-and-Subtract (MAS) operations within one 
clock cycle an 8-bit configurable bane] shifter, embedded local SRAM and 4 sets of 
Registers. Two shifters in the Quad-PE can also be configured to produce a 16-bit ban-el 
shjfter. Its distinctive featw·es are the single clock cycle MAC, MAS operations and 
parallel/parallel multiplier to accelerate DSP applications. Moreover it can execute single 
clock cycle absolute value computation. 
3.3 PE Functions 
PE functions are mainly divided into S-PE or PA-PE functions. 
3.3.1. Standard-PE Functions 
Table 3.3 shows the functions of S-PE. It is useful for bit-wise manipulation and 
generic ALU functions. 




not A NOT 






3.3.2. Processing Accelerator-PE Functions 
Table 3 .4 describes the PA-PE functions. It is speciaUzed for DSP such as MAC, MAS 
logical Shift, Arithmetic Shift, Rotate function, absolute value computation. 
Table 3.4: Processing Accelerator-PE Functions 
Function Mnemonics 
AxB PAMUL 
AX B + out(t) MAC 
AX B-out(t) MAS 
Logical Shift Left LSL 
Logical Shift Right LSR 
Arithmetic Shift Right ASR 
Rotate ROR 
IAl(Absolute value) ABS 
3.3.3. PE Instruction Formats and Operation Modes 
The PE instruction format consists of a 19-bit instruction word. The most significant 
2-bits, 18 and 17 in the instruction word (WS_en/RS_en, WR_en/RR_en) are used for the 
Read/Write enable bit of the embedded SRAM and registers. Bits 16 to 10 are used for 
SRAM and register selection (addressing). Bit 9 is used for data output register enable 
signal and bits 8 to 6 are used to specify the PE operation. Finally, bits 5 to Oare used to 
control the input multiplexers for input operand selection. Thjs format is illustrated in 
Figure 3.3 below. 
j 18 I 17 I 16 
WS_en/ WR_erv SRAM 
RS_eu RR_en en 
J 18 I 11 16 
WS_en/ WR_av SRAM 
RS_en RR_eo en 
j 1s 12 11 10 9 
SRAM Selection Register Dout 
Selection RCtl 
(a) Standard-PE Instruction format 
j 1s 12 
SRAM Selection 






!8 6 !5 3 12 o I 
SPEOP MUXB MUXA 
Js 6 !5 3 12 o I 
PA-PE MUXB MUXA 
OP 
(b) Processing Accelerator-PE Instruction format 
Figure 3.3: PE Instruction formats 
-43-
(a) Horizontal mode (b) Vertical mode (c) Circular mode 
Figure 3.4: PE Array Operation Modes 
Figure 3.4 illustrates 3 types of PE operation modes that can be realized on the PE 
array; Horizontal mode, Vertical mode and Circular mode. In the horizontal and vertical 
mode, the each rows or columns of the PEs can connected together respectively. Tbese 
operation modes optimized for the SIMD computational method. Lastly in the circular 
mode, the PEs in the one Quad-PE connects together and each Quad-PE can work 
separatively. These allow for even greater flexjbiJity and help to maximize computational 
throughput accoriling to the target application. 
3.4 Embedded Local SRAM 
Each PE has a small quantity of local embedded SRAM. As the effective memory 
bandwidth is increased dt"amatically by as much as the number of the PEs, which will 
result in an increase in effective processing speed in many applications. Bus traffic can 
also be reduced because many data transmission operations can be contained within a PE. 
Consequently, a lowering of power dissipation will also be achieved. Effectively this can 
act as cache, which can be continuously refreshed 
3.5 Configurable Nature of Arithmetic Primitives 
As described in the Chapter 2.3, one of the distinguished features as an adaptive 
computing system is the word-length configuration. The basic word-length of each PE is 
4-bit. It can be configured 8, 16, 32-bfr according to the target application. The 
-44-
configurable nature of the arithmetic primitives in the PE allows this configuration [18]. 
The most complex component in the PE is multiplier so the example of configurable 
arithmetic primitives, the configurable parallel multiplier will be introduced 
3.5.1. Scalable Parallel Multiplier Cell 
Figure 3.5 shows a generic lxl-bit multiplier cell. It includes a full adder, an AND 
gate and three multiplexers to select the input operand through the control signals CTRLH 
and CTRLL. In this figure, A represents the multiplicand and B is the multiplier. SrNis the 
SUM signal from the adjacent cell above, CouT is the propagated carry output, CJN is the 
carry input from the adjacent multiplier cell. MouT represent the multiplication result. 
The 2x2-bit :multiplier can be implemented using the generic I-bit ceJ] and moreover an 
8x8-bit multiplier can be realised by arranging the basic 4x4 primitive in a 2x2 array as 
shown in figure 3.6 [18]. Because of this configurable characteristic, the word-length can 










Figure 3.5: A generic lxl-bitMultiplier Cell for n=l 
-45-
Figure 3.6: 8x8 multiplier using 4-bit Generic Cells 
3.6 Quad-PE 
As previously described one Quad-PE consists of two pairs of PEs (two S-PE and two 
PA-PE). The Quad-PE is controlled and configured by the Switch Block according to the 
control and address data from the ICS_RISC transmitted through the IBIA. Figure 3.7 





(MUI. Add. SUb, Comp} : 
-··············-·····-········ ............... .;. 
Me talisatio n Pad 
Embedded $RAM j 181 181 r "-'~~~ SAAM 
SW it C h B I O C k ,....·_. =--=--=--~:::::::::::::::::::::::::::::::..,, 
Inter PE bus Inter PE bus 
a.ta us 
Embedded SRAM 
Ad dre.ss Embedded $RAM 
Figure 3.7: Quad-PE 
3.7 UnitCAP Chip Architecture 
The CAP chip consists of 4 sets of UnitCAP. Each UnitCAP has an array of 16 
heterogeneous S-PEs and PA-PEs. Figw-e 3.8 shows the UnitCAP chip architecture. The 
configw-able interconnectivity is realised through the input multiplexer in each PE. The 
detailed description of interconnection between the PEs will be described in Chapter 6. 
3.8 Conclusions 
The heterogeneous types of PE architecture for communication and multimedia signal 
processing have been described. The adoption of the PE architecture can accelerate the 
-47-
performance where intensive bit-level computation and digital signal processing is 
required and achieve more flexibility compare with homogeneous types of PE array. The 
sugge ted PE architecture has been fully modelled and its functionality verified using 
SystemC at high-level. The details regarding the system level modelling of the PE will be 






......................................... .................................... .. 
Figure 3.8: UnitCAP Chip Architecture 
-48-
Chapter4 
Architecture of ICS Chip 
The ICS chip comprises the Switch Blocks, JCS RISC, program memory, data 
memory. data frame buffers and OMA controller. The ICS chip is a control processor 
which controls the CAP chip via the IBIA as well as the overall system. The ICS_RISC 
provides control and address signals and data to the system as a whole. The switch blocks 
configure each PE based on the current program instruction. The high bandwidth data 
Interface Unit enables efficient transmission of data and instructions within the system. In 
this chapter. the detailed architecture of the JCS chip is described. 
4.1 Switch Block 
The Switch Block provides data from/to each PE and also provides instruction data to 
each PE. Three types of Switch Block, 6-sided, 7-sided and 8-sided provide optimized 
interconnection within the ICS chip. Figure 4.1 shows the Switch Block architecture 
which connects between the PEs and other Switch Blocks. The architecture of the Switch 
Block is similar to conventional Switch Blocks in Field Programmable Gate Arrays 
(FPGA) [32]. The Jines in the figure represent switches to connect data/instruction data 
within the PEs, Switch Blocks and the ICS chip. A pass transistor design is used to 




The ICS_RlSC is a 32-bit dedicated RISC control processor. The ICS_RJSC controls 
the execution of the PE an-ay and provides control and address signals to program/data 
memory, the data frame buffers and the OMA controller. It has a 3 stage pipelined 
architecture that is Fetch (F), Decode (D) and Execute (E). To cope with the iterative 
nature of DSP arithmetic, it has 16 sets of loop buffers so as to provide direct instruction 
to instruction decoding instead of fetching from program memory in each case. This 
significantly reduces bus utilization allowing for improved performance and lower power 
dissipation. Moreover 32 general purpose registers and specialized addressing modes are 
provided for optimized communication and multimedia signal processing. For detailed 




To The Down ide Switch Block 
?-Sided Switch Block 
To The Down Side Switch Block 
8- Slded Switch Block 
Figure 4.1: Architecture of Switch Block: A 6-sided Switch Block, 7 -sided Switch Block and 8-sided 
Switch Block 
-50-
,-···---·-····-·- --- ---··------------·-··---------------, 




( 16 K 32 bll) 
: ln,rructlon ~-----'---~ 
, Address 
I <3 1'0> Program Counter 
Register file 
(32 X 32 bit) 
1/0 Unit 
Instruction Data <3 1 :0> 
Instruction Register 
ALU & Control 
Unit 
, Control Slngals 
I 
:1cs RISC , 
'·-------------------------------------~-------~--------
Figure 4.2: Architecture oflCS_RISC 32-bitdedicated Control Processor 
4.2.1 Features of ICS_RISC 
The ICS_RISC has a simple and efficient architecture. It has a harvard architecture 
and simple 3 stage pipelined architecture. Memory access during the execution stage is 
carried out using load/store instructions only and all operations, except load/store, PE and 
DMA operations, are register-to-register within the ICS_RISC. This provides 
improvements in the performance and power dissipation. 
4.2.2 System Components of ICS_RISC 
The ICS_RISC consists of a 32 x 32-bit general purpose register, a program counter 
which is the 32th general purpose register, a 16 x 32-bit loop buffer to generate 
instruction addresses for iterative sets of instructions, a status register (N:Negative/Less 
than, Z:Zero, C:CruTy/Borrow, V:Overflow), an instruction register for instruction 
decoding, ALU, shifter, multiplier and 32-bit data input/output registers [30,31). Figure 








Control Unit Data Memory (ICS /CAP) 
Figure 4.3: A detailed architecture of the ICS_RISC 
4.2.3. Types of Instruction Set 
CAP 
(UnltCAP) 
Table 4.1 describes the instruction set and instruction processing components of the 
3D-SoftChip. All control instructions are executed in the ICS chip, while computation 
fostructions , such as arithmetic and logical operations for PEs are executed in the CAP 
chip using various computation methods (SISD, SIMD, MISD, MIMD). The detailed 
instruction set is described in Appendix A. 
Table 4.1: Types of Instruction Set 
Function Processine Component 
Move res 
Arithmetic (S-PE, PA-PE) CAP 






Addressing Mode/Loop Buffer res 
Addressing 
PE Control ICS 
-52-
PE Configuration res 
Program/Data Load (rCS,PE res 
Program/ Data for PE) 
DMAControl res 
4.2.4. ICS_RISC Instruction Set Architecture- Version 1.0 
Table 4.2 shows the instruction set architecture (ISA) for ICS_RISC. This is the first 
version of the ISA, more efficient and dedicated instructions can be added is needed. It 
has 50 instructions, largely divided into arithmetic and logic, branch, data transfer, bit and 
bit-test, PE control, DMA control and lastly a loop buffer instruction. 
Table4.2: Instruction Set Summary (ICS_RISC ISA Versionl) 
Mnemonic Operation Operands Flags 
ARITHMETIC AND LOGIC INSTRUCTIONS 
ADD Add Two Registers Rd, Rs! , Rs2 N,Z,C,V 
ADDI Add Register and Constant Rd, Rsl , #I N,Z,C,V 
SUB Subtract Two Registers Rd,Rsl. Rs2 N,Z,C,V 
SUBI Subtract Register and Constant Rd, Rsl , #l N,Z,C,V 
MUL Multiply Two Registers Rd, Rsl , Rs2 N,Z,C,V 
MULI Multiply Register and Constant Rd, Rsl, #I N,Z,C,V 
AND Logical AND Registers Rd, Rsl , Rs2 N,Z,C,V 
ANDI Logical AND Register and Constant Rd, Rsl, #I N,Z,C,V 
OR Logical OR Registers Rd, Rsl , Rs2 N,Z,C,V 
ORI Logical OR Register and Constant Rd, Rsl, #I N,Z,C,V 
XOR Logical XOR Registers Rd, Rsl, Rs2 N,Z,C,V 
XORI Logical XOR Recister and Constant Rd, Rsl , #I N,Z,C,V 
NOT Logical NOT Registers Rd,Rsl , Rs2 N,Z,C,V 
NOTI Logical NOT Register and Constant Rd, Rsl , #I N,Z,C,V 
BRANCH INSTRUCTIONS 
BREQ Branch if Equal (Z=l) PC, Offset None 
BRNE Branch if NOT Equal (Z=O) PC, Offset None 
JMP Unconditional Branch (PC=PC+Offset) PC, Offset None 
CMP Compare Registers Rsl, Rs2 N,Z,C,V 
CMPI Compare Register and Constant Rd,#I N,Z,C,V 
DATA TRANSFER INSTRUCTIONS 
MOVA Move between Registers (Rd=Rsl) Rd, Rsl None 
MOVA! Move between Reg. & Const. (Rd=Const) Rd,#1 None 
MOVB Move between Registers (Rd=Rs2) Rd, Rs2 None 
MOVBI Move between Reg & Const. (Rd=Const) Rd,#I None 
MSR Move Register to Status Register(SR=Rsl) SR, Rsl Noue 
-53-
MSRI Move Imm value to Status Register(SR-#1) SR, #I None 
MRS Move Status Register to Rel!ister(Rsl=SR) Rsl, SR None 
LD Load indirect with Re l!ister (Rd=Mem[Rb ]) Rd.Rb None 
ST Store indirect with ReJ?ister (Mem[Rb]=Rd) Rd.Rb None 
BIT AND BIT-TEST INSTRUCTIONS 
LSL Loe:ical Shift Left Rd, Rsl N,Z,C,V 
LSR Logical Shift Right Rd, Rsl N,Z,C,V 
ASR Arithmetic Shift Ril!ht Rd, Rsl N,Z,C,V 
ROT Rotate Rd, Rsl N,Z,C,V 
PE CONTROL INSTRUCTIONS 
PECON4 PE Word-Length Configuration (4-bit) None None 
PECON8 PE Word-Len!!th Confiimration (8-bit) None None 
PECONl6 PE Word-Length Configuration(l6-bit) None None 
PECONJ2 PE Word-Length Configuration (32-bit) None None 
PESEL Select certain PE (PEO-PE15) None None 
PEMODH PE Ooeration mode (Horizontal mode) None None 
PEMODV PE Oneration mode (Vertical mode) None None 
PEMODC PE O=ration mode (Circular mode) None None 
PEEXEH Execute specific program to each PEs in the None None 
same Horizontal line 
PEEXEV Execute specific program to each PEs in the None None 
same Vertical line 
PEEXEC Execute specific program to each PEs in the None None 
same Circular line 
DMA INSTRUCTIONS 
LDPEPRG Load Program Data from Program memory to addrMem None 
Instruction decoder in PE. 
LDDFB Load large amount of processing data for PEs addrMem, None 
from Data Memorv to Data Frame Buffer addrDFB 
LDPEDATA Load large amount of processing data for PEs addrDFB, None 
from DFB to Embedded SRAM in PE addrSRAM 
WBREG Write back processed data in Embedded addrSRAM None 
SRAM to the re!?isters in the ICS RISC 
WBDFB Write back processed data in Embedded addrSRAM, None 
SRAMto DFB addrDFB 
STDFB Load large amount of processed data in PEs addrDFB, None 
from Data Frame Buffer to Data Memorv addrMem, 
LOOP BUFFER INSTRUCTION 
LBEN Generate an Iterative Set of Instruction PC None 
Addresses 
(16 sets of Lo:m Buffer) 
-54-
4.3 High Bandwidth Data Interface Unit 
The high bandwidth data interface unit allows the efficient transfer of data within the 
3D-SoftChip. Two sets of data frame buffer and the DMA controller make it easy to 
transfer large amounts of data. Multiple sets of program memory support run-time 
program switching and, because of this dynamic reconfigurable featw-e, adaptive 
computing is possible. The data memory has a variable word width so it can easily be 
combined to build wider/deeper memories and thus increase flexibility for different 
application programs. The DMA instructions and data flow for the DMA controller can 
be seen in Figure 4.4. A detailed description of the operations of the DMA instructions 
can be seen in Appendix A. 
regAdd r 
pmAdd , l~ IAddr 





OMA Cont roller 
Control Signals for OMA Controllt r UnltlCS 
Data from/to 
UnltCAP Chip 
Figure 4.4: DMA Controller Architecture and Instructions for DMA Controller 
4.4 Conclusions 
The JCS chip architecture has been described in this chapter. The system components 
in the ICS Chip allow it to efficiently supply data and instructions to the PEs through the 
IBIA. The PE array can be freely configured due to the highly controllable characteristic 
of the switch block. This allows more than sufficient adaptability/flexibility for adaptive 
-55-
computing systems. Moreover, the DMA controller enables transfer of the bulk data fast 
and effectively through the 3D-SoftChip. 
-56-
Chapters 
Architecture of UnitChip 
The 3D-SoftChip consists of 4 sets of UnitChip. Each UnitChip has one UnitCAP 
and one UnitICS. As described in the chapter 3, the UnitCAP comprises 16 sets of 
heterogeneous arrays of S-PEs and PA-PEs and the UnitICS consists of a switch block, a 
32-bit dedicated RISC control processor, a high bandwidth data interface unit, 2 sets of 
data frame buffers and program/data memory for both the ICS and the PE array. In this 
chapter, the UnitChip architecture and its pipeline operation mechanism which can 
maximize the computational throughputs [3], are described. 
5.1 UnitChip Architecture 
As mention above, the UnitChip is a combination of the UnitCAP and the UnitlCS 
chip and four UnitChips form the complete 3D-SoftChip. Figure 5.1 illustrates the overall 
architecture of the UnitChip. The control, data and instructions transfer through the IBIA 
to the UnitCAP, and the processed data from the UnitCAP can be rapidly transferred back 
to the ICS_RISC to be manipulated and stored in the data memory. 
-57-
Control to CAP Chip 
throu h the 181 
ICS RISC 
Control Processor · / 
Program fo ICS ISC 
Data from/to 
CAP Chip 
through the 181 
Figure 5.1: Overall Architecture of the UnitChip. 
CAP 
5.2 Pipelined Operation Mechanisin of UnitChip 
TableS.l: Pipelined UnitChip Operation Mechanism 
Stage (1) Stage (2) Stage (3) Stage (4) Stage (5) 
lCS_RISC LDPEPRG LDDFB LDDFB LDPEPRG PECON4,8, 16,32 
lnstructions PESEL, PESEL PECON4,8.J 6,32 PEEXEH,V,C PEMODH,V,C PEMODH,V,C PEEXEH,V,C WBREG,WBDFB 
PEsOp. Execu1e(l-J) Execme ( 1-2) 
PROGRAM 
forPEs Load PRGM for PRGM forPEs(l-1) Load PRGM for PRGM for PEs(2-1) PEs (I) PEs (2) 
(in Local 
memory) 
Data Frame Load Data for 
Buffer 0 PEs (L) Data for PEs (1-1) Data for PEs (1-2) Wri1e back Execution (1-1) results 
Data Frame Load Data for PE 
Buffer 1 (2) Data for PEs(2-1) 
Memory 







PRGM for PEs(2-2) 
Data for PEs(2-2) 
Write back 
Executioo(J -I) results 
Table 5.1 illustrates the pipelined operation mechanism of UnitChip to improve its 
perfonnance. The detailed explanation is as follows. 
• STEP 1 - LOAD PROGRAM FOR PEs: The first operation is to load 16 
instruction words for PEs from program memory to the instruction decoder in the 
each PE, the row and column decoder in the UnitCAP can specify a certain PE to 
load the programs, depending on the desired computational mode (e.g, SIMD, 
MIMD). 
• STEP 2 - LOAD PROCESSING DATA FOR PEs (1): Load large amount of 
processing data for PEs from data memory to data frame buffer. The start address 
of memory and an amount of data to transfer can be indicated by the DMA 
instructions. 
• STEP 3 - LOAD PROCESSING DATA FOR PEs (2): Load the processing data 
for PEs from data frame buffer to embedded SRAM in each PEs. The row and 
column decoder in the UnitCAP can specify a certain PE to load the processing 
data. 
• STEP 4 - EXECUTE PEs: Execute the PE array 
• STEP 5 - RELOAD PROGRAM FOR PEs (1): Reload 16 instruction words 
from program memory to the instruction decoder in each PE, the row and column 
decoder in the UnitCAP can again specify a certain PE to load the programs to. 
• STEP 6 - RELOAD PROCESSING DATA FOR PEs (2): Reload the processing 
data for PEs from data frame buffer to each PEs, the row and column decoder in 
the UnitCAP can specify a certain PE to load the data into. 
• STEP 7 - WRITE BACK PROCESSED DATA TO DFB: Write back processed 
drta from embedded SRAM in each PEs to data frame buffer 
STEP 8 - TRANSFER PROCESSED DATA TO Memory: Transfer large 
:1mount of processed data from data frame buffer to memory 
-59-
5.3 Area Estimations and Constraints 
Table 5.2 shows the feasible estimated area of 3D-SoftChip components. The 
perfonnance of the integrated circuits largely depends on integration density. The tight 
area constraints can be achieved through more integration density, which means it can 
maximize benefit from large scaled integration. The area constraints should be tight in 
order to achieve the best performance. 
Table 5.2: Arca Estimation and Constraint of UnitChip (Target Technology: 0.13 um Process) 
Com""nent Estimated Area 
S-PE 60umx60um 
PA-PE 60umx60um 
!BIA 15 um x 15 um 
One' 1uad-PE 130 um x 130 um 
UnitCAP 500 um x 500 um 
CAP{4x4 UnitCAP) llOOumx llOOum 
CAP(\6xl6 UnitCAP) 2200 um x 2200 um 
1cs ruse 300 um x 300 um 
5.4 Conclusions 
As explained above, by using the pipeline operation mechanism that is a 6-stages 
pipelined architecture, the performance of the UnitChip can be 6 times more improved. 
This pipelined operation is another distinguished character to accelerate the 





In this chapter, the three hierarchical interconnection architectures: Inter-PE bus, 
Switch Block Array interconnection and IBIA, will be introduced along with the 
configurable nature of the Inter-PE bus using the input operand multiplexer in each of the 
PEs. 
6.1 Hierarchical Interconnection Architecture 
The interconnection network of the 3D-SoftChip can be broken into three hierarchical 
levels. The Inter-PE bus between PEs in the CAP chip i~ the first level. This local 
interconnection network has a 20-mcsh architecture providing nearest-neighbor 
interconnection between the PEs. The second level of the interconnection network is the 
switch block array interconnection. This supports longer interconnections on the JCS chip 
but also has a basic 2D-mesh architecture. The last hierarchical level of interconnection is 
the IBIA. With progression of technology to ever decreasing semiconductor geometry 
scales, the prediction of interconnection delay and the portion of interconnection delay in 
the total system delay arc crucial factors. It is also a major factor in the limitation of 
overall system performance. To overcome these problems, 3D interconnection 
technology using Indium bump becomes very attractive because it supports a very high 
bandwidth coupled with a very low inductance/capacitance (and thus low power 
dissipation) and can be readily utilized to achieve an interconnect array with a pitch as 
-61-
low as 10µ.m. The development of 3D integrated systems will allow improvements in 
packaging costs, performance, reliability and a reduction in the size of the chips [15]. 
However, any other equivalent 30 interconnection technology could also be applied to 
realize this interconnection level within the 3D-SoftCnip architecture. Figure 6.1 shows 
the three hierarchical interconnection networks. 
(a) PE Array Interconnection Network: 20-mesh interconnection for local interconnection 
(b) Switch Block Array Interconnection Network: 20-mesh interconnection for long interconnection 
-62-
(c) Indium Bump Interconnection: Single indium bump after reflow 
Figure 6.1: Three hierarchical Interconnection Networks 
6.1.1. PE and Switch Block Array Interconnection 
6.1.1.1 Programmable Nature of PE Array Interconnection 
In ut Data 





In ut Data 






















Figure 6.2: Quad-PE and Programmable Interconnect Architecture 
-63-





Table6.1: Inter-PE Bus (IPB) interconnection connectivity 
IPB Sl--1 Name Source{Outnut) Destination(lnnut) 
IPBI SPEl(dOutadiPE) PAPEl(dLefl) 
IPB2 SPEl(dOutadjPE) PAPE2(dUn) 
IPB3 PAPEl(dOutadiPE) SPEl(dRiPht) 
IPB4 PAPEl(dOutadiPE) SPE2(dUo) 
!PBS PAPEl(dOutadiPE) Next Ouad-PE(SPEl(dLeft)) 
IPB6 PAPE2(dOutadiPE) SPEl(dDown) 
IPB7 PAPE2(d0utadiPE) SPE2(dLeft) 
IPB8 PAPE2{dOutadiPE) Downside Quad-PE(SPEl(dUn)) 
IPB9 SPE2(d0utadiPE) PAPEI(dDown) 
IPBIO SPE2{d0utadiPE) PAPE2(dRioht) 
IPBI I SPE2(d0utadiPE) Next Quad-PE(PAPE2(dLeft)) 
IPB12 SPE2(d0utadiPE) Downside Ouad-PE(PAPEl(dUn)) 
Figure 6.2 shows the Quad-PE architecture and Inter-PE interconnection architecture 
{3]. Because of the input multiplexer in each PE, the connectivity can be readily 
configured. The input multiplexer can choose certain input operands from among the 6 
different inputs; data input, data from left side, right side, upward side and down side PE 
(din, dLcfl, dRighl, dUp, dDown) and each PE's output (dOutadjPE) becomes input 
operand to the neighbour PEs. Table 6.1 describes the connectivity within one Quad-PE 
and indicates that it can be configured by the PE programming according to the target 
application. 
6.1.2. Indium Bump Interconnection 
Indium is an excellent material to use as an interconnect material due to its excellent 
adhesion to most metals, including aluminum, which is the metallization for the pads 
used in most VLSI technologies. Indium has a low melting point, which implies a low 
work hardening coefficient, allowing for direct bonding on processed VLSI wafers. 
Additionally, it provides excellent mechanical as well as electrical connectivity (contact 
resistance < I mD: per bump). Retlow techniques can be used for flexibility and to 
increase the bump height to width ratio as needed. Such techniques can also be used to 
incorporate self-alignment features to the bonding process. Figure 6.3 illustrates 3D filp-
chip wafer bonding technology using indium bump interconnection arrays. 
-64-
Bonding Pad Indium Bumps 
Subtrate / CAP Chip 
Figure 6.3: 3D Flip-Chip wafer bonding technology using Indium Bump Interconnection Arrays 
6.2 Conclusions 
The three hierarchical interconnection network architectures have been described. 
With the exception of the 3D interconnection there are similar to conventional 
interconnection architectures in reconfigurable systems. The Inter-PE bus provides 
configurable connectivity with 2D mesh architecture and the switch block 
interconnection offers longer interconnection in the ICS chip. Lastly, the IBIA presents 
vertical interconnection between the two separated chips providing a high bandwidth, 




High-level modeling of 3D-SoftChip 
using SystemC 
In this chapter, the high-level modelling of 3D-SoftChip using SystemC will be 
introduced. Firstly, an overview of SystemC, Computer Aided Design (CAD) 
environment for SystemC will be briefly described, followed by a presentation of the 
high-level simulation output waveforms for each of the 3D-SortChip components and 
analysis of these. Finally, some conclusions are provided. 
7.1 SystemC Overview 
SystemC is a C++ class library and design methodology which can effectively design 
a software algorithm, hardware architecture, interface with SoC and system level designs. 
System-level modelling, quick simulation to validate and optimize design and HW 
architecture and various software algorithms explorations can all be achieved using 
conventional C++ development environments. The current system design methodology is 
for the system engineer to write high-level language (C, C++, Matlab etc.) programs to 
verify the concepts and algorithms at system-level. After the concepts and algorithms are 
validated, the high-level modelled designs are manually converted to the Hardware 
Description Languages (VHDL, Verilog-HDL) in order to implement the hardware. But 
-66-
this approach gives rise to a number of problems, such as errors arising from the manual 
conversion from C to HDL, a disconnection between the system level model and HDL 
model and conversion limitation as design sizes is get ever bigger and more complex. As 
a result of this, new C language based system design languages are starting to emerge as a 
new design methodology. Figure 7.1 shows the conventional system design in contrast to 
a SystemC based design methodology. 
C, C++ System( Model 
• 
System Level Model \., Manual 
\ Conversion 
"· .. 





Synthes is Synthesis 
+ Rest of Process Rest of Process 
(a) (b) 
Figure 7.1: System Deign Methodology: 
(a) Conventional Design Methodology, (b) SystmC Design Methodology (*Source: www.systemc.org) 
The system design methodology using SystemC has many advantages over the 
conventional system design methodology includjng increased more productivity and 
reliability from the progressive refinement process and the use of a single language. In 
the design methodology using SystemC, the time consuming manual conversion process 
is no longer necessary because the high-level modelled code becomes a more reliable and 
high performance hardware model while hardware concepts and timing constructs can be 
added through the progressive refinement process. More productivity can be achieved by 
using a single design language, the high-level modelled SystemC code can result in 
smaller code that is easer to write as well as relatively faster simulation time, moreover 
-67-
the testbench code for functional verification at high-level can be reused at any level or 
design stage[l9,20] . 
7.1.1. CAD Environn1ent for SystemC 
As described above, SystemC is a C++ class hbrary, which means any conventional 
C++ compiler can be a CAD development environment for SystemC. Any Unix, linux or 
PC based C++ compiler can be used, however, in this research, the PC based CAD 
environment (Microsoft Visual C++ Version 6.0) has used to compile the high-level 
modelled SystemC code because of its easy accessibility. Once SystemC code is 
compiled, the results are stored in various types of file. The most common file type for 
the results is a Value Change Dump (VCD) type and the GTKWave wavefonn viewer is 
used to validate VCD type of results. The figure below shows the Visual C++ and 
GTKWave waveform viewer. 
• r.u11nl11r Mlero.olf Vl, u-11( • IU, \ ,, \U..U•\coonle,\muin c1•PJ + ~t'Q1t8J 
l~Oe fdt 'f1ew tr-t e,o)ect. ~d tom '(r:)nclow ~ 
~ al l.ltl 'IA l!S 81), .......... .,. ai<•Fl 
J(Gl obals) i)fjAifi i,!;j :.:,7·"' f7 
~-•• -•• -••• -.~ ... -.--'.;u=~ ::~··· 
~Rlllllttl.•~.. ~~:;· •r_g u)( 
1 LPJ:Oi#afri f£1f9Sd§M set,90; 
Slot·Act!Yfl~.., .. , t i o fl ("'l, SC_NS): 
tu(k,l..4tlor!s... e._untt( 1. sc_tts); 
~- e,ofle •. , fk•• . 18, SC_ HS ) ; 
1/ffUfln count NH!Ull'I wlt/1 n41111'(J ,roniu•,;t1on 
1.oun t CHI ( "CNJ'" ) : 
CHT .el "C CLK ) : CKT . r ent (ruet) : c wr.90(90); CHT . ualue ( uatue ) : 
• : c1a u Vle'tril ill 
/l""IH\ IESTO[HCII t'lgd")l" ,-ilth pos-ition.:il. connection 
t u t l S T("lST" ' ) : 
l!i.l(Cllt . t-1Pset , go , 1,1;1lui!) ; 
/Jt-r;,r.e r-'l lt- Cl'\'.ttiun 
. -· · .... 
'""""' ~tllgl1rac, 
""'~~-
L.;E;.;;;••;.;;;cu;;.;;••.;..;• ';;.;;"•.i;;•""• """ m"---------------l 5').'it.s 
T~t 












rl, lll H2M 
MW :'. 1.1•-on __ ~~~~~--~~~~~--~....,1 
Figure 7 .2: The CAD Envi.ronment for SystemC; Vi ual C++ Version 6.0, GTKWave Waveform 
Viewer. 
-68-
7 .2 System-level Modeling of 3D-SoftChip 
In this section, the high-level mode1led single Standard-PE, Processing-Accelerator-
PE, ICS_RISC and UnitChip will be introduced with output simulation waveform. The 
functionality of these components has been fully verified. For a more detailed description 
of the system-level modelling of 3D-SoftChip see Appendix B. 
7.2.1. Standard-PE 
The detailed architecture of the S-PE was introduced in Chapter 3. Based on the 
architecture, it has been high-level modelled using SystemC. Figure 7.3 shows the block 










Figure7.3: High-level modeling of S-PE: 
(a) S-PE block diagram, (b) file structure of S-PE 
Figure 7.4 shows the output waveform of the S-PE execution results after ALU 
instructions between data from internal registers and embedded SRAM. The input signals 
(dln, dLeft, dRight, dUp, dDown) have been selected by the input multiplexer. The ALU 
output signals can be seen in the dOut, and dOutadjPE signals. The functionality of the S-
PE was confirmed by checking the output result. 
-69-
t GTI<Vlavc C.15ystcmC\JO S0ttCh1p\SP£l•pe_wavc.vcd !'rl@ , 
FIie Edit Search nme Markers View Help 
VCD loaded , ucc8S6fully. 
j37J r,cih!ies round 
[49851 region, found. 
Signals I 1Waves--
Time 
Zoom !Pago IFetch 019c ,Shi! 
,~ltlll_ff_ !!::.I ,._ ~1..±J ..±:.I_ ,.±:.1 
,~J).!gj . .,::!il' -±J ,...±J -=~-t...±J 
From·IO sec 





SysiemC clock =·1 
SystemC n,set =i 
SystamC.1ns11cs11e 01 =i I I =1==00::,:00==0:;::ls=o+::;:ls=o:::;+ 1=so=+:;::\s==o+::;:ls=2+::;:ls=1=+ 1==s2=+:;::1s==2+::;:1s=1 •::;l==so=+ l;::s1=.:;::1s=o+::;:1s=1+::;l=,o=+ ;::I u=+:;::ls=o+::;:ls==s:::;+ 1==u=+:;::I ,==s+::;:ls=i+::;ls=s=. ;::10=+:;::ls=s+::;:ls=•:::;+ 1==sJ=+:;::1s=2+::;:1s=s+::;1=u=+ ;::lso=+:;:::ls1 
SystemC.dln(30J•' =·=~================================= 
SystemC.dLaft(JOJ•i '~==~=================================I 
SystemC.dRlght(3:0J =! II =~~~================================ 
SystemC dUp(3:0J=: , 
SystemC.dDown(3.0J=,1 1=,~=;':~================================ 
SystemC.aluOulJJ OJ=!! SQ lsi ls3 \S2 Isa lss In l•z !SA 1§6 lss \$:J lss Is· 
SyatemC s,amOota(JOJ•! S7 >7 s s S 
SystemC.doutBu,(3·0J=! lu ISl ltz lsB lu IP lo I•, In liJ 
Sy.iemC.dOutlJ:OJ=: lu ls3 ls2 lie \u lu In In 
Sy,iomC.dOutadJPE(J,OJ=: •=-------------------------.a.a.--="--------
Figure 7.4: The Output Waveform of S-PE 
7.2.2. Processing Accelerator-PE 
The PA-PE architecture has been described in Chapter 3. The high-level modelling 
was executed from this description. Figure 7.5 shows the PA-PE bJock diagram and file 











Figure 7.5: High-level modeling of PA-PE: 
(a) PA-PE block diagram, (b) file structure of PA-PE 
-70-
' . 
file Edrt Search lime Markers View Help 
VCO loaded $Uttenfully 
{36) flmhtlH lou11d. 
150951 r1giona round. 
Zoom. --~ , Page Felch 011c Sh]ft 
~..!tl:!J -±J _±:J ..±:..f .±JI 
~..!!!@) -.i ....±J _±.I ~ ..1.1 






Time I 67 ns l34 ns 201 us 2ij8 ns 335 n J 
SyslemC clock ='.j' 
Sy.stemC.,esel =t 
Sys1emc.1ns11cspe 01 =:, so+lso• lso• lso+ Isa+ Isa+ lso+lso+lso+ lso+ lso+fso+ )so+-lso+lso+ ls2tls2+lt2+ ltz+ is-it iso+ls 1± !so+ ln+lso+ lu t lso+lss+ls1± It& lstt I 
Syal•mC.dlnl3DJ=< ·=·!!:::"~··============================== SystamC.dLo"IJ OJ=! sn 1s 
SystemCdRightf.lOJ=! ·=.~ns=============================== 
SystemC.dUpl3.0J=! l=:so~~··=~1s2============================ 
SyttemCd0o\Nrif30J=! f!:·,~'":==:"'===r.==========r.:;======;:;:;;::;:;:;:::================ Sy111mC.muwAOutf3DJ=: · so lu lsz ls1 lsz 181 Isa ls1 is2 1st 
sy,temC.mu1e80ut[J.DI ==t I so 1st lsz ls1 In tu lso In fs2 ls1 
Sy•temc.,_a1uou1(30J=! so In lsJ lu ls-o In ls:a In In In In IS3 Isa In ls1 
Sy111mC.sramD•taf.l OJ=! 1 
SystemC doutBusp OJ=! " 1<1 1s, 
U__r- "1.ll 1..l.ll. 1.ll ill...r u.J. J 
Sy8'emC d0utl3 OJ=! SD 191 
I 
Sy,11mC.dOutadjPEj3·0I=' so 1 
.,,..r.J•-k-~----------------------------~_.. 
Figure 7.6: The Output Waveform of PA-PE 
The figure above shows the output waveform of the hjgh-level modelled PA-PE. The 
selected input operand through the input multiplexer executes the ALU instructjon (MAC, 
MAS, Shjft, etc) and the results are then stored to the embedded SRAM. The output 
signal shows the operation executed as required. 
7.2.3. ICS_RISC 
The ICS_RISC and instruction set architecture was introduced in Chapter 4. The 
ICS_RISC can largely be classified into control and datapath units. The 32 x 32-bit 
general purpose register, a program counter, a 16 x 32-bit loop buffer, a status register, an 
instruction register, ALU, shifter, multiplier and 32-bit data input/output registers form 
the datapath architecture. The fetch, decoding and execution unit make up the control unit. 
Additionally, a bus control unit is used to control the 32-bit operand A, operand B, data 
write bus, input bus and output bus to avoid data col]jsion. Figure 7.7 shows the top block 
diagram of the ICS_RISC and its SystemC file structures. 
-71-






Figure 7.7: High-level modeling of ICS_RISC: 
(a) ICS_RISC block diagram, (b) file structure ofICS_RISC 
The output waveform shows the results after execution of simple loop and ALU 
instructions. Figure 7 .8 shows the pseudo code for the instructions. The circle in figure 
7.9 which is wiitten as a loop instruction indicates the internal general purpose register 
address. It increases as programmed and the other circle presents the output result of the 
ALU operations. 
//Simple Loop & ALU Instruction 
MOV RO, #0; / /Simple Loop Inst 
MOV Rl, #1; 
MOV R2, #2; 
MOV R3, #3; 
MOV R4, #4; 
MOV RS, #5; 
MOV R6, #6; 
MOV R7, #7; 
MOV R8, RO; 
MOV R9, Rl ; 
MOV RlO, R2 ; 
MOV Rl l , R3 ; 
MOV Rl 2 , R4; 
MOY Rl 3, RS; 
MOY Rl 4, R6; 
MOY Rl S, R7; I I End of Loop Inst. 
AND Rl 6, RB , R9 ; //ALU Inst 
OR Rl7, RlO, Rl l ; 
XOR Rl 8, Rl 2, Rl3 ; 
ADD Rl 9, Rl4, Rl S; 
SUB R20, Rl4, Rl 5; // End 
Figure 7.8: The Psuedo Code for ICS_RISC 
-72-
r GlKWaw · C:\S-;,;C\30 SoflChip\TlST_ICSIICSjllSC ~lmlwavo.w:d - ~@]~ 
F,le Ede Search Time Mai~ers v,~ Help 
jmJ r,91ons found 
Otagg1119 2 1,aces 
Drop com pl el od 
Signal& 
Time 
Sys1emC clock =t 
I 
Z.Oom Page F elGh 01so Slnft 
~lt!IJ~ -4- I ...±J ~ ~ 
~~~ ...::t.J __±j -=±.J -=±.J 
F,omlo-





SytlemC resel =( 
Sy,1emC 10al;,j31 OI =! 
SystemC opAld,(4.0l=l 
Sys1emC.opB1dxl4:0I =:I 
i• lso+ lso+ In+ !$Q+ lso+ 1so+ 1sn+ lso+ 1sn+ lso+ lso+ !$Q+ Isa+ lso+ lso+ lso+ lso+ Isa+ Ito• Isa+ lso+ lso+ l}o+ lw Jso+ Im 
too Jso1 Im Im 1~0• Im Im Im J,oa Jsoo lso1 Im Im lso1. lsoa lsop Im lsoo lso1 Im lsoJ lso1 
so~ @o. !sos ISOA boc lsop 1$QO lso3 
SyslemC rdAOEn = 
SyslemC rdBOEn = 
Sys1emC.wbld,(4 0( =I 
SyslemC 1lu0vl(Jl.O( =!! 
SystemC iAdd,[31 OI=! 
~ Loop lnstr4ctioA . ,,, - ----
::w::::=r~o1 l$02lm lso1 lsos lso6 l$01 lsos lso2 Im Im !$QC lroo lsoE Im lsili ls11 Im Im Im lsoo Im Im Im !soc 
eoooooooo / 1 !$Q-+ !so .. lso+ lso+ lso+ lsoooooooo 




Figure 7.9: The Output Waveform ofICS_RISC 





















Pi-e:;~ any key to continue. 
a 
Figure 7.10: The Instruction Index 
Figure 7 .10 shows the instruction index during ICS_RISC instruction execution for 
debugging purposes. The instruction index was perfectly matched with the instruction of 
the pseudo code. 
-73-
7.2.4. UnitChip 
The composition of the UnitCAP and UnitICS becomes the UnitChip. It can be largely 
divided into 4 kinds of sub-SystemC files, that is ICS_RISC, Memory, DMA and 
UnitCAP. As described in Chapter 5, the architecture and the pipelined operation 
mechanism can be identified in the high-level system simulation results. 
reset ; 









Figure 7.11: High-level modeling ofUnitChip: 
(a) UnitChip block diagram, (b) file structure of UnitChip 
Figure 7 .11 illustrates the UnitChip block diagram and SystemC file structure of the 
UnitChip. Each sob-SystemC block's functionality has been described before, the 
UnitChip is a simple combination (port-mapping) of these sub-blocks at the top module. 
The simple ALU instruction has been mapped in this high-level modelled UnitChip. The 
simulation result shows its functionality. In figure 7 .12, the upper side circle indicates the 
ICS_RISC operation introduced before, and lower circle shows the PEs operations, which 
is the execution of simple ALU functions for the PEs with parallelism. The signal named 
as a PEl.dOut means the output signal from PEl. The functionality can be verified by 
checking these signals (from PE1-PE16) and is as expected. 
-74-
l' GlKW,wtt - (:\S1111tcmC\JU·SfiflCl1111\ flSJ_ICS\tCS.JUSC. 1lm\wrlve,V1: 1l m 
Fite Edit SHn:h Time M8f~lf$ View Help 
0 loaded sl.Jc-cassfutly. 
1•91 fo.c1l,tln found 
(51821 reg1011s round. 
Fram·lo sec 





Tima: 102 nsi 136 ns 170 tis 204 na 2 r: 
SyGtemCelaci< l~.J"~ ~r-i____r-i 
Sys10mC.rose1 II 
Syst,mc IData!31 01 11 1::;1:::sa::;g:::06=•:;:ls::;o:::oo::;z+~ls::;o,::;o::;et:::;l;:;:so:::,:::02==,:;:1s:::;o:::,a::;,=. :;:ls::;o,:::;o::;e+:::;l;:;:so:::,:::tle+;:::;:lt:::;o:::4g::;P+~i,::;o,::;:,,:;;t+:::l;:;:s:o:::,::;or:::•:::;l:::so==,s:::o=.:;:1a::;o:::12:;:1+~is::;o•::;•:::2+:::;l;:;:so:::,:::o:3==t:;:it:::;o:::,r:::,==. -=r::;:::w 
SystemC.opAJd"'A 01 II lso5 110& lsoz leoe laoo /sot iso2 !so, lso1 lsoa lsop lsoE 
s,, .. mc opBld•I• 01 sn, 1100 lso1 lsoc lsop = 
SysltmC.IOAOEn ,.clC ..... $'-"R ... IS ...... C~O.>Qp..,e..,rL.ia...,t..ciO"'-lJOuo1Sc..==.=--, .. -------------------
SyslemC rdBOEo I 
Sys1emC.wtild)ll• 01 !sos 1906 lsoz lsoe lso2 lsoA I sos Isac Ison 1,0& l10E is10 lsu is12 lap =:1ii::i: 
Sy•t•mC ,1uOu1[31 .0J lso,21+ it01 CJJ,1 J,p.iiji;"Jijijj 
s1s1emCIAdd,[31:01 ;, so+ lsoooo+ lsooao+ lsoooo+ lsoooo+ lsoooo+ lsoooot lsoooo+ Jsoooo+ l.soooD:t lsoooo+ lsmmutf~.,., .... "' 1jVuoo± lsoooo.- lsoooo+ I 
Sysl,mC,PEl_dOul(9,0( II $0 IS7 $9 1•3 lso 
Sy<1omc Pe2_d0utl3DJ 11 ... •, lso lu I•, lso 
S1sr1mC PE3_dOutJ3 OJ •• sc l•P 
Sy,IOmCPEA_dOulJ30J .•o PE operations:::-- l•B l•o I•• Is• !so 
Syi!em~.PES_d0utl301 !! :---;; lu IS7 Isa IS7 1li..__ 
Sys11mC.PE!i_dOu1J3.0( ,n So.. IU l•c Isa I\L___ 
sy ... temC.PE7 _dOutl3,0I so lo iu iso 
Sy•IOmC PE8_d0utJJ·OJ l•s In is1 
Sy~1emC PE9_d0Utf301 ~n jsz is5 is,;___ 
S.ysltmC PEIO_dOIIIJ3.0I .='="=====================·===:::::::1•~2=:::::::I•~• ====*I•,~== 
Sy111mC PE1r_dOut!l.O( l~•~n======================:::l~ss~~l•~c==:=1•~1 =:::l;;•o===:1,~,s=== 
S.ysl,mC.PE12_dOulJ30( l;'~"======================-::::l~IZ~~ls~B==~l•:;,:2=~1U~~l~08~== 
s,,1emc PE1J_d0ut1so1 Is, iss iso isB iu 
SystemCPE14_dOut(30J ,n . ..- lso I .. 
SyslemC PE15_d0utJ3:0( 1:ll~~~~=======================:,.~.=*1s~1=~lt::;g ==1:::n=~lt~E;:=== 
SyslemC Pfl6_dOulJ3:0J ii ... '" 1•• 107 
Figure 7.12: The Output Waveform of UnitChip 
7 .3 Conclusions 
In this chapter, the overview of SystemC and its CAD tool develop environment has 
been introduced. The high-level system modelling and functiona1 verification of the 3D-
SoftCh.ip using SystemC ha been described and some simulation results provided. The 
waveforms show the correct functionality for each of the sub-blocks and for the top 
module of the UnitChip. 
-75-
Chapters 
Application Mapping for 3D-SoftCbip 
The MPEG4 Full Search Block Matching Motion Estimation Algorithm (FBMA) 
has been applied to the high-level system modeled 3D-SoftChip to verify its functionality 
and demonstrate its architectural superiority. The hand-crafted assembler code for 
implementation of the algorithm becomes the input stimulus of the system-level modeled 
30-SoftChip. The performance will be analyzed in compraison with a conventional DSP 
processor, Application Specific ICs (ASICs) and MorphoSys. 
8.1 Full Search Block Matching Algorithm (FBMA) 
Motion estimation (ME) is introduced to exploit the temporal redundancy of video 
sequences and is an indispensable part of video compression standards such as the 
ISO/IEC. MPEG-1. MPEG-2, MPEG-4 and the CCITI. H.261/ITU-T. H.263 etc. Since 
ME is computationally the most demanding portion of the video encoder, it can take up to 
80% of total computation time and it can be a major limiting factor for the performance. 
Among the many different ME algorithms, FBMA is one of the most widely used in 
hardware. despite its high computational cost because it has the optimal performance and 
lowest control overhead. The block matching motion estimation algorithm compares a 
specific sized block of pixels in the current frame with a range of equally sized pixel 
blocks in the previous frame to find the best match (minimum difference) between two of 
the blocks. The position of the best matched block can then be encoded as a motion 
-76-
• STEP I -LOAD REF. BLOCK DATA INTO PE ARRAY SRAM: The first 
operation is to load reference block data ( IL (m,11)) into embedded SRAM in each 
PE in the array. 
• STEP 2 - EACH PE MOVES THIS DATA TO INTERNAL REGISTER: 
Each PE moves the reference data from the embedded SRAM into an internal 
register so it is available to be used for calculation of SAD values for the entire 
search window. 
• STEP 3 - LOAD FIRST SEARCH POSITION BLOCK DATA INTO PE 
ARRAY SRAM: The block data for the first search position (JH(m+dx,n +dy)) 
is then loaded into the embedded SRAM in each PE in the array ready for 
calculation of the SAD value between the reference block and this first search 
position. 
• STEP 4 - EACH PE EXECUTES SUBTRACTION AND ABSOLUTE 
VALUE COMPUTATION: In this step, each PE carries out a subtraction 
operation between the reference block data and the current search position in 
SRAM. the absolute value of this resulting difference is stored as the absolute 
difference value for that block position. 
STEP 5 - PARTIAL SUMMATION (I): In this step every odd columned PE 
performs a partial sum operation of its absolute difference value with the value 
from the PE to its immediate right in the array, the result is stored as a double-
word value across both PEs. 
• STEP 6 - PARTIAL SUMMATION (2): In this step the two partial sums 
computed in the previous step are summed in the same way, every odd columned 
PE pair sums its result with the result from the PE pair to its right, this result is 
stored as a quad-word value across all four PEs in each row. 
STEP 7 - PARTIAL SUMMATION (3): In this step the column wise operation 





in this case, however, the second row of PEs accumulated its result with the result 
from the row above, while the third row of PEs accumulates its result with the 
result from the row below. 
STEP 8- PARTIAL SUMMATION (4): In this final partial sum accumulation, 
the second row of PEs sums its result with the result from the third row, producing 
the total SAD value for that search position. 
STEP 9 - WRITE BACK RESULT DATA TO THE ICS_RISC: Finally the 
resultant SAD value calculated in STEP 8 is written back to the internal register in 
the lCS_RISC for comparison with the previous minimum and updating of the 
motion vector if applicable. 
STEP 10 - REPEAT STEPS 4 TO 9: The next search position data block can be 
loaded into the SRAM in the PE array while the SAD calculation is being carried 
out for the current search position so once the result had been written back the 
calculation of the SAD for the next search position can be begun immediately. 
8.3 Performance Analysis 
Figure 8.3 shows the perfonnance comparison of the 3D-SoftChip with a DSP 
processor, several AS!Cs and MorphoSys for matching on 8x8 reference block against its 
search area of 8 pixels displacement. There are 81 candidate blocks (27 iterations) in each 
search area [33}. In the 3D-SoftChip, as described above, the number of processing 
cycles for one candidate block is just 7 clock cycles (each UnitChip computes one quarter 
block, so with 4 UnitChips one complete block is computed every 7 cycles), so the total 
number of processing cycles for the 3D-SoftChip becomes 567 (81 iterations of 7 cycles 
each). 
The number of clock cycles required is very close to that reported for MorphoSys, 
with just 4 UnitChips, this, however, can readily be improved simply by increasing the 
number of UnitChips on a scaled up 3D-SoftChip. A 4:<4 UnitChip array, for example, 
would have an effective throughput of one block every 142 cycles. In addition to this, 
considering the characteristics of the 3D system, there are other significant advantages. 
-80-
Data dependency is largely eliminated so there after the injtiaJ set-up there is a 100% PE 
utilisation. The reference and candidate block data can be moved into the embedded 
SRAM in the PE concurrently with array execution, so the PEs can operate continuously. 
Also low power consumption can be achieved through a minimisation of the number of 
data accesses, because most of data manipulation can be executed within the PE anay. 
Most importantly, however, because all memory is directly accessible within the 3D-
SoftChip via the IBIA there are effectively zero external data reads and thus power 












I 142 I 
MorphoSys [4 UnitChips] [16 UnitChips] 
[331 3D-SoftChip 
Figure 8.3: Performance comparison for Motion Estimation 
When comparing with the performance of the DSP processor and dedicated ASICs, 
the performance of the suggested 4x4 UnitChip 3D-SoftChip has remarkable advances 
with a theoretical capability of more than 3.8 times the performance. Given its wide 
apphcability/adaptability to any number of other applications, the performance achieved 
compared to these dedicated processors is a potentially en01mous advancement. This 
clearly demonstrates the architectural superiority of the suggested novel 3D-SoftChip. 
-81-
8.4 Conclusions 
In this chapter, the mapping of the implemented MPE04 full search block matching 
algorithm has been applied to the system-level modelled 3D-SoftChir ;,-i order to 
demonstrate its architectural superiority. According to the described results, the proposed 
3D-SoftChip architecture has the potential for a more than 3.8 times perfonnance 
improvement over conventional systems. The suggested 3D-ACSoC is clearly a highly 




J n this chapter, the contribution of this thesis will be summarised and future 
research work will be suggested. 
9.1 Contributions 
In this thesis, a novel 3D vertically integrated adaptive system•on•chip architecture as 
a next generation computing system along with its functional verification and the 
mapping of an MPEG4 motion estimation algorithm has been presented. The suggested 
architecture has a number of advantages compared with conventional current generation 
reconfigurable/adaptive computing systems, such as wide applicability, various and 
powerful computation methods, adaptive word·length configuration and benefits from the 
architecture including 3D interconnect perfonnance, reliability and a reduction in the size 
of the chips (and thus the cost), as described before. As outlined in chapter 5.3, the size of 
total chip as described is relatively small at around 1.1 mm2 for an array of 2x2 
UnitChips or 2.2 mm2 for an array of 4x4 UnitChips. This is based on a 4-bit word-length 
for the PEs so there is also ready potential to extend to a wider word-length (8-bit word-
length) and more integration of the PEs to maximise the computational throughput and 
benefits from large integration. Moreover, the ICS_RISC can also be readily extended on 
the upper chip layer by adopting advanced computation algorithms and dedicated 
instructions for specific applications to allow more efficient controllability and 
-83-
performance over the current relatively simple ICS_RISC design. As minimum feature 
sizes continue to decrease in more advanced chip fabrication processes the inherent 
scalability of the UnitChip design means that the array size can simply be increased to 
within the constraints of the maximum die size to realise ever more power adaptive 
computing sys1ems. 
The: performance of the execution of the MPEG4 full search block matching motion 
estimation algorithm has been shown to be more 3.8 times improved over current 
generation processors. Due to these significant performance, power and cost advantages it 
can he shown th.it the suggested 3D~ACSoC is one of the most suitable architectures for 
the next generation of computing system. 
Moreover. the suggested advanced HW/SW co-design and verification methodology 
can accelerate the reliability and significantly reduce the design time. especially the time 
and effort required for verific.itio:t. This thesis indicates a highly promising research 
direction for future aduptive computing systems and an advanced ,md efficient HW/SW 
development methodology for ever more complicated SoCs. 
9.2 Future Work 
As introduced in the suggested design methodology, the high-level modelling and 
functional verification has been carried out, thc next task is the architectural explorations 
to obtain an optimized H\V specification. The method to explore various architecture 
opti{)ns is through parnmcterizcd memory. data frame buffer and OMA controller 
modelling usinB System(', followed by simulation with various HW configurations so as 
to find the best HW specification. The use of the parameterized modelling method makes 
the architecture exploration considerably easier, the puramctcr values can simply be 
changed in the SystcmC code. Figure 9.1 shows the SystemC modelling of the 
parameterized memory. Once the optimum HW specification is decided, the rest of the 
procedure can be executed with any conventional hardware design method, such as full 
and semi-custom design and the SW design should be concurrently performed so that the 





#include "systemc.h " 
template < class T, int size = 1 00> 
SCMODULE(ram) { 
sc_in< bool> 









//Read / Write 
I / Address 
//Parameterized Word-length 
ram(sc_module...name name_, bool debug_ = false): 
sc_rnodule(narne_), debug(debug_) 
l 
sc_ THREAD(ram_proc) ; 
sensitive < < clock.posO; 
buffer = new T[sfze] ; 
if (debug) ( 
cout << "Running constructor of' << name() << endl ; 
cout << "Number of location fs" << size << endl ; 
private : 
T* buffer; 
const bool debug; 
}; 
template < class T, inLsize> 




if (nRW) { 
data = buffer[addr] ; 
J else { 





Figure9.1: The parameterized Memory modeling example using SystemC 
-85-
Bibliography 
[ l I E. Mirsky and A. DcHon, "MATRIX : A Reconfigurable Computing Architecture 
with Configurable Instruction Distribution and Deployable Resources," Proc. IEEE Symp. 
FPGAs for Custom Computing Machines, pp.157-166. April 1996 
[21 S.C. Goldstinc. H. Schmit, B.Mihai, S. Cadambi, M.Matt, R.R.Tuylor. "PipcRcnch: 
A Reconfigurable Architecture and Compiler." IEEE Computer. pp.70-77, April 2000 
[31 S. Hartcj. L. Ming-hua. L. Guangming. J.K. Fadi, B. Nadar, M.C.F 
Eliscu. "MorphoSys: An Integrated reconfigurable system for data-parallel and 
computation-intensive applications". IEEE transactions on computers. Vo 149, No.5, 
pp.465-481. May 2000 
[4] E. Waingold ct al. "Bring it all to software: RAW Machines·•, Computer, Vo130, 
lssuc9, pp.86-93. Scpt. l 997 
[51 T. Miyamori. K. Olukotun. "REMARC: Rcconfigurnblc Multimedia Array 
Coprocessor'', Proc. ACM/SIGDA FPGA98, Monterey. Feb.1998 
(6] C. Ebeling et al., "Architecture design of reconfi~urablc pipelined datapaths", 
Advanced Research in VLSI, 1999. Proceedings. 201 Anniversary Conference on, pp.23-
40, 21-24. Mar.1999 
171 Goldstein, S.C, ct al, "PipeRcnch: a reconfigurable architecture and compiler", 
Computer, Vol.33, lssuc4, pp.70-77, April 2000 
(8] D.Chcn. J. Rab.icy. "PADDI: Progrnmmablc arithmetic devices for digit.ii signal 
processing". VLSI Signal Processing IV, IEEE Press, 1990 
[91 Elixcnt Ltd. "The Rcconfigurnblc Algorithm Processor". 
ht Ip:// www .e Ii x c nt.com/produc ts/w hi tc _papcrs.htm 
[ 10] Trisccnd Corp .. "Trisccnd A7S Configurable System-on-Chip Platform", 
http://www.lrisccnd.com/ 
j I l] Motorola Inc., "MRC601 l: Reconfigurable compute Fabric(RCF) Device", 
http:// www.motorola.com/ semiconductors/ 
1121 Nick Tredennick. Brion Shimamoto, ''Special Report: Do-it-all devices", IEEE 
Spectrum. pp.37-40, Dec. 2003. 
-86-
[ 131 L. Guangming. "Modeling, Implementation and Scalability of the MorphoSys 
Dynamically Reconfigurable Computing Architecture," PhD thesis, Univ, of California, 
Irvine. 2000 
114] S. Eshraghian. S. Lachowicz. K. Eshraghi,m, "3-D Vertically Integrated 
Configurable Soft-Chip with Terabit Computational Bandwidth for Image and Data 
Processing". Proc. MIXDES'2003. Lodz. Poland. June 26-28, 2003 
( 15] A. Rassau, G. Alagoda. A. Ehrdardt. S. Lachowicz, K. Eshraghian, "Design 
Methodology for a 3D-SoftChip Video Processing Architecture", 61h World 
Multiconfcrcncc on Systemics, C'yvcrnetics and Informatics (SCI2002), Orlando, Florida, 
U.S.A. pp.324-329, .July 14-17. 2002 
[ 161 QuickSilver Technology Inc .. "Adapt2400 ACM architecture Overview", 
hu p ://w w w .q u icksi 1 vcrtcc h .com/ pdf s/ Adapt2400 _ Wh i tcpapcr _ 0404. pd f 
f 171 picoChip Design Limited, "PC102 Product Brief', http://www.picochip.com 
1181 S. Eshraghian. "Implementation of Arithmetic Primitives Using Truly Deep 
Submicron Technology (TDST)", Ms thesis, Edith Cowan University, 2004 
[ 19) Open SystcmC Initiative, "The Functional Specification for SystcmC 2.0'', 
http://www.systcmc.org/ 
[201 Open SystcmC Initiative, "SystemC 2.0.1 Language Reference Manual Rev 1.0", 
http://www.systcmc.org/ 
[211 International Technology Roadmap for Semiconductors (ITRS), "International 
Technology Roadmap for Semiconductors 2003 Edition", http://public.itrs.net/ 
[22] AMDREL Consortium, "Existing Functional Level Reconfigurable Implementation 
Platforms", http:/ Iv I si .cc.du th. gr/ amdre I/ de] i vcrabl cs. html 
(23J IZM. "3D System lnlcgration", 
http://www.pb.izm.lng.<lc/izm/01S_Programms/01 O_R/ 
[241 Joyner J.W., Zarkcsh-Ha P.J, Meindl J.D, "Global Interconnect Design in a Three-
Dimensional System-on-a-Chip", IEEE Transactions on VLSI systems, Vol 12, Issuc4, 
pp.367-372, April 2004 
[251 J.W. Joyner. ct al.. "Impact of three-dimensional architectures on interconnects in 
gigascalc integration". IEEE Trans. VLSI Syst. Vol. 9, pp.922-928, Dec. 2001 
[26] Kaustv. Banerjee. ct al, "3-D JCs: A Novel Chip Design for Improving Decp-
Submicrometcr Interconnection", Proceedings IEEE Special Issues on Interconnections, 
Vol. 89, No 5, pp602-633, May 2001 
-87-
[271 Texas Instruments, '"TMS320C60DO Assembly Benchmarks", 
http://www. ti .com/sci docs/products/dsp/c60DO/bench marks/6 7 x. htm 
128] K M Yang. M-T Sun and L.Wu, "A Family or VLSI Design for Motion 
Compensation Block Matching Algorithm", IEEE Trans. on Circuits and Systems, Vol. 
36, No 10, pp1317-25. October, 1989 
1291 C. Hsieh and T.Kin, "VLSI Architecture for Block-Matching Motion Estimation 
Algorithm", IEEE Trans. on Circuits and Systems for Video Technology, Vol.2, pp 169-
175, June 1992 
130] Yeong-don Bae. "Basic Microprocessor Design", http://www.donny.co.kr 
(31 J Yap Zi He, "Building A RISC Microcontroller in an FPGA, 
"http://www.opencores.org/proj ects/ri sc me u/ 
[32] Michael Shyu, Guang-Ming Wu, Yu-Dong Chang and Yao-Wen Chang, "Generic 
Universal Switch Blocks", IEEE Trans. on Computer, Vol. 49, Issue 4, pp348-359, April 
2000 
[33] Hartej Singh, "Reconfigurable Architectures for Multimedia and Data-Parallel 
Application Domains", PhD Thesis, University of California Irvine, 2000 
[34] R.Gao, D.Xu and J.P.Bently, "Reconfigurable Hardware Implementation of an 
Improved Parallel Architecture for MPEG-4 Motion Estimation in Mobile Applications", 
IEEE Trans. on Consumer Electronics, Vol. 49, No4, pp 1383-1390, November 2003 
(351 J.Rabay et al, "Reconfigurable Computing: The Solution to Low Power 
Programmable DSP", Proc. ICASSP 97, Munich. Germany, April 1997 
!36] Stylianos Pcrissakis cl al, "Embedded DRAM for a Reconfigurable Array", Proc. or 
VLS199, June 1999 
(37] Chameleon Systems Inc. "CS2000TM Reconfigurable Communications Processor", 
CS2000TM Advanced Product Information, http://www.chamcleonsystems.com 
-88-
JD-Soft( 'hip 
. \ \" nwl JI) \ l' rl il';ll I~ In l t':.!nt \ l'( I \ d ,l[ 1 ti H" Computing S_y~lt:m 
\ppl'mli, .\ - fl . ..., l{J~( I.",\ \·l·1· ..... io11 1.0 
1 ICS_RISC Instruction Set Architecture Version 1.0 
31 30 2'l 28 27 26 25 24 13 22 21 20 19 IK 17 16 IS 14 13 12 11 10 9 8 7 6 S 4 3 2 
























0 0 0 
0 0 0 
0 0 0 
0 0 0 
I 0 I 
I 0 I 
I 0 I 
I 0 I 
0 I 0 
DMA LHI 
Op Sl 
I I I 
0 0 Opcode Rd 
0 I Opcode Rd 
Immediate (32-bit) 
I 0 Opcode Rd 
I I Opcode Rd 
0 0 Shift X Rd 
0 I 0 X Rd 
I 0 I 
' 
Rd 
I I Cond X 
0 0 PE Op 
' 
Opmode I Config 





Not yet decided 
A DISSERTATION R)R THE DEGREE OF MA.>l"ER OF ENGINEERING SCIENCE 
Immediate(4.8:I6-bit} 
Unui;ed 
Rs2 Rsl Unused 
Rs2 Rsl 
-1 Unused 
ShiftAmt Rsl Unused 
X Rb Unused 
X Rb Unused 
Offset 
PESel I Unused 
Start addre!iS of M= Start address of Program/Data 
SRAM/ICS Reg(S/D) Scl Memory (Sou/Ost.) 
















,\ ,uH'IJl>\l·rtil·:111.\ lnk:.,:rakd \dapti,t· t .. 111puli11:.!Sy.,km 
.\ppl·11tli'\ \-H·s Rl:-.t I~\ \t·r,!1111 !.O 
Oprod,s Mnemonics Description (Immediate) Description (Register) 
0 0 0 MOVA Rd= Immediate Rd= Rsl 
0 0 I MOVB Rd= Immediate Rd= Rs2 
0 I 0 AND Rd= Rd & Immediate Rd=Rsl &Rs2 
0 I I OR Rd = Rd I Immediate Rd=Rsl jRs2 
I 0 0 XOR Rd= Rd" Immediate Rd=Rsl "Rs2 
I 0 I NOT Rd = - Immediate Rd= -Rsl 
1 I 0 ADD Rd= Rsl + Immediate Rd=Rsl+Rs2 
I I I SUB Rd= Rs! - Immediate Rd=Rsl-Rs2 
0 0 0 CMP Compare Rs! and Immedia1e Compare Rs I and Rs2 
0 0 I MSR Status Register= Immediate Status Register= Rsl 
0 I 0 MRS NIA Rs I = Status Register 
Shift Mnemonics Description 
0 0 0 LSL Shift Left 
0 0 LSR Shift Right 
0 0 ASR Arithmetic Shift Right 
0 0 ROT Rotate -::' -·' ··. 
.,~, -·:' .. 
A DISSERTATION FOR 1ltE DEGREE OF MASTER OF ENGINEERING SCIENCE 90 
PE Operations 
0 0 0 
0 0 I 
0 1 0 
0 1 1 
1 0 0 
1 0 1 
DMA 
Operations 
0 0 0 
0 0 I 
0 I 0 
0 I I 
I 0 0 
1 0 I 
Jl)-Sofi( 'hip 
.-\ ~·un·l JI> \lTlic1ll.1 l1!ll':,:1.!!nl \d,ipti\l' (',nnput1n:_: ..;_,.,km 
\pp1·1nli, \ - H ·:-. 1n...,1 I.\\ \,·r,ill11 t.11 
Cond Mnemonics Description 
0 0 0 EQ Equal 
0 0 I NE Not Equal 
0 I 0 AL Always (Unconditional) 
Mnemonics Description < t:c· 
PECONF Configuration of each PEs (4,8,16,32 bits) 
PESEL To select certain PE (PEO- PEl5) 
PEMODE To select PE operation modes (HorizontalNertical/Circular modes) 
PEVEXE To execute specific program to each PEs in the same vertical line 
PEHEXE To execute specific program to each PEs in the same horizontal line 
PECEXE To execute specific program to each PEs in lhe same circular line 
Mnemonics Description 
LDPEPRG Load maximum 16 program data from Program memory to Embedded SRAM in PEs 
LDDFB Load large amount of processing data for PEs from Memory to Data Frame Buffer 
LDPEDATA Load large amount of processing data for PEs from DFB to Embedded SRAM in PE 
WBREG Write back processed data in Embedded SRAM to the Registers in the ICS_RISC 
WBDFB Write back processed data in Embedded SRAM to DFB 
WBMEM Write back processed data in DFB to Data Memory 
A DISSERTATION RJR THE DEGREE OF MASTER OF ENGINEERING SCIENCE 91 
I. I Instruction descripLions 
3D-Soft('hip 
:\ \'on-I JI) \'l'rlk.illy lnll'!,!rall'd \d,1pti\l' ( ·l)mputiiq,! S~skm 
Ap1>l'n1li:-..,\- ICS RIS< !S \ \t·r.~ion 1.0 
• Immediate addressing: Short immediate values: 4,8,16 bit ( 1 instruction word). Long immediate value: 32 bits (2 instruction words) 
Rd= Rd op Immediate (4,8,16.32 bit) 
31 30 
" 
28 27 26 2S 24 23 22 21 20 19 18 17 16 IS 14 13 12 11 10 9 8 7 6 s 
1 Instruction word I 0 0 0 lo 0 0 0 Opcode Rd 4, 8, 16 bit Constant 
31 30 29 28 27 26 2S 24 23 22 21 20 19 18 17 16 IS 14 \J 12 11 10 9 8 7 6 s 
2 Instruction words 0 0 o Io 0 0 I I Opcode I Rd I Unused 
32 bit Constants 
4 J 2 
4 3 2 
Description: The processed data from PEs and immediate data from regFile or data memory can be manipulated in the ICS_RISC so it can 






Rd= Rsl op Rs2 
Description: Rs I and Rs2 indicates the address of internal regFile{32 sets of 32bit data(32 x 32bit)). The opcode identifies the operations and 
the manipulated data between Rsl and Rs2 is stored in the register which indicated by Rd. 
LB Addressing: 
Rd= Rsl op Rs2 
Description: When the LB Addressing becomes active. t.'ie sources of addresses b.::.come a Loop Buffer. It has 16 depths of looping capacity. 
Shift I Rotate : 
Rd= Rs I Shift by Amount 
Description: According to the shiftCtl and shiftAmt. the shifter can shift the input operands. 
Load: 
Rd=Mem [Rb) 




.\ '\o\'l'I JI)\ l·r1 is .dh I 11h°'.'.r.tl1·d \d:1pln l' ( .. 1np111in~ ...;~ ~ll·m 
\p11t·11ii1\ \ .. !{ '- t.:1-.,t 1:-,. \ \, r~i11:1 1.0 
Description: Rd in the regFile can load the data from data memory address which indicates by Rb. 
• Store: 
Mem [RbJ = Rd 
Description: Data in the regFile can store to the data memory address which indicates by Rb 
• Branch: 
If (Cond) PC= PC+ Offset 
Description: According to the Cond signals. the Program Counter value can increase as much as offset value. 
• Multiply: 
Rd=Rsl *Rs2 
Description: The operands can multiplied and stored in the Rd. 
• PE Control : 
PECONF. PESEL, PEMODE (HorizontalNertical/Circular modes), PEEXE 
• DMA Control : 
LDPEPRG: Load maximum 16 program data from Program memory to Instruction Decoder in PEs 
LDDFB: Load large amount of processing data for PEs from Data Memory to Data Frame Buffer 
LDPEDATA: Load large amount of processing data for PEs from DFB to Embedded SRAM in PE 
WBREG: Write back processed data in Embedded SRAM in PE to the registers in the ICS_RISC 
WBDFB: Write back processed data in Embedded SRAM in PE to DFB 
WBMEM: Write back processed data in DfB to Data Memory 
• Dedicated Instructions 
Not yet decided 









,\ '.\on·l Jl) \"crtk:dl) l11tt·~r;1kd .\d.tpli\l' ('ompulin~ Sysll'm 
.-\ppl'IH!i, \ - ll .", J<I."',( 1:-, \ \"l'f".•,ion 1.0 
Loop Buffer(LB) Addressing 
A DISSERTATION RJR 'THE DEGREE OF MASTER OF ENGINEERING SCIENCE 94 
D- ftChip 
High-level Modeling of 3D-SoftChip 
using SystemC 
1 Configurable Array Processor (CAP) Chip 




Da a Bus djacent PEs 
~ MUX A 1 
... 












(Mui. Add, Sub, Comp) 
Embedded SRAM 
Figure 1.1: Standard-PE architecture 
MUXA MUX B: input operand selection 
Register 




ALU : 4-bit ALU with bit-serial multiplier, adder, subtractor, comparator 
Registers : 4 sets of registers 
DourReg : data out register to send data for adjacent PEs(Up/ Down/Left/ Right) 
Embedded SRAM: embedded SRAM (word-length: 4-bit, address: 0-15) 
A DISSERTATION POT THE DEGREE OF MASTER OF ENGJNERrNG SCIENCE 95 
1.3 S-PE function 
Table 1.1 : S-PE functions 
Function Mnemonics 
A and B AND 
AorB OR 
nolA NOT 




A comp B COMP 










15 12 J 11 10 
SRAM Selection Register 
Selection 
Figure 1.2: S-PE instruction format 











9 J s 6 J s 3 J 2 oJ 




Figure 1.3: S-PE block diagram {Input/Output Pin Description) 
A DISSERTATION FOTTHE DEGRBE OF MASTER OF ENGINERJNG SCIEN E 96 
0\' J 31) 
Ap1>~ndix IJ. 
1.6 Data-path Architecture of S-PE 
Instruction from ICS muxACtl [ :OJ 
Data Bus from adjacen 
Input Data B 
(dlnBus) 










" 0 Q. 
~ 
~ 
Bus Output Data 
(doutBus) 
























r+ Register <( 
V 
ah Ctl 
2:0 l J rwRe gEn 
Figure 1.4: Data-path architecture of S-PE 
1.7 S-PE Operation Flow 
Begin 
Instruction Fetch 
(Instruction from ICS) 
Instruction Decoding 
( 1) Input operands select 
(2) ALU Operation select 
(3) ALU output result store select 
Execute 
End 
Figure 1.5: S-PE operation flow 
A DISSERTATION FOT THE DEGREE OF MASTER OP BNGINERlNG SCIENCE 
sramSel dO UlC I [• ·OJ u I 





ovel 3D Vertkull 







Figure 1.6: SystemC file structure of S-PE 
1.9 SystemC Codes for S-PE 
See Appendix C 
1.10 Output Waveform 
f GTK\V~vo · C:\SystcmC\JIJ-SoUChip\SP[\spe_wa.;;; , vcd - . • - - ~11Ql 
File Edit Search lime Markers View Help 
D loaded successfully. 
[371 facilities found. 
[49851 regions Found. 
Signals --· .. , I Waves 
Time 111 SystemC.clock =· 
[
Zoom --- Page Fetch [Disc , Shift 
~~,._I +-1..±J +-11 +-I 
~J!!!l!!l~' -+ I .... :±.J I -+ I ::-:J, 
From:!O sec 






SystemC.10st1cs11 B:OJ =1 =s==o o:::::o:::o o~l==s o==+:;:I s==o+:::l;::so:::+:;:I s:::::o:::+ ;::I s==2+:;:I s:::::2:::+:;::I s==2+::;l:::s 2==+:;:ls:::1+:::l;::s ==o+:;:I s==1:::+:;::I s==o+:::;I s:::1:::+:;:I s==o+::;l:::s 1:::+::;:I s:::::o:::+ ;::I s==s+:::;l:::s •:::+:;:I s==s+::;l:::s 4:::+::;:I s:::s:::+ ;::ls:::4 +:::;l:::s :::+:;:I s:::4+::;l;::s 3:::+::;:I s:::2:::+ ;::ls:::s+:::;I s:::4:::+:;:I s:::o+::;:::I sr 




SystemC. d0own[3 OJ : 1! 




SystemC.dOut[3:DI=! \$1 lsJ \s2 Isa lsi Is? \$1 lsJ 
, SyslemC.dOutadjPE[3,0i=! I=.~============================:::::!==:::;===~===== 
I 
I 
Figure 1.7: Top level simulation result of S-PE 
A DlSSBRTATION FOTTHE DEGREE OF MASTER OF ENGfNERING SCIBNCE 98 
3D- oftChip 
ovel ~ 0 'ertkai y 1 ltegrated Adapti e Computin~ S 'ste1 
> n i 8-Hi 1h-levcl Modeling of 3D~Sof Chip Usiug S, s emC 




















Accumulator / Subtractor 





Figure 1.8: Processing Accelerator-PE architecture 
1.12 System Components 
• MUX A, MUX B : input operand selection 






Multiplier : a signed 4-bit scalable parallel/parallel multiplier 
Accumulator/Subtractor : to enable MAC, MAS operations within one clock cycle . 
8-bit Barrel shifter 
Registers : 4 sets of registers . 
Embedded SRAM: embedded SRAM (word-length: 4-bit, address :0-15) 
1.13 PA-PE Functions 
Table 1.2.PA-PE functions 
Function Mnemonics 
AxB PAMUL 
Ax B + out(t) MAC 
A D1SSER1'ATI0N FOT THB DEGREE OF MASTER OF ENGINERING SCIENCE 99 
AX B -out(!) MAS 
Logical Shift Lefl LSL 
Logical Shift Right LSR 
Arithmetic Shift Righi. ASR 
Rotate ROR 
IAl(Absolute value) ABS 
1.14 PA-PE Instruction Format 
18 17 16 15 12 11 JO 9 1 8 
WS_cn/ 
RS_cn 
SRAM SRAM Selection Register DoulR PA-PE_OP MUX_B MUX_A 
RR_m en Selection QI 
Figure 1.9: PA-PE in truction format 















Figure 1.10: PA-PE block diagram (Input/Output Pin Description) 
A DISSERTATION FOT THE DEGREB OF MASTER OF ENGTNERING SCIENCE LOO 
D-SoftChip 
•oHJ 3)) 
\ >p •ndi U-
1.16 Data-path Architecture of PA-PE 
Instruction from ICS muxACtl I :OJ 
Data Bus from adjacen 
Input Data B 
(din Bus) 















us Output Data B 
(doutBus) 



















~ ,_J_ I'._ 
~ vi ~ 
.~<C -;;;~ 
... :::;: Accum ~ .. 
.!:u ~ I .. ~~ Sub 






S I sramSel reg e 130] [l ·OJ 
~ Register Embedded 
SRAM 
rwRegEn rwSEn sramEn 
Figure 1.11: Data-path architecture of PA-PE 
1.17 PA-PE Operation Flow 
Begin 
Instruction Fetch 
(Instruction from ICS) 
Instruction Decoding 
(1) Input operands select 
(2) ALU Operation select 
(3) ALU output result store select 
Execute 
End 
Figure 1.12: PA-PE operation flow 
A DISSERTATION FOTTHE DEGREE OF MASTER OF ENGl~ERJNG SCIENCE 
dO CI utR t 
OoutReg 
101 









Figure 1.13: SystemC file structure of PA-PE 
1.19 SystemC Codes for PA-PE 
See Appendix C. 
1.20 Output Waveform 
l"-Gll\Wave - C;\Syi;tcmCl3D SoftChiplPAP£_V1\papc_wavc.vcd ----- - .. - --- ~§ X 
File Edit Search Time Matkers View Help 
,CO loaded successfully. 
36] facilities found 

















SystemC.dRightfJ OJ=!' ,~::::::::::::::=::;::::====================================· SystemC dUpf3 OJ=!1 
sn 1s 1 
sn 1s1 IS? 
SystemC.dDown(3 OJ=' 
SystemC muxAOut{3.0J = 1 






)Sl )S2 )s1 )so )s1 )s2 )si 
l$l )$2 )s1 )so )51 )s2 )s1 
SystemC.s_aluOut(3 OJ=: ls2 lu lsJ )so lsz )s1 so )s1 ls3 ls4 )so )S2 \so )ss 1s1 
SystemCsramOataJ3·0J=11 1---------------------------l~s 
SystemC. dou1Bus!3.0J =! SX IS i IS2 ISi IS3 
SystemC.dOut[3.0J =! so 1$1 IS2 ls1 IS3 
SystemC.d0utadjPE[3 OJ=:/1 _,,s.,._n _______________________________________ _ 
.-,,-----.---.,,.-J .... J,------.---=-==--.....::.--_;_-====-.....::.------=----==------=-------~ 
Figure 1.14: Top level simulation of PA-PE 
A DISSERTATION FOTTHE DEGREE OF MASTER OF ENG!NERING SCIEN E 102 
2 ICS(lntelligent Configurable Switch) Chip 
2.1 ICS_RISC (32-bit Dedicated RISC Control Processor) 
: fns,truction ' 
• AddrMs Instruction Data <31 :0> Loop Buffer 







(32 X 32 bit) 
1/ 0 Unit 
Instruction Register 




Figure 2. 1: Overall architecture oflCS_RISC 









Conrrol Unit Data Memory (ICS/CAP) 
Figure 2.2: Detailed ICS_RISC Architecture 










A Non•l .{D Yertically lnh'gr:.1kd Atlaplivc Comlmting SJslcm 












Special Features (ICS_RISC) 
Harvard architecture, 3 Stage Pipelined architecture(Fetch, Decode, Execute) 
Memory access, during the execution stage, is done by load/store instructions only 
All operations except load/store, PE and OMA operations, are register-to-register within 
the ICS RISC 
Single-cycle instrnction execution 
System Components (ICS_RISC) 
Program Counter: 32th GPR is a program counter 
Loop Buffer : 16 x 32-bit buffer to generate instruction address for iterative characteristic 
instructions 
Register file(General Purpose Register) : 32 x 32-bit general purpose register 
Status Register: 4 kinds of flags (N: Negative/ Less Than, Z: Zero, C: Carry/ Borrow, V: 
Overflow) 
Instruction Register: Instruction decoder for ALU and Control Unit 
ALU & Control Unit: It is consist of ALU, Shifter, Multiplier 
I/0 Unit: 32·bit Data input/output register (dlnReg, dOutReg) 
ICS RISC Functions 
See Appendix A 
ICS _RISC Block Diagram (Input/Output Pin Description) 
!CS.RISC 
l0ata[31:0) 1Addr!31 :OJ 
dAddr[3J:0] 
Reset dDataJ31 :OJ 
Clock 
Figure 2.3: ICS_RISC Block Diagra.n (Input/Output Pin Description) 
A DISSERTATION FOTTI-fE DEGREE OF MASTER OPENGINERINO SCIENCE 104 
D- oftC 
2.7 UnitlCS Block Diagram (Input/Output Pin Description) 
ICS_RISC 1Addr[31 :OJ 
iData[31 :OJ 1Data[31 :OJ 
Data Frame 
dData[31 :OJ Buffer Reset 
dData[31 :OJ 
T dAddr[31 :OJ 
Clock 
> 
DMA Control DMA Controller 
Figure 2.4: UnitICS Block Diagram (Input/Output Pin Description) 
2.8 Three-stage Pipeline Architecture (ICS_RJSC) 
I FETCH DECODE EXECUTE 
FETCH DECODE EXECUTE 
FETCH DECODE 
Figure 2.5: Three-stage Pipeline Architecture (ICS_RISC) 
2.9 Register Architecture (ICS_RISC) 

















31 2827 0 
j N[ Z [ C j V [(Reserved)! 
N : Negatrve / Less Than 
Z : Zero 
C : Carry / Borrow 
V : Overflow 
Figure 2.6: Register Architecture (ICS_RISC) 





2.10 Data-path Architecture ofICS_RISC 














-E-.i lncrementor ~ ~ .. -~ .::: :::, :.: C. 
-
dOulReg • dlnReg >-GPR _, >- V, >- E >-<( ~ :J I .. ::E. 
-I SL11ue ~eg .... ., 
.... 
.... V ' l "" V 
re SEn opAldx opllldx Imm aluCtl shlftCtl dOut En dinEn 
rdAEn rdBEn rdlEn 
Figure 2.7: Data-path architecture of rcs_RISC 
2.11 SystemC File Structure 




Figure 2.8 : SysternC fi le structure of ICS_RISC 
2.12 Sy temC Modeling of Data-path Architecture of ICS_RISC 
System Components: (l) Program Counter, (2) Status Register, (3) Loop Buffer, (4) 
General Purpo.e Register, (5) ALU, (6) Barrel Shifter, (7) Multiplier, (8) Data Input 
Register (9) Data Output Register 
• Program Counter (PC) 
A DIS ERTATION FOTTHE DEGREE OF MASTER OF ENGINERING SCIEN E 106 
f GW,WaV\! C:\SyslcmC\30 SortChlp\HSnlCS_ TESnpc_modiflcd\wavc.vcd ~@~ 
FIie Edit S9arch nme Marker5 YlflN Help 
CO loaded successfully. 
1111 facilit1ss found 










Zoom Page Fetch Oise Shift 
.,.l~I~ +-I~ +-I +-I 
-1 UHUU! -+tll .-.:±J1 .....1.J __. , __., 
From·jO sec 





, Sys1emC.iAddrTmp[31 :0J= lo 1112 !J 14 Is lfi lz la 12 It It 112 
I SystamC.1nc10ul(31:0J =· J :=3:::J=:4 :::IS::::::::::l~z=l3==!4:::::!~s=l=6 :::l7=le:::;1~2-------------;::::=l+:::l:::+::::!+=l+=:1~1~4 -------
SystemC.IAddrOut= I 
SystemC.tAddrlnl31 ·0J = 1~11:::::1~2 :;:IP~~:::;:li~lz~l3:::::l~4 :::ls~l6::::;l:::::z ===============r-=:=le:::::1::::9 ::;:l+:::::;::lt:::1~1~2 ======= 
SystemC. iAddr[31 OJ=· ~!3=!=4 =:!5~::::1~2:::j3~j4=:j~s=l=6 :=l7~la=!~10~==::::;::=========l9:::::=l+::::l::::+:::!+:::::=l+:::::1~1=0 ======= 
SystcmC. dAddrl31 .0J =· 
Figure 2.9: Output waveform of PC 
• Status Register 
t Gll<Walll! . C:15~tcmC\3D-SottChip\lfSn1cs_nSTur-ve1llied\s1_w.ive.11t:d Q@rgJ 
File Edtl Search Time Markers View Help 
CD loaded successfully . 
191 facihltes found. 












hi~ ____ ,.... J 
56 ns 
1.oom --:-:- l Page Fetch Disc Shift j 
~.l!:!l!Jt.r ~ _t..1 +- I +-
~.!!!OOJ~ ..±l ...±J. -. I -. 1 
112 DS 
!%0101 
Figure 2.10: Output waveform of Status Register 
• Loop Buffer (LB) 
A DISSERTATTON FOTTHE DEGREE OF MASTER OP ENGINERING SCIBNCE 
from:jo sec 









'<nel 3D 'ertically Integrated Adap ive ComJ>Uting S). tem 
end ix H-Hi h-Je\'el ModeJin~ of 3D- o t 'hip Using SystemC 
l' GTKWaw · C:\S~lemC\3D -SaflChip\TIS1'1CS\tt\lf_wa~~vcd - - - - ~ ~rg} 
File Edit Search nme Markers View Help 
VCO loaded s uccessfu lly. 
[7) facilittes found . 












I P 115 114 
!1 n 11 I? 
I 0 
~ Zoom -,}Page ll~alch ) Oise 1 Shift , . . l14·•d~l1 +- !1 +- +-I' .... ,, I.ft! UHDO! -.i~ -+ IL _.. -+ L -+ I 
113 11 2 111 110 19 110 111 11 2 113 
13 14 15 
le IA 13 1, 
Figure 2.11: Output waveform of Loop Buffer 
• General Purpose Register (GPR) 
From:jo sec 
To: j495 ns 









1' G fKWaYl! • C:\Sy,temC\30·Sof1Chlp\TISnlCS_T£Sn,er,file_v1 •v.rifled\regFile_w•ve.vcd ~§~ 
File Edit Search Time Markers View Help 
CO loaded successfully. 
J11 I facm1ies found. 
)3961 regions found 





SystemC-wbDataJ31.0) $QQOQQQ+ I+ I+ I+ I+ I+ !+ I+ lsoooooaoe 
SystemC.rdAJd,f4 '01 SOO I+ I+ It ISQ4 
SystemC.rdAOEn 
From·ID SBC 
To· 1995 ns 
I+ I+ I+ I+ I+ I+ I+ I+ ISOQD00000 





SystemC,rdAOataf31 .OJ "'so,..,.o,.,ooe..o.,.,oo"'-o --------""l+_..,l+uli.+ ..,ls-";oo:=a~ao::::00=4=::::----------'l"-'sou.o+ul+,:...l,i+..i,l+ul"'so':!'.0~00~00~0~4 =:----' 
SystemC rdBOEn 
SystemC.rd81dxf4·0J =·~oo=============I=+ :::l•::;lt:::!::l+::!:IS:!lF:=============!:l+'::l::::t:::lt::;lt:::;::1$1::f== 
SystemC rdBDataf31 OJ :::s~oo~oa~o:'.::oo==o==========:::I::::+ :'::l+::::l+~l+~ls==o~oo~oo~o~o1:===========!:l+'::I::::+ :'::lt::::l±:::::=lso~o~oo~o:'.::00~1 
Sys1emC.pcJ31 OJ ~•=PP~QQ=P~QQ=Q ____ ~ls~O~PP=O~PP~Pi~--------------~<»~------------
..J 
Figure 2.12: Output waveform of Register File 
• ALU 
Table2. l: ALU Functions 
Opcodes Mnemonics Description (Immediate) Description (Register) 
0 0 0 0 MOVA Rd = Immediate Rd=Rsl 
0 0 0 l MOVB Rd= Immecliate Rd =Rs2 
0 0 l 0 AND Rd = Rd & Immediate Rd= Rsl & Rs2 
0 0 l l OR Rd= Rd \ Immediate Rd= Rsl IR 2 
0 l 0 0 XOR Rd = Rd A Immediate Rd=Rsl A Rs2 
















SystemC.aluAJnl31 .0J =I 
















-SoftC • p 
Rd = - Immediate 
Rd = Rsl + Immediate 
Rd = Rs 1 - Immediate 
Compare Rs l and Immediate 
Status Register= Immediate 
NIA 
Zoom . Paga Felch ~ Disc Shift 
~.!!:t!l~ ~ I ~ ....tJ ~ I 
~~_-+l~i -+I -+I 
28 ns (2 ns 
Rd= -Rsl 
Rd=Rsl+Rs2 
Rd= Rsl -Rs2 
Compare Rs 1 and Rs2 
Status Register= Rsl 
Rs 1 = Status Regi ter 
From jo sec 






SystemC a\uCtl[3oOJ=! :::s=;::=====~======·======~=====:=• ======~====== 
Syst1mC.condFta9{3:0J =' l~popo 1~0100 j;:0000 
SystemC.aluOut{3tOJ=: ~$Q~+=1•-PO~P-PP-PQ-4 __ ~Js~PP~P-PO=Oo-s~-~1•~PP~O~PO~P-PO~-~is~P~PP~PO~O~PE~-------~l=SE~Ff~Ff~FE~5--
Figure 2.13: Output waveform of ALU 
• 32bitBarrel Shifter 
Table2.2: Shifter Functions 
Shift Mnemonics Description 
0 0 0 LSL Shift Left 
0 0 1 LSR Shift Right 
0 l 0 ASR Arithmetic Shift Right 
0 l l ROT Rotate 
A DI SBRTATION FOTTHE DEGRBE OF MASTER OF ENGINERING S IENCE 109 
3D-
-
lf Gll<Waw - C:\Sytti,mC\3D Sof1Chlp\HSnJCS\oh1l1er _tesll<hifter _wave.vcd ~§IE] 
FIie Edtl Search nme Markers View Help 
r,.,co loaded successfully Zoom (age [ Felch j Disc I Shift l Ma,imull1 Tims 15) racilr!Jes fou"d ±!l.!!:!JJ~ +- I .-. ..±J 1 -1 From.lo sec I 195 ns 1521 re9lons fou"d 
.:ft!J!!!!!ll.J I -+I~ -+I~ To j195 ns I Current nme 18 ns 
I Slg"als 1 !Waves ---1.0 ns 15 ns 20 ns 25 rut "1 lime .. 
-
SystemC.CLK I I I 
Sys1emC.sh1ftlnl31.0I Y,JlOOOOOQ± 1%00000000000001110111011101110111 
Sys1emC sh1ftC!ll2.0I l'.000 %nn1 
SyslemC sh111Amtl4 ,0I %00000 1xnnrno 






- - ' ,. 
-!- '- .... ~ 
Figure 2.14: Output waveform of32-bit Barrel Shifter 
32 x 32 Signed Multiplier 
r GTKWave . C:\Sr.<temC\30-SallChlp\ll5l\JCS""ulliplie~\mut=~"~a~vt:d - - - ~@!'RI 
FIie Ed~ Search nme Markers View Help 
CO loaded successfully. 
(5] racilit1es found. 








'Waves 34 ns 
Zoom ~ [Page ]~ Fetch Oise Shift~ 
..,~,~ ._ _±J .--11 ._, 
.-.! UHDOj -+l -+ ~ _ .,. ~ -+ 
51 ns 
From:ID seo 
To 1995 ns 
68 ns 
Figure 2.15: Output waveform of Signed 32 x 32 Multiplier 
• Data Input Register 
• Data Output Register 








A NoYel 31) Verticall · ntegr t . daptive Compuling 
A µpendix H-l-li~h-fo el lodeline of 3 
!- GTKWovo C:\SystcmC\JD Soft(hlp\T£Snd41apalh_ v1 \da1,1po1h _vmvo.vcd l'J§!EJ 
File Edit Search Time Markers View Help 
0 loaded ~uccessfully 
(29) (ac,111,aa round. 
(31365] regions found 
Signals I Waves 
Zoom ~ ] Pago F alth Ditc . Shin 
~~ - ..±..1 ~ ~ .±.J 
.ftJJA!!!m --.i 1...±.J ...:±..I ~ _±J, 
from jo sec 





Tima SystemC clock=' 1 J 
SystemC.reset =f 




SystamC.aluCt1(3:0J=! I ~•~"-----;:======ls7====='="====='==•'"======= IS7======'"'======'·"===== 
sy,tamC.aluOEn = 1 
SystemC shi11Cll(2 OJ=• -""P11.1Q1.11D ________ "'J"_..01.._.1 ___ __.JY.,.,' o.,.1 ... o___ ___.,J>-:..w,DL11.QD.._ __ ,,J%o1.P.LIU'----"'Jx,.01"'0'-----'J"'uo.,o"'o ___ _. 
SystemC.shiftOEn =( 
SyslamC mulOEn=I l ====:;:::::;::::====;:=:;:::=====;=====::;:::;:;:::====;:=;====:;::::====::::;::::;::;;:::==== 
SyslemC.opA!dxl4.0J =! j ::s:!oo====:;:l•::!Q':::E======Js~1s====::;l;::•1::t~===::;J::so~F=====Js~1~5====::Cls~1~t ====!=ls~QE;:::=== 
SystemC.opBld•l•·o1=,\ _i!10 lsoJ Ins JsoJ JS1s JsoJ 
SystemC.tdAOEn= ----~~-------------------------------------
SystemC rdBOEn= J,:::;::===::;----------~-----=----~----~-----~-----
SystemC.Wbld•l4.0J =w1· .. s,i;OO.___ _ -;Js~o'==E====!:'$~15~====='==$Q~P====::::'~$Q~F====='·~l~S======'·~o~o ====~'$~OF~==== 





Sy•lemC dOulCII =( 
Sys! emC lbEn =I 
SystemC lbRWEn ,; 
SystemC.dln[J 1 :OJ ::;! _.s'"Q Q,..Ds.O..,QQ,,.Q..._Q _.,[s"'o,.p p.,o ... o..,pp"'A---"'1 S"'O Q.,o ... o..,g 0 ... 1.._t _ _,)""so._.p ... o..,p Q,.Qc,.l E.._ _ _,_19,,,0..,.0..,p Q...,Q"'OQ.,A.___..,)$ ... Q,.Q g._.o ... o,..p1..,4 __ _..,)s,,.o,,,gg,.o"'g g.,1..,E __ ._.I $,,,Q O .. As.A..,Q 0 ... 0,..4 __ 
SyslemC.zFlag ,a 
Sys1amC.IAddr(31 :DI=! ISODaflnlJ03 
Sy•temC dAddi(31 ·0J=! ' I i~'~nmnj]ino======================================= SyslemC,d0ut(  ·0J , ~ 
add 
Figure 2.16: Output waveform of top module in data-path architecture 
2.13 Control Architecture of!CS_RISC 
• Fetch Unit : Fetch the instructions 
• Decoder Unit 
decode 
output 
load 0010101 XX)()( 
Figure 2.17: Instruction Decoding (1) 




inst 1 0 Rs Rd 




Figure 2.18: Instruction Decoding (2) 
Table 2.3: Instruction ID for Instruction Decoding 
Instruction ID Instruction[31 :25J Description 
INST_ALU1S 000/0000 ALU Immediate (1 Inst. Word) 
INST_ALUIL 000/0001 ALU Immediate (2 lost. Word) 
INST_ALUR 000/0010 ALU Register 
INST_ALULB 000/0011 ALU Loop Buffer Addressing 
INST_SHRO 001/0100 Shift I Rotate 
INST_LOAD 001/0101 Load 
INST_STORE 001/0110 Store 
INST_BRANCH 001/0111 Branch 
INST_PECON 010/1000 PE Control 
INST_DMA lxx/xxxx DMAControl 
INST_MUL 011/1111 Multiply 
A DISSERTATION FOTTHE DEGREE OP MASTER OF ENGINERING SClENCE 112 
D loaded successfully 
133] faollltl•• found. Zoom P~ge Fetch Oise Shift 
·~.!:!:!!.1.1!::J ~ ~ +- I +- I 
. -1 UHDOI~ _±J ___:±J -+ I -+ I 
From·jO sec 









I ns ,oGa ns 4950 ns 5040 l'I.S 5130 ns 
SyslamC cloak 
SyslamC reset 





s+lso 10+is0Ja+lso I o+!Sfff •I• s 1c;;+!sJos .. jso 16• lso J s+lsooe+ls 01D+ls o u+lso1 o+ls m +lss 1c+ Is Joe+!so16+!so 1s+ 1 sooe+jsomF 
SyslemC.candl2,01 
SystemC opcode[3 OJ 
Sys1emC shift[2:0l 
Sys1emC.rs11d,(4:0J 




SystemC PEOpmodall 01 









so 1u l$Q ls2 Isa 
sslsE 1sc Isa l•f l•E 
%tlY-111 1%110 1%100 )%111 
§+IS02 IUE 1$15 IUF !Sl3 
s-1- lsoA ls17 ls1E ISJE lsu 
£+ls1F 1soo 1soe 1s1F lsor, 







IS2 1$1 ISO !Sl !SO 1$2 1$8 
iss 1s3 10 1ss ISE lsc Isa In: lsE 
lxo10 !%001 l>:101 1%010 l:m 1 1,mo l:noo lxm 
Im IS09 Im Im !§02 !S)E Im Im Im 
Ison 1so, Im 1s1E lsu ls11 is,1E lnr 1m 
1m 1soo 1m 1m !Slf 1soo 1sos 1nr 1so6 
ls+ls+ 1$± lsooo+ls+ls+ ls+ ls+lsooooFFFF 
1~11 Mo 
IS2 !Sl ISO 
!SS IS3 ISA lss !SE 
!>;010 lr.001 1%101 1%010 !>;lll 
iso, 1so2 1m 1s1s lsoz 
ISOD ISO j ISO§ l$1E ISOA, 
1.su lsoo lsu In 7 1s1f 
-. 






Figure 2. 19: Output waveform of Instruction Decoding 
• Execute Unit 
Table 2.4: Control Signal according to the Instruction 
I ostruction ALU Op. ALU Out Shifter Out Mui.Out Operand A Operand B 
ALU Immediate Op Code Enable Disable Disable Rd Immediate 
(l lnst. Word) (4,8,16bit) 
ALU [mmediate Op Code Enable Disable Disable Rd Immediate 
(2 Inst. Word) (32bit) 
ALU Register Op Code Enable Disable Disable Rsl Rs2 
ALULBAddr. Op Code Enable Disable Disable R l Rs2 
Shift/ Rotate Don't Care Disable Enable Disable Rb (Rsl) ShiftAmt 
Load MOY Disable Disable Disable Rb (Rsl) Don ' t Care 
Store MOY Disable Disable Disable Rb (Rsl) Rd 
Branch ADD Enable Disable Disable PC Immediate 
PE Control Don't Care Disable Disable Di able Don't Care Don't Care 
DMAControl Don't Care Disable Disable Disable Don't Care Don't Care 
Multiply Don't Care Disable Disable Enable Rsl Rs2 
A DISSERTATION FOTTHE DEGRBE OF MASTER OF ENGINERLNG S .!ENCE 113 
liri1@1Mf¥¥,i!tlidi@ILiliiW! iii11&9 iiiiili:ih¥1:+ --- ~@]~ 
Filo Ed11 Searth Timo Marl<llr• V1ow Help 
1
13*~~-, 






Zoom Page I Felch Oise r Shift 
~J!:!j]~ ..±J ~ .±.J .±.J 
~J!!!!fil..::!!l -+ I ~ -+ I .±.J 
1056 DS 1188 DS 
Mulmum'fime 
F1om·lo sec 9995 n• 
To j9995 ns Current Time 
1:J;l ns 




SystemC opcode!3 OJ 
• 
SyslemC.1h1ft!2.0J 
SyslemC rs11dXl4 OJ 
Sy.temC,rs21d,!4 OJ 
SyslemC.rdldxl4:0J I 










SyslemC opAldxl4 01 
SyslemC.op81dxl4 01 
SyslemC.rdAOEn 



























































-----------------------------------· " I 
SyslemC dinCII 
---------------------------------------~' :I 
I SystemC, dOutCtl SystamC.sh,ftAmtl4 01 0 
,...._ ____ ...,'..11...1 
11 10 11 10 
Figure 2.20: Output waveform of Instruction Execution 















Figure2.22: Modified Pipeline Register Architecture (High- peed) 





• Pipeline Control ( reset, flush and refill) 
address ~I _N_+_1 ~--N+_2 _ _,___N_+_3_-'--_N_+_4 ~c....__N_+s _ _.__Ds_r _ _.___Ds_r_+ 1_-'--D_ST_+_2~c...._D_ST_+_3__, 
MOV 
ADD 
I Fetch Decode 
Fetch 










Figure2.23: Branch Instruction Execution 
2.14 Top-level Simulation Result of ICS_RISC 
• Simple Program for Verification 
0000/0000 //MOV RO, #0 
0001/0001 //MOY Rl, #1 
0002/0002 //MOV R2, #2 
0003/0003 //MOV R3, #3 
0004/0004 //MOV R4, #4 
0005/0005 //MOV RS, #5 
0006/0006 //MOV R6, #6 
0007/0007 //MOV R7, #7 
0408/0000 //MOV R8, RO 
0409/0020 //MOV R9, Rl 
040A/0040 //MOV RIO, R2 
040B/0060 //MOV Rll , R3 
040C/0080 //MOY Rl2, R4 
040D/OOAO //MOV Rl3, R5 
040E/OOCO //MOV Rl4, R6 
040F/OOEO //MOV Rl5, R7 
0450/4280 //AND Rl6, R8&R9 
0471/52CO //OR Rl7 RIO I Rll 
0492/6340 //XOR Rl8, Rl2 "Rl3 
04D3/6B80 //ADD Rl9, Rl4 + Rl5 
A DISSERTATION FOT THE DEGREE OF MASTER OF ENG LNER ING SCIENCE 
Decode Execute 
Fetch Decode Execute 
//Simple Loop Program 
//End 




A Novel 30 Vertically Integrate Adapti e Computin 
A> >endix B-High-1 HI lodelin of 30-SoftCbip U:i1 g 
04F4/6B80 //SUB R20, Rl4-Rl5 //End 
r GTKWave • t::ISystemC\JD-SoftChlp\TEST JCS\ICS_RJSC-sim\wave.vcd ~[Qj[8) 
Fila Edit Search Tims Markers View Help 
l3'i50J regions found 






Zoom I Page Fetch Disc Sh,n 
~~JtJ +-1.....±:..J +-I +-I 
.ftl.!!!!.QQJ~ ~J~I --. l ~ J 
From:jo sec 





Sy9ten'IC.,Datal31 :DJ=t s+ lso+ lso+ 110+ lso+ 1so+ )so+ lso+ )so+ )so+ )so+ lso+ )so+ lso+ lso+ lso+ )so+ lso+ 1so+ lso+)so+ 1so+ 1so+ lso+ )so+ lso+ lsoo 
SystemC.opA/dxl4:0J=!i I soo )so1 IS02 ISOJ lso4 lsos lsoG ISQ7 lsoa lsoo 1so1 IS02 lsoJ lsoa )soe lsoo lsoE )soo 1so1 )so2 )soJ )so4 
SystemC.opBldxl4:0J=! soJ )soo lsoa )soA Isac )soo lsoo lsoJ 
SystemC.rdAOEn = I 
SystemC.rdBOEn=·: 
SystemC.wbldxl4:0J=t $DO 1so11so2 1soJ lso4 lsos 1so& 1so11soa 1so91soA )soe )soc lsoo lsoE lsof 1s10 1su 1si21su 1s14 lsoo 1so1 lso2 1soJ lso4 
SystemC.alu0u1(31 :0J =! soooooooo ISO+ ISO+ )so+ ISO+ ISO+ lsoooooooo 
SystemC.iAddrl31 :0J=1 lso+ lso+ )so+ lso+ lso+ lso+ lso+ lso+ lso+ lso+ lso+ lso+ lso+ lso+ lso+ )so+ lso+ lso+ )so+ )so+ )so+ )so+ lso+ )so+ lso+ lso+ Is 
1, ........ __ __. 
Figure2.24: Top level Simulation Result of ICS_RISC 





















Pt•ess <>ny ke y to continue. 
D 
Figure2.25: Instruction Index 




3. l SystemC File Structure 
reset 
-









Figure3.l: SystemC file structure ofUnitChip 
3.2 Top-level Simulation Result of UnitChlp 
f GfKW.we C:\Sy.temC\30 SoftChlp\HST_ICS\ICS_Jll~C ,im\wavc.vcd ~&J-
File Edit Sea,ch Time Msri<ers View Help 
;VliO loaded successfully 
[491 facilitoos found. 












SyS!emC. IAdd~31 .0J 
SySlemC.PE1_dOutJ3 OJ 
SymmC PE2_d0U1(3 Of 
SyslemC.PE3 _ dOut[3.0J 
Waves 
102 ns 136 AS J70 ns 
From Jo sec 







Iii 1,0001+ 1,0002+ 1so,oe+ 1,0,02+ 1,0,oAt 1so10Bt 1,0,oe+ 1,0,00+ 1,wi. 1so1of+ 1,0450+ 1,0111+ 1,0,22, 1,omJ, !•Plf•• 1·11.ilJ 
i, I sos 1so6 lsoz IHB lsoo lso1 lso2 lto3 lsoA lsoB Ison lsoe: __ 
j ! so, lsoo • !sos lsoa Isac Ison ... 
'! 1cs RISC operations=-=- ..,: , 
! 
I isos lso6 lsoz lsoe lso2 ls:01 lsoe !soc lton tg,or; lsor Ism lsu ls12 lsP =.JnL I lso,n+ !,on1" §:dojp3;Jiij:ij 
so+ lsoooo+ lsoooo+ /soooo+ lsoooo+ lsoooo+ lsoooo+ lsoooo+ lsoooo+ lsoooo~ lsoooo+ 1suuttY::r ~Vz-th /,uuoot lsoooot lsoooo+ I 
11 sn IS7 $0 JS3 ISO 
so ~.,, lso lu. 1st !so 
SyslemC.PE4_dOU1(3 OJ .·o PF l )petatloOS::::-.. ISB lso lss 1$9 lsg 
SyslemC.PE5_dOull3·0J !! :---;;: ._ [u Is? Isa Is? ,,< 
SystemC.PE6_d0ul[3 OJ =·~·============================IS~l==*-":;:c===!=ls~B==I•':::::== 
Sy.,amC.PE7 _dOUlf3 .0J ! sn [u 1§9 [•g 
SysremC.PEB_dOul[3'0J ;,= .. ~=======================:l~SS:~=;:1$;;7======:::;:;:ls'r·= == 
SyslamC.PE9_d0ul(3.0J i =·="=====================:==l~sz==;:====="ls=s==!=ls•;:·=== 
Syo1omC.PE10_~0U1(3:0J ii =·="========================;:::;:::=:;1$~9==*.ls~j=====::'ls:~== 
SyslemC PE11_dOutJ3.0J Iii, · •. ·. lse lsc IS) ISO I« 
SyslamC.PE12_d0Ulf3.01 ="~'======================-=:l~sz;::::=::ls~B==*.ls~z===!=IS~l=·~l$~B === 
SystamC.PE13_dOulf3'01 ;,:i~~~~=======================-l'.=S9~=;1~ss~=::IS~o===IS~B=-..:=*.ls~z=== 
SyslemC.PE1•_d0uli3·0J , ,~·~·=======================;;·C;,:-==;l~so:==;:;;:==~==*.JS:;6=== 
SyS1emC.PE15_d0UIJ3:0J IJ 1:='="=======================!":="=...;ls~•==::IS::"O-=:,::::;:IS=9=::::!=ls;:E=== I SystamC PE16_d0ulf3:0J I' •• I<> r.o '" "" 'L====:::i:;:f·l ...,..._-,--________________________________ _ 
1-.r - -,tl'J_ --' - ,, 
Figure3.2: Top-level Simulation Result ofUnitChip 
A DLSSBRTATLON FOT THE DEGREB OF MASTER OF ENG!NERlNG SCIENCE 117 
3D-Sof'tChip 
A Noni JI) \'crlitnlly Jnh:~nikd ,\tlaplin.• Compuling System 
A1mend b .. B-1-Iigh-lt'n•J \ lodt•ling ol' .m-Sol't( 'hip lJsing s,·sh.1111(: 
References for the ICS_RISC 
[I] Yeong-don Bae, "Basic Microprocessor Design", http://www.donny.co.kr 
[2] Yap Zi He, "Building A RISC Microcontroller in an FPGA", 
http:/lwww.opencores.org/projects/riscmcu 
A DISSERTATION FOT THE DEGREE OF MASTER OF ENGINERING SCIENCE 118 




• iReg: Instruction Rl'g for Standard-PE(header file for iReg) 
* Copyright(c) 2005 by Chui KlM, All right reserved 
• Author: Chui KIM(ckim@studenLecu.edu.au) 
• File name: iReg.b 
* Revision history: Version! 









































* iReg: Instruction Reg for Standard-PE(source file for iReg) 
* Copyright(c) 2005 by Chui KIM, All right reserved 
• Author: Chui KIM(ckim@studenLecu.edu.au) 
• File name: iReg.cpp 
* Revision history : Versionl 
* Date: 17/1/2005 
., 
#include "iRcg.h" 
void iRcg::do Reg() { 
c_uint<19> tmp_inst; 








//S-PE operation set 
//data-out reg ctl 
//internal reg seJ 
//SRAMsel 
//SRAM enable signal 
/fmtcrnal reg read/write signal 
//SRAM read/write enable.signal 













= tmp_ inst.range{8,6); 
= tmp_inst.rangc(5,3); 
= tmp_insLrange(2,0); 
• Mux: Mux for Standard-PE(hender file for Mux) 
* Copyright(c) 2005 by Chui KIM1 All right reserved 
* Author: Chui KIM(ckirn@studenLecu.edu.au) 
• File name: mux.b 
* Revi ion history: Versioul 

























ompu mo- Syst m 
//mux ctl input 
I/input data 
I/data from internal Reg 
I/data from adjuccnt PE(frorn left PE) 
I/data from adjacent PE(from right PE) 
//data from adjacent PE(from up ide PE)· 
I/data from adjacent J>E(from dowmidc PE); 
I/data request for internal register 








• Mux: Mux for Standard-PE(source file for Mme) 
* Copyrigbt(c) 2005 by Chui KIM, All right reserved 
* Author: Chui KIM(ckim@student.ccu.edu.au) 
• File name: mux.cpp 
* Revision hi tory : Version! 
• Date: 17/1/2005 
., 
#include ''mux.h" 
void mux::do_mu.,:Q { 
switch (muxCtl.readO) { 
} 
cuse 0: muxOut = dln; 
case 1: muxOut = dReg; 
case 2: muxOut = dLcft; 
case 3: muxOut = dl{igbt; 
case 4: mm,Out = dUp; 















A DISSERTATION FOTTHB DEGREE OF MASTER OF ENGINERrNG SCIENCE 120 
,. 
JI)-Soft(J1ip 
:\ :'\ord .~I) \'l·rtirnll)' lntq.,:ralt•cl .\daptiH' Computin:,.: Sy.'ilcm 
.\ , lL'IHlh: C-S,<,tt·mC Cmh·!-, 
• SPE: Standard-PE for CAP(ConflgurablcArray Processor)(heoder file for SPE) 
• Copyright(c) 2005 by Chui KIM,AU rlght reserved 
• Author: Chui KIM(ckim@5tudent.crn.edu.au) 
• FUc name: spe.h 
• Revision hl5tory: Version! 












//temp signal for Instruction 
sc_signal<Sc_ulnkl 9> > 
clock; 
r1!5el; 
lnstlCS; //Instruction Input from ICS 
din, dLcft, dRight, dUp, dDown; //data Inputs 
dOut; //data output 
dOutadJPE; //data output for adjacent PEs 
sJnst; 
//temp signals from iReg(Jnstructlon Decoder) 
sc_signal<SC_uinl<l> >s_muxACII; 
sc_slgnal<.SC_uinkJ> > s_muxBCU; 
sc_signal<.SC_uinkJ> > s_sopScl; 
sc_slgnal<bool> s_doulRCtl; 
sc_signa!<Sc_uint<l;,,;,, s_regScl; 




1/lemp signals for mux In/output and ALU Inputs 
sc_slgnal<Sc_uink4> ;,,s_.Jln, s_dLdt, s_dRight, s_dUp, s_dDoll'n; 
sc_slgnal<Sc_uink4> ;,,dRcgOutA;Hrq; out formuxA Input 
sc_slgnal<!!ic_ulnk4> :,dRegOutB;f/reg out for muxB input 
sc_signal<SC_uint<:4> ;,,muxAOut; 
sc_signal<St_uink4> >nmxBOut; 
sc_~lgnal<bool> dRcqA, dRcqB; (!data request ror register 
/!temp signal for ALU output 
sc_signal<SC_uink4> ;,,aiuOut; 
//temp signal for internal register 
5':_signal<Sc_ulnk4> ;,,regln; 
sc_signal<.SC_uink4> ;,,tmpl, tmp2, tmpJ, tmp4; 
sc_signal<SC_uink4> ;,,rt-gOut; 
lflcmp signals for SRA!'.~ 
sc_slgnal_n<4> sramData; 
sc_lv,;,1:,. ramData[I6]; 
//lcmp signal for output data bus 











A DISSERTATION FOTTiiE DEGREE OF MASTER OF ENGINERING SCIENCE 121 
3D-Sortl:hip 
A 7'.'on•I ~'D Ycrt it,111~ I nll'gr,1 tl'd . \ ii apl h l' ( · omput i ng S ~ s lcm 









































scnsitivc_pos << clock; 






























sensitive<< clock<< muxAOul « muxBOut « s_sopSel << s_nl'RegEn; 
SC_METIIOD(do_n-g); 
sens!Uve <<clock<< s_rcgSd << s_rwRcgEn << regln << dRcqA << dReqB; 
SC_METIIOD(do_srum); 
sensitive<< dock<< s_rwSEn « s_sramEn << s_sramSel << regln << sramData; 
SC_METIIOD(do_doutReg); 





d Ou t.inltializc( O ); 
dOutadjPE.lnltl.nlizc(O); 
ror (int i=O; 1<16; i++) ramData[ll="XXXX"; 
A DISSERTATION FOTTirn DEGREE OF MASTER OF ENGINERING SCIENCE 122 
31)-SoftChip 
,\ :\'on:! 31) \'l'rlirnlly l nkgr:1 ll•d ,\ dapt i\ l' Computing S ystcm 
..\ ppc111I h ( '-S p;tl'mt' ( · odl·s 
,. 
• SPE: Standard-PE ror CAP(Configurablc Array l'Toccssor)(sourcc file for SPE) 
• Copyright(c) 2005 by Chui KIM,AII right reo;ened 
• Author: Chui KIJ\.f(ck.lm@studcnt.cm.edu.11u) 
• FUe name: spe.epp 
• Revision history: Version I 




void spe::do_latch() I 















#define comp(a,b) (((a)>(b))?l: (((a)=(b))?O: •I)) I/comparator 




un.~igncd short rcsult=O; 
unsigned short result; 
unsigned short srd=muxAOut.read(); 
unsigned short srcl=muxBOut.rcad(); 
switch(s_sopSel.rcad()) l 
a,se 0: rtsult = srcl & m:2; 
case 1: ll$llll = srd I src2; 
case 2: result= -srd; 
case 3: ll$lllt = srcl "src2; 
case 4: =ult= srcl + m:2; 
case 5: ll$ll\t = srcl - src2; 
case 6: result= srcl • src2; 





// Internal Register 
void spc::do_reg() I 
if(s_rwRegEn) I ffread operation 
switch (s_rcgSe!.rcnd()) I 
case 0: regOut.writc(tmpl); 
case I: reg0ut.writc(tmp2); 
case 2: reg0ut.wrlte(tmp3); 




I else f 
dRegOulA = regOut; 
doutBus = sc_lv<4> (regOut); 
























A DISSERTATION FOTTHE DEGREE OF MASTER OF ENGINERING SCIENCE 123 
3D-Soft(:hip 
A !\owl 3J) \'crtkally lnk;.!nlll'd .\daptin· Compulin:.t System 
.\,l ll'JHlh C-SYskmC Codes 
J etsef 
//SRAM 
void spe::do_sram() { 
lf(dReqB) { 
I else! 
dRegOutB = rcgOul; 
doutBus = sc_lv<4> (rcgOut); 
dOut = sc_ulnt<4> (doutBus); 
f/wrilc opcraUon 
switch (s_regScl.read()) { 
case 0: tmpl = rcgln; 
case 1: tmp2 = reg In; 
case 2: tmp3 = reg In; 
case 3: tmp4 = reg In; 
default: 
If (s_sramEn) I 
if (s_rwSEn) { //read opcratlun 
sramData.write{ramDatll[s_sramScl.read{)]); 
doutBus = sramData; 






If dOul = sc_ulnt<4> (doutBw); If •• dOut has a dummy value(#F) 
} else { I/write operation 
J else{ 
sramData= sc_lv<4> (rcgln); 
ramData{s_sramSd.read()] = snnnData; 
sramData = "ZZZZ"; 
II Data oulput register 
void spc::do_doutRcg() { 
If (s_doutRCtl) I 
dOutadjPE = sc_ulnt<4> (doutBw); 
A DISSERTATION FOTTHE DEGREE OF MASTER OF ENOINERING SCIENCE 124 
3D-Soft(:hip 
,\ :\'on•I JD \'t•rlirnll,r lntq.!nikd Adaptht• Computi11g Syslcm 
.,\ I ll'lldh: ( '·S \ Sll'lll (' ( 'Odl''.-, 
2 Proeceeing Accelerator-PE. 
,. 
• !Reg: Instruction Reg for Pro«sslngAccclcrator•Pl-:{hcadcr l'ilc for !Reg) 
• Copyrighl(c) 2005 by Chui KIM, All right rescned 
• Author: Chui KJM(dtlm@studcnLccu.cdu.au) 
• File 1111DU': lReg.h 
• Rcrision hbtory: Vcri;lonl 









































• IReg: Instruction Reg for ProccsslngAcrelerator•PE(soun:e me ror !Reg) 
• Copyrlght(c) 2005 by Chui KIM, All right reserved 
• Author: Chui KI!'tf(cklm@studenLccn.edu.au) 
•Filename: IReg.cpp 
• Revi5lon history: Version! 
• Date: 2911/2005 
., 
#Include "IReg.h" 
void !Reg::do_!Reg() { 
sc_;.,lt11<19> tmpJnsl; 























/IS-PE operation sci 
//data-out reg ell 
//Jnte"1al reg sci 
//SRAM sci 
f/SRAM enable signal 
f/internal reg n:ad/wrlte signal 
//SRAM read/write enable signal 
125 
31)-Soft(:hip 
,\ No\"el 31) \'crtkally l 11lt·g1·Hll'd ,\dapth t' Computing Syslc1 1 • 
•. \ J wmlh: C-SY.~(PlllC Cotks 
,. 
• Mux: Mux for ProcessingAccelcrator•l'E(hcadcr file for Mux) 
• Copyrlght(c) 2005 by Chui KIM, All right n:scrvcd 
• Author: Chui KIM{ckim@studcnt.ccu.cdu.au) 
• File name: nmx.h 
• Revision history: Version! 

























//mux ttl input 
//input data 
/I data from internal Reg 
I/data rrom adjacent PE(from !en PE) 
//data from ndjaccul PE(from right PE) 
f/data from adjacent PE(from upside PE); 
//data from adjaccnt PE(Crom downside PE); 
//data request ror internal register 







• Mux: Mux for ProccssingAccdcrator-PE(sourcc file for Mux) 
• Cop)Tight(c) 2005 by Cbul KIM, All right resened 
• Author: Clml K™(cklm@studcnt.ccu.edu.au) 
• Filt n::me: mu1up11 
• Revision history: Vcrsionl 
• Date: 29/1/2005 
., 
#include "mux.h" 
void mux::do_mux() I 
switch (muxCtl.rcad()) { 
,. 
case 0: muxOut" din; 
case 1: muxOut ="dRcg; 
case 2: muxOul = dLeft; 
case J: muxOut = dlUght; 
case 4: muxOut" dUp; 








• ALU: ALU ror Processing Accelcralor-PE(headcr fifo for ALU) 
• Cop)Tight(c) 2005 by Chui KIM,All right reserved 
• Author: Chui KThl{cklm@student.ccu.cdu.au) 
• FIie n:nnc: alu.h 
• Revision history: Verrlonl 










A DISSERTATION FOTTIJE DEG REH OF MASTHR OF HNGINERING SCIENCE 126 
31)-SoftChip 
A J\'onl 31) Vcrtin1II~ lnkgr:1ltd ,\da)lliH' Computing System 

























• ALU: ALU for ProcessingAccelcrator-PE(source file !or ALU) 
• Copyright(() 2005 by Chui KIM,AU right reservl.'d 
• Author: Chui KIM(cklm@stud~nt.ccu.cdu.au) 
• File llllllle: alu.epp 
• Revision history: Version! 
• Date: 29/l/2005 
., 
#include "alu,h" 
1/#dcflne l\L\C(A,B,P) (((A)•(B))+(P)) //m:ic 
//#•lcfinc MAS(A,D,P) (((A)•(B))·(P)) /fmas 
//#define //arr.when the data-type Is signed, II should be modllied 
#define ROR(A) ((((A&Ox.Of)&Oxl)?(((A&Ox.Ol)»Ox.t)!Ox.8):(A&Ox.Of)>>llxl)&Oi:Of)/frotate 
#define ABS{A) ({(A&OxOf)<Ox.0?(-I •(A&OxOO):(A&Ox.OO)&OxOf) //abs 
void alu::do_olu() { 
se_uint<4> rei;u\l,srcl,src2,tmp,mu1Tmp; 
srcl = aluAin.read(); 
/ftemp signals 
src2 = aluBln.rcad(); 
switch (aluCU.read()) { 
case 0: result= sn:I•src2; 
case 1: mulTmp = srcl•sn:2; 
re.ult= mulTmp + sc_uint<4> (rcgTmpJ; 
case 2: mu IT mp= srcl•sn:2; 
result= mulTmp • sc_uint.:4> (regTmp); 
caseJ: rcsult=sn:l<<I; 
case4: rt>Sult=sn:l>>I; 
//when the data-type is signed, It should be nmdified(asr) 
case 5: result= sn:l>>I; 
case 6: result= ROR(srcl); 
/fwhcn lhc data.type Is signed, It can be applled(abs) 






rq:Tmp.writc{tmp); /fdcfined In the header file, signal for Test 
//sc_out<.Sc_ulnt4> > rcgTmp; 





















A \'on:I ~'I> Yertkally h1h·gr;1!l'd .\clapli\'l' Cornputilig S)·~km 
_ \ , 11·1ul i, ( ·-sv-;lt·1n( · { · mlc~ 
• PAPE: Processiug Accckrator•PE ror CAP(hcadcr fllc for PAPE) 
• Copyrlght(c) 2005 by Chui KIM, All right rmrved 
• AuUmr: Chui KIM(ckim@studcnt.ccu.edu.ouJ 
• I<'ile name: papc.h 
• Revision history: Vcrsionl 

















lnslICS; f/imtruction Input from ICS 
din, dLefi, dRlght, dUp, dDown; //data Inputs 
dOut; I/data output 
dOutadJPE; fldata output ror ndjacent PEs 
s)nst; 
//temp signals from lRcg(lnstruction Decoder) 
sc_signaksc_uint<3> >s_muxACtl; 
sc_slgnal<Sl:_uint<3> > s_muxDCtl; 
sc_signal<sc_ulnt<h > s_sopSel; 
sc_slgnal<bool> s_doutRCtl; 
sc_signal<sc_uint<2> > s_regScl; 




//temp signals for mux In/output and ALU lnpuls 
sc_signal<Sl:_uink4> >s_dln, s_dLcft, s_dRlght, s_dUp, s_dDown; 
sc_signaksc_uint<•b :,dRcgOulA;l/n.-g oul for muxA input 
sc_slgmtksc_uint<:4> >dRegOutll;l/reg out for muxD Input 
sc_slgnaksc_uint<:4> >rnu~\Out; 
sc_signaksc_uint<:4> >muxllOut; 
sc_signal<booi> dRcqA, dRcqB; /fdata request for register 
I/temp signal for ALU output 
sc_slgnal<Sl:_uink4> >s_aluOut; 
sc_signal<Sl:_uint<:4> >S_rcgTmp; 
lltcmp signal for internal register 
sc_signaksc_ulnt<:4> >s_rcgln; 
sc_slgnaksc_uint<:4> >tmpl, tmp2, tmp3, trnp4; 
sc_signa\c:sc_ulnt<:4> >rcgOut; 
//temp slgnals for SRAM 
sc_slgnal_n<4> sramData; 
sc_lv<4> ramData[l6]; 
//temp signal for output data bus 
JI sc_signalc:sc_ulnl<4> >doutDus; 













A Non•l 31> \'ertkally lntl'~1·atl'd .\daplin· Co111putin:,.: System 






























muxA=ncw mux("mu ,."); 
















sensith·c_pos << clock; 




I Reg I ->rcgScl(s_ngSel); 
I Reg I •>sramEn(s_srumEn); 



























sensitive<< clock« s_rcgScl << s_rwRcgEn « s_rcgln « dReqA << dReqB; 
SC_METIIOD(do_sram); 
scnslthc <<clock« s_rwSEn << s_sramEn << s_sram.Sel << s_.-cgln << sramDala; 
SC_METIIOD(do_doutRcg); 




ror (int l=O; 1<16; i++J ramDataJIJ=''XXXX"; 
A DISSERTATION FOTTHE DEGREE OF MASTER OF ENGINERING SCIENCE 129 
3I)-SoftChip 
i\ \onl JD \'crtkall., lnkgrall'd .\daptin· ( ·0111puti11;.: Sysklll 
:\ 1 Jtndi.\ C-S\,.,kllll' Codt·'> 
I' 
• PAPE: ProcesslngAccckrator-PE for CAP(souri:c file for PAPE) 
• Copyr:lght(c) 2005 by Chui KIM, All rlghl reserved 
• Author: Chui KIM(ckim@studcnt.ccu.cdu.au) 
•Filename: papc.cpp 
• Revision history: Ven.ion! 




void papc::do_latch() I 







I else f 
// Internal Rl:gl.stcr 
void papc::do_reg() I 
s _iost.writc(imtlCS.rcad()); 





if(sJwRcgEn) { //read operation 
switch (s_regSd.rrod()) I 
cast O: rtgOut.wrllt(lmplJ; 
case I: rtg0utwrltt(lmp2); 
cast 2: rL110ut.11ritc(tmpl); 








lf(dReqA) I I/output control 
1/SRAM 
dRL110U!A: rt>gOut; 
doulUtu = sc_h·d> (rt.'jtOut); 




dRL110utB = rt>gOul; 
doutnus" sc_lv<4> (rtgOut); 
dOut = sculnt<4> (doutBus); 
I else if (l_rwRegEn=:0) I llwritc opc:ratlon 
switch (s_rL11Scl.rcad()) I 
ca . ...: 0: tmpl : s_n.-gln; 
case 1: lrnp2 = s_rcgln; 
ca.w l: lmpJ: s_rcgln; 
ca.'iC l: Imp.a= s_rcgln; 
default: 
void pape::do_sram() { 
If (s_sramEn) { 
if(s_rwSEn) I //read opc:ra1ion 
srarnData.write(ranillala(s_sramSd.rcad() ]); 
doutBus = sraml)ata; 






If dOut = sc_ulnt<4> (dou!Bus); fl •• dOul has a dununy value(#!') 
) else I //write operation 
sramData= sc_lv<4> (s_regln); 
A DISSERTATION FOTTHE DEGREE 01' MASTER 01' ENGINERING SCIENCE 130 
31)-Soft(:hip 
,\ \ O\"d )I) Ycrlirn I I~ I nl l·grall'd . \ d apth l' Compu I ing System 
.\ 1 1l'ndh C-.'i,.~ietnC ('odl'~ 
ramData[s_sramScl.rt'ad() J = sramData; 
I else I 
sramData = "ZZZZ"; 
II Data output register 
void pape::do_doutRcg() I 
Ir (s_doutRCtl) I 
dOutadjPE = sc_uint<4> (doutBus); 
3 ICS_RISC 
3.1 Datapath Architecture 
,. 
• PC: Program Counter 
• for ICS(lntclligcnt Conligurahlc Switch)RISC Core(header ftle for pc) 
• Cop)Tlght(c) 2005 by Chui KIM, All right rcmml 
• Author: Chui KIM(tklm@studcnt,t'Cll.edu.au) 
• File Mme: pc.h 
• Rev:15ion history: Version! 






























1/Seh.d Signal between aluOut/incrOul 
//lnstroctlon Address 
/ffiataAddrcss 
senslllve << clock.pos() «reset« IArcgCtl << dArcgCII « aluOut; 
SC_METIIOD{do_autolncr); 






• PC: Program Counter 
• for ICS(lntclllgent Configurable Swltch)RJSC Con:(source rue for pc) 
• Cop)·rlght(c) 200S by Chui KIM, All right reserved 
• Author: Chui KIM(cklm@student.ecu.cdu.au) 
A DISSERTATION f'OTTHE DEGREE OF MASTER OF ENGINERfNG SCIENCE 131 
31)-Soft Chip 
,\ :\0\ l'I JD \ 't•rtir;1II .\ I 11 lt•gr:1 kc\ . \tl a pl i \ t' ( ·0111 puli 11g S ~·sll'm 
.\ J wmli\ ('-S,,;1l'm(' ('rnk~ 
•Filename: pc.cpp 
• Revblon history: Vcrsionl 
• Date: l.VJ/2.005 
., 
#indude "pc.h" 
vold pc::do_pc() I 
bool IAddrOulTmp; 
1rcreseo I 
I else I 
L\ddr= O; 
dAddr;O; 
IAddrTmp = IAddrln; 





IAddr = aluOut; 
iAddrOulTmp; O; 
IAddr = incrOut; 
IAddrOulTmp = I; 
IAddrOut" IAddrOutTmp; 
lr(dArcgCU) I 
dAddr = aluOul; 
void pc::do_autolncr() I 
If (rF:Sel) l 
,. 
IAddrln: O; 
} else If {lAddrOut) I 
Ir (clock.poscdge()) I 
IAddrln++; 
• SR: Status Rl.'gl!;tcr 
• for ICS(lntclligcal Contigur3blc Switch)RISC Con:(headcr me for sr) 
• Copyrlght(c) 2005 by Chui KIM,All right reo;crvcd 
• Author: Chui KIM{cklrn@studcnt.ccu.cdu.au) 
•Filename: sr.h 
• Revision history: \'cr.;ionl 

























A DISSERTATION f-OTTHE DEGREE OF MASTER OF EN GINER ING SCIENCE 132 
31)-SoftChip 
A :\on•! -~ I) \ erl ka It~ I nkgr:i ll'd . \ d:i pt i I l' ( · om puling S~ ~!cm 
\, ll'tHlh. C-S,,.;tl'IJ1(' Coch.-.~ 
SC_CTOR(sr) ( 
SC_METIIOD(do_sr); 
scnslUve <<clock<< n-sct << condFlag << wbDala 
« wbSel << srOEn << srWbEn; 
#lfdcl'S™ 





• SR: Status Register 
• ror ICS(Tntelligcnt Configurable Swilch)RISC Core(source file forsr) 
• Copyrlght(c) 2005 by Chui KIM, All right micrvcd 
• Author: Chui KIM(cklm@studcnt.ccu.cdu.au) 
•Filename: sr.cpp 
• Re\islon history: Vcrsionl 
• Date: 312/2005 
., 
#include "sr.b" 
void sr::do_sr() I 
tr(resel) I 
srDnla = O; 
l else If (srWbEn) I 
1rcwbSe1) I 
I else I 




rdData = uDala; 
/I rdData = srOEn? ~Jv<-b (srData): "ZZZZ"; 
I 
I' 
• LF: Loop Buffer 
• !or ICS(lntclllgcnt Configurable S"·itch)RISC Con:(hcadcr rue for IO 
• Copyrlght(c) 2005 by Chui KIM,All right reserved 
• Author: Chui KIM(cklm@studeut.eru.edu.au) 
• File name: lf.h 
• Rcvl.!ion history: Version! 











sc_out-csc .. ulnt<4> > 
void do_lI(); 
reset; 
lbEn; //Loop Bulfer Enable 
lbRWEn; //Loop Bulfer Read/Write Enable 
lAddrln; //L\ddr Input for LF 
iAddrOut; 1/IAddr Output for U' 
Iner; 
A DISSERTATION FOTTIIE DEGREE OF MASTER OF ENGINERINO SCIENCE 133 
31)-SoftChip 
A \:mt:I JI) \'erlil':.111~ lntq.~r.ill·d .\dapli\l· Computiug Sy!-.km 
















• LF: Loop Buffer 
• for ICS(lntdllgent Configurable Swltch)RISC Con(soune file for IO 
• Copyright(e) 2005 by Chui KIM,All right reserved 
• AuUmr: Chui KlM(ckim@studcnt.ccu.edu..11.u) 
• FIie name: lf.cpp 
• Re~islon history: Version! 
• Date: 17/3/2005 
., 
#include "If.Ii" 
void lf::do_lf() I 
ir (clock.posedgeO) I 
lr(lbEn) I 
I 
// Iner= lncrTmp; 
I 
,. 
• RegFile: 32 x 32 Register me 
lf(lbRWEn) I //read opera lion 
lnerTmp++; 
IAddrOut.wrltc(buflllncrTmp]); 
I else { /lwrite operation 
lncrTmp--; 
bumincrTmp] = IAddrln; 
• for ICS(lntelllgent Configurable Swilch)RISC Core(header me for rcgFlle) 
• Copyrlght(c) 2005 by Chui KIM,AU right reservl'd 
• Author: Chui KIM(ckim@student.Nu.edu.au) 
• File name: regFile.h 
• Re\ision lilitory: Vcr~ionl 






















/frt'lld Index A 
/frcad Index B 
//read A output enable 




A DISSERTATION FOTTI-iE DEGREE OF MASTER OF ENGINE.RING SCIENCE 134 
3D-SoHChip 
:\ '.\o\'l'I JI) Yl'rlirnll,\ l11kgrall'd .\tlaplill' ('omputi11g s.,~tcm 
\ ) Jt'!Hlh ( ·.s, ..,Jvm(. ( 'och·', 
sc_out<Sc_ulnt<.,2> > rdA Data; 
sc_out<.sculnt<l2> > rdBData; 
1/nad data A 
I/read data B 
sc_slgnal<sc_ulntd2> >gpr0,gprl,gpr2,gpr J,gpr4,gpr5,gpr<i,gpr7 ,gpr8,gpr9, 
gprl O,gprll ,gprl2,gpr!J,gprl 4,gprl S,gpr16,gprl 7 4:prl8,gprl9,gpr20, 






sensitive<< dock.pm()<< rdAld:c « rdDJdx « rdAOEn ..:< rdDOEn << wbld:c 







• RtgFUe: 32 1 32 Register me 
• for ICS(lntdligcnt Configurable Swltch)RISC Core(soun:e fie for regFUe) 
• Copyrlght(c) 200S by Chui Kll\1,All right reserved 
• Author: Chui KIM(ddm@studcnt.ecu.edu.au) 
•Filename: regFile.cpp 
• Rc\islon history: \'crsionl 
• Date: J/2.1200S 
., 
#Include "regfilc.h" 
void regfilc::do_rcgFlle<) I 
ir(wbEn) I 
switch (whldx.read()) ( 
case O: gprO.writc(wbData): 
case I: gprl.\l'tlte(wbDal:I); 
case 2: gpr2.write(wbDa1a); 
case J: gprl.write(wbData); 
case 4: gpr4.writc(wbDa1a); 
ease S: gprS.wrltc(whData); 
case 6: gpr6.write(whDa1a); 
casc 7: gpr7.wrlMwbData); 
case 8: gpr8.writc(whData); 
cpse 9: gpr9.wrltc(wbl}Jta); 
CO.<;!,' 10: gprlO.w·ritc(wbData); 
case II: gprll.writc(wbData); 
case 12: g11rl2.writc(wbData); 
case l.l: lll'rl.lwrite(wbllata); 
case 14: 1.:11r14.write(wbDat11); 
ca.,;c IS: g11rlS.write(whData); 
ease 16: g11rl6.writc(wbData); 
nse 17: g11rl7.wrile{wbDat:i); 
case 18: gpr18.writc{wbDat:i); 
case 19: gprl9.ui.te(wbData); 
case 20: gpr20.writftwbData); 
case 21: gpr21.write(wbData); 
case 22: g11r22.write(wbDala); 
case 2.l: g11r23.write(wbDat:i); 
case 24: gpr24.write(whData); 
case 25: g11r2S.writc(wbData); 
ca.,;c 26: g11r26.wrilc(wbData); 
ca.,;e 27: gpr27,writc(wbDat:i); 
case 28: gpr28.wrllc(wbDataJ; 
case 29: gpr29.w·ritc(wbDataJ; 
case JO: gprJO.write(wbData); 
case 31: gprJl.writc(l)C); 
































break; //for PC 
135 
3I)-SoftCllip 
A .\ml'I JI) \'l'rlirnJI~ lnltgr:1kd .\daptin' Com11uti11~ S~ .... tcm 
:\ , wndh I '-S, ~ll'llll' <. 'odl'.', 
' Jr(rdAOEP) I 
default: 
switch (rdAldll..rtad()) ! 
I 
lr(rdBOEn) I 
case 0: rdAData = gprtl; 
case 1: rdAData = gprl; 
case 2: rdAData = gpr2; 
case 3: rdAJlata = gpr3; 
case 4: rdAData = gpr4; 
case 5: rdAData = gprS; 
case 6: rdAData = gpr6; 
CIL<;C 7: rdAData = gpr7; 
case 8: rdADatn = gpr8; 
case 9: rdAData = gpr9; 
ca.~e 10: rdAI>ata = gprlO; 
case 11: rdAData = gprll; 
case 12: rdAData = gpr!Z; 
case I3: rdAI>at:i = gprlJ; 
c11..,;e 14: rdAData = gprl4; 
case 15: rd A Data= gpr!S; 
case 16: rdADatn = gprl6; 
case 17: rdAData = gpr17; 
case 18: rdAData = gprl8; 
case 19: rdAData = g11rl9; 
case 20: rdADat.u = gpr20; 
case 21: rdAData = gpr21; 
case 22: rdAData = gpr22; 
case 23: rdAData = gpr23; 
Cll5e 24: rdAData = gpr24; 
case 25: rdAData = gpr25; 
Cll.'le 26: rdAData = gpr26; 
case 27: rdAData = gpr27; 
case 28: rdAData = gpr28; 
case 29: rdAData = gpr29; 
case 30: rdAData = gpr30; 
case 31: rdAData = pc; 
default: 
switch (rdBlduead()) ! 
case 0: rdBData = gprO; 
case I: rdBData = 11;prl; 
case 2: rdBData = gpr2; 
case 3: rdBData = gpr3; 
case 4: rci:lData = gpr4; 
case 5: rdBL'u.ta = gpr5; 
case 6: rdBData = gpr6; 
case 7: rdBData = gpr7; 
case 8: rdBData = gpr8; 
case 9: rdBData = gpr9; 
case IO: rdBData = gprlO; 
case II: rdBData = gprll; 
case 12: rd0Data = gprl2; 
case I): rdBData = gprll; 
caw H: rdBData = gprl4; 
case IS: rdBData = gprlS; 
case 16: rdBData = gpr16; 
case 17: rdBData = gprl7; 
case 18: rdBData = gprl8; 
case 19: rdBData = gpr19; 
case 20: rd0Data = gpr20; 
CIIS(' 21: rdBData = gpr21; 
case 22: rdBData = gpr22; 
ca~ 23: rd0Data = gpr23; 
case 24: rdBD-Jta = gprU; 
case 25: rdDl>ata = gpr2S; 
case 26: rd0Data = gprUi; 
case 27: rdBData = gpr27; 

































































:\ \'on•l .'D \'l'rtirnlly lnkgrnkd ,\tlaptin• Computing Sr,;ll'm 
.\ 1 H'ndi\. C-S, ~ltmC ( 'ridt•s 
,. 
case 28: rdBData = gpr28; 
case 29: rdDifala = gpr29; 
case 30: rdBDalll = gpr30; 
case 31: rdllData = pc; 
ddault: 
• aluDcr: Dcfinltl1m of t1le ALU functions 
• Copyright(c) 2005 by Chui KIM,All right reserved 
• Author: Chui KIM(ckim@student.ccu.cdu.au) 
• File name: aluDcf.h 
• Rc,i1ion history: \'er:sionl 
• Date: 212/2005 
., 
#irndcf _ALU_DEFINE_II_ 
#define _AI.U _DEFINE_II_ 
fl ALU ),'unction Definitions 
#dcllncCMD_MOVA OxO 




#define CMD_NOT OxS 
#define CMD_ADD Ox6 
#define CMD_SUD Ox7 








• alulCS: ALU for ICS{lntclllgcnt Configurable Swilch)RISC Corc(headcr file for alulCS) 
• Copyright(c) 2005 by Chui KIM,AII right mervcd 
• Author: Chui KIM(ckim@studcnt.ecu.cdu.au) 
• File name: alulCS,h 
• Revl51on history: \'enionl 

































:\ :'\oH·l .~I) \t>rticill~ lnlq!raktl .\daptiH Com1wting Sy."ill'lll 
.\ > Jl'lllli\ ('.S,'>ll·mt' Codt•.-. 
I: 
,. 
• alulCS: ALU for ICS(lnlclllgcnt Configurable Swltch)RISC Corc(source file ror alufCS) 
• CopJ'l'ight(c) 2005 by Chui KIM, All right reserved 
• Author: Chui KIM(ckim@studcnt.ceu.edu.au) 
• ~'Ile name: aluICS.cpp 
• Rcvl~lon history: Version! 
• Date: 2/212005 
., 
#indude "alulCS.h" 
#dcnne wmp(a,b) (1.(r.i:-(bl)?I: (((a)=(b))?0:-1)) 
I/ALU 




slgnc J short result= O; 
signed short result; 
signet.' ~hort srcl :: aluAln.n:ad(); 
signed short srcl = aluBln.rcad{); 
// signed short tdn = du.read(); 
signed short cmd = aluCll.read(); 
sc_uint<4> tmpCond; 
switch (cmd & OxF) I 
II Conditional Flags 
case 0: result= std; 
case I: result = sn:2; 
case 2: result= sn:1 & srd; 
case 3: result= rn:11 sn:2; 
case 4: ll.'Slllt = srcl ~ src2; 
case S: n-sult = -srcl; 
case 6: rcrult = srcl + src2; 
case 7: ns1.ilt = srcl. src2; 












It {result & ox.n·n·ooooi 
else d.wrlte(O); 
It {re,;ult & Ox.FFFFOOOO) 
cl.write(l); //carry/Borrow flag 
vf.write(l); 1/o~erllow flag 
,. 
else vf.wrltc{OJ; 
resull &= Ox.FFFF; 
Ir (result= 0) 
else zr.wrlte(O); 








• Mln,: multipllcr 
zf.writc(I); 1/zcru flag 
nf,wrltc(I );/lnegaU~e nag 
• for ICS(lnlelligcnt Configurable Swltch)RISC Core(header file for multiplier) 












A Non,I JD Vcrtirnlly lnkgr:1tL•d Adaptin Com1rnti11g System 
.. \ J )l'IH!h t '-S\'Slt•m(. ( 'mh·~ 
• Copyrlght(c) 200S by Chui KIM,All right reserved 
• Author: Chui KIM(cklm@studcnLCcu.cdu.au) 
• t'lle name: mul.h 
• Revision history: Venlont 






















• MUL: mulllplicr 
muJOut.iniUaUze(O); 
• ror ICS(lntdtigent Configurable Swilch)RISC Core(source me ror multiplier) 
• Copyright{c) 2005 by Chui KIM,All right reserved 
• Author: Chui KIM(ckhn@student.Ku.edu.au) 
• File name: mul.cpp 
• Revision history: Vcrsionl 
• Date: 14/312005 
'I 
#include "mul.h" 
vold mul::do_mul() { 
I' 
sc_uint<32> sn:I, sn:2, result; 
sn:l = mulAln; 
src2 = mulBln; 
result= sn:I • sra; 
mu!Out.writc(result); 
• Shiner: Shifter ror ICS(lnlelllgcnt Configurable Switch)RISC Corc(headcr file ror Shiller) 
• Copyright(c) 2005 by Chui KIM,All right reserved 
• Author: Chui KIM(cklm@studcnt.ecu.cdu.au) 
•Filename: shlrtcr.h 
• Revision history: Version! 






















A \'orl'I JD \'l·rtirnlly lnll'gr:i!t'd ,.\daptin• Computing System 




scru;Ltlve << shlftln << shlftAmt << shlRCtl; 
shlftOuLlnltlalize{O); 
• Shifter: Sbi!ter for ICS(lntelllgent Configurable Swltch)RISC Core{sourte file for Shl!ter) 
• Copyright(c) 200S by Chui KIM, All right re1crved 
• Author: Chui KlM(cklm@studenLccu.edu.au) 
• File name: shiffcr.cpp 
• Revision history: Vcrslonl 
• Date: 1J2/2005 
., 
#Include "shlftcr.h" 







switch (w_shirtCtl) I 
case 0: w_shl!tOut = w_shlftln << w_shUIAmt; break; lnoglcal shift left 
case I: w_shlfiOut = w_shlftln >> w_shiflAmt; break; //logical shift right 
II case 2: w_shlfiOut = ({32(w_shiftlnj31])!«(32,w_shlflAml))l(w_shiftln>>w_shiftAmt); break; 
II Arilhmctlc Shift Right shouJd be modified, 
I/ case 2: w_shiflOut = ({w_shiftln[32J!w_shiftln[31])1<:<:(32-w_shirtAmt)Jl(w_shHtln>>w_shlflAmt); 
break; 
I 
case 2: w_sh!UOut = w_shiflln >> w_shiftAmt; break; 






• Datapalh: Data-pat11 archlteclure 
• for ICS(lntelligcnt ConfigurableSwltch)IUSC Con:(headcr file ror data path) 
• Copyright(c) 2005 by Chui KIAl,All right reserved 
• Author: Chui KIM(ckim@studmLecu.cdu.au) 
•filename: datapalh.h 
• Revision history: Version I 











A DISSERTATION FOTTIIE DEGREE OF MASTER OF ENGINERING SCIENCE 140 
3I)-Sof'lChip 
1' :\'o\·l'I JI) \(.·l'lin.dly l111l'gr:11l'd .\clapliH" Computing S)·slcm 









































































1/Stntus Register Enable 
I/Status Register Read/Write Enable 
I/ALU Conlrol Signal 
I/ALU Output Enable 
/IShifler Control Signal 
I/Shifter Output Enable 
/IMultiplicr Output Enable 
I/Operand A Index 
/IOperand B Index 
//Read A Output Enable 
//Read B Outpul Enable 
//Wrllchack Index 
//Writeback Enable 
//Immediate Output Enable 
//Instruction Address Register Control 
/ffiataAddress Rcg15ter Control 
/ffiata Input Control 
//Data Output Control 
/ffiata Input 
//Loop Buffer Enable 





husA, bu~B, busW; 
s_aluOut; //ALU Output Signal 
s_condFtag; //Conditional Flag 
s_shlrtOut; //Shifter Output Signal 
s_mulOut; /&fultiplicr Output Signal 
tmpbusA, tmpbusW; /ffemp Signals ror SR 
s_shiflAmt; /ffcmp Signal ror Shifter 
pIAddr; 1/Instnict!onAddrcss from PC 
llAddr; 1/Inslnict!on Address from LF 
void do_outCtl(); /!Function for Output Control 
void do_lnOutReg(); I/Function for Data In/Out Reglltcr 
vold do_sigDiv() l //Function for Gen. Signals for Sl:ltus Register 
sc_ulnl·d2> busAI, busWl,busBI; 
busAI = busA; 
busWI = busW; 
busBI = busB; 
tmpbusA = busAl,range(31,28); 
tmpbusW = busWI.range(Jl,28); 













lpc·>dock(dock); lpc•>mct(rcsct); ipc->lAregCU(IAregCU); 
lpc->dArcgCtl(dAregCtl); lpc·>aluOut(s_aluOut); 
A DISSERTATION FOTTHE DEGREE OF MASTER OF ENGINERING SCIENCE 141 
3D-SoftChip 
,\ \oHI 31> \'crtkally lntcgrakd ..\claptiH' Computin~ Sysll'lll 





















































scll51th·c <:<: dlnCtl <:<: dOutCtl; 
SCJ\.IETIIOD(do_1lgD1v); 





• Data11ath: Dala•path an:hltec:turc 
• ror ICS(lnlelUgent Configurable Swltch)RISC Core(Source rdefor datapath) 
• Copyrlght(c) 2005 by Chui KIM, All right reserved 
• Author: Chui Kll\.1(ckim@student.ecu.edu.au) 
•Filename: datapath.cpp 
• Revision history: Version! 
• Date: 30/4/2005 
., 
#include "datapatb.h" 





busW = s_aluOut; 
I 
i!(shlftOEn)t 




It (lbEnJ { 
I else I 
busW = s_mulOut; 
IAddr:: IIAddr; 
IAddr = pIAddr; 
//Loop Buffer Addn:s.'llng 
//Instruction Addresi from LB 
//Instruction Addres.1 from PC 
A DISSERTATION FOTTHE DEGREE or MASTER or ENGINERING SC JEN Ce 142 
30-SoftChip 
A ;\'onl JD \'crtirnlly lnk~rnkd ..\daptin• Co1111mtin~ System 
Appl·mli.\ t'-Svsll'lllC ('ode~ 






3.2 Control Architecture 
,. 
• Def: Macros for ICS_R1SC 
• Copyright(c) 2005 by Chui KlM,All right reserved 
• Author: Chui KIM(ddm@studcnt.cru.cdu.au) 
•Filename: dcf.h 
• Revision history: Vc1'51onl 

















#define OP _MOVA 
#define OP _MOVB 
#define or _AND 
#defineOP_OR 
#define OP_XOR 
#define OP _NOT 
#define OP _ADD 
#define OP _SUB 
#define OP_ CMP 
#dcfir.c OP _MSR 






























* 1-"ctch: Fetch Unit for ICS_R1SC(hcodcr me for fetch) 
• Copyright(c) 2005 hy Chui KIM,AU right re;er\·cd 
* Author: Chui KIM(cklm@studcnt.ccu.cdu.eu) 
•Filename: fctch.h 
* Revision history: Verslonl 
• Date: 115/2005 
., 
//Data Input Rcglltcr 
/fflata Output Register 
/!ALU Imm Short{l lnsl. word) 
/!ALU Imm. Long(? Inst. word) 
//ALU Register 












A DISSERTATION FOT TI{E DEGREE OF MASTER OF ENOINERING SCIENCE 143 
3D-SoJ'tChip 
A :\ o, d JI) Y1:rtin1 ll y I ntl'gral l'd . \ dapt i\ l' ( ·01111111 Ii n:,.: S) slt'Jll 


















• Fetch: Fetch Unit for ICS_RISC(Sourtc file for fetch) 
• Copyright(c) 2005 by Chui KThf,AII right reserved 
• Author: Cllll KIM{cklm@studcnLeClLl'fl.u.au) 
•Filename: Mch.cpp 
• Rc\ision history: Version} 
• Date: 1/5/2005 
., 
#include "fctch.b" 





• D«odc: lrutruction Decoder Unit for ICS_RISC(hcaderfilc for decode) 
• Copyrlght(c) 2005 by Chui KThf,AII right reserved 
• Author: Chui KIM(ddm@student.ecu.edu.au} 
• }'Uc narm: dccodc.h 
• Re\15ion history: Vcrslonl 
























































I/Compare flag(updatc status register/no wriWback) 
//Branch Flag 
//End of sinmlalon nag 
/!Status rcgl.~tcr output enable 
//Sla\us register read/write enable 
A DISSERTATION FOTTHE DEGREE or MASTER OF ENGINER!NO SCIENCE 144 
3D-Sot'tChip 
A ;\;oYd .1,1) Yt·rli(·;ill~ lnk~rntrd .\daptiH· f'o111pu1in;.: S~_..,h·m 
.\ J )l'lldi, ( ·-s,·..,[l'Jll(' ( 'odl'" 
//For Loop Burrer 








/n..oop Burrer Enable 
//Loop Buffer RcmUWrltc Enable 
I/PE Execute Operation, 
//PE Operation !\.lode Sclcc:t!on 
//PE Configuration 
//PE Sci 
DMAOp; //Dl\,L\ Eu1:utc Operations Sclcdion 
DFBScl; //Data Frame Duffer Sd1:CU011(2 Sets) 
dataAmt; /I Amount or Data to Tramrcr 















SRAMRegScl; f/Sl'icct bchHcn SRAM/ICS_RJSC Rcg(Sourcc/Dcst.) 
startAddrSRAMReg; //SU.rt Addrcs.o; of SRAM/ICS_RISC Rcg(S/D) 
mcmSel; I/Memory Sck-ctlon(Program/Dala) 





//Onst[Jl:25] for extract ln1\ruction ID 




//Funclion ro~ pipeline control 
//Function for extract in1truclio11 ID 
//Function for condition 
}; 
,. 
void do_ficldEd(); //Function for Instruction field extraction 
SC_CTOR(dccodc) l 
SC_MKfllOD(do_pipclineCtl); 
serulth·c << clock.posO << nu.sh; 
SC_I\IETIIOD(do_lil.'itld); 
seruiU\·c <<clock.po~()<< fins!; 
SC_METIIOD(do_cond); 
scnslllvc << dock.pos(); 
sc_METI IOD(do_fietdExl); 
sensitive<< clock.po.1() << Onst; 
• Derodc: lnstrucdon Th:codcr Unit ror ICS_R1SC(5oun:e me for dt1:ode) 
• Copyright(c) 2005 by Chui KIM,All right reserved 
• Author: Chui K1M(cklm@student.ccu.cdu.au) 
•Filename: dccode.cpp 
• Rcrisiou history: Version I 
• Date: 1/512005 
., 
#include "dccodc.b" 
void dccodc::do_plpclineCtl() I 
bool rcfillTmp; 
lr<resetJ I 
/ffun~tlon for Pipeline Control 
r,.,fi!JTmp = I; 
) else( 






rcnll = rcrtllTmp; 
• Execute: Exctule Unit for ICS_RISC(hcadcr me for cxctutc) 
• Copyrl~ht(c) 2005 by Chui KIM,AII right rcscrwd 
• Author: Chui KIM(cklm@studcnt.ccu.cdu.au} 
A DISSERTATION FOTTHE DEGREE OF MASTER OF ENGlNER!NG SCIENCE 145 
31)-Sol'tChip 
:\ :\ o \'d JI) \'(•1·t irn l I~ 1 n kgrn t l'd \ d a pl h l' Co1upu ti 11 g St~lcm 
.\ l Jl'IH!i, ( '-S\ ~ll'fll(.' ( 'otlt·~ 
• File name: cxl!l:ute.h 
• Revision hi.story: Version! 




/IOpernad A Control 
#define ARD 
#define A RSI 
#dcfineAPC 








































































//Operand A : Rd 
//Operand A: Rsl 
//Operand A: PC 
//Operand B: Immediate 
//Operand D : R~2/ShiftAmt 









I/Immediate Operand J/lag 
f/Comparc Flag (update SR, No writeback) 
I/Status Register Output Enahle 
I/Status Rcgi~tcr WritebackEnablc 
/!ALU Control 
/!ALU Output Enable 
I/Shifter Coutrol 
//Shifter Output Enable 
//Multiplier Output Enable 
//Operand A Index 
I/Operand B lndc~Shift Amt 
//Read A Output Enable 
I/Read B Output En.able 
//Writcback Index 
//Wrltehack Enable 
1/lnuncdiatc Output En.able 
1/In.~trucllon Address Rcgis!Cl" Control 
//Data Address Rc-gistcr Control 
I/Data Input Control 






//Function for Control Signal Generate 
/ffunctlon for Select Input OpcrandA,B 
/ffunction for Arrange A tu Ct! Signals 
/ff unction for Arrange ShiftCtl Signals 
#irdefSIM 
sc_ulnt<4> opcodeTmp; 
sc_ulnt<2> opA, opll; 
SC_CTOR(cXC(U!e) ( 
SC_METIIOD(do_ctlSigGcn); 
seIL'litlve « clc;.~k.pos() << lnstld; 
SC_METIIOD(do_opSd); 
senslth·c « clO(k.pos() << rdldx « rslldl << rs21dx; 
SC_Mt:TIIOD(do_aluCU); 
§ensltlve « c]O(k.pos() << opcode; 
SC_Mt:TI IOD(do_shiftCtl); 
§cnshive << c]O(k.pos() << shift; 





. \ , OH·I JI) \ l·rl ira II~ l 11 II':.! ra ! l'rl \ cl a pl h l' ( ·mu pu Ii rig \\ ·" h·rn 
















• Exerutt': t:11mitc Unll for ICS_RISCls.uurce Ole rorn«ute) 
• {'op)'righl(c) 2005 h)· Chui KIM,All right reserved 
• Author: Chui KIM(ck.im@lstudmt.«u.cdu.au) 
• Hie name: cxccule.cpp 
• Revision history: Version! 
• Date: .vsnoos 
., 
#include "cxccutd1" 
void cx«ute::do_dlSigGcn() I 
sc_uint<4> top.codcTmp; 
sc_uint<2:> topA, topD; 
boo! taluOEn, tshi!tOEn, tmulOEn, twbEn, tlAregCtl, tdArcgCtl, tdlnCtl, tdOutCtl; 
boo! wbEnTmp; 
sc_uint<4:> instldTmp; 
lmtldTmp = ln~tld.rnad(); 
// sc_ulnt<S:> tshifiAmt; 
lf(lnstldTrnp == 0) I //INST_ALUIS 
topcodcTmp = opcode.read(); 
taluOEn = I; 
tshirtOEn = O; 
tmulOEn =0; 
topA = ARD; 
topB =DIM; 
twbEn = I; 
tlArq:Ctl = O; 
tdArcgCtl = O; 
tdlnCU :O; 
tdOutCII :: O; 
} else II(instldTrnp = I) I //INST__ALUIL 
topcodcTmp = opcode.rtar!(); 
taluOEn = I; 
tshifiOEn = O; 
tmulOEn :::0; 
topA =ARD; 
topB = DIJ\.J; 
twbEn = I; 
tlArcgCU ::O; 
tdArcgCtl = O; 
tdlnCU =0; 
ldOulCII = O; 
} else II(in.~tldTrnp = 2) j //INST_ALUR 
topcodcTmp = opcode.rtad(); 
taluOEn = I; 
tshiftOEn = O; 
tmulOEn =O; 
topA =ARSI; 
topB = BRS2; 
A DISSERTATION mTTHE DEGREE OF MASTER OF ENGINERING SCIENCE 147 
II 
3D-SoftChip 
,\ '.\'o\·el JI) \'ertkally lnh:gnikd .\dapliH' ( 'omputi11g S~stt•m 






J else if (inslldTmp = JJ I tnNST_ALULB 
topcodcTmp = opcode.read(); 










I else If (lrutldTmp = 4) I //1NST_SHR0 
topcodcTmp =0; 
LaluOEn =0; 
tshlftOEn = I; 
lmulOEn :::O; 
topA =ARSI; 
topB = BRS2; 
ti;hlrtArnt = BRS2; //BRS2 = ShiflAmt 
twhEn = I; 











twbEn = I; 
tL\regCU =O; 
tdAregCtl : I; 
tdlnCtl = I; 
tdOutCtl =0; 









tdArrgCtl = I; 
tdlnCtl ::O; 
tdOutCII = I; 
















A l)ISSERTATION FOTTI-tE DEGREE OF MASTER OF ENGINERlNG SCIENCE 148 
If 
31)-Soft(:hip 
,\ \onl .~ll \trtit·,tll~ lnll'1,!r,1ll'd .\dapliH· < ·0111puti11g S~"ll'III 


























































































lr (opB !=OJI 
rdBOEn=I; 
I 





~old executc::do_opScl() { 
lf(opA=O)( //ARD 
opAldx = rdldx; 
} else If (opA = I) I 1/ARSI 
opAldx = rslldx; 
} else If (opA = 2) { IIAPC 
opAldx = 15; 
lf(opB=OJ{ //BIM 
A DISSERTATION FOTTHE DEGREE OF MASTER OF ElNGJNERING SCIENCE 149 
JD-Sort Chip 
.,\ :'\md .~D \erlicill: illl1·;_!r;tll·d \daplilt· ('0111puti11~ s:~tcm 
.\ l H'llfli, ('.S,..,lt•m(' ('wh•:'> 
opBldx = .'; 
lclseir(opB=ll{ //BRS2 
opBldx = rs2Jd11; 
I el.5e If (opB = 2) I 1/BIW 
opBldx = rdldx; 
wbldx = rdld1; 
void exe<:utc::do_aluCtl() { 
opoodtffmp = opcode.rt11d(J; 
swltch(oJ)fodeTmp) I 
case (OP _MOVA) 
case (OP _MOVBJ 
case {OP _AND) 
case (OP _OR) 
case (OP_XOR) 
case {OP _NOT) 
case {OP _ADD) 
case (OP _SUB) 
me(OP_CMP) 
c~se (OP _MSR) 
case (OP _MRS) 
default 
void ex«11te::do_shlflCtl() { 
sc_uint<l;,, shlftTmp; 








aluCtl = O; 
aluCtl:d; 
aluCtl = 2; 
aluCtl :J; 
aluCtl ,::4; 
aluCtl = S; 
aluCt1=6; 
aluCtl = 7; 
aluCt1=8; 
aluCII = O; 
aluCtl= O; 





















• Control: Control Arch. ror ICS_R1SC(l1eaderfile ror control) 
• Copyrlght(c) 2005 by Chui KIM, All right nsencd 
• Author: Chui KIM(cklm@studcnt.l-cu.cdu.au) 
• File name: conlrol.h 
• Revision hl;tory: Vcrslonl 
































//SR Output Enable 
//SR Writchack Enable 
f/ALU Control 
A DISSERTATION FOTTI-IE DEGREE OF MASTER OF ENGINERING SCIENCE 150 
II 
31)-Sof'tChip 
..\ \o,t•I .\I) \1·rlieall.1 lnh·:.:rn1L·rl \dapti1l' Co111puti11µ s~~ll·m 











































I/ALU Output En.thle 
1/Shlflcr Control 
1/Shirtcr Output Enahlc 
IIMultlpller Output Enable 
I/Operand A Index 
1/0pt"rand B lndn 
1/ltcad A Output Enable 
1/Kead ll Output Enable 
1/Wrltrback Index 
//Writeback Em1h-te 
//Immediate Outp-ut Ena bk 
//ln.~tructlon Addrcs.1 Register Control 
/ll>ata Address Rci:l1ter Control 
m>at.a Input Control 
/11>.\ta Output Control 
shlrtAmt; /!Shift Amount 
lhEn; //I,oop Buffer En11blc 
lhRWEn; //1,oop Huffer Read/Write Enable 
l'EOp; 1/l'E EKccul!on Operation 
PEOpmodc; I/PE OJ)l:ratlon Mode SclN:tlon 
l'EConfig; //PE Configuration 
PJ-:Sd; I/PE Sclcctlun 
i>MAOp; IIIIMA Operation Selection 
DFBSd; /fl)at.a 1-·ramc Buffer Selection 
dataAmt; I/Amount or Dal.a to Tran.1rcr 
slllrtAddrDFB; /ISl:irt Addtl'!is or IWB(Soun:c/DcsL) 
SIL\l\lRcgSel; //Selcd between SRA!'o1/ICS_R1SC Rcg{S/DJ 
st11rtAddrSR,\MRcg; 1/St.arl Address or SRAM/ICS_RISC Rcg(S/D) 
memSd; I/Memory Scledlon(Program/Data) 
sc_oot<.1c_uinl<5> > 
.K_OU[<boob 






































nn.~t; //Fctd1ed lmtructlon Data 
nu~h: //Pipl'line Flush 
n:nll; //Pipeline Rentl 
dln.~tld; //In.1trurtlon ID 
dCond; l!Coodition 
opcode; IIOJX:odc 
shin; l!Shlrt Control 
l'!llldx; //R'll Index 
n;21dx; I/R12 Index 
rdldx; l!Rd Index 
dlmm; lllmmc:dlatc Data 
imml-lag; lllnm1l'lilate Opt"rand l-'l11g 
dCmpFlag;IICompare Hag 
dBranch~lag;//Braoch 1-lllj! 
dE:rdfflag; IIEnd orSlmulation I-lag 
dSrOEn, dSr\\'bEn; I/SR Read/Write Enable 
dLbEn, dl.hRWEn; /ILU Enable RMd/Write Enable 
dPEOp; //PE Exe1:ution Operation 
dPJ-:Opmodc; I/PE OJX:raUon Mode Selcdlon 
dPEConfig; /IPE Conngurallon 
dPEScl; //PE Sde1:tion 
dDMAOp; /ffiMA Operation Sclcdion 
dDFBScl; 1/DFB Selection 






dAluCll; //ALU Control 
dAluOEn; /IAI.U Output Enable 
dShlfiCII; //Shlf't Conlrol 
dShlnOEn;IIShifl Output Enable 
d~lulOEn; l/l',fu!Uplicr Ou I put Enable 
dOpAldx; /IOperund A Index 
dOpBldx; l!Operand B Index 
dRdAOEn;I/Rcad A Output Enable 



















A :\o,d .,n hrtkalt_, ln1tgralt'd ,\daplhl' ('omputing S~sll'm 
















































dRdBOEn://Rcad B Output Enable 
dWbldx; //Writcback Index 
dWbEn; //Writcback Enable 
dlmmOEn;//lmmcdiate Output Enable 
dlArcgCtl; I/Instruction Address Rcgilitcr Control 
dDArcgCtl; /IDataAddress Register Control 
dDlnCth //Data Input Control 
dDOutCII; //Data Output Control 
dShlrtAmt; //Shlft Amount 
instldTut; f/lrutructlon ID Debug Information 






























//SR Output Enable 
//SR Writcback Enable 
I/ALU Control 
I/ALU Output Ellllble 
1/Shifler Output Control 
I/Shifter Output Enable 
//Multiplier Output Enable 
//Operand A Index 
//Operand B Indet 
//Read A Output Enable 




//lmmcdiate Output Enable 
//lnstructlonAdd.rcss Register Control 
//Data Address Rq;istcr Control 
//Data Input Control 
//Data Output Control 















ldccodc·>branchl'lag(dBrnnchf1ag); ldecode->cxitFlag(dEx.itFlag); idecodc•>srOEn{dSrOEn); 















iexccute•>shlft(shift); \cxecutc•>rsl ldx(rsl Idx); 
A DISSERTATION FOTTIIE DEGREE OF MASTER OF ENGINERlNG SCIENCE 152 
31)-Soft(:hip 
,\ ;\'ovd JI) \'crlk:illy lnkgrnkd .\daptiH· Computing System 


























~ensitlve << clock.pos() << reset; 
sc_r.1ETIIOD(do_condExe); 

























• Control: Control Arch. for ICS_RISC(source file ror control) 
• Copyright(c) 2005 by Chui KIM, All right re;crved 
• Author: Chui Kll\f(ckim@student.ecu.cdu.au) 
• File name: control.cpp 
• Revision bl5tory: Versionl 
• Date: 5/5/2005 
., 
#include "rnntrol.b" 
void control::do_plpeRcg() [ 
U(rcset) I 




elAregCtl = O; 
branchFLag = O; 
cExltFlag = O; 
hutld = dlnstld; 
cond = dCond; 
cmpFlag = dCmpFlag; 
braocbFlag= dBranchFlag; 
efallFlag = dExitilag; 







A \o,·cJ 31) \'crtirnlty lnkgr:itl'cl _.\dapti\'l' Computin:,.: S)'Skm 
Appe1ulh ( ·-svskmC ( 'otk., 
srOEn = dSrOEn; 
eSrWbEn = dSrWbEn; 
aluCtl :: dAluCtl; 
aluOEn = dAluOEn; 
shlrlCtl = dShll'tCtl; 
shlf'tOEn = dShlftOEn; 
mulOEn = dMulOEn; 
opAldx = dOpAld:,q 
opBldx = dOpBldll:; 
rdAOEn :: dRdAOEn; 
rdBOEn = dRdBOEn; 
whldl :: dWbldx; 
eWbEn = dWbEn; 
lmmOEn = dlmmOEn; 
Imm =dlmm; 
dArtgCtl = dlArcgCII; 
dAregCtl = dDArcgCtl; 
dlnCtl = dDinCtl; 
cDOutCtl = dDOutCtl; 
lbEn = dLbEn; 
lbRWEn = dLbRWEn; 
PEOp = dPEOp; 
PEOpmode= dPEOpmodc; 
PEConfig = dPEConllg; 
PESel = dPESd; 
Dl\L\Op = dD:".L\Op; 
DFBSd = dDFBScl; 
dalaAmt = dDataAmt; 
starlAddrDFB =dStarlAddrDFO; 
SRAMRcgScl=dSRAMRcgSd; 
darlAddrSRAMRcg = dStarlAddrSRAMRcg; 
memSel = dMcmSel; 
startAddrProgDaMem = dStarL\ddrProgDaMcm; 
\'oid control::do_condExe() ! 
,. 
bool ex«F1ag; //Enrute Hag 
bool exltflag; 
If ((cond=COND_AL) 11 ((cond=COND_EQ) && (zFlag=lll 11 ((cood=COND_NEJ && (dlag=OJJ) { 
encJolag = I; 
Oush = branchFlag & end•lag; 
wbEn = (uedlag && -refill)? eWbEn : O; 
uWbEn = (cxec:Flag && -rtfill) ? eSrWbEo : O; 
lArei;:Cll: (e:ll'c1'1ag && -refill) ? c!Arq:Ctl : O; 
dOutCII = (cxec:Jolag && -refill) ? eDOutCtl : O; 
ultFlag = (exec Flag && -rtOII)? eExilFlag: O; 
• Debug: lkbu1: Information ror ICS_RISC<headcr rue ror dfhug) 
• Copyrlghl(c) 2005 by Chui KIM, All right resern'tl 
• Author: Chui KIM(ddm@studcnt.ecu.edu.au) 
• Hie name: debug.h 
• Rtvblon hi,;tory: Vcrs.lool 
• Date: 5/5noos 
., 
#include "sy5tcmc.h" 
#include "dcf.h .. 
SC_MODUU:(dl'bugJ t 
A DISSERT,\TION FOTTI-tE DEGREE OF MASTER Of ENOINERING SCIENCE 154 
31)-SoftChip 
,\ :\m 1:I .\U h·rtirn II~ 1 nkgrn t t•d . \ cl apti w Computing Sy:,;ll'm 
















scrudtivc << instld << aluCtl; 
!11.'ltldTcxt.lnlliallze{O); 
aluTcxt.initializc(O); 
• Jkbui;:: l}fbu11 lnformatlon ror ICS_RISC(soun:c file ror debug) 
• CopyrlghUcJ 2005 by Chui KIM, All right muved 
• Aulhor: Chui KIM{cldm@studcnLecu.edu.au) 
• Hie name: dchi..g.cpp 
• Rc,ision history: Vernon! 
• Date: 51512005 
., 
#include "dcbug.h" 
#define ALUIS O; 
#define ALUIL l; 
#define ALUR 2; 
#define ALULB 3; 
#define SIIRO 4; 
#define LOAD 5; 
#define STORE 6; 
#define BRANCII 7; 
#define MUL 8; 
#define PECON 9; 
#define D:'.IA 10; 




ln\tldTmp = instld.read(); 
aluCtlTmp = aluCll.rmd(); 
switch (ln\tldTmp) l 
case INST _A LUIS : lru;lldTcx.t = ALUIS; prlnt!("ALUIS \n"J; 
caw INST_ALUIL; lnslldTcxt = ALUIL; printl("ALUIL \n"J; 
case INST _ALUR : lru;lldTex.t = ALUR; printl("ALUR \n"); 
case INST _ALULB : l11-1tldTcxt =ALULB; prlntl("ALULB \.n''); 
caR INST_SIIRO : iru;tldTcx.t = SURO; printr("SIIRO \n"J; 
case INST_LOAD : lmtldTex.t = LOAD; prlntf("LOAD \n"); 
ca,;e INST_STORE : lrutldTrxt = STORE; printWSTORE \.n"); 
case JNST_BR,\NCII: ltulldText = BRANCll;prlnlf("BR,\NCI I \n"J; 
cascINST_MUL : imtldText = MUL; printf("MUL\n''J; 
case INST_l'ECON : irulldlht = Pl::CON; prlntf("'PECON \.n"); 
case INST_DMA : lnstldText = D~1A; prlntr("DMA \n" J; 
ddault: prlntf("'Not Dt-flncd Instruction \n"); 
• BusCtl: l/0 Bus Control for ICS_RISC(header file for busCU) 
• Copyright(c) 2005 by Chui KIM, All right rcsened 
• Author: Chui KIM(cklm@student.ecu.cdu.au) 
• FUe name: busCtl.h 















,\ \·o\'l'l JI) \l·rlir:111~ lnh'gratl'd .\dapliH' Co1111n1ting Sy:-.ll'lll 
\ J Jl'IHJi\ ( '.,'j\ '>tell\(' ( 'odl•:,, 
• Revision history: Vnslonl 















sensitive« nRW << dataOut; 
,, 
,. 
• BusCtl: 1/0 Bu~ Control for ICS_RISC(sourcc me for busCtl) 
• Copyright(c) 2005 by Chui KIM,AII right rescrnd 
• Author: Chui KIM(ckim@studcnl.ccu.edu.au) 
• File name: busCtl.cpp 
• Rnis!on history: Version! 
• Date: S/512005 
., 
#include "busCtl.h" 
Ioid busCtl::do_busCtl{) I 
dataln = sc_uintc:32> (data); 
if(nRW)I 
data= sc_lv<.32> (dataOut); 
data= "ZZZZZZZZZZZZZZZZ'ZZZZZZZZUZZZZ7,Z"; 
,. 
• ICS_RISC: Top module for ICS_RISC(hcadcr file for ICS_RISC) 
• Copyrlght(c) 2005 hy Chui KIM,AU right reserved 
• Author: Chui Kl;\l(ckim@sludent.ccu.cdu.au) 
• Hie name: ICS_RISC.h 
• Rc,islon history,: Vcrsionl 




























//PE Execution Operation 
A OISSERTATION FOTTIIE DEGREE OF MASTER OF ENG!NERING SCIENCE 156 
31)-Sol't (: hip 
:\ \onl JD h·rtirnll~ lnh-gratt•d .\tlaplht• < ·0111p11ti11i.: s~~lt'EH 












PEOpmodc; //l'E Opt:ration Mode 
PEConflg; //J'E Configuration Mode 
PESel; //PE Scl«Uon 
DMAOp; 1/DMA Operation Selection 
DFBSc\; //IJFD Selection (Z Sets) 
dataAmt; /!Amount Data to Traru;fcr 
startAddrDJ<'B; 1/Slllrt Address or DFB 
SRAMRegScl; 1/SRA"M/ICS_RISC Reg Sci 
startAddrSRAMRcg; //Slllrl Address or SRA!\1/ICS_RISC Reg 
mcmScl; //Memory Sclcctkm(Program/Dala) 
















































































































A '.\m,·l JI) \'l'rlir:i/1.\ lnk~!r:ill·d \clapli\l' ( 'ornputin:,.: s~..,km 























• ICS_R1SC: Top module for ICS_RISC(source me tor ICS_IUSC) 
• Copyrlght(c) 2005 by Chui KIM,AU right reserved 
• Author: Chui KlM(ddm@studmt.ccu.cdu.au) 
• File name: [CS_RISC.cpp 
• Rel'islon history: Vcrslonl 
• Date: 5/5/2005 
., 
#include "ICS_RISC.h" 
void ICS_RISC::do_OutCtl() { 
tnRW = dOulCtl; 
nRW= tnRW; 
A DISSERTATION FOTTHE DEGREE OF MASTER OF ENGINERJNG SCIENCE 
ldatapath,:,,shiftCU(shlftCUJ; 
Ida ta path• 
ldatapath,:,,wbEn(wbEn); 
ldalapalb· 
ldatapatb->dln(dloJ; 
[datapatb-:,,zflag(zFlag); 
Jdatapath•>dOut(dOut); 
158 
