Framework for simulation of fault tolerant heterogeneous multiprocessor system-on-chip by Tafesse, Bisrat
UNLV Retrospective Theses & Dissertations 
1-1-2008 
Framework for simulation of fault tolerant heterogeneous 
multiprocessor system-on-chip 
Bisrat Tafesse 
University of Nevada, Las Vegas 
Follow this and additional works at: https://digitalscholarship.unlv.edu/rtds 
Repository Citation 
Tafesse, Bisrat, "Framework for simulation of fault tolerant heterogeneous multiprocessor system-on-
chip" (2008). UNLV Retrospective Theses & Dissertations. 2434. 
https://digitalscholarship.unlv.edu/rtds/2434 
This Thesis is protected by copyright and/or related rights. It has been brought to you by Digital Scholarship@UNLV 
with permission from the rights-holder(s). You are free to use this Thesis in any way that is permitted by the 
copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from 
the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/
or on the work itself. 
 
This Thesis has been accepted for inclusion in UNLV Retrospective Theses & Dissertations by an authorized 
administrator of Digital Scholarship@UNLV. For more information, please contact digitalscholarship@unlv.edu. 
FRAMEWORK FOR SIMULATION OF FAULT TOLERANT HETEROGENEOUS 
MULTIPROCESSOR SYSTEM-ON-CHIP
by
Bisrat Tafesse
Bachelor o f Science 
Alemaya University 
2006
A thesis subm itted in partial fu lfillm en t 
o f the requirem ents fo r the
Master of Science Degree in Engineering 
Department of Electrical and Computer Engineering 
Howard R. Hughes College of Engineering
Graduate College 
University of Nevada, Las Vegas 
December 2008
UMI Number: 1463537
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy 
submitted. Broken or indistinct print, colored or poor quality illustrations and 
photographs, print bleed-through, substandard margins, and improper 
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript 
and there are missing pages, these will be noted. Also, if unauthorized 
copyright material had to be removed, a note will indicate the deletion.
UMI
UMI Microform 1463537 
Copyright 2009 by ProQuest LLC.
All rights reserved. This microform edition is protected against 
unauthorized copying under Title 17, United States Code.
ProQuest LLC 
789 E. Eisenhower Parkway 
PC Box 1346 
Ann Arbor, Mi 48106-1346
Unauthorized publication, m odification or d is tribution  o f this material, 
in whole or in part, in any form  or by any means w ith o u t the 
w ritten  consent o f the author is s tric tly  prohibited.
All rights reserved © 2009
uNiy Thesis ApprovalThe Graduate College 
University of Nevada, Las Vegas
November 26 20 08
The Thesis prepared by
Bisrat Tafesse
Entitled
Framework for Simulation of Fault Tolerant Heterogeneous
Multiprocessor System-On-Chip
is approved in partial fulfillment of the requirements for the degree of 
_______________ M aster o f  S c ien ce  in  E le c t r ic a l  E n g in eerin g
Examination Comnmé€eÆemhcr
Examination Committee Member 
Graduate College Faculty Representative
Examination Committee Chair
Dean o f the Graduate College
11
ABSTRACT
Framework for Simulation of Fault Tolerant Heterogeneous 
Multiprocessor System-On-Chip
By
Bisrat Tafesse
Dr. Venkatesan M uthukum ar, Examination Com m ittee Chair
Associate Professor o f Electrical and Computer Engineering 
University o f Nevada, Las Vegas
Due to  the ever growing requirem ent in high perform ance data com putation, 
current Uniprocessor systems fall short o f hand to  meet critical real-tim e performance 
demands in i) high throughput ii) faster processing tim e iii) low  power consumption iv) 
design cost and tim e-to -m arke t factors and more im portan tly  v) fau lt to le ran t 
processing. Shifting the design trend to  MPSOCs is a work-around to  meet these 
challenges. However, developing e ffic ient fau lt to le ran t task scheduling and mapping 
techniques requires optim ized algorithm s tha t consider the  various scenarios in 
M ultiprocessor environments. Several works have been done in the past few  years 
which proposed simulation based fram eworks fo r scheduling and mapping strategies 
tha t considered homogenous systems and error avoidance techniques. However, most 
o f these works inadequately describe today's MPSOC trend because they were focused 
on the netw ork domain and d idn 't consider heterogeneous systems w ith  fau lt to le ran t 
capabilities.
iii
In order to  address these issues, this work proposes i) a perform ance driven 
scheduling algorithm  (PD SA) based on simulated annealing technique ii) an optim ized 
Hom ogenous-W orkload-D istribution (HWD) M ultiprocessor task mapping algorithm  
which considers the dynamic workload on processors and iii) a dynamic Fault Tolerant 
(FT) scheduling/mapping algorithm  to  em ploy robust application processing system. The 
im plem entation was accompanied by a heterogeneous M ultiprocessor system 
simulation fram ew ork developed in system C/C++. The proposed fram ew ork reads user 
data, set the architecture, execute input task graph and finally generate perform ance 
variables. This fram ew ork alleviates previous w ork issues w ith  respect to  i) architectural 
flex ib ility  in number-of-processors, processor types and topology ii) optim ized 
scheduling and mapping strategies and iii) fau lt-to le ran t processing capability focusing 
more on the com putational domain.
A set o f random as well as application specific STG benchmark suites were run on 
the s im ulator to  evaluate and verify the perform ance o f the proposed algorithms. The 
simulations were carried out fo r i) scheduling policy evaluation ii) fau lt to le ran t 
evaluation iii) topology evaluation iv) Number o f processor evaluation v) Mapping policy 
evaluation and vi) Processor Type evaluation. The results showed tha t PD scheduling 
algorithm  showed marginally be tte r perform ance than EDF w ith  respect to  utilization, 
Execution-Time and Power factors. The dynamic Fault Tolerant im plem entation showed 
to  be a viable and e ffic ient strategy to  meet real-tim e constraints w ith o u t posing 
significant system perform ance degradation. Torus topology gave be tte r performance 
than Tile w ith  respect to  task com pletion tim e and power factors. Executing highly
iv
heterogeneous Tasks showed higher power consumption and execution tim e. Finally, 
increasing the num ber o f processors showed a decrease in average U tilization but 
improved task com pletion tim e and power consumption.
Based on the simulation results, the system designer can compare tradeoffs 
between a various design choices w ith  respect to  the perform ance requirem ent 
specifications. In general, designing an optim ized M ultiprocessor scheduling and 
mapping strategy w ith  added fau lt to le ran t capability w ill enable to  develop e ffic ient 
M ultiprocessor systems which meet fu tu re  performance goal requirem ents. This is the 
substance o f this work.
TABLE OF CONTENTS
ABSTRACT...........................................................................................................................................  iü
LIST OF TABLES................................................................................................................................... viil
LIST OF FIGURES ...................................................................................................................................x
ABBREVIATIONS .................................................................................................................................xil
ACKNOWLEDGMENTS.......................................................................................................................xlil
CHAPTER 1 INTRODUCTION ........................................................................................................ 1
The Trend towards MPSOC....................................................................................................... 1
MPSOC at Present ...................................................................................................................... 5
Project A ccom plishm ent............................................................................................................6
Conclusion ........  8
CHAPTER 2 LITERATURE REVIEW ................................................................................................9
MPSOC Overview......................................................................................................................... 9
MPSOC Classification................................................................................................................ 11
M ultiprocessor Scheduling and M apping Problem Space ..............................................12
M ultiprocessor Task Scheduling....................................................................................12
M ultiprocessor Task M a p p in g ....................................................................................... 14
Fault-Tolerance in M ultiprocessors....................................................................................... 17
Related W o rk ...................................................................................  18
CHAPTER 3 DEFINITIONS AND THEORY .................................................................................. 19
D e fin itions ...................................................................................................................................19
System D e fin itio n s ....................................................................................................................26
Performance Definitions ....................................................................................................... 32
CHAPTER 4 METHODOLOGY ......................................................................................................34
Application M od e ling ............................................................................................................... 34
Application Partitioning Overview .......................................................................................34
Task Scheduling and Mapping Strategies.............................................................................37
Task Scheduling A pproach.............................................................................................37
Task Mapping A pp roach ................................................................................................ 37
MPSOC Simulation M odeling Approach .............................................................................38
vi
Framework M ode ling ............................................................................................................... 39
Framework Control F low ......................................................................................................... 40
Framework F lex ib ility ............................................................................................................... 42
Architectura l Framework M ode ling ......................................................................................42
The XV Switching A lgorithm  .......................................................................................... 44
Behavioral Framework M ode ling .......................................................................................... 45
CHAPTERS IMPLEMENTATION ................................................................................................ SO
Performance Driven Scheduling A lg o r ith m ........................................................................SO
Simulated-Annealing O verv iew .....................................................................................S I
Homogenous-W orkload-D istribution (HWD) Mapping Strategy ..................................S3
Back G ro u n d ......................................................................................................................S3
HWD Mapping A lg o r ith m ...............................................................................................S4
Fault Tolerant Im plem entation ............................................................................................S6
Back G ro u n d ......................................................................................................................S6
Fault Tolerant Model ......................................................................................................S7
Fault Tolerant A lg o r ith m ................................................................................................ S9
CHAPTER 6 SIMULATION RESULTS............................................................................................61
CHAPTER 7 CONCLUSIONS AND FUTURE WORK ...................................................................83
Summary ....................................................................................................................................83
C o n tr ibu tio ns .............................................................................................................................84
Future W ork .............................................................................................................................. 8S
APPENDIX I DETAILED SIMULATION PROFILES........................................................................ 86
APPENDIX III BENCHMARK SUITES .............................................................................................. 98
APPENDIX IV SYSTEMC................................................................................................................ 102
REFERENCES ...................................................................................................................................104
VITA  105
VII
LIST OF TABLES
Table 1. Comparison o f Processor U tilization, Throughput
and Buffer U tiliza tion ................................................................................. ......... 63
Table 2A. Comparison o f Execution Time, Port Traffic and Power.................... ......... 65
Table 2B. Comparison o f Execution Time, Port Traffic and Power.................... ......... 67
Table 3. Average Execution Time, Port Throughput, Power and POP............ ......... 69
Table 4. Scheduling Evaluations............................................................................. ......... 70
Table 5. Comparison o f NET and FT........................................................................ ......... 72
Table 6. Comparison o f EDF and Sim ulated-Annealing..................................... ......... 74
Table 7. Performance Evaluation fo r RAND 0000-3000 Benchmark.............. ......... 76
Table 8: Comparison o f HWD and Next Available and Random
Mapping A lg o rithm s.................................................................................. ......... 78
Table 9. Comparison o f Processor T ype ............................................................... ......... 79
Table 10. Comparison o f Execution tim e  fo r d iffe ren t num ber o f tasks......... ......... 81
Table 11. Comparison o f Power fo r d iffe ren t num ber o f ta sks ........................ ......... 81
Table 12. Comparison o f Execution Time fo r D ifferent
Scenarios o f Variables................................................................................ ......... 87
Table 13. Comparison o f U tilization fo r D ifferent Scenarios
o f Performance V ariab les......................................................................... ......... 88
Table 14. Comparison o f Throughput fo r D ifferent Scenarios
o f Performance V ariab les......................................................................... ......... 89
Table 15. Comparison o f Performance Variables fo r EDF, NFT,
Tile and 16 Processors............................................................................... ......... 90
Table 16. Comparison o f Performance Variables fo r SA, NFT,
Tile and 16 Processors............................................................................... ......... 90
Table 17. Comparison o f Performance Variables fo r EDF, FT,
Tile and 16 Processors.............................................................................. ......... 91
Table 18. Comparison o f Performance Variables fo r SA, FT,
Tile and 16 Processors.............................................................................. ......... 91
Table 19. Comparison o f Performance Variables fo r EDF, NFT,
Tile and 25 Processors.............................................................................. ......... 92
Table 20. Comparison o f Performance Variables fo r SA, NFT,
Tile and 25 Processors.............................................................................. ......... 92
Table 21. Comparison o f Performance Variables fo r EDF, FT,
Tile and 25 Processors............................................................................... ......... 93
Table 22. Comparison o f Performance Variables fo r SA, FT,
VII I
Table 23.
Table 24.
Table 25.
Table 26.
Table 27.
Table 28.
Table 29.
Table 30.
Table 31:
Comparison o f Performance Variables fo r EDF,
NFT, Torus and 16 PEs........................................................................................... 94
Comparison o f Performance Variables fo r SA, NFT,
Torus and 16 Processors..................................................................................... 94
Comparison o f Performance Variables fo r EDF, FT,
Torus and 16 Processors..................................................................................... 95
Comparison o f Performance Variables fo r SA, FT,
Torus and 16 Processors..................................................................................... 95
Comparison o f Performance Variables fo r EDF, NFT,
Torus and 9 Processors......................................................................................... 96
Comparison o f Performance Variables fo r SA, NFT,
Torus and 9 Processors......................................................................................... 96
Comparison o f Performance Variables fo r EDF, FT,
Torus and 9 Processors......................................................................................... 97
Comparison o f Performance Variables fo r SA, FT,
Torus and 9 Processors......................................................................................... 97
Task Size fo r STG Benchmark Files......................................................................99
IX
LIST OF FIGURES
Figure 1. Task Execution Parallelism ...................................................................................... 3
Figure 2. Trends in Performance fo r Desktop Processors................................................ 6
Figure 3. Application F lierarchy............................................................................................ 20
Figure 4. Application M odeling Using Task-Graph...........................................................20
Figure 5. Task-Status Enumeration and State M ode ling ................................................ 22
Figure 6. Heterogeneous System w ith  Torus Topology................................................... 28
Figure 7. Application M odeling on MPSOC A rch itecture   ......................................35
Figure 8. Framework M odeling ............................................................................................. 43
Figure 9. Processor and Switching Technique....................................................................44
Figure 10. Simulation Framework Prototype and In frastructure .................................... 46
Figure 11. Software/Hardware Libraries and Architecture Setup M o d u le .................. 47
Figure 12. Simulated-Annealing A lgorithm ...........................................................................52
Figure 13. Simulated-Annealing State T rans ition ............................................................... 53
Figure 14. Hardware Interconnection in 5X5 Torus............................................................ 56
Figure 15. Fault Tolerant Pseudo Code.................................................................................. 60
Figure 16. Comparison o f Processor U tiliza tion ...................................................................64
Figure 17. Comparison o f Processor Throughput............................................................... 64
Figure 18. Comparison o f Buffer U tiliza tio n .........................................................................64
Figure 19. Comparison o f Execution T im e............................................................................ 66
Figure 20. Comparison o f P ow e r............................................................................................ 66
Figure 21. Comparison o f Port T ra ffic .................................................................................... 66
Figure 22. Comparison o f Exec. Tim e/Task...........................................................................68
Figure 23. Comparison o f Exec. Tim e/Task...........................................................................68
Figure 24. Comparison o f Exec. Tim e/Task...........................................................................68
Figure 25. Comparison o f Power/Task................................................................................... 68
Figure 26. Comparison o f Power/Task................................................................................... 68
Figure 27. Comparison o f Power/Task................................................................................... 68
Figure 28. Comparison o f Execution Time/Task fo r D ifferent Num. o f Processors... 69
Figure 29. Comparison o f Power/Task fo r D ifferent Num. o f Processors..................... 69
Figure 30. Processor U tilization Evaluation fo r EDF and SA.............................................71
Figure 31. Throughput Evaluation fo r EDF and S A .............................................................71
Figure 32. Buffer Usage Evaluation fo r EDF and SA............................................................ 71
Figure 33: Comparison o f U tilization fo r NFT and F t ..........................................................73
Figure 34: Comparison o f Throughput fo r NFT and Ft....................................................... 73
Figure 35: Comparison o f Buffer Usage fo r NFT and F t..................................................... 73
Figure 36. Comparison o f U tilization fo r Tile and Torus.................................................... 75
Figure 37. Comparison o f Throughput fo r Tile and T o ru s .................................................75
X
Figure 38. Comparison o f Buffer Utilization fo r Tile and Torus........................................ 75
Figure 39. Comparison o f Exec. Time/Task fo r Tile and T o ru s ......................................... 77
Figure 40. Comparison o f U tilization fo r Tile and Torus.....................................................77
Figure 41. Comparison o f Power/Task fo r Tile and T o rus ................................................. 77
Figure 42. Comparison o f Port Traffic fo r Tile and Torus................................................... 77
Figure 43. Comparison o f Execution T im e .............................................................................78
Figure 44. Comparison o f P ow e r............................................................................................. 78
Figure 45. Comparison o f U tiliza tio n ...................................................................................... 78
Figure 46. Comparison o f Throughput....................................................................................78
Figure 47. Comparison o f Execution T im e .............................................................................80
Figure 48. Comparison o f P ow e r............................................................................................. 80
Figure 49. Comparison o f U tiliza tio n ...................................................................................... 80
Figure 50. Comparison o f Throughput....................................................................................80
Figure 51. Comparison o f execution tim e fo r d iffe ren t num ber o f tasks.......................82
Figure 52. Comparison o f power fo r d iffe ren t num ber o f ta sks ......................................82
XI
ABBREVIATIONS
ASIC Application Specific Integrated Circuit
DSP Digital Signal Processor
DTD Document type Defin ition
E3S Embedded System Synthesis Benchmarks Suite
EDF Earliest Deadline First
EEPROM Electrically Erasable Programmable Read Only M em ory
FPGA Field Programmable Gate Array
FTS Fault Tolerant Scheduling
GXL Graph Exchange Language
GUI Graphical User Interface
HDL Hardware Description Language
HMPSOC Heterogeneous M ultiprocessor System On Chip
HWD Homogenous W orkload D istribution
MIMD M ultip le  Instruction M u ltip le  Data
MPSOC M ultiprocessor System On Chip
NOC Network On Chip
NP Non-Polynomial
PE Processing Element
QOS Quality o f Service
RAM Random Access M em ory
ROM Read Only M em ory
ROTS Real-time Operating System
SIMD Single Instruction M ultip le  Data
SA Simulated Annealing
SOC System-On-Chip
STG Standard Task Graph
TGFF Task Graph For Free
XII
ACKNOWLEDGMENTS 
...In loving memory o f my mother, Haregewoin Zewdle.
To my father, Tafesse Muluneh in recognition to his relentless support and sacrifice 
which I w ill never consider to payoff.
To my pathfinder. Dr. Venkatesan Muthukumar fo r  his inspiration and committed 
guidance throughout my entire course work.
To Dr. Emma Regentova fo r  her kind and unrestricted help.
XII I
CHAPTER 1 
INTRODUCTION
1 .1  The Trend tow ards MPSOC
Extensive w ork has been done in the fie ld  o f Uniprocessor systems aimed at 
designing effic ien t task scheduling strategies. Uniprocessor scheduling defines task 
tim ing, placement and processing in Uniprocessor environm ent. This scheme defines 
apparent execution concurrency whereby the percentage o f processor usage varies 
according to  the resources sharing policy. These systems have architectural constraints 
which determ ine the scheduling m ethodology employed in the design. In the case o f 
preem ptive systems, the processor is shared among competing tasks. Tasks are fetched, 
processed, preem pted and resumed several tim es in an interleaved fashion during the 
task execution cycle. Task preem ption and resume duties incur considerable overhead in 
context switching.
Pipelined and superscalar architectural designs improved throughput and utilization 
o f single processor systems. These architectures em ploy instruction level parallelism 
where d iffe ren t phases o f m ultip le  instructions are executed at d iffe ren t steps.
However, Present day applications are becoming highly sophisticated w ith  respect to
the vast com putational data involved and application constraints. The ceaseless e ffo rt
by com puter architects to  meet these challenges is facing d ifficu lty  because the chip
1
m anufacturing technology can no longer meet these requirem ents adequately [6]. 
Applications tha t run on embedded systems such as m ultim edia, d igital signal 
processing, image processing and wireless com m unication involve an ever grow ing size 
o f data processing and com plexity which can no longer be satisfied by Uniprocessor 
systems [12]. The emerging performance demand dictates the technology to  transform  
into concurrent task processing capability because Uniprocessor com puting isn 't a viable 
alternative fo r present perform ance demands. Task scheduling strategies need to  be 
adopted fo r M ultiprocessor systems which take the architectural change into account. 
Uniprocessor perform ance issues like i) execution concurrency ii) data synchronization 
and coherence and iii) fau lt-to lerance require d iffe ren t m ethodology when considered 
fo r M ultiprocessor systems.
To materialize this concept, consider task execution parallelism. Uniprocessor 
systems are known to  o ffe r parallelism through apparent execution concurrency where 
the processor executes several tasks in an interleaved fashion. The processor is shared 
among tasks which are executed in predeterm ined tim eslots. The resource sharing 
policy may adopt round-robin tim e division m ultip lexing o r o ther scheduling schemes. 
However, no tw o  tasks are executed simultaneously. Uniprocessor parallelism, in the 
form  o f pipelined and superscalar processing as a means to  speedup com putation have 
also reached the lim it where they can no longer catch up w ith  the growing performance 
requirem ents.
To circum vent the degraded performance in parallelism exhibited by Uniprocessor 
systems, the triv ia l solution would be to  have N m ultip le processors to  execute m ultip le
tasks independently. Theoretically, th is reduces the to ta l processing tim e by a facto r o f 
1/N because the workload is shared among the N processors. Figure 1 illustrates 
parallelism o f fou r tasks Ti, Tz, T3 , T4  w ith  execution tim es o f 2, 3, 4, 5 respectively. For 
part-A, tasks are subm itted sequentia lly in non-preem ptive mode so the m inim um  
processing tim e would be (2+3+4+5) = 14 tim e-units. For part B, due to  the parallelism, 
to ta l processing tim e would be the maximum of (2, 3, 4, 5) = 5 provided all tasks are 
subm itted at the same tim e and tasks are independent.
T3
PE B PE-4PE-3PE-2PE-1
Figure 1. Task Execution Parallelism
The parallelism exhibited by Multiprocessors is real because d iffe ren t processors 
execute several tasks independently through m ultithreaded processing. This parallelism 
comes at the expense o f the increased resources which increases the freedom  fo r task 
mapping choice. Often, parallelism in MPSOCs fo llow s SIMD or M IM D architecture 
[6][25]. SIMD architectures execute the same operation on d iffe ren t data streams; 
M IM D architectures execute d iffe ren t operations on d iffe ren t data streams.
Even if  M ultiprocessor systems yield bette r overall perform ance than Uniprocessors, 
there are cost tradeoffs in design tim e and e ffo rt during hardware/software
developm ent [12]. Additionally, M ultiprocessor systems require pow erfu l and reliable 
means o f com m unication to  meet high bandwidth requirem ent. [2 ][4 ][18 ][20 ][21 ][24 ] 
proposed viable means o f networking fo r Multiprocessors. Bus M atrix  based com ponent 
interconnect fo r MPSOCs proposed by Sudeep et. al. [24] showed be tte r performance 
than the trad itiona l bus based im plem entation w ith  respect to  bus congestion in 
medium sized Multiprocessors; but they proposed packet-switched NOCs fo r larger 
M ultiprocessor variants. SESAME PROJECT [9] proposed a M ultiprocessor model which 
maps applications toge ther w ith  the com m unication channels on the architecture. The 
model tries to  m inim ize an objective function which considers power, execution tim e 
and to ta l cost o f the architectural design.
As more sophisticated applications emerge in fu tu re , the benefits tha t MPSOCs 
deliver is expected to  increase as well. The principal advantages o f MPSOCs include:
1) Design m odularity which enables system flex ib ility . Thus, depending on user 
requirem ents and market demand, the system can be easily re-configured.
2) Improved perform ance exhibited by the MPSOCs w ith  respect to  task throughput, 
and power consumption.
3) Fault tolerance due to  availability o f processors.
4) Optim ized architecture w ith  no bulky and costly general purpose processors. This 
also reduces heating effect because optim ized cores are smaller in size and 
consume less power. Additionally, faster execution tim e  is an intrinsic advantage 
o f M ultiprocessor systems.
5) M eeting real-tim e constraints due to  increased processing power.
6) Software/hardware design cost and the tim e-to -m arke t factors are greatly 
reduced. Furtherm ore, Developing a M ultiprocessor system is by fa r cheaper than 
m anufacturing a single high perform ance processor chip due to  the d ifficu lty  o f 
complex chip m anufacturing process.
1. 2 MPSOC at Present
Current technology allowed complex designs to  be placed in modern digital systems. 
Entertainm ent and high-tech m ilitary purpose applications rely on M ultiprocessor 
systems because single processor isn't a viable alternative to  deliver high performance 
requirem ent. Uniprocessor systems guarantee high utilization o f the single processor 
but they fall short in meeting critical real-tim e demands in i) high throughput ii) faster 
processing tim e iii) low  energy and pow er rating and iv) fau lt to le ran t computing.
The M oor's law predicted tha t fo r every 18 months, chip size would double to 
provide faster execution. U nfortunate ly, earlier im provem ents through pipelining and 
instruction-level parallelism fo r Uniprocessor designs can no longer catch up w ith  the 
m oor's prediction [26]. Figure 2 shows the performance gap between current trend and 
the m oor's prediction by evaluating d iffe ren t processor generations against the 
SPECInt2000 benchmark.
I OK
_^SPi-.r.int2ono 
^  ^  ^ P c rlb i niance
I 8gp_,
100
ÎS
IfL-tinoItifjy irt'ïaiive I ( M delay} I ’Ipcliiiing  (relative }>04 gates/stagol 
ll, l 'ire la tiv e S I* l,( , ln i/\n i/ l I’c rfdrm aiue
Figure 2. Trends in Performance fo r Desktop Processors (source: Wayne [26])
1. 3 Project Accomplishments
M ost o f the methodologies fo llow ed in Uniprocessor system designs can be adopted 
fo r M ultiprocessor systems well. The tw o  main features which need to  be adjusted are 
task scheduling and mapping strategies because M ultiprocessor systems have d iffe ren t 
environm ent and o ffe r the freedom  fo r task scheduling and allocation.
In th is work, d iffe ren t existing as well as proposed scheduling and mapping 
heuristics are im plem ented. The classical Earliest-Deadline-First (EDF) scheduling 
algorithm  is im plem ented to  meet real-tim e application constraints. In order to  arrive at 
an e ffic ient task scheduling strategy, a Performance-Driven scheduling algorithm  based 
on Simulated-Annealing heuristic is proposed which tries to  find an optim ized
scheduling strategy w ith in  an affordable cost o f tim e. A new Homogenous-W orkload- 
D istribution mapping policy which considers processors runtim e status is also proposed. 
TILE and its variant TORUS topologies which are common layouts in Multiprocessing 
environm ents are also supported in the fram ework. Heterogeneity in processor 
architecture, which is a key design feature fo r embedded systems, is another essence o f 
this fram ework. In order to  assist the designer w ith  a comprehensive environm ent tha t 
realizes a variety o f design parameters, it is necessary to  have a system-wise re- 
configurable too l w ith  respect to  num ber o f processor, processor types, scheduling and 
mapping policies, topology, and buffe r sizes which constitu te the core design benefits 
o f the  fram ework.
Finally, in an e ffo rt to  im plem ent a robust system, a new fau lt to le ran t a lgorithm  is 
proposed which re-configures the system in the event o f processor fa ilure. Fault 
to le ran t im plem entation is principal design requirem ent fo r real-tim e operating systems 
(RTOS) because processor failure introduces performance issues which may entail real­
tim e constraint v io lation. MPSOCs incorporate more than one processor which makes 
them  less vulnerable to  perform ance issues during processor failure.
Integrating all these scenarios during system design phase w ill help to  m inim ize the 
w ide range o f complex design choices thereby reducing design tim e, cost and e ffo rt 
from  a hardw are/softw are design perspective.
1. 4 Conclusion
The representation o f every m inute facet in M ultiprocessors is a task tha t is next to  
impossible. Emphasizing on the key performance factors significantly sim plifies the 
developm ent process. U tilization, Throughput, Execution tim e and po\A/er are major 
M ultiprocessor perform ance indices. In order to  optim ize the performance o f 
Multiprocessors, a careful analysis o f the  various scenarios need to  be understood. 
Performance optim ization is achieved through several mechanisms including strategic 
task scheduling and mapping algorithm s. Faulty scenarios should also be carefully 
addressed to  to le ra te  processor fa ilu re  and resume the application processing \A/ith the 
possible m in im um  perform ance penalty. Simulation based design approaches sim plify 
the e ffo rt in refin ing design outcomes quickly p rio r to  the ir im plem entation thereby 
reducing design e ffo rt and cost.
Saying all these, the upcoming chapters \A/ill fu rthe r elaborate on the  methodologies 
and im plem entation o f the \A/ork. Chapter 2 revises previous works in the fie ld o f 
M ultiprocessor system environm ent. Chapter 3 provides defin itions fo r key 
term inologies used throughout this book. Chapter 4 presents the proposed 
m ethodology in task scheduling, mapping and fau lt to lerance techniques. Chapter 5 
dem onstrates the im plem entation in detail. Chapter 6 exposes the detailed simulation 
results and analysis. Finally Chapter 7 concludes the book and recommends fu tu re  work.
CHAPTER 2 
LITERATURE REVIEW
2 .1  MPSOC Overview
M ultiprocessor systems incorporate a num ber o f processing elements bu ilt in to a 
system which are aimed to  im prove application processing performance. MPSOCs are 
design alternatives to  Uniprocessor systems to  meet embedded and real-tim e system 
demand. MPSOCs in fu tu re  w ill form  the backbone o f embedded systems and are likely 
to  exterm inate the Uniprocessor embedded com puting era.
The MPSOC paradigm comprises three m ajor components:
1) Processing elements which perform  the task execution
2) M em ory elements tha t are used fo r data storage, and
3) Communication netw ork tha t interconnects all components in a manner defined 
by the topology. The network part includes the routers and /o r switches to  route 
the data to  the required processing element.
Embedded systems, as mentioned earlier, em ploy resources tha t are optim ized fo r 
specific purpose. Complex and expensive general purpose processors are superseded by 
customized processor designs thereby enabling shorter tim e-to -m arke t and cheaper
mass production. These optim ized components yield higher perform ance w ith  lower 
cost o f m anufacturing [11].
Customizing a m ultiprocessor system involves removing, adding or m odifying 
components to  meet the  required functiona lity  [26]. M ultiprocessor customization 
mainly refers to:
1) M odifying processing elements
2) M odifying the memory blocks
3) M odifying the interconnection between components
4) Adding specialized resources or removing unnecessary com ponents those do not 
contribu te  much to  the perform ance im provem ent o f the system [26].
Heterogeneity in MPSOCs is described w ith  respect to  processor functiona lity  and 
flex ib ility . Multiprocessors can be made from  programmable or dedicated (non­
programmable) processors. Programmable processors can be reconfigured to  o ffe r 
execution flex ib ility  w ith  respect to  user demand. FPGA based im plem entation fo r CPUs 
and SoCs are common design choices to  support flex ib ility  in multiprocessing systems. 
Dedicated processors like DSPs and ASICs are fast and cheap to  design but cannot be 
reconfigured based on changing application requirem ents. M em ory elements can also 
be heterogeneous w ith  respect to  block size, access mode, and clock cycles.
This w ork is particularly focuses on designing e ffic ient fau lt to le ran t task scheduling 
and mapping strategies fo r M ultiprocessor systems.
10
2. 2 MPSOC Classification
MPSOC and Network-On-Chip (NOG) are m ajor design paradigms in multiprocessing 
systems. MPSOC may incorporate all components in to a single chip. However, 
Depending on the application demand, processors and mem ory modules can be 
im plem ented as physically d istributed components interconnected by some means o f 
com m unication network. This design choice defines another paradigm called NOC.
MPSOC can be classified as fo llows based on the application they are ta ilo red to  
process and the ir architecture.
1) Embedded Vs General Purpose Systems
2) Real-time Vs Non-real-tim e Systems
3) Preemptive Vs Non-preem ptive Systems
4) Homogenous Vs Heterogeneous Systems
Present day Multiprocessing architectures em ploy most o f these design 
classifications. In fact, modern embedded M ultiprocessor systems are real-time, 
preem ptive and heterogeneous [23].
M em ory elements in MPSOCs fo llow  i) d istributed mem ory (message passing) 
architecture o r ii) global shared mem ory architecture [1 ][4 ][6 ][17 ]. Global memory 
architectures must deal w ith  data synchronization and access contro l issues to 
guarantee memory read/w rite  operations among the com peting processors are handled 
in a consistent manner. This procedure introduces data access delay between mem ory 
and processing elements. Processing complex applications introduces high data tra ffic  
and thus higher frequency o f memory access which makes the overall delay significantly
11
high. Practically, reducing the frequency o f memory access speeds up task processing 
tim e because mem ory cycle tim e is still a design bottleneck in chip technology.
In d istributed memory architectures, each processor has local memory. This 
elim inates global mem ory synchronization and access contro l issues. W hile global 
memory architectures are less costly in design and easier to  im plem ent, the current 
M ultiprocessor design trend mostly favors d istributed memory architectures as they are 
more e ffic ien t and allow system-wise scalability [26].
2. 3 M ultiprocessor Scheduling and Mapping Problem Space
2. 3.1 M ultiprocessor Task Scheduling
Real-time systems [23] are concerned not only w ith  processing performance but also 
the tim ing constraints o f the application. Real-time applications are often unable to  
afford delayed task execution. They instead require careful tim ing  o f the tasks so tha t 
the coherence and va lid ity o f data among tasks is preserved and kept consistent.
Real-time systems can be i) soft real-tim e and ii) hard real-tim e depending on the 
behavior o f the application. Soft and hard deadlines specify the strictness o f penalty 
when task deadline missed. Soft real-tim e systems can to lera te  deadline vio lation 
because the application they are used in is not critical. Hard real-tim e systems however 
are s tric t on task tim ing. A task should be processed, not sooner not later, but w ith in  a 
specified tim e fram e or otherw ise the effect would possibly be catastrophic. Hard-real 
tim e systems are com m only integrated in the contro l unit o f medical, m ilitary and 
sim ilar critical mission applications.
12
To meet deadline requirem ents, real-tim e systems often em ploy preem ptive 
execution methods. This enables task processing to  be completed w ith in  its deadline. 
Tasks those missed the ir deadline have no im portance and so processing these tasks 
fu rthe r would do nothing bette r than wasting valuable system resources. If an executing 
task is anticipated tha t it w ill not be completed by its deadline, it is discarded so tha t 
o ther w aiting tasks can use the resources and com plete w ith in  the ir deadline. Real-time 
systems like Video decoders em ploy this strategy to  deliver the QOS fo r the viewer. 
During complex video decompressions, the graphic processor w ill discard frames tha t 
are anticipated not to  be ready fo r display by the deadline and move on to  execute the 
next frames. The displayed video w ill have a reduced quality due to  the missing video 
frames but this can be to lerated by the viewer.
Tasks are given p rio rity  fo r execution by the Scheduler according to  the scheduling 
policy. The efficiency o f the scheduling algorithm  is evaluated w ith  respect to  the 
performance goals o f the application. If the goal is to  reduce power consum ption, the 
scheduling algorithm  which executes an application w ith  a m in im um  power is chosen. 
However if faster application processing is required, the scheduling algorithm  which 
executes the application faster is chosen. A feasible schedule a fte r all is a scheduling 
a lgorithm  which satisfies all the constraint demands o f the application [6][15].
Extensive w ork has been done fo r many years in the fie ld o f M ultiprocessor task 
scheduling. It has been shown tha t earlier scheduling algorithm s like FCFS and least- 
laxity [15], fail to  qualify as acceptable choices o f scheduling because they are unable to 
respect deadline requirem ents fo r real-tim e applications. Nguyen [15] proposed
13
earliest-effective-deadline-first scheduling. The work dem onstrated tha t increasing the 
num ber o f processors to  improve scheduling success increased the size o f the 
architecture which makes the scheduling process more d ifficu lt. [2] proposed Traffic- 
Aware-Scheduling fo r hard NOCs aimed at minim izing earliest-com pletion-tim e o f the 
application by considering the dynamic netw ork tra ffic .
Finding an optim al solution in scheduling/mapping problem  domain fo r set o f real­
tim e tasks has been shown to  be NP-hard (Non-Determ inistic Polynomial) problem [15]. 
Thus, certain heuristics have to  be employed to  make the scheduling/m apping search 
feasible. Com putational com plexity analysis is given in detail in [16]. Carvalho et. al. [7] 
proposed a "congestion-aware" mapping heuristic to  dynamically map tasks on an NOC 
based heterogeneous MPSOCs. The heuristic approach in itia lly  assigns master-tasks to  
selected processors and the slave-tasks are allocated on congestion m inim izing manner. 
Their w ork focused on mapping the tasks based on the NOC link's and channel runtim e 
status to  m inim ize network congestion.
2. 3. 2 M ultiprocessor Task Mapping
Scheduling algorithm s fo r Single processor systems only specify the order o f task 
execution. Flowever, m ultiprocessor environm ents grant freedom  in dynamic task 
mapping. The mapping algorithm  also needs to  specify where tasks should be 
processed. The higher the choice o f freedom , the harder is to  decide the  allocation of 
tasks. M ultiprocessor systems in essence yield higher perform ance but they also incur 
considerable workload to  determ ine task allocation on the available processors. The 
mapping freedom  comes at the expense o f the d ifficu lty  to  carefully d istribute  the tasks
14
which considers data synchronization, inconsistency and deadline constraints. However, 
the perform ance gain tha t MPSOCs yield highly offsets the tradeoffs imposed during the 
design process.
Task mapping in heterogeneous systems is done in tw o  steps i) binding and ii) 
placement. First, the task binding procedure attem pts to  find which particular 
assignment o f tasks would lead to  effective lower power and faster execution tim e. This 
is type matching between task type and processor type. Homogeneous systems have no 
type constraints which makes the mapping process one step shorter. The second step is 
task placement. During this step, the M apper considers allocating any tw o  tasks tha t 
exchange data more frequently  as close to  each o ther as possible. Placement o f 
com m unicating tasks to  adjacent or nearby processors reduces the communication 
latency, netw ork congestion, switch tra ffic  and power dissipation.
Task mapping can be static or dynamic. Static mapping is the simplest task allocation 
policy which assigns a particular task to  a pre-determ ined processing elem ent regardless 
o f the dynamic state o f the system components. Task assignment is determ ined offline  
before the  task processing begins. As a result, this scheme is comparably cheap, easy 
and faster to  im plem ent because the costly process to  m onitoring the system at runtim e 
is suppressed. This however contradicts to  MPSOCs requirem ent because processing 
tasks on a heterogeneous architecture introduces workload dynam icity upon the system 
resources. This dynam icity introduces irregularity in workload d istribution  among the 
processing elements. Irregular workload d is tribution  is a design issue to  e ffic iently 
optim ize system performance. To overcome this problem, tasks should be mapped on
15
the fly by considering the dynamic workload on the processors. Processor workload is 
measured by i) how long the currently running task would take to finish ii) how many 
tasks the processor has in its local memory.
Tasks in which the information about them  is known prior to their execution are 
called deterministic tasks. These tasks have static behavior throughout the execution 
process regardless of certain external factors. Real-time tasks, for example, have 
predetermined execution tim e and deadline and thus are deterministic; because these 
values are independent on the number of tasks in the application. Determinism of tasks 
makes scheduling process easier because task processing can proceed w ithout runtime 
factors. Loops and conditional branches inside a process introduce non-determinism  
because their results are known at runtime. Therefore, static mapping strategy would be 
ideal only for deterministic tasks. [6] has shown static and dynamic task allocation for 
deterministic and non-deterministic tasks respectively. The work by Myoung et. al. [8] 
suggests that real-time applications should behave deterministically even in 
unpredictable environments so that the tasks can be processed on real-time basis.
One advantage of dynamic task allocation is that it enables the scheduling of non- 
deterministic tasks at run tim e. Statically mapping non-deterministic tasks may result in 
costly runtime performance penalty. On the contrary, dynamic allocation introduces 
extra computation overhead as a result of the scheduling and mapping processes which 
has to be performed on top of the main application processing [6].
Task migration as opposed to dynamic mapping has been proposed in [26] to 
overcome performance bottlenecks wherever they are identified or anticipated to
16
occur. This can result in performance improvement in shared memory systems but could 
counterbalance their benefit when implemented in message passing (non-shared 
memory) systems due to the high volume of inter-processor data communication.
The efficiency of the M apper is measured by how optimally the tasks are mapped on 
the available resources. Practically, increasing Number of processor increases the  
freedom in executing tasks at a higher degree of parallelization; however they introduce 
higher power, intercommunication latency and channel congestion tradeoff which 
ultimately degrades the system performance.
2. 4 Fault-Tolerance in Multiprocessors
Fault-Tolerance is one major design issue that real-time processor systems need to 
address. Practically, system designs often have flaws that may not be identified at 
design tim e. It is virtually impossible to anticipate all faulty scenarios beforehand  
because processor failure may occur due to unexpected runtime errors. Real-time 
systems are critically affected if no means of fault tolerance scheme is implemented.
Various algorithms have been proposed to  address the different kinds of faulty 
scenarios. [3] proposed an NoC design flow  which automates the insertion of system 
monitors at design tim e whenever the communication requirements are known. Fleiko 
et. al. [10] demonstrated methods of error tolerance through tim e, space and data 
redundancy. This is however a classic and costly approach in many aspects though the 
work demonstrated redundancy is a potential method in certain cases.
17
2. 5 Related W ork
Simulation based approaches for designing and verifying embedded MPSOCs have 
been proposed in [12][13]. P0LIS[12] is fram ework based both on microcontrollers for 
software and ASICs for hardware im plem entation. PTOLEMY [12] is flexible fram ework  
for simulating and prototyping heterogeneous systems. METROPOLIS [12] is an 
extension of POLIS for unified modeling and structure for simulating computation  
models. Works those specifically address the long run times of simulations have also 
been proposed. Thiele et. al. [12] proposed simulation based distributed operation layer 
(DOL) which considers concurrency, scalability, mapping optimization requirements and 
performance analysis.
These works were generally focused on task scheduling that minimizes congestion 
on the underlying network and communication links; but application workload 
distribution is also significant factor that determines performance. Similar works 
concentrated on homogeneous systems those inadequately described today's 
heterogeneity requirem ent and others emphasized early error avoidance approaches. 
However, such systems do not guarantee fail safe environment because all faulty 
scenarios in a dynamic system may not be anticipated at design tim e. This work 
proposes a new ground breaking heuristic methodology to im plem ent fault tolerant 
scheduling and mapping algorithms.
18
CHAPTER 3 
DEFINITIONS AND THEORY
3 .1  Definitions
Instruction: An Instruction comprises o f two entities: i) the operation to be performed 
and ii) the operands that the operation executes.
For example, (ADD X, Y), (SQUARE X) and (SHIFT-RIGHT X) are examples of 
arithmetic/logic instructions. ADD, SQUARE, SHIFT-RIGHT are operations and X, Y are 
operands.
Process: A process, Pr contains a multi-set o f instructions, Pr = (Instri, Instr2, Instr3, . . . ,  
Instr^}, where Instr is an instruction and N is the number o f instructions in the process. 
For example, a process multi-set is represented as Pr = {(ADD X, Y), (MAX X, Y), (SHIFT- 
RIGHT X)}.
Task (Job): A Task is a set o f processes, T = (Pri, Pr2, Prs, ..., Prm}, where Pri,Pr2,...,Prm are 
set o f processes and M  is the number o f processes in the task.
M aster-Task: A Master Task denoted as Tm is a task, T^ e  {T} such tha t there exists a 
directed edge e, E {E} from  Tm to one or more successor nodes in the DAG. The successor 
nodes are called slave-tasks.
The acyclic property of the DAG restricts any task not to be a Master or a Slave to itself.
19
Application: An Application is a set o f tasks denoted as A = {Ti, T2, T3, ..., Tn}, where Tj, 
T2j ..., Tn are tasks and N is the number o f tasks in the application.
The application hierarchy is depicted in Figure 3.
APPLICATION h — y TASK I ^ PROCESS 1 ^ INSTRUCTION
Figure 3. Application Hierarchy
Task-Graph: Task graph is a DAG which represents an application. Task graph is 
denoted as TG such that TG = (V, E) where V is a set o f nodes and E is a set o f edges. A 
Node in V represents a task such that V = (Ti, T2, T3, ..., T^j and an edge in E represents 
the communication dependency between the tasks such that E = {Q , Q , C3, . .. ,  Q }. A 
weighted edge, i f  specified, denotes communication cost incurred to transfer data from  a 
source task to destination task. Figure 4 depicts an application which has 9 tasks.
T8
T9
Figure 4. Application modeling using Task-Graph
20
Task graphs are typically represented as generic graphs. Standard-Task-Graph-Set (STG) 
is a Task-Graph form at that is useful for mapping/scheduling algorithm verifications 
during prototyping process. STG defines a task in three parameters: i) Execution tim e ii) 
number of predecessors for that task and iii) predecessor list which specifies the list of 
the slave-tasks. Appendix III gives detailed description of STG Benchmark files.
Task Execution tim e: Task Execution time is denoted as T g x e c  t i m e  defines the worst case 
execution time o f the task on a processor.
Release-Tlme: The Release-time o f a task is denoted as Treieasejime- specifies the time by 
which a task should be dispatched to a processor.
The release-time is mathematically represented as:
Release-Time < Deadline -  [Execution Time + Total-Communication delay] (3.1) 
Task Dead line: (Tdeadiine) Task-Dead-line specifies the latest time a task must be executed 
and completed.
Task-address: Task-address, denoted as (Xtask, Ytask) specifies task destination in 2- 
dimentional XY coordinate system where the task is mapped on.
The task-address is specified w .r.t the location, (Xprocessor, Vocesso/-) of the processor 
address in the 2-dimensional network topology.
Destination-list: Destination-list o f a task, denoted as (Toast) contains the precedence 
successor tasks 7} o f the current task, T-,. T, is master task which has slave task 7}.
For example, a destination list Tonst = {Ts, Jy, Tg} for a task Ti signifies that T i has to be 
processed and its result forwarded to the location where Tg, Jy and Tgare mapped.
21
Task Object: A characteristics o f a task T is represented in a tuple/object <Tid, Ttype, 
T r e l e o s e _ t i m e /  T g x e c - t i m e /  o^ddress/ Where V/d —  Task ID, Ttypg — Task-Type,
T r e l e a s e _ t i m e  ~  Rclease-time, T g x e c - t i m e  ~ ExeCUtion time, Tdeacjijne — Deadline, Tgcjcjfgss —
Ytask) = Task Address, Toast = Destination-list. Task ID is a unique task identifier.
Task Status: Task status is defines as a state o f the task during its lifecycle. Lifecycle o f a 
task is the time period o f the task from  task fetch to completion.
The task status can be represented as an enumerated entity as given in figure 5:
TASK-STATUS
{
IDLE
P E N D IN G
READY
ACTIVE
R U NNING
COMPLETE
IDLE
PEND
FT
READY ACT
> OFFLINE
> RUNTIME
Figure 5. Task-Status enumeration and state modeling
in the proposed MPSOC fram ework, each task is considered to be in one of the 
above enum erated states. Explanation of the task status is given below:
1) A newly fetched task by default is given an "Idle" status. This task has just been 
introduced into the system.
2) After the task goes through the scheduling process that involves the task-dead-
line, task-release-time and task-type assignments, a task notifies the M apper
22
that it has been scheduled by raising "Pending" status.
3) A task gets the "Ready" status after the M apper assigns it a destination address 
based on the mapping algorithm.
4) A task is released only at the release-time of the task. The task gets the "Active" 
status only after the dispatcher delivers the task onto the architecture. "Active" 
status indicates that the task is in the communication layer. This approach 
emphasizes the computational focus and hides most of the communication 
perspectives. All the task duration from the tim e the task is dispatched till it 
executes is squeezed in one "Active" status.
5) After a task arrives at the destination processor, it changes the status to  
"Running". This comprises all the duration of the task while it is being executed 
and transferred among the processing elements.
6) After a task is processed to completion and the results disseminated among all 
the slave tasks, a master task is given a "Completed" status and it terminates.
Arguments defined in the above definitions can be extended, reduced or modified 
depending on the application they are used. In the proposed fram ework, tasks are 
exchanged among participating modules dynamically. A task interface is defined that 
interfaces all the illustrations are HW /SW  components. The task interface class and 
additional references found on the attached CD.
Based on these definitions, the JPEG algorithm can be described as follows. The JPEG 
is an image compression application. It involves the DCT and quantization tasks. The 
DCT, for example, has a number of processes which compute frequency domain and
23
transform functions. Again, the transform function has a number of operations to 
compute the arithmetic, logic and trigonometric functions such as addition, cosine and 
normalization.
Processor: A Processor P is a system resource which performs Task execution. A 
processor can be represented as a tuple <Architecture, Type, Power, Clock-Rate, 
Interrupt, Memory>
-Architecture: Architecture describes the internal organization of a processor and how  
it executes data. The Common architectures include RISK (Reduced-lnstruction-Set- 
Computer), CISC (Complex-lnstruction-Set-Computer), von-Neumann and Harvard. 
-Type: Type depicts the architectural characteristics o f a processor in a heterogeneous 
system.
For example, DSP (Digital-Signal-Processor) or ASIC (Application-Specific-lntegrated- 
Circuits) describe the processor type. Similarly, processors which can only execute 
{ADD, SUBTRACT, SHIFT) can be set as type T i and processor which can only execute 
(ADD, XOR, NOR and NAND) can be set as a different type T 2 . Tasks are assigned task- 
type based on the operation they involve.
-Power: Describes the Energy dissipation per unit o f time o f a processor both in idle- 
state/standby and active-state modes.
Active-power can be modeled as a function of Task size processed and task type (for 
heterogeneous systems). Task size is modeled as a function task execution tim e. 
Power rating is directly related to energy consumption. Energy-efficient designs 
mainly focus on reducing the energy/power rating of processors both at standby and
24
active mode of operations. Low power systems are optimized for reducing heating 
effect of the components which impacts device lifetime and component failure rate. 
Low-energy designs are optimized to extend battery life which is key embedded 
design issue in cell phones and other portable devices.
-Clock-Rate: The clock-rate o f a processor specifies a signal that synchronizes the 
operations o f the processor.
The Clocking frequency determines the speed of the processor. Computation 
intensive applications require higher clock frequencies in the GHz range. The 
PlayStation™-3 processor has clock rate of 3.2 GHz whereas common portable laptop 
computers have cores that are clocked at 1.66GHz. Clock rate can also be measures in 
instructions-per-second which defines how many instructions the processor can 
execute within a second. Thus a 2-MIPS processor can execute 2 million instructions 
per second.
-Interrupt: Interrupt is a process interruption request in preemptive systems.
A processor may issues interrupt for a resource (such as I/O  device or another 
processor) if the resource is needed for a higher priority task. Preemptive execution 
requires context-switching whereby the task-state (such as program counter, 
registers and stack pointers) of the preempted task is saved. This makes context- 
switching a computationally intensive process.
25
3. 2 System Definitions
System-On-Chip (SoC): SoC defines software/hardware components built into a system. 
SoC employ specialized or general purpose processor, memory-units and interfaces.
SoC are often designed for embedded systems. The SoC attributes can be represented 
as a tuple <processor, memory, interface> where processor describes the processor 
architecture, memory describes the type and size of memory and the interface describes 
how the SoC interfaces with other input/output peripheral devices. 
Multiprocessor-System -On-Chip: Multiprocessor-System-On-Chip, denoted as MPSOC 
defines a paradigm where a set o f processors and other HW components are 
interconnected.
M P S O C  -  {P i, P2, P 3 ,- ,  PN, M l ,  M 2, M 3 ,..., Mk} , where Pi, P2, ...,Pn ore a set o f N 
processors and M i, M 2,...,Mk are a set o f K memory units.
Multiprocessor systems can be represented in tuple as <processors-set, memory-set, 
interconnection> where processor-set describes the number and architecture of 
processors in the SoC, memory-set describes the size and type of memory the SoC 
employs and the interconnection describes the topology, routing techniques and other 
network parameters.
Multiprocessor system components can be i) built into a single chip or ii) distributed and 
connected via an interconnection network depending on the application requirement. 
Heterogeneous Multiprocessor System-On-Chip: Heterogeneous Multiprocessor System- 
On-Chip, denoted as HMPSOC, is a design optimization o f MPSOC paradigm which 
considers the behavior o f the applications it  executes. HMPSOC = { Pi, P2, P3, ■■■, P n Jo ,
26
Tb, To ..., To M l, M 2, M 3 , . .., Mk } where Pi, P2 ... ,Pn ore set o f processors where each 
processor belongs to the typeset {To, Tb, To ..., h )  types and M i, M 2, ...,Mk define a set o f 
K memory units.
Topology: Topology describes the physical layout o f hardware components in a system. 
Topology defines the placement o f the processor and memory blocks and how the 
com m unication between these resources is established. Topology fo r a M ultiprocessor 
system can be described by a Topology-Graph TG. TG is represented in tup le  as 
<processor-set, type, topology> where processor-set describes the processors 
parameters, the type describes the ratio o f the types o f processors in the system and 
the topology describes the layout o f resources. Figure 6 shows Torus interconnection o f 
a heterogeneous M ultiprocessor system. This topology can be represented as <25, 
4:11:6, TORUS> which describes 25 hardware components having 4 ty p e i processors, 11 
type 2  processors and 6 types processors where these components are interconnected in 
TORUS topology. TILE and TORUS are the most common topologies in MPSOC design. 
N etw ork  Topology: The Network Topology represents the arrangement o f processors in 
the system. The network topology is represented as an array o f processors P,y. The set o f 
processors are represented as P =  (P i, P2, P3, . . . ,  Pn} where N is the to ta l number o f 
processors in the system. For Tile topology, each processor on the boundary o f topology 
is connected to two other processors and each processor in the center o f the Tile 
topology and all processors in Torus topology are connected to fou r other processors. 
Therefore, a processor P,y is connected to Pj-i, P/+2, Pj.i, Pj+p The processors Pj.i, Pp i, Pj.i, 
Pj+i are called Neighbor processors.
27
I PF I Processor type-1 
[ PE I Processor type-2 
I P E  I Processor type-3 
I M I Memory units
Figure 6. Heterogeneous system with Torus Topology
Scheduling: Scheduling is o process o f assigning start o f execution time instance to the 
tasks o f an application such that the tasks are executed before their respective deadline. 
Let us consider an application A = {Tp T2, T3, . . . ,  !«} = {T/}, where T, is the task instance 
of the application with e, the worst case execution tim e of the task on a specific 
processor and d,, the deadline of the task. Scheduling is defined as a mapping of the 
execution of the task T| in the tim e domain. Scheduling is represented as:
a(Tj) = ti and t,+  ei< dj (3.2)
where t| is the execution start tim e of task T,. Scheduling of an application is denoted as 
a  (A) = (aT i, oTz, oTg, . . . , On). If there exists a precedence operation such that 
execution of a task t| should occur after execution of task Tj, represented as t, < tj, then
tj + ei < tj < d| and tj + ej < dj (3.3)
where tj is the start tim e of task Tj, ej and dj are its respective tim e and deadline.
28
Optimization of Scheduling Algorithm:
The problem of determining optimal scheduling is to  determ ine the minimum execution 
tim e of the tasks in the entire application. Other factors for optimizing the schedule 
include: maximize processor utilization, minimize port traffic, minimize power, etc.
M inimize (HfLiCej)) and tn + Cn < DL (3.4)
where ei is the worst case execution tim e of each task T,, DL is the deadline of the  
application, tn and en are the start tim e and worst case execution tim e of the last task in 
the application.
Scheduling algorithms can be preemptive or non-preemptive depending on the 
application. In real-time systems, preemptive scheduling is employed which means that 
a currently running task is suspended and another higher priority task is executed on the 
processor to m eet real-time requirements.
M apping: mapping is the processes o f assigning each task in an application to a 
processor such that the processor executes the task satisfying the task deadline.
Mapping a task to a processor is denoted as M (Tj) = {Pj} where T, is the task and Pj is the  
processor. Similarly, an inverse mapping W  (Pj) lists the set of tasks that are assigned to 
the processor represented as: W  (Pj) -  {T p  T2, 7}  . . .  T^}.
In some cases, the task mapping is denoted as a multi set M M  (T|) = {P p  P2, P3, . . ., 
Pk}  which represents a list of processors that task Tj can be executed.
Application mapping is defined as mapping all tasks in the application to  the processors 
denoted as M (A) = (M (T j), M (T 2 ), M (Tj),..., M(Tw)}
29
Optimization of Mapping Algorithm:
The problem of determining optimum mapping is to  determ ine the optimum  
allocation of all the tasks in the application on the processors such that the total 
execution tim e of all tasks on the processors should be minimized. Other factors for 
optimum mapping include: maximizing processor throughput, minimizing inter­
processor communication delay and balancing workload homogenously on the  
processors to increase utilization. Homogenous workload distribution is the proposed 
Mapping algorithm in the implementation of the fram ework.
Fault Tolerant system: Fault tolerant system describes a robust system which is capable 
o f sustaining application processing even in the event o f processor failure. A fa u lt 
tolerant algorithm is a function that performs Scheduling and Mapping on set o f 
unprocessed tasks on the available processing elements.
Let A be an application with a set of tasks A = {T i,  Tp  T3, . . . ,Tw} be N scheduled tasks 
denoted by a(A) and mapped on the set of processors P = {P p  P2, P3, . ■ . . ,  Pm }  denoted 
as M(A). A fault is defined as non-recurring, permanent failure of a processor at tim e  
instance, T/during the execution of a task.
The application after the occurrence of the fault is expressed as Af such that A / c  A 
which has a list of tasks A /=  {Tg, T^, . . . ,T„} that are not dispatched by the dynamic 
mapper when the processor failure occurs. The processor list after the failure is denoted  
by Pf = {Pp P2, P3,- ■ ■ ,Pk} and Pn ^ P /w here P„ is the failed processor. The proposed fault 
tolerant MPSOC fram ework determines the updated schedule a(Af), updated mapping 
M(A/) for a set of unprocessed tasks in the application subset A/, such that:
30
tn + 6n < DL (3.5)
where tn and en are the start tim e and execution tim e of the last task in Af.
Optimization of Fault-Tolerant Strategy:
Since the proposed fram ework employs dynamic mapping methodology, only the 
tasks dispatched to the failed processor, which includes the task that was being
executed by the failed processor and tasks dispatched by the mapper to the failed
processor buffer will need to be rescheduled along with the tasks that were not
dispatched by the mapper. This concept is the most significant contribution of this 
fram ework. Rescheduling of the tasks in A/ is considered to determ ine the optimal 
solution for performance of the system.
Let A/ be the application with list of tasks A/ = {Tg, Th,...,Tn) that have not been 
dispatched to processors at the occurrence of the fault. Also consider the tasks Tb = {Tr,.
. . ,  TJ be the list of tasks in the buffer of the failed processor and let Tp be the task 
that was being executed by the failed processor. The task list Tpr of fault tolerant 
algorithm at tim e 7} is represented as:
TpT = Af+Tb+Tp  (3.6)
It is also assumed that all the tasks that are being executed by the non-failed processor 
complete execution of the tasks even after the tim e of failure 7}. Thus, FT algorithm  
performs updated scheduling a ' {Tft) and updated mapping M ' (Tft) on the task set Tft 
which satisfies the deadline constraint DL of the application A.
31
3. 3 Performance Definitions
Utilization (p)\ utilization is defined as the ratio o f sum o f processor execution time to
the tota l execution time o f the application.
(total execution time o f  processor)
^ (total application processing time)
Throughput (fj): Throughput is defined as the ration o f the number o f tasks processed by
the processor to the sum o f execution time o f the task on the processor.
(number o f  tasks executed on the processor)
^  (total execution time o f the processor) (^-^)
Buffer Utilization: is the maximum buffer occupancy in the processor during the
execution o f the application
BU = MAX (utilized buffer size) (3.9)
Port Traffic: defines the number o f tasks transacted by the processor.
Port tra ffic  can be used to  estim ate the channel tra ffic  o f the com m unication network, 
port-tra ffic  represents the netw ork activ ity  in data transaction among components.. 
Performance Index: Performance-lndex, denoted as PI is the cumulative cost function  
used to determine the optimal schedule using Simulated-Annealing. PI quantifies the 
performance o f the system with respect to averaged values o f Execution time, utilization, 
throughput, buffer usage, port traffic and power. PI is evaluated by a Cost Function 
which is expressed as:
COST = Ci*(avg-Execution time) + C2 *(avg-proc-util) + C3 *(avg-proc-throu) +
Cs*(avg-buff-usage) + C4 *(l/avg-port-tra ffic)  + C6*(l/ovg-proc-power) (3.10) 
W here Q , Q, Q, Q , C5, Ce are normalizing constants.
32
The Simulated-Annealing scheduling procedure compares successive simulations based 
on the PI. Execution time, Processor Utilization, Processor Thorughput and Buffer 
utilization are subject to be maximized so their cost is computed linearly whereas Power 
and Port Traffic are subject to be minimized so the reciprocal values are taken. The 
coeficient values are set according to the desired performance goal of the system. 
Higher coecients are given for those performace variables which are crucial to the  
performance goals.
33
CHAPTER 4
METHODOLOGY
In Chapter 3, key terminologies were defined. This chapter will move on to 
describe the methodologies adopted in the implementation of the proposed design.
4. 1 Application Modeling
Application modeling defines scheduling and mapping methodologies which 
consider the application characteristics and efficiently model it on the architecture. 
Analysis of application characteristics is done at the scheduling phase which determines 
task heterogeneity and timing. Figure 7 depicts how an application task-graph is 
modeled onto a Multiprocessor architecture.
The fram ework adopts a bottom up modular design methodology which gives the 
benefit of component reusability. Behavioral component models were designed and 
stored as hardware and software libraries.
4. 2 Application Partitioning Overview
Partitioning a task into subtasks yields an increase in processing speed because the 
subtasks can be executed in parallel on different processors [14]. These subtasks
34
T3
T4
T2
T775
18.
A
P
P
L
I
C
A
T
I
O
N
TASK SCHtDUÜNS
TASK MAPPING
PE
-l y l l # / '__
m  r \ -----
g -
Figure 7. Application Modeling on MPSOC Architecture.
are also called grains. Grains are sequence of instructions that need to be executed by a 
single processor. A task can be partitioned at different levels of granularity. The size of a
35
grain is measured by the number of instructions it contains. Higher number of 
instructions constitutes bigger grain size.
Often, the partitioned grains do not have equal sizes. This is because certain tasks 
have non-parallelizable (serial) processes which cannot be partitioned by the 
partitioning algorithm. The serial processes are called incompatible because they 
cannot be processed in parallel but have to be executed serially on a single processor. 
The parallelizable processes however can be distributed to multiple processors and 
executed in parallel fashion so they are called compatible processes [16].
Contrary to task parallelism, Amdahl's Law [6] states that there is a limit to the  
extent that a particular task can be parallelized. It explains that speedup is not only 
dependent on the number of processor in the system but also on the portion of the 
application which cannot be parallelized. This explains that if the task is not 
parallelizable, then there would be no speedup gain regardless of the number of 
processors used.
From the processors point of view, executing course grained tasks has a different 
overhead than executing fine grained tasks. Determining the size of the grain therefore  
determines the degree of task concurrency. Course grained subtasks hide parallelizable 
instructions which could have been distributed among several processors. On the other 
hand, fine grained subtasks maximize parallelism but they also incur computation 
overhead in data transfer and communication latency [6].
36
4. 3 Task Scheduling and Mapping Strategies 
4. 3. 1 Task Scheduling approach
The proposed fram ework employs offline task scheduling. The scheduler is 
responsible for assigning every task its "release-time", "task-type" and "deadline" using 
the EOF and Simulated-Annealing scheduling policies. The soft-deadline of a 
deterministic task is known prior to the scheduling stage. The soft deadline merely 
reflects the tim e fram e in which a task should be released and finished but doesn't take 
the runtime communication latencies into consideration. Thus the hard deadline, which 
considers the inevitable communication delay, has to be calculated at runtime based on 
the latencies that would result from the apparent master-slave task allocation. Mapping 
master and slave tasks on processors which are far apart in the topology increases the 
communication traffic and latency. Tasks are dynamically mapped at runtime so hard 
deadline constraints cannot be determined at offline task scheduling.
4. 3. 2 Task Mapping approach
Optimized mapping algorithms contribute a crucial improvement on the overall 
performance of the system. Consequently, Multiprocessor systems emphasize not only 
on scheduling but also on mapping strategies. Homogeneously distributing application 
workload among the processing elements has a significant impact on utilization, power 
dissipation, throughput, and real-tim e factors.
The dynamic mapper in the proposed fram ework performs the task allocation as per 
the configured policy. In order to take heterogeneity of components into consideration, 
power and execution latency penalties have been introduced. This emphasizes the fact
37
that processing a particular task on different type of processing elements incurs 
different cost. This models the behavior of heterogeneous embedded systems where  
application specific processors execute heterogeneous tasks. Efficient and optimized 
heterogeneous mapping algorithms are tailored to minimize this penalty cost.
Various task allocation heuristics were implemented in the proposed fram ework. 
Given a scheduled task, the task mapping follows one of the policies:
1) Next-Available Mapping: In Next-Available mapping policy, a scheduled task T, is 
mapped on processor P| followed by the mapping of the next scheduled task Ti+i 
on processor Pj provided P, and Pj are neighbors to each other as defined by the 
topology. This policy assigns task to  each processor sequentially, but doesn't 
consider the heterogeneity and processor workload.
2) Homogenous-Workload-Distribution Mapping (HWD): HWD is the proposed 
mapping algorithm which considers the workload on each processor and maps a 
scheduled task on a processor having the lowest workload. HWD is described in 
the next section.
3) Random mapping: Random task allocation is used by the Simulated-Annealing  
algorithm to distribute workload in randomly.
4. 4 MPSOC Simulation Modeling Approach
Simulation based design approaches facilitate system development effort because 
they help to refine the design outcomes quickly based on different problem domains. 
Choosing a convenient modeling tool is as crucial to the design process as to its
38
implementation. Modeling event based designs in languages like C++ which lack tim e, 
event and concurrency constructs cannot abstract the exact behavior digital systems. 
VHDL and Verilog are among the powerful design description languages which are used 
for system specification and prototyping. However, these languages have their own set 
of coding rules that need to be mastered by the developer prior to  the designing phase.
Système (Appendix IV) is an event driven transaction level modeling tool with  
software/hardware co-simulation capability that allows designers to simulate proposed 
specifications [22]. The fram ework was implemented in SystemC 2.2/C++ and compiled 
with Linux GCC 4.3. The hardware modules like processor, switch, dispatcher-unit and 
multiplexer are implemented in systemC and the software modules like Input parser. 
Scheduler and task interface are implemented in C++.
4. 5 Framework Modeling
The proposed fram ework was modeled as follows.
1) INPUT: Task-Graph File, Topology, Scheduling-Policy, Mapping-Policy, Number of 
Processors, processor types. Buffer size. Switching technology. Fault tolerant mode.
2) OUTPUT: System Performance Variables: Utilization, Throughput, Task completion  
tim e, Power-Rating, Port-Traffic and Buffer Utilization.
3) CONSTRAINTS/ASSUMPTIONS:
•  The processors implemented are simple RISC processors with ALUs which are 
capable of computing basic logic and arithmetic operations.
•  The fram ework assumes no conflict occurs on channels and ports.
39
•  Communication cost for a task is dependent on the size (and thus the execution 
duration) of the task. Thus communication latency is modeled as a function of task 
size.
•  Heterogeneity is modeled by varying the task processing cost w .r.t. processing 
tim e and power. All processors are capable of executing all possible operations but 
with different cost.
•  Processor failure is perm anent once it occurs and only single processor failure is 
considered. However, the same methodology can also be adopted for multiple 
processor failure.
4. 6 Framework Control Flow
1) Input processing: User data such as number of processors, topology, task-graph 
and scheduling/mapping specifications are retrieved from the GUI (Graphical-User- 
Interface) and stored in a data store. The main module then instantiates the task- 
Graph-Parser class which reads the task-graph file, analyzes the form at of the task- 
graph and parses individual tasks along with their attributes as per the specified 
task-graph form at. These parsed tasks are stored in the scheduler-task-pool 
structure. Next, the task-interface class is called which defines the common 
interface for all the hardware/software components of the fram ework.
2) Hardware setup: The set-Up module retrieves hardware specifications from the  
settings class and instantiates the processor, switch, Mapper and dispatcher 
modules. It then configures the i) processor architecture (port-type, buffer-size,
40
operation-list and address) ii) switch parameters (switching-policy, buffer-size, 
port-size) iii) Mapper/dispatcher (port-size, memory-size, topology) iv) the  
topology (placement and wiring of the components, setting clock).
3) Application Processing: The application processing simulation begins at the set- 
Schedule module of scheduler class which schedules tasks statically according to 
the scheduling-policy. Following this, the snoop-Processor, map-Task and dispatch 
modules in the Mapper class perform binding and placement of each task on the  
architecture.
4) Output Processing: The simulation term inates by calling calculate-Performance 
module which retrieves the different performance variables from the individual 
processors such as the total-processing-time, power-rating, number of tasks 
processed and buffer-usage. These values are displayed on the GUI and stored in 
the result log File.
The fram ework consists of software and hardware libraries which are set according to 
the user input specifications:
SOFTWARE LIBRARIES: These libraries contain software components of the fram ework  
like the Input parser, scheduler, data store, and task interface.
HARDWARE LIBRARIES: The hardware libraries, implemented in SystemC specify 
hardware components of the fram ework such as Processor, M apper, Dispatcher and 
multiplexer. The fram ework depicted in Figure 8 shows the software/hardware modules 
of the fram ework. Figure 10 and 11 depict fram ework libraries and detailed description.
41
4. 7 Framework Flexibility
Flexibility of the fram ework reduces the design effort, tim e and cost because the 
comprehensive design space offer various exploration capabilities. This alleviates the 
haste to go for additional tools during system development phase. The fram ework  
provides system wise re-configurability with respect to the following scenarios:
•  Architectural flexibility: in order to support w ider design functionality, the 
fram ework accepts number of processors, processor types, size of memory and 
topology. The heterogeneous processor architecture is modeled hardware-wise as 
two-dimensional matrix, where the X and Y coordinate designate processor 
address. The tw o De facto topologies for MPSOCs, namely the TILE and TORUS are 
supported.
•  Scheduling flexibility: EDF and Performance Driven scheduling strategy based on 
Simulated-Annealing technique are supported.
•  Mapping flexibility: The M apper unit employs the proposed Homogenous- 
Workload-Distribution, Next-available and Random mapping policies. Homogenous- 
Workload-Distribution algorithm is presented in detail in chapter 5.
4. 8 Architectural Framework Modeling
Architectural modeling defines the specification of switches, processors and the  
other hardware modules. The switches interface processors with the network. Switches 
employ the XY switching algorithm which is described in the next section. A switch has 8 
ports for sending and receiving tasks. Every switch has 2 input/output ports located on
42
Hardware
Libraries
Software
Libraries
MAIN GUI
TO PARSER
SW ITCHSETTING S
SCH ED U LER PRO CESSO RTA SK  INTERFACE
M ULTIPLEXERO ther IPs M APPER
DISPATCHER
ARCH SETUP
PE PE PE PEPE
PE PE PE PE PE
PEPE PE PE PE
PEPE PE PE PE
PE PE PE PE
PERFO RM ANCE EVALUATION
Figure 8. Framework Modeling
each side of the TILE. The switch schematic Is shown In figure 9. W henever a task arrives 
at any one of the switch ports, Its address Is evaluated by comparing [X, Y] coordinate
43
SWITCH INPUT SWITCH
INPUT
PORTS
OUTPUT
PORTS
PROCESSOR INPUT 
BUFFER
BUFFER
PE
SWITCH OUTPUT 
BUFFER MUXDEMUX PROCESSOR OUTPUT
BUFFER
Figure 9. Processor and Switching technique
addresses of the switch and the task. If the two addresses match, then the task has 
reached its destination and the switch directly loads the task on the processor input 
buffer. If the addresses do not match, the direction is computed according to the XY 
switching algorithm and the task is forwarded to that port. XY-Switching algorithm is 
given in the next section.
The XY Switching Algorithm
XY switching technique is a simple data routing scheme where a task is first routed 
along the X dimension, and then in the Y dimension. This algorithm gives the shortest 
path between communicating elements for a Tiled layout. XY switching also gives robust 
way of data routing because during node or path failure, data can be transferred 
through other alternative routes. Heterogeneous Multiprocessor systems often contain 
components that have different dimensions and sizes. One processor may have a 
different size than a neighboring processor because they employ different internal
44
architecture. In this scenario, the XY algorithm may not guarantee the shortest path 
because of the irregularities along the path. APSRA [18] by Rickard et. al. proposed fault 
tolerant deadlock free algorithm for heterogeneous mesh NOCs. APSRA stores routing 
information in memory to monitor faulty routs and allow re-configurability dynamically.
The XY switching technique uses the 2-dimensional coordinate address of a Tiled 
architecture. Every processor is assigned a processor-ID and coordinate address during 
the architecture setup phase of the fram ework. M apper assigns task destination address 
at runtime. Task routing address during switching is computed by the relative difference 
between the task and processor addresses using the following expression.
(Xtask'Xprocessor) ~ X^ jff and (Ytag^ -Yprocessor) ~ Y^ iff (3.11)
If Xdiff = Ydiff = 0, then the task has reached its destination processor. The switch 
directly transfers the task to the processor input buffer. If the addresses do not match, 
the direction is computed following the XY switching algorithm as follows,
1) (Xtask- Xprocessor) < 0 , then the switch passes the task through "west-port".
2) (Xtask- Xprocessor) > 0, then the switch passes the task through "east-port".
3) (Ytask- Yprocessor) < 0, then the switch passes the task through "north-port".
4) (Ytask- Yprocessor) > 0, then the switch passes the task through "south-port".
4. 9 Behavioral Framework Modeling
Switch power is consumed whenever a task is routed from input port to output port. 
Processor power is modeled with respect to idle-mode and execution-mode power.
45
TG PARSER
 :-------
void scheduler: :secSchedî2le ( M  
setXasJcTypeO ; 
for (inc. 1*0; K N - l ;  iff) { 
for (izit 3=14-1; ]<N; j-F-F> i 
If{task[iI,DL>task[j|.DL>(
swapTa3k=3cheduledIaskPool[i];... 
for(int i=Q; K n o O f T a s k s ;  1++) <
task|i|.rl3TiD=task[i|.DL-task|i|.DL;>
# - =  SCHEDULER
class inputParssri 
private:
int noOfTesks,... ; 
public:...
fstrean cgFile(InputTaskFileName); 
for^irit 1=0; KtoOflasks; i+4-){ 
getlloe{tgFlle,line); 
tasJcId[il = atoi (line.substr (7,4|î ; 
executlonIime|l| = atoi{llne.3ubstr(: 
for (lot j=Q; j< noOfPrdssrlil ; jf-i-.M, 
prdssrfl]|j|*atol(line.subatr{40+1:
MAPPER
DISPATCHER
SC_HODULE (irsapperDispatCher) i 
private :
string rttappingScheœe;string topology 
piibilc :
3c_ln <booi> clock;
3c_i:n <int> processorSnoop [MAX] ; 
3c_i:nQut <serialData> dispFort [MAX| ; 
5C_TK.EIZAD (dispatcher ) ; SC_î*ÎETfiOD (mapper ) ; 
...for(1=0; I < noOfPrccessors; !++>(
proc[I] = processorSncoplI].r e a d ();>
...for(Int X=l; X<= xDlmenslon; X++>
for (Int Y=l; Y < -  yDirtension; Y++){ 
if (locatePro =■« cbosenProcssaor) { 
dispMem[++Cntr].addrCO|=X; 
dl9pMeïs[++Cntr] .addr|l|=Y; . . .
3C_MODULE(processor)< 
private :
int pîd,pXadd,pYâdd,proType,... ;
sc_inout <3eriaIData> swltchNorth; 
sc_inout <3eriaIData> swltchEast; 
sc_inout <3eriaIData> swltchSoath; 
sc_inc-ut <serialData> switchWest; 
3c_inout <bool> clock;
p r o c e s s o r (...,Int x,int y , t y p e ...){ 
SC_KETfiOD (Switch) ; 
3 e n s i t i v e « 3 w W e s t « 3 w E a 3 t « ... ; 
SC_METriOD (processorDataUpdate ) ; 
SC _ T H R E A D (ALU);... 
this->tctai£xecTime4-=ta3k. execTiir.e+ 
task. CQHsmHelay-fpenalLatency; I
Figure 10. Simulation fram ework prototype and infrastructure.
1) Processor power is modeled as a function of the task execution tim e, Texectme
P p o w e r i T e x e c  t i m e )  —  A P O W E R  *  T g x e c t i m e
where APOWER is unit power consumed per unit tim e of task processing.
The total power rating for processor P would be expressed as:
P to ta l-P o w e r~  APOWER * X n = l {Pexec-tim e
where N is the total number of tasks executed by the processor.
46
(3.12)
(3.13)
class tasklnterface
{
int data[H]; 
int taskid;
Int t a s kAddre s s [21 ; 
int noOfDestinations; 
int taskDestDist[M A X ]; 
int taskNxtAddrjs [MAX]
SC_MODULE(multiplexer]{
3c_in <serialData> maxlnputl; 
3c_in <serialData> nmxlnput2; 
3c_in <bool> selectin;
sc_out <3erialData> muxOutput; 
if(selectin.read()==false)I 
muxOutput.write(maxInputZ);
const int naX_PORT=50; 
const short TEMP_LIMIT=0.1; 
const double MAX_TASK=0.3; 
struct setting{
bool faulcTolerantMode; 
int numberOfProcessors; 
int bufferSize; 
int processorTypes; 
string switchingMode; 
string topology; 
string scbeduliogPclicy; 
string localMapplngPolicy; 
31 ring mapping S cheme; 
string casklnputFileîIame; }
TASK INTERFACE
SETTINGS
Other IPs
MULTIPLEXER
PROCESSOR
SWITCH
Software
Libraries
3c_clocJ£ clockl S "clDckl" , 1 0 ) , clock:2 ( "clock:2", 2 0 )  ;  . , 
processor* processorCore|xDimenslQn*yDiroension]; 
multiplexer* mux[yDiiuenslori| ; 
mapperOispatc^er dispatcherUnit ('"MD” , . . . > ; 
sc_3lgnal <t.ask:înt;erface> *wire = new. . . 
for (Int 1—0.; KxDlmenslon;  1+41-) { 
for(int j=0; j<yDimension; j++) <
core[++proId|=new processor(1,3 ,proType,... 
sc_trace_file *tf - sc__create_vcd_trace_file ("TF") ; 
sc_start
dompPerforn^nceLogO ; . . .
Hardware
Libraries
AR CH  SETUP
Figure 11. Software/Hardware libraries and architecture setup module
2) W henever a task T; is executed on processor P, the task Execution tim e Tgxectime and 
the power consumed by processor Ppower to execute the task are updated on the 
Total-Execution-Time and Total-power parameters of the processing element 
respectively as:
Ptot-exec-tim e ~  P tot-exec-tim e  fexec time
P to to i Power ~  Ptoto lPow er P po w er *  ( T g x e c  tim e)
(3.14)
(3.15)
47
3) For modeling processor heterogeneity, additional execution latency and power cost 
have been introduced. Additional LATENCY and Additional POWER cost are incurred if 
and only if there is a difference between task-type and processor-type. These 
additional costs are functions of task type and task execution tim e.
L A T E N C Y  ( T f y p e /  T g x e c t i m e )  = A T Y P E  *  T g x e c t i m e  (3.16)
P O W E R  ( T f y p e ,  T g x e c t i m e )  —  A T Y P E  *  P p o w e r ( T e x e c  t i m e )  (3.17)
where L A T E N C Y  and P O W E R  are additional cost variables, A T Y P E  is the type difference 
between processor-type and task-type.
Let T be a task with execution Time Tgxec time and two processors Pi and P2 . Let task T
and Pi be Typei and P2  be Type2 . Executing T on Pi (same type) will have a static cost
described by:
P l ( e x e c u t i o n - t i m e )  ~  T e x e c  t i m e  (3.18)
P l ( p o w e r )  — P p o w e r ( T e x e c  t i m e )  (3.19)
Executing T on P2  will incur additional cost with respect to execution tim e and power. 
These additional costs are added on top of the static execution tim e and power 
parameters of the processor which is described by:
Pl ( e x e c u t i o n - t i m e )  ~  T e x e c t i m e  LATENCY (3.20)
Pl ( p o w e r )  ~  Pp o w e r f P e x e c  t i m e )  POWER (3.21)
W henever a task is exchanged from one node to another (switch to  switch), the  
communication delay is updated on the Task Communication-Delay parameter. The 
total communication latency for that task would be expressed as:
Total-task-communication-delay = (N) * (task-size) * (COMM-DELAY) (3.22)
48
W here N is number of switch hopes, task-size is the size of the task defined as a 
function of its execution tim e in time-units and COMM-DELAY is a tim e constant that 
is required to transfer single time-unit task size for one hope.
4) For each task T processed, the Task-Slack-Time Tsiack-me ond the Average-Slack-Time 
T a v e r a g e - s i o c k  ore Calculated as,
T s l a c k - t i m e  ~  T d e a d l i n e  ~  [ T r e l e a s e - t i m e  T e x e c u  t i m e ]  (3.23)
T a v e r a g e - s la c k  —  X  ( T s i a c k - t i m e )  /  h i  (3.24)
where Tdeadime is the deadline, Treiease-time is the release time and N Is the to ta l number 
o f tasks processed.
The slack tim e of a task describes by how much the task execution can be delayed 
without violating the deadline of other tasks. The average-slack-time helps to  
determ ine the efficiency of the scheduler. The smaller the slack-time of a task, the 
closer is the finish-time of the task to its deadline and thus the higher the efficiency 
of the scheduler.
49
CHAPTER 5 
IMPLEMENTATION
The previous chapter explained the proposed MPSOC fram ework methodology. In 
this chapter, the implementation of the proposed algorithm for scheduling, mapping 
and fault tolerant is discussed in detail. This work employs efficient scheduling, mapping 
and fault tolerant algorithms to optimize the performance of MPSOs. The algorithms 
proposed in this work are:
1) Performance-Driven scheduling algorithm
2) Homogenous-Workload-Distribution mapping algorithm
3) Fault tolerant algorithm.
5 .1  Performance-Driven Scheduling Algorithm
This work proposes an efficient performance driven scheduling algorithm based on 
Simulated-Annealing techniques. Performance index determined by a cost function 
(defined in chapter 3) is used to determ ine the optimal schedule. The performance 
index is a cumulative factor of i) processor execution tim e ii) processor utilization iii) 
processor throughput iv) processor power v) port traffic vi) processor buffer utilization.
The problem of determining the optimal schedule is defined as determining the 
schedule with maximal performance index.
50
Simulated-Annealing Overview
Annealing is a metal manufacturing process whereby molten metal is cooled down 
at a predetermined slow rate so as to allow the metal particles to  find the most stable 
crystalline structure. Adopting this method to solve mathematical problems is called 
Simulated-Annealing. Simulated-Annealing is an efficient search heuristic to find an 
optimal solution in a short tim e which otherwise require exhaustive search and long 
execution tim e to find the best solution.
Simulated-Annealing is a randomized process. The process starts at a given 
tem perature, computes the cost of the solution at that tem perature and evaluates 
probabilistic state transition based on the PI (cost function). The procedure repeats 
these steps aiming to maximize the PI by skipping the local optima. The iteration stops 
at a tem perature threshold. W hile being able to  provide fast and affordable cost (which 
might also be the best), this method never guarantees the best solution. Simulated- 
Annealing is used in many design areas. The pseudo code of the Simulated-Annealing 
algorithm is given in figure 12.
The Simulated-Annealing algorithm performs i) random scheduling and mapping of 
tasks ii) running simulation iii) capturing, averaging and normalizing performance 
variables and calculating performance index. These procedures are repeated for several 
times. At any one of the iterations, if a better PI is found, the state is saved and 
simulation iterates till the tem perature reaches a threshold value.
51
STATE := INITIAL_STA1E; INITIAL_COST := COST(STATE)
BESTSTATE := STATE; BESTC05T := COST
TEMP_TRESHHOLD := 1; SIM_TEMP := 100
SIM_LEN := 1; TEMP_I4AX := SO
COOLING_FACIOR := 0. 75 
WHILE SIM TEMP > TEMP_TRESHHOLD
WHILE SIM_LEN < TEMP_MAX
Î
HEW STATE := STATE
NEW_COST := COST(NEW_STATE)
IF NEW_COST < BE5TCOST
BEST_STAT:E := HEW STATE; BEST COST := NEW_COST
SIM_LEH = SIM_LEN + 1 
SIM TEMP := SIM TEMP * COOLING FACTOR
Figure 12. Simulated-Annealing algorithm
The cost is computed using averaged values of all the performance parameters by the PI 
equation as given below.
A "STATE" in the Simulated-Annealing algorithm refers to a solution instance where 
simulation data along with every performance param eter is captured. W henever a 
"STATE" transition occurs, these values are replaced by the new "STATE" variables. A 
"STATE" transition between any tw o iterations of the Simulated-Annealing process is 
decided based on a stote-Tronsition function which is shown in Figure 13.
The performance index is calculated using eq. (3.10) for each "STATE" as:
PI = Ci*(avg-Execution time) + C2 *(avg-proc-util) + C3 *(avg-proc-throu)+
C5 *(ovg-buff-usage)+ C4 *(l/avg-port-tra ffic ) + C6*(l/avg-proc-power)
52
bool stateTransition(float carrentCcst, float bestCost,float temprature)
I
bool taJceMove; 
i f (currentCost < bestCcst)
{
takeMcve = true; // MOVE TO NEW STATE
I
else // 5CME ERC3A3::ITY TO DECIDE
i
float probability =  e x p ((bestCost - currentCost|/ temprature);
if (probability < randO) // TRAîfSITION Ea03A3:I:TY
takeMove =  true;
else
takeMove = false;
}
return takeMove;
Figure 13. Simulated-Annealing State Transition
The coeficient values are set according to goal of the PI. For the problem considered, the  
normalizing coefficients were set to: Q  = 300, Q  = 500, Q  = 250, Q  =125, Cs = 120 and 
Ce = 50000. Execution tim e. Utilization, Throughput and Power factors are key 
performance indices and consequently they are given heavy weight during 
normalization. Port traffic and power indices are subject to be minimized so their 
respective cost function is computed reciprocally using higher coeficients.
5. 2 Homogenous-Workload-Distribution (HWD) Mapping Strategy 
5. 2 .1  Back Ground
Utilization is a key factor in determining performance of Multiprocessor systems. 
From multiprocessing perspective, overall utilization is determined by the individual 
processors utilization. According to the definition, processor utilization is directly
53
proportional to the tim e that the processor was busy executing tasks. Consequently, 
minimizing the idle-time of each processor (maximizing busy-time) improves processor 
utilization and therefore overall performance. To reduce idle-tim e, tasks should be 
distributed evenly throughout available processors. Homogenous workload distribution 
ensures that none of the processors remain idle while other processors have long task 
queue in their buffer.
The proposed Homogenous-Workload-Distribution (HWD) task mapping is aimed at 
optimizing mapping strategy in an effort to balance the dynamic workload throughout 
the processors. The algorithm involves tw o steps: i) Probing individual processor buffers 
through dedicated snooping channels which connects each processor to the M apper 
and ii) Mapping each scheduled task on a processor which has the lowest workload. 
Workload is a measure of the execution overhead for processors. The overhead for a 
particular processor is determined by the number of tasks queued in its buffer. Thus a 
processor having the least number of tasks in its buffer has the lowest workload and 
therefore the M apper will assign the next scheduled task to this processor.
5. 2. 2 HWD Mapping Algorithm
Let {Pi,P2, Ps, . . . ,  P n ]  be set of processors and {Li, L2, L3, . . . , L j^] be the dynamic task 
workloads on the respective processor buffers where N is the number of processors. Let 
{Ti, T2, T3, . . .  ,Tk] be set of tasks and the symbol "a" denotes "is mapped to". Then,
i) A task T: <Ti, Tgxec-time, Tdeadiine > a  Pi if and only if U < Lj V j 6  (1, 2, 3 , . . . ,  N} where 
[L|, Lj] E (Lj, L2, L3,..., Lyv}, P| E {PjjPlj .../ Pn \  and 1 < i < N such that {Tceleose-tlme Tgxec­
time Tdeadiine)-
54
ii) If Li = Lj for 1 < i < N, 1 < j < N and i t  j, then type matching and master-slave 
mapping policies will be considered to choose the task placement as described 
below.
Let {Typei, Type2, Types,...JypeM) be set containing the processor types 
respectively such that M = N (for simplicity, assume each processor has different
type but the general case is M <N ). Again, let task T has type TypeR where TypeR G
{Typei, Type2, Types,..., Type/c) for R = 1, 2, 3,... ,K. Then,
a) T a  Pi if and only if Typei = TypeR for 1 < i < M whenever possible. Else, master- 
slave mapping policy applies as follows.
Let Tmas be master-task of Tsiave- let [P|, Pj] G {PijPz, P3, ■■■, P n } for 1 < i < N,
1 < j  < N and Tmas cx Pi, then
b) Tsiave a  Pj if Pi and Pj are neighbors defined by the topology.
This approach has three fundamental advantages: i) It minimizes the chance for 
buffer overflow ii) task waiting tim e in processor buffer is reduced iii) processor 
utilization is increased because idle or less overloaded processors will be assigned more 
tasks.
The hardware layout implementation is depicted in figure 12 which shows the 
dedicated channels for monitoring each processor workload.
55
Lieck v A f A  tGFA(?Sh=^ SCHEDULER INTERFACE SOFTWARE
LAYER
S
N
O
0  
P 
P
1
N
G
C
H
A
N
N
E
L
Mapper + Dispatcher unit
MUX ARRAY
PROCESSOR
ARRAY HARDWARE V  LAYER
J
Figure 14. Flardware Interconnection in 5X5 Torus.
5. 3 Fault Tolerant Implementation
5. 3 .1  Back Ground
Failure to employ fault detection and Recovery strategies is often unaffordable in 
critical mission systems. The growing design sophistication in current Multiprocessor 
chip technology has posed significant design complexity to im plem ent fault tolerant 
procedures. Additionally, the inherent execution parallelization in MPSOCs entails 
tremendous overhead to dynamically monitor application execution status.
56
Processor failure introduces inevitable overall performance degradation due to i) the  
reduced computing power of the system and ii) the overhead involved in applying the  
recovery schemes. Recovery procedures involve tracking task execution history and 
reconfiguring system based on the available processors. This enables to resume 
application processing where the execution was left off. During processor failure, task 
processing is tem porarily suspended and error recovery procedures are taken before 
the normal application processing is resumed.
Component overheating or similar runtime issues may cause processor failure. This 
processor may possibly have tasks executing on it or stored in its buffer that were  
delivered to it. These tasks cannot be simply discarded because it will disrupt application 
processing cycle. In real tim e systems, when a processor failure occurs, the system has 
to be reconfigured and data consistency issues have to be resolved or otherwise it may 
lead to possible data discrepancy and deadline violation.
Previous works in [10][3][27][21] have demonstrated reliability issues regarding task 
processing, data routing and communication link failure. [27] proposed a system-level- 
design fram ework for a tiled architecture to construct a reliable computing platform  
from potentially unreliable chip resources. [21] proposed dynamic scheduling strategies 
to increase resource utilization by exploiting all the scheduling freedom  in NOCs.
5. 3. 2 Fault Tolerant Model
Fault modeling can have several dimensions which express the fault occurring time, 
the fault duration, location of failure and so on. [27] has proposed a model where faults 
are modeled to have transient and permanent behavior.
57
The proposed fram ework adopted similar fault modeling parameters.
•  Duration: It is assumed processor failure is permanent. Thus, fault duration time 
Tfd is infinite (Tfd = °°)- During failure, the address of the failed processor is 
completely removed from the M apper address table.
•  Location: Location specifies where the failure occurred. For Tile and Torus layout, 
location is represented in 2-Dimentional coordinate address.
•  Time of failure: During task processing, any processors may fail at any tim e  
instance. Consequently, tim e of failure is modeled as a random tim e instance.
If the fram ework is operating under fail-safe mode, the M apper is configured to 
probe each processor so that problems can be detected dynamically. In the event of 
failure, the following procedures are carried out.
1) Processor failure is monitored dynamically by the Mapper.
2) If failure detected, the M apper removes the failed processor address from the 
address lookup table. This restricts no more tasks to be mapped to this processor.
3) Tasks that are scheduled or mapped but not dispatched have to be re-scheduled 
and re-mapped. The respective new task release tim e and task address have to be 
done again. This is because execution instance of a task depends on its master-task 
finish tim e which may possibly have been suspended in the failed processor.
4) Tasks that are already dispatched to the failed processor can be either i) in the  
processor's buffer or ii) in the middle of execution when the failure occurred. 
These tasks have to be migrated back to the M apper and their respective re­
scheduling and re-mapping have to be done. The scheduler can perform task
58
scheduling offline after the execution is suspended by disabling system clock.
5) Tasks that are dispatched to other non-failed processors are not affected.
6) Tasks that have already completed their execution are not affected.
5. 3. 3 Fault Tolerant Algorithm
The proposed fault tolerant algorithm is given as follows.
P ro c e d u re
Monitor and Detect Processor Failure;
Remove Processor Id From Address Lookiip Table;
Migrate Tasks from Failed Processor to Mapper;
Disable System Clock;
Reschedule All Tasks i n  Mapper TaskPool;
E n a b l e  System Clock;
Map All Tasks;
Dispatch All Tasks;
Resume Application Processing;
End Procedure
The application constraint determines how the task execution has to proceed. In 
non-preemptive soft-real-tim e systems, the suspended tasks can simply be forwarded  
and added onto the task queue of the receiving processor buffer.
In hard real-time systems however, deadline is top priority constraint so preemptive 
execution needs to be employed. A task migrated from the failed processor may 
preempt an already executing task if its deadline is closer. Fault recovery procedures 
however entail tremendous processing overhead in context switching which eventually 
degrades the overall system performance. Task migration imposes additional 
communication cost to  transfer the task among processors. The pseudo code for fault 
tolerant implementation is given in Figure 15.
59
if(processorFailureDecected) // COHPONEHT FAILURE DETECTED
f
mapper.RemoveID(failedProces3orID, addressLo-okUpTable);// REMOVE ID 
for(all_ta3ks_in_Failed_Proces3or) // COLLECT TASKS FROM FAILED PE
mapper.collect!asks(failedProcessorlD,dispatcherXaskPQol);
}
systeiClock.disable(true); // TEMPORARLLY SUSPEND SYSTEM
for(all_task3_in_di3patcher_task_pool) // GENERATE KEW TASK-GRAPH 
{
scheduler.calculateDeadline(); 
scheduler.setReleaseTlme()
scheduler.scheduleTask(scheduiingPolicy, preemptiveMode ) ;
}
saveReportlnLogFile(faultDetectionTime,fauItyProcessorlD); 
systeiClQCk:.disable (false) ; // RESUME NORMAL TASK MAPPING
mapper.mapTask(dispatcherTaskPool, addressLooküpTable, mappingPolicy); 
dispatcher.dispatchXask(dispatcherTaskPool, addressLookUpIable);
Figure 15. Fault Tolerant Pseudo Code.
Due to the dynamic property of the M apper and fault tolerant implementation, the cost 
incurred in the event of processor failure is minimal. The only penalty incurred when a 
processor failure occurs is the cost due to task migration and the respective re­
rescheduling and re-mapping procedures.
60
CHAPTER 6
SIMULATION RESULTS
In this chapter, simulation results are presented. The Tables and charts illustrate the 
various performance evaluations under different scheduling and mapping algorithms, 
fault tolerance, topology, number of processors and processor type scenarios. The 
charts will be supplemented by a brief analysis. The simulations were carried out for the 
various performance indices based on the following different cases:
1) Scheduling policy evaluation: EOF Vs Simulated-Annealing
2) Fault Tolerant evaluation: Non-Fault-tolerant Vs Fault-tolerant
3) Topological evaluation: TORUS Vs TILE
4) Number of Processors evaluation: PE size 9, 16, 25, 36 and 49 are presented 
(only square dimensions are considered)
5) Mapping evaluation: HWD Vs Next Available Vs Random
6) Processor Type evaluation: Heterogeneity 1, 2, 3, 4, 5, 6, 7, 8, 9.
Different set of STG benchmarks were run on the fram ework for the above 
simulation evaluations. The STG graphs contain tasks ranging from  50 to 3000. These 
benchmarks include randomly generated "RAND-XXXX" graphs as well as application 
specific graphs namely the SPARSE, ROBOT and FPPP. Test-bench suites are discussed in 
detail in Appendix III.
61
The evaluation was done on all benchmarks by taking two task graphs from  each 
task graph size. All the simulation charts and tables presented here are averaged values 
of all the STG benchmarks considered. Table 1 tabulates comparison of Processor 
Utilization, Processor Throughput and Buffer Utilization. Tables 2A and 2B list Execution 
tim e. Port Traffic and Power performance factors. Tables 4, 5 and 6 present comparison 
of Scheduling, Fault tolerant and Topology. EOF and Simulated Annealing are presented 
in table 4. Table 5 and 6 compare FT with NFT and Tile with Torus respectively. Detailed 
simulation results are included in Appendix I.
LEGEND: Table entries have the following format:
[Topology][Scheduling Policy][Fault Tolerant M ode][Num ber of Processors]
- Topology is represented as T or R: T = Tile or R = Torus.
- Scheduling Policy is represented as E or S: E = EDF or S = PD Simulated-Annealing.
- Fault Tolerant mode is denoted by + or -: + = Fault Tolerant, - = Non Fault Tolerant.
- Number of Processors is represented by a constant Number.
For example, RE+16 signifies Torus topology, EDF Scheduling with Fault Tolerant enabled
and 16 processors used. Processor Types for all simulation is 4. No heterogeneity
evaluation is done.
Table and Chart Headers:
BFFR = Buffer Utilization - THROU = Throughput
EXEC TIME = Execution tim e - UTIL = Utilization
PDP = Power-Delay-Product
62
1. Comparison of Execution tim e/Task, Port Traffic/Task and Power/Task factors for 
different scenarios in Topology, Scheduling policy. Fault tolerance and Number of 
Processors.
Table 1. Comparison of Processor Utilization, Throughput and Buffer Utilization
UTIL THROU BUFFER
RE-9 0.469 0.025 5.273
RS-9 0.548 0.024 5.364
RE+9 0.433 0.025 5.273
RS+9 0.471 0.023 5.636
TE-16 0.430 0.027 5.400
TS 16 0.433 0.025 5.400
TE+16 0.418 0.029 5.357
TS+16 0.398 0.030 5.429
RE-16 0.365 0.025 5.083
RS-16 0.406 0.024 5.417
RE+16 0.370 0.027 4.917
RS+16 0.419 0.024 5.917
TE-25 0.338 0.024 5.154
TS-25 0.365 0.027 5.154
TE+25 0.324 0.024 5.222
TS+25 0.363 0.027 5.000
63
Processor Utilization
0.600
0.500
0.400
0.300
0.200
0.100
0.000
•Processor Utilization
Figure 16. Comparison of Processor Utilization
Processor Throughput
0.040
0.030
0.020
0.010
0.000
— ♦■ Throughput
Figure 17. Comparison of Processor Throughput
Buffer utilization
8.000
6.000
4.000
2.000
0.000
^  ^
■Buffer Utilization
Figure 18. Comparison of Buffer Utilization
64
Analysis: Processor Utilization decreased when the number of processors increased. 
However, Processor Throughput remained almost the same and buffer Utilization did 
not change regardless of the number of processors used. This result is in accord with 
the expected results because few er processors often deliver higher utilization.
2, Comparison of Execution tim e per task, Port Traffic per Task and Power per task for 
different scenarios in Topology, Scheduling policy and Fault Tolerance and Number 
of Processors.
Table 2A. Comparison of Execution Time, Port Traffic and Power
EXEC TIME PORT POWER PDP
RE-9 4.38 0.28 158.31 694.04
RS-9 5.15 0.29 186.53 960.35
RE+9 5.03 0.29 181.27 912.10
RS+9 5.35 0.29 198.90 1063.43
TE-16 2.72 0.47 95.61 260.10
TS-16 2.79 0.47 98.23 274.29
TE+16 2.72 0.40 94.24 256.67
TS+16 2.78 0.40 98.41 273.29
RE-16 2.80 0.19 96.48 269.82
RS-16 2.97 0.19 106.36 316.03
RE+16 2.80 0.19 96.06 268.56
RS+16 2.93 0.19 101.31 297.08
TE-25 1.84 0.17 63.40 116.91
TS-25 1.83 0.17 62.66 114.64
TE+25 1.91 0.28 62.80 120.24
TS+25 1.96 0.28 65.49 128.45
65
E x e c u tio n  t im e
6.00
5.00
4.00
3.00
2.00
1.00
0.00
♦ Execution time 
Figure 19. Comparison of Execution time
Power
250.00
200.00
150.00
100.00
50.00
0.00
POWGr
Figure 20. Comparison of Power
Port Traffic
0.50
0.40
0.30
0.20
0.10
0.00
— ♦ — Port Throughput 
Figure 21. Comparison of Port Traffic
66
Analysis: Increasing Number of processors resulted in improved Task Processing Time 
and power factors. The port traffic showed variation but picked when the number of 
processor is 16. The Utilization curve showed that Multiprocessors are effective towards 
faster and power efficient application processing.
3. Comparison of Execution time/processor/task. Port Traffic/processor/task and 
Power/processor/task for different scenarios in Topology, Scheduling policy. Fault 
Tolerance and Number of Processors.
Table 2B. Comparison of Execution tim e. Port traffic and Power
PE/TASK EXEC TIME PORT POWER
RE-9 39.46 2.56 1424.77
RS-9 46.34 2.61 1678.81
RE+9 45.29 2.63 1631.43
RS+9 48.12 2.64 1790.14
TE-16 43.52 7.53 1529.82
TS-16 44.68 7.53 1571.61
TE+16 43.58 6.36 1507.91
TS+16 44.43 6.34 1574.56
RE-16 44.75 3.09 1543.65
RS-16 47.54 3.10 1701.68
RE+16 44.73 3.10 1536.89
RS+16 46.92 3.11 1620.89
TE-25 46.10 4.13 1584.92
TS-25 45.74 4.13 1566.50
TE+25 47.87 6.88 1570.01
TS+25 49.03 6 .8 8 1637.29
67
Power/TaskExecution Time/Task
250.006.00
5.00 200.00
4.00 150.00
3.00
100.00
2.00
50.001.00
0.000.00
EDF  «"0""' SA  # EDF SA
Figure 22. Comparison of Exec. tim e/Task Figure 25. Comparison of Power/Task
Execution Time/Task Power/Task
250.006.00
5.00
4.00
200.00
150.00
3.00
2.00 100.00
1.00
0.00
50.00
0.00
 NFT
Figure 23. Comparison of Exec. Time/Task Figure 26. Comparison of Power/Task
Execution Time/Task Power/Task
250.006.00
5.00 200.00
4.00 150.00
3.00
100.00  *
2.00
50.001.00
0.00 0.00
TE-16 TS-16 TE+16 TS+16
Torus Topology
TE-16 TS-16 
Tile Topology
TE+16 TS+16
Tile Topology Torus
Figure 24. Comparison of Exec. Time/Task Figure 27. Comparison of Power/Task
68
Analysis: Simulated annealing showed better performance for smaller number of 
processors w .r.t Execution tim e/Task and Power/Task evaluations. Fault tolerant 
evaluations showed almost no change under most scenarios. Torus topology gave 
significantly better performance for both Execution tim e/Task and Power/Task factors.
4. Comparison of average values obtained from Table 2A for Execution time/Task, 
Power/Task and Port Traffic/Task for 9 ,1 6  and 25 Processors.
Table 3. Average Execution tim e, Port Throughput, power and PDP
AVERAGE EXEC TIME PORT POWER PDP
PE#9 4.98 0.29 181.25 907.48
PE#16 2.81 0.31 98.34 276.98
PE#25 1.89 0.22 63.59 120.06
Execution Time/Task
6.00
5.00
4.00
3.00
2.00
1.00
0.00
PE#9 PE#16 
— EXETIme
PE#25
Figure 28. Comparison of Execution
Power/Task
200.00
150.00
100.00
50.00
0.00
PE#9 PE#16
Power/Task
PE#25
Figure 29. Comparison of Power/Task for
tim e/Task for different Num. of Processors different Num. of Processors.
69
Analysis: Average Execution Time/Task and Average Power/Task variables showed a 
decrease when the number of processors are increased. This is because tasks are 
executed in parallel. M ore number of processors theoretically yield higher processing 
power with respect to execution tim e and power rating.
5, Comparison of Execution Time, Processor Utilization, Throughput, Power, Port 
Traffic and Buffer Utilization with respect to Scheduling policy under different 
scenarios of Scheduling policy. Fault Tolerant mode and Number of processors.
Table 4. Scheduling Evaluations
EDF EXEC TIME UTIL THROU POWER PORT BUFFER PDP
RE-9 4.38 0.469 0.025 158.31 0.28 5.273 694.04
RE+9 5.03 0.433 0.025 181.27 0.29 5.273 912.10
TE-16 2.72 0.430 0.027 95.61 0.47 5.400 260.10
TE+16 2.72 0.418 0.029 94.24 0.40 5.357 256.67
RE-16 2.80 0.365 0.025 96.48 0.19 5.083 269.82
RE+16 2.80 0.370 0.027 96.06 0.19 4.917 268.56
TE-25 1.84 0.338 0.024 63.40 0.17 5.154 116.91
TE+25 1.91 0.324 0.024 62.80 0.28 5.222 120.24
SA EXEC TIME UTIL THROU POWER PORT BUFFER PDP
RS-9 5.15 0.548 0.024 186.53 0.29 5.364 960.35
RS+9 5.35 0.471 0.023 198.90 0.29 5.636 1063.43
TS-16 2.79 0.433 0.025 98.23 0.47 5.400 274.29
TS+16 2.78 0.398 0.030 98.41 0.40 5.429 273.29
RS-16 2.97 0.406 0.024 106.36 0.19 5.417 316.03
RS+16 2.93 0.419 0.024 101.31 0.19 5.917 297.08
TS-2S 1.83 0.365 0.027 62.66 0.17 5.154 114.64
TS+2S 1.96 0.363 0.027 65.49 0.28 5.000 128.45
70
P ro ces so r U til iz a t io n
0 .6 00
0.500
0.400
0.300
0.200
0.100
0.000
RE-9 RE+9 TE-16 TE+16 RE-16 RE+16 TE-25 TE+25
-EDF - * - S A
Figure 30. Processor Utilization Evaluation for EDF and SA
Prcessor Throughput
0.040
0.030
0.020
0.010
0.000
RE-9 RE+9 TE-16 TE+16 RE-16 RE+16 TE-25 TE+25 
^ ^ -E D F  — SA
Figure 31. Throughput Evaluation for EDF and SA
Buffer Utilization
7.000
6.000
5.000
4.000
3.000
2.000
1.000
0.000
RE-9 RE+9 TE-16 TE+16 RE-16 RE+16 TE-25 TE+25 
EDF
Figure 32. Buffer Usage Evaluation for EDF and SA
71
Analysis: Simulated-Annealing gave marginally better performance than EDF with 
respect to Processor Utilization and Buffer Utilization. However, EDF showed a marginal 
Throughput gain than Simulated-Annealing. This conforms with the expected results 
because Simulated-Annealing procedure
6. Comparison of Execution Time, Processor Utilization, Throughput, Power, Port Traffic 
and Buffer Utilization with respect to Fault Tolerant strategy under different 
scenarios of Topology, Scheduling policy and Number of processors
Table 5. Comparison of NFT and FT
NFT
EXEC
TIME UTIL THROU
POWER PORT BUFFER PDP
RE-9 4.38 0.469 0.025 158.31 0.28 5.273 694.04
RS-9 5.15 0.548 0.024 186.53 0.29 5.364 960.35
TE-16 2.72 0.430 0.027 95.61 0.47 5.400 260.10
TS-16 2.79 0.433 0.025 98.23 0.47 5.400 274.29
RE-16 2.80 0.365 0.025 96.48 0.19 5.083 269.82
RS-16 2.97 0.406 0.024 106.36 0.19 5.417 316.03
TE-25 1.84 0.338 0.024 63.40 0.17 5.154 116.91
TS-25 1.83 0.365 0.027 62.66 0.17 5.154 114.64
FT
EXEC
TIME UTIL THROU POWER
PORT BUFFER PDP
RE+9 5.03 0.433 0.025 181.27 0.29 5.273 912.10
RS+9 5.35 0.471 0.023 198.90 0.29 5.636 1063.43
TE+16 2.72 0.418 0.029 94.24 0.40 5.357 256.67
TS+16 2.78 0.398 0.030 98.41 0.40 5.429 273.29
RE+16 2.80 0.370 0.027 96.06 0.19 4.917 268.56
RS+16 2.93 0.419 0.024 101.31 0.19 5.917 297.08
TE+25 1.91 0.324 0.024 62.80 0.28 5.222 120.24
TS+25 1.96 0.363 0.027 65.49 0.28 5.000 128.45
72
P ro ces so r U til iz a t io n
0 .6 0 0
0.400
0.200
0.000
RE-9 RS-9 TE-16 TS-16 RE-16 RS-16 TE-25 TS-25 
^ p~r —01^ FT
Figure 33: Comparison of Utilization for NFT and FT
Processor Throughput
8.000
6.000
4.000
2.000 
0.000
RE-9 RS-9 TE-16 TS-16 RE-16 RS-16 TE-25 TS-25
NFT -^-FT
Figure 34: Comparison of Throughput for NFT and FT
Buffer Utilization
8.000
6.000
4.000
2.000
0.000
RE-9 RS-9 TE-16 TS-16 RE-16 RS-16 TE-25 TS-25
-NFT -«-FT
Figure 35: Comparison of Buffer Usage for NFT and FT
73
Analysis: Non-Fault Tolerant im plementation gave a marginal gain in Utilization but 
Throughput and Buffer Utilization remained almost the same. This showed that 
application of Fault tolerant scheme was effectively implemented because the  
procedure doesn't incur considerable overall system performance degradation.
7, Comparison of Execution Time, Processor Utilization, Throughput, Power, Port Traffic 
and Buffer Utilization with respect to Fault Tolerant strategy under different 
scenarios of Scheduling policy. Fault Tolerant scheme and Number of processors.
Table 6. Comparison of EDF and Simulated-Annealing
TILE EXEC TIME UTIL THROU POWER PORT BUFFER PDP
TE-16 2.72 0.430 0.027 95.61 0.47 5.400 260.10
TS-16 2.79 0.433 0.025 98.23 0.47 5.400 274.29
TE+16 2.72 0.418 0.029 94.24 0.40 5.357 256.67
TS+16 2.78 0.398 0.030 98.41 0.40 5.429 273.29
TE-25 1.84 0.338 0.024 63.40 0.17 5.154 116.91
TS-25 1.83 0.365 0.027 62.66 0.17 5.154 114.64
TE+25 1.91 0.324 0.024 62.80 0.28 5.222 120.24
TS+25 1.96 0.363 0.027 65.49 0.28 5.000 128.45
TORUS EXEC TIME UTIL THROU POWER PORT BUFFER PDP
RE-16 4.38 0.469 0.025 158.31 0.28 5.273 694.04
RS-16 5.15 0.548 0.024 186.53 0.29 5.364 960.35
RE+16 5.03 0.433 0.025 181.27 0.29 5.273 912.10
RS+16 5.35 0.471 0.023 198.90 0.29 5.636 1063.43
RE-9 2.80 0.365 0.025 96.48 0.19 5.083 269.82
RS-9 2.97 0.406 0.024 106.36 0.19 5.417 316.03
RE+9 2.80 0.370 0.027 96.06 0.19 4.917 268.56
RS+9 2.93 0.419 0.024 101.31 0.19 5.917 297.08
74
P ro ces so r U t il iz a t io n
0 .4 40
0.420
0.400
0.380
0.360
0.340
0.320
TE-16 TS-16 TE+16 TS+16
 ♦ Tile Topology •Torus Topology
Figure 36. Comparison of Utilization for Tile and Torus
Processor Throughput
0.040
0.030
0.020
0.010
0.000
TE-16 TS-16
— ♦— Tile Topology
TE+16 TS+16
-Torus Topology
Figure 37. Comparison of Throughput for Tile and Torus
Buffer Utilization
8.000
6.000
4.000
2.000
0.000
TE-16 TS-16
“ ♦ — Tile Topology
TE+16 TS+16
■Torus Topology
Figure 38. Comparison of Buffer Utilization for Tile and Torus
75
Analysis: Torus gives marginally higher Throughput than Simulated-Annealing but Tile 
gives better Utilization. Buffer Utilization remains the same for the different processor 
sizes considered.
8. Comparison of Average Execution tim e, Average Utilization, Average Throughput, 
Average Port Traffic, Average Power and Average buffer Utilization for 16, 25, 36, 49  
number of Processors \A/ith Tile and Torus Topologies. The evaluations are made for 
Rand-0000 3000 STG benchmark.
Table 7. Performance Evaluation for Rand 0000-3000 Benchmark.
BENCHMARK # TASKS PE TYPES FAULT MODE MAPPING SCHEDULING
RAND 0000 3000 4 OFF HW D EDF
PE# TOPOLOGY AV EXEC-TIME AV UTIL AV
THROU
AV PORT
AV
POWER
AVBFR
16 TILE 7302.9 0.472957 0.03 6530 270977 5
16 TORUS 7209.59 0.469788 0.03 5141 267719 5
25 TILE 4724.62 0.407126 0.03 3978 173925 4
25 TORUS 4656.38 0.400652 0.03 3309 170672 5
36 TILE 3278.44 0.36522 0.03 2683 117420 5
36 TORUS 3330.82 0.346969 0.03 2311 121339 5
49 TILE 2454.71 0.301596 0.03 1934 86972.7 5
49 TORUS 2488.42 0.290048 0.03 1707 89036.6 5
76
A vg . E x e c u tio n  T im e /T a s k
2.500
2.000
1.500
1.000
00.5000.000
TILE
# 1 6  # 2 5  36 # 4 9
TORES
A vg . P o w e r /T a s k
100.000
80.000
60.000
40.000
20.000
0.000
TILE
116 #25 1136 # 4 9  TORES
Figure 39. Comparison of Exec. Time/Task Figure 41. Comparison of Power/Task for
Avg. utilization
116 #25
TILE
36 149 TORES
Avg. Port Traffic
6000
I
TORES
Figure 40. Comparison of Utilization for Figure 42. Comparison of Port Traffic for
Analysis: Increasing the number of processors showed a decrease in average Utilization,
average Execution-Time, average Power and average Port Throughput. Similarly,
average Execution-Time/task and average Power/Task factors showed a decrease in 
performance when the Number of processors grows.
77
9. Comparison of average Execution tim e, Utilization, Throughput, Power, Port traffic
and Buffer Utilization for different Mapping algorithm evaluations.
Table 8: Comparison o f HWD and Next Available and Random Mapping Algorithms
MAPPiNG AV EXCT AVUTL AVTHR AVPRT A V P W R AVBFR
RANDOM 1417.46 0.482853 0.02 108 59950.3 5
H W D 1394.67 0.590144 0.02 107 57996.9 4
NXT AVLBL 1382.5 0.592676 0.02 108 57005 6
Execution Time
1430
1420
1410
1400
1390
1380
1370
1360 ------------------------------------------------------------
RANDOM H W D  NXTAVLBL 
Execution Time
Figure 43. Comparison of Execution Time
AV UTL
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
RANDOM H W D  NXTAVLBL 
AV UTI_
Figure 45. Comparison of Utilization
61000
60000
59000
58000
57000
56000
55000
Power
RANDOM HW D
'Power
NXT AVLBL
Figure 44. Comparison of Power
AVTHR
0.0230
0.0228
0.0226
0.0224
0.0222
0.0220
0.0218
RANDOM HW D  
AV THR
NXTAVLBL
Figure 46. Comparison of Throughput
78
Analysis: HWD showed better performance than Random task allocation for all cases. 
HWD showed marginally equivalent performance w .r.t. Throughput and utilization when 
compared with Next available but showed higher power and Execution tim e factors. 
Placement of tasks next to  each other reduces communication latency, port traffic and 
power indices so Next Available mapping showed even better performance that HWD 
and Random.
10. Comparison of average Execution tim e. Utilization, Throughput, Power, Port traffic 
and Buffer Utilization for different Processor types.
Table 9. Comparison of Processor Type
PE TYPES AVEXC TIME AVUTL AV THROU AVPRT AVPWR AVBFR
2 747.013 0.280789 0.04 115 18297.7 5
3 1002.81 0.322797 0.06 114 31016.2 6
4 1259.75 0.233069 0.03 114 46473.4 6
5 1397.6 0.234401 0.02 114 55646.4 6
6 1424.29 0.274387 0.02 114 57673.6 5
7 1587.2 0.260453 0.02 114 68477.3 6
8 1568.86 0.284396 0.02 114 68915.8 5
9 1612.95 0.282541 0.02 114 73244.8 6
79
E x e c u tio n  T im e
2000
1500
1000
500
2 3 4 5 6 7 8 9  
■Execution Time
Utilization
0.4
0.3
0.2
0.1
0
2 3 4 5 6 7 8 9  
 ♦ Utilization
Figure 47. Comparison of Execution tim e Figure 49. Comparison of Utilization
Power
80000
60000
40000
20000
0
2 3 4 5 6 7 8 9
“ ♦“ Power
Throw
0.06
0.05
0.04
0.03
0.02
0.01
0.00
2 3 4 5 6 7 8 9
AVTHR
Figure 48. Comparison of Power Figure 50. Comparison of Throughput
Analysis: Increasing processor types increased Execution tim e and Power factors and 
reduced task throughput. The more heterogeneous the application is, the more the task 
processing cost would be. Utilization remained almost the same for different 
heterogeneity variations.
80
11. Comparison of Execution tim e and Power for different Number of Tasks.
Table 10. Comparison of Execution tim e for different number of tasks
EXEC TIME
TEST BENCH #TASK AVG25 AVG9 AVG 16
RAND 0000 50 71.50 219.54 117.63925
RAND 0001 50 81.35 217.86 126.068875
ROBOT 88 263.26 732.93 403.881375
SPARSE 96 244.28 631.54 356.809375
RAND 0000 100 157.32 436.22 251.721875
RAND 0001 100 159.76 419.03 221.74675
RAND 0000 300 467.36 1310.14 724.190625
RAND 0001 300 469.94 1313.26 744.77525
FPPP 334 892.68 2304.09 1338.35125
RAND 0000 500 791.82 2174.20 1238.93125
RAND 0001 500 761.36 727.20 1221.59875
RAND 0000 1000 1581.55 4273.37 2441.44375
RAND 0001 1000 1575.71 2412.87
RAND 0000 3000 4682.35 7324.06
RAND 0001 3000 7181.17
Table 11. Comparison of Power for different number of tasks
POWER
TEST BENCH #TASK AVG25 AVG9 AVG16
RAND 0000 50 2558.68 8773.32 4270.4925
RAND 0001 50 2912.94 8117.71 4692.24375
ROBOT 88 7318.49 24923.00 12164.3875
SPARSE 96 7512.14 20574.10 10635.045
RAND 0000 100 5663.36 16560.73 9412.215
RAND 0001 100 6025.08 15974.28 8268.68875
RAND 0000 300 16795.68 49821.58 26738.825
RAND 0001 300 16651.40 49691.35 28011.15
FPPP 334 28253.55 78016.30 42141.7625
RAND 0000 500 29226.75 82797.83 46450.175
RAND 0001 500 27576.15 23818.70 45568.425
RAND 0000 1000 57618.95 161103.25 90564.4125
RAND 0001 1000 58407.40 89625.25
RAND 0000 3000 172086.00 273342.75
RAND 0001 3000 265970.50
81
Execution Time Vs No. of Tasks
8000.00
7000.00
6000.00
5000.00
4000.00
3000.00
2000.00 
1000.00
0.00
PE9 PE 16 •PE25
Figure 51. Comparison of Execution tim e for different number of tasks
Power Vs No. of Tasks
300000.00
250000.00
200000.00
150000.00
100000.00
50000.00
0.00
■“PE9 = 6 =  PE16 “PE25
Figure 52. Comparison of Power for different number of tasks
Analysis: processing larger application (more number of tasks) takes longer to execute 
and consumes higher power than executing few  number of tasks regardless of the 
number of processors used.
82
CHAPTER 7
CONCLUSIONS AND FUTURE WORK
7 .1  Summary
The problem space in Multiprocessor system design requires the close scrutiny of
the various design choices in order to achieve the required performance characteristics.
The advent of Multiprocessors systems is a design alternative to Uniprocessors in order
to efficiently handle the ever growing requirement for huge data processing.
Multiprocessor systems allow real concurrent execution of tasks. Concurrent task
scheduling gives an overall increased performance with respect to execution time,
throughput, power and above all, processing tim e. Fault tolerance is another advantage
Multiprocessors deliver. The freedom in task scheduling and mapping introduces design
complexity to optimally allocate tasks on the Multiprocessors. However, the
performance gain that MPSOCs yield highly offsets the tradeoffs imposed during the
hardware/software development phase.
Towards optimizing the scheduling and mapping strategies, a heterogeneous
MPSOCs simulator was developed in SystemC in order to provide a comprehensive
simulation tool for implementing and verifying the proposed methodologies. The
simulator takes user data, sets the architecture based on the inputs, runs benchmark
files and outputs performance results. A strategic Homogenous-Workload-Distribution
83
Mapping policy is proposed which considers dynamic processor workload in task 
mapping. A new fault tolerant algorithm to deliver a robust system is also proposed. The 
fault tolerant procedure reschedules and remaps tasks in the event of processor failure 
dynamically to m eet real-time requirements.
7.2 Contributions
For the scheduling space, the PD heuristic has shown better overall performance 
than EDF w .r.t Execution-time/Task and Power/Task more particularly for small number 
of processors. Processor and Buffer Utilization have also shown marginal increase for PD 
scheduling than EDF. However, EDF showed a better Throughput.
Fault-tolerant evaluations showed that Throughput, buffers utilization. Execution­
tim e/task and Power/Task factors are not significantly affected even after processor 
failure occurs. Fault-tolerant scheme showed a small decrease in Processor Utilization. 
Topology wise. Tile showed better utilization and Throughput however Torus showed 
significantly better performance with respect to execution-tim e/task and power/task  
factors. Number of processor comparisons showed a proportional decrease in 
Utilization, Execution tim e and Power factors when the Number of processors is 
increased. However, Throughput and buffer utilization remained the same. Executing 
highly heterogeneous tasks resulted in higher power and latency cost. Finally, the 
proposed HWD algorithm evenly distributed execution workload among the processors 
which improves processor overall performance.
84
7.3 Future work
Currently, only STG benchmarks are supported. M ore descriptive benchmarks that 
specify additional task attributes, application nature, power variables and data gathered 
from real applications are available. Among them  the E3S and TGFF are commonly used 
suites. Additionally, STG is only a textual task specification form at that requires a 
specific parser. Enabling graph based task file entry method would further enhance the  
fram ework to be able to interoperate with standardized task-graph specification tools 
like the GXL.
Graphical data input and visualization capabilities are also missing in the framework. 
Powerful graph visualization tools like GRAPHVIZ are readily available that could easily 
be integrated to the fram ework. Incorporating a means to graphically interface with the  
data will enable easier performance evaluation follow-ups and simplify decision making 
during the development process.
Topology wise, the classical layout for MPSOCs is TILE and its variant TORUS. 
Previous studies in MPSOCs are focused on these topologies, including this work. Much 
has not been done in Hypercube, tree and other topologies. It is also worth trying to  
analyze new topologies combined with the newly emerging communication 
technologies.
85
APPENDIX I
DETAILED SIMULATION PROFILES 
The summary of all simulation outputs are illustrated henceforth. Tables 8, 9 and 10 
tabulate Execution tim e. Processor Utilization and Processor Throughput performance 
variables for all benchmarks under the different scenarios of Scheduling, Topology, Fault 
Tolerance are tabulated for all benchmark files. Tables 11 - 26 tabulate the detailed 
simulation results for each STG benchmark files. Additional datasheets may be found on 
the accompanying CD those specify very detailed performance measurements obtained 
for individual processors.
86
Table 12. Comparison of Execution tim e for different scenarios of performance variables
TE-16 TS-16 TE+16 TS+16 TE 25 TS-25 TE+25 TS+25 RE-16 RS-16 RE+16 RS+16 RE-9 RS-9 RE+9 RS+9
T-BENCH Task EXECUTION TIME
RND 0000 50 118.36 132.74 105.90 101.58 73.20 69.90 70.55 72.34 113.24 122.93 113.78 132.60 210.62 218.80 201.20 247.56
RND 0001 50 118.58 119.28 112.21 136.19 87.38 75.32 118.24 141.51 119.93 142.63 197.89 221.27 217.80 234.49
ROBOT 88 405.04 401.71 399.33 434.25 262.74 264.07 259.78 266.44 380.33 422.39 388.95 399.06 732.44 721.93 744.42
SPARSE 96 356.68 357.19 351.41 374.80 247.19 249.46 235.60 244.87 351.58 363.53 344.65 354.65 616.51 637.89 625.16 646.62
RND 0000 100 235.50 251.25 232.20 278.66 164.26 159.92 151.91 153.19 250.06 267.09 241.50 257.51 430.93 444.89 405.58 463.49
RND 0001 100 213.79 249.80 211.19 106.79 157.48 162.03 235.43 260.86 244.08 252.05 406.40 419.44 399.89 450.38
RND 0000 300 731.44 698.00 715.95 745.58 459.81 469.58 459.38 480.68 715.53 726.09 729.49 731.46 1267.82 1292.20 1355.13 1325.40
RND 0001 300 748.04 752.01 743.14 748.80 463.54 476.34 730.14 752.43 728.38 755.28 1331.07 1301.69 1272.87 1347.40
FPPP 334 1320.14 1275.44 1343.14 1389.66 868.71 877.90 904.10 920.00 1352.01 1364.12 1319.35 1342.95 2238.36 2337.36 2285.60 2355.04
RND 0000 500 1172.44 1272.35 1241.74 1264.64 801.32 806.39 777.12 782.44 1248.57 1260.47 1196.09 1255.15 2138.38 2125.33 2168.56 2264.51
RND 0001 500 1203.74 1250.45 1236.41 1240.50 755.50 767.23 1181.78 1248.80 1197.50 1213.61 727.20
RND 0000 1000 2426.54 2514.35 2445.80 2449.84 1614.60 1557.93 1556.42 1597.26 2363.85 2471.24 2413.39 2446.54 4163.73 4324.58 4258.13 4347.04
RND 0001 1000 2370.30 2483.20 2417.09 2380.88 1577.63 1573.78
RND 0000 3000 7325.22 7339.65 7291.39 7339.96 4639.01 4725.69
RND 0001 3000 7138.12 7224.21
00
' - j
Table 13. Comparison of Utilization for different scenarios of performance variables
TE-16 TS-16 TE+16 TS+16 TE-25 TS-25 TE+25 TS+25 RE-16 RS-16 RE+16 RS+16 RE-9 RS-9 RE+9 RS+9
TEST BENCH Task UTILIZATION
RAND 0000 50 0.261 0.301 0.395 0.339 0.361 0.288 0.291 0.303 0.361 0.457 0.311 0.358 0.516 0.538 0.402 0.447
RAND 0001 50 0.336 0.362 0.395 0.323 0.336 0.243 0.297 0.324 0.333 0.359 0.401 0.568 0.388 0.464
ROBOT 88 0.317 0.256 0.237 0.286 0.182 0.214 0.177 0.239 0.288 0.302 0.261 0.317 0.398 0.304 0.332
SPARSE 96 0.509 0.354 0.334 0.394 0.310 0.366 0.355 0.368 0.392 0.421 0.381 0.412 0.439 0.490 0.385 0.474
RAND 0000 100 0.354 0.452 0.476 0.281 0.304 0.381 0.310 0.395 0.294 0.337 0.430 0.479 0.484 0.580 0.399 0.492
RAND 0001 100 0.397 0.471 0.435 0.328 0.365 0.390 0.297 0.361 0.323 0.528 0.456 0.613 0.429 0.472
RAND 0000 300 0.463 0.446 0.464 0.465 0.287 0.366 0.353 0.356 0.353 0.400 0.378 0.415 0.509 0.570 0.503 0.469
RAND 0001 300 0.446 0.459 0.430 0.463 0.371 0.398 0.454 0.460 0.385 0.420 0.461 0.558 0.484 0.532
FPPP 334 0.371 0.389 0.378 0.391 0.311 0.379 0.240 0.320 0.334 0.396 0.329 0.363 0.422 0.467 0.426 0.451
RAND 0000 500 0.544 0.536 0.524 0.515 0.440 0.503 0.411 0.481 0.508 0.552 0.473 0.504 0.591 0.662 0.549 0.544
RAND 0001 500 0.466 0.447 0.425 0.436 0.360 0.382 0.379 0.443 0.408 0.429 0.358
RAND 0000 1000 0.475 0.469 0.413 0.437 0.372 0.412 0.382 0.399 0.417 0.415 0.428 0.445 0.523 0.580 0.492 0.502
RAND 0001 1000 0.490 0.499 0.479 0.465 0.401 0.429
RAND 0000 3000 0.493 0.511 0.471 0.455 0.394 0.408
RAND 0001 3000 0.535 0.539
00
00
Table 14. Comparison of Throughput for different scenarios of performance variables
TE-16 TS-16 TE+16 TS+16 TE-25 TS-25 TE+25 TS+25 RE-16 RS-16 RE+16 RS+16 RE-9 RS-9 RE+9 RS+9
TEST BENCH Task THROUGHPUT
RAND 0000 50 0.050 0.025 0.068 0.078 0.032 0.060 0.029 0.056 0.032 0.028 0.051 0.024 0.028 0.026 0.031 0.024
RAND 0001 50 0.028 0.030 0.028 0.023 0.026 0.028 0.029 0.022 0.030 0.023 0.031 0.026 0.028 0.024
ROBOT 88 0.015 0.015 0.015 0.014 0.014 0.018 0.014 0.015 0.016 0.014 0.015 0.015 0.014 0.014 0.014
SPARSE 96 0.017 0.018 0.018 0.015 0.016 0.016 0.017 0.017 0.017 0.018 0.018 0.017 0.018 0.017 0.018 0.017
RAND 0000 100 0.028 0.031 0.033 0.029 0.024 0.028 0.029 0.027 0.026 0.028 0.030 0.025 0.028 0.026 0.034 0.026
RAND 0001 100 0.033 0.025 0.034 0.053 0.027 0.028 0.028 0.026 0.030 0.030 0.028 0.028 0.028 0.025
RAND 0000 300 0.026 0.028 0.026 0.026 0.027 0.027 0.027 0.029 0.028 0.027 0.026 0.027 0.027 0.026 0.026 0.026
RAND 0001 300 0.027 0.026 0.027 0.026 0.027 0.025 0.026 0.027 0.027 0.026 0.026 0.026 0.027 0.025
FPPP 334 0.017 0.018 0.018 0.021 0.018 0.017 0.019 0.019 0.017 0.018 0.018 0.019 0.019 0.017 0.018 0.017
RAND 0000 500 0.027 0.025 0.027 0.025 0.026 0.025 0.027 0.027 0.026 0.025 0.029 0.026 0.026 0.027 0.026 0.025
RAND 0001 500 0.027 0.026 0.026 0.026 0.027 0.027 0.027 0.026 0.027 0.026 0.014
RAND 0000 1000 0.027 0.026 0.027 0.026 0.025 0.027 0.026 0.025 0.028 0.026 0.026 0.027 0.027 0.026 0.027 0.026
RAND 0001 1000 0.027 0.026 0.026 0.027 0.026 0.026
RAND 0000 3000 0.026 0.026 0.027 0.026 0.026 0.026
RAND 0001 3000 0.027 0.026
00
ID
Table 15. Comparison of Performance variables for EOF, NFT, Tile and 16 processors
Proc# Topology Fault Mode Scheduling
16 Tile Off EOF
Test Bench #Task Exec Time Utiiization Throughput Port Power Buffer
Rand 0000 50 118.36 0.261 0.050 4 4341.3 6
Rand 0001 50 118.58 0.336 0.028 6 4311.2 5
Robot 88 405.04 0.317 0.015 6 12461.4 5
Sparse 96 356.68 0.509 0.017 6 10898.1 5
Rand 0000 100 235.50 0.354 0.028 12 8665.4 6
Rand 0001 100 213.79 0.397 0.033 10 7886.9 6
Rand 0000 300 731.44 0.463 0.026 93 26968.4 5
Rand 0001 300 748.04 0.446 0.027 74 28432.2 5
FPPP 334 1320.14 0.371 0.017 30 40277.5 4
Rand 0000 500 1172.44 0.544 0.027 108 43606.6 5
Rand 0001 500 1203.74 0.466 0.027 242 44956.6 6
Rand 0000 1000 2426.54 0.475 0.027 846 91086.8 5
Rand 0001 1000 2370.30 0.490 0.027 629 87612.0 6
Rand 0000 3000 7325.22 0.493 0.026 6526 274100.0 6
Rand 0001 3000 7138.12 0.535 0.027 4546 264167.0 6
Table 16. Comparison of Performance variables for SA, NFT, Tile and 16 processors
Proc # Topology Fault Mode Scheduling
16 Tile Off SA
Test Bench #Task Exec Time Utilization Throughput Port Power Buffer
Rand 0000 50 132.74 0.301 0.025 4 5062.8 5
Rand 0001 50 119.28 0.362 0.030 6 4306.8 6
Robot 88 401.71 0.256 0.015 6 11825.7 6
Sparse 96 357.19 0.354 0.018 6 10486.1 5
Rand 0000 100 251.25 0.452 0.031 12 9424.2 6
Rand 0001 100 249.80 0.471 0.025 10 9544.5 5
Rand 0000 300 698.00 0.446 0.028 93 25692.2 6
Rand 0001 300 752.01 0.459 0.026 74 28647.5 5
FPPP 334 1275.44 0.389 0.018 30 37891.4 6
Rand 0000 500 1272.35 0.536 0.025 108 48022.9 5
Rand 0001 500 1250.45 0.447 0.026 242 46461.8 5
Rand 0000 1000 2514.35 0.469 0.026 846 94768.5 5
Rand 0001 1000 2483.20 0.499 0.026 629 93767.7 5
Rand 0000 3000 7339.65 0.511 0.026 6527 274044.0 5
Rand 0001 3000 7224.21 0.539 0.026 4546 267774.0 6
90
Table 17. Comparison of Performance variables for EOF, FT, Tile and 16 processors
Proc# Topology Fault Mode Scheduling
16 Tile On EOF
Test Bench #Task Exec Time Utilization Throughput Port Power Buffer
Rand 0000 50 105.90 0.395 0.068 4 3548.7 5
Rand 0001 50 112.21 0.395 0.028 6 3987.3 4
Robot 88 399.33 0.237 0.015 6 11706.5 6
Sparse 96 351.41 0.334 0.018 7 10427.8 5
Rand 0000 100 232.20 0.476 0.033 12 8466.5 5
Rand 0001 100 211.19 0.435 0.034 10 7838.3 5
Rand 0000 300 715.95 0.464 0.026 93 25949.6 7
Rand 0001 300 743.14 0.430 0.027 73 28368.6 7
FPPP 334 1343.14 0.378 0.018 31 42088.8 5
Rand 0000 500 1241.74 0.524 0.027 109 46631.7 6
Rand 0001 500 1236.41 0.425 0.026 242 46541.2 5
Rand 0000 1000 2445.80 0.413 0.027 848 90806.6 5
Rand 0001 1000 2417.09 0.479 0.026 630 88668.4 4
Rand 0000 3000 7291.39 0.471 0.027 6524 273312.0 6
Table 18. Comparison of Performance variables for SA, FT, Tile and 16 processors
Proc # Topoiogy Fault Mode Scheduling
16 Tile On SA
Test Bench #Task Exec Time Utilization Throughput Port Power Buffer
Rand 0000 50 101.58 0.339 0.078 4 3441.8 5
Rand 0001 50 136.19 0.323 0.023 6 5148.5 5
Robot 88 434.25 0.286 0.014 7 13908.2 6
Sparse 96 374.80 0.394 0.015 7 12075.5 5
Rand 0000 100 278.66 0.281 0.029 13 10798.7 5
Rand 0001 100 106.79 0.328 0.053 6 3602.9 6
Rand 0000 300 745.58 0.465 0.026 93 27495.5 5
Rand 0001 300 748.80 0.463 0.026 74 28184.2 6
FPPP 334 1389.66 0.391 0.021 32 45284.6 6
Rand 0000 500 1264.64 0.515 0.025 109 47316.6 5
Rand 0001 500 1240.50 0.436 0.026 242 46822.3 5
Rand 0000 1000 2449.84 0.437 0.026 849 89530.4 6
Rand 0001 1000 2380.88 0.465 0.027 628 88452.9 5
Rand 0000 3000 7339.96 0.455 0.026 6532 271915.0 6
91
Table 19. Comparison of Performance variables for EOF, NFT, Tile and 25 processors
Proc # Topology Fault Mode Scheduling
25 Tile Off EOF
Test Bench #Task Exec Time Utilization Throughput Port Power Buffer
Rand 0000 50 73.20 0.361 0.032 3 2622.1 5
Rand 0001 50 87.38 0.336 0.026 4 3213.0 6
Robot 88 262.74 0.182 0.014 4 7153.8 5
Sparse 96 247.19 0.310 0.016 4 7840.0 5
Rand 0000 100 164.26 0.304 0.024 9 6005.7 5
Rand 0001 100 157.48 0.365 0.027 7 5811.0 5
Rand 0000 300 459.81 0.287 0.027 58 16463.8 6
Rand 0001 300 463.54 0.371 0.027 47 16627.1 5
FPPP 334 868.71 0.311 0.018 21 27770.2 6
Rand 0000 500 801.32 0.440 0.026 69 29845.9 4
Rand 0001 500 755.50 0.360 0.027 150 27554.0 6
Rand 0000 1000 1614.60 0.372 0.025 521 59483.2 4
Rand 0001 1000 1577.63 0.401 0.026 388 58597.5 5
Table 20. Comparison of Performance variables for SA, NFT, Tile and 25 processors
Proc # Topology Fault Mode Scheduling
25 Tile Off SA
Test Bench # Task Exec Time Utilization Throughput Port Power Buffer
Rand 0000 50 69.90 0.288 0.060 3 2492.8 6
Rand 0001 50 75.32 0.243 0.028 4 2612.9 5
Robot 88 264.07 0.214 0.018 4 7426.6 6
Sparse 96 249.46 0.366 0.016 5 7863.2 5
Rand 0000 100 159.92 0.381 0.028 8 5759.4 6
Rand 0001 100 162.03 0.390 0.028 7 6239.2 5
Rand 0000 300 469.58 0.366 0.027 59 17118.2 4
Rand 0001 300 476.34 0.398 0.025 47 16675.7 4
FPPP 334 877.90 0.379 0.017 21 27794.4 5
Rand 0000 500 806.39 0.503 0.025 68 30155.8 5
Rand 0001 500 767.23 0.382 0.027 150 27598.3 5
Rand 0000 1000 1557.93 0.412 0.027 520 56590.5 5
Rand 0001 1000 1573.78 0.429 0.026 388 58217.3 6
92
Table 21. Comparison of Performance variables for EDF, FT, Tile and 25 processors
Proc# Topology Fault Mode Scheduling
25 Tile On EDF
Test Bench #Task Exec Time
Utilizatio
n
Throughpu
t Port
Power Buffer
Rand 0000 50 70.55 0.291 0.029 3 2474.4 6
Robot 88 259.78 0.177 0.014 4 7358.7 5
Sparse 96 235.60 0.355 0.017 5 6649.8 5
Rand 0000 ICC 151.91 0.310 0.029 8 5407.9 4
Rand 0000 300 459.38 0.353 0.027 58 16257.2 5
FPPP 334 904.10 0.240 0.019 21 28451.1 6
Rand 0000 500 777.12 0.411 0.027 69 28142.4 5
Rand 0000 1000 1556.42 0.382 0.026 521 56375.5 6
Rand 0000 3000 4639.01 0.394 0.026 3976 170136.0 5
Table 22. Comparison of Performance variables for SA, FT, Tile and 25 processors
Proc # Tile Fault Mode Scheduling
25 Tile On SA
Test Bench #Task Exec Time Utiiization Throughput Port Power Buffer
Rand COCO 50 72.34 0.303 0.056 3 2645.4 4
Robot 88 266.44 0.239 0.015 4 7334.8 5
Sparse 96 244.87 0.368 0.017 4 7695.5 5
Rand 0000 100 153.19 0.395 0.027 9 5480.5 5
Rand 0000 300 480.68 0.356 0.029 58 17343.5 5
FPPP 334 920.00 0.320 0.019 21 28998.5 6
Rand 0000 500 782.44 0.481 0.027 68 28762.9 5
Rand 0000 1000 1597.26 0.399 0.025 522 58026.6 5
Rand 0000 3000 4725.69 0.408 0.026 3978 174036.0 5
93
Table 23. Comparison of Performance variables for EDF, NFT, Torus and 16 PEs
Proc # Topology Fault Mode Scheduling
16 Torus Off EDF
Test Bench #Task Exec Time Utilization Throughput Port Power Buffer
Rand 0000 50 113.24 0.361 0.032 4 4140.3 5
Rand 0001 50 118.24 0.297 0.029 5 4274.8 5
Robot 88 380.33 0.288 0.016 6 11005.0 5
Sparse 96 351.58 0.392 0.017 6 10331.1 5
Rand 0000 100 250.06 0.294 0.026 11 9527.6 6
Rand 0001 100 235.43 0.297 0.028 9 8680.7 5
Rand 0000 300 715.53 0.353 0.028 77 26717.3 5
Rand 0001 300 730.14 0.454 0.026 62 26967.5 4
FPPP 334 1352.01 0.334 0.017 29 42031.3 6
Rand 0000 500 1248.57 0.508 0.026 92 47421.1 5
Rand 0001 500 1181.78 0.379 0.027 196 43991.3 5
Rand 0000 1000 2363.85 0.417 0.028 677 87067.0 5
Table 24. Comparison of Performance variables for SA, NFT, Torus and 16 processors
Proc # Topology Fault Mode Scheduling
16 Torus Off SA
Test Bench # Task Exec Time Utiiization Throughput Port Power Buffer
Rand 0000 50 122.93 0.457 0.028 4 4619.6 5
Rand 0001 50 141.51 0.324 0.022 5 5540.1 5
Robot 88 422.39 0.302 0.014 6 13556.3 6
Sparse 96 363.53 0.421 0.018 7 11360.8 6
Rand 0000 100 267.09 0.337 0.028 11 10271.6 4
Rand 0001 100 260.86 0.361 0.026 9 9940.0 5
Rand 0000 300 726.09 0.400 0.027 77 26843.2 6
Rand 0001 300 752.43 0.460 0.027 62 28406.6 5
FPPP 334 1364.12 0.396 0.018 29 44082.3 5
Rand 0000 500 1260.47 0.552 0.025 91 47917.3 6
Rand 0001 500 1248.80 0.443 0.026 196 47436.2 6
Rand 0000 1000 2471.24 0.415 0.026 676 91705.2 6
94
Table 25. Comparison of Performance variables for EDF, FT, Torus and 16 processors
Proc# Topology Fault Mode Scheduling
16 Torus On EDF
Test Bench #Task Exec Time Utiiization Throughput Port Power Buffer
Rand 0000 50 113.78 0.311 0.051 4 4083.9 4
Rand 0001 50 119.93 0.333 0.030 5 4377.0 4
Robot 88 388.95 0.261 0.015 6 11254.3 5
Sparse 96 344.65 0.381 0.018 7 9881.2 6
Rand 0000 100 241.50 0.430 0.030 11 8739.8 6
Rand 0001 100 244.08 0.323 0.030 9 9231.9 4
Rand 0000 300 729.49 0.378 0.026 77 27118.0 5
Rand 0001 300 728.38 0.385 0.027 62 27096.3 5
FPPP 334 1319.35 0.329 0.018 29 42155.7 5
Rand 0000 500 1196.09 0.473 0.029 93 44222.2 4
Rand 0001 500 1197.50 0.408 0.027 197 43906.7 5
Rand 0000 1000 2413.39 0.428 0.026 675 89731.1 6
Table 26. Comparison of Performance variables for SA, FT, Torus and 16 processors
Proc # Topology Fault Mode Scheduling
16 Torus On SA
Test Bench #Task Exec Time Utilization Throughput Port Power Buffer
Rand 0000 50 132.60 0.358 0.024 4 4925.6 5
Rand 0001 50 142.63 0.359 0.023 5 5592.2 6
Robot 88 399.06 0.317 0.015 6 11597.7 6
Sparse 96 354.65 0.412 0.017 7 9619.8 6
Rand 0000 100 257.51 0.479 0.025 11 9403.9 6
Rand 0001 100 252.05 0.528 0.030 10 9424.4 6
Rand 0000 300 731.46 0.415 0.027 77 27126.4 7
Rand 0001 300 755.28 0.420 0.026 62 27986.3 6
FPPP 334 1342.95 0.363 0.019 29 43322.5 5
Rand 0000 500 1255.15 0.504 0.026 91 46463.0 5
Rand 0001 500 1213.61 0.429 0.026 197 44431.3 7
Rand 0000 1000 2446.54 0.445 0.027 677 89819.7 6
95
Table 27. Comparison of Performance variables for EDF, NFT, Torus and 9 processors
Proc# Topology Fault Mode Scheduling
9 Torus Off EDF
Test Bench #Task Exec Time Utilization Throughput Port Power Buffer
Rand 0000 50 210.62 0.516 0.028 7 8232.2 5
Rand 0001 50 197.89 0.401 0.031 8 7118.7 6
Sparse 96 616.51 0.439 0.018 9 19408.1 5
Rand 0000 100 430.93 0.484 0.028 18 16649.4 5
Rand 0001 100 406.40 0.456 0.028 14 15499.3 5
Rand 0000 300 1267.82 0.509 0.027 130 46522.7 6
Rand 0001 300 1331.07 0.461 0.026 104 51340.7 6
FPPP 334 2238.36 0.422 0.019 44 72956.2 5
Rand 0000 500 2138.38 0.591 0.026 153 81605.0 5
Rand 0001 500 727.20 0.358 0.014 9 23818.7 5
Rand 0000 1000 4163.73 0.523 0.027 1179 155229.0 5
Table 28. Comparison of Per ormance variables for SA, NFT, Torus and 9 processors
Proc# Topology Fault Mode Scheduling
9 Torus Off SA
Test Bench #Task Exec Time Utilization Throughput Port Power Buffer
Rand 0000 50 218.80 0.538 0.026 6 8321.7 5
Rand 0001 50 221.27 0.568 0.026 8 8360.9 3
Robot 88 732.44 0.398 0.014 9 25190.4 6
Sparse 96 637.89 0.490 0.017 9 21413.5 6
Rand 0000 100 444.89 0.580 0.026 17 16923.1 5
Rand 0001 100 419.44 0.613 0.028 14 15621.7 6
Rand 0000 300 1292.20 0.570 0.026 131 48311.9 6
Rand 0001 300 1301.69 0.558 0.026 104 48929.7 5
FPPP 334 2337.36 0.467 0.017 44 78595.4 6
Rand 0000 500 2125.33 0.662 0.027 153 80068.8 5
Rand 0000 1000 4324.58 0.580 0.026 1181 163877.0 6
96
Table 29. Comparison of Performance variables for EDF, FT, Torus and 9 processors
Proc# Topology Fault Mode Scheduling
9 Torus On EDF
Test Bench #Task Exec Time Utilization Throughput Port Power Buffer
Rand 0000 50 201.20 0.402 0.031 6 8149.5 5
Rand 0001 50 217.80 0.388 0.028 8 7994.8 5
Robot 88 721.93 0.304 0.014 9 24527.3 5
Sparse 96 625.16 0.385 0.018 9 19866.0 4
Rand 0000 100 405.58 0.399 0.034 18 14354.4 6
Rand 0001 100 399.89 0.429 0.028 15 15049.2 6
Rand 0000 300 1355.13 0.503 0.026 131 52828.1 5
Rand 0001 300 1272.87 0.484 0.027 105 48509.6 5
FPPP 334 2285.60 0.426 0.018 44 75977.1 6
Rand 0000 500 2168.56 0.549 0.026 153 82887.5 5
Rand 0000 1000 4258.13 0.492 0.027 1184 160347.0 6
Table 30. Comparison of Performance variables for SA, FT, Torus and 9 processors
Proc# Topology Fault Mode Scheduling
9 Torus On SA
Test Bench #Task Exec Time Utilization Throughput Port Power Buffer
Rand 0000 50 247.56 0.447 0.024 6 10389.9 6
Rand 0001 50 234.49 0.464 0.024 8 8996.5 6
Robot 88 744.42 0.332 0.014 10 25051.3 4
Sparse 96 646.62 0.474 0.017 9 21608.8 6
Rand 0000 100 463.49 0.492 0.026 18 18316.0 6
Rand 0001 100 450.38 0.472 0.025 15 17726.9 6
Rand 0000 300 1325.40 0.469 0.026 131 51623.6 7
Rand 0001 300 1347.40 0.532 0.025 105 49985.4 7
FPPP 334 2355.04 0.451 0.017 45 84536.5 5
Rand 0000 500 2264.51 0.544 0.025 154 86630.0 5
Rand 0000 1000 4347.04 0.502 0.026 1179 164960.0 4
97
APPENDIX III
BENCHMARK SUITES
A Benchmark suite contains a group of task-graph files that are generated randomly 
or modeled from real applications. Benchmarks are used in verifying mapping and 
scheduling algorithms in the prototype development process. These suites define tasks 
in an abstract way whereby the contents of the tasks are vaguely represented. These 
tasks are specified in different set of attributes, form at and target application. 
Benchmark file attributes designate the task behavior with respect to task processing 
tim e, task deadline, power consumption and communication latency. Among the 
existing benchmarks, STG, E3S, GXL and TGFF are commonly used in Multiprocessor 
scheduling problem space.
Standard Task Graph Set (STG)
STG is a benchmark specifically designed to evaluate Multiprocessor scheduling 
algorithms. The STG suite has random and program generated task-graphs. An STG 
specifies tasks using a DAG where the nodes represent application tasks and a weighted 
edge, if specified, represents the communication between nodes along with the cost.
98
The STG benchmark has two parts. The first part defines attributes for each task in a
matrix form at as:< ID><COM-DELAY><EXEC-TIME><NUM-OF-PRED><PRED-LIST>.
The second part of STG starts with a '#' symbol and specifies the following 
information about file I) the source of the application ii) method used in generating the 
task-graph file ill) parallelization of tasks iv) critical path length v) average processing 
tim e vi) precedence limits and vii) edge connection probability. The number of tasks in 
for few  STG Files are given in the following table.
Table 31: Task size for STG Benchmark files
TEST BENCH ft TASKS
RAND OOOO.stg 50
RAND OOOl.stg 50
ROBOT.stg 88
SPARSE.stg 96
RAND OOOO.stg 100
RAND OOOl.stg 100
RAND OOOO.stg 300
RAND OOOl.stg 300
FPPP.stg 334
RAND OOOO.stg 500
RAND OOOl.stg 500
RAND OOOO.stg 1000
RAND OOOl.stg 1000
RAND OOOO.stg 3000
RAND OOOl.stg 3000
RAND OOOO.stg 5000
RAND OOOl.stg 5000
Detailed information on STG benchmark suite and STG file downloads can be found at 
STG website: <http://www.kasahara.elec.waseda.ac.jp/schedule/>
99
Task Graph For Free (TGFF)
TGFF [19] is major standard tool written in C++ that generates random task-graphs 
which can be used in evaluating mapping and scheduling algorithms for Multiprocessor 
environments. TGFF useful in areas such as software/hardware co-design, embedded  
system applications and DAG task-graph generation. Detailed information on TGFF and 
the complete documentation can be found at:
<http://zivanR.eecs.northwestern.edu/~dickrp/TGFF/>
<http://ziyang.eecs.northwestern.edu/~dickrp/TGFF/m anual.pdf>
Graph Exchange Language (GXL)
It is a powerful XML based standard graph exchange form at that is useful for 
software developers to transport design specifications among software tools. It 
facilitates interoperability among graph based tools in exchanging graphs and also non- 
graphical raw data as graphs. Graphs can be typed, attributed, directed or undirected 
and ordered formats which are written in XML language following the GXL document 
type definition (DTD). Graph attributes such as weighted edges and nodes are specified 
in named XML tags. Further information can be found at: <h ttp ://w w w .gupro .de/G XL/>
Embedded Systems Synthesis Benchmark Suite (E3S)
E3S is another newly emerged benchmark suite mainly designed for embedded  
system evaluation based on 17 processors manufactured by popular companies 
including the Texas instruments. Motorola and AM D. It specifies various task
100
parameters and describes processor database which makes is more suitable for 
embedded application evaluation. It specifies task attributes such as execution times 
and power ratings obtained from manufacturer datasheets, and additional information  
that can be used as benchmarks during system design process. E3S considers a variety of 
real life applications in areas like auto-industry, networking, telecom, consumer and 
office-automation. Detailed information can be found at the developer's website: 
<http://zivang.eecs.northwestern.edu/~dickrp/E3S/>.
101
APPENDIX IV 
SYSTEMC
Système [5] is a library extension of the C++ library aimed at upgrading the language to 
support hardware specification, modeling and simulation at system level abstraction. 
Système is a new standard for C++ based hardware Description/modeling language 
(HDL) developed by the Open Source Consortium Initiative (OSCI). SystemC can be 
compiled with most C++ compilers. There is a growing popularity for SystemC because:
It is open source so designers can use it for free with no licensing issues.
Most designers are fam iliar with C/C++ design specifications. This alleviates the need for 
system designers to  learn other designing tools.
Design specifications can be exchanged among system designers w ithout exporting the 
design to other tools. Extensive SOC designs and intellectual property components have 
been developed in C++ which makes it easy to be integrated in new designs.
SystemC limitations: Currently, système lacks extended program constructs to control 
simulation process. M andatory program process controls like suspending clock, stopping 
and restarting simulation, enabling and disabling processes/Threads, unbinded port 
declarations, runtime module creation and modification are missing from  the
102
lanaguage. The lack of these constructs may pose limitations during the design 
prototyping phase. Additionally SystemC has no GUI. Works focused on SystemC are 
undergoing in an effort to alleviate these language limitations. M ore information on 
SystemC can be found on the founder's website: <h ttp ://w w w .systemC.org/>. 
Framework Source Code:
The source code for the fram ework is included on the accompanying DVD at the back 
cover of this book. Please refer to the general documentation for copyright notes and 
software documentation. M ore simulation results are also provided on the CD.
103
REFERENCES
[1] Ahmed Amine Jerraya, Wayne W olf. Multiprocessor Systems-On-Chips. Morgan 
Kaufmann Series in Systems on Silicon 2005.
[2] Ashwini Raina. FUSE-N: Framework for Unified Simulation Environment for 
Network-On-Chip. Thesis paper. 2004.
[3] Calin Ciordas, Andreas Fiansson; Kees Gossens, Twan Basten. A Monitoring-Aware 
Network-On-Chip Design Flow. Manuscript Draft. December 2006.
[4] Christian Neeb. Designing Efficient Irregular Networks for Heterogeneous Systems- 
on-Chip. Manuscript draft. Elsevier Editorial System for Journal of Systems 
Architecture. 2006
[5] David Berner, Jean-Pierre Talpin. SystemCXML: An Extensible SystemC Front End 
Using XML. <http://w w w .davidberner.de/publications/05fdl berner.pdf>
[6] El-Rewini And Mostafa Abd-El-Barr. Advanced Computer Architecture And Parallel 
Processing. A John Wiley & Sons, Inc Publication. 2005.
[7] Ewerson Carvalho Ney Calazans Fernando Moraes. Heuristics for Dynamic Task 
Mapping In Noc-Based Heterogeneous MPSOCs. 18th lEEE/IFIP International 
Workshop on Rapid System Prototyping. 2007
[8] Hee-Joong Ahn, M oon-Haeng Cho, Myoung-Jo Jung, Yong-Hee Kim, Joo-Man Kim, 
And Cheol-Hoon Lee. Ubifos: A Small Real-Time Operating System for Embedded 
Systems. ETRI Journal, Volume 29, Number 3, June 2007
[9] Hristo Nikolov, Todor Stefanov. And Ed Deprettere. Systematic And Automated  
Multiprocessor System Design, Programming, And Im plem entation. IEEE 
Transactions on Computer-Aided Design of Integrated Circuits And Systems, Vol. 27, 
No. 3, March 2008
104
[10] Jari Nurmi, Hannu Tenhunen, Jouni Isoaho And Axel Jantsch. Interconnect-Centric 
Design for Advanced Soc And Noc. Kluwer Academic Publishers. 2004.
[11] Joseph A. Fisher, Paolo Faraboschi And Cliff Young. Embedded Computing. A VLIW  
Approach To Architecture, Compilers, And Tools. 2005.
[12] Lot ha r Thiele luliana Bacivarov Wolfgang Flaid Kai Fluang. Mapping Applications To 
Tiled Multiprocessor Embedded Systems. IEEE Seventh International Conference on 
Application of Concurrency To System Design. 2007
[13] M artino Ruggiero, Alessio Guerri, Davide Bertozzi, Francesco Poletti And Michela 
Milano. Communication-Aware Allocation And Scheduling Framework for Stream- 
Oriented Multi-Processor Systems-On-Chip. Proceedings of The Conference on 
Design, Automation And Test in Europe. European Design And Automation  
Association. 2006
[14] Maurice Flerlihy And Nir Shavit. The Art of Multiprocessor Programming. 2008.
[15] Nguyen Due Thai. Real-Time Scheduling in Distributed Systems. IEEE Proceedings of 
The International Conference on Parallel Computing in Electrical Engineering . 2002
[16] Peter Brucker. Scheduling Algorithms. Fifth Edition. October 2006
[17] Peter Grun, Nikil Dutt And Alexandru Nicolau. M em ory Architecture Exploration for 
Programmable Embedded Systems. Kluwer Academic Publishers. 2002
[18] Rickard Flolsmark, Maurizio Palesi, and Shashi Kumar. Deadlock free routing 
algorithms for irregular mesh topology NoC systems with rectangular regions. 
Journal of Systems Architecture. 2007
[19] Robert P. D ick, David L. Rhodes Y, And W ayne W olf Z. TGFF: Task Graphs For Free. 
IEEE Proc of The Sixth Intl. Workshop on FIW/SW Co-Design. March 1998.
[20] Ronald Flecht, Stephan Kubisch, Andreas Flerrholtz, Dirk Timmermann. Dynamic 
Reconfiguration with Flardwired Networks-On-Chip on Future FPGAs. International 
Conference on Field Programmable Logic And Applications, 2005. Aug. 2005.
105
[21] Sander Stuijk, Twan Basten, Marc Geilen, Amir Hossein Ghamarian and Bart 
Theelen. Resource-Efficient Routing And Scheduling of Time-Constrained Streaming 
Communication on Networks-On-Chip. Manuscript Draft, 2006.
[22] Sang-Young Cho, Yoojin Chung, And Jung-Bae Lee. Virtual Development 
Environment Based on SystemC for Embedded Systems. I CCS 2007, Part IV.
[23] Sorin Manolache, Petru Eles, And Zebo Peng . Real-Time Applications With  
Stochastic Task Execution Times, Analysis And Optimization. 2007.
[24] Sudeep Pasricha, Nikil D. Dutt, And M oham ed Ben-Romdhane. BMSYN: Bus M atrix  
Communication Architecture Synthesis for MPSOC. IEEE Transactions on Computer 
-Aided Design of Integrated Circuits And Systems, Vol. 26, No. 8, August 2007.
[25] Sundararajan sriram and shuvra S. Bhattacharyya. Embedded Multiprocessor 
scheduling and synchronization. Signal processing and communication series. 2000
[26] W ayne W olf. High-Performance Embedded Computing. 2007.
[27] Xinping Zhu And W ei Qin. Prototyping A Fault-Tolerant Multiprocessor Soc W ith  
Run-Time Fault Recovery. 2006.
106
VITA
Graduate College 
University of Nevada, Las Vegas
Bisrat Tafesse
Home;
6222 M eadow  Vista Lane 
Las Vegas, Nevada
Degree:
Bachelor of Science in Computer science and IT, Jun 2006  
Alemaya University, Alemaya, ETHIOPIA.
Special Honors and Awards:
Golden key International Honor Society 
Alliance of Professionals of African Heritage
Thesis Title: Framework for Simulation of Fault Tolerant Heterogeneous 
Multiprocessor System-On-Chip
Thesis Examination Committee
Chairperson, Dr. Venkatesan M uthukum ar, Ph.D 
Committee member. Dr. Emma Regentova, Ph.D 
Committee member. Dr. Mei Yang, Ph.D 
Graduate faculty representative. Dr. Laxmi P. Gewali, Ph.D
107
