Software profiling for an FPGA-based CPU core. by Tong, Jason G.
University of Windsor 
Scholarship at UWindsor 
Electronic Theses and Dissertations Theses, Dissertations, and Major Papers 
1-1-2007 
Software profiling for an FPGA-based CPU core. 
Jason G. Tong 
University of Windsor 
Follow this and additional works at: https://scholar.uwindsor.ca/etd 
Recommended Citation 
Tong, Jason G., "Software profiling for an FPGA-based CPU core." (2007). Electronic Theses and 
Dissertations. 6963. 
https://scholar.uwindsor.ca/etd/6963 
This online database contains the full-text of PhD dissertations and Masters’ theses of University of Windsor 
students from 1954 forward. These documents are made available for personal study and research purposes only, 
in accordance with the Canadian Copyright Act and the Creative Commons license—CC BY-NC-ND (Attribution, 
Non-Commercial, No Derivative Works). Under this license, works must always be attributed to the copyright holder 
(original author), cannot be used for any commercial purposes, and may not be altered. Any other use would 
require the permission of the copyright holder. Students may inquire about withdrawing their dissertation and/or 
thesis from this database. For additional inquiries, please contact the repository administrator via email 
(scholarship@uwindsor.ca) or by telephone at 519-253-3000ext. 3208. 
Software Profiling For An FPG A -B ased  
C PU  Core
by
Jason G. Tong
A Thesis
Subm itted to the Faculty of G raduate Studies and Research 
through Electrical and Computer Engineering 
in Partial Fulfillment of the Requirements for the 
Degree of M aster of Applied Science at the 
University of Windsor
W indsor, O ntario, C anada 
2007
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Library and 
Archives Canada
Bibliotheque et 
Archives Canada
Published Heritage 
Branch
395 Wellington Street 
Ottawa ON K1A 0N4 
Canada
Your file Votre reference 
ISBN: 978-0-494-34988-5 
Our file Notre reference 
ISBN: 978-0-494-34988-5
Direction du 
Patrimoine de I'edition
395, rue Wellington 
Ottawa ON K1A 0N4 
Canada
NOTICE:
The author has granted a non­
exclusive license allowing Library 
and Archives Canada to reproduce, 
publish, archive, preserve, conserve, 
communicate to the public by 
telecommunication or on the Internet, 
loan, distribute and sell theses 
worldwide, for commercial or non­
commercial purposes, in microform, 
paper, electronic and/or any other 
formats.
AVIS:
L'auteur a accorde une licence non exclusive 
permettant a la Bibliotheque et Archives 
Canada de reproduire, publier, archiver, 
sauvegarder, conserver, transmettre au public 
par telecommunication ou par I'lnternet, preter, 
distribuer et vendre des theses partout dans 
le monde, a des fins commerciales ou autres, 
sur support microforme, papier, electronique 
et/ou autres formats.
The author retains copyright 
ownership and moral rights in 
this thesis. Neither the thesis 
nor substantial extracts from it 
may be printed or otherwise 
reproduced without the author's 
permission.
L'auteur conserve la propriete du droit d'auteur 
et des droits moraux qui protege cette these.
Ni la these ni des extraits substantiels de 
celle-ci ne doivent etre imprimes ou autrement 
reproduits sans son autorisation.
In compliance with the Canadian 
Privacy Act some supporting 
forms may have been removed 
from this thesis.
While these forms may be included 
in the document page count, 
their removal does not represent 
any loss of content from the 
thesis.
Conformement a la loi canadienne 
sur la protection de la vie privee, 
quelques formulaires secondaires 
ont ete enleves de cette these.
Bien que ces formulaires 
aient inclus dans la pagination, 
il n'y aura aucun contenu manquant.
i * i
Canada
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
©  2007 Jason G. Tong
All R ights Reserved. No P a rt of this docum ent m ay be reproduced, stored  or o th ­
erwise re ta ined  in a  retreival system  or tran sm itted  in any form, on any m edium  by 
any m eans w ithout prior w ritten  perm ission of the  au thor.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A bstract
Profiling tools are com puter-aided design (CAD) tools th a t  help in determ ining the 
com putationally  intensive portions in a software program . They are used by em bed­
ded system  designers to  choose com putationally  intensive functions of the  software 
program  for hardw are im plem entation and acceleration. This thesis presents a  de­
tailed  discussion of th e  various profiling tools available for em bedded system  design. 
In addition , a  F P G A -B P  tool, th e  A irw o lf Profiler, was developed and used to  pro­
file a set of software benchm arks. T he accuracy of the  profiled results was com pared 
against a well-known software-based profiling tool, G N U ’s gprof. It is shown th a t  A ir ­
w olf provides up to  66.2% im provem ent in accuracy of profiled results and reduces 
the run  tim e perform ance overhead, caused by software-based profiling tools, by up 
to  41.3%. T his helps em bedded designers in choosing the com putationally  intensive 
functions for hardw are acceleration.
iv
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
To my family for their unending love and support.
' v
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Acknowledgm ents
The day is finally here! I have successfully com pleted one of my life-tim e achievem ents, 
a M aster’s Degree in Electrical and C om puter Engineering. T here are several people 
who I would like to acknowledge in this dissertation.
F irs t and forem ost, I would like to  give my sincerest thanks to  my supervisor, 
Professor M oham m ed A. S. Khalid. I am  indebted for his invaluable advice, en­
couragem ent, m oral sup p o rt and guidance th roughou t my M aster’s research. His 
professionalism, knowledge and expertise will never be forgotten. I will always value 
our research discussions th a t we had  over the last few years. N ext, I would like to  
th an k  my thesis com m ittee members: Professors N arayan K ar and N ader Zam ani, 
for their invaluable suggestions, and su pport th roughou t th is project. Special thanks 
to  Professor H uapeng W u for his valuable tim e chairing th e  M.A.Sc. Defence. Also, 
I would like to  give a  very special th an k  you to  Lesley Shannon and  Blair Fort from 
U niversity of Toronto for their invaluable advice, tim e and assistance in th is project.
My friends and colleagues from Professor K halid’s Research group (in order of 
appearance): Kevin Banovic, A m ir Yazdanshenas, Ian Anderson, R aym ond Lee, and 
M arw an K anaan. I th an k  you all for being th e  g reatest “cell”-m ates and  m aking my 
experience in EH107D and EH268 an  enjoyable one. My sincere thanks go ou t to
vi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A C K N O W LE D G M E N TS
Kevin Biswas, H arb A bdul-H am id, M atthew  Meloche, Ashkan Hosseinzadeh Nam in, 
M itra  M irhassani, M ahzad A zarm ehr, Ali B idabadi, Josh Daniel and  N ata lia  Salgo 
for their friendship and sup p o rt during my stay.
M y heartfelt thanks go ou t to  Lisa Price, for her editing skills and  great patience 
in revising a m ajority  of my papers over th e  years, including th is thesis. Also for her 
continuing friendship and  support she has given to  me.
To Ralene M arcoccia, the  A ltera  U niversity P rogram , and the A ltera  C orporation, 
I th an k  you for providing the  Nios II Developm ent F P G A  boards and the full licenses 
for the  developm ent software.
F inally  and m ost im portantly , I am  indebted  to  my parents Yim  and M ay Tong 
for their everlasting love, understanding  and m oral support th roughout my M aster’s 
journey. This voyage would no t have been easy to  em bark on w ithout them .
F inancial and equipm ent support of this research was provided by the  N atu ral 
Sciences and  Engineering Research Council (NSERC) of C anada, C anadian  M icro­
electronics C orporation  (CM C) and the  U niversity of W indsor.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C ontents
A b stract iv
D ed ica tion  v
A cknow ledgm ents vi
L ist o f F igures x ii
List o f Tables xiii
L ist o f A b brev iations x iv
1 In trod u ction  1
1.1 Profiling Tools for
F PG A -B ased Em bedded S y s t e m s .................................................................... 1
1.2 Thesis O b je c t iv e s ....................................................................................................  3
1.3 Thesis O rg a n iza tio n ................................................................................................  5
2 D esign  M eth od olog ies for E m bedded  System s 6
2.1 T raditional Design M e th o d o lo g y .......................................................................  7
2.2 Hardware-Softw are Co-Design M ethodology ..............................................  9
2.3 Function-A rchitecture Co-Design ....................................................................  11
2.4 P latform -B ased D e s ig n .........................................................................................  13
viii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C O N TE N T S
2.5 Sum m ary ..................................................................................................................  16
3 Profiling Tools 17
3.1 Profiling Tools and th e
Software Profiling M e th o d o lo g y ........................................................................ 17
3.2 Software Based Profiling (SBP) T o o l s ............................................................  20
3.2.1 Instruc tion  Set S im u la to r .......................................................................  21
3.2.2 G N U ’s g p ro f ................................................................................................  22
3.2.3 In te l’s V T u n e ............................................................................................  23
3.2.4 Sum m ary of SBP T o o ls ........................................................................... 24
3.3 Software Based M em ory Profilers ( S B M P ) .................................................. 24
3.3.1 V a lg r in d .......................................................................................................  25
3.3.2 R ational Softw are’s P u r i f y ...................................................................  26
3.3.3 Sum m ary of SBM P Tools..... ................................................................... 27
3.4 H ardw are-C ounter Based Profiling
(H C B P) T o o ls ...........................................................................................................  27
3.4.1 Hardw are C ounters A p p r o a c h ............................................................  28
3.4.2 Page M igration A p p r o a c h ...................................................................  29
3.4.3 D esktop Processor Profiling C o u n te r s ................................................ 29
3.4.4 Sum m ary of H C B P Tools..... ................................................................... 30
3.5 F PG A -B ased Profiling (FPG A -B P) T o o ls ......................................................  31
3.5.1 SnoopP .......................................................................................................  32
3.5.2 Frequent Loop Analysis Tool ( F L A T ) ................................................ 33
3.5.3 W o O D S T o C K ............................................................................................  34
3.6 Q ualitative C om parison of Profiling Tools ...................................................  35
4 T h e A irw olf Profiler 38
4.1 T he Airwolf A rc h ite c tu re ..................................................................................... 39
ix
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C O N TE N T S
4.2 Airwolf Profiling C o u n te r ....................................................................................... 41
4.3 A irwolf’s Software D r i v e r s ...................................................................................  42
4.4 Sum m ary ...................................................................................................................  44
5 E xp erim enta l R esu lts  45
5.1 T he Nios II Profiling E n v iro n m en t..................................................................... 45
5.2 F P G A  Development B oard and  Design CAD Tools ..................................  47
5.3 Profiling Tools S e t t i n g ..........................................................................................  48
5.4 Profiling Software B en ch m a rk s ............................................................................ 49
5.5 C om parison of Profiled R esults ........................................................................  51
5.5.1 D i j k s t r a ........................................................................................................ 51
5.5.2 F ibo_M atrix_M ult......................................................................................  52
5.5.3 Gam e of Life .............................................................................................  53
5.5.4 B itC ount ....................................................................................................  55
5.5.5 D h r y s to n e ....................................................................................................  56
5.5.6 S u m m a r y ....................................................................................................  57
5.6 Perform ance O verhead A n a ly s i s ........................................................................  58
5.6.1 D i j k s t r a ........................................................................................................ 58
5.6.2 F ibo_M atrix_M ult...................................................................................... 59
5.6.3 Gam e of Life .............................................................................................  60
5.6.4 B itC ount ....................................................................................................  60
5.6.5 D h r y s to n e ....................................................................................................  61
5.6.6 S u m m a r y ....................................................................................................  63
6 C onclusions and Future W ork 64
6.1 Research C o n tr ib u tio n s ..........................................................................................  65
6.2 F u ture W o rk ...............................................................................................................  66
R eferences 67
X
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
V IT A  A U C T O R IS
C O N TE N T S
73
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List of Figures
2.1 T he T rad itional Design M eth o d o lo g y ............................................................... 8
2.2 T he Hardware-Software Co-Design M e th o d o lo g y ........................................  10
2.3 T he Function-A rchitecture Co-Design M e th o d o lo g y .................................  12
2.4 Design Space Exploration  ...................................................................................  14
2.5 P la tfo rm  Based D e s ig n ..........................................................................................  15
3.1 Software Profiling M e th o d o lo g y ........................................................................  19
3.2 Profiling Tool C la s s i f ic a t io n ...............................................................................  21
3.3 R ational P u rify ’s M em ory Profiling Colour C o d e ........................................  26
3.4 Page M igration A pproach ...................................................................................  30
3.5 Snoopy’s Profiling A rchitecture ........................................................................  32
3.6 Snoopy’s Profiling C o u n te r ...................................................................................  33
3.7 Frequent Loop Analysis T o o l ...............................................................................  34
3.8 W atching Over D a ta  S tream ing on C om puting Elem ent Links . . . .  35
4.1 T he Airwolf Profiler .............................................................................................  40
4.2 T he Airwolf Profiling C o u n t e r ...........................................................................  41
4.3 An Exam ple of A irwolf’s Software D r i v e r s ..................................................  43
5.1 T he Nios II Profiling E n v iro n m en t....................................................................  46
xii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List of Tables
3.1 C om parison of Profiling T o o l s ...........................................................................  37
5.1 Nios Developm ent B oard C om ponents .........................................................  46
5.2 B enchm ark D e s c r ip t io n s ....................................................................................... 50
5.3 Profiled R esults for D i j k s t r a ...............................................................................  51
5.4 Profiled R esults for Fibo_M atrix_M ult  ................................................... 52
5.5 Profiled R esults for Gam e for Life using Nios2-gprof ............................. 53
5.6 Profiled R esults for Game for Life using A ir w o l f .......................................  54
5.7 Profiled R esults for B itC ount using N io s 2 -g p r o f .......................................  54
5.8 Profiled R esults for B itC ount using A irw o lf ..............................................  55
5.9 Profiled R esults for D h ry s to n e ............................................................................ 57
5.10 Perform ance O verhead Analysis for D i j k s t r a ..............................................  59
5.11 Perform ance O verhead Analysis for F ib o .M atrix _ M u lt............................. 59
5.12 Perform ance O verhead Analysis for G am e of Life .................................... 60
5.13 Perform ance O verhead Analysis for B itC ount ...........................................  61
5.14 Perform ance O verhead Analysis for D h ry s to n e ...........................................  62
xiii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List of Abbreviations
A bbreviation Definition
AIB Avalon Interface Bus
AMD A dvanced Micro Devices
A PI Advanced P rogram m ing Interface
ASIC A pplication Specific In tegrated  C ircuit
CAD C om puter Aided Design
CE C ounter Enable
C PE C om puting Processor Elem ent
C PU C entral Processing Unit
D$ D a ta  Cache
DSP D igital Signal Processing
DTLB D a ta  T ranslation  Lookaside Buffer
FCN Function
FLAT Frequent Loop Analysis Tool
FLC Frequent Loop Cache
F PG A Field P rogram m able G ate A rray
FPG A -B P Field P rogram m able G ate A rray-Based Profiling
FSL Fast Simplex Link
H CBP H ardw are-C ounter Based Profiling
HCEL Hits C ounter Enable Line
HDL Hardw are D escription Language
1$ Instruc tion  Cache
IC In tegrated  C ircuit
ID E In tegrated  Development Environm ent
IP Intellectual P roperty
ISR In terru p t Service R equest
xiv
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
L IS T  OF A B B R E V IA T IO N S
ISS Instruc tion  Set S im ulator
LSW  Least Significant W ord
M SW  M ost Significant W ord
N ios-II-PE  Nios II Profiling Environm ent
PA PI Perform ance Advanced Program m ing Interface
PB D  P la tfo rm  Based Design
PC  Program  C ounter
PM A  Page M igration A pproach
RAM  R andom  Access M emory
SBB Short Backwards B ranch
SBM P Software-Based M em ory Profiling
SBP Software-Based Profiling
SO F S tatic-R A M  O bject File
SO PC  System  On P rogram m able Chip
SO T  Sam pling Over T im e
SPM  Software Profiling M ethodology
T C E  T im e C ounter Enable
TC E L  T im e C ounter E nable Line
UART Universal A synchronous Receiver T ransm itter
W O oD STO CK  W atches Over D a ta  STream ing On C om puting elem ent linKs
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C hapter 1
Introduction
1.1 Profiling Tools for
F P G A -B ased  Em bedded System s
In recent years, em bedded system s have grown in popularity  due to  their increased 
processing power. They are prevalent in our m odern society, where these system s 
are used in a  wide variety of applications ranging from the  perform ance of simple 
everyday tasks to  p roduct m anufacturing. Com monly used em bedded system s include 
cell phones, electronic pagers, television rem ote controls, d ig ital cam eras, personal 
d a ta  assistants, DVD players, H D TV  and much more. In large industrial com panies, 
em bedded system s are used as program m able controllers for m anufacturing, nuclear 
p o w er g e n e ra tio n , t r a n s p o r ta t io n  a n d  m e d ica l in s t ru m e n ta t io n .
These em bedded system s consist of a  hardw are platform  and software code work­
ing together to  execute specific com putation , control and com m unication tasks. A 
typical em bedded system  contains a  processor core, m em ory storage and general in-
1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1. IN T R O D U C TIO N
p u t/o u tp u t  interfaces. 99% of the curren t microprocessors produced are used for 
em bedded system s applications [67]. T he purpose of these system s is to  execute 
software application code th a t is stored  in memory. Due to  the  lim itations in the  
hardw are resources of these system s, they  cannot be as flexible and reprogram m able 
as a desktop com puter. D esktop com puters are general-purpose com puters containing 
various hardw are com ponents which can be program m ed to  im plem ent any applica­
tion or function. Em bedded system s have dedicated and lim ited hardw are resources 
th a t are designed specifically for perform ing the  tasks th a t  are specific to  a  p articu lar 
application.
T he continuing advancem ent and innovation of em bedded system s, resulting in 
increased complexity, has led designers to  significantly intensify their developm ent 
efforts during th e  design process. In addition  to  the  added difficulty, consum er de­
m and for these devices continues to  rise, which has helped to  shorten  design cycles 
and tigh ten  tim e-to-m ark  deadlines. T he design of em bedded system s is becoming 
significantly difficult w ithou t the  use of com puter-aided design (CAD) tools th a t  can 
effectively p artitio n  the  com ponents into th e  hardw are or software dom ains. T here 
are o ther added constra in ts th a t  designers m ust consider, such as th e  reduction  of 
In tegrated  C ircuit (IC) chip area and  system  power consum ption while sustain ing 
m axim um  perform ance [70].
T he entire objective in the  developm ent of em bedded system s is to  create an  ef­
ficient, optim ized and a balanced hardw are-softw are partition . I t involves of placing 
certain  com ponents in the  hardw are and  software domains. Each of these hard­
ware and  software com ponents execute concurrently  to  im plem ent a  function. The 
hardw are-softw are p artitio n  determ ines th e  quality  of the em bedded system  based 
on its  perform ance. There are au tom ated  partition ing  algorithm s, however they  re­
quire inform ation on th e  system ’s perform ance prior to  partition ing  the  em bedded 
system ’s com ponents [63]. This is where profiling tools become vital since they  de­
2
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1. IN TRO D U C TIO N
term ine which com ponents are th e  perform ance bottlenecks and which com ponents 
m eet the  tim ing requirem ents.
Profiling tools are CAD tools th a t  m easure the  perform ance of a software or h a rd ­
ware system  based on the  tim e needed to  perform  certain  functions. T hey  also help 
in detecting  problem s such as com m unication bottlenecks in a  system , cache misses 
and o ther im p o rtan t m easurable perform ance m etrics. T hey allow early detection  of 
perform ance bottlenecks and help the  em bedded system  designers to  optim ize their 
designs in order to  m eet system  perform ance constra in ts [60, 51].
There are several profiling tools available today  th a t  can be used to  profile software 
code running  on a ta rg e t processor. These tools provide different profiling inform a­
tion  th a t  can benefit em bedded designers so th a t  th ey  can optim ize th e  software code. 
Despite th e  variety of profiling tools th a t  are available, m any of them  use different 
m easuring techniques th a t can poten tially  provide inaccurate feedback. T he m ajority  
of the  profiling tools used are software-based, which require th e  designer to  compile 
their software program s to  include instrum en tation  code a t the  b inary  level. This is 
not desirable since it is very intrusive to  th e  original program  and can cause unpre­
dictable execution behaviour of the  software. Sam pling techniques are also used in 
a variety  of profiling tools and can provide varying results depending on the  sam ­
pling frequency of the  profiler. This consequently affects th e  accuracy of th e  profiled 
results, which can poten tially  lead em bedded designers to  im plem ent the  wrong soft­
ware functions in hardw are. It is im perative th a t  profiling tools m inim ally d istu rb  
the original program  binary  file and have th e  ability  to  provide accurate results in 
order to  create  an  effective hardw are-softw are p artitio n  of the  em bedded system.
1.2 Thesis O bjectives
T he work presented in th is thesis conforms to  the  following objectives:
3
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1. IN TRO D U C TIO N
1. To create a m inim ally intrusive profiler th a t  does no t require the  insertion of 
in strum entation  code added to  a software p rogram ’s b inary  file. This profiler 
should be able to  accurately m easure the  am ount of tim e a software function 
has taken to  execute on a ta rg e t processor.
2. Use th e  developed profiler to  profile several com m on software benchm arks ru n ­
ning on an FPG A -based  soft-core processor system.
To satisfy th e  first objective, an Field Program m able G ate A rray (FPG A )-based  
on-chip profiler, called th e  A irw olf profiler, was developed. T his profiler contains 
tw enty profiling counters th a t  can m easure th e  perform ance of up to  tw enty different 
software functions. It is m inim ally intrusive and collects profiling inform ation by 
m easuring the  num ber of system  clock ticks th a t  each software function takes to 
execute on a soft-core processor. For the  second objective, a profiling environm ent 
was developed th a t  is based on the  A ltera  Nios II soft-core processor [32]. This 
environm ent was used to  execute several software benchm arks and to  profile them  
using the  A irw olf profiler. The results obtained  using th e  A irw o lf profiler were 
com pared against those obtained from the G N U ’s gprof [36] software-based profiler. 
T he results collected using the  A irw olf profiler show a significant increase in profiling 
accuracy over those of the  gprof profiler.
T his entire p ro ject emphasizes the  use of FPG A s in the design of em bedded sys­
tems. FPG A s have grown in size in term s of logic capacity  and  on-chip m em ory 
resources. This enables them  to  im plem ent and rap id ly-pro to type large d ig ital cir­
cuits such as those com m only encountered in em bedded system s design w ithout the 
need of fabricating  the  system  onto an A pplication Specific In tegrated  C ircuit (ASIC). 
T he supporting  CAD tools enable designers to  quickly create em bedded system s by 
instan tia ting  a set of In tellectual P ro p erty  (IP) com ponents and  autom atically  con­
necting them  to  the  peripheral com ponents and program ing th e  F P G A  board.
4
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1. IN TRO D U C TIO N
1.3 Thesis Organization
This thesis contains six chapters. C hap ter 2 covers the  various design m ethodologies 
for em bedded system  design. C hap ter 3 presents a  survey of the  profiling tools th a t  
are available. C hap ter 4 introduces the  A irw olf Profiler and discusses its arch itecture 
and com ponents. C hap ter 5 presents the  experim ental framework used to  ob tained  
profiling results and presents a discussion on these results. C hap ter 6 provides con­
cluding rem arks and  a  discussion of fu tu re work.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C hapter 2
D esign Methodologies fo r  
Embedded System s
T he developm ent of em bedded system s involves the  com bination of hardw are and  soft­
ware com ponents together to  m eet the  requirem ents of a specific application. T here 
are several design m ethodologies th a t  can help em bedded designers to  coordinate dif­
ferent design tasks in order to  m eet tigh t tim e-to-m arket deadlines and to  fulfill all 
the  specified perform ance requirem ents. These are:
•  T rad itional Design M ethodology
• Hardware-Software Co-Design
•  F u n c tio n a l  A rc h ite c tu re  C o -D esig n
•  P latform -B ased Design
6
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2 . D ESIG N M ETH O DO LOG IES FO R EM BED D ED  SY ST E M S
In this chapter a brief in troduction  to  these m ethodologies is provided so th a t  the 
reader is able to  understand  the  different approaches th a t are used in the  design of 
em bedded systems.
2.1 Traditional D esign M ethodology
T he T raditional Design M ethodology [39] is a set of design approaches th a t  are com­
monly used in the  autom otive industry  [54]. This approach usually follows a w aterfall 
m odel of system  developm ent [69],
F igure 2.1 shows a flowchart for the  trad itio n a l m ethodology for the  design of 
em bedded system s. In itially  a set of specifications are defined which describe the  
system ’s operations and the  perform ance requirem ents th a t the  system  m ust satisfy. 
A fter th is in itial step, the hardw are and software com ponents are designed indepen­
dently. Usually a  group of hardw are and software engineers develop these com ponents 
d istan t from each other and a t different tim es during the  design process. T here is very 
m inim al in teraction  between these groups as the hardw are arch itecture is being built 
and the  software code is w ritten . It is usually presum ed th a t these com ponents can be 
combined together w ithout any incom patibility  issues. As the  com ponents are fully 
synthesized and functional, th e  system s’ com ponents are in tegrated  together, during 
w hat is known as the system  integration  stage. Following th is stage is th e  verification 
and p ro to typing  stage, during which designers verify and tes t the pro to type. Lastly, 
the design is sent for fabrication.
T his design m ethodology is su itab le for sm aller and  sim pler designs, b u t is not 
feasible for complex em bedded system s. It in troduces m any problem s and causes 
c o m p a tib il i ty  co n flic ts  to  o c c u r  b e tw e e n  th e  so f tw a re  a n d  h a rd w a re  dom ains. W hen 
designing the  hardw are (or software) com ponents first, it may be difficult to  determ ine 
if the  software com ponents are able to  run  on the  hardw are architecture and  vice versa. 
In m any cases, certain  hardw are com ponents m ay need to be changed if th e  software
7
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. D ESIG N M ETH O DO LOG IES FOR EM BED D ED  S Y ST E M S
System
Verfication
Fabrication
System
Specification
System
Integration
v.
Hardware
Components
Hardware
Synthesis
Hardware
Model
S '
Software
Components
Code
Generation
Software
Model
F ig u re  2 . 1 : T h e  T ra d i t io n a l  D esig n  M e th o d o lo g y .
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2 . D ESIG N  M ETHODOLOGIES FO R EM BED D ED  SY ST E M S
com ponents, which were bu ilt in a  different design tim e-fram e, rely on an  unsupported  
hardw are function (or arch itecture) in order to  execute properly. Using the  trad itio n a l 
design m ethodology, designers use m ost of their tim e on interface debugging tasks and 
have less tim e for o ther im p o rtan t tasks such as overall system  verification, testing  
and optim ization. In some cases, m any design iterations m ay be required to  m eet 
design goals and constrain ts. This m ay lead to  missed tim e-to-m arket deadlines and 
design obsolescence.
2.2 Hardware-Software C o-Design M ethodology
The Co-Design m ethodology for em bedded system s enables the  hardw are and  software 
com ponents to  be designed concurrently. I t allows designers to  find an efficient and 
balanced hardw are-softw are p artitio n  of the  com ponents of the  em bedded system , 
while m aintain ing com patibility. This m ethodology ensures the  hardw are platform  
is able to  execute the  software com ponents (or supporting  application software) and 
has the  necessary com puting resources for proper execution.
One of the  m ain advantages of the  co-design m ethodology is the  ability  to  detect 
early com patibility  issues in th e  design. W hen problem s are detected  earlier in the 
design stage, they  are easier and less expensive to  fix [55].
T here are m any proposed co-design m ethodologies and the  m ajority  of them  have 
focused on the  im plem entation of digital signal processing algorithm s or em bedded 
system s design [25]. In each of th e  methodologies, m ost have com m on design stages 
th a t will eventually lead to  a  system  th a t perform s a specific function or application. 
A flowchart for th e  hardw are-softw are co-design m ethodology is shown in F igure 2.2 
[30].
T he co-design process s ta rts  w ith the  specification of the system , usually expressed 
using a high level system  m odeling language or a software program . T his defines 
the  requirem ents, design constra in ts and the functionality  of th e  system. Next, the
9
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2 . D ESIG N  M ETH O DO LOG IES FO R EM BEDDED S Y ST E M S
NO
Acceptable?
YES
END
Partitioning
System
Specification
Verfication
Hardware
Synthesis
Software
Generation
Interface
Synthesis
Figure 2.2: T he H ardware-Software Co-Design M ethodology
10
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. D ESIG N  M ETH O DO LOG IES FOR EM BED D ED  SY ST E M S
hardw are-softw are partition ing  stage determ ines which functions or com ponents are 
to  be placed in the  hardw are dom ain and which are handled by software. T he th ird  
and  m ost im p o rtan t stage is synthesis, in which the  hardware, software and interface 
com ponents are synthesized concurrently. Hardw are and software engineers continu­
ously in teract w ith  each other by exchanging perform ance inform ation and  functional 
requirem ents of all the  com ponents. T his ensures th a t  the hardw are arch itecture and 
th e  software program  can execute together w ithou t difficulty. Finally, the  verification 
stage determ ines if th e  designed system  m eets the  design requirem ents and perfor­
m ance constrain ts. If th e  design fails to  m eet the  requirem ents, itera tion  is needed, 
which leads back to  th e  review of th e  specifications. The num ber of itera tions de­
pends on the  design size and  complexity. T he hardw are-softw are co-design process 
helps m inimize the  num ber of itera tions and the  design tim e required to  im plem ent 
a  com plete system .
2.3 Function-A rchitecture C o-Design
A nother m ethodology used in th e  design of em bedded systems is th e  Function Archi­
tectu re  Co-Design [54]. In th is approach the  em bedded system  is built a t a higher 
abstraction  level, which allows designers to  focus on the  design of the  system ’s func­
tionality  w ithout having to  be concerned w ith how th a t  functionality  is im plemented. 
T he hardw are-softw are co-design pu ts  em phasis on interfacing the  hardw are and  soft­
ware com ponents together. This process, however, does not focus on the  design tasks 
a t the  system-level, which often leads to  extended tim e in reaching the  ta rg e t design.
Figure 2.3 illustra tes the  Function A rchitecture Co-Design [27]. T he m ethodology 
s ta rts  a t the  specification stage where the arch itectu ral and functional descriptions of 
the  system  are defined. D uring th e  specification stage, the  system  is described using 
two different definitions: [57]:
11
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2 . D ESIG N  M ETH O DO LOG IES FO R EM BEDDED S Y ST E M S
NO
YES
Acceptable?
Prototype
Verification
Mapping
HW/SW
Co-Design
Performance
Simulation
Fabrication
Communication
Refinement
Function
Description
Architectural
Description
Figure 2.3: T he Function-A rchitecture Co-Design M ethodology
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. D ESIG N M ETH O DO LOG IES FO R EM BED D ED  SY ST E M S
• Functional Definition: th e  specific function or application th a t  th e  system  will 
provide
• A rchitecture Definition: a  candidate arch itecture th a t  contains all the IP  cores, 
hardw are and software com ponents th a t  im plem ent th e  specified function.
Following the  specification stage is th e  m apping stage, in which the  system ’s 
functions are partitio n ed  and directly m apped  to  the  chosen system  architecture. In 
addition, the  hardw are and  software interfaces are also m apped onto the  arch itectu re’s 
resources. T he perform ance sim ulation stage is next, which involves carrying out all 
of the  sim ulations for each com ponent, and perform ing various verification techniques 
on the  m apped hardw are and  software com ponents. This is done to  verify th a t  the 
m apped system  is functional and  is capable of m eeting the design constrain ts. T he 
next stage is th e  com m unication refinem ent stage, in which the  inter-com m unication 
between th e  various system  functions are defined [57], Once these m odelling stages 
are com pleted, the  system  design goes into a hardware-software co-design synthesis 
where the  com ponents of th e  system  are synthesized together. At th is stage, the 
p ro to type of the  em bedded system  has been constructed, and then  goes into the 
verification stage. F urther design itera tions are perform ed if the  system  does not 
m eet the  specified design requirem ents. Fabrication is the last stage, in which the 
verified system  is taken  and  sent off for production.
2.4 Platform -B ased D esign
T he P latform -B ased Design (PB D ) m ethodology emphasizes the  use of reusable IP  
co res as  a  p la tf o rm  u p o n  w h ich  d es ig n s  a rc  c o n s tru c te d  [54]. T h is  invo lves a  des ig n - 
space exploration th a t  a ttem p ts  to  find a balance between a hardw are p latform  con­
sisting of a  set of in s tan tia ted  program m able IP  cores and the  ability  of the  archi­
tec tu re  to  su pport a  set of applications. P latform -B ased Design uses a  “m eet in the
13
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2 . D ESIG N  M ETH O D O LOG IES FO R EM BEDD ED S Y ST E M S
Application Space
Platform
Specification
Platform
Design-Space
Exploration
Architectural Space
Figure 2.4: Design Space E xploration
middle approach” [26] as shown in F igure 2.4 [56].
T here are two different approaches used in PBD : the  top-dow n approach and 
bo ttom -up  approach. Using the  top-dow n approach, the  system ’s p latform  architec­
ture, including the  processor’s speed, m em ory capacity  and other in s tan tia ted  pe­
ripherals, are defined a t the  beginning of the  design cycle. T he bo ttom -up  approach 
defines a  family of different software applications th a t  can be program m ed on the 
given hardw are p latform . T he intersection of th e  arch itecture space and the applica­
tion  space defines the  hardw are platform s available for a set of applications. In some 
cases, the  hardw are p latform  th a t  was derived m ay be over-designed for the  p articu ­
lar application, a lthough th is  is deemed beneficial to  designers since they  can create 
new software products and extend the useful life of the  hardw are p latform  [49]. T his 
implies th a t  using platform -based design for em bedded system s em phasizes the  reuse 
of existing com ponents. N ot only does th is reduce the  am ount of hardw are resources 
used b u t can help to  minimize the  cost of m anufacturing the  em bedded system.
14
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2 . D ESIG N M ETH O DO LOG IES FO R EM BED D ED  S Y ST E M S
Platform
Instance Application
Performance
Numbers
Simulation
Platform
Derivation
Mapping/
Compiling
Figure 2.5: P la tfo rm  Based Design
15
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2 . D ESIG N M ETH O DO LOG IES FOR EM BEDDED S Y S T E M S
Figure 2.5 describes th e  P latform -B ased Design m ethodology of em bedded sys­
tem s [54]. T he designer s ta rts  by specifying the  p latform  architecture, which outlines 
th e  perform ance constra in ts and the  functionality  of the entire system  based on the  
in tended application. This includes the  specification of the required speed of th e  m i­
croprocessor, m em ory capacity, cache memories, etc. From the  defined requirem ents, 
a  p latform  instance is m ade which contains all of the  in stan tia ted  hardw are com po­
nents and  software program s required to  execute a  specific application. Following th is 
stage is the  m apping and com piling of the  system , which includes hardw are p latform  
synthesis and the  program  code generation. N ext, the  compiled system  goes in to  the  
sim ulation stage, when designers tes t all of the  com ponents to  ensure th a t  they  are 
functioning correctly and m eeting the design constraints. Based on the  perform ance 
num bers retrieved from the  sim ulation stage, the  designer can determ ine if the  system  
has satisfied the  specified requirem ents. If not, the  system  goes into ano ther design 
ite ra tio n  cycle until it has fully m et all of the  constraints.
2.5 Sum mary
This chapter presented an  in troduction  to  different em bedded system  design m ethod­
ologies. In th e  next chapter, a  discussion of profiling tools is presented.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C hapter 3
Profiling Tools
There is a wide variety of profiling tools available th a t  m easure different perform ance 
m etrics and retrieve diverse sets of profiling inform ation. Section 3.1 discusses profil­
ing tools and a proposed software profiling m ethodology for th e  design of em bedded 
systems. T he subsequent sub-sections classify th e  different types of profilers available 
as follows: Software-Based Profiling (SBP) Tools, Software-Based M em ory Profiling 
(SBM P) Tools, Hardware-Counter Based Profiling (HCBP) Tools and  F P G A-B ased  
Profiling (F PG A -B P) Tools. In each of these categories, a  brief survey of these ex­
isting tools is presented.
3.1 Profiling Tools and the
S o f t w a r e  P r o f i l i n g  M e t h o d o l o g y
There are several m ethodologies and approaches used in the design of em bedded sys­
tems. As explained in chapter 2, the  m ajority  of the  m ethodologies begin a t the
17
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. PROFILING  TO O LS
specification stage in which all th e  functionalities of the  system  and the  supporting  
arch itecture to  im plem ent th a t  function are defined. Usually em bedded designers 
have two options for the in itia l im plem entation  of their design based on th e  specifi­
cations. For the  first option, the  em bedded system  can be entirely im plem ented in 
hardw are while moving certain  com ponents to  the  software dom ain, depending on the  
execution perform ance of those functions [42]. T he second option is to  have the  entire 
em bedded system  im plem ented in software [35] and  invoke a  profiler th a t  m easures 
the perform ance of the  software program . T he inform ation provided by th e  profiler 
is used by designers to  help them  choose which software functions are m ore desirable 
for hardw are im plem entation.
Profiling tools are used to  m easure th e  perform ance of a  program  th a t  is running  
on th e  ta rg e t processor of an em bedded hardw are platform . These tools provide use­
ful inform ation for designers so th a t  they  can identify certain  software ho t-spo ts  th a t  
are causing a  perform ance bottleneck. Designers can choose either to  optim ize the 
software code to  alleviate the  perform ance issue or im plem ent the com putationally  
intensive function in the  hardw are dom ain in order to  achieve a speed-up in perfor­
mance of th e  entire system . It is im perative th a t  profilers provide accurate  results and 
properly detect these hot-spots. This can lead to  the  creation of a  balanced p artitio n  
between the  hardw are and software com ponents. T he quality of the  em bedded system  
is entirely dependent on th e  efficiency and the  effectiveness of the hardw are-softw are 
p artitio n  of the  system ’s com ponents. T he application  of profiling tools has led to  a 
proposed Software Profiling Methodology (SPM ) as shown in F igure 3.1 [60].
T he design flow is sim ilar to  the  hardw are-softw are co-design m ethodology of em ­
bedded system s [30], as explained in Section 2.2. T he SPM  begins a t th e  software 
specification stage. T he com plete em bedded system  is w ritten  in a high level language 
such as C or C + +  and th en  the  software is functionally verified. Next a profiler is 
invoked in order to  m easure the  run tim e perform ance of the  program  and eventually
18
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. PROFILING  TO O LS
NO
YES
^  Meet ^  
Requirements?,
END
Software 
Implementation of 
Embedded System
Profiling
Software Modification  
Hardware 
Implementation
Functional
Verification
Figure 3.1: Software Profiling M ethodology
19
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. PRO FILING  TO O LS
re tu rn  feedback and perform ance s ta tis tics  to  the designer. T he designer analyzes 
the results and determ ines if the  software code m eets the specified perform ance con­
strain ts. T h a t sam e profiling inform ation can be used by an au tom ated  hardw are- 
software partition ing  CAD tool [63]. If th e  system  fails to  m eet the  requirem ents, 
the  designer will try  to  optim ize the  code or move certain  com putationally  intensive 
functions into the hardw are dom ain as a  hardw are accelerator. If necessary, th e  entire 
m ethodology s ta rts  again until the  designer is satisfied w ith the  perform ance.
Existing profiling tools offer different types of profiling capabilities and  support 
different program m ing languages. C /C + +  profiling tools are common, b u t there  
are also tools available th a t  can profile program s w ritten  in Java [38, 37]. M entor 
Seamless Co-verification environm ent provides a  profiler th a t  takes a design w ritten  
in System C [13] and m easures its perform ance based on processor u tilization, cache 
efficiency, m em ory hotspots, bus u tiliza tion  and bus m aster contention [12].
C urrently, there are m any different kinds of profiling tools th a t  are used to  retrieve 
a variety  of profiled inform ation abou t a program . The m ost common is function- 
level profiling which m easures the  am ount of tim e needed for a  function to  execute 
on th e  processor. A nother type is memory-level profiling th a t  determ ines which func­
tion, d a ta  variable type or instruction  is causing m em ory related  problems: excessive 
m em ory references, cache misses, heavy pointer dereferencing, branching and looping 
instructions. F igure 3.2 depicts the  proposed classification of profiling tools. T here 
are th ree  m ain categories: software-based, hardware-based and FPGA-based. We de­
scribe each of these in detail in the  following sections.
3.2 Software Based Profiling (SBP) Tools
Software-Based Profiling (SBP) is the  m ost com m on technique for m easuring the  
perform ance of application code w ritten  in a program m ing language. T here are two 
approaches to  profiling the  software code when using these tools: sim ulation and  the
20
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. PROFILING  TO O LS
Hardware-BasedSoftware-Based FPGA-Based
Profiling T ools
G NU ’s gprof Hardware Counters SnoopP
Valgrind Page Migration Approach Frequent Loop Analysis Tool
Vtune Performance Analyzer W O 0 DST0 CK
Airwolf
Figure 3.2: Profiling Tool Classification
insertion of in strum en tation  code. S im ulations take place in v irtual environm ents 
th a t  sim ulate the  behaviour of a  m icroprocessor as the  software code is running  on 
a v irtual environm ent. T he insertion of in strum entation  code allows an SBP tool 
to  a ttach  itself to  th e  b inary  file and collect perform ance inform ation during the 
execution of a program  on the  processor. In th is section, we describe an ISS, G N U ’s 
gprof [36] and  In te l’s [11] Vtune  [45] is given.
3.2.1 In stru ction  Set Sim ulator
Instruction  Set S im ulators (ISS) are one of the  SBP tools used for profiling soft­
ware code running in a  sim ulated environm ent. One popular ISS is th e  Sim pleScalar 
Toolset which sim ulates application code running on the Sim pleScalar com puter a r­
chitecture [29], T he advantages of using an  ISS for profiling is th a t  the  designer is able 
to  view the  entire d a ta  flow movement inside the  m icroprocessor’s registers during the  
sim ulation. I t keeps track  of all of the execution processes, the current instruction  in 
execution, d a ta  m anipulations, cache accesses and o ther reportab le  events. This does 
not require th e  software code to  be modified, therefore intrusiveness to  the  b inary  file 
is non-existent.
T he use of an  ISS m ay no t be feasible for larger software program s or w ith system -
21
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. PROFILING  TO O LS
on-a-chip designs since they  can be very slow to  sim ulate [51]. T his could lead to  
very inaccurate profiles of th e  execution tim es of each function. S im ulations can have 
varying tim es to  com plete depending on th e  com plexity of the  software code. I t m ay 
take several hours to  run  an  entire sim ulation which may only cover a few seconds 
of real-tim e, thus m isrepresenting the  entire execution time. Due to  th e  increasing 
com plexity of em bedded system s designs, constructing  complex m odels of the  system ’s 
com ponents and  other ex ternal environm ents m ay not be possible
3.2.2 G N U ’s gprof
gprof [36] is an open-source profiling too l th a t  is used on Linux [5] and Unix [6] 
w orkstations to  profile C and  C + +  application code. It provides two types of profiled 
ou tputs: the  flat profile and the  call graph. T he flat profile is a repo rt of how much 
tim e the  program  is spent on each function and  the  num ber of tim es th a t  function 
was called. T he call g raph  displays each function, its calling function and other 
functions called w ith in  th a t  function. To utilize th is profiler, the  designer is required 
to  compile the  code w ith the  default debug in strum entation  setting. T his option 
inserts additional instrum en tation  code into th e  b inary  executable file, as required by 
gprof.
During program  execution, gprof utilizes the  inserted in strum entation  code to 
m onitor th e  perform ance of th e  program  running on the  C entral Processing U nit 
(CPU ). T he instrum en tation  code allows gprof to  count the  precise num ber of func­
tion calls and  generate th e  appropria te  num ber of in terrup ts  to  sam ple the  program  
counter (PC) of the CPU . It is capable of generating a profile th a t  accurately counts 
the  num ber of functions th a t  have been called, however, the reported  execution tim e 
of each function m ay be som ew hat inaccurate.
gprof collects inform ation on the  execution tim e of a program  by reading the  value 
of th e  PC  a t specified intervals. The PC  value determ ines which function is being
22
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. PROFILING  TO O LS
executed on the processor. Based on th is value, gprof increm ents the  execution tim e 
counter of the  function th a t  is currently  executing by its sam pling period. T his can 
create inaccurate tim ing results for each function called and the  execution tim e of the  
entire program  [68]. T he accuracy of the  profiled execution tim e is entirely dependent 
on the  sam pling frequency of the  PC.
3.2 .3  In te l’s V T une
In te l’s VTune Perform ance A nalyzer  is an  SPB tool th a t  profiles C /C + +  code th a t  is 
executed on Intel processors [45, 47, 11]. T he VTune  analyzer features th ree profiling 
modes: Sampling Over T im e  (SO T), Call Graph and Counter M onitor. Each of these 
modes is discussed briefly in the  following paragraphs.
T here are two sam pling m ethods th a t  are used by VTune: Sam pling Over T im e  
(SOT) and the P ause/R esum e Application Programming Interface  (A PI) [24]. SO T 
profiles th e  software code and shows the perform ance results specified “over tim e” of 
each thread , function and instruction  until the  program  has com pleted execution. In 
addition, it can detect when the  processor is in an idle sta te . This allows designers 
to  optim ize the application  code to  execute o ther th reads when the  processor is not 
executing any threads.
Sam pling using the  P ause/R esum e A P I  [24] requires the  user to  insert certain  
functions into various p a rts  of the  software code. Such functions are VTPauseO, 
VTResumeO, VTPauseSam plingO, VTResumeSamplingO, CMPauseO and CMResumeO 
These functions are used to  select certain  code regions for profiling.
V T u n e’s Call Graph profiler [58] displays th e  calling sequences of functions during 
execution of the  software code. T he VTune  profiler adds in strum entation  code into 
the b inary  executable file so th a t  it can m onitor and identify the  num ber of specific 
functions called during run-tim e. A dditionally, it  identifies th e  critical p a th  in the  
call graph  which displays the  po ten tia l bottlenecks th a t  lim it system  perform ance.
23
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. PROFILING  TO O LS
3 .2 .4  Sum m ary o f  S B P  Tools
The use of the  sam pling technique in com m on software-based profilers helps to  reduce 
the run-tim e overhead during profiling. Nevertheless, this can produce inaccurate  
profiled results which can po ten tially  create a sub-optim al p artitio n  of the  em bedded 
system . T he use of an  ISS can also produce inaccurate results since sim ulators are only 
as good as the  system  m odel th a t  is being sim ulated. Also, the  sim ulation tim e may 
no t accurately m atch  the  ac tua l run-tim e execution of the  program . C ertain  SBP 
tools require the  designer to  link their program  w ith instrum entation  code which 
is inserted  a t the  b inary  level. T his can lead to  an excessive num ber of in terru p t 
calls which m ay cause unpredic tab le  behaviour of the  software code running on the 
em bedded hardw are p latform . A dditionally, the  instrum entation  code can lead to  an 
increase in code size and  m ay poten tially  change the  behaviour and the  perform ance 
of th e  software system.
3.3 Software Based M em ory Profilers (SB M P)
Em bedded system  software m ust take g reat care in ensuring th a t  the  m em ory system  
is used properly [33, 34]. One of th e  m ain problem s is m em ory leakage. A m em ory 
leak is caused when the  application code consumes unnecessary m em ory resources 
and fails to  release the  m em ory th a t  is no longer in use. Prolonged m em ory leaks 
in an  application  can cause th e  system  to behave unpredictably  and eventually run 
out of memory, leading to  th e  failure of the  em bedded system . An increase in the 
num ber of unnecessary m em ory accesses and paging is ano ther problem , since it 
introduces latency in retrieving d a ta  and  operands for instructions to  execute on 
the  processor. Excessive num bers of read and w rite accesses to  m em ory are the 
m ost com m on overhead operations in CPU s [41]. These operations generally cause 
perform ance degradation. Cache misses are also an issue when the  processor is unable
24
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. PROFILING  TO O LS
to retrieve instructions from its own cache memory. This is due to  m ispredicted 
branching instructions, heavily nested dereferencing of m em ory pointers and  looping 
instructions.
M em ory profilers are needed to  detect the  problem s listed above, so th a t  th ey  can 
be resolved by the  designer. T hey provide detailed inform ation ab o u t which function 
call in the  application code is producing m em ory leaks, cache misses and high m em ory 
referencing. Reducing the  num ber of m em ory accesses can improve perform ance and 
minimize perform ance overhead [50]. In th is section, the  following m em ory profiling 
tools are described: Valgrind [14], and  P urify  [44],
3.3.1 Valgrind
Valgrind, is an  open-source GNU profiling tool for Linux system s [14]. This profiler 
can check the  calls for read and w rites to  memory, as well as for allocating and 
freeing m em ory using functions such as the  C + +  functions new and d e le te .  The 
m ajor advantage of Valgrind is its capability  for cache m em ory profiling. It sim ulates 
the C P U ’s Level 1 d a ta  and  instruction  level caches as well as Level 2 cache. Valgrind 
determ ines a cache hit count for every line of the  program  th a t is being traced  and 
analyzed. It can profile applications of various sizes, from sm all functions to  complex 
application systems.
T he technique Valgrind uses to  m easure the  perform ance of software code is to  
run th e  application  in a  sim ulated v irtual processor environm ent. O ther com ponents 
and libraries of th e  software code are linked to  the  sim ulator as well. D uring the  
sim ulation process, th e  profiling d a ta  is collected and it is stored in a  log file. The 
usefulness of th is m ethod  depends on how well th e  functions and d a ta  structu res are 
modelled in the  sim ulator. Valgrind is capable of profiling m em ory activ ity  on larger 
program s, although the perform ance of the  software program  can degrade.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. PRO FILING  TO O LS
Illegal to read, write or 
free red and blue memory
Red Blue
Memory /  Memory
Malloc
Free
Free
Legal to read and write 
{or free if allocated by 
malloc)
Legal to write or free, but 
illegal to read
Yellow
Memory
Allocated, Uninitialized 
Memory
Write
Figure 3.3: R ational P u rity ’s M em ory Profiling Colour Code 
3 .3 .2  R ational Softw are’s P urify
R ational So ftw are’s P urify  [44] is a software-based m em ory profiler th a t  can be used 
on M icrosoft W indows [7], Unix [6] and Linux [5] operating  environm ents. T he tool 
helps in solving m em ory problem s and  determ ines the  exact code location th a t  is 
causing th e  error. T he kinds of problem s th e  program  detects are m em ory leaks, 
reading and w riting beyond the  bounds of an  array  in memory, a ttem p ts  to  free un­
allocated m em ory and using un-initialized memory. P urify  uses a  four colour scheme 
to  represent m em ory problem s as shown in F igure 3.3 [44]: red, yellow, green and 
blue.
T he red zone indicates the  program  has no m em ory access unless m em ory is 
explicitly allocated by using a m a llo c  or new function. P urify  initializes all heap and 
stack  m em ory as a red zone until it  is allocated. The yellow zone is th e  m em ory th a t
26
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. PROFILING  TO O LS
is allocated by the  program . I t is not legal to  read  from it because it is no t initialized 
or does not contain any valid data . T he green zone is m em ory th a t  has been w ritten  
into and is available for reading and w riting data . Blue zone is m em ory th a t is freed 
by the  program  and  is no longer accessible.
3.3.3 Sum m ary o f S B M P  Tools
M emory profiling tools are essential for detecting  m em ory leaks, allocation and  de­
allocation errors, as well as instructions th a t  cause cache read /w rite  misses. T hey  give 
th e  designer m ore options to  analyze and optim ize th e  software code prior to  po rting  it 
to  the  ta rg e t architecture. In addition, they  provide m ore detailed  perform ance infor­
m ation th a n  function-level profilers. T he problem  w ith  the  current m em ory profiling 
tools is th a t  they  use th e  sam e m easuring techniques as SBP tools. Some m em ory 
profilers require th a t  th e  designer include instrum en tation  code in their application  
a t th e  b inary  file. This in troduces th e  issue of large code sizes and runtim e overhead. 
Some m em ory profilers use sam pling techniques to  sam ple the  hardw are counters and 
retrieve their values. As discussed in the  case of software-based profiling, sam pling 
techniques can produce inaccurate results and m ay poten tially  m islead the  designer 
to  im properly im plem ent certain  functions in the  hardw are or software dom ains.
3.4 Hardware-Counter B ased Profiling  
(H C B P) Tools
H ardw are-C ounter Based Profiling (HCBP) tools utilize on-chip hardw are counters
t h a t  a re  a v a ila b le  o n  a d v a n c e d  p ro c e sso rs  su c h  as  Sun  Ultrasparc [64], In tel P entium  
Processors [46] and  Advanced Micro Device (AM D) Processors [9]. These hardw are 
counters are dedicated  to  m onitoring specific events th a t  occur during runtim e exe­
cution of an  application. T he types of events which can be m onitored are: m em ory
27
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. PRO FILING  TO O LS
accesses, cache misses, pipeline stalls, types of instructions executed and  etc. H C B P 
tools do no t require the  use of instrum en tation  code since these counters are designed 
to  collect perform ance inform ation of the  software program . In addition , very little  
perform ance overhead is in troduced during runtim e execution.
Accessing these counters requires a unique instruction. T he P erform ance A d ­
vanced Programming Interface  (PA PI) [28] provides users w ith  a  high level interface 
to  access these counters and can su pports  m any different processors [62], In te l’s 
VTune  counter m onitor provides an  interface for accessing and utilizing the  hardw are 
counters to  profile application code executing on Pentium -based processors [46].
3.4.1 H ardw are C ounters A pproach
Itzkow itz et al from Sun M icrosystem s  have described a software profiling too l th a t  
utilizes the  hardw are counters in an Ultrasparc-III m icroprocessor [48]. Originally 
th is profiling too l was built as an  extension of the  Sun One S tudio [4] com pilers and  
perform ance tools, which are used for m easuring the  perform ance of software code. 
These hardw are counters are included in the  arch itecture and contain different types of 
event counters such as, Instructions Completed, Instruction-cache  (1$) M isses, Data- 
cache (D$) Read M isses, D ata-translation-lookaside-bujfer (D TLB) Misses, External- 
cache (E$) References, E$ Read Misses, E$ Stall Cycles, and m any others.
T here are some lim itations to  using this tool. One such lim itation  is counter- 
skidding. T he too l uses hardw are-counter overflows to  obtain  profiled inform ation. 
W hen a counter overflow occurs, th e  too l does not execute a precisely tim ed tra p ­
ping m echanism  to  ob ta in  the  correct value of the counter. T he second problem  is 
the backtracking m echanism  of instructions which was im plem ented as a solution to  
solve the  trapp ing  m echanism  flaw. T he backtracking technique is used to  find the 
instruction address th a t  caused th e  overflow event to  occur, however the  instruction  
im m ediately preceding the curren t one in the  processor’s PC  m ay not have the  cor­
28
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. PROFILING  TO O LS
rect address value, due to  the  possibility th a t  the  previous instruction  was a  branch 
call. Instead  of relying on th e  value of th e  PC , th e  profiling too l tries to  find the  
proper values in o ther registers to  calculate the  effective address of the  instruction  
th a t  caused th e  overflow event. It is not guaran teed  success in finding th e  address 
since the  value of the  registers m ay have changed once o ther overflow signals have 
been delivered to  o ther hardw are counters. D espite w ith these drawbacks, th e  tool 
has m anaged to  find the  proper instruction  99% of the  tim e. T he M CF benchm ark 
was profiled and th e  feedback provided enabled a 20% perform ance im provem ent.
3.4 .2  P age M igration  A pproach
T he Page M igration Approach  (PM A ), developed by T ikir et al utilizes hardw are- 
counters for profiling m em ory w ith m em ory page-m igrating capabilities [65]. T he 
profiler was used on a m ulti-processor system  based on S u n ’s SunFire Server  as 
shown in F igure 3.4. Each system  board  contained several processors and  memory. 
T he Sun  Fire L ink  hardw are counters are used to  sam ple the  frequency w ith  which 
each processor “touches” a page of m em ory th a t  is rem ote from the  on-board  local 
m em ory hardw are. A t a  certain  num ber of counts specified by th e  user for rem ote 
touching of m em ory pages, th e  profiler halts th e  execution. It then  m igrates th a t  
particu la r m em ory page to  the  processor th a t  accesses it m ost frequently for read 
and  w rite operations. PM A  has dem onstrated  90% speed im provem ent w hen certain  
m em ory pages are placed closest to  the  processor th a t  requires d a ta  from th a t page.
3.4 .3  D esk top  P rocessor  P rofiling C ounters
T here are consum er desktop processors today  th a t contain hardw are counters which 
m onitor the  perform ance of application  code in th e  CPU. A M D  Athlon  m icroproces­
sors [9] contain four 48-bit perform ance hardw are counters th a t  can be used as event 
driven or tim ing driven counters. These counters can m onitor the  num ber of tim es a
29
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. PROFILING  TO O LS
Page Migration
Softw are A pp lica tion
Sun F ire Link 
H ardw are Counters
Physical P age
Processo r #3
Memory
Physical P ag e
P rocesso r #1
Memory
Physical
P rocesso r #2
Memory
Physical
P ro c e s s o r#4
Memory
Figure 3.4: Page M igration A pproach
certain  event occurs or they  can m easure the  dura tion  of an  event th a t  is currently  
taking place on the  processor. In tel P en tium  m icroprocessors also contain a set of 
perform ance hardw are counters [46]. T hey are also event or tim ing driven and are 
accessible th rough  In te l’s VTune  [45] profiling tool.
3.4 .4  Sum m ary o f H C B P  Tools
Using hardw are counters for profiling software code is beneficial since it does not 
in troduce any instrum en tation  code, leaving the compiled application source code 
untouched. Additionally, they  do no t add any perform ance overhead since the d a ta  
collection of these counters occurs during runtim e execution of th e  software. However, 
there are draw backs when using H C B P tools. F irst, some H C B P tools m ay require 
th e  user to  reconfigure and reprogram  th e  counters to  detect different events, which 
can lead to  the  addition  of certain  functions a t the  source code level. Secondly, 
they use the  sam pling m ethod  to  sam ple th e  hardw are counters which leads back to  
the  problem s th a t  were in troduced w ith SBP tools. Thirdly, handling of in terrup ts  
a ffec t th e  g a th e r e d  d a t a  s in ce  th e  in te r ru p t  se rv ic e  ro u tin e s  (IS R ) u se d  a d d  to  th e  
num ber of events. Lastly, there  is a  lim ited num ber of hardw are counters available. 
T he program m er m ust run  the  application  m any tim es to  ob ta in  d a ta  for different
30
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. PRO FILING  TO O LS
m onitoring events [62].
3.5 FP G A -B ased Profiling (F P G A -B P ) Tools
F PG A s are user program m able in tegrated  circuits th a t  offer reasonably high level of 
in tegration , negligible p ro to typing  cost and instan taneous m anufacturing capability. 
R iding on M oore’s law [52], FPG A s have grown in logic capacity  while m ain tain ing  
an affordable cost for m any applications [31]. Em bedded developm ent kits th a t  utilize 
FPG A s contain an abundance of on-board resources such as clock m ultipliers, fast 
m em ory chips, m ath  co-processors, etc. This m akes them  an a ttrac tiv e  alternative 
for rap id  pro totyping of large em bedded system  designs due to  their reconfigurability 
and flexibility th a t  they  offer to  the  designer.
Researchers today  are developing profiling tools th a t  can help designers working 
on em bedded system  designs using FPG A s. T he two m ajor F P G A  vendors, A ltera  
C orporation  [17] and Xilinx Incorporated  [72], provide em bedded system  developm ent 
kits which use the  Nios II [32] and M icroBlaze [73] soft-core processors, respectively. 
These soft-core processors are in stan tia ted  on the  F P G A  and used as basic building 
blocks for designing em bedded system s [66].
F PG A -based profiling (FPG A -B P) tools also utilize these soft-core processors for 
profiling. In F PG A -B P  tools, the  designer executes the software on the  soft-core 
processor and collects the  perform ance d a ta  provided by the on-chip profiling hard­
ware. These tools have provided im proved results com pared to  the  previous profiling 
tools described earlier. T hey  keep latency and perform ance overhead a t a m inim um , 
because they  are non-intrusive and require negligible instrum entation . T hey do not 
u se  th e  s a m p lin g  te c h n iq u e  a n d  re q u ire  v e ry  m in im a l p ro c e sso r c o m p u ta t io n .  T h e se  
features are highly desirable for profiling tools used in em bedded system s. In this 
section, a detailed discussion of the  existing FPG A -based profiling tools is provided.
31
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. PROFILING  TO O LS
System Clock
_n_n__
PC
Segment
Counter
Segment
Counter
#N
Segment
Counter
Segment
Counter
MicroBlaze
CPU
Figure 3.5: Snoopy’s Profiling A rchitecture
3.5.1 SnoopP
SnoopP  [60] is an  on-chip function-level profiler th a t  was im plem ented on the Xilinx 
V irtex-II 2000 F P G A  board . This board  is used to  im plem ent designs based on 
Xilinx M icroBlaze [73] soft processor. T he on-chip profiler utilizes the  M icroBlaze as 
a ta rg e t processor. SnoopP  uses a  hardw are profiling architecture th a t is non-intrusive 
to  the  code, such th a t  any additional instructions, com m ands or o ther flags are not 
necessary. F igure 3.5 depicts th e  hardw are arch itecture for the  SnoopP  profiler. 
SnoopP  consists of a  variable num ber of segm ent counters th a t  are user specified
a n d  defin e  th e  a d d re s s  o f in s t ru c t io n s  to  b e  an a ly z e d . T h e  n u m b e r  o f s e g m e n t c o u n te rs  
is dependent on the  num ber of functions the  user wishes to  profile and th e  area 
available on the  FPG A .
T he segm ent counters, shown in Figure 3.6, determ ine if the  value of the  PC
32
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. PROFILING  TO O LS
PC >= low address
PC O U T P U T  BUS
—  PC <= high address
R EAD  BUSC ounter EN
^  64-bit Time CounterS Y S T E M  C LO C K
Figure 3.6: Snoopy’s Profiling C ounter
address is in the  range of m em ory addresses in which the b inary  code corresponding 
to  the  function resides. T his is determ ined by the  com parators inside each segm ent 
counter. If th is condition is true, the  com parator sends an  enable signal to  the 
hardw are counter which utilizes the  processor’s system  clock to  count the  num ber 
of clock cycles the  function has used. T his gives the  designer the  precise num ber of 
clock cycles th a t the  particu lar function needs to  execute on the processor. Sn o o p P ’s 
and g p ro f’s results were com pared, and it was shown th a t SnoopP  was significantly 
more accurate. Additionally, SnoopP  does not slow down the  perform ance of either 
the  software or th e  profiling process.
3.5.2 Frequent Loop A nalysis Tool (FLAT)
Frequent Loop Analysis Tool (FLAT) is a too l th a t  detects functions in software th a t 
heavily use loops [40]. In m ost cases, loops use 90% of the  execution tim e while 
constitu ting  only 10% of the  entire software code. FLAT searches for these critical 
regions and records the  execution frequency of each loop-intensive function into a 
cache-like hardw are arch itecture th a t is im plem ented in an F PG A . A block diagram  
of the  FLA T arch itectu re is shown in Figure 3.7.
Usually a loop instruction  is typically denoted  as a Short Backwards Branch  
(SBB), when the  program  jum ps back to  the  first instruction  of th a t  loop. The
33
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. PROFILING  TO O LS
Read/WriteRead/Write
A ddressA d d re ss
DataData
SBB
Increment
Microblaze
CPU
Frequent 
Loop Cache 
Controller
Frequent
Loop
Cache
Figure 3.7: Frequent Loop Analysis Tool
value of th e  SBB is a negative address offset. T he Frequent Loop Cache (FLC) stores 
the execution frequency of each loop function a t the  index m em ory location th a t  is 
based on th e  SBB value. A cache controller, called the  Frequent Loop Cache C on­
troller, keeps the  d a ta  u p d ated  w ith the  la test values. FLA T does no t require the 
use of instrum en tation  code or any sam pling techniques. Nonetheless, the accuracy 
of the  loop detection relies on the  size of the  on-chip cache in the  FPG A .
3.5.3 W oO D ST oC K
W O oD STO CK  [59] (W atches Over Data STream ing On Com puting elem ent linKs), 
is a profiling tool th a t  m onitors th e  com m unication dataflow between C om puting 
Processor Elem ents (C PEs) as shown in F igure 3.8
W O oDSToCK m onitors the  d a ta  flow between each C PE  by adding m onitors to  
the  circuit which run  in real tim e. T he d a ta  link between each elem ent of the  system  is 
created by Fast Sim plex L inks  (FSLs) [71], available in X ilinx’s M icroBlaze [73] soft­
core processor. FSLs allow stream ing  and buffering of-data  between the  hardw are 
com ponents of th e  system. T he profiler utilizes th e  links to  m easure the  stream  of 
d a t a  b e tw e e n  ea ch  C PE. I t  m e a su re s  th e  n u m b e r  o f ru n - t im e  e x e c u tio n  c lock  cycles 
to  see which C P E  is stalled  or starved for data.
A stalled C P E  occurs when a stream  of d a ta  is a t the  inpu t b u t little  or no o u tp u t 
d a ta  is coming out. A starved  com puting elem ent occurs when little  d a ta  is com ing in
34
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. PROFILING  TO O LS
CE 1 FSL 1
CE 2 FSL 2
CE 3 FSL 3 WoODSToCKMonitor
WoODSToCK
Monitor
WoODSToCK
Monitor
Figure 3.8: W atching Over D a ta  S tream ing on C om puting Elem ent Links
or going out of the  C PE, b u t it is still running. T he results obtained showed th a t  the 
tool was able to  detect bottlenecks using a pipelined system  and a branching system  
benchm ark.
3.6 Q ualitative Com parison of Profiling Tools
There are a variety  of profiling tools available to d ay  th a t  can m easure the  perform ance 
of software code by collecting inform ation ab o u t different perform ance m etrics. T he 
m ajority  of these tools have one or m ore draw backs related  to  accuracy, runtim e 
overhead and extended execution tim e. Table 3.1 shows a com parison of the  profiling 
tools discussed in th is thesis.
Notice th a t  SBP tools have functional and  m em ory profiling capabilities. They do 
require th e  insertion of instrum en tation  code th a t is needed to  in te rru p t the  processor 
a t specific intervals to  sam ple the  d a ta  stored in th e  hardw are registers in the  system. 
This can cause inaccurate profiled results to  be reported  along w ith  the  in troduction
35
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. PRO FILING  TO O LS
of perform ance overhead during execution and an increase in file size. T his is not 
desirable in the  design of em bedded systems. One of the advantages of using sim u­
lators is th a t  the  original program  does no t require any instrum en tation  code. This 
is beneficial since th is does not modify the behaviour of the  software program , al­
though sim ulating large program s is very slow and is therefore an  im practical op tion 
for profiling large em bedded system  designs.
HCBP tools are m ostly used for profiling m em ory system s, however, they  do use 
techniques th a t  are sim ilar to  those used by software-based profiling tools, such as 
sam pling, which can affect the  accuracy of the  perform ance inform ation retrieved. 
T he accuracy of the  profiled results is dependent on the frequency of the  sam pling 
rate.
F PG A -B P  tools are clock-cycle accurate and do not in troduce overhead during 
software execution. T he software program  may require m inim al code d istu rbance or 
can be left alone, thus reducing the  effect of unpredictable execution behaviour. As 
shown in th e  table, F PG A -B P  tools are not restric ted  as are functional or m em ory 
profilers. T hey have th e  ability  to  detect com m unication bottlenecks between C PEs.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
R
eproduced 
with 
perm
ission 
of the 
copyright 
ow
ner. 
Further 
reproduction 
prohibited 
w
ithout 
perm
ission.
Feature gprof ISS VTune Valgrind Purify HWC PMA SnoopP Woodstock FLAT
Instrumentation Code X X X X
Sampling X X X X X X
Clock Cycle Accurate X X X
Performance Overhead X X X X X X
Simulation X X X
Software-Based X X X X X
Hardware-Based X X
FPGA-Based X X X
Functional Profiler X X X X X X
Memory Profiler X X X X
Other Profiler CPEs
Table 3.1: C om parison of Profiling Tools
CO
PROFILING 
TO
O
LS
C hapter 4 
The A irw o lf  Profiler
This chapter introduces the  FPG A -B ased Profiling tool, th e  A irw o lf Profiler. The 
Airwolf Profiler contains a set of dedicated hardw are counters th a t  are used to  profile 
software code running on the  Nios II Processor. I t is a System  On Program m able 
Chip B uilder (SO PC) B uilder ready com ponent [18] th a t  can be in stan tia ted  on any 
Nios II Processor [32] based designs. The m odification of th e  interface of the Airwolf 
Profiler can also be in s tan tia ted  on other em bedded soft-core processors such as Ten- 
silica X tensa Soft-Core Processor [8] and  the  Xilinx M icroblaze Soft-Core Processor 
[73]. T his chapter begins by describing the  A irw o lf P rofiler’s architecture. T he later 
sections explain how each of A irw o lf’s segm ent counters m easure tim e and  the  num ­
ber of h its occurred. F inally  a discussion of the  A irw olf Profiler’s software drivers 
used to  profile software code is provided.
38
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4. TH E AIR W O LF  PR O F ILE R
4.1 The A irw olf A rchitecture
T he A irw o lf Profiler is an  on-chip F PG A -B P  too l used to  profile software program s 
running on the  Nios II Processor in real-tim e. T his is done by determ ining th e  ru n ­
tim e of each software function by accurately counting the num ber of system  clock 
cycles. A irw olf does no t require any instrum en tation  code added to  th e  b inary  file. 
A pair of software drivers needs to  be placed in between a software function block 
in the  source code in order to  activate and deactivate a p articu lar profiling counter 
contained in Airwolf. T his approach m inim ally d isturbs the program  and th e  software 
behaviour during execution. T he goal of the  A irw o lf Profiler is to  provide accurate  
results while m inim ally modifying the  software code. F igure 4.1 depicts the  A irw o lf 
Profiler's A rchitecture.
As shown in the  figure, the  A irw olf Profiler contains the  Tim e C ounter Enable 
(TC E) m odule and 20 profiling counters. This is sufficient for profiling large program s 
th a t consist of a large num ber of software functions. In stan tia tin g  the  A irw o lf Profiler 
onto the  S tra tix  EP1S40F780C5 FPG A  [16] consumes 3,345 logic elements. T he 
m axim um  operating  frequency th a t  the profiler can support is 120 MHz. Usually this 
frequency is used for high-speed Nios II Processor system s [32],
T he T C E  m odule contains 20 C ounter Enable (CE) registers which are used to  
activate th e  appropria te  profiling counter. T he logic circuit in the  T C E  m odule is 
dependent on the Address and  D a ta Jn  bus inpu ts th a t  are being fed from th e  Avalon  
Interface Bus (AIB) [19]. T he AIB contains all of the  necessary control logic signals 
th a t  are used to  m anipu late  th e  CE registers in the  T C E  module. T he accom panying 
software drivers of the  A irw o lf Profiler are program m ed to access the  appropria te  CE 
register by sending a unique address onto the  interfacing bus. T he o u tp u t of each CE 
register is fed into th e  inpu t enable of the  assigned profiling counter (shown as the 
T im e Counter Enabling L ines  (TCELs) in the  figure).
T he Hits Counter Enable L ines  (HCEL) are th e  o u tp u t control lines coming from
39
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4. TH E AIR W O LF  P RO FILER
o
co
- O a.
TCEL #1
HCEL #1
TCEL #2
HCEL #2
TCEL #20
HCEL #20
System Clock
TCE-EN
HITS-EN
Segment
Counter
#20
TCE-EN
HITS-EN
Segment
Counter
TCE-EN
HITS-EN
Segment
Counter
Tim e C ounte r Enable 
M odule
32-bit 
Output Bus
Figure 4.1: T he Airwolf Profiler
40
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4. THE A IR W O LF  PR O F ILE R
A1 AO
32-bit 
Output Bus
32-bit Hits Counter
64-bit Time Counter
EN
System Clock
From TCEL
From HCEL
MSB[63:32]
LSB[31:0]
Figure 4.2: T he Airwolf Profiling Counter
the  T C E  module. T heir purpose is to  indicate when a function has been called as the 
program  is executing on the  processor.
T he D ata-In  and Address inpu t buses are also used to  ex tract the  profiling d a ta  
stored in th e  profiling counters. These d a ta  are sent out to  a host com puter th rough  
the D ata-O ut bus. A set of control signals provided by the AIB, nam ely the  chipselect, 
write^enable and  read^enable signals, are used to  prevent any illegal inpu t or o u tp u t 
accesses of th e  C E registers and the  profiling counters.
4.2 A irw olf Profiling Counter
T he A irw o lf Profiler contains 20 profiling counters which allow for up to  20 functions 
to  be profiled a t a tim e. F igure 4.2 depicts th e  contents of each profiling counter.
Each profiling counter actually  consists of two counters, a 32-bit hits counter and 
a 64-bit tim e counter. T he h its counter counts the  num ber of positive edges of the 
input HCEL control signal. W hen th e  appropria te  profiling software driver activates 
the profiling counter, th e  HCEL control signal becomes high for one clock cycle. 
This signifies th a t  the  assigned function has been activated and the  h its counter is 
increm ented by 1.
T he 64-bit tim e counter is used to  count the  num ber of clock cycles of the  cur-
41
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4. THE AIR W O LF  P RO FILER
rently  executing function. Each 64-bit tim e counter is capable of m easuring over 
100 million hours of profiling tim e when using a 50 MHz system  clock. This ensures 
th a t th e  overflow of th e  register will effectively never occur. T here are two d istinct 
inpu ts to  each of the  64-bit tim e counters, which are the  tim e counter enable and 
the clock inputs. T he tim e counter enable inpu t is fed by the app ropria te  T C E L  
control line, which controls the  counting sequence of the  counter. If the T C E L  signal 
becomes high and rem ains a t th a t  s ta te , the  counter begins to  count the  num ber of 
positive edges of th e  system  clock. If th e  T C E L  signal becomes low, counting of the  
clock ticks is disabled. T his concept is of g reat im portance since A irw olf accurately  
counts th e  num ber of clock ticks a function has taken. This helps to  provide accura te  
perform ance feedback which is beneficial for em bedded system  designers.
A m ultiplexer com ponent th a t  is controlled by the  address bits from th e  Address 
inpu t bus exists in every profiling counter. This m andates which d a ta  is assigned to 
the  AIB. In th e  end, th e  profiled d a ta  stored in these counters will be ex tracted  by 
calling the  approp ria te  software driver and  displayed back to  the  designer.
4.3 A irw olf’s Software Drivers
To use the  A irw olf Profiler, th e  source code m ust include the  appropria te  software 
drivers to  control th e  counting of the  profiling counters. There are 40 software drivers 
in to ta l, and each profiling counter is assigned a pair of drivers. One driver is used 
to  activate th e  appropria te  profiling counter and is usually placed a t th e  beginning 
of a function. A nother driver is used to  deactivate th e  appropria te  profiling counter 
and is placed a t the  end of a function block. T he sam ple code below illustra tes this 
p ro cess .
The AIRW0LF_SECTI0N_0NE_START() driver calls on profiling counter #1  to  start 
measuring by counting the number of clock ticks and the number of calls made to 
th a t function. Near the end of the function block, AIRW0LF_SECTI0N_0NE_ST0P()
42
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4. THE A IR W O LF  PRO FILER
void somefunction (int n)
{
AIRWOLF _SECTION _0NE_START( ) ;  
int addnumbers = 0; 
addnumbers += n; 
AIRW0LF_SECTI0N_0NE_ST0P( ) ;
>
int mainO 
{
AIRW0LF_RESET() ;  
somefunction (1000); 
AIRW0LF_0UTPUT(); 
return (0);
>
Figure 4.3: An Exam ple of A irwolf’s Software Drivers
deactivates th e  C E # 1  register which disables the  profiling for profiling counter # 1 .
AIRW0LF_RESET() is a  driver th a t  resets and  initializes all of the  counters in the 
profiler to  0. This software driver is usually placed a t the  beginning of the  main 
program .
AIRW0LF_0UTPUT() is a  software function th a t ex tracts all of the d a ta  from the 
profiling counters in th e  A irw o lf Profiler. T he d a ta  stored in the  64-bit tim e counter 
needs to  be split in to  two 32-bit words in order to  be tran sp o rted  onto the  32-bit 
d a ta  bus. Initially, th e  M ost Significant W ord (MSW) is retrieved which corresponds 
to  bits 32-63 of the  64-bit tim e counter. These b its are stored  in a 32-bit variable. 
T he Least Significant W ord (LSW ) is retrieved next and corresponds to  bits 0-31 of 
the  64-bit tim e counter. Those bits are also stored in a separate 32-bit variable. To 
merge these d a ta  together, the 32-bit variable containing the M SW  d a ta  is casted into 
a 64-bit variable and shifted 32 positions to  the  left. The 32-bit variable containing 
the  LSW  d a ta  b its  is augm ented w ith  the  64-bit variable. T his process is done for all
43
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4. TH E AIR W O LF  P R O FILER
of the  20 profiling counters.
4.4 Summary
T he A irw o lf Profiler was in troduced which describes the  profiling architecture. Each 
profiling counter was presented which shows th e  type of collected and  stored profiling 
inform ation. In C hap ter 5, a profiling environm ent is introduced, which is used to 
execute a  set of software benchm arks. Each benchm ark is profiled using the  A irw o lf 
Profiler. To determ ine th e  accuracy of the  retrieved perform ance inform ation pro­
vided by the  A irw o lf Profiler, the  results are com pared against a  well-known software 
based profiler, G N U ’s gprof. In addition, perform ance overhead analysis is conducted, 
which com pares the  run-tim e for each function th a t was com piled w ith  and  w ithout 
instrum en tation  code.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C hapter 5 
Experim ental Results
This chapter presents analysis and com parison of th e  profiled results obtained by using 
Nios2-gprof (SBP tool) and th e  A irw o lf (F PG A -B P  tool) Profilers. We first describe 
the experim ental environm ent and the  profiling software benchm arks used. T his 
discussion includes details of the  in s tan tia ted  com ponents of the  Nios I I  Processor 
System  and  a brief description of the  operations involved in each of the software 
benchm arks. A thorough analysis and  critical com parisons of profiling results is 
presented. Finally, a perform ance overhead analysis is presented, which com pares the 
execution tim es of each software function w ith  and w ithout in strum entation  code.
5.1 The N ios II Profiling Environm ent
For this experim ent, a Nios I I  Processor system  was created to  serve as th e  profiling 
environm ent for the  software benchm arks. This environm ent consists of a  processor 
core, system  tim ers, m em ory and  a bus interconnect which connects all the  instan-
45
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. E X P E R IM E N TA L  RESU LTS
Niosll/Fast
Processor
Core
64KB
1-Cache
64KB
D-Cache
1 Hardware MUL/DIV
SRAM/Flash
Memory
Controller
Inst.< H
UART
Controller
c/5
D
CD
0
O
1t=
0
co
CO
<5
System Clock
Timer
Resolution
Timer
Airwolf
Profiler
Figure 5.1: The Nios II Profiling Environm ent
Nios D evelopm ent B oard
S tra tix  P rofessional E d itio n  
S ta tic  R A M  (Off-C hip) M odule 
F lash  R a m  (Off-C hip) M odule
41250 Logic E lem ents 
1MB 
8MB
Table 5.1: Nios Development B oard C om ponents
tia ted  com ponents. F igure 5.1 depicts the  Nios II Processor System , which will be 
referred to  as the  Nios I I  Profiling E nvironm ent (N ios-II-PE).
Table 5.1 lists the  in s tan tia ted  com ponents used in the N ios-II-PE. T he N ios-II-PE 
consists of the  fast version of the  Nios II Processor core, which is a soft-core processor 
th a t  is optim ized for high perform ance in com putationally-intensive applications a t 
the  expense of consum ing m ore logic elem ents on an  FPG A . This processor is suitable 
for executing th e  benchm arks used in th is experim ent. T he core contains m ultiply 
and  divide hardw are accelerators which allow m ultiplication and division operations
46
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. E X P E R IM E N TA L  R E SU LTS
to  be executed in hardw are [32]. In addition , it contains separate  in struction  and 
d a ta  cache memories, each having 64KB. For the  program , stack and d a ta  memories, 
the N ios-II-PE utilizes the  1 MB sta tic  R andom  Access M emory (RAM ) m odule 
which is located off-chip. Software benchm arks are downloaded onto th is m em ory 
module. T here are two tim ers in th is system , nam ely the  system  clock and  high 
perform ance tim ers. T hey are required for N ios2-gprof in order to  m easure th e  ru n ­
tim e of the  software functions and by some of the  software benchm arks as well. An 
instance of th e  A irw olf Profiler is used in the  N ios-II-PE, consisting of all 20 profiling 
counters. Each of these counters is assigned a specific software function to  profile. 
Software function assignm ents are based on the  placem ent of the software drivers in 
the program , as was explained in Section 4.3. T he Universal Asynchronous Receiver 
and T ransm itter (UART) controller is used to  com m unicate w ith  the  N ios-II-PE and 
to  transfer stream ing messages back to  the  host com puter. All of the  in s tan tia ted  
com ponents in the  N ios-II-PE are connected using th e  Avalon Interface Bus [19] 
which provides all of the necessary control logic and d a ta  signals th a t  are used to  
com m unicate between each in s tan tia ted  com ponent.
5.2 FP G A  D evelopm ent Board and D esign CAD  
Tools
The Nios Development K it [3] was used to  im plem ent the N ios-II-PE. T his kit con­
tains a Nios Development B oard, S tra tix  Professional Edition, featuring a S tra tix  
EP1S40F780C5 F P G A  chip. T he chip features 41,250 logic elements, 3,423,744 m em ­
ory bits and  14 D igital Signal Processing (D SP) blocks [22], T here are available off- 
chip m em ory m odules th a t can be used, which include the  8MB flash memory, the 
1MB SRAM  and the  16MB SDRAM  modules. In th is experim ent, the  1MB SRAM 
was used for the  program , stack and  d a ta  memories for each benchm ark. All of the
47
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. E X P E R IM E N TA L  RESU LTS
com ponents on the  developm ent board  utilized the  50MHz clock oscillator as the 
system  clock of th e  N ios-II-PE.
T he supporting  CAD tools th a t  were used in th is experim ent are Q uartus II 
Version 5.0 SP2 [20], System  On P rogram m able Chip (SOPC) Builder Version 5.0
[23] and  Nios II In teg rated  Development E nvironm ent (IDE) Version 5.0 [21].
Q uartus II [20] is a  design environm ent th a t  is used to  synthesize hardw are de­
scription language (HDL) files and to  generate a  Static-R A M  O bject Files (SOF) 
th a t are used to  program  the  F PG A . SO PC  Builder [23] is a  system  builder tool th a t  
builds em bedded system s using different in s tan tia ted  IP  cores. It autom atically  gen­
erates HDL files based on th e  in s tan tia ted  com ponents th a t  are used in the  system. 
In addition, user-specified IP  cores can be im ported  into SO PC  Builder and can also 
be utilized in a system .
Nios II ID E [21] is an  environm ent th a t is used to  generate and compile C and 
C + +  code and  download its b inary  image to  run on a Nios II Processor System. It 
contains a num ber of debugging tools th a t  the  designer m ay use to  debug software 
code, enabling them  to  view th e  d a ta  contents inside the Nios II Processor core’s 
registers. I t  also comes w ith  an interface th a t  is used to  com m unicate w ith  the Nios 
II Processor system  over a  serial cable th a t  is connected to  the  F P G A  developm ent 
board. T he console window th a t is displayed on the  host com puter shows the  o u tp u t 
generated by th e  Nios II Processor System  and  other s ta tu s  messages.
5.3 Profiling Tools Setting
The profiling tools used in th is experim ent are N iosII-gprof and  the  A irw olf Profiler. 
E a c h  b e n c h m a rk  w as  im p o r te d  in to  th e  N ios I I  ID E . T h e re  w ere  so m e  a d d i t io n a l  
settings th a t  were applied to  the  software benchm arks in order to  utilize these profiling 
tools:
48
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. E X P E R IM E N TA L  R E SU LTS
•  N ios2-gprof: To utilize th is  profiler, the  original program  m ust be com piled 
w ith  instrum en tation  code (-pg) which causes the  GCC com piler to  insert ex tra  
software in te rru p ts  and variable counters into the p rogram ’s b inary  file. T his is 
required by Nios2-gprof so th a t  it can collect perform ance inform ation during 
the  execution of the  software program .
•  A irw olf Profiler: T he A irw o lf Profiler requires a pair of software drivers added 
to  the  source code of the  program . One driver is used to  activate and  th e  o ther 
to  deactivate th e  assigned function’s profiling counter. This ensures th a t  the 
reported  execution tim e is dedicated  to  the  assigned function.
Each benchm ark was compiled using the  Nios I I  GCC  compiler by applying the 
highest optim al com pilation (-03) setting. T he compiler generates th e  executable 
binary  by optim izing the code for fast perform ance a t the  expense of a slightly larger 
file size [1].
5.4 Profiling Software Benchmarks
T he software benchm arks used in this experim ent are listed in Table 5.2. These 
benchm arks were based on the em bedded software benchm arks suite M iB ench  [43, 2] 
and  the  UTNiosbenchmarks [53]. T he following paragraphs describe each benchm ark 
briefly.
• B itC o u n t: This benchm ark tests  the  b it m anipulation  capabilities of a  m icro­
processor. Inpu ts to  th is benchm ark are arrays of Is and Os. B itC ount uses five 
bit-counting and  m anipulation  algorithm s which are the following: optim ized  
1-bit per loop, recursive bit count by nibbles, non-recursive bit count by nibbles 
using a table look-up, non-recursive bit count by bytes using table look-up and 
shift and count bits [43]. This algorithm  was executed for 10,000,000 iterations.
49
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. E X P E R IM E N TA L  R ESU LTS
Profiling Software Benchmarks
BitCount
Performs several bit manipulations 
for 10,000,000 iterations
Dijkstra
Computes the shortest path between 
160 nodes
Game of Life
Cellular automation program run 
for 100,000 passes
FiboJVlatrix_Mult
Computes the 40th Fibonacci
term and then multiplies two 250x250 matricies
Dhrystone
Tests the integer performance of a 
processor for 100,000,000 iterations
Table 5.2: B enchm ark D escriptions
• Dijkstra: This algorithm , developed by Edsgar W . D ijkstra, finds th e  shortest 
p a th  between any pair of nodes. D ijkstra  uses an adjacency m atrix  to  com­
pu te  the  shortest d istance th a t is represented by a 100x100 m atrix  [43]. This 
benchm ark was modified to  find th e  d istance between 160 d istinctive nodes.
• Game o f Life: Based on John Conway’s gam e of life, [15], th is benchm ark 
is a cellular au tom ation  program  which m odels a  cell th a t  is initially  alive or 
dead dependent on th e  seed configuration [61]. A set of rules are followed which 
determ ine the cell’s b irth  or d ea th  in the  next generation cycle. T his benchm ark 
was executed for 100,000 passes.
• F ibo-M atrix-M ult: T here are two functions in th is benchm ark th a t  are called 
se q u en tia lly . T h e  f irs t  fu n c tio n  is F i b o n a c c i  w h ich  c o m p u te s  th e  4 0 th  te rm  o f 
a  Fibonacci sequence recursively. T he second function is M atrix_M ult which 
m ultiplies two 250x250 m atrices.
50
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. E X P E R IM E N TA L  RESU LTS
Nios2-gprof Airwolf Profiler
FCN Time % #  of FCN Time % # o f
Name (Secs) Time Calls Name (Secs) Time Calls
Dijkstra 41.56 71.43 160 Dijkstra 42.27 70.98 160
Enqueue 16.27 27.96 192739 Enqueue 16.61 27.89 192739
Dequeue 0.25 0.43 192739 Dequeue 0.52 0.88 192739
Readdnt 0.05 0.09 25600 Readdnt 0.12 0.20 25600
Qcount 0.05 0.09 192899 Qcount 0.03 0.05 192899
Table 5.3: Profiled R esults for D ijkstra
• Dhrystone: A synthetic benchm ark which assesses a system ’s integer perfor­
mance. T he Nios II ID E provided th is program  to  m easure the perform ance of 
the  Nios II Processor Core [10].
5.5 Com parison of Profiled R esults
Each benchm ark was executed w ith  Nios2-gprof and the  A irw olf Profiler w ith  their 
respective software com pilation settings. In the  subsequent paragraphs, an  analysis 
of the  profiled results is presented for each of th e  benchm arks listed in Table 5.2.
5.5.1 D ijkstra
Table 5.3 shows th e  profiled results for the  Dijkstra  benchm ark. T he first four columns 
show the  results obtained by N ios2-gprof and  and the  la tte r  four colum ns show the 
results ob tained  w ith  A irw o lf profiler. T he first colum n gives the function nam e. The 
second colum n shows the  execution tim e of each function. T he th ird  colum n shows 
the function’s execution as a percentage of to ta l execution tim e of the  benchm ark. 
T he num ber of function calls is displayed in the  fourth  column. T he sam e explanation
51
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. E X P E R IM E N TA L  R ESU LTS
Nios2-gprof Airwolf Profiler
FCN Time % # o f FCN Time % # o f
Name (Secs) Time Calls Name (Secs) Time Calls
Fibonacci 172.14 82.69 204668308 Fibonacci 195.17 84.46 204668309
Matrix_Mult 36.03 17.31 1 Matrix_Mult 35.90 15.54 1
Table 5.4: Profiled Results for Fibo_M atrix_M ult
applies for th e  rem aining columns in the  tab le  and for all subsequent tables.
Each profiler’s results are alike, having sim ilar execution tim es and  rankings of 
com putationally  intensive functions. T he D i j k s t r a  function is reported  to  run  for 
41.56 seconds by Nios2-gprof whereas the A irw olf Profiler reported  42.27 seconds.
There are very m inor differences in the reported  execution tim es of the  rem aining 
software functions. This implies th a t  Nios2-gprof reports results w ith com parable 
accuracy to  those of the  A irw olf profiler for sm aller, less com putationally  intensive 
benchm arks. A irw o lf a tta in ed  an im provem ent in accuracy of 1.67%.
5.5.2 F ibo_M atrix_M ult
Table 5.4 depicts the  profiled results for th e  Fibo-M atrix-M ult benchm ark. Nios2- 
gprof reported  th a t the  F ib o n a c c i function was called 204,668,309. Similarly, A irw o lf 
reported  th a t the  num ber of calls to  F ib o n a c c i was 204,668,309 times. In term s of 
the  run-tim e, Nios2-gprof and A irw olf reported  th a t the function was running for 
172.14 and  195.17 seconds respectively. T his implies th a t  th e  sam pling technique 
used in Nios2-gprof has produced an  inaccurate rep o rt of the  execution tim e when 
profiling recursive function calls. In conrast, th e  clock-cycle counting m ethod  th a t  
A irw olf utilizes shows an 11.79% accuracy im provem ent in the reported  tim e for th a t  
function.
T he M atrix_M ult function had  very m inor difference in the  reported  tim e between
52
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. E X P E R IM E N TA L  RESU LTS
Nios2-gprof
FCN Name Time (Sec) % Time #  of calls
set_new_grid_pres_state 24.02 30.61 100000
set_cell_next .state 21.92 27.93 20000000
adj ust_neigh_cnt 19.70 25.11 20000200
set_grid_next_state 12.57 16.02 100000
init.grid 0.26 0.33 1
Table 5.5: Profiled R esults for G am e for Life using Nios2-gprof
the two profilers. T he percentage difference is 0.36%.
5.5 .3  G am e o f Life
Tables 5.5 and 5.6 shows the  results for the Game o f Life  benchm ark using Nios2- 
gprof and  A irw o lf respectively. Nios2-gprof repo rted  the function 
s e t_ n e w _ g r id _ p re s _ s ta te  as being the  longest running function. This is reported  
sim ilarly by A irw olf as well. Looking fu rther into the  table, the  com putationally  in­
tensive functions are ranked differently between th e  two profilers. Nios2-gprof ranked 
s e t_ c e l l_ n e x t_ s t a t e ,  a d ju s t_ n e ig h _ c n t ,  and s e t_ g r id _ n e x t_ s ta te  in the  order 
of th e  longest running functions.
A irw o lf had  a different ranking which listed s e t_ g r id _ n e x t_ s ta te ,  
s e t_ c e l l_ n e x t_ s t a t e  and  a d ju s t_ n e ig h _ c n t  as the  order of com putationally  in­
tensive functions. R esults like those reported  by Nios2-gprof can poten tially  m islead 
em bedded designers to  assign a  function for hardw are im plem entation.
T he reported  tim es of each function as reported  by Nios2-gprof are slightly inac­
curate. More noticeably is th e  s e t_ g r id _ n e x t_ s ta te  function, which was reported  
to  have run  for 12.57 and  18.57 seconds by Nios2-gprof and  A irw olf respectively. 
Using A irw olf can provide an  increase in accuracy of 32.3%.
53
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. E X P E R IM E N TA L  R ESU LTS
Airwolf Profiler
FCN Name Time (Sec) % Time #  of calls
set_new-grid_pres_state 
set_grid_next-state 
set_celLnext_state 
adjust _neigh_cnt 
init_grid
28.99
18.57 
17.62
14.58 
0.00021
36.32
23.28 
22.08
18.28 
0.00036
100000
100000
20000000
20000200
1
Table 5.6: Profiled R esults for G am e for Life using A irw olf
Nios2-gprof
FCN Time % #  of
Name (Secs) Time Calls
ntbLbitcnt 71.88 22.35 80000000
bit_shifter 66.27 20.60 10000000
bit-count 63.55 19.76 10000000
main 47.10 14.64 1
ntbLbitcount 24.51 7.62 10000000
ar_btbl_bitcount 19.76 6.14 10000000
bitcount 17.41 5.41 10000000
btbLbitcnt 6.47 2.01
bw-btbLbitcount 4.40 1.37 10000000
Flipbit 0.28 0.09
Table 5.7: Profiled Results for B itC ount using Nios2-gprof
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. E X P E R IM E N TA L  R E SU LTS
Airwolf
FCN Time % # o f
Name (Secs) Time Calls
bit_shifter 196.64 54.40 10000000
bit_count 61.98 17.15 10000000
ntbLbitcnt 51.34 14.20 80000000
ntbLbitcount 22.26 6.16 10000000
ar_btbl_bitcount 15.04 4.16 10000000
bitcount 12.63 3.49 10000000
bw„btbLbitcount 4.40 0.44 10000000
main 0.01 0.00 1
btbLbitcnt 0.00 0.000
Flipbit 0.00 0.00
Table 5.8: Profiled R esults for B itC ount using A irw olf
5.5 .4  B itC ou n t
Tables 5.7 and 5.8 shows th e  profiled results for the  B itC ount benchm ark using 
Nios2-gprof and A irw olf profilers respectively. T here is a significant difference in 
the  reported  execution tim e of each function when the results from each profiler are 
com pared. N ot only are th e  execution tim es different, b u t N ios2-gprof also ranked 
the  m ost tim e consum ing functions differently th an  A irw o lf N ios2-gprof listed the 
n t b l _ b i t c n t ,  b i t _ s h i f t e r  and  b i t_ c o u n t  as the m ost tim e consum ing functions, 
whereas the  A irw olf Profiler repo rted  th a t  th e  b i t _ s h i f t e r ,  b i t_ c o u n t  
and n t b l _ b i t c n t  functions contribu ted  the  m ost tow ard the  to ta l execution tim e of 
the  benchm ark.
Nios2-gprof reported  th a t  b i t _ s h i f t e r  ran  for 66.27 seconds w hereas A irw olf 
Profiler has m easured th a t  function to  take 196.64 seconds on th e  processor. Once
55
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. E X P E R IM E N TA L  R E SU LTS
again, due to  the  sam pling technique used by Nios2-gprof the  profiler provided an 
inaccurate reporting  of th e  execution tim e. A irw olf Profiler provided up to  66.2% 
im provem ent in accuracy in some of th e  functions.
As for th e  n t b l _ b i t c n t  function, which was called recursively, Nios2-gprof and 
A irw olf reported  th a t th e  function was running for 71.88 and 51.34 seconds respec­
tively. T his shows th a t  Nios2-gprof reports  inaccurate execution tim es when profiling 
recursive functions.
N ios2-gprof reported  th a t  the  b t b l _ b i t c n t  and F l i p b i t  functions were called 
during th e  execution of the  benchm ark. However, the  A irw olf Profiler did no t detect 
calls to  those functions. T he insertion of in strum entation  code not only generates 
additional function calls and  in terrupts, b u t it can also cause unpredictab le behaviour 
of the  executing program .
5.5.5 D h ryston e
Table 5.9 shows th e  profiled results for the  D hrystone  benchm ark. B oth  profilers 
have sim ilarly ranked the  m ost tim e consum ing functions. However, the  reported  
execution tim es of each function were quite different. P roc_8 was reported  to  run  for 
100.52 seconds by Nios2-gprof whereas A irw o lf reported  78.26 seconds. This shows 
a 22.1% in im provem ent w ith  the  A irw o lf profiler. A dditionally P roc_6 was repo rted  
by Nios2-gprof to  run  for 80.52 seconds and  A irw olf reported  th a t  function to  run  
for 62.28 seconds. T he im proved accuracy using A irw olf in th a t  function is 22.6%. 
P roc_4 also had  a  significant difference in th e  reported  execution tim e. Nios2-gprof 
reported  P roc_4  was running a t 30.01. In contrast, A irw olf reported  th a t function 
was running  a t 18.00 seconds which th is  provides a  40% accuracy im provement.
A nother noticeable inaccurate reporting  of the  execution tim es are the  functions 
P ro c _ l, P roc_3  and F unc_ l. P ro c _ l was reported  to  take 131.84 and 106.53 seconds 
by N ios2-gprof and th e  A irw olf Profiler respectively. This am ounts to  a 19.19%
56
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. E X P E R IM E N TA L  RESU LTS
Nios2-gprof Airwolf Profiler
FCN Time % # o f FCN Time % # o f
Name (Secs) Time Calls Name (Secs) Time Calls
Func-2 240.02 30.02 100000000 Funm2 253.64 38.32 100000000
ProcT 131.84 16.49 100000000 Proc_l 106.53 16.01 100000000
Proc_8 100.52 12.57 100000000 Proc_8 78.26 11.83 100000000
Proc_6 80.52 10.07 100000000 Proc_6 62.28 9.41 100000000
Func_l 67.69 8.47 300000000 Proc_2 36.00 5.44 100000000
Proc_3 49.19 6.15 100000000 Func_l 34.69 5.24 300000000
Proc_7 38.13 4.77 300000000 Proc_7 30.14 4.55 300000000
Proc_2 36.91 4.62 100000000 Proc_3 22.13 3.34 100000000
Prom4 30.01 3.75 100000000 Proc_4 18.00 2.72 100000000
Proc_5 15.11 1.89 100000000 Func-3 10.14 1.53 100000000
Func_3 9.69 1.21 100000000 Proc_5 10.00 1.51 100000000
Table 5.9: Profiled R esults for D hrystone
im provem ent in accuracy when using the  A irw o lf Profiler. A nother observation is 
w ith regards to  the reported  tim es of Func_l and Proc_3. Nios2-gprof reported  th a t  
Func_l took  67.69 seconds to  execute and  Proc_3 ran  for 49.19 seconds. However, 
the  results ob tained  w ith  the  A irw o lf Profiler showed th a t  Func_l had an execution 
tim e of 34.69 and th a t Proc_3 executed for 22.13 seconds. This am ounts to  a  55% 
im provem ent in accuracy w ith  the  A irw o lf Profiler.
5.5.6 Sum m ary
T he A irw o lf Profiler lias experim entally  dem onstrated  a significant im provem ent in 
achieving accurate profiled results. A irw o lf’s m easuring technique is to  precisely 
count the  num ber of system  clock ticks of a  function has taken w ithout any sam ­
57
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. E X P E R IM E N TA L  RE SU LTS
pling m ethods or in strum entation  code inserted. In some of the  profiling software 
benchm arks, A irw olf has a tta in ed  66.2% im provem ent. In addition , A irw o lf has 
ranked com putationally  intensive functions differently th an  the  software-based pro­
filer, Nios2-gprof. These im provem ents can greatly  benefit designers and guides them  
in m aking a  proper hardw are-softw are p a rtitio n  of the  em bedded system . In the  next 
section, perform ance overhead analysis is conducted which com pares the  ac tua l run­
tim e of a  program  w ith  and  w ithout the  insertion of instrum entation  code in to  the 
software program .
5.6 Perform ance Overhead Analysis
Nios2-gprof requires th e  C /C + +  file to  be com piled w ith in strum entation  code which 
generates additional software in terrup ts  and counter variables in the  original program . 
This can lead to  a large increase in the  execution tim e of the benchm ark and can cause 
an inconvenience to  the  em bedded system  designer who has to  wait (potentially, for 
m any hours) to  retrieve th e  profiled results. This especially applies as the  software 
code size grows larger.
In th is section, an  analysis of the  perform ance overhead will be conducted for 
the software benchm arks discussed above. Each software program  was com piled w ith 
the default debug (-g ) setting  while the sam e assigned functions were profiled w ith  
the  A irw olf Profiler. T he perform ance overhead was determ ined by sum m ing the  
execution tim e of each profile run, w ith  and w ithout in strum entation  code.
5.6.1 D ijkstra
Table 5.10 shows the  overhead perform ance analysis for Dijkstra. Colum n 1 lists 
the function names. Colum ns 2 and 3 show th e  execution tim es when th e  program  
was executing w ith  and  w ithout instrum en tation  code respectively. T he last colum n
58
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. E X P E R IM E N TA L  RESU LTS
FCN Time (Sec) Time (Sec) Difference
Name No Instrumentation with instrumentation (Sec)
Dijkstra 42.20 43.24 1.04
Enqueue 16.60 17.22 0.62
Dequeue 0.52 0.92 0.4
Read _int 0.12 0.12 0
Qcount 0.031 0.031 0
Performance Overhead: 3.35%
Table 5.10: Perform ance O verhead Analysis for D ijkstra
FCN Time (Sec) Time (Sec) Difference
Name No Instrumentation with instrumentation (Sec)
Fibonacci 195.17 357.90 162.74
M atrix.M ult 35.90 36.00 0.10
Performance Overhead: 41.34%
Table 5.11: Perform ance Overhead Analysis for Fibo_M atrix_M ult
shows th e  tim e difference between th e  two com pilation runs.
As evident from th e  tab le , there  is very little  tim e difference when in strum entation  
code is added, a t m ost 1.04 seconds. This implies th a t  profiling w ith  Nios2-gprof on 
sm aller benchm arks, such as D ijkstra , contributes m inim al perform ance overhead. In 
th is case, only 3.35% of additional execution tim e was contribu ted  by the instrum en­
ta tio n  code.
5.6 .2  FiboJVLatrixJVtult
Table 5.11 depicts the  perform ance overhead analysis for the  Fibo^M atrix.M ult bench­
m ark. Notice th a t  th e  F ib o n a c c i function has taken  162.74 seconds of additional
59
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. E X P ER IM E N TA L R ESU LTS
FCN
Name
Time (Sec)
No Instrumentation
Time (Sec) 
with instrumentation
Difference
(Sec)
set_new_grid_presjstate 28.98 42.71 13.73
set_grid_next_state 18.57 32.28 13.70
set_cell_next_state 17.62 17.60 0.02
adjust_neigh_cnt 14.58 14.61 0.03
init_grid 0.00021 0.00021 0.00
Performance Overhead: 25.60%
Table 5.12: Perform ance Overhead Analysis for Gam e of Life
execution tim e. T he added instrum en tation  code changed the  behaviour of th e  soft­
ware benchm ark. Since the  F ib o n a c c i function was called recursively, th is  implies 
th a t  in strum entation  code adds significant perform ance overhead when profiling re­
cursive functions w ith  Nios2-gprof. This has caused the  entire benchm ark to  have a 
perform ance overhead of 41.34%.
5.6.3 G am e o f Life
Table 5.12 shows the  perform ance overhead analysis for the Game o f L ife  benchm ark. 
T he s e t_ n e w _ g r id _ p re s _ s ta te  and s e t_ g r id _ n e x t_ s ta te  functions show notice­
able increases in execution tim e w ith the  added in strum entation  code by 13.73 and 
13.70 seconds respectively. T he rem ainder of the  functions had  very m inor differences, 
a t m ost 0.033 seconds. Once again, the inserted  code caused an  increase in run-tim e 
of those two functions, contribu ting  nearly 25.6% of perform ance overhead.
5.6 .4  B itC ou n t
Table 5.13 dem onstrates the  perform ance overhead analysis for the  B itC oun t bench­
m ark. T he in strum entation  code added an additional 48.43 seconds in execution tim e
60
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. E X P E R IM E N TA L  RESU LTS
FCN
Name
Time (Sec)
No instrumentation
Time (Sec) 
with instrumentation
Difference
(Sec)
bit.shifter 196.64 197.31 0.67
bitmount 61.98 62.18 0.20
ntbLbitcnt 51.34 99.77 48.43
ntbLbitcount 22.26 22.31 0.05
ar_btbl_bitcount 15.04 15.09 0.05
bitcount 12.63 12.66 0.03
bw_btbl_bitcount 1.60 1.60 0.00
btbl.bitcnt 0.00 0.00 0
Performance Overhead: 12.10%
Table 5.13: Perform ance O verhead Analysis for B itC ount
to  the  recursively called function n tb l_ b i t c o u n t .  This strongly  supports the  idea 
th a t  profiling recursive functions w ith  Nios2-gprof can cause a significant increase in 
run-tim e execution. T he other functions listed in this tab le  had very little  effect in 
th e  execution tim e. T his has resulted an  overall perform ance overhead of 12.10%.
5.6.5 D h ryston e
Table 5.14 depicts the  execution tim e differences of each software function in D hry­
stone. Some of th e  functions showed a slight decrease in execution tim e, resulting 
in th e  negative tim e differences shown in th e  table. T he in strum entation  code may 
have caused a  change in behaviour in those functions. Since those negative values 
are dim inutive however, it m inim ally affects the  entire benchm ark’s execution time. 
Notice th a t  the  software functions Func_2 and P roc_5 show a  significant increase 
in run-tim e, adding 107.59 and  67.73 seconds respectively. T he overall perform ance 
overhead for D hrystone  is 21.59% when using Nios2-gprof.
61
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. E X P E R IM E N TA L  RESU LTS
FCN Time (Sec) Time (Sec) Difference
Name No instrumentation with instrumentation (Sec)
Func_2 253.64 361.23 107.591
Proc_l 106.53 106.00 -0.531
Proc_8 78.26 83.00 4.736
Proc_6 62.28 130.00 67.725
Proc_2 36.00 37.59 1.590
Func.l 34.69 33.00 -1.687
Proc-7 30.14 30.00 -0.138
Proc_3 22.13 25.00 2.868
Proc_4 18.00 18.25 0.247
Func_3 10.14 10.00 -0.140
Proc-5 10.00 10.00 0.000
Performance Overhead: 21.59%
Table 5.14: Perform ance O verhead Analysis for D hrystone
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. E X P E R IM E N TA L  R E SU LTS
5.6.6 Sum m ary
T he results presented in this analysis have shown th a t  the  insertion of the  instrum en­
ta tio n  code in th e  p rog ram ’s b inary  file con tribu ted  to  additional and unnecessary 
run-tim e for certain  software functions. In particu lar the com putationally  intensive 
functions executed longer th an  norm al, contribu ting  up to  41.30% of perform ance 
overhead. T he in strum entation  code no t only adds additional in te rru p t calls b u t has 
changed the  behaviour of the  entire program  execution. T his is undesirable since 
designers m ust rely on the  actual program  behaviour in order to  retrieve the  accurate  
profiled results. F PG A -B P  tools require m inim al or no in strum entation  code added 
to the  program  which makes them  m ore desirable com pared to  SBP tools.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C hapter 6
Conclusions and Future Work
T his d issertation  has discussed and  qualitatively  com pared the  existing profiling tools 
used for profiling software code. T he different m easuring techniques th a t  each profiler 
uses can retrieve different perform ance m etrics, although w ith varying accuracy in the 
profiled results. A proposed FPG A -based profiler, th e  A irw olf Profiler, was used to  
profile a set of profiling software benchm arks. These results were com pared w ith 
those generated by a  well-known software-based profiler, Nios2-gprof T he results 
show th a t  FPG A -based profilers provide a  significant im provem ent in accuracy of the 
profiled results based on th e  m easured execution tim e of each software function. This 
benefits em bedded designers and guides them  to  a proper hardw are-softw are partition  
of an  em bedded system . This chapter gives a brief sum m ary of th e  work th a t  has 
been presented.
C hap ter 2 described th e  four different approaches for th e  design of em bedded sys­
tems: th e  Traditional Design Methodology, Hardware-Software Co-Design, Function- 
Architecture Co-Design and  P latform -Based Design.
64
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6. CONCLUSIONS AND  F U TU R E  W O R K
In C hap ter 3, a  com prehensive survey and  com parison of existing profiling tools 
was presented. P roposed classification of these tools was made, nam ely Software- 
Based Profilers (SB P), Software-Based M em ory Profilers (SBM P), Hardware Counter- 
Based Profilers (H CBP) and FPGA-based Profilers (FPG A -B P).
In C hap ter 4, a  F PG A -B P  tool, the  A irw o lf Profiler, was introduced. A irw o lf’s 
profiling arch itectu re was discussed and  a description of how the  profiler accurately  
measures the  execution tim e of a software function was given. A irw o lf’s profiling 
counters along w ith  its supporting  software drivers were also presented.
In C hap ter 5, the  profiling environm ent and  the  supporting  CAD tools were ex­
plained. T he profiling software benchm arks were described, which were used to  ob tain  
profiled results using N ios2-gprof and th e  A irw o lf profilers. An analysis of th e  re­
trieved profiled results from bo th  profilers was presented. This analysis was based 
on the  execution tim e th a t  each profiler m easured. It was experim entally dem on­
stra ted  th a t  the  A irw olf Profiler provided up  to  a 66.2% im provem ent in accuracy 
over Nios2-gprof in some of th e  software functions. In addition, perform ance over­
head analysis was used to  com pare the  execution tim es between two program s: one 
th a t  contained in strum entation  code and one th a t  did not. It was shown th a t  the 
insertion of instrum en tation  code caused a significant increase in execution tim e in 
some of th e  functions, contribu ting  up to  41.34% in run-tim e perform ance overhead. 
This added tim e and overhead is unnecessary since it causes Nios2-gprof to  rep o rt 
inaccurate execution tim es of each function and  causes delays for the  designer to  
retrieve the  profiled results.
6.1 Research C ontributions
T he research contributions m ade in this d issertation are as follows:
•  An F P G A -B P  tool, the  A irw olf Profiler, was proposed and developed to  profile
65
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6. CONCLUSIONS A N D  F U TU R E  W O R K
a set of profiling software benchm arks. It has provided highly accu ra te  profiled 
results which are very useful for em bedded system  designers.
• T he Nios I I  Profiling E nvironm ent was developed and was used to  im plem ent 
th e  two profilers in order to  execute and profile different software benchm arks.
•  Perform ance overhead analysis was conducted in order to  observe th e  effects 
of adding instrum en tation  code to  a p rog ram ’s b inary  file. It was shown th a t 
certain  software functions executed abnorm ally, causing an  increase in run-tim e 
execution.
6.2 Future Work
T he A irw olf Profiler was designed for research purposes to  profile software applica­
tions running on an  A ltera Nios II Processor [32] im plem ented on an A ltera  S tra tix  
F PG A  [22]. T he tool can easily be modified to  become an instruction  address-based 
profiler th a t  has th e  capability  of m onitoring the  current instruction  in execution 
on the  processor. T his concept can provide an  im provem ent in the  profiled results 
com pared to  the  current software driver strategy.
In fu ture work, the  A irw o lf Profiler can be enhanced to  cover m em ory profiling 
as well, so th a t it can m onitor m em ory re la ted  events such as the  num ber of off- 
chip m em ory accesses, cache misses and m em ory leakages. This can fu rther benefit 
em bedded system  designers and  help in im proving certain  portions of th e  software 
code th a t  cause m em ory related  perform ance issues. T he A irw o lf profiler can also be 
easily modified to  work w ith  o ther FPG A -based soft core processors such as Xilinx 
M ic ro B la ze  [73].
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
R eferences
[1] GNU GCC D ocum ent. h ttp ://gcc .gnu .O rg /on linedocs/gcc-3 .2 .3 /gcc, Accessed 
February  2006.
[2] M iBench Version 1.0. h ttp ://w w w .eecs.um ich .edu /m ibench /, Accessed February  
2006.
[3] Nios II Developm ent K it. h ttp ://w w w .a lte ra .c o m /p ro d u c ts /d e v k its /a lte ra /k it-  
nios_lS40.htm l, Accessed Septem ber 2005.
[4] Sun One S tudio 5, S tan d ard  Edition.
h ttp :/ /w w w .sun .com /dow nload /p roducts.xm l?id= 3edd36bd , Accessed Septem ­
ber 2005.
[5] T he Linux Homepage, h ttp ://w w w .lin u x .o rg , Accessed Septem ber 2006.
[6] T he Unix System , h ttp ://w w w .u n ix .o rg , Accessed Septem ber 2006.
[7] T he W indows Homepage, h ttp ://w w w .m icrosoft.com /w indow s/defau lt.asp , Ac­
cessed Septem ber 2006.
[8] T he X tensa 7 Processor for SOC Design.
h ttp ://w w w .ten silica .co m /p ro d u cts /x ten sa_ 7 .h tm , Accessed Septem ber 2006.
[9] A M D  A thlon Processor, x86 Code O ptim ization Guide, 2002.
[10] D hrystone Code. h ttp ://w w w .n e tlib .o rg /b en ch m ark /d h ry -c , Accessed May 
2005.
[11] In tel C orporation, h ttp ://w w w .in te l.co m , Accessed M ay 2005.
[12] M entor Graphics Perform ance Profiler U ser’s and Reference M anual (Software 
Version 2.2), M ay 2005.
[13] System C. h ttp ://w w w .system c.o rg , Accessed July  2005.
67
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
R E F ER E N C E S
[14] Valgrind. h ttp ://w w w .v alg rin d .o rg , Accessed O ctober 2005.
[15] John  Conway’s G am e of Life, h ttp ://w w w .b its to rm .o rg /g arn eo flife /, Accessed 
M ay 2006.
[16] A ltera  C orporation. N ios D evelopm ent Board Reference Manual, S tratix Profes­
sional E d ition , Septem ber 2004.
[17] A ltera  C orporation, h ttp ://w w w .a lte ra .c o m /, Accessed Jan u ary  2005.
[18] A ltera  C orporation. Altera Embedded Peripherals, O ctober 2005.
[19] A ltera  C orporation. Avalon Interface Specification, A pril 2005.
[20] A ltera  C orporation. Introduction to Quartus I I  Version 5.0, A pril 2005.
[21] A ltera  C orporation. Nios I I  ID E  Help System  Version 5.0, O ctober 2005.
[22] A ltera  C orporation. Stratix Device Handbook - Volume 1, Ju ly  2005.
[23] A ltera  C orporation. S ystem  On Programmable Chip Builder Version 5.0, O cto­
ber 2005.
[24] D. L. Anderson and  D. Brucks. An in troduction  to sam pling and  tim e. Technical 
report, Intel C orporation, 2005.
[25] K. Banovic, M .A.S. K halid, and E. A bdel-Raheem . Fpga-based rap id  p ro to typ ­
ing of digital signal processing systems. In Proc. o f the 48th M id- W est Sym posium  
on Circuits and System s, pages 647-650, A ugust 2005.
[26] A. Bonivento and A. Sangiovanni-Vincentelli. P la tform  based design for wireless 
sensor networks. In Proc. o f the 2nd A nnual Workshop on Networking with  
Ultra Wide B and  and Workshop on Ultra Wide Band fo r  Sensor, pages 9-19, 
Ju ly  2005.
[27] S. Brini, D. Benjelloun, and F. C astanier. A flexible v irtu a l p latform  for com­
p u ta tio n a l and com m unication architecture exploration of dmt. vdsl modems. 
In Proc. o f the 2003 Design, A utom ation  and Test in  Europe Conference and 
Exhibition, pages 164-169, December 2003.
[28] S. B ro w n , S. B ro w n e , J. D o n g a r ra ,  N. G a rn e r ,  K. L o n d o n , a n d  P. Mucci. A 
scalable cross-platform -infrastructure for application perform ance tun ing  using 
hardw are-counters. In Proc. o f the 2000 A C M /IE E E  conference on Supercom­
puting, 2000.
68
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
RE F ER E N C E S
[29] D. B urger and T . M. A ustin. T he sim plescalar tool set version 2.0. A C M  
S IG A R C H  C om puter Architecture N ew s, 25(3):13—25, June 1997.
[30] C. J. N. Coelho Jr., D. C. D a Silva Jr., and  A. O. Fernandes. H ardware-software 
codesign of em bedded system s. In  Proc. o f the 11th Brazilian Sym posium  on 
Integrated Circuit Design, pages 2-8, Jan u ary  1998.
[31] K. C om pton and S. Hauck. Reconfigurable com puting: A survey of system s and 
software. A C M  Computing Surveys , 34(2):171—210, June 2002.
[32] A ltera  C orporation. Nios I I  Processor Handbook, O ctober 2005.
[33] C. Erickson. M em ory leak detection  in C + + .  L inux Journal, 2002, O ctober 2002.
[34] C. Erickson. M em ory leak detection in em bedded systems. Linux Journal, 2003, 
June 2003.
[35] R. E rnst, J. Henkel, and  T. Benner. H ardware-sotware cosynthesis for m icro­
controllers. IE E E  Transactions on Design and Test o f Computers, 10(4):64—75, 
Decem ber 1993.
[36] J. Fenlason and R. S tallm an. G nu gprof.
h ttp ://w w w .g n u .o rg /so ftw are /b in u tils /m an u a l/g p ro f-2 .9 .1 /, January  1997.
[37] J. F leishm ann and K. Buchenrieder. A hardw are-softw are proto typing  envi­
ronm ent for dynam ically reconfigurable em bedded system s. In Proc. o f the 6th 
In ternational Workshop on hardware-Software Co-Design, pages 105-109, M arch 
1998.
[38] J. F leishm ann and  K. Buchenrieder. Codesign of em bedded system s based on java 
and reconfigurable hardw are com ponents. In Proc. o f the Design, A u tom ation  
and Test in  Europe Conference and Exhibition, pages 768-769, M arch 1999.
[39] D. W . Franke and  M. K. Purvis. H ardw are/softw are codesign: a perspective. In 
Proc. o f the 13th in ternational conference on Software engineering, pages 344- 
352, M ay 1991.
[40] A. Gordon-Ross and F. Vahid. Frequent loop detection using efficient non- 
intrusive on-chip hardw are. In Proc. o f the 2003 In ternational Conference on 
Compilers, Architecture and Synthesis fo r  Embedded System s , pages 117—124, 
November 2003.
[41] Z. Guo, W. N ajjar, F. Vahid, and K. Vissers. A quan tita tive  analysis of the 
speedup factors of fpgas over processors. In Proc. o f the 12th In ternational 
Sym posium  on Field Programmable Gate Arrays, pages 162-170, February  2004.
69
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
RE F ER E N C E S
[42] R. K. G u p ta  and  G. DeMicheli. Hardware-software cosynthesis for digital sys­
tems. IE E E  Transactions on Design and Test o f Computers, 10:29-41, Septem ber 
1993.
[43] M. R. G uthaus, J. S. Ringenberg, D. E rnst, T . M. A ustin, T. M udge, and R. D. 
Brown. M ibench: A free, com m ercial representative em bedded benchm ark suite. 
In Proc. o f the f th  A nnual Workshop on Workload Characterization , pages 3-14, 
Decem ber 2001.
[44] IBM  C orporation . Rational PurifyPlus, Rational Purify, Rational Pure Cov­
erage, Rational Quantify. Installing and Getting Started. Version 2003.06.00, 
Technical W hite Paper.
[45] Intel C orporation. Using In tel V T u n e’s C ounter M onitor, January  2005.
[46] Intel C orporation . IA-32 Intel A rchitecture Software D eveloper’s M an­
ual. h ttp ://d ev e lo p e rs .su n .co m /p ro d tech /cc /a rtic les /p co u n te rs .h tm l, Accessed 
February  2006.
[47] In te l’s VTune. h ttp ://w w w .in te l.co m /v tu n e , Accessed January  2006.
[48] M. Itzkowitz, J. N. W ylie Brian, C. Aoki, and  N. Kosche. M emory profiling 
using hardw are counters. In  Proc. o f the 2003 A C M /IE E E  conference on Super­
computing, pages 17-30, Ju ly  2003.
[49] K. Keutzer, S. M alik, A. R. Newton, M. Rabaey, and A Sangiovanni-Vincentelli. 
System-level design: O rthogonalization of concerns and  platform -based design. 
IE E E  Transactions on Com puter A ided Design fo r  Integrated Circuits and S ys­
tems, 19(12): 1523-1542, December 2000.
[50] R. Lysecky, S. C otterell, and F. Vahid. A fast on-chip profiler memory. In Proc. 
o f the 39th Conference on Design Autom ation, pages 28-33, June 2002.
[51] R. Lysecky and  F. Vahid. A study  of the  speedups and com petitiveness of fpga 
soft processor cores using dynam ic hardw are/softw are partition ing . In Proc. o f  
the Conference on Design, A u tom ation  and Test in  Europe (D A TE), pages 18-23, 
M arch 2005.
[52] G. E. Moore. C ram m ing m ore com ponents onto in tegrated  circuits. Proceedings 
o f the IEE E , 86(l):82-85 , January  1998.
[53] F. Plavec, B. Fort, and  Z. Vranesic. Experiences w ith soft-core processor design. 
In Proc. o f the 19th IE E E  Conference on International Parallel and D istributed  
Processing Sym posium , pages 167b-167b, April 2005.
70
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
R E F ER E N C E S
[54] P. Pop, P. Elese, and  Z. Peng. A nalysis and Synthesis o f D istributed R eal-T im e  
Embedded System s. Kluwer Academ ic Publishers, T he N etherlands, 2004.
[55] M. Saini. Co-verification enhances tim e to  m arket advantage of p latform  fpgas. 
Technical report, M entor G raphics C orporation, July 2004.
[56] A. Sangiovanni-Vincentelli and G. M artin . P latform -based design and  software 
design m ethodology for em bedded system s. Proc. o f the IE E E  on D esign & Test 
o f C om puters, 18:23-33, December 2001.
[57] F. Schirrm eister and A. Sangiovanni-Vincentelli. V irtual com ponent co-design- 
applying function arch itectu re co-design to  autom otive applications. In  Proc. o f 
the 2001 Vehicle Electronics Conference, pages 221-226, Septem ber 2001.
[58] B. Shah. Advanced call graph  profiling techniques. Technical report, Intel C or­
poration , 2005.
[59] L. Shannon and P. Chow. M axim izing system  perform ance: Using reconfigura­
bility to  m onitor system  com m unications. In Proc. o f the 2004 In ternational 
Conference on Field Programmable Technology (IC F P T ), pages 231-238, De­
cem ber 2004.
[60] L. Shannon and P. Chow. Using reconfigurability to  achieve real-tim e profiling 
for hardw are/softw are codesign. In Proc. o f the 12th In ternational Sym posium  
on Field Programmable Gate Arrays, pages 190-199, February 2004.
[61] Shannon, L. and Chow, P. S tandardizing the  Perform ance Assessm ent of Re­
configurable Processor A rchitectures, h ttp ://w w w .eecg .to ro n to .ed u / lesley/re- 
sea rch /b en ch m ark s/ra tes/sh an n o m ra tes .p s , Accessed May 2006.
[62] B. Sprunt. T he basics of perform ance-m onitoring hardw are. IE E E  Micro, 
22(4):64-71, July-A ugust 2002.
[63] G. S titt, R. Lysecky, and F. Vahid. D ynam ic hardw are/softw are partition ing: 
A first approach. In Proc. o f the 40th Conference on Desiqn A utom ation, pages 
250-255, June 2003.
[64] Sun M icrosystems. Using U ltraSPA R C -IIIC u Perfor­
m ance C ounters to  Im prove A pplication Perform ance, 
h t tp :  /  /  d c v o lo p c r s .s u n .c o m /p ro d te c h /c c /a r t ic le s /p c o u n te r s .h tm l ,  A ccessed
F ebruary  2006.
[65] M. M. T ikir and J. K. Hollingsworth. Using hardw are counters to  autom atically  
improve m em ory perform ance. In  Proc. o f the 2004 A C M /IE E E  conference on 
Supercomputing, pages 46-58, Ju ly  2003.
71
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
R E F EREN C ES
[66] J. G. Tong, I. D. L. Anderson, and M. A. S. Khalid. Soft-core processors for 
em bedded systems. In To Appear in Proc. o f the 19th In ternational Conference 
on M icroelectronics, Decem ber 2006.
[67] J. Turley. Em bedded processors by the  num bers. Embedded System s Program­
ming, May 1999.
[68] D. A Yarley. P rac tica l experience of the  lim itations of gprof. In Software Practice 
and Experience, pages 461-463, 1993.
[69] W . Wolf. Principles o f Embedded Computing System  Design. San Francisco, 
California, 2001.
[70] W. Wolf. A decade of hardw are/softw are codesign. In  Proc. o f the 5th In tern a ­
tional Sym posium  on M ultimedia Software Engineering, pages 38-43, D ecem ber
2003.
[71] Xilinx C orporation. Connecting Custom ized IP  to the M icroBlaze Soft Processor 
Using the Fast S im plex Channel (F SL) Link, May 2004.
[72] Xilinx C orporation, h ttp ://w w w .x ilin x .co m /, Accessed January  2005.
[73] Xilinx C orporation. M icroBlaze Processor Reference Guide, O ctober 2005.
72
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
VITA A U C T O R IS
Jason  Gim  Tong was born  in W indsor, O ntario , C anada, on July  25, 1981. In  2000, 
he g raduated  from V incent M assey Secondary School. From there after he a tten d ed  
the  University of W indsor where he obtained  his Bachelor of Applied Science (B.A.Sc) 
degree in Electrical and  C om puter Engineering (C om puter Engineering O ption) in
2004. He has sustained a position on th e  D ean’s List th roughout his underg raduate  
studies. He is curren tly  a M aster of A pplied Science (M .A.Sc.) candidate  in Electrical 
and C om puter Engineering. His research in terests include reconfigurable com puting, 
hardw are-softw are co-design for FPG A -based  em bedded system s, and digital system  
design. He was rew arded w ith  a T uition W aiver Scholarship (Fall 2005 to  Sum m er 
2006) from the U niversity of W indsor. He is currently  an IE E E  studen t m ember.
73
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
