A Model-based Design Framework for Application-specific Heterogeneous Systems by Skalicky, Samuel
Rochester Institute of Technology
RIT Scholar Works
Theses Thesis/Dissertation Collections
5-2015
A Model-based Design Framework for
Application-specific Heterogeneous Systems
Samuel Skalicky
Follow this and additional works at: http://scholarworks.rit.edu/theses
This Dissertation is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for
inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact ritscholarworks@rit.edu.
Recommended Citation
Skalicky, Samuel, "A Model-based Design Framework for Application-specific Heterogeneous Systems" (2015). Thesis. Rochester
Institute of Technology. Accessed from
A Model-based Design Framework for
Application-specific Heterogeneous Systems
Samuel Skalicky
A dissertation submitted in partial fulfillment
of the requirements for the Degree of
Doctor of Philosophy
in
Computing and Information Sciences
from the B. Thomas Golisano College of
Computing and Information Sciences
Rochester Institute of Technology
Rochester, NY
May 2015
i
B. Thomas Golisano College of Computing and Information Sciences
Rochester Institute of Technology
Rochester, NY
Certificate of Approval
Doctor of Philosophy
The requirements for conferral of the degree of
PhD in Computing and Information Sciences to Samuel Skalicky
have been met, the submission examined and approved
by the dissertation committee and PhD Program Director.
Approved by:
Dr. Pengcheng Shi, PhD Program Director
Professor, RIT College of Computing and Information Sciences
ii
This dissertation has been examined and approved by the following
Examination Committee members:
Dr. Sonia Lopez Alarcon, Advisor
Assistant Professor, RIT Department of Computer Engineering
Dr. Marcin Lukowiak, Committee Member
Associate Professor, RIT Department of Computer Engineering
Dr. Matthew Fluet, Committee Member
Associate Professor, RIT Department of Computer Science
Dr. Stanisław Radziszowski, Committee Member
Professor, RIT Department of Computer Science
Dr. Andrew Schmidt, External Committee Member
Computer Scientist, USC Information Sciences Institute
iii
© 2015
Samuel Skalicky
All Rights Reserved
iv
Acknowledgments
I like to say that I decided to work towards a PhD because I was finishing my undergraduate degree
on Co-op and at the time I thought — sure, I can go back to school — not realizing at the time what I
was getting myself into. So many people have contributed to my success, and I am incredibly grateful to
all of them. Since everyone I mention here in this section contributed to my success in very different and
immeasurable ways, i’ll mention them all chronologically rather than in some ‘order of importance’.
Looking back, as an undergraduate student with tunnel-vision I would first like to thank Dr. Andreas
Savakis for opening my eyes to the possibility of graduate studies, which was the impetus of my even
considering going for a PhD. It seems like somebody was looking out for me from the beginning, after being
introduced to FPGAs on Co-op i found out that a class on Electronic Design Automation was offered by Dr.
Marcin Lukowiak. And at every turn, when I needed to start looking in a different direction Dr. Lukowiak
was always offering some class that fit the bill. Instead of just getting the exact material I needed when I
needed it, I gained a long time friend and mentor who always keeps me in-czech. Thanks for everything!
As if that was not enough, I happened to take a class with Dr. Sonia Lopez Alarcon who, among all
the others in that class, took a special interest in me and later on became my advisor in this PhD journey.
I am grateful that, as I grew out of my old-self and into the new person I am now, Sonia constantly stuck
with me through it all. I was one of those students that thought the liberal arts requirements in engineering
were unnecessary, and then started writing research papers and wished I had done more but the Comma
Queen brought me up to speed. Our research focused has constantly shifted from FPGAs, to GPUs, to
Matlab, compilers, scheduling, and ending up in high-level synthesis. Thanks Sonia for always been ready
for anything and supportive of where ever the results take us throughout this whirlwind adventure!
Taking classes as a PhD student I met professors from various disciplines, two of which were gracious
enough to become part of my dissertation committee. Thanks to Dr. Stanisław Radziszowski, I learned
more of the theoretical and mathematical background and gained the appreciation for understanding that
point of view. Later i took a class with Dr. Matthew Fluet who graciously met with me and took the time
to bring me up to speed given my different background. Thanks to his support I gained the understanding
that I really was not building a framework, but a compiler on a larger scale and treating it as such I was
able to simplify many of the problems I was facing in my research. He also was gracious enough to follow
v
me down the rabbit-hole of graph scheduling, guide me through concrete proofs, and all the while keeping
the big picture in mind. Thanks to both of your for all that you have done!
In the uncertain world of graduate studies (and PhD in particular) Dr. Pengcheng Shi always knows how
to keep students interested, focused, and taught the skills we didnt know we would need until we actually
needed them. Thanks Dr. Shi for opening my eyes and helping me become much more independent. I am
thankful to many of my fellow PhD students Azar Dehaghani, Lei Hu, Harish Rao, Shannon Pattison, Biru
Cui, Ricardo Figueroa, Hongda Mao, Haitao Du, and Ruslan Dautov for the conversations and support they
gave over the years. My lab mates also provided much needed comradery and include Cory Merkel, Ganesh
Khedkar, James Letendre, Mark Hogan, Alex Karantza, Matthew Ryan, and Tyler Kwolek among others.
Although he was not a lab mate or a fellow PhD student, Christopher Wood was a much needed ‘partner in
crime’ throughout my PhD studies. Although his focus vastly differed from my goals, he was always happy
just to listen and comment, and occasionally jump in and help with research and paper writing. Thanks
guys!
In my final stages of my PhD studies I got the opportunity to expand my research into the embedded
systems domain at the University of Southern California Information Sciences Institute under the direction of
Dr. Andrew Schmidt. In this short time, Andy spent much time both inside and outside the normal confines
of the workplace and provided a beneficial perspective for my research, finally joining as a dissertation
committee member. Working with Andy was a real pleasure, although in my final days working at USC/ISI
I was told that they had never seen anyone work their mentor so hard. I am incredibly grateful to him and
all of the USC/ISI colleagues for their meaningful discussion and support. Thank you Andy for all that you
have done, and for just being who you are!
When it came time to attend conferences and present research, many provided support financially. Thanks
Sonia for making this reimbursement process easier, and for going to bat for me. Thanks Dr. Shanchieh
Yang, Dr. Hector Flores, and Dr. Pengcheng Shi for your support over the years. And lastly, thanks to Sun
for letting me spend our vacation money to go to conferences and accompanying me to some of them.
I would also like to thank my family and friends for supporting me (or putting up with me depending on
the day) throughout. Thanks to my parents for supporting me and being the best landlord, its going to be
hard to find a replacement. Thanks for reading all of my papers, agreeing with me even when you knew I
was wrong (and helping me understand later), and just being who you are. Thanks Atit for putting up with
my ridiculous schedule, sticking around even when I ignored you (and your calls) to work on research, and
thanks for always being a happy face to hang out with.
vi
I must be the luckiest guy in the world, to have such a supportive and encouraging wife. Thanks Sun for
encouraging me to start this PhD, for sticking with me even when you were bored and lonely, and for being
a great role model. I certainly would not have been able to do this without you, there were many days that
were difficult but you helped me make it through. Now that we have both graduated, I am excited to step
into this new world with you.
I hope that I can give back to you all that you have given me.
I love you with all of my heart.
vii
A Model-based Design Framework for
Application-specific Heterogeneous Systems
Samuel Skalicky
Supervised by Dr. Sonia Lopez Alarcon
Abstract
The increasing heterogeneity of computing systems enables higher performance and power efficiency.
However, these improvements come at the cost of increasing the overall complexity of designing such systems.
These complexities include constructing implementations for various types of processors, setting up and
configuring communication protocols, and efficiently scheduling the computational work. The process for
developing such systems is iterative and time consuming, with no well-defined performance goal. Current
performance estimation approaches use source code implementations that require experienced developers
and time to produce.
We present a framework to aid in the design of heterogeneous systems and the performance tuning of
applications. Our framework supports system construction: integrating custom hardware accelerators with
existing cores into processors, integrating processors into cohesive systems, and mapping computations to
processors to achieve overall application performance and efficient hardware usage. It also facilitates effective
design space exploration using processor models (for both existing and future processors) that do not require
source code implementations to estimate performance.
We evaluate our framework using a variety of applications and implement them in systems ranging from
low power embedded systems-on-chip (SoC) to high performance systems consisting of commercial-off-the-
shelf (COTS) components. We show how the design process is improved, reducing the number of design
iterations and unnecessary source code development ultimately leading to higher performing efficient systems.
viii
독특한 다른 응용 프로그램을 위한
모델 기반 설계 체계
새뮤얼 스컬리키
지도교수 소냐 로페즈 알라콘 박사
초록
서로다른 컴퓨터 시스템의 증가는 높은 성능 및 전력 효율을 가능하게한다. 그러나, 이러한 개선은
시스템 설계의 전체 복잡도를 증가시키는 요인이 된다. 이러한 복잡성은 각각의 프로세서에 따라 다른
명령어를 만들고, 통신 프로토콜을 설정 및 구성하며, 효율적으로 연산 작업을 스케줄링하는 일들은 포함한다.
이러한 시스템을 개발하기위한 과정은 반복적이며 시간이 많이 드는, 성능적인 목표가 잘 정해 져 있지 않는
과정이다. 현재 소스코드 를 구현해 성능을 추정하는 일들은 경험있는 개발자와 시간을 필요로 한다.
우리는 각 다른 시스템의 디자인 및 응용 프로그램의 성능을 향상시키도록 도와주는 설계를 제시한다.
우리의 설계는 시스템을 만드는 것을 지원한다 : 특별히 주문제작된 하드웨어를 원래 있는 코어와 통합
시키는것, 성능 향상과 효과적인 하드웨어 사용을 위한 프로세서 맵핑 등을 지원한다. 이 프레임워크는
성능을 추정하기 위해 소스코드 구현을 필요로 하지 않는 프로세서 모델을 사용하여 효과적인 설계 공간
탐색을 촉진 한다.
우리의 설계를 평가 하기위해 낮은 전력을 필요로 하는 내장형 시스템온칩 부터 고도의 성능을 내는
상용제품 까지 다양한 응용 프로그램을 이용 했다. 우리는 디자인 반복과 불필요한 소스 코드 개발을
줄이는 것이 결국 고도의 성능 향상을 도래 하는 것을 보였다.
ix
Table of Contents
List of Figures xii
List of Tables xiv
List of Listings xv
Chapter 1: Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Thesis Statement and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2: Literature Review 6
2.1 Foundational Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Performance Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Processor Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 Processor Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 Scheduling in Heterogeneous Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.5 Compilers in Heterogeneous Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Heterogeneous System Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Design Frameworks & Implementation Strategies . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Distributed Heterogeneous System Implementations . . . . . . . . . . . . . . . . . . . 13
2.2.4 Single Chip Heterogeneous Multicore Processors . . . . . . . . . . . . . . . . . . . . . 14
Chapter 3: Background Information 15
3.1 Applications and Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Application Representations and Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Compiler Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Redsharc Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.1 Network Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.2 Kernel Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.3 Build Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.4 Runtime Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Chapter 4: Model-based Framework 26
4.1 Phase 1: Analyze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Phase 2: Estimate and Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Phase 3: Simulate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Phase 4: Generate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Chapter 5: Front-End Compiler 30
x
5.1 Identifying Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.1.1 Operation Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1.2 Application Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Matlab compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.1 Lexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.2 Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2.3 Dynamic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3 Producing Application DFGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Chapter 6: Scheduling 41
6.1 Graph-based Processor Modeling Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.1.1 Reduced Graph Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.1.2 Schedule Length Estimation on Identical Cores . . . . . . . . . . . . . . . . . . . . . . 48
6.1.3 Schedule Length Estimation on Different Cores . . . . . . . . . . . . . . . . . . . . . . 67
6.2 System Modeling Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2.3 Medical Imaging Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2.4 Scheduling Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2.5 System Modeling Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Chapter 7: Code Generation 97
7.1 Fixed Configuration Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.1.1 Application Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.1.2 FPGA Interfacing and Hardware Support . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.1.3 Application Performance Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.1.4 Fixed Configuration Systems Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.2 Configurable Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.2.1 Developer Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2.2 Kernel Development with Redsharc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.2.3 System Development with Redsharc . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.2.4 Build Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2.5 System Runtime Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.2.6 Example Applications in Redsharc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.2.7 Configurable Systems Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Chapter 8: Conclusions and Future Work 131
References 134
xi
List of Figures
Figure 1.1 CPU, GPU, FPGA Design Space Chart . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Figure 3.1 Application and System deconstruction hierarchy . . . . . . . . . . . . . . . . . . . . . 16
Figure 3.2 Breakdown of graph representations used . . . . . . . . . . . . . . . . . . . . . . . . . 17
Figure 3.3 Example kernel graph characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Figure 3.4 Various storage representations of a kernel graph . . . . . . . . . . . . . . . . . . . . . 19
Figure 3.5 Example stages of the compilation process . . . . . . . . . . . . . . . . . . . . . . . . 20
Figure 3.6 Example Redsharc MPSoC system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Figure 3.7 Redsharc hardware kernel interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Figure 3.8 Redsharc simulation environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Figure 4.1 Framework flow high level view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Figure 4.2 Detailed Analyze phase diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Figure 4.3 Detailed Simulation phase diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Figure 4.4 Detailed Generate phase diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Figure 5.1 Application implementation and parallelization flow. . . . . . . . . . . . . . . . . . . . 31
Figure 5.2 Example operation graph showing clusters of operations forming three kernels. . . . . 32
Figure 5.3 Example application graph showing the kernels that are present. . . . . . . . . . . . . 33
Figure 5.4 High level overview of the compiler flow . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Figure 5.5 Compilation process example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Figure 6.1 Reduced graph examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Figure 6.2 Memory footprints of adjacency representations . . . . . . . . . . . . . . . . . . . . . 46
Figure 6.3 Memory footprints of incidence versus reduced . . . . . . . . . . . . . . . . . . . . . . 46
Figure 6.4 Example bipartite graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Figure 6.5 Example graphs with different distributions . . . . . . . . . . . . . . . . . . . . . . . . 54
Figure 6.6 Sample case for exposed calculation error . . . . . . . . . . . . . . . . . . . . . . . . . 59
Figure 6.7 Results for dot product operation DFGs. . . . . . . . . . . . . . . . . . . . . . . . . . 63
Figure 6.8 Results for matrix-vector multiplication DFGs. . . . . . . . . . . . . . . . . . . . . . . 64
Figure 6.9 Results for matrix-matrix multiplication DFGs. . . . . . . . . . . . . . . . . . . . . . . 64
Figure 6.10 Results for random graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Figure 6.11 Example pipelined architecture design . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Figure 6.12 Example graph and architecture representations . . . . . . . . . . . . . . . . . . . . . 68
Figure 6.13 Comparison of graph representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Figure 6.14 Memory footprint comparison to reduced representation . . . . . . . . . . . . . . . . . 71
Figure 6.15 Initial pipelined architecture design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Figure 6.16 Improved pipelined architecture design. . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Figure 6.17 Original dot product design from [124] . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Figure 6.18 Matrix-vector multiply results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Figure 6.19 Matrix-matrix multiply results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Figure 6.20 Cholesky decomposition results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
xii
Figure 6.21 Matrix inversion results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Figure 6.22 Heterogeneous system modeling diagram . . . . . . . . . . . . . . . . . . . . . . . . . 83
Figure 6.23 Hardware system connection diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Figure 6.24 NTEPI algoritham flow diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Figure 6.25 Performance of kernels used in NTEPI . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Figure 6.26 Design space diagram subset for NTEPI . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Figure 6.27 Performance of various scheduling policies . . . . . . . . . . . . . . . . . . . . . . . . . 90
Figure 6.28 Normalized performance of three best policies . . . . . . . . . . . . . . . . . . . . . . 92
Figure 6.29  delay of the various scheduling policies . . . . . . . . . . . . . . . . . . . . . . . . . 94
Figure 7.1 Code generation flow for heterogeneous Matlab scripts. . . . . . . . . . . . . . . . . 99
Figure 7.2 Application implementation and parallelization flow. . . . . . . . . . . . . . . . . . . . 100
Figure 7.3 NTEPI processing flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Figure 7.4 NTEPI kernel DFG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Figure 7.5 Shallow Water kernel DFG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Figure 7.6 AMD Hardware system configuration diagram . . . . . . . . . . . . . . . . . . . . . . 107
Figure 7.7 Transfer bandwidths for CPU/GPU and CPU/FPGA as a function of payload size. . 108
Figure 7.8 Matlab MPI bandwidth experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Figure 7.9 NTEPI performance results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Figure 7.10 Shallow water performance results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Figure 7.11 Example Redsharc MPSoC system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Figure 7.12 Shifting development focus with Redsharc . . . . . . . . . . . . . . . . . . . . . . . . . 116
Figure 7.13 Redsharc hardware processor interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Figure 7.14 Redsharc API connection diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Figure 7.15 Redsharc networks configurations and connectivity . . . . . . . . . . . . . . . . . . . . 119
Figure 7.16 Redsharc build framework flow diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Figure 7.17 Redsharc build tool flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Figure 7.18 Face Recognition DFG partitioned into software and hardware kernels. . . . . . . . . 126
xiii
List of Tables
Table 3.1 Examples of Redsharc software kernel API calls. . . . . . . . . . . . . . . . . . . . . . . 23
Table 6.1 Calculating memory footprint of various graph representations . . . . . . . . . . . . . . 45
Table 6.2 Scheduling variable summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Table 6.3 Common unary and binary operation types for most programming models . . . . . . . . 70
Table 6.4 Summary of estimation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Table 6.5 Processor specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Table 7.1 Matlab FPGA Interface Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Table 7.2 Processor Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Table 7.3 Redsharc System API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Table 7.4 System performance with Redsharc build framekwork . . . . . . . . . . . . . . . . . . . 129
xiv
List of Listings
Listing 5.1 Small selection of the Matlab language tokens implemented . . . . . . . . . . . . . . 35
Listing 5.2 Small selecion of Matlab grammar rules . . . . . . . . . . . . . . . . . . . . . . . . . 37
Listing 7.1 Software Kernel 4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Listing 7.2 DFG API for Configuring Kernel 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
xv
Chapter 1
Introduction
At the beginning of the VLSI era, processor performance improvements came from advancements in the
semiconductor fabrication process that enabled higher clock frequencies. With higher frequencies, processors
were able to execute more instructions per second. In 1974 Robert Dennard predicted that the power used
by a processor is proportional to the area or size of the chip, since both voltage and current will be reduced
for smaller transistor sizes [33]. This held true for many years until the ubiquitous power wall was identified
in 2004 [83], where the voltage and current could not be reduced, leading to larger power needs as transistors
got smaller. This power wall signified the end of Dennard’s scaling law.
Since increasing the frequency of the processor was not possible, architecture designers opted for packing
more processors onto the same chip running at a lower frequency. This began themulticore era where the core
count steadily increased as smaller semiconductor fabrication processes were achieved. As time progressed,
different types of cores began to be designed into a single chip. When specific cores are not in use they
can be powered off for better efficiency, giving rise to the term dark silicon to refer to the parts of the chip
that are not being used. This paradigm allows more compute resources to be designed into a chip than can
possibly be powered all at the same time.
With the end of Dennard scaling, Multicore Scaling was supposed to be the way forward but it too ended
[38]. We are currently in the era of Dark Silicon where area is free and power is not, and where unconventional
cores offer a new way forward [28]. Computing systems are becoming increasingly heterogeneous to optimize
for performance, power, and cost [93]. Similarly, the programming models have evolved from sequential
streams of instructions to a threading model, allowing developers to take advantage of an increasing number
of parallel cores using multithreading. Further support for the multithreading paradigm to enable multiple
1
threads to interleave instructions executed by a single processor core is called simultaneous multithreading
(SMT) and enables an even larger number of threads to execute in parallel.
1.1 Motivation
Thanks to the availability of area on chip, additional specialized cores have been designed to take advantage
of data level parallelism in ways such as single instruction multiple data (SIMD) where a single thread will
issue an instruction that operates on multiple data values simultaneously. Or in the case of GPUs, single
instruction multiple thread (SIMT) where multiple threads issue the same instruction to each core. Whereas
SIMD adds additional functional units for parallelism, SIMT enables many more cores to be implemented
by reducing the amount of control logic and therefore requiring that every core execute the same instruction.
Custom general purpose architectures have become more prevalent thanks to the widespread use of FPGAs,
but also custom special purpose cores have been added to general purpose processors for tasks such as
encryption, image processing, signal processing and many others. The heterogeneous hardware/software
codesign for systems of such processors has been a longstanding problem [32, 60] with increasing complexity
and difficulty.
Some non-trivial steps in the implementation flow include (1) exploiting task parallelism, (2) customizing
the data parallelism within tasks, (3) interfacing cores in a processor and processors in a system, and
(4) scheduling tasks effectively over the heterogeneous configuration [60]. Of these, (1,2,4) rely on the
ability to map computations to processors. Given categories of computations, such as those defined as the
Berkeley Dwarfs [7], there may be a mapping between categories and processor types. We investigated
this possibility by choosing computations from one category, Dense Linear Algebra, and analyzed their
performance on various processor types: CPU, GPU and FPGA. We selected dot product, matrix-vector
and matrix-matrix multiplication to analyze the effect of increasing data parallelism, and matrix inversion
and Cholesky decomposition to analyze the impact of control flow complexity. We profiled the performance of
each computation executed on every processor type for a range of data sizes. We compared the performance
results to determine the best implementation for every combination of: kernel type, processor type, data size,
precision, and library implementation. Figure 1.1 shows the results of this comparison on data sizes from 5 (5
element vectors, 5x5 or 25 element matrices) to 8000 (8000 element vectors, 8000x8000 or 64,000,000 element
matrices). By reading this chart we can understand that for the matrix-vector multiply kernel operating on
double precision data the best processor type for data sizes smaller than 20 is the FPGA, between data sizes
2
Figure 1.1: Design space for the highest performing architectures. The different regions represent the architecture
with the best performance at any computation and data size for both single (SP) and double precision (DP) floating
point.
of 20 and 200 the fastest is the CPU using the AMD Core Math Library (ACML), and for data sizes larger
than 200 using the GPU with Matlab’s implementation is the fastest.
These results show that even within a single category, there may be some computations that are best
executed on one type of processor and others that are best for a different type of processor. In the category
of Dense Linear Algebra for example, computations such as dot product and Cholesky decomposition have
very different computational needs and require different architectural features within a processor to achieve
high performance. Moreover, in order to accurately map computations to processors, we will need to consider
a variety of factors to determine the best processor. In addition, empirically exploring the design space and
actually trying out every possible corner case is not a feasible strategy.
1.2 Thesis Statement and Contributions
Although many of the problems of implementing heterogeneous systems have been identified early on [60],
many still persist and plague the development of current systems [32]. In our work we seek to improve the
hardware/software design process of heterogeneous systems and enable improved performance and efficiency
of compute intensive applications. We explore the organization of computations in an application, and their
execution on processors in a computing system.
Thesis Statement:
An application can be parallelized and implemented in a heterogeneous system and achieve high
performance by identifying the computations and estimating their performance for different pro-
cessor types. We facilitate this difficult process by specifying the interfaces of each implementation
step (analyze, estimate, schedule, simulate, and generate) and construct a cohesive framework
that can automate and streamline the implementation of applications in heterogeneous systems.
3
The research objectives and contributions are:
• A coarse kernel granularity at the level of linear algebra computations
• A method to estimate performance without source code or detailed processor designs
• An evaluation of scheduling policies for improving overall application performance
• An automated and streamlined design space exploration for types and quantity of processors
• An implementation generator for both: coarse grain systems of interconnected processors
(workstations or clusters) and fine grain systems of integrated on-chip processors (commonly
referred to as multiprocessor system-on-chip, or MPSoC)
• An end-to-end framework for the implementation of applications in heterogeneous systems.
1.3 Methodology
As stated earlier, although various approaches have been proposed to the problems addressed in this dis-
sertation there has been little work attempting to combine these solutions into a cohesive development
environment. Scheduling policies have been developed but not evaluated for heterogeneous systems. Di-
rectly evaluating performance on hardware is very expensive and time consuming. Moreover, to use each
of the previous solutions they would need to be integrated with a model of the system to estimate the
performance of an application on a heterogeneous system.
We developed a model-based framework that aids the design of distributed hardware/software solutions
on a heterogeneous system. Our framework takes into account the relationships between computations
and input data sizes, as well as processors and implementations to optimize application performance. The
differentiated capabilities of general purpose CPU, GPU, and FPGA processors satisfy each of the different
types of computations normally found in compute-intensive applications. We propose that performance can
be estimated without any hardware specific implementation and present a high-level graph-based processor
modeling methodology. By correlating a graph representation of a computation to the architectural features
of a processor, we can quickly approximate its execution time without any source code or architecture
design information. Using this approach, we can easily construct a schedule of computation-to-hardware
assignments and estimate application performance.
We propose that the degree of heterogeneity dictates the preferred scheduling strategy in heterogeneous
systems. We evaluate the execution of the application as a combination of computations that are scheduled
4
on a heterogeneous set of processors. Previous work included scheduling policies [53][69] such as static
or dynamic strategies for specific applications. We present a system model to analyze various scheduling
policies and determine the best for constraints such as performance, efficiency, or power. We investigate
systems comprised of processors from both commercially available off-the-shelf components (COTS) and
custom multiprocessor system-on-chips (MPSoCs). By solving the problems posed above, future on-chip
heterogeneous systems will also be able to perform faster and more efficiently. These on-chip heterogeneous
systems are already being developed [30][49][103]. Having analysis tools will be critical to aid in the software
design to take advantage of such future hardware.
Finally, we propose a tool flow that integrates existing tools and provides an end-to-end application
development environment from initial algorithm to final implementation in a parallel heterogeneous system.
Tool suites for developing such heterogeneous systems either from device vendors (such as Nvidia, Xilinx,
etc.) or from academic initiatives (such as Redsharc [102]) enable easy integration or construction. But
in order to use them the developer must have a pre-existing design already chosen. Yet, coming up with
a design is a difficult proposition given the size of the design space, including: types of processors (CPU-
like as hard or soft, accelerators - GPU or custom on FPGA), quantity of processors, and assignments of
computations-to-processors. Our model-based framework seeks to fill the gaps in existing tools to provide a
modeling environment to aid in determining the composition of the system, customized for each application.
1.4 Dissertation Structure
The rest of this dissertation is organized as follows. Chapter 2 discusses the foundational work that our
research is based on, and the other similar and ongoing efforts related to our objectives. We present the
relevant background information to support our technical descriptions of the research in Chapter 3. The
model-based framework is described in detail next in Chapter 4. Our front-end compilation approach is
presented in Chapter 5. Then, in Chapter 6 we constrain the general scheduling problem for the two
levels encountered in this work. At a low level, scheduling operations within a processor to estimate the
performance of a kernel is presented in Section 6.1. We describe our high level approach to simulate the
operation of the system by scheduling kernels to processors in Section 6.2. Then a study on parallel code
generation is presented in Chapter 7 specifically for fixed configurations of processors in Section 7.1, and
configurable on-chip systems in Section 7.2. Finally, our conclusions and possible future work is described
in Chapter 8.
5
Chapter 2
Literature Review
In this chapter we describe both the previous work that is the foundation of our work and the other similar
efforts that are related to our objectives and contributions.
2.1 Foundational Works
Our proposed work is based on existing knowledge and advancement in the many areas of computing. Sec-
tion 2.1.1 presents the first step in the process of performance improvement: benchmarking and performance
evaluations of applications on different processor architectures. A survey of processor models for estimating
performance of a computational workload is presented in Section 2.1.2. Following this, some existing proces-
sor simulators are presented in Section 2.1.3. Our work also benefits from the research in scheduling policies
for heterogeneous systems which is briefly described in Section 2.1.4. Lastly, the compilers and compilation
strategies previously presented for heterogeneous systems are described in Section 2.1.5.
2.1.1 Performance Evaluations
The idea that one hardware platform provides better performance than another is not new. The performance
of various processing architectures have been evaluated for many computations. CPU, GPU, and FPGA
implementations of a Low-Density Parity-Check decoder were compared by Falcao et al. [39] for data sizes
8000x4000 and 1024x512. They concluded that for the smaller data size the FPGA was faster and the GPU
was faster at the larger data size.
6
Sotiropoulos et al. designed an FPGA matrix-matrix multiplication architecture [105] and compared its
performance to a standard CPU implementation. This comparison was only for specifically sized matrices
and did not discuss the implementation in the CPU. Their results showed that the FPGA can outperform the
CPU with a speedup of up to 557x. A comparison of matrix decomposition by Yang et al. [120] evaluated the
performance on CPUs, GPUs, and FPGAs. The results for data sizes ranging from 256 to 1024 demonstrated
that the FPGA was faster than GPU followed by the CPU for both single and double precision floating point.
Higher level functions such as 2D filtering by Llamocca et al. [72] and an implementation of Bayesian
networks by Fletcher et al. [40] were evaluated using GPU and FPGA architectures. But when making
the comparisons, the authors implemented the algorithms solely in one architecture and therefore, chose
one particular processor over another. They did not discuss the best implementations for the computations
in these higher level functions, only the best implementation for the whole algorithm. Grozea et al. [47]
evaluated a sorting algorithm on CPUs, GPUs, and FPGAs to speed up the performance of network intrusion
detection systems. Their results showed the highest performance architecture was CPU, followed by FPGA
and then GPU.
In all of the work mentioned above, none provide a clear way to utilize the results to improve the
performance of a CPU, GPU, FPGA heterogeneous system. Moreover, empirically evaluating performance
by actually executing the application can be very slow and presents portability issues for different types of
processors. Another approach to evaluating performance is to use processor-specific models.
2.1.2 Processor Models
Processor models are analytical tools used to better understand a particular architecture and how to ef-
ficiently utilize it. Many models operate using architecture specific programming languages or compiled
instructions but others use a higher level representation of the workload. We present previous approaches
to modeling each specific CPU, GPU, and FPGA architecture.
CPU Models have been designed by many researchers for different purposes. One such model is from
the Performance Modeling and Characterization (PMaC) framework designed for larger supercomputing
performance modeling by Snavely et al. [104]. Although the framework is not designed for CPU simulation,
it contains an accurate CPU processor performance model. This model contains three distinct parts: a
Machine Access Pattern Signature (MAPS), PMaC’s Efficient Binary Instrumentation Toolkit for Linux
(PEBIL) [66], and the PMaC Convolver. The MAPS is used to obtain a model for the machine’s memory
access patterns including: cache hit rates for each level, size of working sets, and access patterns. This is
7
used to determine how much of the advertised FLOPS performance can be sustained. PEBIL is used to
capture information about the behavior of the computation such as memory operations and floating point
computations. Finally, the PMaC Convolver combines these two pieces of information to determine the
execution time required to complete the computation. Other work by Nakamura et al. [79] presents a
mathematical model for an integer-only processor using a small subset of instructions. Padhariya et al.
[82] presented a model based on calculations correlating the number of floating point operations per second
(FLOPS) the processor can compute with the number of floating point operations in the program. The
verification of this model only used simple applications with polynomial behavior.
GPU Models have evolved over the years as the architectures have been improved. One early model
was designed by Hong et al. [54]. They attempted to estimate the cost of memory operations and assumed
that the compute instructions could be overlapped by memory accesses. Although this was a good model for
early GPU architectures, it does not accurately represent the performance of the newer platforms due to the
inclusion of caching schemes. A follow up work by Sim et al. [97] incorporated memory and instruction level
parallelism (MLP & ILP) into the model. This new model more accurately calculates the overlap of memory
and compute instructions including caching (texture, constant, and global’s L1 & L2) and special function
units. This MLP & ILP model uses the Nvidia Compute Visual Profiler to gather hardware performance
information such as register, cache, and DRAM usage. The Ocelot instruction analyzer collects instruction
mixture information (special function unit, sync, and floating point) about the computation that’s being
executed. Finally, the ILP and MLP are calculated using static analysis tools that analyze the generated
binary files by evaluating the instruction scheduling and register allocation.
FPGA Models have been designed by Holland et al. [52] to estimate the performance of FPGA
architectures for design space exploration prior to architecture design. They presented a method to analyze
the amenability of an algorithm for implementation in an FPGA. They further expanded the model for
multi-FPGA systems in [51]. Their goal was to predict the performance without any architectural design
details of the algorithm. In their verification, they presented modeling error rates up to 15% compared to
their baseline performance. To improve upon the existing models, we designed an FPGA model for pipelined
architectures [101]. This work incorporated the number of pipelines, memory bandwidth, and other design
factors to accurately predict the performance of a computation. Rather than using an architecture specific
language such as VHDL or Verilog, our model uses only operational workload information such types and
quantity of scalar operations.
8
In summary, processor models use some analytical relationship between the workload and the architecture
to estimate performance. Another approach is to accurately simulate the specific architectural functionality
at a much lower level. Using this approach is referred to as processor simulation and is discussed next.
2.1.3 Processor Simulators
Measuring the performance of a particular computation in real hardware is an expensive and time consuming
operation. Simulators are designed to provide accurate low level detailed operation of a particular hardware
platform to aid in the design or tuning of architectural features. Compared to processor models, simulators
always require either architecture specific code or compiled instructions and generally operate at a much
lower and more detailed level. Another option for evaluating the performance of a processor architecture
is to use a simulation tool. Many such tools exist for CPU [8][16][115][121][122], GPU [9], and FPGA [48]
hardware platforms. However, these simulators are meant to provide much more detail about the operation
of the architecture. Although simulators could be used to estimate performance of kernels, they provide
much more detail than is needed resulting in long simulation times. As such, we use processor performance
models in our work.
Once the performance of a kernel is estimated, that information is used to determine the mapping between
kernels and processors to achieve overall high performance and efficiency. This is described in the following
section.
2.1.4 Scheduling in Heterogeneous Systems
Scheduling policies for mapping tasks from a directed acyclic graph (DAG) to heterogeneous processors have
been studied and found to be NP-complete for finding the optimal schedule [63]. Many works have presented
heuristic scheduling approaches in lieu of an optimal policy for the sake of usability [94][12][71][29][18][23].
However, the execution times of the tasks on the different processors only vary by up to 3x. This small
performance range is not enough to be applicable to very heterogeneous systems such as those evaluated in
our work. Beaumont et al. [12] analyzed HEFT and other policies on specific dataflow graph (DFG) types
and found that HEFT performed best, yet required more data transfers than their policy. Liu et al. [71]
presented an iterative list scheduling policy that can provide shorter schedules than the HEFT policy. Other
policies by Cirou et al. [29] and Boeres et al. [18] were also created with heterogeneous systems in mind, yet
were only evaluated using systems of processors with the same hardware platforms rather than very different
platforms as we do in this work.
9
Scheduling of tasks in heterogeneous systems has been heavily researched, but usually only with systems
containing abstract processors, or processors that were very similar rather than distinct CPU, GPU, or
FPGA processors. Topcuoglu et al. [112] presented the highly regarded heterogeneous earliest finish time
(HEFT) policy but do not mention the heterogeneity of their system. Arabnejad et al. [6] presented the
predict earliest finish time (PEFT) policy that used a novel optimistic cost table and produced makespans
of 20% less than HEFT using an abstract system where each platform had a heterogeneity value between 0
(similar) and 2 (very different). The mapping between an abstract system and real world processors is not
mentioned. Liu et al. [70] presented the priority rule based serial scheduling (SS) policy and evaluated it in a
system with uniformly distributed random task compute times. Wu et al. [117] presented the adaptive greedy
(AG) policy and evaluated it in a heterogeneous system of CPU+GPU workstations but used exponentially
distributed random task compute times. Braun et al. [20] presented eleven scheduling policies including
opportunistic load balancing (OLB) and minimum execution time (MET) and evaluated them in a system
with uniformly distributed random task compute times. However OLB does not consider the execution
time of each task on the given hardware platform before making assignments, making it not applicable to
heterogeneous systems. The shortest process next (SPN) policy was suggested by Khokhar et al. [60] for
use in heterogeneous systems and improves upon OLB by choosing the next task to assign based upon the
shortest execution time of a task on any of the available hardware platforms.
Scheduling is normally a part of the compilation process of taking the implementation of an application
and producing the executable binary. The next section describes some relevant compilers for heterogeneous
systems.
2.1.5 Compilers in Heterogeneous Systems
Banerjee et al. [10] presented their MATCH compiler for heterogeneous reconfigurable systems consisting
of CPUs, DSPs, and FPGAs. Their focus was on code generation (ie. cross compiling from Matlab to the
native language used) for each of the specific processors. They considered the entire workload when mapping
the computational work to processors. Other similar works investigate different compiler design and analyses
for other heterogeneous systems [84]. Ratnalikar et al. [87] presented an approach using a macro dataflow-
style where the application is broken into coarser grain macro dataflow operations. We follow a similar
approach of decomposing the application into coarse grain kernels. But they only considered generating C++
implementations using other libraries for systems of identical CPUs rather than heterogeneous processors
and using Matlab’s built-in computational libraries [88].
10
Shei et al. [94] presented an approach for automatically optimizing the computational workload among
a system with CPU and GPU processors. They empirically collected runtime performance for each kernel
across an extensive range of data sizes for use in mapping kernels to processors and then curve fitting the
data to produce an equation representative of the design space. In terms of determining the configuration
of processors in the system, in many cases it is impractical to actually execute a kernel on every type of
processor (at any data size) to facilitate scheduling and mapping. Therefore, we use processor models to
estimate performance for each kernel and data size as needed quickly and efficiently.
In this work, we consider applications composed of kernels that can be assigned to a single processor
for execution. In many cases this is ideal, as the kernels are data-parallel and can be mapped to processor
architecture that match their control flow. However in some cases the data sizes that these kernels operate
on can be very large, to the point that the data must be partitioned among multiple processors. Teo et al.
[110] and Travinin et al. [113] present approach for distributed array processing and Majeti et al. [76] present
an approach for automatic data layout and distribution among heterogeneous architectures. Although we
do not consider integrating these approaches in this work, it is a possible future extension.
2.2 Related Work
General purpose computing research has focused on symmetrical multi-processor architectures to improve
performance. Conversely, embedded systems research has been pushing towards heterogeneous systems
and applying new techniques to match specific tasks to specialized architectures. However, using the right
architecture for the right task requires using different toolchains or programming models for each architecture.
Moreover, Cao et al. [24] showed that the resulting software is not portable to different toolchains or other
architectures. Carbon et al. [25] propose a virtualization technique to abstract out the specific architectural
details required for programming. Using their approach, they achieved an average performance improvement
of 15% using a separate low power, low performance core for management tasks such as garbage collection,
just-in-time compilation, and interpretation and up to 5x speedup for memory management.
Our work combines a framework with a system simulator to improve knowledge and understanding of
heterogeneous systems and enable the future of computing for heterogeneous single chip multicore proces-
sors. Next, we describe related efforts for heterogeneous system simulators, design frameworks, and some
heterogeneous systems that have implemented in the past.
11
2.2.1 Heterogeneous System Simulators
Simulators have been developed for heterogeneous CPU+GPU systems by Sinha et al. [99], and CPU+FPGA
systems by Fummi et al. [42]. Both simulate existing implemented code to evaluate the performance without
actually running the software. Although they give an estimated execution time, they do not help with
the implementation flow or making design decisions. A simulator for adaptive applications computed in
heterogeneous systems was designed by Hong et al. [53] that included a scheduler to handle the various
computations in each of the processors. But their framework relies on an inaccurate evaluation of execution
time using FLOPS instead of processor models as discussed in their work and does not assist in moving from
simulation to implementation. Instead all of the work setting up the simulation would have to be redone for
the implementation. Hardware simulators are designed to provide accurate low level detailed operation of a
particular hardware platform to aid in the design or tuning of architectural features. Yet this level of detail
is more than is required to estimate the performance and efficiency of heterogeneous systems.
A simulator for evaluating the various scheduling, architectures, and parameters of an abstract hetero-
geneous system to assess the performance of an application was designed by Branco et al. [19]. Contrary to
the one-to-one computation-to-hardware scheduling we are interested in, their work focused on scheduling
tasks that require one or more resources in order to be processed. They concluded through the simulation of
various applications that all resources need to be accounted for when scheduling in order to make the best
possible decisions, especially when diverse types of resources are used.
2.2.2 Design Frameworks & Implementation Strategies
Design flows for implementing multiprocessor systems have been created to suit a variety of different system
configurations. One such design flow by Castrillon et al. [68] focuses on systems composed of similar
processors like RISC CPUs and DSPs that are both able to use similar source code. They take the initial
source code and optimize it for the specific type of processor that will be used. They also have an end-to-end
flow [26] from initial application implementation, to kernel extraction, including mapping and scheduling to
produce a multiprocessor system implementation. However, they choose an operation clustering approach
for kernel extraction and rely on the ability of the processor toolchains to be able to use a similar base
language (C/C++) which is not portable to CPU, GPU, FPGA systems [96].
Another heterogeneous system framework by Maassen et al. [75] distributes tasks to different environ-
ments including: CPUs, CPU+GPU systems, clusters, and others. By manually applying labels throughout
the code, the software can assign different tasks to different executors (basically the environments). Rather
12
than taking advantage of the computational abilities of each processor as we do in our work, their focus is
more on distributing work and handling tasks that are able to execute only on specific systems (like only on
a GPU, for example).
Cross-compiling using a language such as OpenCL may simplify programming for heterogeneous archi-
tectures, but lacks the performance of code written in the preferred language for each processor. Shagrithaya
et al. [92] attempted to solve this by identifying specific constructs (types of computations or functions) in
OpenCL and map them to optimized routines in the preferred language for each processor.
Grigoras et al. [45] created an aspect driven compilation framework that can assign different parts of
an application to different processors. These aspects must be custom designed to recognize and modify
source code constructs (such as loops) through a process called weaving. Kirchgessner et al. [61] designed a
framework to allow HDL designs to be portable between various FPGA platforms from any FPGA vendor
(Xilinx, Altera, etc.). Their application-centric methodology focuses on mapping specific constructs such as
FIFOs and other interfaces into existing designs for any platform.
2.2.3 Distributed Heterogeneous System Implementations
Li et al. [69] designed a CPU+GPU distributed system to speedup the pairwise alignment of biological
sequences. Their system contained a range of low to high end GPUs and used a dynamic scheduling policy
based on a producer-consumer model. The central dispatcher contains a pool of independent tasks and
integrates the completed solutions from each node into a final result. This solution achieves a speedup
greater than 4x and can continue to scale with more nodes. Another system designed by Shen et al. [95]
combines CPUs with FPGA accelerators into a single system. They show that the performance of the system
for a distributed matrix multiplication application scales linearly with increases in number of CPU+FPGA
nodes. This system also has the capability to monitor and adjust the clock frequency of the CPU and
FPGA dynamically for power and thermal management constraints. Inta et al. [57] presented an abstract
discussion of CPU+GPU+FPGA systems using commercial off-the-shelf (COTS) processors. They analyzed
the performance of applications by breaking them up into low level operations (scalar add, multiply, square,
etc.) and assigning them to GPU or FPGA accelerators. However they do not take into account the
capabilities of the CPU to also assist in the computational ability of the system. Instead they use the CPU
just for control, input/output, and display purposes.
13
2.2.4 Single Chip Heterogeneous Multicore Processors
Currently, our research targets commodity platforms (CPU, GPU, and FPGA) combined into a single system
that can be easily added to desktop workstations. However the future combination of such architectures will
be incorporated into single chip heterogeneous multicore processors. Current work by Cong et al. [30]
proposed a loosely coupled set of accelerators combined into an accelerator-rich chip multiprocessor (CMP).
Although they mention accelerators in general, their work focuses on coarse-grain reconfigurable arrays and
does not mention GPU-like large quantity of parallel compute cores. Another similar work by Hariyama
et al. [49] presented an overall architecture similar to the GPU using customized cores. They utilize the
FPGA as a means to build their design containing customized CPU cores and SIMD architecture accelerator
cores. However, they do not consider reconfigurable accelerator cores to fully take advantage of the FPGA’s
capabilities. As such, their CMP design can be characterized as a CPU+GPU system.
A reconfigurable coprocessor design by Brunelli et al. [21] addresses the issue of adding an FPGA-like
architecture to a processor on a single chip. Their design only mentions a single reconfigurable core, but the
main communication bus could be used to incorporate additional cores. A reconfigurable DSP was designed
and fabricated by Zhang et al. [123] to enable flexible implementation of baseband wireless functions. Not
only did they incorporate FPGA compute units, they also designed a reconfigurable interconnect network
to maximize bandwidth between the various cores in order to improve overall performance.
Commercially, heterogeneous processors are becoming readily available such as Stellarton from Intel that
combines an Atom processor with an Altera FPGA in the same package or the Zynq from Xilinx combining
multiple ARM Cortex processors, a Mali GPU, with reconfigurable fabric on a single chip. Intel and AMD
both now have very capable GPUs integrated into the silicon with their CPUs. Now that these systems are
available, simplified implementation strategies that achieve high performance such as the one proposed in
this work are in high demand.
14
Chapter 3
Background Information
In this chapter we present background information that supports the technical descriptions of the research
presented in later chapters. The work presented in this dissertation is very multidisciplinary, crossing domains
from computer science graph theory to scheduling in industrial engineering and even low level architecture
design from computer engineering. Some of the following information may be viewed as obvious common
knowledge to some, while to others it may provide a non-obvious or sometimes-neglected perspective. In
Section 3.1 we describe the state of the application and system design space, specifically how we break
problems down into smaller more manageable ones. In Section 3.2 we describe an application’s workload
representation using graphs and associated terminology definition, some of which is unique to our work.
Then in Section 3.3 we briefly present the various parts and functionality of a compiler. Lastly, we describe
the foundational elements of Redsharc that were completed prior to our work in Section 3.4.
3.1 Applications and Systems
In our work, we investigate heterogeneous systems of interconnected processors (ie. workstations or clusters)
and systems of integrated on-chip processors (commonly referred to as multiprocessor system-on-chip, or
MPSoC). With the decreasing feature size in semiconductor fabrication, additional specialized cores such as
custom general purpose architectures or custom special purpose cores are now commonplace. When these
are combined at the silicon level in the fabrication of a processor they are referred to as cores (ie. multi-core
processor). When each is fabricated separately and combined at the board level, they are referred to as
processors (ie. CPU workstation + GPU card). In this dissertation, both cores and processors may be
15
(a) System breakdown: a system contains processors,
where each processor contains many individual func-
tional units.
(b) Application breakdown: an application contains
coarse grain kernels, where each kernel contains many
individual operations.
Figure 3.1: Domain decomposition. The application and system are broken down into their core components.
used interchangeably. Regardless of the level of integration (cores on-chip in a processor, or processors in a
system), each can be further broken down into functional units (that perform operations such as addition,
multiplication, etc.) as shown in Figure 3.1a.
Programming specialized cores (for encryption, video encoding, etc.) has raised the level of abstraction
from instruction-based to kernel-based. A kernel is a cohesive set of computational work that is executed by a
processor, also known as a task or unit of computation. Ideally, the amount of computational work in a kernel
is chosen such that it takes advantage of the compute capabilities of a processor and reduces the amount
inter-processor communication. For disambiguation, in this work the term kernel is different than the core
of an operating system (OS) commonly referred to as the OS kernel. Given the special purpose cores that
execute a single type of kernel or readily available high performance libraries of kernel implementations, in
many cases the developer only needs to provide the set of data inputs and a single instruction, that denotes
the type of kernel to execute, to the core. This single instruction is all that is needed by the core to execute
the entire kernel. We propose that this kernel-based programming model apply to heterogeneous systems to
raise the level of abstraction. Compared to the granularities chosen in [45, 69, 75], low level operations require
too many data transfers and high level algorithms require too much computation and may not fully utilize
the capabilities of the core. By breaking an application down into coarse grain kernels, which contain the
individual operations as shown in Figure 3.1b, the design problems are split into smaller more manageable
ones. For example, the problem of assigning parts of an application to different processors is simplified at
the coarse grain kernel level. The performance of the application is then able to be estimated by analyzing
the operations that are part of a each kernel separately.
16
Figure 3.2: Graph breakdown: an application is represented as a graph of its kernels. Each kernel is represented as
a graph of its operations.
3.2 Application Representations and Graphs
An application like medical imaging or face recognition can be represented as a set of kernels. Each kernel
produces results that may be used as inputs for other kernels. These data dependencies between kernels
form a dataflow graph (DFG) representing the workload of an application, called the kernel DFG. Each
kernel is composed of a set of scalar operations that represent the computational work of the kernel. The
data dependencies between the operations of a kernel form a graph representing the workload of the kernel,
called the operation DFG. This hierarchy is shown in Figure 3.2.
Kernels and operations are the granularities of work used in this dissertation. Directly, one unit of work
(ie. a kernel or operation) may have a relationship to another that implies the production of a result that is
then consumed. During execution, when one unit of work must precede another this relationship is called a
precedence constraint or dependency. In general, the term precedence indicates an ordering that one comes
before another. Thus in many cases the types of graphs used in scheduling are referred to as directed graphs,
meaning the dependencies indicate direction from one to another. Without direction, it would be unclear
which is producing the result and which is consuming it. Another important concept is the formation of
dependencies such that it forms a circular chain, called a cycle. This dependency cycle would indicate that
every unit of work is dependent on the previous and the first is dependent on the last. In reality, graphs with
a cycle do not represent real workloads therefore we use graphs without cycles, called acyclic graphs. The
graphs we use in this work are referred to as directed acyclic graphs for both operation graphs and kernel
graphs.
Now let’s clarify what the term graph implies. A graph is a representation (or visualization) of tasks
and their dependencies. We refer to each task in graph terminology as a node, and the dependencies as
edges. A graph G is represented as G = (V;E), where V is the set of vertices (nodes) and E is the set of
edges (dependencies). At the kernel level, each vertex u 2 V is called an operation while each directed edge
17
(a) An example graph before scheduling with levels L1
to L4, width of 5, and 1-span of 4.
(b) Epochs E1 to E5 after scheduling with 3 cores P1
to P3, and 3-span of 5.
Figure 3.3: Characteristics of an example graph in the minimized configuration before scheduling (a) and after
scheduling (b).
(u; v) 2 E represents a data dependency between two operations. Within the graph, a level is defined as a
set of independent operations that do not have any dependencies to any other operations in the same set.
Each operation must have a dependency to at least one operation in the previous level (unless the operation
is in the first level in which case there will be no incoming dependencies). Moreover, given that operations
on variables can either be unary or binary there must be only one or two incoming dependencies. But there
are no restrictions on the number of outgoing dependencies of an operation. The result of an operation can
be the input of every operation in the subsequent level, or it can be a final result that is not used in any
other operation. The width of the graph is the maximum number of operations in any level, which represents
the maximum amount of parallelism present at any point throughout the algorithm. Figure 3.3a shows an
example graph with 4 levels denoted L1 to L4.
We define the minimized configuration of a graph to be such that all edges must go from a operation in
level Li to another operation in a later level Li+j , and all operations are arranged such that an operation in
level Li cannot be placed in level Li 1 due to dependencies. For example, the graph in Figure 3.3a is shown
in the minimized configuration. The conversion to this reduced representation has the same computational
complexity as a depth first search or downward ranking, O(jV j+ jEj).
Given a graph, there are a variety of ways to store the node and edge information including: adjacency
matrix, adjacency list, and incidence list. An adjacency matrix stores the complete connectivity of nodes in
the graph, specifically whether an edge exists between any two nodes or not. Figure 3.4b shows the adjacency
matrix representation of the graph in Figure 3.4a. Notice that node 10 does not have any outgoing edges to
any other nodes, thus there are zeros in each column of row10. However, node 10 does have three incoming
edges from nodes 7,8 and 9. This is represented by the value of 1 in col10 in row7, row8 and row9. Outgoing
edges can be found in the respective row for a node, and incoming edges in the respective column.
18
(a) Example Graph (b) Adjacency Matrix (c) Adjacency List (d) Incidence List
Figure 3.4: Example graph (a) and various representations or storage approaches (b-d).
The adjacency matrix takes up a significant amount of storage space since it stores whether an edge
exists for any possible position (between any two nodes). However, the adjacency list representation only
stores information about edges that actually exist. Figure 3.4c shows the adjacency list representation of
the graph in Figure 3.4a. For each node, the adjacency list representation stores a list of the other nodes
that are adjacent to it. You can see that row10 is an empty list since node 10 does not have any outgoing
edges. But row1 contains 2, 3, 4, 5 and 6 to indicate the edges from node 1 to those other nodes. Notice
that in this representation we cannot easily determine the incoming edges of a particular node. Similarly,
the incidence list only stores the specific edge information but does not keep a list for every node. Figure
3.4d shows the incidence list representation of the graph in Figure 3.4a. Notice that the edge from node 1
to node 2 is represented as (1,2) in row1. This representation explicitly stores every edge in the graph.
The general scheduling problem can be represented in standard scheduling notation as (P j Cmax) [109],
where P represents the number of identical parallel processors and Cmax identifies that the objective is to
find a schedule which minimizes the maximum completion time (ie. how fast the entire workload can be
executed). This problem is known to be strongly NP-hard [43]. More difficult variations of this problem
include adding precedence constraints (ie. data dependencies) (P j prec j Cmax), and using unrelated parallel
processors (R) with precedence constraints (R j prec j Cmax).
The span, or schedule length, of a graph is the number of levels in the graph and, given an infinite
number of processors, is also called the critical path or infinite-span (1-span). After scheduling the graph
onto a given number of processors the span is known as the specific span, p-span where p is the number of
processors, and is equal to the number of epochs. Figure 3.3b shows a detailed schedule for the example
graph onto a processor containing 3 processing elements P1 to P3. This schedule has 5 epochs denoted E1
to E5. An epoch is defined as a set of operations that are executed concurrently during the same period
19
in time (ie. same clock cycle). The operations in each epoch may be from different levels, but none of the
operations have dependencies between other operations in the same epoch.
3.3 Compiler Basics
In general, a compiler has a front-end and a back-end. The front-end includes the processes from initial
source code analysis to generating an intermediate representation. The back-end includes the optimization
and generation processes. Initially, a program is treated as a stream of characters read from the file (Figure
3.5a). The stream of characters (Figure 3.5b) is analyzed and converted into a stream of tokens (Figure 3.5c).
This process is called a lexical analysis, and the tool that executes this process is called a lexer (other terms
include: tokenizer, scanner, etc.). These tokens can be symbols such as: +,-, or * that indicate operations
or they can be keywords such as “for” or “while” or even identifiers such as “var1” that refer to specific
variables. Errors produced from this process are due to unknown symbols, or groups of characters that do
not form a valid token.
Then the stream of tokens is processed (called parsing) by a set of rules for a particular language, called
the grammar (Figure 3.5d). Errors produced during this process include groups of tokens that are not “legal”
according to the grammar rules. Other than performing this rule check, the parser also converts the stream of
tokens to a syntax tree that is provided to the next process. The semantic analysis performs a type-checking
to validate that variable types are used appropriately (according to the language specification). The last
process in the front-end is generation of an intermediate (or abstract) representation of the original program
(a) Source code from File (b) Characters scanned (c) Tokens parsed
(d) Grammar rules applied (e) Resulting abstract syntax tree
Figure 3.5: Example stages of the compilation process. First the source code from the file (a) is provided as input
to the compiler. Then each character is scanned (b) and tokens identified (c). These tokens are matched to the
grammar rules (d) to construct the abstract syntax tree (e).
20
as shown in Figure 3.5e. This representation is merely used as an intermediary step prior to generation of
machine-specific code/instructions.
In the back-end, optimizations are made to the intermediate representation that are tailored (or tuned)
for a specific type of machine that the program will eventually run on. This is why compilers are unique to
the type of hardware that the program will run on. In particular, using a compiler with optimizations for a
different machine architecture (even if they are both x86 CPUs, for example) will result in under-performing
programs. The last step is to take this optimized representation and translate it into the final machine
code/instructions. Also, in this last part the specific register assignment is performed. This is also tailored
to the specific type of machine, using one with a larger number of registers may not increase performance
unless the compiler is tailored for that architecture.
3.4 Redsharc Foundations
Prior to the work in this dissertation, Redsharc was created to aid in the development of hardware/software
systems on reconfigurable devices. In this section we present the previous work that was done by others.
Later, in Section 7.2 we present our developments on top of the existing work described below.
The reconfigurable data-stream hardware/software architecture (Redsharc) is a programming model and
network-on-chip infrastructure to simplify development of MPSoCs. Redsharc provides an abstract applica-
tion programming interface (API) that allows programmers to develop systems of simultaneously executing
kernels, in software and hardware, that communicate over a seamless interface [62]. Redsharc incorporates
two on-chip networks that directly implement the API to support high-performance systems with numerous
hardware kernels.
Redsharc is based on the Stream Virtual Machine API (SVM) [78], an intermediate language between
high level stream languages and low level instruction sets of various architectures developed under DARPA’s
Polymorphous Computing Architectures (PCA) program. SVM has no preference to the computational
model and only specifies how kernels communicate with each other. SVM is primarily based on a streaming
model, but additionally includes support for blocks or random access chunks of data.
Redsharc addresses the challenges of achieving inter-core communication with support for different com-
munication models. The goal is to support any configuration of heterogeneous hardware and software kernels
to fit the needs of the application. In a Redsharc system, kernels can be implemented as either software
threads running on a processor core, or hardware cores in the FPGA fabric. Regardless of whether a kernel
runs on a processor or hardware core, or in which core it runs on in the system, all kernels communicate using
21
Figure 3.6: Example Redsharc MPSoC system.
Redsharc’s abstract API supporting both streaming and transmission of blocks of data. These transmissions
occur over the proven, validated, and configurable Redsharc on-chip networks [90].
3.4.1 Network Infrastructure
The form of MPSoC systems that can be created using Redsharc is shown in Figure 3.6. Redsharc supports
multiple software kernels assigned to the same processor core. This means that the two kernels are executing
simultaneously on the same physical processor core — sharing compute time by context switching. The
stream switch network (SSN) and block switch network (BSN) allow data to be transmitted through different
modes as needed by the application. The SSN is a runtime reconfigurable crossbar on-chip network designed
to carry streams of data between cores in a circuit switching fashion. The BSN is a routable crossbar on-chip
network that permits access to any blocks from any kernel in a packet switching fashion. The BSN memories
include a set of on-chip block-RAM (BRAM) and connections to off-chip devices such as SRAM or DDR.
The type of memory allocated to a kernel (either BRAM or an allocation in an off-chip memory) enables the
system to choose between memory speed and density to meet the needs of the application. More information
regarding the SSN and BSN, including performance analyses, can be found in the previous work [62][90].
The data ports on the BSN and SSN connect directly to the hardware or processor cores. Thanks to the
full crossbar structure present within the BSN and SSN, any core can be connected to any port. The SSN
uses on-chip resources to to store data in FIFOs. The BSN uses on-chip BRAM to store data in addition
to off-chip resources such as DDR or SRAM. The interfaces for these off-chip memories are available in the
form of IP-cores (from vendors and other 3rd party developers) and can be connected directly to the BSN.
22
Table 3.1: Examples of Redsharc software kernel API calls.
Function Name Arguments Description
streamPush element *e
stream *s
Pushes element e onto stream s
streamPop Pops the top element from stream s and stores the value in e
streamPeek Reads the top element from stream s and stores the value in e
blockWrite element *e
int idx
block *b
Writes element e into block b at index idx
blockRead Reads an element from block b at index idx and stores the value in e
3.4.2 Kernel Development
Cores in Redsharc can either be processors or custom hardware accelerators. Application implementation
begins by decomposing the workload into kernels. These kernels can either be software threads or hardware
logic. Then, leveraging the Redsharc API a developer can quickly assemble, generate, and test the system
on the device. This approach allows for rapid development and testing along with providing vendor-agnostic
implementations for ease of platform migration.
In Redsharc, the software kernel interface (SWKI) is implemented as a traditional software library. The
SWKI provides an API for communication and data transfer, as shown in Table 3.1, to other kernels via
provided drivers to access the DMA controllers. A full description of the API calls is presented in [62].
Each type of processor may implement the SWKI in different ways. Software kernels are supported by a
microkernel or small scale real-time operating system (RTOS) that interfaces between the on-chip networks,
supports the management functions of the control kernel (starting, stopping, launching kernels), and enables
context switching to support multiple simultaneously executing software kernels on the same processor.
However, the RTOS is very thin providing direct access to driver routines enabling each kernel to run at
full speed on the processor, only interrupting for context switching or as directed by the control kernel for
management functions. The RTOS sets up and configures DMA, providing pointers for the software kernels
to interact with directly.
The hardware kernel interface (HWKI) is a thin wrapper that connects hardware kernels to the SSN
and BSN, implemented as a VHDL entity. The HWKI is composed of 3 sets of interfaces: control registers,
blocks, and streams as shown in Figure 3.7. Control registers allow the control kernel to start, stop, and
reset each core and enables the kernel to share status or debug information. The block interface connects
directly to the BSN and provides a simple set of block RAM-like interfaces for the kernel to interact with.
The stream interface connects directly to the SSN and provides standard FIFO interfaces. Specifically which
block or stream each kernel is interacting with is handled separately by the control kernel and implemented
by the BSN and SSN.
23
Figure 3.7: Redsharc’s hardware abstractions and interfaces for kernels
3.4.3 Build Infrastructure
The construction of MPSoCs can incur long development time when dealing with memory interfaces, PCIe
or other high-speed transceiver IP blocks, and low-level signaling for buses or on-chip interconnect protocols.
Redsharc aims to provide both software and hardware designers a simplified development environment,
shifting the focus from system design and integration to application and kernel development.
Part of Redsharc includes a build infrastructure to support rapid assembly, configuration, and testing of
developed hardware kernels and full systems. The goal of the build infrastructure is to allow a developer to
spend more time developing kernels, rather than creating test benches and simulation/synthesis project files.
(a) HW Kernel Testbench (b) Redsharc MPSoC Testbench
Figure 3.8: Template simulation testbenches for individual hardware kernels (a) and full/partial MPSoC system
simulation (b).
24
Utilizing Redsharc’s kernel interfaces and on-chip SSN and BSN networks, the build framework allows large
systems to be easily constructed. The developer can leverage provided makefiles, simulation and synthesis
scripts to rapidly simulate and synthesize a kernel for debugging and testing as shown in Figure 3.8a.
With the Redsharc API and template simulation testbench, the stream, block, and control transactions are
managed for the developer. Input streams and blocks are provided as files to the simulation environment and
output streams and blocks are checked against expected results for validation. The simulation environment
currently supports Synopsys VCS. Multiple test vectors can be loaded into the simulation environment and
can be used as regression tests while a kernel is under development.
The build infrastructure also supports the testing of multiple kernels assembled together as a subsystem
or full system as shown in Figure 3.8b. This includes the use of pre-configured soft-core processors to emulate
software kernels and pre-designed stream and block switch networks (SSN and BSN) for connectivity of the
system. System simulation can be performed at various stages during the build process: pre-synthesis,
post-synthesis, and post-placement & routing (PAR), including timing information.
3.4.4 Runtime Operation
After specifying kernel implementations and generating the system binaries and executables, the user also
provides control software that will run on one of the processor cores to manage the runtime operation. Since
Redsharc is based on SVM it supports standard SVM API for setting up and configuring kernels, data
transfer between kernels, and other control functions. The scheduling and ordering of kernel execution is
chosen by the user to fit their application needs. When this control software is executed it prompts the
compute cores in the system to begin executing kernels to complete the workload of the application.
25
Chapter 4
Model-based Framework
In this chapter we present a framework to aid the design, implementation, and performance estimation of
an application in a heterogeneous system. Although the type of work needed to construct a system for an
application has been specified before [60], we present our contributions and describe the improvements we
make to the standard flow. An overview of the framework is shown in Figure 4.1. First, the kernels are
extracted from the initial implementation of an application in the analyze phase. Second, the performance of
each kernel is estimated for each processor and used to construct a kernel-to-processor assignment schedule.
At the center of the framework is a simulator that evaluates different scheduling policies and estimates
overall application performance. By changing the configuration of processors we estimate the performance
for different heterogeneous systems in simulation. Finally, once the configuration of the system has been
chosen, the actual implementation is produced in the generate phase.
Figure 4.1: Overview of the framework’s flow from initial source code to implemented hardware system. The schedule
and simulate steps may be iterated multiple times for optimization.
26
Figure 4.2: Phase 1: Analyze. Static and dynamic analyses produce a program trace and dataflow graph (DFG)
which are used in Phase 2 to estimate performance and compose a schedule.
4.1 Phase 1: Analyze
The target application is presented to the framework as an implementation in a single threaded high level
language for testing purposes. Then static and dynamic analyses are performed. Existing techniques [35, 80,
98] are used to statically identify kernels and their dependencies in the application. Next, dynamic analyses
construct the dataflow graph (DFG) of kernels (kernel DFG) in the applications. These dynamic analyses are
performed using data inputs representative of normal use to generate a program trace. Formally, a program
trace is a sequence of states traversed by the program from start to end. For the purposes of this work, the
program trace contains information about the number of times a particular kernel is executed and in which
order to produce a kernel DFG. Using this graph, the amount of exploitable task parallelism is extracted and
used for scheduling. Figure 4.2 shows the analyses and results that are produced. We combine the analyses
into a unified front end compiler, described in detail in Chapter 5.
4.2 Phase 2: Estimate and Schedule
Using the DFG and program trace produced in the analyze phase, the execution time of each kernel is
estimated using processor models. Typically, one model is used for each hardware platform and requires
an implementation of the kernel in a specific design language (ie. C, CUDA, or VHDL). To simplify this
process, we present a new method to estimate kernel performance using operation DFGs, in which the
nodes are operations and the edge are data dependencies that form the work of a single kernel, and will be
described in Section 6.1. Then the kernel DFG, in which nodes are kernels and edges are kernel dependencies,
is analyzed and kernel-to-processor assignments are made to achieve the best overall application performance.
An evaluation of various scheduling policies is presented in Section 6.2. The inputs and outputs for Phase 2
is shown in Figure 4.3.
27
Figure 4.3: Detailed framework flow, the simulator’s results are used to improve and optimize the kernel assignments
and system configuration (quantity of processors, schedule).
4.3 Phase 3: Simulate
The goal of simulation is to aid in deciding the right number of each type of processor, to select the best
scheduling policy, and to estimate the performance of the whole application. Figure 4.3 shows a detailed view
of this part of the framework. The simulator plays a central role in validating the results of the previous
phases. The application is simulated by applying a particular scheduling policy to assign the kernels to
processors. Data transfer costs are also included in the estimation of the application’s total execution time.
Various metrics are used to evaluate the performance of the chosen implementations and scheduling policy
including processor utilization and overall execution time. Given particular time, cost, or power constraints
the best scheduling policy, quantity of each type of processor, and kernel implementations can be chosen.
Parameter optimization techniques are used to minimize the number of simulations while still finding a
solution that achieves the best performance. This simulation process is detailed in Section 6.2 in terms of
scheduling kernels to processors to analyze the performance of heterogeneous systems.
4.4 Phase 4: Generate
The last phase in the framework accepts the kernel DFG, kernel assignments (scheduling policy), and system
configuration, producing the parallel system implementation as shown in Figure 4.4. We assume that each
processor has a control thread for communication and management. The control threads are constructed to
initiate kernels and synchronized for correct operation by enforcing data dependencies using data transfers.
In addition, the control threads manage allocated memory for each kernel and avoid deadlock between
processors. The implementation, whether for an MPSoC or system of interconnected processors, is fully
functional and requires no further input from the user. Just as with any other compiled executable, the
implementation produced only needs to be started, then it runs through the application workload producing
28
Figure 4.4: Detailed code generation phase, the system implementation is produced using the kernel DFG, kernel
assignments, and system configuration (quantity and type of processors).
the final results. A sample implementation of this code generation phase for Matlab-based heterogeneous
systems is presented in Chapter 7.
29
Chapter 5
Front-End Compiler
The goal of the front end compiler is to analyze the source code, identify and partition work into kernels
constructing the kernel DFG, and mapping the specific implementations for each kernel from libraries or
provided by the user. In many low level programming languages like C/C++, CUDA, or OpenCL, these
tasks are relegated to the user either by partitioning the work into functions or using directives to identify
the start and end of each kernel. High level languages provide a level of abstraction that allows developers to
describe kernels and data dependencies in their application with very minimal effort. Behind the scenes, high
performance parallelized implementations of kernels are called to perform the actual computation. Given an
application, developing a sequential implementation is intuitive. Parallel implementations entail difficulties
such as partitioning the work, scheduling, synchronization and communication. Of these, the front end
compiler is responsible for partitioning the work, and identifying where synchronization and communication
are needed.
Parallelizing an application across a heterogeneous system is a complex and difficult task to perform
manually. Figure 5.1 shows our interpretation of the standard two level compiler approach introduced with
the Stream Virtual Machine (SVM) [64] from the DARPA Polymorphous Computing Architectures (PCA)
program. The high level compiler breaks the application down from the initial sequential language specific
implementation into an abstract representation. First, at a high level the kernels are identified from the
application. Next, the performance of each kernel is estimated, dictating which type of processor each should
be mapped to. Then, the order in which each kernel should be executed is scheduled on the processors. The
high level compiler produces an abstract representation of the application consisting of a kernel DFG, where
30
Figure 5.1: Application implementation and parallelization flow.
nodes in the graph are kernels and edges are data dependencies, and a schedule containing assignment and
ordering information.
The low level compiler/linker operates on the abstract representation, mapping kernels to implementa-
tions and producing the final parallel implementation. Kernel implementations are mapped either to existing
libraries or those provided by the user. From the schedule, the control threads are constructed to initiate
computation and data transfers. Control threads are synchronized for correct operation by enforcing data
dependencies using data transfers.
The latter parts (red items with dashed outlines in Figure 5.1) form the code generation phase and are
discussed in Chapter 7. The former parts (high level compiler items in Figure 5.1) represent the work done
in Phases 1 (Analyze) and 2 (Estimate & Schedule) of our framework. Phase 1: Analyze is essentially a
front end compiler that takes in the source code and delivers a DFG of the kernels in the application. The
focus of this section is on the work in Phase 1, Phase 2 is discussed in Chapter 6. First, we describe the
process of identifying kernels in Section 5.1. Then we present our approach for application decomposition
in Section 5.1.2. We present an implementation of this decomposition in the form of a Matlab compiler in
Section 5.2. Lastly, we present our application graph production infrastructure in Section 5.3.
5.1 Identifying Kernels
In general, there are two approaches to forming coarse grain kernels from the source code: clustering and
decomposition. A clustering, or bottom-up, approach is used on lower level languages and starts with a low
level view of the individual operations (such as scalar addition, multiplication, etc.) and seeks to cluster
them into kernels. A decomposition, or top-down, approach is used for high level langauges and starts at a
high level by recognizing kernels in the source code (such as matrix multiplication, FFT, convolution, etc.).
The benefits and pitfalls of each approach is discussed further next.
31
Figure 5.2: Example operation graph showing clusters of operations forming three kernels.
5.1.1 Operation Clustering
Operation clustering, as shown in Figure 5.2 is a bottom-up approach that creates groups of operations
with similar data dependencies or groups of operations that match with the capabilities of a processor. In
general, this type of approach is very difficult and a continuous source of problems for researchers to address.
Both minimizing data transfer (reducing dependencies between groups) and matching the capabilities of a
processor with the types and number of operations are part of the cost functions used in this approach. Addi-
tionally, using this approach may form clusters of operations that may not match existing high performance
implementations. Thus additional work will be required to produce efficient, parallel implementations of
any possible clustering of operations for any type of processor. Due to these difficulties we opt for another
approach: decomposition.
5.1.2 Application Decomposition
Application decomposition relies on some underlying knowledge of the structure of the source code. Either
functions are called in the source code that can be easily identified and mapped to known types of kernels,
or the high level language supports constructs that can be identified as kernels. An example of the latter
is the Matlab language that supports vector or matrix variables. An operation on such variables like
multiplication can easily be identified as a matrix-matrix or matrix-vector multiplication or as a matrix
scaling kernel if the matrix variable is multiplied by a scalar variable. If such underlying knowledge of the
structure is not known, the common approach is to use manual labeling. In this approach the developer
inserts labels (also called directives or pragmas) into the source code to signal the start or end of a kernel.
Analyzing the source code using a decomposition approach simplifies the production of the kernel DFG. An
32
Figure 5.3: Example application graph showing the kernels that are present.
example kernel DFG is shown in Figure 5.3. In this graph, each node represents a kernel such as matrix
multiplication (MM), matrix inversion (Inv), or Cholesky decomposition (Chol).
Using this approach, not only can the kernels be easily identified but also easily mapped to existing high
performance implementations for a variety of processors. A common solution to mapping kernels that do not
have existing high performance implementations is to implement them sequentially in a CPU-like processor.
Although the sequential kernel implementation may not achieve high performance, it will still enable the
application as a whole to be implemented in heterogeneous systems. Tools such as high level synthesis (HLS)
are also a potential source of acceleration for sequential kernels implemented in FPGAs.
5.2 Matlab compiler
Given the difficulties of kernel identification, we chose an application decomposition approach using the high
level Matlab language. The Matlab language and infrastructure provides many benefits three of which
are that: it simplifies kernel identification, provides access to high performance kernel implementations, and
many of the compute intensive scientific applications that motivated this work are already implemented
in Matlab. As a proof of concept, we implemented a front-end compiler to produce a kernel DFG from
Matlab source code. This implementation is not a full-fledged compiler, but provides the capabilities needed
to analyze the variety of the applications studied in this work. Our compiler is composed of three parts:
lexer, parser, and dynamic analyzer. This is equivalent to the front-end of a general compiler. However, we
assume that the developer has already validated the numerical correctness (and therefore also the syntactic
correctness) of their program by running it sequentially in Matlab. Thus, we do not include the capability
for our compiler to track errors or produce insightful error messages to the programmer regarding the
correctness of their program. The goal of our compiler is to produce a dataflow graph of the coarse grain
33
Figure 5.4: High level overview of the compiler flow
kernels in the program (kernel DFG). It does not produce executable code/instructions for any specific type
of processor architecture, and is not meant to be an intermediate representation from which machine code
can be generated. A high level overview of the compiler flow and how it interacts with the other phases of
the framework is shown in Figure 5.4.
Given that the Matlab language specification has not been formally released by MathWorks (the com-
pany behind the Matlab language) and that the functionality of some program syntax varies between
versions (which suggests that the formal specification is evolving), our compiler was designed using informa-
tion from existing research [59]. We tested the correctness of the compiler by first running sample programs
in the Matlab software (thus using MathWorks’ Matlab compiler) and then analyzing these programs
with our compiler. Additionally, an open source tool called GNU Octave was developed as a free alternative
to the Matlab software. Even though the language used by the open source GNU Octave tool has many
similarities to the Matlab language, there are documented cases where Octave programs will not run in
Matlab and vice-versa [81]. Therefore we chose to create our own custom compiler rather than attempt
to use the existing Octave compiler. The next sections describe our proof of concept implementation and
reasoning for choosing the subset of the Matlab language to implement.
5.2.1 Lexer
Implementing the lexer for our front-end compiler began with the smallest testable subset of tokens, and
increased until we were able to test larger real-world programs. Listing 5.1 shows a small subset of the tokens
and regular expressions that were implemented. The complete lexer specification that was implemented
34
is available as Supplemental Item #1: matlab.lex. All tokens needed to support general arithmetic on
scalars was implemented first. Then the same scalar arithmetic was extended for vectors and matrices
as well. However extensions from scalar arithmetic to complete all vector and matrix arithmetic was not
implemented. Examples not implemented include transpose, indexing, element-wise operations (other than
scalar add, sub, mult, div), and concatenation among others. Support for complex numbers is also not
present in the implementation.
The lexer was implemented using the Fast Lexical Analyzer (Flex), which is a lexer generator. Flex
generates code that executes character scanning and tokenizing using a deterministic finite automaton (DFA).
The code generated by Flex is combined with the parser to form the entire compiler program. The next
section describes the parser functionality and how it interacts with the lexer.
Listing 5.1: Small selection of the Matlab language tokens implemented
NEWLINE \n | \ r | \ f
HSPACE [ \ t ]
INTEGER [0 9]+
INTEGER (( [0  9 ]+) [dDeE] ( [+ ] ? ) ( [ 0  9 ]+) ) | ( ( [ 0  9 ]+) [dDeE ] ( [   ] ) ( [ 0 ]+ ) )
DOUBLE ([0  9 ]*)\ . ( [0  9 ]+)
DOUBLE (( [ 0  9 ]* )\ . ( [ 0  9 ]+) [dDeE]( [+  ] ?)( [0 9]+))
STAR \*
PLUS \+
MINUS \ 
BACKSLASH \/
EQUALS =
LEFTSQBRACKET \ [
RIGHTSQBRACKET \ ]
LEFTPAREN \(
RIGHTPAREN \)
COMMA ,
SEMICOLON ;
COLON :
COMMENT %[^\n\ r \ f ] * [ \ n\ r \ f ] *
FORLOOP fo r
ENDBLOCK end
IDENTIFIER [ a zA Z ] [ _a zA Z0 9]*
TEXT [^\n\ r \ f ]
5.2.2 Parser
Compared the work performed by a traditional parser, the parser in our compiler does not implement
extensive error checking. Instead, it effectively implements an intermediate representation generation. The
main work of the parser is to match the tokens against the grammar rules and generate the intermediate
35
(a) Initial Matlab source
code from user
(b) Compiler generated in-
termediate code
(c) Graph construction
during execution
(d) Final kernel DFG with
kernels identified
Figure 5.5: Progression of compilation process from initial source code (a) to the generated kernel DFG (d).
representation. This intermediate representation is then used to perform a dynamic analysis that produces
the kernel DFG for the application.
Similar to the development approach used with the lexer, the parser grammar rules were developed
incrementally, using real Matlab programs. A subset of these grammar rules is shown in Listing 5.2. The
complete grammar specification that was implemented is available as Supplemental Item #2: matlab.yacc.
When the tokens match the appropriate grammar rule the intermediate code is generated. This code is regular
C++ language syntax using a custom library called the GraphCodeLibrary that contains the functionality
needed generate the kernel DFG of the application. More on this library is discussed in Section 5.3.
Figure 5.5 show the compilation process from initial source code to generated kernel DFG. From the
initial source in Figure 5.5a, our compiler statically produces the intermediate code in Figure 5.5b. This
intermediate code uses the Data class from the GraphCodeLibrary for each variable. When this intermediate
code is executed it begins building the a dataflow graph of the variables and operations on them as shown
in Figure 5.5c. Once all of the variable types have been identified we construct the final kernel DFG as
shown in Figure 5.5d. Before parsing begins, the output file that will hold the intermediate code is created
and initialized. Some of the initialization process includes adding a comment header at the top of the file,
#include statements for the GraphCodeLibrary headers, and setting up the function that will contain the
entire functionality of the application (the app function shown in Figure 5.5b). Then parsing of the tokens
begins. In general, every line of Matlab code must include an assignment to a variable identifier. Any line
of code that does not make an assignment to some variable, does not make any impact on any final computed
results. We assume that every line of code is necessary to produce the final correct results. For example,
the Matlab code: a = 2+3; computes the addition of two integers and assigns the result to variable a. If
the line of code was just 2+3; then the result of this addition would be meaningless and not impact any
36
Listing 5.2: Small selecion of Matlab grammar rules
commands :/* empty */
| commands command
;
command : ass ignment
| f o r_ i n i t commands ENDBLOCK
| COMMENT
;
a s s i g n e e s : IDENTIFIER
| a s s i g n e e s IDENTIFIER
| a s s i g n e e s COMMA IDENTIFIER
;
a s s i gn e e : IDENTIFIER
| LEFTSQBRACKET as s i g n e e s RIGHTSQBRACKET
;
ass ignment : a s s i gn e e EQUALS expr
;
f o r_ i n i t :FORLOOP IDENTIFIER EQUALS value COLON value
;
expr : va lue
| IDENTIFIER
| LEFTSQBRACKET rows RIGHTSQBRACKET
| expr PLUS expr
| expr MINUS expr
| expr STAR expr
| expr BACKSLASH expr
| func LEFTPAREN func_args RIGHTPAREN
;
func : IDENTIFIER
;
func_args : IDENTIFIER
| func_args COMMA IDENTIFIER
;
row : /* empty */
| va lue
| row_delim
| row_delim value
;
rows : | row
| rows NEWLINE row
| rows SEMICOLON row
;
va lue : INTEGER
| DOUBLE
;
37
other computation in the application. Thus, the grammar rules that result in code generated are only the
assignments and control statements (if/else, loops, etc.).
In addition to assignments of expressions to variables, initial data values can also be assigned. An example
of this is a = 2; which could then be used in a later line of code like c = a + b;. In this case, the initial
data assignment does not constitute generating a node in the operation graph (since no operation has been
performed). The dynamic analysis operates using representative data to exercise the control statements.
Many of the Matlab functions that deal with external data/functions/files are not implemented. For
example, the load function can be used to load variables from a file, but is not implemented in our compiler.
Instead all initial variable data must be assigned in the Matlab code. Another example is declaring
new functions in Matlab using the syntax: function [y1,...,yN] = myfun(x1,...,xM). Although our
compiler can recognize the calling of any function (whether a built-inMatlab function like fft or a custom
user defined function) it does not parse the definition of functions. Lastly, the proof of concept compiler only
analyzes a single Matlab script file. The functionality to parse multiple files is not implemented. Given
that each Matlab file can only be either a sequential script or single function, normally multiple Matlab
files are needed for large applications. Even though in our proof of concept compiler only a single Matlab
file is parsed, any functions found are treated as a single kernel. This means that even though the function
definition is unknown the complete kernel DFG can still be produced, accurately representing the code in a
single Matlab file.
After parsing all of the tokens, the parser performs final cleanup operations. This includes closing the
application function (adding a return statement and closing curly braces) and writing out the main function
as shown in Figure 5.5b. The main function calls this application function (containing all of the generated
code) which builds the application DFG. Then it calls the appropriate GraphCodeLibrary functions to output
the final kernel DFG. Figure 5.5c shows the graph construction during execution, where both initial variables
and operations on them are created. Afterwards, variable types are applied to the operations to determine
the type of kernel. In Figure 5.5d these kernels are identified as matrix scaling (Ms) and matrix addition
(M+). In addition to the kernel DFG (the main output of the compiler), the compiler also produces a file
containing the initialization values for any variables found in the application, and which kernels take them
as input. This provides a location for initial values, and separates the actual data from the computational
workload. The user can easily modify this data as needed without having to recompile the application.
38
5.2.3 Dynamic Analysis
The last step in the compilation sequence is to execute the intermediate code generated during parsing. In
Figure 5.4, the generated code is represented by the Dummy Code and the GraphCodeLibrary is shown as
Graph Data Lib.. The dynamic analysis is performed by executing this generated code. During execution,
statistics such as data types and data dependencies are monitored. These are then used to identify the
individual kernels and form the kernel DFG. As expressions are evaluated, an overloaded operator library
tracks the data accessed. The first time a data element (scalar, vector, or matrix) is accessed, this is counted
as a memory access and the appropriate statistics updated. When a data element is accessed again it is not
counted as a memory access. Memory access statistics are included in the kernel DFG and stored for future
analysis use.
5.3 Producing Application DFGs
Producing a kernel DFG for an application has been a constant source of difficulty in the analysis of applica-
tions. Every language is different, every study concentrates on different aspects of the code/program/graph,
and each approaches the graph production in a different and unique way. This has created a large number
of efforts with incompatible graph production tools. Thus, every study has been forced to start with the
task of determining how to produce the graph of the code that they need. In this work, we initially created
a library (ie. GraphCodeLibrary) of overloaded operators for a templated C++ class called Data. Whenever
an instance of this class was created, data structures were initialized and created so that when the instance
was used or operated on information could be extracted.
In this work, we’re interested in the computational workload in general rather than the data-specific
operations. When building the DFG, if a data element (scalar, vector, matrix) has not yet been accessed,
then the operation on it will have no incoming dependencies. After the result of an operation is assigned to a
variable, a node ID is assigned to that variable. This node ID corresponds to a node in the DFG. Whenever
the variable is used later, an edge is connected from the older node ID to a new one. Thus, as computation
happens during execution the DFG is constructed behind the scenes using the infrastructure created in the
GraphCodeLibrary. The full GraphCodeLibrary is available as Supplemental Item #3: GraphCodeLibrary, a
folder containing all the dependent source files needed.
Since we have not defined the data type (integer, floating point, array, etc.) for Data objects, we can
use this graph construction approach for both operation DFGs and kernel DFGs. By default, the library
39
supports the overloaded operators needed to form operation graphs. But for kernel graphs, some kernels
can not be represented by simple operators. Some examples of this are operations on vectors and matrices:
transpose, inversion, decomposition, etc. For some of these operations there is not a single operator that can
be used, instead they are represented as function calls. We have added a func function to the Data class that
takes an array of the inputs, an array of outputs, and a function name. The inputs and outputs are used
to create the dependencies between the kernels. The function name is encoded into the graph to mark the
type of kernel (or work) represented by that particular node. This capability generalizes the type of work
that can be represented in the graph.
40
Chapter 6
Scheduling
In this chapter we introduce two specific subsets of the general scheduling problem as they apply to problems
encountered in our framework. In this work, applications are broken down into kernels which are further
broken down into individual operations. At each level, kernels in an application and operations in a kernel,
scheduling is used to assign the work to the hardware that will execute it. At the kernel level, individual
operations (scalar add, subtract, multiply, etc.) are scheduled onto the functional units of a processor to
estimate the performance of a kernel on any processor. At the system level, an application’s kernels are
scheduled onto the set of processors in the system to improve the performance or efficiency of the system by
improving the kernel-to-processor assignments.
First, we present the scheduling problem as it relates to modeling processors in Section 6.1. The sim-
ulator at the core of our framework is fundamentally an application of scheduling, rather than a complex
architectural simulator. We present the scheduling problem as it relates to simulation and modeling for
systems of processors in Section 6.2.
6.1 Graph-based Processor Modeling Methodology
Modeling the performance of different hardware architectures has been an important and complex problem
in the design process of hardware/software solutions for various applications. The ability to estimate the
execution time of a kernel without having to run it on the actual hardware provides the capability to compare
the performance of different types of hardware architectures and different design optimizations. Accurate
processor models are complex in order to account for every architectural feature [101]. Each processor has
41
diverse architectural features and modeling each set of features generally has resulted in different modeling
approaches. In addition, each processor may have a different programming model for implementing kernels.
To model a system of heterogeneous processors, multiple models would be required making system simulation
very expensive and time consuming. Between different processor models, the implementation of kernels are
not portable and require recoding for each processor. Although we are interested in accurate performance
estimation, we seek to improve the standard modeling approach and investigate models that provide only as
much accuracy as needed to make the correct scheduling decisions to achieve overall application performance.
Fundamentally the goal of programming models and languages is to enable the developer to represent
the computational workload of their kernel such that it takes advantage of the architectural features of
the processor to achieve high performance and efficiency. The computational workload can be represented
generically as the scalar operations and their data dependencies. In many cases a kernel implementation
cannot achieve high performance due to programmer error, such as introducing artificial dependencies that
limit potential parallelism. But when represented as a graph, only the essential operations and their true
dependencies persist. Whereas implementations in different languages are needed for different processors, this
graph representation is portable between processors. Then, by scheduling the operations onto the functional
units of a processor we can estimate the performance of the computational workload efficiently and without
low level register-transfer level (RTL) architectural details.
Our modeling approach is to schedule the individual operations from the operation DFG of the kernel
onto the available functional units in a particular processor. As such, our goal in scheduling is to estimate the
number of clock cycles required to complete all of the operations. Although this approach may seem obvious,
it is generally neglected because scheduling itself is an NP-hard problem [13]. We analyze the contributions
an optimal schedule provides and prioritize each piece of information. Then we reduce the specific scheduling
requirements and investigate the ease with which schedules can be created and their accuracy compared to
the optimal solution.
We further constrain the scheduling problem for processor modeling by assuming that every operation
takes unit execution time (UET) to execute. For a graph of UET operations and precedence constraints the
scheduling problem is represented as (P j prec; pj = 1 j Cmax) using standard scheduling notation [109]. In
this notation, P represents the number of identical parallel cores and Cmax identifies that the objective is
to find a schedule which minimizes the maximum completion time (ie. total execution time) where all tasks
have processing time pj = 1. Since each operation takes a single unit (or atomic) amount time to complete,
preemption is not possible.
42
(a) An example graph be-
fore scheduling with levels
L1 to L4.
(b) Reduced representa-
tion showing the number of
operations in each level.
(c) Estimated schedule of
example graph with 3 cores
achieving a schedule of 5.
(d) Epochs E1 to E5 after
sched- uling with 3 cores
P1 to P3, and 3-span of 5.
Figure 6.1: Reduced version of the example graph (a), graphical version of reduced graph (b), and result of the
estimated scheduling (c).
First, we analyzed the specific information needed in scheduling and present a reduced graph represen-
tation based on only the subset of information needed for processor modeling in Section 6.1.1. Then we use
this representation for modeling processors with identical cores and present our approach in Section 6.1.2.
We extend this for processors with different cores connected in a pipelined fashion in Section 6.1.3.
6.1.1 Reduced Graph Representation
In general, a schedule is produced in order to know exactly when each operation should be executed and
on exactly which functional unit. But for the sake of estimating performance, we do not need the exact
schedule. Instead, we are only interested in the length of the schedule to estimate performance. In this
section we analyze the information typically contained in a schedule and eliminate unnecessary information
for the purposes of processor modeling.
Figure 6.1a shows an example graph. The optimal schedule for three cores is shown in Figure 6.1d. When
scheduling operations onto the functional units of a processor, an optimal schedule provides the following
information:
(1) uniquely identifies each operation,
(2) specifies operation-to-functional unit assignments, and
(3) dictates specifically when to execute an operation.
Therefore, optimally scheduling these operations will provide the detailed recipe for exactly how the
design should operate to get the best performance. However optimal scheduling is NP-hard [13] and thus
this approach is infeasible for non-trivial real world architectures and kernels. To improve this approach,
we reevaluate what benefit each piece of information provides. Since the underlying objective of modeling a
43
processor is to estimate the performance, we need to know when an operation should be executed. Hence,
item (3) is the most important piece of information. We relax our requirements on items (1) and (2) to
more easily calculate the schedule. We define this variant as the reduced schedule where only the number
of operations executed at each time step are specified as shown in Figure 6.1c. We exploit this relaxation
to construct a polynomial-time algorithm that approximates the optimal schedule. We also exploit this
relaxation to reduce the memory footprint of the graph representation that allows large real world graphs
to be analyzed.
Scheduling the operations from the graph onto the functional units in a processor results in a detailed
schedule providing all three items of information. The length of this schedule is proportional to the number of
clock cycles required for execution. We define three types of schedules: the optimal schedule is the minimal
length schedule of a graph for a given set of cores, the execution schedule is the actual order operations
were executed when the implementation is executed in hardware, and the reduced schedule (also: estimated
schedule) is produced from the graph and only provides the number of operations that are executed at each
time step. Both the optimal schedule and execution schedule provide all three pieces of information: (1)
uniquely identifying each operation, (2) specifying the exact functional unit, and (3) dictating the specific
time in which an operation is executed.
Two problems that prevent real world applicability for this approach are the amount of memory required
to store large graphs, and that the complexity of computing the optimal schedule is NP-hard [13]. In
addition to the three pieces of information a detailed schedule provides, the length of the schedule or span
represents the number of cycles required to complete all of the operations in a given processor. This span
is the metric used to gauge the performance of the design. To find this span, we do not need to know
exactly which operation will be executed on which functional unit, but instead we just need to determine
how many operations of each type are being executed at every time step. Given this we simplify the graph
representation and reduce the amount of work required to construct a schedule.
Our reduced representation only stores the number of operations within a level of a graph in theminimized
configuration. Given the example graph from Figure 6.1a, the corresponding reduced representation is shown
in Figure 6.1b. Given that only the number of operations are stored, data dependency information is not
explicitly stored. However the data dependencies can be inferred from the level ordering. For example, if an
operation is in the first level, that implies that it has no incoming dependencies. And if an operation is in
the second level, that implies that there is at least one incoming dependency from an operation in the first
level. No guarantees are given regarding the number of outgoing dependencies from an operation in level Li
44
Table 6.1: Calculating memory footprint of various graph representations
Graph Representation Memory footprint calculation [bytes]
Adjacency Matrix DN +Dadj N2
Adjacency List STL List N  16 + E  16 +DN  ESTL Vector N  24 + E DN
Incidence List STL List 16 + E  16 + E  2DNSTL Vector 24 + E  2DN
to any operation in level Li+j . The implications of using this reduced representation will result in a loss of
edge information between levels L2 to L3, and L3 to L4. The effect that using this reduced representation
has on calculating the optimal length of the span will be discussed later.
The data sizes that modern applications are required to process are increasing and stretching the current
capabilities to process them. For these large matrices, the performance of operations on them, such as
scheduling, no longer becomes limited to whether they are stored in main memory or on disk [77]. Each
node in a dataflow graph represents a single operation that must be performed. When the number of
operations in the graph exceeds 1010, traditional storage techniques with minimal storage sizes O(jV j+ jEj)
such as adjacency list, incidence list, or incidence matrix are still too large. This limit is easily reached by
many kernels including matrix-matrix multiplication graphs on data sizes larger than 2048x2048.
We analyzed the memory footprint of each of these graph representations generically based on a variety
of different factors. We present the equations to calculate the number of bytes required in Table 6.1. In
these equations N and E represent the quantity of nodes and edges in the graph respectively, DN represents
the number of bytes needed to hold the value of N , and DE represents the number of bytes needed to hold
the value of E. When using the adjacency matrix representation, each element may store as little as a single
bit (one or zero) to indicate if an edge exists or not. However if additional information is encoded such as
the type of operation or number of memory accesses each value may need to be larger. Thus Dadj represents
the number of bytes needed for each element in the adjacency matrix.
Figure 6.2 shows the equations in Table 6.1 plotted over a range for each factor. For an adjacency
matrix, the two factors driving the size of the memory footprint are the number of nodes in the graph and
the number of bytes needed for each element in the matrix. Figure 6.2a shows the memory footprint for the
adjacency matrix scaled from 1 to 1 trillion nodes using from 1 byte to 8 bytes per element. Notice that it
is only practical to store graphs with thousands of nodes using this representation up until the light blue
region around 1 TB. Figure 6.2b shows the memory footprint for the adjacency list scaled over the number
of nodes and edges in the graph. The number of edges ranges from the minimum number needed for a single
connected component (number of nodes minus one) to the maximum number of nodes (for full connectivity,
45
(a) Adjacency Matrix (b) Adjacency List (STL Vector)
Figure 6.2: Memory footprints for different graph representations plotted over a range for each factor
N  (N   1) where N is the number of nodes). Notice that using this representation we can store graphs
with billions of nodes up until the light blue region around 1 TB. But such large graphs must have minimal
edge connectivity in order to be feasible. The incidence list footprint is shown in Figure 6.3b. The memory
footprints for the remaining graph representations (adjacency list with STL List and incidence list for STL
List) show similar trends as Figure 6.2b and are not shown.
Compared to current minimal storage techniques such as adjacency list or incidence list, which require
O(jV j+ jEj) memory for storage, our reduced representation requires storage on the order of the number of
levels in the graph, O(jLj). For example, a DFG for matrix-matrix multiplication that operates on matrices
(a) Reduced representation (b) Incidence List (STL Vector)
Figure 6.3: Memory footprints for reduced representation (a) compared to incidence list (b) plotted over a range for
each factor
46
(a) One-to-many. (b) Many-to-one. (c) Many-to-many.
Figure 6.4: The three possible types of bipartite graphs to represent the operations and edges between two levels.
of size 8192x8192 contains more than one trillion operations and more than one trillion edges. With such
a large number of nodes, this graph cannot be stored using conventional storage techniques feasibly. The
reduced representation of this graph however can be stored using only a 13 element array since the DFG
has 13 levels. We calculate the memory footprint as L DN where L is the number of levels in the graph
and DN is the number of bytes needed to store the number of operations in each level. Figure 6.3 shows the
memory footprint of the reduced representation over a range of graph sizes and number of levels compared
to the current smallest graph representation: incidence list. For the reduced representation, at a minimum
the number of levels is one if all of the nodes are parallel and at maximum equal to the number of nodes if
completely sequential.
As mentioned previously, some graphs are too large to store in memory. Naturally, the next problem is
how to produce the reduced representation if the graph itself cannot be stored in memory. Fortunately, the
reduced representation can be constructed using minimal memory by only accessing nodes in the current
level first and temporarily storing each node and which level it is in, until all of its successor nodes have
been labeled. Or at least until it is determined which level a node is in is not dependent on this node but on
another. Then, the memory used to store the information for a node would no longer be needed and could
be freed at that time.
By grouping operations into levels, we can analyze two levels separate from the rest of the graph. The
operations in any two neighboring levels and the edges between them form a bipartite graph. A bipartite
graph is defined as having two disjoint sets of operations, where there are no edges between any operations
in the same set. A complete bipartite graph is represented as Kn;m where n and m are the numbers of
operations in each disjoint set. Since we know that each node must have at least one edge from a node in the
previous level we can infer that for a one to many K1;m graph as shown in Figure 6.4a, the edge information
can be completely recoverable since each node must have an edge to the single node in the previous level.
47
However for a many-to-one Kn;1 subgraph as shown in Figure 6.4b (a subgraph of the complete graph
has the same number of nodes but not all the edges) the only guarantee is that there is at least one edge to
the operation in the level Li+1 from an operation level Li. But, exactly which operation this one edge comes
from in level Li cannot be identified and further, more edges may have existed in the original representation
of the graph. Similarly, a many-to-many Kn;m subgraph as shown in Figure 6.4c only has the guarantee
that there is at least m edges to operations in the subsequent level.
This reduced graph representation is used in schedule length estimation in the following sections. After
presenting the scheduling algorithm that operates on our reduced representation, we bound the amount of
error introduced due to the lack of edge information. Then, we extend the reduced representation for use in
modeling processors with different cores.
6.1.2 Schedule Length Estimation on Identical Cores
In processors with multiple cores, each core can perform a single operation at a time in concert with the
other cores, while sharing memory and data results. We assume each core has a single functional unit that
can perform any type of scalar operation (add, multiply, etc.). We schedule operations from the graph onto
the cores of the processor to estimate the number of cycles needed to complete all of the operations in a
kernel. This number of cycles is proportional to the actual execution time of the kernel on the processor.
First we present our scheduling algorithm to operate on the reduced graph representation presented in
the previous section.
Scheduling Algorithm
One of the advantages of dataflow graph representations is that they expose parallelism by expressing only the
actual data dependencies that exist in an algorithm. Scheduling these graphs and exploiting data parallelism
can be viewed as a transformation on the given dataflow graph that moves operations between levels without
violating dependencies as shown in Figures 6.1a and 6.1d. The scheduling of a graph onto a given number of
processors is equivalent to the hardware control that exists in the processor architecture. Compared to other
analytical approaches for processor modeling, many of them ignore the hardware control in the processor
architecture entirely.
For example, let’s assume that T is the execution time for a given graph, T1 is the execution time on
a single processor, and Tp is the execution time in a processor with p functional units. In general, we can
calculate this as Tp  T1/p and the ideal speedup is when Tp = T1/p. This ideal speedup is the minimum
48
lower bound on the execution time of a kernel and is not often achieved in practice. Additionally, this
formulation is unrealistic since not necessarily all of a kernel is parallelizable and, according to Amdahl’s
law [4] Tp = Tseq +Tpara/p where, only the parallel portion of the execution time Tpara is able to be divided
amongst the parallel functional units. Yet, this calculation ignores the fact that not necessarily all parts
of the parallel portion Tpara can be divided equally among the functional units. Moreover, to accurately
account for Tpara we need to perform scheduling and evaluate the dependency structure of the kernel and
how it matches the available functional units and the hardware control in the processor architecture. A
better more realistic approximation can be found by scheduling the graph onto p functional units. This
scheduled estimation of the execution time approximates the actual ideal performance for a given processor.
When we calculate the span of a graph for a particular number of cores, what we are really interested
in is the version of the graph with the minimum span where the maximum width at any epoch is at most
the number of functional units. This span can be used as an indicator of performance directly proportional
to execution time. To find this span, we do not need to know exactly which operation will be executed on
which processing element, but instead we just need to determine how many operations are being executed
during each epoch. This reduced requirement allows us to represent the graph using the reduced graph
representation described previously. We present a polynomial time algorithm to estimate the length of the
schedule using the reduced graph representation. This algorithm has a computational complexity on the
order of the number of levels in the graph O(jLj).
For each level in a given graph, we calculate the number of epochs required to execute the operations in
level Li, Opsi, by
epochsi = dOpsi/pe (6.1)
where p is the number of functional units to execute operations in the given processor. Ideally there should
be enough operations in any level of the graph to keep all of the functional units busy during every epoch.
However, this is rarely the case and the number of additional operations needed from other subsequent levels
to keep all functional units busy is
need =
8>><>>:
0; if (Opsi mod p) = 0
p  (Opsi mod p); if (Opsi mod p) 6= 0
(6.2)
We can fulfill this need by stealing operations from the next level (or any other subsequent levels), assuming
49
that we have already completed the predecessor operations. But, this cannot actually be verified due to the
potential edge information loss from using the reduced representation. We will explore the range of potential
error introduced later. The number of exposed operations comes from either completing operations in the
current level, or from already stolen operations completed previously. A function f can be used to estimate
the number of operations exposed using any heuristic information available for a particular problem or a
generic statistical calculation. If this function always returns zero, then this is the naïve case where only the
operations in any one level are executed at the same time and no stealing occurs.
exposed = f (6.3)
Then, the number of operations exposed is assigned to the need to determine the number of used operations
that were stolen from the subsequent level.
used = min fneed; exposedg (6.4)
Once the number of operations that are used from subsequent levels is calculated, we can subtract out
these operations from the subsequent level and add them to the current level:
Ops+i = Ops
 
i + used (6.5)
Ops i+1 = Opsi+1   used (6.6)
Finally, we sum the number of epochs required for the current level with those from previous levels to
calculate the resulting span by modifying the epochsi calculation to use the latest level value of Ops+i
spani = spani 1 +

Ops+i /p

(6.7)
After completing the last level, this calculation will result in the span of the graph when executed on the
given processor architecture. Table 6.2 summarizes the variables described above.
Scheduling Algorithm Validation
The next task is to verify how close to the optimal schedule length, OPT , this approach can achieve. The
factor that makes this task rather difficult is stealing. The determination of how close to optimal the result
is depends on whether the exposed calculation produces a valid result, meaning that those operations could
50
Table 6.2: Scheduling variable summary
Symbol Description
level A set of independent operations in the graph.
epoch A set of independent operations from the graph executed concurrently.
width The maximum number of operations in any level of the graph.
epochsi The number epochs required to execute all operations in level Li of the graph.
need The number of operations required to keep all functional units busy for a level in the graph.
exposed The number of operations in subsequent levels of the graph with dependencies already met.
used The number of exposed operations that were actually stolen to satisfy the need.
span The total number of epochs in the specific span.
p The number of functional units in the multiprocessor.
Ops
A vector where the rows represent levels in the graph and the column is the quantity of
operations.
actually be stolen without breaking any dependencies. To approach this problem we will instead assume
a naïve approach, where no stealing occurs. In this variation the operations in each level are divided into
epochs of size p, and no operations from differing epochs are executed together. Using a greedy approach,
the epochs are filled with the operations from one level until there are no operations remaining. Then the
operations from the next level are processed in a similar way. This continues iteratively for each level in the
graph.
We assume that the values of all variables are positive whole numbers since the number of operations in
a level cannot be fractional nor can we have a partial number of functional units. We also assume that the
number of functional units is at least 1 and that each level in the graph contains at least 1 operation. Let N
be the number of operations in the graph. Let L be the set of levels in the graph, where jLj is the number
levels in the graph. Let Ops be a vector such that Opsi is the number of operations in level Li of the graph.
By definition:
N =
jLjX
i=1
Opsi: (6.8)
The maximum amount of parallelism within the graph is equivalent to the width of the graph which is
defined as:
width = max
1ijLj
fOpsig: (6.9)
Let p be the number of functional units in the processor architecture. If p  width then a lower bound on
OPT is:
51
OPT = jLj; (CriticalPath)
where the number of levels, jLj, is the critical path of the graph. In this bound, the assumption is that the
work in each level can be completed in a single epoch. If p < width then the span will be greater than jLj
since at least one level will require more than one epoch to complete all of the operations in the level. In the
same way that we quantified the number of epochs in a level, we can also formulate a bound based on the
number of epochs required to complete all of the operations in the graph as:
OPT 

N
p

: (TotalWork1)
For this bound, if there are N operations that need to be executed in the graph then they cannot possibly
be done in less time than the formulation above.
According to the above naïve estimation algorithm the span is calculated as:
span =
jLjX
i=1
epochsi; (6.10)
where epochsi = dOpsi/pe represents the number of epochs required to complete all of the operations in
level Li.
The number of operations in the graph, N , can be represented as a combination of the number of groups
of p operations, denoted as qn, plus the number of operations leftover, denoted by rn, as:
N = p qn + rn; (6.11)
with 0  rn < p. Similarly, the number of operations in level Li can be represented as a combination of qi
groups of p operations plus ri leftover operations such that:
Opsi = p qi + ri: (6.12)
Using this new representation, we can now calculate epochsi using qi and ri by:
epochsi =
8>><>>:
qi + 1 if ri 6= 0
qi if ri = 0
(6.13)
52
Similarly, the OPT TotalWork bound can be calculated by:
OPT 
8>><>>:
qn + 1 if rn 6= 0
qn if rn = 0
(TotalWork2)
Given this expression for epochsi we can define the range of the span. If ri = 0 for every level then the
span could be as little as P qi, but if ri 6= 0 for every level then the span could be as high as P(qi + 1).
Thus, span is bounded as follows:
jLjX
i=1
qi  span  jLj+
jLjX
i=1
qi: (6.14)
Since N =POpsi we can relate the global level qn and rn to the quantities for each level in by summing
the all of the leftover operations in each level and determining the number of groups of p that can be formed
from them, denoted as qr by:
jLjX
i=1
ri = p qr + rr: (6.15)
Then, the total number of groups of p from N is:
qn = qr +
jLjX
i=1
qi; (6.16)
and the number of leftover operations from N is:
rn = rr: (6.17)
Since we know that ri < p, then the maximum value for ri is p  1, meaning that
qr < jLj: (6.18)
Optimality Proof
In this section we prove the correctness of the scheduling algorithm presented above. Then we present an
extended version of the proof for all possible cases. We now present the main optimality theorem:
Theorem 1. Given the naïve scheduling algorithm and any input graph in the reduced representation, the
ratio of the resulting estimated span of the schedule to the optimal length of the schedule is span/OPT < 2.
53
(a) Graph with 1 epoch per level. (b) Graph with 3 epochs per level. (c) Graph with 1 or 2 epochs per level.
Figure 6.5: Example graphs for a multiprocessor architecture with 4 processing elements where the number of
operations in every level is: less than or equal to the number of processing elements (a), more than the number of
processing elements (b), or less than or equal to or greater than the number of processing elements (c).
Proof. For this algorithm, the worst case is when the number of nodes in each level is Opsi = p+ 1. In this
case at each level we want to steal p   1 nodes from the subsequent level. Given the naïve case where no
stealing occurs this results in two epochs for each level when it is likely that some nodes from the subsequent
level were exposed during the first epoch. Thus the OPT bound for this case is qn +1 and the estimation is
bounded by jLj+P qi according to Equation (6.14). At a minimum qn could be as small as 1 +P qi, andP
qi could be as small as jLj. Thus the ratio of the resulting estimated span of the schedule to the optimal
length of the schedule is:
span
OPT
 jLj+
P
qi
1 +
P
qi
< 2: (6.19)
Proof. Previously we identified two lower bounds on the length of the schedule in Equations (CriticalPath &
TotalWork2). When Opsi  p like the graph shown in Figure 6.5a, then Equation (CriticalPath) is the tighter
bound. Otherwise when Opsi > p like the graph shown in Figure 6.5b, then Equation (TotalWork2) is tighter.
But, if there is a combination such that some levels contain less than or equal to p operations and some
contain more like the graph shown in Figure 6.5c, then the maximum of the two is used: maxfjLj; dN/peg.
Since the bound on the optimal schedule length varies as described previously, we will break down the
proof into the following three cases.
Case 1: when the number of operations in every level is less than or equal to p, 8 i 2
1::jLj: Opsi  p
In this case, if Opsi < p for any level then ri 6= 0 and qi = 0. Thus from Equation (6.13) epochsi =
qi + 1 = 1. Similarly, if Opsi = p for any level then ri = 0 and qi = 1. Thus epochsi = qi = 1. Given these
two formulations, the number of epochs required for any level is 1 and the span = jLj. The ratio of span to
OPT for this case will always be
54
span
OPT
=
jLj
jLj = 1: (6.20)
Thus, if Opsi  p the naïve algorithm calculates the optimal schedule length.
Case 2: when the number of operations in every level is greater than p, 8 i 2 1::jLj: Opsi > p
In this case, given that Opsi > p, 0  ri < p and qi  1 thus
P
qi  jLj. We will use the OPT bound
from Equation (TotalWork2) for this case since the length of the schedule is not bounded by the critical
path (i.e., number of levels). Both the span and OPT formulations are dependent on the number of leftover
nodes, hence we will analyze four subcases:
2a. 8 i 2 1::jLj: rn = rr = 0 and qr = 0: Since qr = 0 and rr = 0 then there must be not be any leftover
operations in any level hence span =P qi and OPT =P qi. Therefore the ratio of span to OPT is
span
OPT
=
P
qiP
qi
= 1: (6.21)
2b. 8 i 2 1::jLj: rn = rr > 0 and qr = 0: Since qr = 0 and rr > 0 then
P
ri < p and the worst case values
are span = jLj+P qi and OPT  1 + qr +P qi = 1 +P qi. Therefore the ratio of span to OPT is
span
OPT
 jLj+
P
qi
1 +
P
qi
< 2; (6.22)
since P qi  jLj given that Opsi > p.
2c. 8 i 2 1::jLj: rn = rr = 0 and qr > 0: Since qr > 0 and rr = 0 then
P
ri mod p = 0 and the worst case
values are span = jLj+P qi and OPT  qr +P qi. Therefore the ratio of span to OPT is
span
OPT
 jLj+
P
qi
qr +
P
qi
< 2: (6.23)
Note that this ratio cannot be less than 1 since qr < jLj as established in Equation 6.18.
2d. 8 i 2 1::jLj: rn = rr > 0 and qr > 0: Since qr > 0 and rr > 0 then the worst case values are
span = jLj+P qi and OPT  1 + qr +P qi. Therefore the ratio of span to OPT is
55
span
OPT
 jLj+
P
qi
1 + qr +
P
qi
< 2: (6.24)
The results for these four subcases prove that for the case when Opsi > p, span/OPT < 2.
Case 3: when the number of operations in every level can be less than or equal to p or greater
than p, 8 i 2 1::jLj: Opsi  p or Opsi > p
In this case, given that Opsi may be either less than, equal to, or greater than p, we know that qi  0
and 0  qr < jLj. If qr +
P
qi  jLj with qr > 0 and
P
qi < jLj, then some levels have Opsi  p thus
span < jLj+P qi and OPT = jLj. The ratio is then
span
OPT
 jLj+
P
qi
jLj < 2: (6.25)
If qr+
P
qi  jLj with qr = 0 then every level has Opsi = p and we would have case 1 if
P
ri = 0. Otherwise
if qr +
P
qi > jLj then we have the same formulation as Opsi > p and so span/OPT < 2.
In summary, we’ve shown that in each case the value of the ratio of the estimated span to the optimal
schedule is always less than 2.
Exposure Analysis
In the previous section we proved that our scheduling algorithm as a whole produces schedule length esti-
mations within two times the optimal length. The main reason for this discrepancy is that the algorithm
naïvely groups operations into epochs. By enabling stealing, or taking operations from subsequent levels, we
can better approximate the optimal schedule. In this section we will investigate the effect that the exposure
calculation has on the rest of the algorithm, including the potential for additional error.
An Initial Bound. The algorithm presented above operates on a subset of information about the graph.
As such, the number of operations exposed is an estimation that is not based on the actual dependencies of
the original graph. First, we will investigate the range of error that can be introduced into the span by the
exposure calculation. From Equation (6.2) we know that the range of the number of operations needed from
subsequent levels to keep all of the functional units busy, need, is:
0  need  p  1 (6.26)
Based on this range for the need, we can calculate the maximum and minimum values that will be used
from Equation (6.4). The worst case is when exposed  p  1 and the operations between the current level
56
Li and the next level Li+1 are fully connected. In this case stealing is not possible. The maximum number
of operations that could be mistakenly stolen is when used = p  1. When related to the span calculation in
Equation (6.7), this does not effect the resulting number of epochs for the current level. Instead, this affects
the next level which may have fewer operations than actually possible (having removed some by stealing).
Leading to one less epoch in the next level. This affects the span by only up to one epoch per level. Finally,
we can bound the error in the span calculation to
0  errorspan  jLj   1 (6.27)
where jLj is the number of levels in the graph. We subtract one since we cannot introduce any error into
into the calculation for the first level.
As shown above, the amount of error introduced by stealing is capped by the number of levels in the
graph. This means that, given a graph with 6 levels the estimated span could be off by up to 5 epochs. If
the width of graph less than or equal to the number of functional units (meaning only 1 epoch per level) like
the one shown in Figure 6.5a, then our naïve estimation algorithm will calculate the optimal schedule length
since there is no opportunity for stealing. If the graph gets a little wider and width > p (meaning at least
one level hast 2 or more epochs) like the one shown in Figure 6.5c, then there is some potential for error.
Given that the number of operations in this graph is 26 and there are 6 levels, our estimation algorithm
would produce a span of 9 using 4 functional units. Thus the range of results would theoretically be from
4 to 9 since we could be off by up to 5. But the CriticalPath bound tells us that the span cannot be any
smaller than jLj (which for this example is 6) so our actual range is from 6 to 9. Hence our new range for
the error is
0  errorspan  min
8<:jLj   1;
jLjX
i=1
epochsi   jLj
9=; (6.28)
where the summation of the epochsi represents the naïve approach with no stealing. We can further improve
this formulation since we also know that the span cannot be lower than the TotalWork2 bound dN/pe which
evaluates to d26/4e = 7 for this example. Thus the final bound on the amount of error introduced is:
0  errorspan  min
8<:jLj   1;
jLjX
i=1
epochsi  max

jLj;

N
p
9=; (6.29)
and our range for the span is from 7 to 9.
57
As the width of the graph increases the number of epochs also increase, but the amount of error is still
based on the number of levels which has not changed. Therefore, given a graph with 3 epochs per level like
the one shown in Figure 6.5b, the length of the schedule may be 18 but the amount of possible error is 4.
In short, since the naïve algorithm has an upper bound of two times optimal, stealing operations from
subsequent levels can bring the estimation closer to optimal. The naïve algorithm also has the benefit of
always producing results greater than or equal to optimal. However stealing can result in estimations that
are less than the length of the optimal schedule. Thus when designing an exposure function, one must
engineer safeguards to prevent such impossible to achieve results.
A Tighter Bound. If we assume that there are enough operations in the next level Li+1 that can
be exposed so that the needi for level Li can be met, then we can eliminate the need to expose and steal
operations from levels further down. Under this additional assumption, a tighter bound can be derived.
First, we will bound the possible number of operations exposed and then show how this is minimized in the
algorithm using Equations (6.2,6.4).
Consider the following sample exposure calculation:
exposedi =

p
Opsi+1
Opsi


Opsi
p

(6.30)
In this calculation, the first term p(Opsi+1/Opsi) represents the rate of change in number of operations
between two levels. The second term calculates the number of epochs that will be required for level Li. This
calculation simplistically attempts to estimate how many operations can be exposed after completing each
epoch in level Li. We will analyze our scheduling approach using the exposure calculation shown above.
When considering Equation (6.30), three possible cases emerge as shown in Figure 6.4.
Case (a) One-to-many: In this case the number of operations in the current level Li is 1 and the
number of operations in the next level Li+1 is n as shown in Figure 6.4a. When we plug these values into
the exposed calculation from Equation (6.3) we get:
exposedi =

p
n
1


1
p

(6.31)
If we assume that p = 1, then the need will evaluate to 0, and thus the result from Equation (6.31) above
will not affect the number of operations used. However, if p > 1 then the second term, b1/pc will evaluate
to 0 and thus the number of operations exposed will also be 0. In this case the equations use the correct
number of exposed operations and so no error is introduced into the span calculation.
58
Case (b) Many-to-one: In this case the number of operations in the current level Li is m and the
number of operations in the next level Li+1 is 1 as shown in Figure 6.4b. When we plug these values into
the exposed calculation from Equation (6.3) we get:
exposedi =

p
1
m


m
p

=

p
m


m
p

(6.32)
Here, we ignore the case when the number of operations in level Li is a multiple of p according to
Equations (6.2,6.4). Lets represent m as such:
m = p q + r; (6.33)
where q is the quotient and r is the remainder from dividing m by p. After plugging this new value into our
previous equation we get:
exposedi =

p
p q + r 

p q + r
p

(6.34)
Given the nature of the floor function, the second term b(p q + r)/pc will always evaluate to q. This
reduces our exposed calculation to:
exposedi =

p q
p q + r

= 0 (6.35)
In Equation (6.35) above, since the numerator will always be smaller than the denominator (given our
previous assumptions, r > 0) the floor function will always evaluate this equation to 0 making the number of
operations exposed also 0. In this case the equations will always calculate that no operations were exposed.
However, this is not always the case. Figure 6.6 shows that there can be at maximum 1 operation exposed
since we know that Li+1 is always 1 in this case.
(a) Levels in the graph (b) Epochs after scheduling
Figure 6.6: One possible case where the exposed calculation produces error for Case (b) Many-to-one with p = 2.
59
The example in Figure 6.6 shows that 2 epochs are required for level L1 and 0 epochs for level L2.
According to Equation (6.4) level L1 would always require 2 epochs, but level L2 with no stealing would
require 1 epoch. In this case, the calculations for the subsequent level may produce an inaccurate span
calculation but those for the current level will not as long as no operations were used by the previous level.
We will formally bound the span error after the next case.
Case (c) Many-to-many: In this case the number of operations in the current level Li is m and the
number of operations in the next level Li+1 is n as shown in Figure 6.4c. When we plug in these values into
the exposed calculation from Equation (6.3) we get:
exposedi =

p
n
m


m
p

(6.36)
Just as in the previous case, we ignore the case when the number of operations in level Li is a multiple
of p according to Equations (6.2,6.4). However, here some error could be introduced since there could be
as little as n dependencies or as many as 2  n if every operation is binary (mean it has two incoming
dependencies for its two operands). So we will bound the number of operations exposed for this case to the
following range:
j
p
n
m
k
 exposedi  n  1 (6.37)
The lower bound can be verified by considering: p > 1, n > 1 (see Case (b) Many-to-one for n = 1),
m > p. At a minimum we could have: p = m  1 and plugging this value into Equation (6.36) in the second
term bm/pc gives:
exposedi =

p
n
m


m
m  1

=
j
p
n
m
k
(6.38)
The upper bound is found using a method similar to that used in Case (b) Many-to-one by representing
m as shown in Equation (6.33). The difference here is that we are multiplying by n:
exposedi =
j
p npq+r 
j
pq+r
p
kk
=
j
n pqpq+r
k (6.39)
Lets define x = (p q)/(p q + r). We know that x < 1 and so x n < n. Therefor bx nc  n  1.
60
Now that we have verified the bounds of the range from Equation (6.37) we can say that if the depen-
dencies between levels Li and Li+1 are fully connected then the number of operations exposed would be 0.
This means that the range shown above is also the potential number of operations used from the next level
that should not be.
Based on the results above when scheduling a graph as described in this work with the additional
assumption that the number of exposed operations is at least the number required for the need, the maximum
error introduced into the span calculation at each level is still only 1. In total, the maximum error in the
span calculation is bounded to the range:
0  errorspan  Nsteal   1 (6.40)
where Nsteal is the total number of levels in the graph in which stealing occurs. Just as above, we subtract
one since we cannot introduce any error into the number of epochs required by the first level in which stealing
occurs, only in subsequent levels.
In the next section, we will experimentally evaluate the accuracy of our schedule length estimation
algorithm. However given that the naïve algorithm produces results very similar to those using stealing,
perhaps stealing is not necessary to achieve sufficient accuracy.
Experimental Results
In this section we evaluate our schedule length estimation using operation DFGs of linear algebra kernels, and
randomly generated graphs using the Erdős-Rényi G(n; p) binomial model [46]. We compared the results of
our scheduling estimation algorithm for these graphs to that of the optimal schedule. Computing the optimal
schedules for validation purposes required the computational resources of over 270 desktop computers totaling
more than 1500 processor cores to perform brute force scheduling and calculate the optimal schedules for
each of the graphs. In total, more than 2.7 years of processor time was required. In contrast, calculating
all of the estimated schedule lengths using our scheduling algorithm on the reduced graph representation
only required a few minutes for all the experiments on a single desktop computer, and the response per
experiment was almost immediate, due to its computational complexity on the order of the number of levels
in the graph, O(L). More than 500,000 optimal schedules were computed for the random graphs and more
than 1,000,000 optimal schedules for each linear algebra kernel.
When scheduling tree graphs, the optimal approach goes as deep as possible early on, rather than
first executing level-by-level, to expose as much parallelism as possible. The optimal schedules for these
61
tree graphs will show larger discrepancies compared to our level-by-level scheduling algorithm. For that
reason, we selected three linear algebra kernels for evaluation (dot product, matrix-vector and matrix-matrix
multiplication) since they are all in-trees. We used a brute force scheduling approach to compute the optimal
schedules for the random graphs and Hu’s algorithm [55] for the linear algebra kernels. For each graph we
determined the optimal schedule using numbers of processing elements ranging from 1 to the width of the
graph (maximum usable) and recorded the length of the schedule. Previously, we proved that for a number
of processing elements larger than the width of the graph, our estimation would be equal to the optimal
schedule, and for the sake clarity we do not present those results. Initially, we evaluated our scheduling
algorithm with the naïve approach where no stealing occurs in order to better analyze the algorithm itself.
Linear Algebra Graphs. The operation DFGs of the dot product kernels have a binary tree structure
where the top level is composed of multiplications and the lower levels are the additions to sum all of the
products. Since the matrix-vector and matrix-matrix multiplications are formed by an array or grid of these
dot products, their graphs are much larger. For that reason, we were able to test larger data sizes for dot
product than for matrix-vector multiplication and even less for matrix-matrix multiplication. Even though
Hu’s algorithm completes in polynomial time and has a computational complexity O(N), it is based on
node-labeling methods and so each node or edge must be labeled first and then revisited again later. If we
ever want to be able to estimate the length of the schedule for any given graph, we will not be able to visit
every node, further motivating the need for a reduced graph representation.
For the dot product kernel we varied the input data sizes from 4 to 1,536, which resulted in graphs
containing from 7 to 3,071 nodes. For the matrix-vector multiplication data sizes for both the vector and
square matrices ranged from 4 to 158 which resulted in graphs containing from 28 to 49,770 nodes. For the
matrix-matrix multiplication data sizes (where both matrices were of the same size and square) ranged from
4 to 34 which resulted in graphs with from 112 to 77,452 nodes. We used Hu’s algorithm [55] to calculate the
optimal schedule lengths, which has been proven to be optimal for in-trees, out-trees, and opposing forests.
The results of scheduling DFGs for the three kernels that we evaluated are shown such that each data
point is plotted on a 2D grid with a log-log scale. The x-axis represents the data size of the vector or square
matrix evaluated and the y-axis represents the number of processing elements used for a particular schedule.
The color of each data point reflects the value of the ratio estimated-over-optimal schedule length using a
heat-map where values range from 1.00 (dark blue), meaning an exact match, to 1.40 (light blue) signifying
40% larger estimation than optimal. Theoretically, the ratio of estimated-over-optimal will always be less
62
(a) Estimated-over-optimal results. (b) Range of estimated-over-optimal results .
Figure 6.7: Results for dot product operation DFGs.
than 2.00, but we experimentally found that the worst case was always less than or equal to 1.40 for these
graphs.
For all of the operation graphs the results on the diagonal represent as many processing elements as
needed to execute the entire width of the graph. The schedule lengths for those results always matched
the optimal results. Compared to our scheduling algorithm which processes in a level-by-level fashion, the
optimal scheduling executed operations further down as early as possible to expose more operations later
on in the schedule. This effect is more pronounced in the results under the diagonal, where the number of
processing elements is one or two less than the width of the graph and the worst case results are found.
The estimated-over-optimal results for dot product operation DFGs are shown in Figure 6.7a. The results
ranged from 1.00 (exact match) to 1.33 times the optimal schedule length. On average, the estimation was
5% longer than the optimal schedule. Figure 6.7b shows the minimum, average, and maximum results for
each data size that we evaluated. Since these graphs are trees, and more specifically binary trees, they are
complete binary trees when the data size is a power of two. These complete binary trees have a very regular
structure that makes it easier to more accurately estimate the length of the schedule and so the averages
approach a value of 1.0 for those data sizes (minimums in the average and worst case plots).
The estimated-over-optimal results for matrix-vector and matrix-matrix multiplication DFGs are shown
in Figures 6.8a & 6.9a. The results ranged from 1.00 (exact match) to 1.40 times the optimal schedule length.
On average, the estimation was 5% longer than the optimal schedule for matrix-vector multiplication and
6% for matrix-matrix multiplication. Compared to the dot product results, these results are much more
accurate for lower numbers of processing elements due to the increased amount of parallelism in this kernel.
Figures 6.8b & 6.9b show the range of results for each data size that we evaluated. The same effect of more
63
(a) Estimated-over-optimal results. (b) Range of estimated-over-optimal results .
Figure 6.8: Results for matrix-vector multiplication DFGs.
(a) Estimated-over-optimal results. (b) Range of estimated-over-optimal results .
Figure 6.9: Results for matrix-matrix multiplication DFGs.
accurate results around power of two data sizes can be seen here as well. The maximum results show an
inverse logarithmic trend which approaches a value of 1.2.
To summarize the linear algebra kernel results, we found that the estimated-over-optimal results were
within the range given by the proof in Section 6.1.2. In fact, on average the estimated length of the schedule
was 5% longer than the optimal schedule and never went beyond 40%. When moving from dot product to
matrix-matrix multiplication graphs, the amount of parallelism increased dramatically. As a result, the size
of the dark blue region with very accurate results increased. The increasing graph sizes for matrix-vector
and matrix-matrix product, and due to the necessity to use brute force for validation, we were limited to
relatively small data sizes. However, we can see that the trend is to produce more accurate results as the
number of nodes in the graph increases. It is expected that the estimation will behave even better for larger
data sizes. There is an upper limit to the number of processing elements available in multi-core (10’s) and
64
many-core architectures (1000’s). Hence, for the larger data sizes that compute-intensive applications must
operate on, the accuracy of this scheduling approach is very good since these cases fall into the bottom dark
blue area of the graphs. Considering that Hu’s algorithm effectively steals operations to achieve optimal
results and our scheduling algorithm does not, we conclude that accounting for operation stealing is not
necessary to achieve sufficient accuracy for these cases. When the parallelism of the kernel is high, stealing
has minimal impact on the overall length of the schedule. As stated above, when the number of processing
elements is larger than the width of the graph, our estimation would be equal to the optimal schedule. We
experimentally observed that every data point above the diagonal of Figures 6.7a, 6.8a, 6.9a would be equal
to 1.00, dark blue, and was not presented in these plots. Including those data points in our results would have
skewed the average to a much lower value and, for the sake of fairness, they were left out since it is logical
to expect that applications will have more parallelism than the number of processing elements available. In
any case, if those data points had been included, they would have only improved our overall results.
Randomly Generated Graphs. For the randomly generated graphs, only a brute force approach
would guarantee optimal results – thus limiting the practicality of analyzing very large graphs. We generated
random graphs using the Erdős-Rényi G(n; p) binomial model [46] with from 5 to 30 nodes. We used a range
of probabilities for edge creation from 0:1 to 0:9 in steps of 0:1. For each combination of number of nodes and
edge creation probability we generated 500 graphs. This large range provides breadth to the types of graphs
that we analyze in this work, from completely parallel (all nodes in one level) to completely sequential (only
one node in each level).
We organized our results based on the width of the graph and the number of processing elements used
for each data point. Given that there were many data points with the same width and number of processing
elements used, we calculated the average of these. Figure 6.10a shows the average estimated-over-optimal
results versus the width of the graph. Here also, when the number of processing elements is larger than the
width of the graph, the estimation algorithm produces optimal results. Hence, those data points were not
included. Notice that these average results never go beyond 1.2x optimal, and that there is a majority of
data points with an average of 1.00. Since these results are much better than those of the linear algebra
graphs, the upper bound for the heat-map is set to 1.20. Just like for the linear algebra kernels, the estimated
schedule length for any number of processing elements greater than or equal to the width of the graph was
optimal, as well as for the results with a single processing element for obvious reasons. As expected, these
results are better than those presented with the linear algebra kernels, whose scheduling algorithm differed
significantly from ours.
65
(a) Estimated-over-optimal results. (b) Range of estimated-over-optimal results .
Figure 6.10: Results for random graphs.
Figure 6.10b shows the range of the results. Few cases presented results greater than 1.5x optimal and
the average for all cases is 1.03x. For 90% of all cases the results were less than 10% longer than optimal.
Due to the fact that we used brute force for validation processes, we were limited on the number of nodes
and width of our graphs. This affected the average of the results when the width of the graph was 13. For
this case, we tested only 132 samples, one of them maxing at 1.67x optimal, while the average was 1.08x
optimal. In comparison, when the width of the graph was 11, we tested 2,859 samples with an average of
1.03x optimal, while having a max of 1.67x optimal. It is clear that these maximum values had a much larger
impact on the first case than on the second, and if more graphs had been tested for width 13, the average
would have been very similar to the rest of experiments. We can also expect that this good behavior will
extrapolate to larger random graphs. Only specific and rare combinations of nodes and edges will provide
bad results of the order of 1.67x optimal. In any case, our algorithm stays below the proven limit of 2x
optimal and within a processing time of the order of the number of levels, O(L).
Overall, the range of edge probabilities show the necessity for stealing lies in those graphs with very low
connectivity. As the connectivity increases, our estimation algorithm, without stealing, becomes sufficient
to accurately estimate the length of the schedule.
Summary of Schedule Length Estimation on Identical Cores
We have presented a scheduling approach that uses a novel reduced graph representation and a polynomial
time algorithm for estimating the length of the schedule with O(jLj) complexity. We have proven that
this algorithm estimates schedules up to 2  OPT . Given our unique requirements on scheduling, our
algorithm calculates the length of the schedule without specific assignment information. This is the first
66
effort to find only the schedule length without the detailed schedule. This approach greatly simplifies
the scheduling algorithm and data storage. Using this scheduling approach, any kernel can be analyzed
quickly and efficiently given its operation DFG, to estimate the performance on a particular multiprocessor
architecture.
We have evaluated the accuracy of our scheduling approach for three linear algebra kernels and random
graphs across a range of graph sizes. We found that for the random graphs, on average the estimated schedule
length was 3% longer than the optimal schedule length and, 92.7% of all results were found to be between
optimal and 1.1x optimal. The linear algebra graphs on average only varied by 5% of the optimal schedule
length as well. For dot product, matrix-vector multiplication, and matrix-matrix multiplication graphs, the
percentages of all results that were found to be between optimal and 1.1x optimal were 87%, 74%, and 74%
respectively. Thanks to the new reduced complexity of our algorithm on the order of the number of levels
of the graph, O(jLj), the response on a desktop computer was almost immediate for all tested graphs.
Our approach can be used to quickly estimate performance without any implementation or detailed design
description. In the future, we will further investigate the benefits to stealing and exposure functions to further
improve accuracy. We will also investigate how we can apply general processor architecture constraints such
as pipelining, caching, and memory to tailor the estimations for higher accuracy on specific multiprocessor
architectures.
6.1.3 Schedule Length Estimation on Different Cores
In the previous section, the general graph scheduling approach was presented for a processor with identical
cores. In this section we expand this approach for a processor with different cores. In this case, only certain
operations can be assigned to each core. In particular this section focuses on pipelines of these cores, where
each core is just a single functional unit (such as an adder, multiplier, etc.). These pipelines of functional
units form custom hardware accelerators designed to speed up kernels when performance is critical for an
application. First we present the improvements to the base scheduling algorithm to handle different types of
operations and pipelined architectures. Then we present an example use case for a simple pipelined design.
Lastly, we use compare the estimated results from scheduling against the actual performance of five pipelined
architectures.
67
Figure 6.11: Example pipelined architecture design showing only the functional units, control logic not shown, broken
into one single-stage f+g and one three-stage pipeline f;; g.
Improved Scheduling Algorithm
Scheduling the operations from the graph onto the functional units in the pipelines results in a detailed
schedule providing all three items of information identified earlier. The length of this schedule is proportional
to the number of clock cycles required for execution. Hardware systems that have pipelined architectures
contain one or more different types of pipelines. In general each pipeline contains one or more stages, where
each stage contains at least one type of functional unit (adder, subtractor, etc). Figure 6.11 shows the
functional units of an example pipelined architecture design. One way to represent this architecture is as a
single-stage pipeline f+g and a three-stage pipeline f;; g.
Rather than represent it as a single 4 stage pipeline, we choose to represent it as two pipelines to more
succinctly match with the DFG in Figure 3b. If the parallelism of the design were to be increased, either by
adding another adder (pipeline type 1) or a multiplier, divider and subtracter set (pipeline type 2) they must
be added together rather than separately since otherwise there will be a bottleneck on the divider voiding
out the benefit of having a 2nd multiplier. If we represent the whole design as a single 4 stage pipeline it
does not enable future parallelism exploration except by either: replicating the entire 4-stage-pipeline, or by
changing the pipeline representation. Ideally, we want to form the largest set of operations into uses that
can be found replicated within the DFG.
(a) Pipelines and stages
for an example architecture.
Note stage S3 of pipeline P2
is optional.
(b) Example DFG with
four levels L1 to L4, 10 op-
erations, 3 uses of P1, and
3 uses of P2.
(c) Optimal schedule with 5
epochs E1 to E5, and span
of 5.
(d) Execution schedule
with 6 epochs E1 to E6,
span of 6.
Figure 6.12: Pipeline representation (a), operation DFG (b), optimal schedule (c) and execution schedule from initial
architecture design (d).
68
The representation of this architecture contains the quantity and type of functional units at each stage
of the pipelines. In total, set S combines the quantity of each type of functional unit at each stage of
all pipelines, as shown in Figure 6.12a. In the three-stage pipeline, the last stage can be bypassed and is
marked with a dotted outline. We define a use of a pipeline as the set of operations from the graph that
corresponds to the functional units present in the stages of that pipeline. For this pipelined architecture
design, the operation DFG is shown in Figure 6.12b. It also shows the uses marked on the graph given
the two example pipelines from Figure 6.11. There are three uses of the single-stage pipeline (orange filled
boxes) and three uses of the three-stage pipeline (purple outlined boxes). Additionally, only one use of the
three-stage pipeline contains a subtract operation.
In order to link the operations in the DFG to the functional units in the pipelines, grouping operations
into uses, a matching between the two is required. To simplify this matching, one could assume that the
functional units in each pipeline’s first stage are different. If this assumption is relaxed, then simple operation
matching is not capable of determining which pipeline to use and a pattern matching would be required.
However, our methodology is the same regardless of the complexity of this matching. We assume that the
operations in the graph can be evenly grouped into uses with none left over. We define a graph that meets
these assumptions as well formed. The methodology is not restricted or limited to a specific set of operations
or types of functional units. Given the simplicity of this example architecture and operation DFG of the
kernel, the optimal schedule was produced and is shown in Figure 6.12c with 5 epochs denoted E1 to E5.
An epoch is defined as a set of nodes that are executed concurrently during the same period in time (ie.
same clock cycle). Notice that in this schedule each operation is uniquely identified, by its operation type
and node index, and specifies the exact functional unit that the operation is assigned to.
For pipelined architectures, the graph representation stores the number of each type of operation in a
level as shown in Figure 6.13. In the reduced representation for the example graph the operations from
level L2 are stored as a vector containing the values f2; 3; 0; 0g that corresponds to the set of operation
types T = f+;;; g. The implications of using this reduced representation may result in a loss of edge
information. This effect can be seen in the example graph for nodes in levels L2 and L3 as shown with the
dotted arrows in Figure 6.12b.
To illustrate the savings of our reduced representation, consider the graph for matrix-matrix multipli-
cation of 8192x8192 sized matrices. This graph contains more than one trillion operations and more than
one trillion edges. Storing this graph using an adjacency list requiring O(jV j+ jEj) storage would equate to
almost 8TB of memory. However, the reduced representation of this graph can be stored using only a 13x2
69
Table 6.3: Common unary and binary operation types for most programming models
Arithmetic Relational Logical Bitwise
addition equal negation not
subtraction not equal and and
multiplication greater than or or
division less than xor
modulus greater than equal left shift
increment less than equal right shift
decrement
array (104 bytes) since there are 13 levels in the graph containing only addition and multiplication opera-
tions. Figure 6.14 shows the memory footprint of this approach when storing all 22 common operation types
available for most programming models, shown in Table 6.3, compared to the smallest conventional graph
representation: incidence list using the GNU STL Vector data structure. We plotted the memory footprint
over a range of numbers of nodes and associated numbers of levels from a single level (where all nodes are
parallel) to the same as the number of nodes (all nodes sequential). Notice that even the largest graph
stored using the reduced representation is still feasible, albeit on the order of tera-bytes (TB). Compared to
the reduced representation that only stores the number of operations in a level with memory requirements
O(jLj), this pipelined version requires O(jLj  jT j).
Figure 6.13: Comparing graph representations and schedules from standard scheduling to our reduced approach.
70
(a) Reduced representation with operation types (b) Incidence List (STL Vector)
Figure 6.14: Memory footprints for reduced representation (a) compared to incidence list (b) plotted over a range
for each factor
There are many existing algorithms that are designed specifically for pipelined scheduling [11][14][15][67]
[107][108] however many of them are not feasible for large real world graph sizes used in processor performance
modeling. These existing algorithms operate on existing graph representations such as adjacency matrix,
adjacency list, incidence list, or a custom graph representation. A such, none of the current state-of-the-
art scheduling algorithms are able to take advantage of our reduced graph representation where the edge
information is stored implicitly. Algorithm 1 presents a simple scheduling algorithm capable of handling
this representation. This algorithm calculates the number of operations that can be executed at each epoch,
producing the reduced schedule. It operates using the set of levels in the graph, L, the set of operation
types, T , and the reduced graph, a 2D-vector containing the number of operations of each type in each level
of the graph, Ops. It also requires the set of stages in the pipelines, S, and the pipeline representation, a
2D-vector containing the number of functional units of each type in each pipelines stage, p.
Algorithm 1 EstimateSchedule(L; T;Ops; S; p)
1: span := 0
2: sch[][] := ffgg
3: for i := 1 to jLj do
4: uses[] := matching(T;Ops; p; i)
5: epochs := maxfuses[]g
6: sch[][] := addToSchedule(Sch; T;Ops; S; p; epochs; i)
7: Ops := removeNodes(T;Ops; S; p; epochs; i)
8: span := span+maxf1; epochsg
9: end for
10: return sch; span
71
This scheduling algorithm employs an iterative procedure where the operations in the DFG are assigned
to the functional units in the pipelines, grouping operations into uses, via a matching function as shown on
line 4. Once the uses are identified then we calculate the number of epochs that will be required to complete
these uses by taking the max of all the uses as shown on line 5. Then, the operations that can be executed
in each pipeline stage are added to the schedule (line 6) and removed from the DFG (line 7). Lastly, we add
the number of epochs needed for this level with the number required for previous levels on line 8. If the
number of epochs required for a particular level is zero, then we add 1 to the span to account for pipeline
latency. This process continues for each level in the graph.
For the example graph, the presented scheduling algorithm produced the reduced schedule shown in
Figure 6.13. In comparing the optimal schedule to the reduced schedule, notice that there are the same
number of epochs (thus same schedule length) and the epochs in which operations are executed match up
with those in the optimal schedule. Using this reduced schedule, the designer can still understand when
the operations should be executed. This algorithm has a polynomial runtime on the order of the number
of levels in the graph and the number of operation types O(jLj  jT j). This runtime is much less than the
runtime for existing algorithms that are on the order of the number of operations and/or edges in the graph
[14][67][107].
Example Use Case
In this section we introduce an example use case and analyze a toy filtering algorithm to illustrate how
the graph-based processor model can be used to improve an initial pipelined architecture implementation.
Finally improvements will be formulated and the architecture modified to increase performance.
The filtering algorithm shown in Algorithm 2 accepts two coefficients a and b and a sliding window
of six samples, producing two results m and k. Figure 6.15 shows an initial pipelined architecture design
including the control logic. This design executes the filtering operation in 6 clock cycles and the execution
schedule, produced from the execution of the architecture, is shown in Figure 6.12d. Is 6 clock cycles the
best performance this architecture can achieve? We scheduled the algorithm’s DFG, shown in Figure 6.12b,
Algorithm 2 Filter(a; b; n[6])
1: m = [(a+ b) n1  n5]  [(a+ b+ n0) (a+ b) n2]
2: k = (a+ b) n3  (a+ b+ n4)
3: return m; k
72
producing the reduced schedule shown in Figure 6.13. The length of the schedule is 5 and the difference
between this and the performance of the design implies that the design be improved to perform better.
To formulate improvements to the design, we can manually compare when operations were executed in
the execution schedule against the reduced schedule to determine when the operations should be executed for
better performance. Initially, we can see that for the first three epochs, one addition operation is executed
in the reduced schedule but not in the execution schedule. In the design shown in Figure 6.15, reg0 holds the
result of the adder and the output of reg0 goes directly into the multiplier. Next, notice that the addition of
a+b (operation #1 in Figure 6.12b) is held in reg0 to be multiplied by n1 (operation #3), n2 (operation #4),
and n3 (operation #5). This delays the addition of (a+ b) + n0 (operation #2) and (a+ b) + n4 (operation
#6) to epochs E4 and E5 respectively. Additionally, according to the reduced schedule the first division can
be executed in epoch E3 but is delayed by one extra cycle by reg2. By addressing these two problems in a
subsequent revision of the design, further performance can be achieved.
Since the output of the adder is being held in reg4 for feedback to add the result of (a+b) to other inputs,
storing the output again in reg0 for multiplication is unnecessary replication. Adding another multiplexor
between the output of reg0 and the input of the multiplier, as shown in Figure 6.16, will allow the design to
operate more effectively. This modification will allow the other additions to occur earlier in the execution,
addressing the first problem. Now that this change has been made the second register reg2 is no longer
needed between the multiplier and divider, and is moved between reg0 and mux4 to delay the result of the
adder in the case of (a+b+n0) that will be divided by the result of the multiplier (a+b)n2. This improved
architecture executes the filtering operation in 5 clock cycles, achieving the same schedule shown in Figure
6.12c.
The addition of the graph analysis to the standard design flow for pipelined architectures provided both
an achievable performance goal and an execution plan in the form of the reduced schedule. By comparing
the execution of the architecture design, the execution schedule, against the reduced schedule the designer
is presented with additional indicators that aid in formulating modifications for increased performance. In
Figure 6.15: Initial pipelined architecture design.
73
Figure 6.16: Improved pipelined architecture design.
the next section, we will analyze five well researched architectures from the previous work and present
modifications that achieve up to 10.7x speedups over the original designs.
Experimental Results
In this section, we present the results of analyzing a series of benchmark kernels on five well researched
pipelined architectures [37][106][119][124] for dot product, matrix-vector and matrix-matrix multiplication,
Cholesky decomposition, and matrix inversion. These were chosen as representative kernels constrained by
different factors, having various levels of control flow complexity. Since these architectures were designed
with scalability in mind, they were evaluated using from 1 to 256 pipelines. We varied data sizes from 4x1
to 8192x1 for vectors, and from 4x4 to 8192x8192 for square matrices. We used square matrices without loss
of generality by using a block-based approach for matrix computation.
Using our methodology we compare the performance of the current design to that achieved by scheduling
the DFG to determine how much more performance can be extracted out of the architecture. We refer to
the performance calculated by scheduling the DFG to the functional units as the estimated performance.
Table 6.4 shows the collected results for each of the five architectures evaluated. The first two columns show
the range of difference in performance between scheduling versus the performance of the original design, and
Table 6.4: Summary of Results. The second column (metric vs Original) shows the difference in percentage of
the number of cycles required by the original architecture versus DFG scheduling. The third column (metric vs
Improved) shows the difference in percentage of the number of cycles required by our improved architecture versus
DFG scheduling. The fourth column (Achieved Speedup) shows the summary of the speedups achieved using the
improved compared to the original architectures.
Kernel
Performance difference between metric and Achieved
Original Improved Speedup
[% diff.] [% diff.]
Dot Prod. 0 0 1.0x
M-V Mult. 0 to 84 0 6.4x
M-M Mult. 0 to 91 0 10.7x
Cholesky 0 to 55 0 to 10 2.2x
Inverse 50 to 97 0 to 93 3.0x
Note: % diff. = (Arch Metric)/Arch
74
Figure 6.17: Original dot product design from [124]
scheduling versus the improved designs. The third column shows the best speedup achieved for that kernel
across the range of data sizes evaluated.
Given the simplicity of the dot product kernel very little control logic was required. The original archi-
tecture for the dot product kernel is shown in Figure 6.17. We found that the performance of the design was
not degraded by the control logic: the estimated performance from scheduling equaled the number of clock
cycles required by the design for every data point. These results validate that our methodology correctly
identifies a good design that already achieves high performance.
Figure 6.18a shows the original matrix-vector multiplication kernel architecture, and Figure 6.19a
shows the original matrix-matrix multiplication kernel architecture. The performance of both matrix-
vector and matrix-matrix multiplication designs matched the estimated performance when the number of
pipelines was less than or equal to the dimension of the matrices. Both of these architectures require that
the elements from each row or column in the matrix iterate through the pipelines multiple times to perform
the computation. However, the designs for these kernels have an upper limit to the number of pipelines that
can be used for a given data size, and when a larger number than needed is available, the extra pipelines
are unused. The scheduling approach however, is not restricted by this particular design constraint. For a
number of pipelines greater than the number of rows or columns in the matrix, a lower number of cycles is
calculated via scheduling.
Figure 6.18c and Figure 6.19c shows the potential speedup that can be achieved by making changes to
the architecture. These plots were created by comparing the original performance of the designs against the
scheduling metric. Comparing the estimated performance to the performance of the original design for data
sizes from 4 to 128, there was potential to achieve increased performance ranging from 50% to 91%. No
performance improvement was possible for data sizes from 256 to 8192 given that the maximum number of
pipelines (256) was not greater than the dimension of the matrices. These differences imply that when more
hardware is available, a different calculation approach can be used to achieve higher performance. When
there are more pipelines than the dimension of the data, the additional pipelines can be used in a fashion
75
(a) Original from [124] (b) Improved
(c) Speedup of the improved design
Figure 6.18: Matrix-vector multiply designs (a-b) and performance results (c). Note that results overlap at 1x for
data sizes from 256 to 8192 for matrix multiply.
similar to the dot product architecture where pipelines cascade their multiplication results to be summed
by other pipelines. Using this approach, the original designs were improved as shown in Figure 6.18b and
Figure 6.19b. Thus, instead of multiplying the vector element against each element in the row of a matrix
sequentially and summing using a multiply-accumulator, the better approach would multiply all elements
in the row of the matrix against the vector element simultaneously. This approach would take advantage
of the additional pipelines available and achieve speedups of up to 6.4x and 10.7x for matrix-vector and
matrix-matrix multiplication respectively.
The original architecture for the Cholesky decomposition kernel is shown in Figure 6.20a. Compared
to the previous kernels, the calculations for Cholesky decomposition and matrix inverse are much more
(a) Original from [106]
(c) Speedup of the improved design(b) Improved
Figure 6.19: Matrix-matrix multiply designs (a-b) and performance results (c). Note that results overlap at 1x for
data sizes from 256 to 8192 for matrix multiply.
76
(a) Original from [119]
(c) Speedup of the improved design(b) Improved
Figure 6.20: Cholesky decomposition designs (a-b) and performance results (c).
complex. Using our approach, we found very large differences between the estimated performance and the
performance of the original designs. In general, this means that the designs for both kernels are capable of
achieving better performance. The Cholesky decomposition architecture operates on a column-by-column
basis to expose more parallelism than in the traditional row-by-row method [119]. Each processing element
(PE) pipeline executes the calculations for each element in the column of the input matrix. Since the linear
equations contain more variables for each subsequent row, a single divider pipeline is shared among each
of the pipelines. We found that the difference between the estimated performance and actual performance
ranged from 0.04% to 55% for all data sizes and number of pipelines. For the 4x4 data size there was no
improvement achieved by increasing the number of pipelines. However, an improvement can be achieved for
other data sizes of at least 8x8 across all numbers of pipelines. As the data size increases additional pipelines
are able to improve the performance by utilizing more of the available parallelism. Thus the potential for
the best improvement (highest speedup of improved versus original design) lies with larger data sizes and
numbers of pipelines. Based on these results, we investigated the source of the differences and present
improvements for the architecture. The improved architecture is shown in Figure 6.20b and the speedup
over the original design is shown in Figure 6.20c.
First, we observed that the graph scheduling did not require the first PE pipeline use to execute any
add operations, thus bypassing the adder. In comparison, the architecture control logic forced data to pass
through the adder (effectively adding zero to the result from the multiplier) for every use. This one cycle can
be avoided for each column of the matrix and the savings increase as the matrix size increases. Similarly,
the adder in the divider pipeline can also be bypassed for the first use in each column. Our improved
design added multiplexers to allow the adders in both the multiplier and divider pipelines to be bypassed
77
as needed. Second, we also found that the control logic in the architecture design restricts each column to
be executed separately from other columns. This approach is based on the idea that as long as there is at
least one pipeline for each row, the best performance is achieved. However, when there is a smaller number
of pipelines than rows, multiple cycles are required to complete all of the rows for each column resulting
in reduced performance. In addition, performance degrades further when the division of the rows over the
pipelines is not even, leaving some pipelines unused until work on the next column begins. This effect is
compounded for larger data sizes where each of the many columns have unused pipelines.
To make improvements to eliminate performance degradation due to lack of overlapping columns, only the
control logic needs to be modified. The compute logic and structure of the pipelines can remain unchanged.
Finally, improvements were made to the original architecture to allow for bypassing as well as the control need
to overlap the execution of multiple columns. After making the improvements to address these problems, we
analyzed the improved Cholesky decomposition design and found that the performance was now within 10%
of the calculated achievable performance. These results show that there is still up to 10% more performance
that can be achieved from further improving design. Our improvements achieved from 1.0x (no improvement)
to 2.2x speedup over the original design. The data sizes that gained the most performance ranged from 64x64
to 512x512. For the original design, the performance at the larger data sizes was very close to the calculated
achievable performance, thus the improved design was not able to achieve the same level of improvement as
with the smaller data sizes.
The remaining 10% difference in performance can be achieved by improving the single divider that is
shared among the pipelines. In some cases, the single divider restricts the number of pipelines that can be
used for a particular column. Adding a second divider will remove this restriction for the current data sizes.
However, as even larger data sizes start to be used more frequently, the ratio of pipelines to dividers will
need to be reevaluated.
The matrix inverse architecture was designed from an original systolic array design [37]. By collapsing
the triangular systolic array into a linear array, their design achieves improved performance thanks to a
larger array size. Each processing element (PE) of the array can either perform a single division operation
or a multiply and subtract sequence of operations as shown in Figure 6.21a. By evaluating the architecture
using our methodology, we found that the difference between the metric and the actual performance of the
design ranged from 50% to 97% for all data sizes and number of pipelines. In comparing the execution
of the architecture design against that of the scheduling approach, we found that the main bottleneck was
distributing the divider result to the other elements. The division is normally calculated in the left-most
78
(a) Original from [37]
(c) Speedup of the improved design(b) Improved
Figure 6.21: Matrix inversion designs (a-b) and performance results (c).
element, so multiplexers can be added to allow the result of the division to be distributed in a single clock
cycle as shown in Figure 6.21b.
After making these improvements, we compared the new inverse design to the estimated performance
and found that the differences ranged from 0% to 93% with the majority of the results within 31% of the
calculated achievable performance. Note that for a single pipeline, the improved design exactly matches the
performance calculated by scheduling, thus the difference was zero, and since the y-axis uses a logarithmic
scale these points are not shown. As the number of pipelines increased for each data size the percent
difference approaches 93% with a logarithmic trend. These results show that the architecture improvements
achieve performance much closer to the highest possible performance than the original design. The larger
data sizes were able to best take advantage of the new improvements. The range of results for the 8192x8192
data size went from 67% to 83% for the original architecture, down to 0% to 12% using the improved design.
For data sizes from 8x8 to 256x256 the majority of the improvement was achieved using 1 to 16 pipelines.
The improved design was not able to achieve the same level of increased performance for smaller data sizes
using a larger number of pipelines. For these smaller sizes improvements ranged from only 4% to 20%. Our
improvements achieved speedups of 1.6x to 3.0x better performance than the original design as shown in
Figure 6.21c. The larger data sizes gained the most performance for all numbers of pipelines.
These performance results showed that for some cases, up to 93% more performance can still be achieved
by making improvements to the architecture. Similar to the Cholesky decomposition design, the original
architecture for matrix inverse was designed to operate on a single row of the input matrix at a time.
This approach simplified the control logic of the design, yet restricted the amount of overlap that could be
achieved between rows. Enabling this overlap will allow future designs to achieve the next level of increased
performance. Further, each processing element (PE) in the linear array contains a divider unit that requires
79
a significant amount of resources to implement. Since the majority of the elements will be performing the
multiply and subtract sequence rather than division, further performance improvements can be achieved by
removing the divider from most of them in exchange for a larger number of elements.
Summary of Schedule Length Estimation on Different Cores
The high-level graph-based approach for estimating the performance of pipelined architectures correlates
to the functional units in the pipelines via scheduling. However, using our graph-based processor modeling
approach for real architectures did not present a realizable benefit since the estimated performance was better
than that of the architecture most of the time. Although our processor modeling approach was beneficial
to improve the performance of the existing architectures, it assumed the best case behavior of the design.
Thus we conclude that rather than using this processor modeling approach to just estimate performance, it
is more beneficial to think of it as a performance goal for the designer to work towards.
Given our unique requirements on scheduling, we have presented a reduced graph representation that
allows large graphs to be stored using considerably less memory. We presented a simple polynomial-time
scheduling algorithm that operates on this reduced representation. To the best knowledge of the authors,
this is the first effort to schedule a graph and produce a subset of information about the detailed schedule.
This approach greatly simplifies the scheduling algorithm.
We evaluated five benchmark kernels using our methodology and presented improvements to existing well
researched architectures from the literature. Our results showed speedups of up to 10.7x were achieved over
the original designs. Our method can be used to quickly estimate performance without any implementation
or detailed design description. The same method can also be used with different design goals to improve
the power consumption or other factors. This manual process could also be automated and integrated with
existing synthesis or simulation tools.
6.2 System Modeling Methodology
With processor models to estimate the performance of kernels in the application, system simulation only
requires scheduling the kernels to processors and accounting for data transfer to estimate overall application
performance. For scheduling kernels to processors at the system level, we investigate the impact of scheduling
decisions on a system to improve performance and efficiency. We use a unique task granularity, linear algebra
kernels, and will study scheduling policies for CPU+GPU+FPGA systems. To the best our knowledge, this
will be the first effort to study scheduling in a CPU+GPU+FPGA system. We explore scheduling kernels
80
whose execution time on one processor compared to another differ by orders of magnitude. We distinguish
our work from those previous [12][71][112] by using realistic execution times for each kernel that are orders
of magnitude different. These large differences in execution time will put more emphasis on making the right
kernel-to-processor assignments rather than keeping each processor busy to achieve high utilization.
In this section, we present the background and discuss the notation for scheduling kernels in a hetero-
geneous processor system. We use state of the art scheduling policies that have been presented specifically
as solutions for general heterogeneous computing and apply them to the most diverse system possible: one
containing CPUs, GPUs, and FPGAs. Then we experimentally evaluate these policies for real world applica-
tions and present our results. We analyze the contribution each algorithm provides the overall performance
of the application.
6.2.1 Background
The problem of scheduling kernels from an application in a heterogeneous system can be represented as
(R j prec j Cmax) in standard scheduling notation. In this problem we are given processors pj 2 P for
1  j  np, where np is the number of processors in the system, and a dataflow graph G = (V;E) where V
is the set of kernels and E is the set of dependencies between kernels. Each kernel vi 2 V has an execution
time tij 2 T for processor j. The data transfer cost for kernel vi is djk 2 D when vi’s predecessor is assigned
to processor pj and vi is assigned to pk.
Mathematically we can represent a scheduling algorithm as a function f that maps kernels from V to
processors in P as f : V ! P such that each kernel is assigned to exactly one processor. Although we
seek to find the best schedule by minimizing the maximum completion time of any kernel in the application,
there are currently no feasible polynomial time algorithms. There is not much previous work focused on this
problem in particular since the two relaxed simplifications still have no known polynomial time solutions. As
such, this work presents an approach that models the performance of heuristic solutions to this scheduling
problem. These scheduling theory problems are usually studied statically, having access to the entire kernel
dataflow graph (DFG). However, in the real world this may not be feasible and so dynamic scheduling
approaches are also used.
Static Scheduling policies have access to the entire DFG of the application prior to execution. This
category of scheduling policies determines a fixed schedule that is later followed during execution. An early
work by Herrmann et al. [50] investigated scheduling with a special chain dependency structure. Liu et al.
[70] presented a priority rule-based algorithm with arbitrary dependencies.
81
Dynamic Scheduling policies, compared to those mentioned above, do not have access the entire DFG
and so must make the best of the current state of the system and the kernels that have been submitted.
Wu et al. [117] presented the Adaptive Greedy policy to minimize waiting time for each kernel and the
Adaptive Random policy which uses random weights and probabilities to assign kernels. They investigated
these algorithms on a system with multiple CPUs and GPUs. However, when kernels are custom written and
cannot be broken down in a combination of standard library routines then kernels will also become resource
constrained and a different type of scheduling approach is required [44]. In our work, focus is placed on
systems in which any processor can handle any kernel.
Since finding the optimal schedule is not feasible in real applications, heuristics are used to attempt
to find a solution with acceptable schedule length. In this work we analyze various dynamic scheduling
algorithms including [70][117] to assign kernels to processors. To evaluate how a scheduling policy performs
for a particular application in a system, the only items required are the kernel DFG, the performance of each
kernel, the size of the data, and the data transfer rates between processors. The kernel DFG represents the
number and type of kernels that need to be executed in the application. Given the different kernels from the
graph, processor models can be used to estimate the performance of each kernel. However, the scheduling
policy will also take data transfer rates into account when assigning a kernel. This can be estimated using
the data sizes and data transfer rates between the processors. Then, using the kernel DFG, data transfer
and performance estimations, the scheduling policy can be evaluated to estimate the overall performance of
an application and find opportunities for improvement: either in choosing the best scheduling policy, or the
configuration of processors in the system.
To estimate the performance of an application scheduled in a system, we define a model of the system.
The difference between the start and stop time of a kernel is known as the execution time and is denoted
as tij for kernel i executed in processor j. The start time of a kernel vi is denoted as si and the completion
time is denoted as ci. For the first kernel va assigned to processor pj at the beginning of the program, we
set the start time at zero and thus sa = 0. Then, the completion time can be calculated as ca = sa + taj .
Given a kernel vb with a dependency to kernel va, that is then assigned to another processor pk, the start
time is calculated as sb = ca + b + djk where b is some additional delay time for kernel vb caused by
scheduling delay and djk is delay due to data transfer. This delay time  could be caused by various
factors, two of which are the scheduling delay to process which task should be assigned to which processor
next, or communication delay from the scheduler to the processor to tell it to begin processing and provide
the necessary information. However, in this work we only consider the source of this delay to be from the
82
Figure 6.22: Diagram of the heterogeneous system model showing the various inputs and performance results as the
output from the system model.
communication required to initiate computation. We assume the scheduler determines what to schedule next
instantaneously or statically prior to the start of execution. These scheduling delays will vary depending on
the scheduling algorithm chosen. We also assume that data transfer d includes only sending the input data
to the processor that will execute some kernel on said data. We include the time that the scheduler takes
to transmit the required information ahead of time to each processor to coordinate this transfer in the 
delay. Hence, the only factors that contribute to the total run time of an application are individual kernel
execution time t, data transfer cost d, and scheduling delay .
The set of execution times T is sourced from processor models. One model for each processor type is
used to estimate the time for each unique combination of linear algebra kernel type and matrix size, vi,
from the set of all kernels V in the application. Since this model is not restricted to a particular set of
processor types, the processor models are external to the system model and considered as an input as shown
in Figure 6.22. The interconnection network may be any configuration of communication interfaces with one
or multiple processors sharing a single connection. As such, the set of data transfer times D is also variable
and is an input to the system model via the system configuration. The system configuration specifies which
type of processors (ie. CPU, GPU, or FPGA) are contained in the system, the number of each type, and the
data transfer bandwidth from each processor to all others including the central scheduler. The scheduling
delay j 2  is a result of applying the scheduling policy. As mentioned previously, the scheduling policy is
represented as a function f , which produces the scheduling delays such that  = f(P; T;D;G).
83
6.2.2 Methodology
Modeling system performance has been researched by many, including Foster [41] and Puigjaner [85] among
others. The goal in scheduling is to assign kernels to processors in such a way as to achieve the lowest
overall execution time for the application. We establish that overall execution time is composed of three
parts: kernel compute time, data transfer time, and scheduling delay. The first two being trivial, we will
only discuss the scheduling delay. This delay, defined as , could be caused by various factors such as: the
scheduling delay to process which task should be assigned to which processor next, communication delay
from the scheduler to the processor to tell it to begin processing and provide the necessary information, or
from dependencies on kernels that are being executed in another processor, but have not completed yet.
Meaning that the order in which tasks are assigned impacts the amount of scheduling delay. We compare
the overall impact of this delay on the total execution time for each scheduling policy analyzed. This factor,
 can be adjusted and tailored to model the performance of particular systems.
Summary of selected Scheduling Policies
In this work we analyze two static and four dynamic scheduling policies to assign kernels to processors. All
of these policies assign kernels from a set of independent kernels to a set of available processors. The set
of independent kernels, I, is a subset of V in which each kernel has not yet begun executing and whose
dependencies have already been completed. The set of available processors, A, is a subset of P containing
only processors with no currently executing kernels or data transfers.
The shortest process next (SPN) policy chooses a kernel from I that has the minimum execution time on
any of the processor from A. Whenever processors are available and there are kernels in I, assignments are
made to keep the system busy. This policy attempts to minimize these  delays by keeping every processor
busy. But it does not integrate the information about the difference in execution time among the processors
into its decision making.
The minimum execution time (MET) policy presented by Braun et al. [20] chooses kernels in arbitrary
order from I and assigns them to the processor with the lowest execution time. Unlike SPN, if the kernel’s
best processor is not currently available it is not assigned to another processor. Instead, the kernel will be
assigned to that best processor at a later time. A processor will sit idle if there are no kernels in I that are
suitable for it. This policy always waits to assign kernels to their best processor. Due to the large differences
in execution times, this will result in lower  delays.
84
The serial scheduling (SS) policy presented by Liu et al. [70] follows a more statistical approach. For
each kernel in I, the mean and standard deviation of the compute times are calculated for each kernel-to-
available-processor mapping. Then it chooses the kernel from I with the highest standard deviation and
assigns it to the processor from A in which the kernel has the lowest execution time. Assignments are made
as long as there are kernels in the set and available processors. This policy does not include information
about the difference in execution time among the processors in its calculations; but instead makes a first step
towards this by calculating the standard deviation in execution time among the processors and assigning
kernels to the processor with the least execution time. But, just like SPN, still assigns kernels to processors
that are not the best when the best processor is busy.
The adaptive greedy (AG) policy presented by Wu et al. [117] maintains queues for each processor and
attempts to make assignments to minimize data transfer and queuing delay. The policy calculates wait time
based on the addition of the queuing delay for each processor and the associated data transfer time and
chooses the one with the lowest total time. The queuing delay is calculated as the sum of the compute times
for all kernels already in the queue. This policy does take the differences in execution time between the
various processors into account, and indirectly ends up making decisions to wait for the best processor.
The heterogeneous earliest finish time (HEFT) policy presented by Topcuoglu et al. [112] first statically
ranks all kernels and then assigns them to processors in order of highest rank first in I. Then assignments are
made to the processor from A with the least sum of time remaining of the any previous kernel and execution
time of the current kernel on that processor. This policy was specifically designed to minimize the  delays
in the rank calculations by evaluating dependencies in the DFG.
The predict earliest finish time (PEFT) policy presented by Arabnejad et al. [6] follows a similar process
to HEFT except that the ranks are instead based on a pre-computed cost table that enables a forecasting
ability. Then assignments are made to the processor from A with the least sum of value from the cost table
and execution time of the kernel on that processor. Similar to HEFT, this policy also specifically addresses
 delays in the rank calculations by evaluating dependencies in the DFG.
Sample System and Application for Evaluating Scheduling Policies
In this section, a medical imaging application is used to evaluate multiple static and dynamic scheduling
policies in a distributed CPU+GPU+FPGA system. We define the set of processors P as fpcpu; pgpu; pfpgag
for our experiments. We will also assume that each processor has a full duplex communication link to every
other processor as well as the scheduler. This assumption simplifies the network complexity to allow us
85
to focus on the performance of the scheduling policy. We leave the extension to more complex network
topologies as future work. Our goal is to simulate a real world system composed of commercial-off-the-shelf
(COTS) processors with specifications shown in Table 6.5. As such, each communication link is based on PCI
Express (PCIe). This PCIe interface has the ability to perform direct memory access (DMA) transactions,
relieving the processor from the requirement of constantly monitoring the transfer progress. Instead, a single
instruction can initiate the transfer and the processor can return to other tasks. We assume each processor
uses DMA and can initiate a transfer with a single instruction. Moreover, since each processor executes a
very large number of instructions, this extra instruction does not impact the overall execution time. Thus, we
assume that data transfers do not impact the processor sending the data and only delay the computation in
the receiving processor. For simplicity, we assume that the entire data must be transferred before processing
can begin. The model can be extended to allow for a data streaming functionality.
Table 6.5: Processor specifications
Processor Specifications
CPU
Intel Core i7 2600 3.4GHz
16GB DDR3 @1.333Gbps
GPU
Nvidia Tesla K20 706MHz
5GB GDDR5 @5.2GHz
FPGA
Xilinx Virtex 7 VX485T, VC707
1GB DDR3 @ 1600Mbps
Figure 6.23: Hardware system level diagram showing the
number of PCIe 2.0 lanes between the CPU and the GPU
and FPGA platforms. The dotted lines show the effective
bandwidth between the various platforms.
The system communication diagram is shown in Figure 6.23 with the number of PCIe lanes marked for
each communication link. Each of the interfaces are PCIe version 2.0 and thus every lane has a bandwidth
of 500 MBps in each direction. The data throughput for CPU to GPU transfers is 8GBps, and 2GBps for
CPU to FPGA and GPU to FPGA transfers. The QuickPath Interconnect (QPI) interface between the CPU
and its chipset is large enough to support the GPU and FPGA bandwidths. We clearly define the CPU and
its chipset as separate chips since the PCIe root complex within the chipset can act as a switch and route
transactions between the various devices without interaction from the CPU [5][17][111]. We assume that this
functionality is enabled within this system and that the data transfer between the GPU and FPGA happens
at the FPGA’s bandwidth and without any CPU interaction.
86
6.2.3 Medical Imaging Application
The identification and characterization of scar tissue within the layers of the heart wall are difficult with
the currently available technologies and procedures. The previous method that produced reliable results
required invasive and dangerous surgical operations. A noninvasive approach has been developed by Wang
et al. [116] using the electrical signals available on the surface of the skin. Using subject-specific anatomical
models, complex and physiologically meaningful data about the electrical activity on the heart muscle can be
produced by solving an inverse propagation problem. The nature of these algorithms still incurs tremendous
computational cost that hinders their clinical use. Previous attempts to improve the performance of the
Noninvasive Transmural Electrophysiological Imaging (NTEPI) application, such as the work by Corraine
et al. [31] using GPUs, resulted in a significant speedup but not enough to allow for real time patient
monitoring.
Within this algorithm, a pattern of linear algebra kernels is repeated thousands of times to filter out the
noise that exists in the electrical measurements. This pattern contains 14 matrix-matrix multiplications, a
matrix inversion, and a Cholesky decomposition with dependencies as shown in Figure 6.24. The size of the
matrix that each kernel operates on relates to the number of spatial data points used in the representation of
the heart. The initial sample use case that we originally evaluated operated on a data size of 836, although
acceptable sizes range from 500 to 8000 spatial data points. We anticipate that in the future larger data
sizes will be required for more accurate results and thus every performance improvement is needed to make
the implementation feasible for real time patient monitoring in the clinical environment.
6.2.4 Scheduling Experiments and Results
First, we evaluate the performance of the individual kernels on different processors. Then, we evaluate the
ability of six scheduling policies to achieve overall high performance of the application as a whole. Finally,
we discuss the performance of each policy by evaluating the delay due to scheduling.
Kernel Performance
The three types of linear algebra kernels that make up an iteration of the NTEPI algorithm have been
studied extensively. In our previous work [100], we found that the execution time of linear algebra kernels
in CPU, GPU, and FPGA platforms for data sizes from 5 to 8000 for best platforms is between 2x to 5000x
faster than the second best platforms; and that the best platforms perform between 140x and 2,500,000x
faster than the worst platforms. These large differences are shown in Figure 6.25 for the kernels and data
87
Figure 6.24: DFG of each iteration of the NTEPI algorithm showing dependencies within a single iteration, and
between iterations. The graph contains matrix-matrix multiplications (MM), Cholesky decompositions (Chol), and
matrix inversions (Inv).
Figure 6.25: Difference in execution time between the best and second best platforms for matrix-matrix multiplication,
Cholesky decomposition, and matrix inverse across a range of data sizes using double precision floating point.
88
sizes used in the NTEPI application. The best performing platforms for each kernel and data size is shown in
Figure 6.26. Due to accuracy concerns in the clinical environment, double precision floating point is required
in order to produce correct results.
In this work, we schedule these kernels, not over simple symmetric cores, but over powerful heterogeneous
processors. As such we distinguish our work from those previous [12][71][112] by using realistic execution
times for each kernel that are orders of magnitude different. These large differences in execution time will put
more emphasis on making the right kernel-to-processor assignments rather than keeping each processor busy
to achieve high utilization. Linear algebra kernels are typical of many other compute-intensive applications
that may also be analyzed using the method presented in this work. We analyze the performance of scheduling
policies as realistically as possible using processor models from previous work [66][97][101] to estimate the
execution time of each kernel.
Application Performance
We modeled a three processor CPU+GPU+FPGA system using the approaches shown in previous work
[41][85]. Execution times were estimated for each combination of kernel, data size, and precision using
processor models. We configured the PCIe communication links in the system as described previously. Data
sizes ranged from 500 to 8000. Since the NTEPI algorithm is iterative in nature, we explored the number
of iterations from 500 to 4000, but there was little variation from 500 to 1500 and from 3500 to 4000. We
scheduled the kernels that make up the NTEPI application in the system using six scheduling policies. Figure
6.27 shows the execution times each scheduling policy achieved using various data sizes. Note that the SPN
and SS results overlap and MET, HEFT, and PEFT overlap. In this section we evaluate the effect that
each scheduling policy had on the overall performance of the application. The total execution time of the
application is a result of the individual kernel execution times and data transfer costs.
Figure 6.26: Design space for CPU, GPU, and FPGA processors for the three kernels evaluated in this work across
a range of data sizes using single precision (SP) and double precision (DP) floating point.[100]
89
(a) 1500 Iterations (b) 2000 Iterations
(c) 2500 Iterations (d) 3000 Iterations
(e) 3500 Iterations
Figure 6.27: Performance of the six scheduling policies: shortest process next (SPN), serial scheduling (SS), adaptive
greedy (AG), minimum execution time (MET), heterogeneous earliest finish time (HEFT), and predict earliest finish
time (PEFT) across a range of data sizes and number of iterations for the NTEPI algorithm. Note that the SPN and
SS results overlap and MET, HEFT, and PEFT overlap.
The lowest execution times were achieved with theminimum execution time (MET), heterogeneous earliest
finish time (HEFT), and predict earliest finish time (PEFT) policies. The biggest difference between these is
that MET is a very simple dynamic policy and does not need the full DFG to perform effective scheduling.
For that reason, the analysis of the results will be broken down into two sections comparing MET to dynamic
policies, and to static policies.
MET vs. Other Dynamic Policies: Generally speaking, for all of our cases the shortest process next
(SPN) and serial scheduling (SS) policies had very similar performance. The only variation was at the 500
data size where SPN was worse than SS and for the 2500 iterations case the SS policy performed similar to
the best policies. Given the DFG we can see that there is not a lot of available parallelism, which reduces
the number of possible schedules. For the SS policy, after finding assignments for all but one processor the
90
algorithm begins to make the same decisions as the SPN policy. For that reason, both policies perform very
similar for this application. They both made the best assignments for the first kernel, but then continued
assigning kernels to their second or third best platforms, effectively reducing the overall performance.
The adaptive greedy (AG) policy in general performed between SS/SPN and the other policies. In
calculating the queuing time, this policy leaves out the execution time of the kernel that is being assigned,
and only evaluates the times of the kernels already in the queue. Thus, kernels may be assigned to their
worst processor even when the best or second best platforms are available. However, once this worst case
assignment happens, the calculation of the queue length for that processor will be very long and subsequent
kernels will be assigned to other processors. Thanks to this, it does make better assignment decisions than
the SS and SPN policies, although through a process of trial and error.
Given the large differences in the execution of kernels on different platforms, the MET policy achieved
some of the best results. In contrast to SPN, SS and AG, the MET policy selects only the platforms with
the lowest execution time for each kernel and will wait to assign a kernel if the best processor is busy. The
closest that these three other policies (SPN, SS, AG) came to matching the performance of MET was the
AG policy which was 8.2x slower than MET at data size 2000 for the 1500 iterations case in Figure 6.27a.
For this particular case, the MET policy uses the GPU 99% of the time, the FPGA 1%, and does not use
the CPU at all. On the other hand the AG policy uses the GPU 97% of the time, the FPGA 98%, and
the CPU for 84% of the time. Remember that these three processors are independent and can execute in
parallel. These numbers (97%, 98%, and 84%) mean that the three platforms are being used for a majority
of the time simultaneously for AG, in contrast to MET where the other two platforms (CPU and FPGA)
were idle for a majority of the time. Just looking at the utilization numbers, it appears the AG policy did
a better job of dispersing the work. However, even while utilizing mostly just a single processor, the MET
policy still completed 35.7 seconds faster by making better kernel-to-processor assignments.
The SS policy also came closest to the performance of the MET policy for data size 500 that was 14.9x
slower. The SS policy utilized the GPU 100% of the time, the FPGA 97%, and the CPU for 78% of the
time. While the MET policy also utilized the GPU for 100% of the time, but only utilized the FPGA for
1%, and did not use the CPU at all. The SS policy assigned other kernels to the CPU and FPGA while
the GPU was processing mostly matrix-matrix multiplications. However, when matrix inverse (best in the
CPU at this data size) and Cholesky decomposition (best in the FPGA) needed to be scheduled, both CPU
and FPGA processors were busy processing matrix-matrix multiplications, and these kernels were assigned
to the GPU which was not the fastest processor.
91
Figure 6.28: Performance of the three best scheduling policies: minimum execution time (MET), heterogeneous
earliest finish time (HEFT), and predict earliest finish time (PEFT) across a range of data sizes for 1500 iterations
of the NTEPI algorithm with normalized performance.
MET vs. Static Policies: The three best policies were MET, heterogeneous earliest finish time (HEFT),
and predict earliest finish time (PEFT). The performance of these three is shown in Figure 6.28 for 1500
iterations of the NTEPI algorithm normalized to the performance of the best one. We found that PEFT
always performed best, followed by MET and then HEFT. The normalized performance of MET, HEFT, and
PEFT showed similar trends for all iterations and so we only show the case for 1500 iterations for brevity.
At maximum, the performance of these three policies only differed by up to 1%. Remember that the MET
policy is the most simplistic of the three and requires almost no calculation in comparison to the HEFT and
PEFT policies. In addition, MET is a dynamic policy where as HEFT and PEFT are static policies and
have much more complexity. As such, HEFT and PEFT did not achieve a significant enough performance
increase to make the additional work required worth while. These results are in contrast the those from
the previous work that showed over 20% increase in performance for PEFT over HEFT [6] and that the
performance of MET is worse than SPN [20]. This leads to our first conclusion: when the differences in
execution times among the various processors reach orders of magnitude, the most important factor to gain
the best performance for this medical imaging application is to achieve the lowest execution time for each
kernel.
Among MET, HEFT, and PEFT the reason for their differences all source from their capability to
identify the critical path of the application. PEFT was designed to evaluate the mapping between kernels
and processors in its ranking calculation, thus it performed the best overall. For the NTEPI application
the best performance is achieved by overlapping the Cholesky decomposition in the FPGA with the matrix-
matrix multiplications in the GPU. Throughout execution, PEFT ensures that there is always enough work
92
for the GPU while the FPGA is working on the Cholesky decompositions. HEFT does not have this capability
to evaluate all possible mappings, and instead uses the average execution times of a kernel on all processors
in its calculations. The downfall of HEFT and MET is that, frequently throughout execution, the number
of independent matrix-matrix multiplications were all completed before the Cholesky decomposition in the
FPGA finished, leaving the GPU idle. The penalty for this is higher for HEFT than for MET, thus HEFT
always performs worse than MET.
For all of our test cases, the cost of transferring data was never more than 1% of the total execution time
of the application. The granularity of linear algebra kernels gave each processor enough work to minimize
these data transfers. On average each data transfer took about 1% of the time of each kernel. Using
this information, and that for the best scheduling policy only a single processor was in use 100% of the
time (GPU), the results suggest that a rigorous power management strategy could be applied to both the
communication links and processors to reduce overall system power without reducing any performance.
In summary, the PEFT policy achieved the lowest overall application run time but only by 1% compared
to MET and HEFT. Compared to these, the worst case of AG was up to 29x slower. Furthermore, SS
and SPN performed up to 155x slower compared to the three best policies. Due to the granularity of the
three kernels investigated and the large differences between the execution times in each processor, the most
important factor in scheduling the NTEPI application in a heterogeneous CPU+GPU+FPGA system is
to achieve the lowest execution time for every kernel by assigning them to the best processor rather than
keep all the platforms busy. This result is contrary to the goal with smaller granularities and systems with
symmetric multicore processors, as stated before.
We have evaluated the performance of two static and four dynamic scheduling polices. We found that
HEFT and PEFT (static policies) performed similarly to MET (dynamic policy). We expected that the
dynamic policies would perform worse than the static policies and we found that SPN, SS, and AG did
perform worse. However, MET performed within 0.5% of one and better than another static policy. This
is a result of the large differences of execution times of kernels among the processors and the DFG of the
medical imaging application.
Scheduling Policy Performance
In this section we evaluate how well each policy made scheduling decisions by analyzing the scheduling delay.
Recall that there are three main contributors to the total run time of an application: the individual kernel
compute time, data transfers cost, and scheduling delay . The  delay represents how much time kernels
93
Figure 6.29: Performance of the six scheduling policies: shortest process next (SPN), serial scheduling (SS), adaptive
greedy (AG), minimum execution time (MET), heterogeneous earliest finish time (HEFT), and predict earliest finish
time (PEFT) showing the average  delay across a range of data sizes and for 1500 iterations of the NTEPI algorithm.
Note that the MET, HEFT, and PEFT results overlap.
had to wait while their dependencies finished executing on other processors, and the processor that they
will execute on sat idle. Moreover, the length of the  delay indicates the amount of overlap simultaneously
executed kernels can achieve. When a processor is idle between kernels, that is wasted time that could have
been spent executing another kernel in parallel if all of the dependencies had been completed earlier. By
minimizing this idle time, a scheduling policy can shorten the overall execution time of the application.
For the NTEPI application and range of data sizes, we also recorded the  delays for each kernel. We
recorded the average  delay, and the standard deviation of delays. In general, we found that the PEFT,
MET, and HEFT had the lowest  values followed by (in increasing order) AG, SS, and SPN. This is the same
ordering of scheduling policies ranked by increasing total application run time from the previous section.
However, by comparing the  values we found a sizable difference between the SS and SPN policies.
In Figure 6.29 we show the average  delays. As mentioned previously, we achieved similar total execution
time results using the SPN and SS policies and determined that they were making effectively the same kernel-
to-processor assignments. But, the results of the average  delay imply that the SS policy is ordering the
kernels for assignment in such a way as to prevent delays from dependencies from impacting the performance.
Taking these results and correlating them to performance results from the previous section, we can deduce
that a low  is desirable as long as the assignment does not imply a much longer execution time. We conclude
that the SPN policy had generally higher  values but lower compute times. In comparison, SS had lower 
values but higher compute times. However, we expect that the SPN and SS policies would achieve different
performance for applications other than NTEPI.
94
The relationships between the MET, HEFT, and PEFT were similar to the chart presented in Figure
6.28. In comparison to the three best (MET, HEFT, and PEFT), all other policies had at least 2.9x larger
 delay with a maximum of 1,320,862x. Also, the smaller delay for the best policies means that on average
there was very little wait time for each kernel and that the next kernels generally executed soon after the
previous kernel finished. This is a very interesting result since the MET policy does not use the dependencies
in its calculations, except to maintain correct kernel precedence ordering.
6.2.5 System Modeling Summary
In this section we have presented a study on scheduling policies that are used in distributed heterogeneous
systems to achieve high performance for a medical imaging application. We presented a configurable system
model that can be tailored for any configuration of processors and interconnect network. To the best
knowledge of the authors, this is the first effort to study scheduling in a CPU+GPU+FPGA system. A unique
task granularity, linear algebra kernels, was analyzed to evaluate scheduling kernels with large differences in
execution time on various processors.
Six scheduling policies were presented and used to assign kernels from the medical imaging application
to the CPU, GPU, FPGA platforms for various data sizes and number of iterations. We showed the shortest
process next (SPN) and serial scheduling (SS) had different algorithmic behavior, yet they both had the
same performance results in this study. In our analysis of the adaptive greedy (AG) policy, that was designed
and tested for a CPU+GPU system, we found that it achieved better performance than the previous two
algorithms. Overall, our results show that the minimum execution time (MET), heterogeneous earliest finish
time (HEFT), and predict earliest finish time (PEFT) performed the best and within 1% of each other.
Attempting to improve processor utilization for the sake of keeping the system busy did not ever achieve the
same level of performance as the algorithms that made the best kernel-to-processor assignments.
We also analyzed the scheduling delay  due to kernels waiting for their dependencies to finish and
found that one of the best policies (MET) had very low average scheduling delay for the medical imaging
application. However, this policy did not consider these delays in its scheduling calculations, and only chose
the processor with the lowest execution time for each kernel. Using this  delay metric, we found that
SPN and SS had the same total execution time results, but performed differently in reducing the average
idle time. We conclude that for this medical imaging application, a scheduling algorithm that sets a higher
value to choosing the best kernel-to-processor assignment yet still considers the scheduling delays due to
dependencies, will achieve high performance in a real world heterogeneous system.
95
This same approach can be applied other applications in order to define the best scheduling policy over
heterogeneous systems with very different execution times on the different computing platforms. These new
policies will also be able to achieve more power savings by more efficiently managing the work in the system
to run on either processors that will execute a kernel quickly or more efficiently. Finally, the modeling process
can be used within a heterogeneous system design framework to automatically configure a system with a
particular number of processors to meet performance, power, or cost constraints.
96
Chapter 7
Code Generation
In this chapter we present the code generation process that produces the final implementation of an appli-
cation. After analyzing an application and parallelizing the workload, the last step in the design flow is to
generate a working implementation. This generation process includes configuring kernel implementations,
choosing a scheduling and control approach, and producing a system-specific executable for the application
requiring no further user interaction to execute. For the purposes of this work, we consider two types of
systems: fixed configuration and configurable systems. We define a fixed configuration system (such as HPC
systems or clusters) as one whose configuration requires manual/physical effort to connect/disconnect hard-
ware. Whereas a configurable system is one that is software (re)programmable for different configurations
such as those supported by reconfigurable hardware (such as FPGAs or programmable SoCs). In Section
7.1, the code generation approach for fixed configuration systems supported by Matlab is presented, fol-
lowed by an approach for generating implementations for configurable systems supported by multiprocessor
system-on-chips (MPSoCs) that include reconfigurable fabric in Section 7.2.
7.1 Fixed Configuration Systems
Heterogeneous systems that are composed of a fixed set of hardware components provide an opportunity
for libraries of kernel implementations to be collected when implementing various applications. Once these
libraries reach a critical mass, they can be used for a wide variety of applications and reduce the effort to
implement individual kernels in an application. An example of this is MathWorksMatlab that provides high
performance implementations for a variety of kernels for both CPUs and GPUs. On top of these libraries,
97
Matlab provides a high level abstract language for the developer to describe the computational work in their
application. During execution, the kernels from the application are mapped to the high performance library
implementations. The rest of this section details research to improve such computational environments with
available libraries of kernel implementations for every type of processor in the system.
Heterogeneous systems are complex and provide the developer with a continuous source of design problems
to overcome. Even for systems with a fixed set of processors there still exist many other design problems
for developers. These problems include kernel mapping, control, communication, and memory management
among others. Consider a simple application consisting of a matrix-matrix multiplication followed by a
matrix inversion. This application can be written using a single line of Matlab code: C = inv(A*B);
where A and B are the input matrices and the result is stored in C. Ideally, the matrix-matrix multiplication
should be mapped to the GPU and the inversion mapped to the CPU. This implies that the initial data
must be transferred to the GPU (from the CPU where it initially lies) and the intermediary result of the
matrix-matrix multiplication transferred back to the CPU (from the GPU). If there were more kernels in this
application, the variables stored in the GPU would need to be freed to make room for further computation.
This example displays the complexity inherent in heterogeneous system implementations that the application
developer or domain-expert would normally need to be concerned with. The goal of code generation for fixed
configuration systems is to handle these tasks automatically for the developer, producing an implementation
that the user can just execute.
An application can be broken down into three main components: the types of kernels that will be executed,
the data dependencies between these kernels, and the initial data the application begins operating on. In
addition to this, the configuration of the system must to be specified, including the type of processors (CPU,
GPU, or FPGA for example), quantity of each type, and data transfer rates between each processor. To
connect the application to the system, the kernels will need to be scheduled onto the processors for execution.
By providing these items separately as shown in Figure 7.1, we can generate the parallel Matlab code to
execute the entire application. We extend the compute capabilities of Matlab by incorporating support for
FPGAs and automating the parallel code generation. This builds on the existing parallel Matlab support
for CPU and GPU processors, while we provide an additional API to perform kernels on the FPGA.
7.1.1 Application Support
Matlab provides a high level abstraction that allows developers to describe kernels and data dependencies
in their application. Behind the scenes, high performance parallelized implementations of kernels are called
98
Figure 7.1: Code generation flow for heterogeneous Matlab scripts.
to perform the actual computation. We extend this abstraction for FPGAs, ease developer effort for system
control and communication, and characterize the performance of a fully functioning CPU+GPU+FPGA
system using Matlab.
Implementations for a wide range of kernels are already available in Matlab for the CPU and GPU,
leaving the user the task of determining how to split the workload to achieve better performance. Given
an application, implementing it sequentially in the CPU is very simple in Matlab. Extending this CPU
version to also utilize the GPU only requires the use of two functions to interface with the GPU: gpuArray
and gather that transfer data from the CPU to the GPU memory and vice-versa. Kernels are executed on
the GPU by first transferring a variable to the GPU, and later when this variable is operated on Matlab
automatically performs the computation in the GPU, no other effort is required by the user. However, using
this approach not only is the user responsible for application decomposition into kernels, they must also
determine mapping and scheduling of kernels onto the various processors.
Parallelizing an application across a heterogeneous system is a complex and difficult task to perform
manually. In Chapter 5 we introduced the standard two level compiler approach introduced with the Stream
Virtual Machine (SVM) [64] that was used in the DARPA Polymorphous Computing Architectures (PCA)
program. Figure 7.2 shows the organization of implementation tasks in this flow. First, at a high level the
kernels are identified from the application. Next, the performance of each kernel is estimated, dictating
which processor each should be assigned to. Then, the order in which each kernel should be executed is
scheduled on the processors. The high level compiler produces an abstract representation of the application
consisting of a dataflow graph (DFG) of the kernels and their data dependencies, and a schedule containing
assignment and ordering information. The low level compiler/linker operates on the abstract representation,
mapping kernels to implementations in existing libraries or provided by the user. From the schedule, the
control threads are constructed to initiate computation and data transfers. Control threads are synchronized
for correct operation by enforcing data dependencies using data transfers.
99
Figure 7.2: Application implementation and parallelization flow.
In this section, we focus on the last part (red items with dashed outlines in Figure 7.2) of the development
process: specifying the abstract representation, integrating the DFG, system configuration, and schedule,
and generating the parallel implementation necessary to execute the application across multiple processors
in a heterogeneous system. The strength of this approach is that the application developer is responsible
only for the initial sequential application development. It is typically easier to develop an algorithm that
does not require any extra communication or synchronization between operations, and then later determine
which kernels should occur in parallel and on which processors for significant speedup. The goal of this
approach is to allow the developer to focus on their application while gaining the performance improvement
from utilizing different types of processors.
The code generation flow provides the combination of performance and efficiency without the additional
overhead of time and effort to ensure data integrity or data communications. In total, the benefits of this
flow are as follows:
• Hardware Architecture Integration: Developers are able to quickly integrate various types of
processors into their system. This frees the architecture specialist responsible for the kernel develop-
ment from interfacing with the application. Similarly, the domain expert can focus on the application
instead of interfacing with various different hardware architectures.
• Sequential to Parallel Code: Developers are freed from parallelizing their code. The schedule is
produced in a separate step in the flow, either manually or using an automated process to assign
kernels to each processor, ensuring data dependencies are met, and providing communication and
synchronization.
• Processor Selection: The appropriate mapping for each kernel is determined based on data set size
and kernel type. This removes additional development and testing time.
100
• Device Memory Management: Automating memory management tasks to free memory from each
device after data is no longer needed, preventing errors due to insufficient memory.
• Code Readability: Since the generated code is in Matlab, it remains in a familiar human-readable
format.
In addition to the regular application development issues, developing for parallel systems also increases
the complexity by introducing problems such as deadlocks and memory management. By parsing the DFG,
the high level compiler can also constrain the lifetime of temporary variables from initial write to the last
read. Using this information we can automatically release the memory for these variables after the last
read of the data. Given a cursory understanding of MPI communication, a developer can avoid deadlocks
due to two threads trying to receive from each other before sending. But even experienced developers are
still mired by deadlocks when non-blocking communication transitions to blocking. This problem can be
handled automatically for the developer by scheduling the send and receive function calls appropriately
to not only prevent deadlocks but also minimize wait time for the other threads to begin communicating.
This functionality is common in operating system schedulers using techniques such as gang scheduling or
co-scheduling.
7.1.2 FPGA Interfacing and Hardware Support
In the many current systems including an effort fromMicrosoft [86] the FPGA interfaces with other processors
in the system using a PCI Express (PCIe) high speed serial connection. This is a popular and high throughput
interface and is supported for FPGAs by many existing frameworks that combine a hardware IP component
and a corresponding OS software driver. Many PCIe frameworks are now available including vendor reference
designs, 3rd-party designs, or academic efforts [17][111] including the Reusable Integration Framework for
FPGA Accelerators (RIFFA) [58]. We use RIFFA Version 2.02 in this work with a 128-bit interface. In
addition to the HDL that interfaces the compute logic on the FPGA to PCIe interface, RIFFA also includes
a kernel driver and a library for linking user software. We encapsulate the FPGA interface using the
“Matlab executable” (MEX) API. This allows compiled C code to be called from Matlab scripts. These
functions enable similar functionality to the built-in Matlab GPU functions, gpuArray and gather that
transfer data to and from the GPU, respectively. The FPGA interface functions are described in Table 7.1.
The application developer can therefore focus on their algorithm while the domain specialist focuses on
their implementation in the FPGA, without either needing to develop complex interfaces. As demonstrated
101
Table 7.1: Matlab FPGA Interface Functions
Name Args Description
fpgaDevice Enumerates the FPGA devices in the system providing detailed informationabout each. Equivalent to gpuDevice
fpgaOpen ID Initializes the FPGA device specified by the given ID and returns a pointer tothe data structure used to interface with that device
fpgaSend
ptr Sends the data A to the specified channel on the FPGA referenced by the
pointer ptr. Returns the number of bytes sent. Equivalent to gpuArraychannelA
fpgaRecv
ptr Receives numel elements of data from the specified channel on the device
referenced by the pointer ptr and returns an array of data. Equivalent to
gather
channel
numel
fpgaComp ptr Initiates computation for the kernel specified by CType on the devicereferenced by the pointer ptrCType
fpgaReset ID Resets the device with the given ID, including the state of the PCIe controllogic and all transfers across all channels.
fpgaClose ptr Resets the device referenced by the pointer ptr and frees any allocated memoryfor data structures
in [111], FPGAs can provide a significant speedup in application execution time. However, there exists
significant hurdles to integrating an FPGA into an existing system. A system development effort requires
careful consideration of not only the algorithm, but integration with other devices, communication, and
data dependencies. These can increase development time, testing time, and costs of projects. However,
in many cases the benefits of introducing significant performance to the overall system often outweigh the
additional effort required. Our approach removes a significant portion of integration effort by providing an
automated integration of RIFFA into system designs. Not only does our flow provide a means to enhance the
performance of an application, but also to speedup application integration time. The application developer
can take advantage of already developed kernels published in academic papers or provided as IP by vendors,
while focusing on their own application development efforts.
It is essential for the FPGA to be able to handle any computation needed by the application. But not all
hardware cores can be implemented on the FPGA simultaneously. It is even less advantageous or practical
to have multiple FPGAs to support small subsets of functions, both from a power and space standpoint.
Therefore, we propose the use of partial reconfiguration (PR) to enable swapping hardware cores as needed by
the application. Xilinx provides a mature PR solution in their ISE Design Suite, leveraging Partitions. Using
the multiple channels provided by RIFFA would allow the master processor to provide PR instructions to the
fabric responsible for managing the reconfigured core while at the same time continuing data execution on
other unaffected cores. The proposed functionality would be integrated into the auto-generated application,
102
being a transparent effort to the developer. This approach would allow for a reduced FPGA size and larger
supported set of functions in the FPGA.
RIFFA provides the ability to direct data transfers to various channels on the FPGA, based on a user’s
design. Each hardware core can interface with a separate channel to ease communication between multiple
hardware cores and the CPU. This allows multiple simultaneous kernels to execute on the FPGA, overlapping
reconfiguration or data transfer with computation and providing a customized compute capability.
7.1.3 Application Performance Experiments
To analyze the overall performance benefits of this code generation flow in a Matlab based heterogeneous
system, we implemented both medical imaging and fluid dynamics applications in various system config-
urations. In this section we introduce each application, describe its workload and DFG, and how it was
implemented across various types of processors.
Noninvasive Transmural Electrophysiological Imaging
The NTEPI Algorithm employs a sequential maximum a posteriori (MAP) estimation of the transmural
action potential (electrical propagation) distributions given the body-surface potential data (as measured
from a standard ECG) [116]. At each time step when a new sample is available, a Cholesky decomposition is
performed, then a set of sample vector are generated. Each sample vector individually enters into simulation
of the Alive-Panfilov models to predict a new set of sample vectors, to estimate future propagation of electrical
Figure 7.3: NTEPI processing flow
103
Figure 7.4: DFG of each iteration of the NTEPI algorithm showing dependencies within a single iteration, and
between iterations. The graph contains matrix-matrix multiplications (MM), Cholesky decompositions (Chol), and
matrix inversions (Inv).
potential. Since the sample ECG measurements contain electrical noise from sources other than the heart,
such as the respiratory muscles that are located between the electrodes and the heart, a Kalman filter is
used to reduce the impact of random noise from the data. The Kalman update process requires inverting an
MxM matrix where M is the dimension of the body surface data. This prediction and update processes are
repeated iteratively, a typical patient analysis requires 2000-3000 iterations. This process is shown in Figure
7.3.
The kernels required for the above calculations are common matrix operations such as addition, subtrac-
tion, element-wise and standard multiplication, scaling, inversion, and Cholesky decomposition. Previous
work profiled this algorithm in detail [31] and found that the majority of the execution time is spent on
the Alieve-Panfilov model. Moreover, we further investigated which specific kernels are the bottleneck and
found that 98% of the time is spent on matrix-matrix multiplication, matrix inversion, and Cholesky decom-
position. As such, this work attempts to improve the overall NTEPI algorithm’s performance by focusing
on these three types of kernels. Each iteration requires twelve kernels broken down into one Cholesky de-
composition, one matrix inversion, and ten matrix-matrix multiplications. Figure 7.4 shows the dependency
structure of the kernels in this application. Between iterations there is no opportunity for overlap as every
104
iteration is dependent on the previous update calculations. However, there is sufficient parallelism within
each iteration to potentially keep all three CPU, GPU, and FPGA processors busy.
Shallow Water Simulation
Shallow water simulations utilize a set of hyperbolic partial differential equations to model the flow of liquid
below a pressure surface, like the force of gravity on the surface of the ocean. These simulations typically
cover both a large data set (large ocean-sized region) over a period of time, which leads to extensive processing
requirements. These equations are shown below in Equation 7.1.
@
@t
+
@(u)
@x
+
@(v)
@y
= 0
@(u)
@t
+
@
@x
(u2 +
1
2
g2) +
@(uv)
@y
= 0 (7.1)
@(v)
@t
+
@(uv)
@x
+
@
@y
(v2 +
1
2
g2) = 0
These equations calculate the wave propagation using the total fluid column height (), the water’s
horizontal velocity as averaged across the vertical column (u; v), and the acceleration due to gravity (g).
These equations can be then be mathematically transformed from this state to simpler, more algorithmic
steps, especially when utilizing the nonconservative form, which is in terms of velocities instead of momenta.
This system then lends itself to a parallel implementation amongst different processor types, Using an
approach similar to a Runge-Kutta method, at each time step the discrete values are used to calculate
Figure 7.5: DFG of each iteration of the shallow water algorithm showing dependencies within a single iteration. The
graph contains element-wise kernels: matrix addition (M+, green, circle), matrix subtraction (M-, blue, square), 9
multiplication (M, brown, triangle), matrix scaling (Ms, red, pentagon), matrix division (M, purple, star), and
squaring (M2, yellow, diamond).
105
Table 7.2: Processor Specifications
Processor Specifications Implementations
CPU AMD A10-5800K 3.8GHz MathWorks Matlab 2012a 32b16GB DDR3 @1600MHz
GPU Nvidia GTX480 607MHz MathWorks Matlab 2012a 32b1280MB GDDR5 @1.67GHz
FPGA Xilinx Kintex 325T, KC705 [2][3][37][106][119][124]1GB DDR3 @ 1600MHz
the propagation for a half-step and then used to compute new values for the next time step. Since these
calculations operate on a grid of discretized points, all 104 kernels in the application are element-wise kernel
on a matrix. Each iteration requires 14 matrix additions (green), 24 matrix subtractions (blue), 9 element-
wise multiplications (brown), 26 matrix scalings (red), 15 matrix divisions (purple), and 16 element-wise
squaring kernels (yellow). Figure 7.5 shows the dependency structure of the kernels in this application.
Previous work on improving the performance of this application has mostly focused on multi-core CPUs
and GPUs [65][73][89]. We investigate the performance benefit that adding the FPGA provides for these
simulations.
Hardware System Characterization
This section describes the characterization of a heterogeneous hardware system in terms of communication
and data management. We constructed a system with one of each CPU, GPU, and FPGA processor. The
CPU was installed onto the workstation’s motherboard, and the GPU and FPGA were installed as PCIe
cards in the motherboard. The specifications for each processor used in the system are shown in Table 7.2.
The system’s PCIe interconnects are illustrated in Figure 7.6a. The FPGA has been configured for a 4x lane
PCIe 2.0 interface with a theoretical bandwidth of 2GBps. The GPU has a 16x lane PCIe 2.0 interface with
a theoretical bandwidth of 8GBps. The Unified Media Interface (UMI) connects AMD’s CPU, now referred
to as Accelerated Processing Units (APUs), and the chipset or Fusion Controller Hub (FCH). This link is
based on PCIe 2.0 giving it a theoretical bandwidth of 2GBps, or enough to support the 2GBps bandwidth
of the FPGA.
The runtime for fixed configuration systems contains two types of threads: control threads that manage
computation and data transfer, and compute threads that actually perform computation. Since the GPU
and FPGA cannot operate directly, computation must be initiated by a control thread running on the CPU.
In our implementation, each processor has a control thread and a compute thread. Since the CPU can initiate
its own computations, both control and compute functionality are integrated into a single thread. Figure
7.6b shows this organization of threads.
106
(a) System configuration (b) Control and compute threads.
Figure 7.6: Configuration of the AMD-based hardware system showing the configuration of processors (a) and
organization of threads (b). Note that the compute thread in the CPU is also its own control thread.
Data transfer to/from the FPGA is supported using our fpgaSend and fpgaRecv functions as presented
in the previous section. These transfers move data between the processors control and compute threads.
For example, to transfer data between the FPGA and the CPU compute thread, effectively data must be
transfered twice: once from FPGA to the FPGA’s control thread, and again from the FPGA’s control thread
to the CPU compute thread. This process is similar for communications between the CPU and GPU, while
requiring a three transfer process for FPGA to GPU. Future efforts will work to integrate a direct GPU to
FPGA communication path and also enable CPU threads to exchange data by passing pointers rather than
copying data.
After developing the MEX interface for the FPGA in Matlab, we analyzed the performance of data
transfers and compared them to both the C++ implementation and the benchmarks provided in [58]. The
difference in performance between theMatlab MEX and C++ implementations were negligible. Figure 7.7
shows the bandwidth for CPU to FPGA transfers achieved as the payload size was varied from 1024B to
1024MB. Notice the transfers to the FPGA were significantly faster than the receives. The previous work [58]
did not include the Kintex FPGA that we used, nor a 4x lane PCIe 2.0 configuration. Instead we compared
to the 8x lane PCIe 1.0, which has an equivalent theoretical bandwidth, and found that our performance
was much lower leading us to believe that there is further customization that is possible to achieve higher
bandwidths.
Figure 7.7 shows the bandwidth for CPU to GPU transfers achieved as the payload size was varied from
1024B to 256MB. We were only able to test up to 256MB transfers as Matlab produced an error when
transferring 512MB payloads. Since our Nvidia Geforce GTX480 card has 1280MB of memory and can
107
Figure 7.7: Transfer bandwidths for CPU/GPU and CPU/FPGA as a function of payload size.
actually support both 512MB and 1024MB data sizes, we narrowed the source of this discrepancy to the
infrastructure provided by Matlab and concluded that it is introducing these additional limitations. Even
though the GPU has four times as many PCIe lanes, it only achieved 2x bandwidth on average over the
FPGA.
Though Matlab has been taking advantage of the parallel threading for many years, there are still
several problems that developers need to take into account. Matlab provides a subset of the message
passing interface (MPI) standard to the user to enable communication between processors. In MPI the
MPI_send and MPI_receive functions transfer data between threads. The equivalent Matlab functions
are labSend and labReceive. The communication path from CPU to CPU is much more complex than the
CPU/GPU and CPU/FPGA paths since the GPU and FPGA are constantly listening for transfer requests
to/from the CPU. In comparison, communication between multiple control threads in the CPU is not so
simple since one thread might send before the receiving thread is ready. If this happens, the sending thread
may block until the receiving thread is ready. To characterize this we constructed three experiments: (1)
where both threads reach the send/receive at the same time, (2) where one thread sends before the other
receives, and (3) where the one thread tries to receive before the other sends. Figures 7.8a-c shows these
experiments graphically. We fixed the delays at 5 seconds, and after collecting the time for sending and
receiving we subtracted out the 5 second delay time. The resulting time therefore, is based solely on the
result of the underlying implementation the MPI interface functions. Figure 7.8d shows the bandwidth
achieved as the payload size was varied from 1024B to 256MB. Dark colored lines indicate the results for
transmitting/sending and lighter color for receiving.
108
(a) Ex 1: Send/Receive simultaneously
(b) Ex 2: Send before Receive
(c) Ex 3: Send after receive (d) Transfer bandwidths as function of payload size.
Figure 7.8: Matlab MPI bandwith experiments (a-c) and bandwidth results (d). In each experiment, either the
sender or the receiver or both are delayed by 5 seconds.
For experiment (1) notice that the optimal payload size is 512KB and 256KB for sending and receiving
respectively. Also, the performance for a 64KB payload size drops slightly. At this payload size the memory
Matlab allocates by default is not enough to store the entire data being transmitted. In MPI, each thread
has a mailbox where other threads can leave data being transferred. This removes the requirement for both
threads to be in lock-step during data transmission. Traditionally the mailbox size is configurable by the
user, however the subset of the MPI that Matlab supports does not include this capability. This effect is
more noticeable in experiment (2) where at this payload size performance drops significantly. Also notice
that nowhere does the transmit or receive bandwidths approach those from experiment (1). This result
leads us to believe that there is additional work happening behind the scenes in Matlab’s implementation
of MPI, limiting performance. Experiment (3) confirms this result since the receiver is waiting prior to the
other thread sending the data, yet performance is still degraded. Both the receiver in experiment (2) and
the sender in experiment (3) are not delayed since the other thread has already reached the communication
point, and yet in neither of these cases does the measured bandwidth reach that of experiment (1).
A common issue for programs that use MPI is communication deadlock. When the size of the data being
transferred is larger than the size of the MPI mailbox the non-blocking communication functions transition
to blocking implementations. This issue is present in Matlab and the equivalent function to MPI_send is
labSend. We found that the labSend function does show this characteristic of transitioning to a blocking
109
approach. We experimentally found that this limit to be 128KB, or a 128x128 double precision matrix. The
subset of MPI that Matlab supports does not include the capability to configure the size of the mailboxes.
Working around this limitation requires forward planning for sending and receiving data. The solution is to
analyze the sending and receiving traffic and determine what features indicate this type of event and apply
a fix to ameliorate the problem. We therefore examine the initial allocation of sends and receives which
are associated with the computations being performed. Any deadlocks found are fixed by manipulating the
order of the sends/receives.
A secondary set of problems that is typically observed during parallel development is device memory
management. In comparison to the CPU, both the GPU and FPGA have a limited amount of memory that
can be utilized. Since Matlab’s abstraction presents limitations on the available memory storage space, we
found that the GPU is typically more constrained than the FPGA, as the total memory available to the user
may not be the actual total memory on board the chip. The overhead of Matlab’s abstractions consume
some noticeable amount of space so a GPU with 1GB of memory will have less depending on the number of
concurrent variables stored in the GPU. When operating on large data sizes, or after many kernels execute in
the GPU this memory becomes fully utilized and errors arise when attempting to allocate more space. Since
the amount of overhead is not discussed in the Matlab documentation, this make preventing errors due to
insufficient memory difficult. In comparison, typically the memory in the FPGA is partitioned as needed
for inputs to each kernel implementation. For example, in an FPGA implementation of a matrix addition
kernel three matrices are stored: A;B; and C to compute C = A+ B. The result of the addition is always
stored in the same location in memory, overwriting any previous value. But when computed in the GPU,
the result could be stored at any pointer location leading to memory quickly filling up for large data sizes.
To help avoid this problem, we also handle the task of allocating and freeing the memory used to store the
results of intermediary computations by following kernel dependencies from the first write to the last read.
Performance Analysis
Each application was implemented for various system configurations. Although the system contains all three
processors, single processor systems utilized only the CPU, GPU, or FPGA to execute kernels (and the
other remained unused). Processors were organized in the following way for two-processor system config-
urations: CPU+GPU, CPU+FPGA, CPU+CPU, and GPU+FPGA. We also investigated the benefit of a
three-processor system containing CPU+GPU+FPGA. For each configuration, the same application DFG
110
was used. Different system configurations and schedules were fed into the Generate flow to produce the
appropriate parallel Matlab scripts.
The Matlab tic and toc functions were used for tracking the total execution time of the application,
the time spent in each thread, and the waiting time each thread experienced. The time spent in each thread
was a combination of the wait time, communication time, and execution time. The waiting time was the
period in which a thread was delayed due to data not being present and being required to be sent from
a corresponding one. The CPU and GPU times were both authentic, in the sense that the kernels in the
algorithm were being performed on those two processors. However, the FPGA operations were simulated,
using the time observed for each data set size and operation from [100].
The results for the Noninvasive Transmural Electrophysiological Imaging (NTEPI) application are shown
in Figure 7.9a normalized to the performance of the fastest system configuration. Contrary to our expec-
tations from the CPU+GPU+FPGA system simulation, our initial results showed that the FPGA was not
able to add any value to the system at any data size. In fact, the best performing system was the CPU+GPU
since the CPU excelled at executing kernels quickly at smaller sizes and the GPU at larger sizes. Notice that
at larger data sizes the GPU and CPU+GPU configurations overlap. This indicates that the CPU was no
longer used in the CPU+GPU configuration.
We investigated the source of the disappointing FPGA performance and found that the overhead required
to initiate computation made using the FPGA not beneficial. To address this issue, we implemented a simple
controller in the FPGA to accept a list of kernels to be executed from the CPU control thread, and execute
them in order without any more direction from the control thread. This same capability is not possible for
the GPU since it would require merging the currently unavailable source code for kernel implementations to
remove the inter-kernel control requirement. The results with reduced FPGA overhead are shown in Figure
7.9b. Using this reduced overhead approach the single FPGA achieved speedups of 40x over CPU+GPU and
49x over GPU for smaller data sizes. This is due to the overhead of the GPU and extra transfers to/from
the CPU that prevent the other two configurations from achieving the same performance as just the FPGA.
Although this approach achieves better results, it is impractical for the GPU. The complexity of the custom
CUDA kernel routines increases significantly when merging the operations for multiple kernels into a single
monolithic implementation. And additionally also requires recompilation. In comparison, no changes are
required in the FPGA other than the small and simple controller that simply imitates the control messages
sent by the CPU control thread.
111
(a) With overhead.
(b) Without overhead.
Figure 7.9: Runtime results of the NTEPI application for various system configurations across a range of matrix data
sizes.
The results for the shallow water application are shown in Figures 7.10a & 7.10b. Figure 7.10a shows the
original performance with the additional overhead of communication. Figure 7.10b shows the performance
with the improvements of adding an embedded controller to reduce control and communication overhead
necessary to initiate computation. In this application the kernels were much simpler, containing almost
no complex control flow, and resulted in very small execution times. Both the NTEPI and shallow water
applications were evaluated with the same range of data sizes. Thus, the data transfers were much more
of an impact on overall application performance of shallow water than NTEPI. This resulted in very few
transfers, and in fact only in the CPU+CPU and GPU+FPGA configurations. The ratio between kernel
execution time and data transfer time was much higher than in the NTEPI application which resulted in very
little communication between processors. The only configurations that utilized more than a single processor
were the CPU+CPU and GPU+FPGA configurations. Compared to the single CPU system for the 2000
data size, the dual CPU+CPU achieved a 1.4x improvement. However for the approach with the overhead,
112
(a) With overhead.
(b) Without overhead.
Figure 7.10: Runtime results of the shallow water application for various system configurations across a range of
matrix data sizes.
the GPU+FPGA configuration performed worse than the single FPGA system due to the additional data
transfers.
Overall, for both of these applications, the FPGA improved the performance of the system. We found
that not necessarily all processors will be used simultaneously when trying to tailor the system for high
performance. In fact, we also analyzed all three processors in a CPU+GPU+FPGA system and found that
for small sizes in the NTEPI application, the best performance was achieved using only the FPGA, while
only CPU and GPU were used for medium sizes, and only the GPU was used for larger sizes. Similarly for
the shallow water application, only the GPU and FPGA were used for all sizes. These applications operate
on a constant data size throughout their execution. Given the results presented above, applications with
varying data sizes such as facial recognition applications may utilize all types of processors to achieve better
performance.
113
7.1.4 Fixed Configuration Systems Summary
We presented the first implementation of the framework generation step. For fixed configuration systems,
a library based approach was used for generating implementations for applications. These implementations
were completely self sufficient and required no further user interaction to execute the entire application.
We extended the commonly used Matlab environment for FPGAs (adding both interfacing functionality
and compute libraries) and presented a set of infrastructure to automatically setup an automated runtime
environment for control and scheduling. The runtime environment handles data transfers, memory allocation,
and prevents deadlocks. We analyzed two applications and presented performance improvements by tailoring
the system implementation to the quantity and types of kernels found in the application.
7.2 Configurable Systems
Compared to fixed configuration systems, configurable systems allow designers to add or remove processors
as needed to fit the computational needs of an application. The biggest benefits that configurable systems
provide are much tighter integration and support for classes of applications with smaller data sizes and more
diverse kernel types. Kernels initially implemented in software can be reimplemented as a custom hardware
accelerator using tools such as high-level synthesis (HLS). Generally, such systems are also referred to as
multiprocessor system-on-chips (MPSoCs) due to the fact that they contain multiple processors and are
implemented within a single chip (as opposed to multiple chips connected on a board or in a system in
a cluster fashion). In addition to implementing individual kernels, infrastructure components within the
system also need to be implemented. These components include processors, interconnect, control units, and
shared memories among others.
For designing MPSoC systems, one approach is to initially implement the application in software and
then parallelize it among the processors in the system. However, when it comes to heterogeneous systems
this software is not always portable between different types of processors. Given the increasing heterogeneity
of processors like CPUs, GPUs, DSPs, and custom accelerators, cross-compilation tools provide some aid in
porting implementations between processors. Even though these tools have achieved code portability, very
few if any have achieved portable performance. Instead, cross-compiled implementations generally achieve
significantly reduced performance compared to custom hand-coded implementations for those processors.
One type of cross-compilation tool for porting high level languages to a custom accelerator is called high
level synthesis (HLS). These HLS tools take an implementation in a language such as C or C++ and construct
114
Figure 7.11: Example Redsharc MPSoC system.
a hardware implementation in a hardware description language (HDL). In this work, we recognize three types
of development styles for processors in MPSoCs: existing C implementations for CPU-like processors (either
from libraries or provided by the developer), HLS produced hardware accelerators (using the same C code
for CPU-like processor), and custom hardware accelerators (provided by the developer, or from a library).
Even as tools such as those described above for designing MPSoCs continue to evolve to meet the
demands of developers, there still exist systematic gaps that must be bridged to provide a more cohesive
hardware/software development environment. In particular, support for design and implementation problems
including system generation, software/hardware compilation and synthesis, and run-time control are lacking
in existing development environments. The Reconfigurable Data-Stream Hardware Software Architecture
(Redsharc) has been previously introduced [62][90] as a solution to meet the performance needs of MPSoCs.
The design process and existing infrastructure of Redsharc make it an ideal foundation for developing a
cohesive build infrastructure and runtime environment for MPSoCs.
We extend our kernel-based development flow to support MPSoCs using Redsharc to provide both soft-
ware and hardware designers a simplified development environment, shifting the focus from system design
and integration to application and kernel development. This includes integrating HLS to rapidly implement
hardware accelerated kernels, automatic system control for user defined scheduling policies, and a build
framework to generate the binaries needed to implement the system. The form of MPSoC systems that can
be created using Redsharc is shown in Figure 7.11. Previous works supported multiple software kernels as-
signed to the same processor core. In addition, Redsharc now supports multiple kernels assigned to any type
of core. For hardware cores, this means that the two kernels are physically implemented side-by-side in the
115
(a) Current state of design (b) Redsharc improved design
Figure 7.12: Shifting development focus with Redsharc
same reconfigurable fabric. For software cores, this means that the two kernels are executing simultaneously
on the same physical processor core — sharing compute time by context switching.
In this section we present our work to leverage Redsharc and improvements to integrate it into our
heterogeneous development framework. First we describe how Redsharc enables separate developer roles
and simplifies the design and implementation process in Section 7.2.1. The kernel programming model is
presented in Section 7.2.2. The system design model is presented in Section 7.2.3. The build framework
that integrates kernel and system implementations is described in Section 7.2.4. After the implementation
is produced, a custom runtime manages the hardware during execution and is presented in Section 7.2.5.
Lastly, we present example applications implemented and their performance in Section 7.2.6.
7.2.1 Developer Roles
Generally, designing embedded systems requires a wide range of skills. Domain expertise is needed to
understand the problem and craft a solution or algorithm. This algorithm will then need to be decomposed
into kernels implemented on the type of core that meets its computational pattern and needs. At the system
level, expertise is needed to determine the number of cores, types of cores —whether each is a processor or
hardware core— and the policy to schedule kernels on the cores. This also includes device specific expertise
to make sure the developed design properly takes advantage of the rich heterogeneous resources and I/O of
the device. Figure 7.12 shows the differences in complexity and time/effort required for each of the different
development duties.
Designing an entire heterogeneous hardware/software system from scratch normally entails allocating
a majority of the time and effort to integration, testing, and verification at the system level, as shown in
Figure 7.12a, and less on the kernel implementations and overall application design. In contrast, Redsharc
116
reduces the need to have a strong skill set at the system level by providing proven and validated on-chip
networks, communication interfaces and control necessary to manage execution. The addition of HLS enables
the software kernels to be implemented in hardware reducing the complexity of the kernel implementations
as shown in Figure 7.12b. Moreover, there are a multitude of HLS tools available that support languages
including: C/C++ [22][34][118], Python [56], and Haskell [74] among many others [1][36] reducing the skills
and time/effort needed to implement kernels. The shift in the state of design that Redsharc provides enables
systems to be constructed quicker, or more time to be spent on refining and improving the performance of
the generated system.
7.2.2 Kernel Development with Redsharc
Whereas in fixed configuration systems extensive libraries are available for the small set of processors, cores
in Redsharc can either be processors or custom hardware accelerators. Since building custom hardware
accelerators is not the primary concern of the application developer, we leverage HLS tools to take existing
software implementations of kernels and generate accelerator implementations. This provides the necessary
implementations in hardware or software that can then be tweaked and tuned for performance later on.
Application implementation begins by decomposing the application into kernels. These kernels can either
be software threads or hardware logic. Then, leveraging the Redsharc API a developer can quickly assemble,
generate, and test the system on the device. This approach allows for rapid development and testing along
with providing vendor-agnostic implementations for ease of platform migration. Furthermore, as HLS tools
continue to mature, the ability to rapidly integrate generated hardware kernels will further alleviate a software
developers burden of hardware design.
Redsharc supports the design of hardware kernel implementations using HLS by accepting the input
software code, running the HLS tool to generate the core functionality and ensuring that the top level interface
implements the Redsharc HWKI. This procedure enables developers with little hardware experience, or
experienced hardware developers with little time, to design hardware implementations. Currently, Redsharc
supports integration with Vivado HLS by augmenting the generated IP core with the HWKI. At present, a
designer must select the appropriate directives to ensure BRAMs and FIFOs are the primary interface, but
rather than requiring an AXI or bus-based top-level interface Redsharc uses Python scripts to integrate the
necessary HWKI into the hardware design.
117
Figure 7.13: Redsharc’s hardware abstractions and interfaces for processor cores
Improved Interfaces
Redsharc provides several interfaces to simplify kernel development. We updated the HWKI with AXI4
interfaces to provide easier communication to leverage AXI’s full and streaming interfaces, without requiring
a developer to manage the signaling in their design. The hardware processor interface (HWPI), as shown
in Figure 7.13, has been newly added to abstract the connections from any hard or soft processor core, and
provide DMA components to interface between the processor cores and the Redsharc system. The HWPI
provides a simple mechanism for vendor agnostic development since different FPGA vendors have support for
different soft and hard processors (PowerPC, ARM, MicroBlaze, NIOS, etc). The software kernel interface
(SWKI), allows software kernels to run on any hard or soft processor core and is a collection of APIs that
mimic hardware for seamless hardware/software communication. All of which provides a substantial step
towards increasing the level of abstraction for the developer.
7.2.3 System Development with Redsharc
To design a system with Redsharc the designer supplies:
1. Kernel implementations
2. Dataflow graph (DFG) of the kernels in the application
3. Scheduling policy to be applied
4. Configuration of cores in the system
These inputs to Redsharc are depicted in Figure 7.14. The Redsharc APIs are shown in red. The control
kernel operates using the provided DFG and scheduling policy, no other user intervention is required.
118
Figure 7.14: Overview of Redsharc API showing the various input data required, and how it is utilized to construct
the system
Redsharc provides a DFG API for specifying the dependencies between kernels in the form of a dataflow
graph (DFG). In this graph each node represents a kernel and an edge is a data dependency between two
kernels. During execution the DFG will be traversed by the control kernel to maintain correct operation of
the system. The scheduling policy defines the order that kernels will be executed and to which core they
will be assigned.
The system configuration includes the number of processor and hardware cores that will be available to
execute tasks. Through the Redsharc System API, the user specifies the capabilities of each core and the
number of stream and block interfaces that will be needed by each kernel. The actual implementation of the
system will be generated from this configuration utilizing pre-designed processor blocks, hardware modules,
and on-chip networks using a set of makefiles to interface directly with the vendor-supplied compilation tools.
To facilitate rapid construction of MPSoC systems while still providing high performance communication,
Redsharc provides a block switch network (BSN) and a stream switch network (SSN) to transfer data via
modes as needed by the application. We abstract the complications with bus mastering, addressing, and
(a) Block Switch Network (BSN) (b) Stream Switch Network (SSN)
Figure 7.15: Redsharc networks configurations and connectivity
119
signaling from the kernel and system developers by providing a simple API representing FIFO and BRAM
transactions for streams and blocks accordingly. In this work the networks have been extended to leverage
the industry standard AXI4 full, streaming, and lite interfaces along with support for alternative on-chip
interconnect topologies. Figure 7.15 shows the improved BSN and SSN interfaces.
The data ports on the BSN and SSN connect directly to the hardware or processor cores. Thanks to the
full crossbar structure present within the BSN and SSN, any core can be connected to any port. The SSN
uses on-chip resources to to store data in FIFOs. The BSN uses on-chip BRAM to store data in addition
to off-chip resources such as DDR or SRAM. The interfaces for these off-chip memories are available in the
form of IP-cores (from vendors and other 3rd party developers) with standard AXI interfaces. In addition,
for devices with hard processor cores such as the Xilinx Zynq and Altera Cyclone5-SoC, the on-board DDR
is made available to the hardware logic via AXI interfaces as well. For this work we have revamped the BSN
to support such AXI interfaces, allowing simple connection to any memory available on-board.
Under certain circumstances, a crossbar switch may not provide an efficient bandwidth-to-resource ratio.
In these cases a bus, mesh, or full-custom network may be preferable. Redsharc now provides support to
select between a bus, mesh, and crossbar topology. While a full-custom network is still feasible with Redsharc,
in the event the system designer knows the exact communication path for all kernels in the system, the build
framework currently only provides support for the aforementioned topologies. Future work is investigating
how to automatically generate full-custom networks for the system.
The Redsharc System API is shown in Table 7.3. Through this API the developer can easily specify the
important characteristics of the system to implement. Using this simple API, a developer can specify the
number of cores and the capabilities of each core. The Redsharc System API does not require the developer
have any HDL, networking, or even any FPGA knowledge since all cores are connected to the BSN, SSN and
control core. Using this API, the same capabilities can be specified for both processor cores and hardware
cores allowing the scheduling policy to dictate where to assign kernels as needed for best performance at
runtime.
When designing with Redsharc, the same integrated development environments (IDEs) and software
development kits (SDKs) are used. For hardware design entry and validation, the same VHDL/Verilog IDE
and simulators are used. Once the kernel designs are validated, the source code is provided to Redsharc
and behind the scenes the same vendor supplied compilation tools (such as gcc or g++ for software, or xst,
ngbuild, or bitgen for hardware) are called for compilation and synthesis of the kernel designs, described
next.
120
Table 7.3: Redsharc System API
Function Name Arguments Description
initSystem int numProcs Initializes the system with numProcs processors.
addCore
int coreID
Adds a core to the system with given ID, number of
simultaneous kernels, with the core type.
int numKnls
type coreType
config cfg
setCapabilities
int coreID
Sets the capabilities of a core (kernels that it can execute).config cfgint numCaps
type types[ ]
7.2.4 Build Infrastructure
The construction of MPSoCs can incur long development time when dealing with memory interfaces, PCIe
or other high-speed transceiver IP blocks, and low-level signaling for buses or on-chip interconnect protocols.
Redsharc aims to provide both software and hardware designers a simplified development environment,
shifting the focus from system design and integration to application and kernel development. We extend
Redsharc to support integration of HLS tools to rapidly implement hardware accelerated kernels along with
automatic system control for user defined scheduling policies.
Part of Redsharc includes a build infrastructure to support rapid assembly, configuration, and testing of
developed hardware kernels and full systems. The goal of the build infrastructure is to allow a developer to
spend more time developing kernels, rather than creating test benches and simulation/synthesis project files.
We introduce a build framework, taking hardware/software implementations and hardware specifications,
that automatically compiles, synthesizes, and generates the heterogeneous hardware/software MPSoC. This
framework abstracts away the requirements to manually connect and configure processors, memory inter-
faces, and on-chip interconnects. Utilizing Redsharc’s kernel interfaces and on-chip SSN and BSN networks,
the framework is used to assemble full systems for the user. The framework consists of three key stages that
generate the necessary scripts and configuration information to then drive the vendor tools to produce the
final binaries and bitstreams to run on the device. The system produced is fully functional, requiring no
additional user input to setup, configure, or execute the application. Full-system synthesis and implemen-
tation is also supported for both Xilinx and Altera systems, leveraging the Xilinx ISE/Vivado and Altera
Quartus II tool chains.
Redsharc now incorporates a vendor-agnostic system development environment and build framework,
illustrated in Figure 7.16. To start, the user provides core specifications and kernel implementations in
121
Figure 7.16: Redsharc build framework flow diagram
a simplified form to enumerate available kernels and compute resources. These kernels utilize Redsharc’s
APIs, allowing the system to generate hardware cores or software tasks via the HWKI or SWKI. The user
provided kernel source code is combined with Redsharc’s libraries to generate a collection of IP, spanning
both hardware cores and software to run on specified processors. The interfaces and kernels are provided as
input to the next stage.
In the second stage the cores are connected together based on the network specifications, provided as
input by the user. The network topology is specified by the user for both the BSN and SSN, such as
crossbar, bus, or mesh topologies. In this stage the networks are configured and VHDL source files are
generated for the system. Any custom top-level I/O specified by the user will be passed directly through
to the necessary cores. Memory controllers are also connected directly to the networks to provide high
bandwidth, low latency connectivity for the system. The overall configuration is passed to the third stage
to assist in managing connections at run-time.
In the final stage the controller core is added to the design. The controller uses the specific system
configuration information to control the connections between hardware cores during run-time. At the end
of this stage the configuration information about the newly specified Redsharc system is ready to be passed
into the Redsharc Generate script, producing the necessary vendor specific project files used to compile,
synthesize, and implement the design into software binaries and hardware bitstreams.
Then, leveraging the Redsharc System API a developer can quickly assemble, generate, and test the
system on the device. This approach allows for rapid development and testing along with providing vendor-
agnostic implementations for ease of platform migration. System Generation with Redsharc incorporates
122
Figure 7.17: Redsharc build tool flow
Python scripts to organize and assemble vendor specific project files before leveraging the vendor tools to
compile and implement the design. The Redsharc API provides the common interface for the tools to identify
and connect cores together with the BSN and SSN networks, create Xilinx, Altera, or Achronix project files
for synthesis and implementation. In Figure 7.17, the red box encapsulates the Redsharc build framework.
While the build framework does not expedite vendor tool flow execution, it does reduces the complexity for
a designer such that backend scripts can ease overall system development.
Redsharc utilizes Makefiles and command line execution to support a more systematic batch style ex-
ecution that generates the binaries and bitstreams. However, unlike some HLS tools, the outputs of the
intermediate stages are human-readable source files, allowing designers to open projects in vendor GUIs,
such as Vivado, Quartus, or ACE. This also allows designers to add any supplemental custom logic into the
design, if necessary. The Makefile can leverage the Redsharc simulation testbench infrastructure to provide
debugging capabilities at both kernel and system-levels. A generated system includes the control capabilities
to manage the on-chip networks based on the data flow graph and scheduling algorithm provided as input
during system generation. The result is a run-time system executing the requisite kernels seamlessly in both
hardware and software.
Given the kernel implementations, dependencies, system configuration, and scheduling policy Redsharc
composes a control kernel to manage communication and execution at runtime. Previously, control kernels
were only implemented in software for simplicity. However, in this work we removed this restriction and
introduced simplifications to the API to ease the user effort and enable a standardized hardware kernel to
control the system.
123
7.2.5 System Runtime Operation
Leveraging the HWPI, Redsharc can utilize a single processor to act as the run-time controller core, responsi-
ble for system management tasks such as starting and stopping each kernel in the system among others. The
controller also configures the networks’ communication paths as needed for the application. Both the BSN
and SSN each have a single AXI4 control port to provide register interfaces to configure the BSN and SSN
for data communication between cores. The control interfaces are standardized for hardware and processor
cores. The HWKI provides registers for starting, stopping, resetting and checking the state machine in the
hardware. The HWPI includes an AXI Mailbox that provides two FIFOs in each direction to allow control
commands to be issued from the control core to the processor core, and responses and notifications from the
processor core back to the control core. With the HWPI, Redsharc can now instantiate a light-weight soft
processor for controlling the system, rather than splitting execution on a larger, more capable processor that
should be allocated for software kernels, such as the ARM Cortex-A9 cores.
After a system has been designed and implemented, the next task is to get it up and running. After
the initial bitstreams and processor executables have been downloaded, the control kernel begins setting up
BSN/SSN network connections, scheduling and launching worker kernels to execute parts of the application.
Kernels are assigned to the processor or hardware core as specified by the scheduling policy and following
the dependencies in the DFG to ensure correct execution. Before a kernel is started, its block and stream
interfaces are configured in the BSN and SSN appropriately.
After execution has begun no more user interaction is required. The control kernel frees block and stream
resources when both the kernel putting in data and the kernel reading out the data have finished. Then
these resources are used to support communication in other kernels. The control kernel can be monitored
by the user and, signal when final data has been produced or when all kernels have finished executing.
After a software kernel finishes executing, the RTOS running on the local processor core frees up any
private resources allocated, allowing other kernels to use them. However launching another hardware kernel
is not so simple. To achieve the same functionality we leverage partial reconfiguration to reconfigure the
FPGA fabric for the incoming hardware kernel. Just as with the processor cores that have a hardware
limited number of DMA controllers, hardware cores have only a fixed number of physical block and stream
ports that connect to the BSN and SSN. The HWKI supports more block and FIFO interfaces by buffering
and interleaving data on a single physical channel. The specific configuration of the HWKI is generated by
Redsharc automatically during implementation and synthesis based on the system specification.
124
Once a new system design has been completed the next question is: “Is the performance of the system
what I expect?” In previous work, an extensible performance monitoring infrastructure [91] was developed.
By leveraging this framework, Redsharc provides two types of system generation: Analysis (for performance
monitoring), and Release (without the performance monitoring framework). Additionally, debug function-
ality of the system can be had through a system configuration setting to direct Redsharc to include debug
capabilities in the control kernel. Through the control kernel, the user can “pause” execution, read/modify
current data in blocks and streams, and other debug functions as necessary.
7.2.6 Example Applications in Redsharc
To demonstrate the simplicity and ease of use of Redsharc, we present two applications implemented using
Redsharc. The first is a face recognition application. In this first demonstration, the focus is on the
implementation process from initial software application to final multiprocessor hardware system. The second
application implements a normal cryptographic process that handles data, encrypts, decrypts and performs
some basic data checking. This demonstration is focused on the various possible system configurations and
how the workload is managed at runtime.
Face Recognition
To demonstrate the Redsharc implementation process, we designed a system for an example face recognition
application and show the steps required for implementation. In this section we present an overview of the
face recognition algorithm and how it was implemented with sample kernel implementations, kernel setup,
and configuration in the DFG.
Facial recognition is often used in consumer products like Google Picasa, Microsoft Live Gallery, or Face-
book and in law enforcement or military intelligence to identify a person of interest. To process this massive
amount of data, reductions in dimensionality are necessary to effectively analyze as many images as possible.
One way to achieve this is to extract the most important features from the image, producing eigenfaces as
introduced by Turk and Pentland [114]. In their approach principal component analysis (PCA) is used to
produce the eigenvalues for the image, making up the eigenface. We use singular value decomposition (SVD)
to perform PCA. The determination if a sample face matches a subject in the reference database is calculated
by computing the feature vector for that sample face. Then the root-mean-square (RMS) differences between
the sample face’s feature vector and the feature vectors for the reference subjects is computed. The closest
matching subject is the one with the lowest RMS difference. Figure 7.18 shows the DFG for this application.
125
Figure 7.18: Face Recognition DFG partitioned into software and hardware kernels.
Implementing this face recognition application using Redsharc is a step-by-step progression of migrating
the existing implementation to Redsharc primitives which simplify implementation. First we began with an
initial sequential C code application. Then, the code for each kernel was segmented into separate functions
and the shared data variables were moved to a global scope. These global variables were then reimplemented
as Redsharc blocks and streams. Converting the functions into software kernels using the SWKI was a
simple matter of migrating access to the global blocks and streams into local blocks and streams passed as
an argument into each kernel as shown in Listing 7.1. At this point each kernel is independent of any other
kernel, simply reading and modifying the given data structures. The DFG was implemented using the API
by specifying which blocks and streams are produced by one kernel and consumed by another, as shown
in Listing 7.2. The Redsharc implementation of this application produced exactly the same results for a
variety of sample images when compared to a given reference database as the initial single threaded C code
implementation.
The same user provided DFG and kernel implementations can be run on any system with any number
of processor cores with a simple change to the system configuration. Interesting future work would be to
implement some of the kernels in hardware either manually or using HLS and compare the performance of
various system configurations for a variety of applications. In addition, we are also working towards providing
126
Listing 7.1: Software Kernel 4 Implementation
void swk4(struct taskData *data) {
// get references to the data structures
redsharc_block *pc1 = data >blocks[0];
redsharc_stream *mean = data >streams[0];
redsharc_stream *diff = data >streams[1];
int i,j; // do work
for(i=0; i<m; i++) {
for(j=0; j<n; j++) {
double tmp0,tmp1,tmp2;
blockRead(&tmp0,j,pc1); // from hwk3
streamPop(&tmp1,mean); // from swk2
tmp2 = tmp0   tmp1;
streamPush(&tmp2,diff); // to swk5
}
}
notify_kernelFinished(data >handle);
}
Listing 7.2: DFG API for Configuring Kernel 4
//setup kernel 4 with two inputs and one output
initKernel(4, HW4, 2, 1, dfg);
//setup first input as a stream from kernel 2
addStreamDependency(4, 0, 2, 0, dfg);
//setup second input as a block from kernel 3
addBlockDependency(4, 1, 3, 0, dfg);
//setup first output as a stream
addOutputStream(4, 0, DOUBLE, N*totalImages, dfg);
support to various vendor architectures from standard C pthreads for initial testing, Xilinx Zynq/PPC, Altera
Nios/HPS, and ARM soft-cores on Achronix FPGAs.
Cryptographic Application
To demonstrate the various system configuration possibilities in Redsharc, a four kernel encryption appli-
cation has been designed. The application’s workload consists of reading some data from memory (the
generate kernel), encrypting and transmitting the packet, receiving a packet and decrypting it, and perform-
ing a check to validate the same plaintext data was decrypted correctly. The application was implemented
entirely in software initially, and executed on the single processor core systems. Then the encryption kernel
was marked to be implemented in hardware and the system regenerated. Then the application was executed
again utilizing both the processor and hardware cores.
127
To demonstrate the Redsharc build framework for different FPGA devices, four systems have been
constructed targeting Xilinx Virtex-7 and Zynq-7000 devices for both software only and hardware/software
co-design integration. These systems target both hard and soft processor cores utilizing compilation tool
chains from two different vendors (ARM: armcc and armlink, Xilinx: gcc).
This section shows how a user can transition from an initial software only single processor core system, to a
parallel hardware/software MPSoC system. For each system, in addition to the compute cores, a MicroBlaze
processor core was used to host the control kernel in the system. This control core did not perform any
computation, it only implemented the scheduling and control capabilities of the system. The first generated
system consisted of a single ARM processor core (in addition to the control core mentioned previously) in
the Zynq ZC706 system. The next system is an extension of the first to include a single hardware core to
execute a statically scheduled hardware kernel. The third system consisted of a single MicroBlaze processor
core (in addition to the control core, for a total of two MicroBlaze cores) in the Virtex VC707 system. The
fourth system is an extension of the third to include a statically scheduled hardware kernel.
The same C code that was provided for the software encryption kernel was synthesized using Vivado HLS
to produce a hardware kernel. The only directives that are automatically applied by the build framework
by default specify a handshake interface for the control of the kernel. Vivado HLS will produce a standard
memory interface for any arguments at the top function level. For the encryption kernel, these were the
plaintext, key, and resulting ciphertext. These three memory interfaces were connected directly to the BRAM
interfaces available in the HWKI.
To show the simplicity of the system API, the processor core’s architecture was specified to be either
MicroBlaze or ARM. Both processor cores were connected to the external DDR memory on the board through
their cache hierarchies. Both processor cores were connected to a HWPI with a single DMA controller
connected to the BSN and another DMA controller connected to the SSN.
Both the ARM and MicroBlaze cores had a timer and interrupt controller available to the FreeRTOS
that ran in a baremetal configuration. This RTOS provided thin control capabilities that operated on
the commands received through the mailbox from the control core. The number of tasks was statically
determined to be two and the tasks were created at initial boot and held in an idle state until commands
were received from the control core to launch kernels. When a kernel completes a finished message was
generated and sent to the control core to facilitate scheduling and execution progress of the application. A
custom communication protocol was created to allow for minimal data transfer between the control core and
the processor core.
128
Table 7.4: System performance with Redsharc build framekwork
Operation Elapsed Time
MB Control Bootup 90ms
MB Control Send Message 1.5s
MB Control Receive Message 1.6s
ARM Compute Send Message 0.9s
ARM Compute Receive Message 0.9s
ARM Compute 128-Byte Block Transfer 3.6s
ARM Compute AES generate kernel 6.5s
ARM Compute AES encrypt kernel 73.8s
ARM Compute AES decrypt kernel 298.3s
ARM Compute AES check kernel 6.7s
Hardware Compute AES encrypt kernel 42.9s
Two methods of data transfer were used in these experiments: writing into BRAM blocks through the
BSN, and through virtual blocks created within the processor core’s DDR memory. When two software
kernels that are assigned to the same processor core need to communicate there is no reason for that data to
leave the cache hierarchy and enter into the on-chip network. Instead, memory allocation commands were
implemented in the processor control protocol to allow blocks to be created in a processor’s memory. Since
data transfer through the BSN passes through the DMA controller the control core only needs to provide a
pointer in the processor’s local address space to direct it where to write data. Whether this is an address in
DDR or the address of the DMA controller, it is transparent to the software kernel, allowing the execution
to be finely tuned dynamically by the control kernel to optimize performance during run-time. For these
experiments, there was only one processor core and one hardware core connected to the BSN. Thus the
connections between these two cores was always fixed and never changed during execution.
The system initialization procedure first boots the processor core followed by the control core. For
both the software-only and hardware/software application implementations the control core first loaded the
configuration of the system and dataflow graph (DFG) of the application. Then began the scheduling process
and execution of kernels. The boot time for the control core was 90ms compared to the compute cores: 56s
boot time for the MicroBlaze processor core and 30s for the ARM core. Table 7.4 shows execution times
for selected operations.
The four kernel AES application initially executes on a PC using the Redsharc API to provide a per-
formance baseline of 112.3ms. Migrating to the FPGA and executing in a dual MicroBlaze system on the
Virtex-7 device resulted in an execution time of 103.6ms. Accelerating the encryption kernel as a hardware
kernel achieved an entire application runtime of 98.2ms. This speedup was achieved by making a simple
129
change to direct the build framework to produce a hardware core from the software implementation, requiring
no HDL development by the user.
A Zynq system with ARM processor cores further accelerated this execution, running in 90.79ms. Since
the size of the data being transferred was only 128-bytes for the plaintext, key, and ciphertext the data
transfer rate achieved was only 35.5 MBps using a 32-bit wide datapath. The focus of this work is not strictly
on HLS accelerated performance, but the ability to easily migrate from a pure software implementation to
multiple hardware accelerated kernels, with minimal changes to software kernels in the system. Further
performance could be obtained by implementing custom hardware kernels in place of those generated by
Vivado HLS or by applying further performance enchancing directives such as loop unrolling or pipelining.
7.2.7 Configurable Systems Summary
We presented a second version of the framework generation step. For configurable systems, an initial soft-
ware only version of the application is used as starting point from which the performance of kernels can
be improved by migration to hardware. In addition, the developer can provide any custom or 3rd party
kernel implementations developed using a simple and user friendly hardware interface API. We extended
the Redsharc development environment by abstracting the processor core interface, adding a central control
scheme and runtime environment. The runtime environment handles data transfers, memory allocation, and
initiates computation in processor or hardware cores. We showed the ease with which different systems can
be configured for different composition of processors for two applications.
130
Chapter 8
Conclusions and Future Work
Heterogeneous systems can execute compute intensive applications and achieve high performance. The
difficulty comes in deciding which processor each kernel should be assigned to in order to achieve the best
performance of the whole application. We presented a framework to aid in the process of implementing an
application, designing a heterogeneous system to support it, and evaluating its performance. Specifically,
this framework targets compute-intensive scientific applications because their workloads are made up of
coarse grain kernels. This composition simplifies the analysis of the application, easing the effort required
to decompose the application and assign work to each processor in the system. We have shown that this
level of granularity, coarse grain kernels, simplifies the design and implementation problems of heterogeneous
systems. Using this framework flow, we achieved speedups of 20-60x for systems of off-the-shelf processors
and 7-18x for MPSoCs compared to sequential single processor systems with very little user effort.
We presented a front-end compiler that analyzes the initial source code and constructs a graph of the
kernels in the application. Each node in the graph represents a kernel of work that itself can be represented as
a graph of the individual scalar operations. We exploited this hierarchy to simplify the simulation, scheduling
and performance estimation of applications implemented in heterogeneous systems. We presented a graph-
based modeling methodology to estimate the performance of a kernel by scheduling the operation graph
onto the functional units in a processor. This approach forgoes the requirement to have hardware-specific
implementations of each kernel in order to estimate their performance on multiple hardware choices. Instead,
our approach enables a single operation graph of the kernel to be used to estimate performance for any type of
processor. We have proven that the accuracy of this model is on average within 5-10% of the optimal solution
with a theoretical upper bound of less than twice optimal, and can be used it to model the performance of
131
real world architectures. Although our goal was to estimate the performance of a kernel in a processor, we
found that many real world architectures cannot achieve the best possible performance. Our model estimates
the best possible performance of a given architecture, and we exploited this to improve the performance of
existing architectures. By using our model for every processor in replacement of the multiple models normally
used for each processor in a system, further simplification is possible for the simulation of heterogeneous
systems.
We presented a system simulation approach to schedule the kernels onto processors to estimate overall
application performance. The performance of compute-intensive applications can be improved using this
framework by manipulating the configuration of processors in the system. We have also shown that the
preferred scheduling policy on a truly heterogeneous system of processors depends greatly on the degree of
heterogeneity. Lastly, we presented code generation approaches for both fixed configuration and configurable
systems. We validated these approaches by implementing various applications, and achieved performance
improvements when compared to conventional processor systems. Our framework simplifies the design and
implementation of heterogeneous systems and tailors the configuration of processors to improve overall
application performance. We have shown that it is possible to build seamless implementations for truly
heterogeneous systems automatically.
We have presented two categories of systems: fixed and configurable. Initially we sought to investigate
both high performance computing and embedded computing systems. Yet, we consistently encountered cases
where any system could both be high performance and embedded. This is consistent with the findings of
others [7]. As the number of cores scale on a single chip, the same problems faced in high performance
computing are now being found at the processor level. Similarly, as embedded computing has shifted to
heterogeneous cores for the benefit of power savings and performance-on-a-budget so now high performance
computing has begun to delve into the heterogeneous processor space. Thus the line between embedded
and high performance computing is eroding, they are now even sharing the same application workloads.
Our design framework supports both high performance computing and embedded computing systems. In
our development flow, the application only needs to be analyzed once to be implemented in variety of
system configurations. Moreover, once an application has been analyzed it can be implemented in either a
multi-processor system-on-chip (MPSoC) in the embedded space or a system of interconnected commercially
available (or custom designed) processors in the high performance computing space.
Programming models and associated languages are constantly evolving to both abstract the underlying
architecture yet still preserve visibility of key elements of the hardware. Moreover, a higher level of abstrac-
132
tion is presented in this work that separates the kernel from the specific implementation for each processor.
This detachment clearly identifies the organization of the kernels in the application, without any loss of
clarity into the specific workload of the kernel and how it is implemented on different processors. This view
has its roots in array based processing in the matrix laboratory (Matlab) but we use it to create a separate
but unified design space for applications of kernels implemented on heterogeneous processors. Given the 13
Berkeley Dwarfs [7], we can even represent any possible type of kernel as a few specific categories. Each
category has its own pattern of computation and communication independent of the size of the workload
and the amount of parallelism. Yet, we have shown that even kernels within a single category cannot be
easily corralled for implementation in a specific type of processor. Our framework is poised to be one way to
navigate the expansive design space and return with an optimized solution for each application. By analyzing
systems containing such diverse processors (CPUs, GPUs, and FPGAs) we have proven the flexibility of this
approach to handle any new processor designs in the future.
Scheduling research has traditionally been dependent on the cost and availability of hardware and access
to a variety of kernel implementations for each processor. This empirical approach follows the conventional
wisdom that any problem can be solved with enough man-hours. Our framework sits at the crossroads of
performance estimation and scheduling to provide an ideal environment to design and test new scheduling
policies. We envision that by eliminating the cost and difficulty of actual hardware, new and clever policies
can be developed for the future computing landscape of heterogeneous processors. But the speed bump
on the road to improved scheduling is application benchmarks. Although the 13 Berkeley Dwarfs identify
the categories of kernels, we need to collect applications that have diverse compositions of dwarfs. In their
discussion of the dwarfs, the authors remark that any significant application (and they present MPEG4 as an
example) will contain multiple dwarfs that constitute the main workload of the application. Yet at the time
of writing, the latest ”benchmark suites” such as Rondinia [27] only provide a benchmark of kernels from a
single category of dwarfs. Instead what is needed is a set of applications that are each composed of multiple
different types of dwarfs. The applications we presented in this work including medical imaging and face
recognition represent two examples that can be included in future heterogeneous application benchmarks.
In summary, possible future research opportunities include:
• Customizing the graph-based modeling approach for specific processor architectures like ARM Cortex,
TI DSP, Nvidia GPU, or ARM Mali architectures
• New heterogeneous scheduling policies that combine the global view of a static policy with the speed
and simplicity of a dynamic scheduling policy
133
• As new heterogeneous on-chip systems such as Zynq UltraScale MPSoC grow up to desktop scale the
two approaches for coarse grain and cohesive integrated systems will merge requiring further investi-
gation to enable efficient implementation of applications
• Development of a truly heterogeneous benchmark suite that showcases the abilities of such a diverse
set of processors as CPUs, GPUs, and FPGA from both a high performance (discrete processors) and
tightly integrated (processors on-chip) perspective
134
Bibliography
[1] M. Adler, K. E. Fleming, A. Parashar, M. Pellauer, and J. Emer. Leap Scratchpads: Automatic Memory and
Cache Management for Reconfigurable Logic. ACM/SIGDA International Symposium on Field Programmable
Gate Arrays, Feb. 2011.
[2] N. Alachiotis and A. Stamatakis. Efficient Floating-point Logarithm unit for FPGAs. IEEE International
Symposium on Parallel Distributed Processing, Apr. 2010.
[3] N. Alachiotis and A. Stamatakis. FPGA Optimizations for a Pipelined Floating-point Exponential Unit.
International Conference on Reconfigurable Computing: Architectures, Tools and Applications, Mar. 2011.
[4] G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities.
Proceedings of the Spring Joint Computer Conference, Apr. 1967.
[5] R. Ammendola, A. Biagioni, O. Frezza, F. Lo Cicero, A. Lonardo, P. S. Paolucci, D. Rossetti, F. Simula,
L. Tosoratto, and P. Vicini. Design and implementation of a modular, low latency, fault-aware, FPGA-based
Network Interface. International Conference on Reconfigurable Computing and FPGAs, Dec. 2013.
[6] H. Arabnejad and J. Barbosa. List Scheduling Algorithm for Heterogeneous Systems by an Optimistic Cost
Table. IEEE Transactions on Parallel and Distributed Systems, PP(99), Mar. 2013.
[7] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker,
J. Shalf, S. W. Williams, and K. A. Yelick. The Landscape of Parallel Computing Research: A View from
Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec
2006.
[8] T. Austin, E. Larson, and D. Ernst. SimpleScalar: An Infrastructure for Computer System Modeling. Computer,
35(2), 2002.
[9] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA Workloads using a Detailed GPU
Simulator. IEEE International Symposium on Performance Analysis of Systems and Software, Apr. 2009.
[10] P. Banerjee, N. Shenoy, A. Choudhary, S. Hauck, C. Bachmann, M. Haldar, P. Joisha, A. Jones, A. Kanhare,
A. Nayak, S. Periyacheri, M. Walkden, and D. Zaretsky. A Matlab compiler for distributed, heterogeneous,
reconfigurable computing systems. IEEE Symposium on Field-Programmable Custom Computing Machines,
Apr. 2000.
[11] S. Banerjee, T. Hamada, P. Chau, and R. Fellman. Macro Pipelining based Scheduling on High Performance
Heterogeneous Multiprocessor Systems. IEEE Transactions on Signal Processing, 43(6), 1995.
[12] O. Beaumont, V. Boudet, and Y. Robert. The Iso-Level Scheduling Heuristic for Heterogeneous Processors.
Euromicro Workshop on Parallel, Distributed and Network-based Processing, Sept. 2002.
[13] A. Benoit, U. V. Çatalyürek, Y. Robert, and E. Saule. A Survey of Pipelined Workflow Scheduling: Models
and Algorithms. ACM Computing Surveys, 45(4), 2013.
[14] A. Benoit and Y. Robert. Mapping Pipeline Skeletons onto Heterogeneous Platforms. Journal of Parallel and
Distributed Computing, 68(6), 2008.
[15] M. D. Beynon. Supporting Data Intensive Applications in a Heterogeneous Environment. PhD thesis, College
Park, MD, USA, 2001.
[16] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna,
S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The GEM5 Simulator.
SIGARCH Computer Architecture News, 39(2), 2011.
135
[17] R. Bittner and E. Ruf. Direct GPU/FPGA Communication via PCI Express. International Conference on
Parallel Processing Workshops, Sept. 2012.
[18] C. Boeres, J. Filho, and V. Rebello. A Cluster-based Strategy for Scheduling Tasks on Heterogeneous Processors.
Symposium on Computer Architecture and High Performance Computing, Oct. 2004.
[19] K. Branco and M. Santana. A Novel Simulator for Evaluating Performance Indices on Heterogeneous Distributed
Systems Environments. IEEE International Symposium on Industrial Electronics, July 2006.
[20] T. D. Braun, H. J. Siegel, N. Beck, L. L. Bölöni, M. Maheswaran, A. I. Reuther, J. P. Robertson, M. D. Theys,
B. Yao, D. Hensgen, and R. F. Freund. A Comparison of Eleven Static Heuristics for Mapping a Class of
Independent Tasks onto Heterogeneous Distributed Computing Systems. Journal of Parallel and Distributed
Computing, 61(6), 2001.
[21] C. Brunelli, F. Cinelli, D. Rossi, and J. Nurmi. A VHDL Model and Implementation of a Coarse-Grain
Reconfigurable Coprocessor for a RISC Core. Research in Microelectronics and Electronics, June 2006.
[22] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, T. Czajkowski, S. D. Brown, and J. H. Anderson.
LegUp: An Open-source High-level Synthesis Tool for FPGA-based Processor/Accelerator Systems. ACM
Transactions on Embedded Computing Systems, 13(2), 2013.
[23] L.-C. Canon, E. Jeannot, R. Sakellariou, and W. Zheng. Comparative Evaluation of The Robustness of DAG
Scheduling Heuristics. Grid Computing, 2008.
[24] T. Cao, S. M. Blackburn, T. Gao, and K. S. McKinley. The Yin and Yang of Power and Performance for
Asymmetric Hardware and Managed Software. International Symposium on Computer Architecture, June
2012.
[25] A. Carbon, Y. Lhuillier, and H.-P. Charles. Hardware Acceleration for Just-In-Time Compilation on Heteroge-
neous Embedded Systems. IEEE International Conference on Application-specific Systems, Architectures and
Processors, June 2013.
[26] J. Ceng, J. Castrillon, W. Sheng, H. Scharwachter, R. Leupers, G. Ascheid, H. Meyr, T. Isshiki, and H. Ku-
nieda. MAPS: An integrated framework for MPSoC application parallelization. ACM/IEEE Design Automation
Conference, June 2008.
[27] S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for
heterogeneous computing. IEEE International Symposium on Workload Characterization, Oct. 2009.
[28] E. S. Chung, P. A. Milder, J. C. Hoe, and K. Mai. Single-Chip Heterogeneous Computing: Does the Future
Include Custom Logic, FPGAs, and GPGPUs? IEEE/ACM International Symposium on Microarchitecture,
Dec. 2010.
[29] B. Cirou and E. Jeannot. Triplet: A Clustering Scheduling Algorithm for Heterogeneous Systems. International
Conference on Parallel Processing Workshops, Sept. 2001.
[30] J. Cong, M. Ghodrat, and M. Gill. CHARM: A Composable Heterogeneous Accelerator-rich Microprocessor.
ACM/IEEE International Symposium on Low Power Electronics and Design, July 2012.
[31] M. Corraine, S. Lopez, and L. Wang. GPU acceleration of transmural electrophysiological imaging. Computing
in Cardiology Conference, Sept. 2012.
[32] S. P. Crago and J. P. Walters. Heterogeneous Cloud Computing: The Way Forward. Computer, 48(1), 2015.
[33] R. Dennard, F. Gaensslen, H.-N. Yu, V. Rideout, E. Bassous, and A. R. Leblanc. Design Of Ion-implanted
MOSFET’s with Very Small Physical Dimensions. IEEE Journal of Solid State Circuits, 87(4), 1999.
[34] K. Denolf, S. Neuendorffer, and K. Vissers. Using C-To-Gates To Program Streaming Image Processing Kernels
Efficiently on FPGAs. International Conference on Field Programmable Logic and Applications, Aug. 2009.
[35] I. Dillig, T. Dillig, and A. Aiken. SAIL: Static Analysis Intermediate Language with a Two-Level Representation.
Retrieved Aug. 6, 2013 from www.cs.wm.edu/~idillig/sail.pdf.
[36] C. Economakos and G. Economakos. FPGA Implementation of PLC Programs Using Automated High-Level
Synthesis Tools. IEEE International Symposium on Industrial Electronics, June 2008.
[37] F. Edman and V. Owall. Implementation of a Highly Scalable Architecture for Fast Inversion of Triangular
Matrices. IEEE International Conference on Electronics, Circuits and Systems, Dec. 2003.
[38] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger. Dark Silicon and the End of
Multicore Scaling. International Symposium on Computer Architecture, June 2011.
136
[39] G. Falcao, M. Owaida, D. Novo, M. Purnaprajna, N. Bellas, C. Antonopoulos, G. Karakonstantis, A. Burg, and
P. Ienne. Shortening Design Time through Multiplatform Simulations with a Portable OpenCL Golden-model:
The LDPC Decoder Case. International Symposium on Field-Programmable Custom Computing Machines,
Apr. 2012.
[40] C. Fletcher, I. Lebedev, and N. Asadi. Bridging the GPGPU-FPGA efficiency gap. ACM/SIGDA International
Symposium on Field Programmable Gate Arrays, Feb. 2011.
[41] I. Foster. Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering.
Addison-Wesley Longman Publishing Co. Inc., 1995.
[42] F. Fummi, M. Loghi, M. Poncino, and G. Pravadelli. A Cosimulation Methodology for HW/SW Validation
and Performance Estimation. ACM Transactions on Design Automation of Electronic Systems, Mar. 2009.
[43] M. R. Garey and D. S. Johnson. Strong NP-Completeness Results: Motivation, Examples, and Implications.
Journal of the ACM, 25(3), July 1978.
[44] Y. Gong, M. E. Pierce, and G. C. Fox. Dynamic Resource-Critical Workflow Scheduling in Heterogeneous
Environments. Job Scheduling Strategies for Parallel Processing, May 2009.
[45] P. Grigoras, X. Niu, J. G. F. Coutinho, W. Luk, J. Bower, and O. Pell. Aspect Driven Compilation for Dataflow
Designs. IEEE International Conference on Application-specific Systems, Architectures and Processors, June
2013.
[46] J. L. Gross, J. Yellen, and P. Zhang. Handbook of Graph Theory, Second Edition. Discrete Mathematics and
Its Applications, 2013.
[47] C. Grozea, Z. Bankovic, and P. Laskov. FPGA vs. Multi-core CPUs vs. GPUs: Hands-on Experience with a
Sorting Application. Facing The Multicore-Challenge, Sept. 2011.
[48] J. Gryba. Methodology for Board Level Functional Simulation and Hardware/Software Co-Verification Using
Seamless. Retrieved July 24, 2013 from http://go.mentor.com/2gnwq.
[49] M. Hariyama and M. Kameyama. Architecture of an FPGA-Oriented Heterogeneous Multi-core Processor
with SIMD-Accelerator Cores. The International Conference on Engineering of Reconfigurable Systems and
Algorithms, July 2010.
[50] J. Herrmann, J. M. Proth, and N. Sauer. Heuristics for Unrelated Machine Scheduling with Precedence Con-
straints. European Journal of Operational Research, 102(3), 1997.
[51] B. Holland, A. D. George, H. Lam, and M. C. Smith. An Analytical Model for Multilevel Performance Prediction
of Multi-FPGA Systems. ACM Transactions on Reconfigurable Technology and Systems, 4(3), 2011.
[52] B. Holland, K. Nagarajan, and A. George. RAT: RC Amenability Test for Rapid Performance Prediction. ACM
Transactions on Reconfigurable Technology and Systems, 1(4), 2009.
[53] B. Hong and V. Prasanna. A modular and extensible simulator for performance evaluation of adaptive appli-
cations in heterogeneous computing environments. International Conference on Algorithms and Architectures
for Parallel Processing, Oct. 2002.
[54] S. Hong and H. Kim. An Analytical Model for a GPU Architecture with Memory-level and Thread-level
Parallelism Awareness. International Symposium on Computer Architecture, June 2009.
[55] T. Hu. Parallel Sequencing and Assembly Line Problems. Operations Research, 19(6), 1961.
[56] G. Inggs, D. Thomas, and S. Winberg. Exploring the Latency-Resource Trade-off for the Discrete Fourier
Transform on the FPGA. International Conference on Field Programmable Logic and Applications, Aug. 2012.
[57] R. Inta, D. J. Bowman, and S. M. Scott. The “Chimera”: An Off-The-Shelf CPU/GPGPU/FPGA Hybrid
Computing Platform. International Journal of Reconfigurable Computing, 2012(2012), 2012.
[58] M. Jacobsen and R. Kastner. RIFFA 2.0: A Reusable Integration Framework for FPGA Accelerators. In
International Conference on Field Programmable Logic and Applications, Sept. 2013.
[59] P. G. Joisha, A. Kanhere, P. Banerjee, U. N. Shenoy, and A. Choudhary. The Design and Implementation
of a Parser and Scanner for the Matlab Language in the MATCH Compiler. Retrieved Jan. 20, 2015 from
http://www.ece.northwestern.edu/cpdc/TechReport/1999/09/CPDC-TR-9909-017.html, Sept. 1999.
[60] A. Khokhar, V. Prasanna, M. Shaaban, and C.-L. Wang. Heterogeneous Computing: Challenges and Oppor-
tunities. Computer, 26(6), 1993.
137
[61] R. Kirchgessner, A. George, and H. Lam. Reconfigurable Computing Middleware for Application Portability
and Productivity. IEEE International Conference on Application-specific Systems, Architectures and Processors,
June 2013.
[62] W. Kritikos, A. Schmidt, R. Sass, E. Anderson, and M. French. Redsharc: A Programming Model and On-Chip
Network for Multi-Core Systems on a Programmable Chip. International Journal of Reconfigurable Computing,
2012.
[63] Y.-K. Kwok and I. Ahmad. Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiproces-
sors. 31(4), 1999.
[64] F. Labonte, P. Mattson, W. Thies, I. Buck, C. Kozyrakis, and M. Horowitz. The Stream Virtual Machine.
International Conference on Parallel Architecture and Compilation Techniques, Oct. 2004.
[65] M. Lastra, J. M. Mantas, C. Ureña, M. J. Castro, and J. A. García-Rodríguez. Simulation of Shallow-Water
Systems Using Graphics Processing Units. Mathematics and Computers in Simulation, 80(3), 2009.
[66] M. Laurenzano, M. Tikir, L. Carrington, and A. Snavely. PEBIL: Efficient Static Binary Instrumentation for
Linux. IEEE International Symposium on Performance Analysis of Systems & Software, Mar. 2010.
[67] M. Lee, W. Liu, and V. Prasanna. A Mapping Methodology for Designing Software Task Pipelines for Embedded
Signal Processing. International Parallel Processing Symposium, Apr. 1998.
[68] R. Leupers and J. Castrillon. MPSoC programming using the MAPS compiler. Asia and South Pacific Design
Automation Conference, Jan. 2010.
[69] D. Li, K. Sajjapongse, H. Truong, G. Conant, and M. Becchi. A Distributed CPU-GPU Framework for Pairwise
Alignments on Large-Scale Sequence Datasets. IEEE International Conference on Application-specific Systems,
Architectures and Processors, June 2013.
[70] C. Liu and S. Yang. A Heuristic Serial Schedule Algorithm for Unrelated Parallel Machine Scheduling with
Precedence Constraints. Journal of Software, 6(6), June 2011.
[71] G. Q. Liu, K. L. Poh, and M. Xie. Iterative List Scheduling for Heterogeneous Computing. Journal of Parallel
and Distributed Computing, 65(5), Jan. 2005.
[72] D. Llamocca, C. Carranza, and M. Pattichis. Separable FIR Filtering in FPGA and GPU Implementations:
Energy, Performance, and Accuracy Considerations. International Conference on Field Programmable Logic
and Applications, Sept. 2011.
[73] J. Lobeiras, M. Viñas, M. Amor, B. B. Fraguela, M. Arenaz, J. García, and M. Castro. Parallelization of
Shallow Water Simulations on Current Multi-Threaded Systems. International Journal of High Performance
Computing Applications, 27(4), 2013.
[74] S. M. Loo, B. E. Wells, N. Freije, and J. Kulick. Handel-C for Rapid Prototyping of VLSI Coprocessors for
Real Time Systems. Southeastern Symposium on System Theory, Mar. 2002.
[75] J. Maassen, N. Drost, H. E. Bal, and F. J. Seinstra. Towards Jungle Computing with Ibis / Constellation.
Dynamic Distributed Data-intensive Applications, Programming Abstractions, and Systems, June 2011.
[76] D. Majeti, K. S. Meel, R. Barik, and V. Sarkar. ADHA: Automatic Data Layout Framework for Heterogeneous
Architectures. International Conference on Parallel Architectures and Compilation, Aug. 2014.
[77] M. Marques, G. Quintana-Orti, E. Quintana-Ortí, and R. van de Geijn. Solving large dense matrix problems
on multi-core processors. IEEE International Parallel & Distributed Processing Symposium, May 2009.
[78] P. Mattison and W. Thies. Streaming virtual machine specification, version 1.2, technical report, Jan. 2007.
[79] Y. Nakamura and A. Trybulec. A Mathematical Model of CPU. Journal of Formalized Mathematics, 4(1),
1992.
[80] G. C. Necula, S. McPeak, S. P. Rahul, and W. Weimer. CIL: Intermediate Language and Tools for Analysis
and Transformation of C Programs. International Conference on Compiler Construction, Apr. 2002.
[81] Octave community. GNU Octave 3.8.1. Retrieved Jan. 20, 2015 from http://wiki.octave.org/FAQ.
[82] N. Padhariya, K. Paul, and D. Bhardwaj. A FLOPs Based Model for Performance Analysis and Scheduling of
Applications for Single and Multiple CPUs. International Conference on Parallel Processing Workshops, Aug.
2006.
[83] D. A. Patterson and J. L. Hennessy. Computer Organization and Design: The Hardware/Software Interface.
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 5th edition, 2014. Section 1.7, pg. 40 Fig. 1.16.
138
[84] A. Prasad, J. Anantpur, and R. Govindarajan. Automatic Compilation of Matlab Programs for Synergistic
Execution on Heterogeneous Processors. ACM SIGPLAN Conference on Programming Language Design and
Implementation, June 2011.
[85] R. Puigjaner. Performance Modeling of Computer Networks. IFIP/ACM Latin America Conference on Towards
a Latin American Agenda for Network Research, Oct. 2003.
[86] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers,
G. Prashanth, G. Jan, G. Michael, H. S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus, E. Peterson,
S. Pope, A. Smith, J. Thong, P. Yi, and X. D. Burger. A Reconfigurable Fabric for Accelerating Large-scale
Datacenter Services. International Symposium on Computer Architecture, Aug. 2014.
[87] P. Ratnalikar and A. Chauhan. Automatic Parallelism Through Macro Dataflow in High-level Array Languages.
International Conference on Parallel Architectures and Compilation, Aug. 2014.
[88] P. Ratnalikar and A. Chauhan. Automatic Parallelism through Macro Dataflow in Matlab. International
Workshop on Languages and Compilers for Parallel Computing, Sept. 2014.
[89] M. L. Saetra and A. R. Brodtkorb. Shallow Water Simulations on Multiple GPUs. In Applied Parallel and
Scientific Computing, volume 7134. 2012.
[90] A. Schmidt, W. Kritikos, R. Sass, E. Anderson, and M. French. Merging Programming Models and On-chip
Networks to Meet the Programmable and Performance Needs of Multi-core Systems on a Programmable Chip.
International Conference on Reconfigurable Computing and FPGAs, Dec. 2010.
[91] A. Schmidt, N. Steiner, M. French, and R. Sass. HwPMI: An Extensible Performance Monitoring Infrastruc-
ture for Improving Hardware Design and Productivity on FPGAs. International Journal of Reconfigurable
Computing, 2012.
[92] K. Shagrithaya, K. Kepa, and P. Athanas. Enabling Development of OpenCL Applications on FPGA platforms.
IEEE International Conference on Application-specific Systems, Architectures and Processors, June 2013.
[93] Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks. Aladdin: A Pre-RTL, Power-performance Accelerator
Simulator Enabling Large Design Space Exploration of Customized Architectures. International Symposium on
Computer Architecture, June 2014.
[94] C.-Y. Shei, P. Ratnalikar, and A. Chauhan. Automating GPU Computing inMatlab. International Conference
on Supercomputing, June 2011.
[95] H. Shen and Q. Qiu. An FPGA-Based Distributed Computing System with Power and Thermal Management
Capabilities. International Conference on Computer Communications and Networks, July 2011.
[96] W. Sheng, S. Schürmans, M. Odendahl, M. Bertsch, V. Volevach, R. Leupers, and G. Ascheid. A Compiler
Infrastructure for Embedded Heterogeneous MPSoCs. International Workshop on Programming Models and
Applications for Multicores and Manycores, Feb. 2013.
[97] J. Sim, A. Dasgupta, R. Vuduc, and H. Kim. A Performance Analysis Framework for Identifying Potential
Benefits in GPGPU Applications. ACM SIGPLAN Symposium on Principles and Practice of Parallel Program-
ming, Feb. 2012.
[98] N. Singh, C. Gibbs, D. Pucsek, M. Salois, J. Wall, and Y. Coady. Spinal Tap: High Level Analysis for Heavy
Metal Systems. IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, Aug.
2011.
[99] R. Sinha, A. Prakash, and H. D. Patel. Parallel Simulation of Mixed-abstraction SystemC Models on GPUs
and Multicore CPUs. Asia and South Pacific Design Automation Conference, Jan. 2012.
[100] S. Skalicky, S. Lopez, M. Lukowiak, J. Letendre, and D. Gasser. Linear Algebra Computations in Heterogeneous
Systems. IEEE International Conference on Application-specific Systems, Architectures and Processors, June
2013.
[101] S. Skalicky, S. Lopez, M. Lukowiak, J. Letendre, and M. Ryan. Performance Modeling of Pipelined Linear
Algebra Architectures on FPGAs. International Symposium on Applied Reconfigurable Computing, Mar. 2013.
[102] S. Skalicky, A. G. Schmidt, and M. French. High Level Hardware/Software Embedded System Design with
Redsharc. International Workshop on FPGAs for Software Programmers, Sept. 2014.
[103] S. Skalicky, A. G. Schmidt, S. Lopez, and M. French. A Unified Hardware/Software MPSoC System Construc-
tion and Run-Time Framework. Conference on Design, Automation and Test in Europe, Mar. 2015.
139
[104] A. Snavely, L. Carrington, N. Wolter, J. Labarta, R. Badia, and A. Purkayastha. A Framework for Performance
Modeling and Prediction. ACM/IEEE Conference on Supercomputing, Nov. 2002.
[105] I. Sotiropoulos and I. Papaefstathiou. A Fast Parallel Matrix Multiplication Reconfigurable Unit Utilized in
Face Recognitions Systems. International Conference on Field Programmable Logic and Applications, Aug.
2009.
[106] I. Sotiropoulos and I. Papaefstathiou. A Fast Parallel Matrix Multiplication Reconfigurable Unit Utilized in
Face Recognitions Systems. International Conference on Field Programmable Logic and Applications, Sept.
2009.
[107] M. Spencer, R. Ferreira, M. Beynon, T. Kurc, U. Catalyurek, A. Sussman, and J. Saltz. Executing Multiple
Pipelined Data Analysis Operations in the Grid. ACM/IEEE Conference on Supercomputing, Nov. 2002.
[108] J. Subhlok and G. Vondran. Optimal Latency-Throughput Tradeoffs for Data Parallel Pipelines. ACM Sym-
posium on Parallel Algorithms and Architectures, June 1996.
[109] O. Svensson. Hardness of Precedence Constrained Scheduling on Identical Machines. ACM Symposium on
Theory of Computing, June 2010.
[110] Y.-M. Teo, Y. Chen, and X. Wang. On Grid Programming and Matlab*G. Grid and Cooperative Computing,
3251, 2004.
[111] Y. Thoma, A. Dassatti, and D. Molla. FPGA2: An Open Source Framework for FPGA-GPU PCIe Communi-
cation. International Conference on Reconfigurable Computing and FPGAs, Dec. 2013.
[112] H. Topcuoglu, S. Hariri, and M.-Y. Wu. Performance-Effective and Low-Complexity Task Scheduling for
Heterogeneous Computing. IEEE Transactions on Parallel and Distributed Systems, 13(3), 2002.
[113] N. Travinin Bliss and J. Kepner. pMatlab Parallel Matlab Library. International Journal on High Perfor-
mance Computing Applications, 21(3), 2007.
[114] M. Turk and A. Pentland. Eigenfaces for Recognition. Journal of Cognitive Neuroscience, 3(1), 1991.
[115] R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli. Multi2Sim: A Simulation Framework for CPU-GPU
Computing. International Conference on Parallel Architectures and Compilation Techniques, Sept. 2012.
[116] L. Wang, K. C. L. Wong, H. Zhang, H. Liu, and P. Shi. Noninvasive Computational Imaging of Cardiac
Electrophysiology for 3-D Infarct. IEEE Transactions on Biomedical Engineering, 58(4), 2010.
[117] J. Wu, W. Shi, and B. Hong. Dynamic Kernel/Device Mapping Strategies for GPU-Assisted HPC Systems.
Job Scheduling Strategies for Parallel Processing, May 2012.
[118] J. Xu, N. Subramanian, A. Alessio, and S. Hauck. Impulse C vs. VHDL for Accelerating Tomographic Recon-
struction. IEEE International Sympoisum on Field-Programmable Custom Computing Machines, May 2010.
[119] D. Yang, G. Peterson, and H. Li. Compressed Sensing and Cholesky Decomposition on FPGAs and GPUs.
Parallel Computing, 38(8), 2012.
[120] D. Yang, J. Sun, and J. Lee. Performance Comparison of Cholesky Decomposition on GPUs and FPGAs.
Symposium on Application Accelerators in High-Performance Computing, July 2010.
[121] M. T. Yourst. PTLsim : A Cycle Accurate Full System x86-64 Microarchitectural Simulator. IEEE International
Symposium on Performance Analysis of Systems & Software, Apr. 2007.
[122] H. Zeng, M. Yourst, K. Ghose, and D. Ponomarev. MPTLsim : A Cycle-Accurate , Full-System Simulator for
x86-64 Multicore Architectures with Coherent Caches. ACM SIGARCH Computer Architecture News, 37(2),
2009.
[123] H. Zhang, V. Prabhu, V. George, M. Wan, M. Benes, A. Abnous, and J. M. Rabaey. A 1-V Heterogeneous
Reconfigurable DSP IC for Wireless Baseband Digital Signal Processing. IEEE Journal of Solid State Circuits,
35(11), 2000.
[124] L. Zhuo and V. K. Prasanna. High-Performance Designs for Linear Algebra Operations on Reconfigurable
Hardware. IEEE Transactions on Computers, 57(8), 2008.
140
