MULTI-OBJECTIVE DESIGN AUTOMATION FOR RECONFIGURABLE MULTI-PROCESSOR SYSTEMS by PHAM NAM KHANH
MULTI-OBJECTIVE DESIGN AUTOMATION FOR
RECONFIGURABLE MULTI-PROCESSOR SYSTEMS
PHAM NAM KHANH
NATIONAL UNIVERSITY OF SINGAPORE
2016
This page is intentionally left blank.
MULTI-OBJECTIVE DESIGN AUTOMATION FOR
RECONFIGURABLE MULTI-PROCESSOR SYSTEMS
PHAM NAM KHANH
(B.Eng.(Hons.), South Russian State Technical University, Russia)
A THESIS SUBMITTED
FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
DEPARTMENT OF ELECTRICAL & COMPUTER
ENGINEERING
NATIONAL UNIVERSITY OF SINGAPORE
2016
Supervisors:
Professor Lim Teng Joon, Main Supervisor
Professor Akash Kumar, Co-Supervisor
Dr. Khin Mi Mi Aung, Co-Supervisor
Examiners:
Associate Professor Bharadwaj Veeravalli
Dr. Yajun Ha, I2R, A*STAR
Assistant Professor Sander Stuijk, Eindhoven University of Technology
This page is intentionally left blank.
Declaration
I hereby declare that this thesis is my original work and it has been written by me in its
entirety. I have duly acknowledged all the sources of information which have been used
in the thesis.




As a life-long learner, I always consider the Ph.D. journey as a great opportunity for
expanding knowledge, developing new skills and shaping my personal characteristics.
During the last four years, it is very lucky for me to have great mentors and classmates
that have always helped me to fulfill this learning experience. I would like to take the
foremost part of my thesis to present my deepest gratefulness to all of them.
First of all, I would like to express my sincere gratitude to my main supervisor - Dr.
Akash Kumar. His immense knowledge about the field is invaluable for me to define the
research topic, his bright ideas and insightful discussions have guided me through the
difficulties along the way, his perceptive feedbacks and comments are essential parts in
all of my research works. His passion for teaching and doing research have inspired me
to set my long-term goal to become such a supportive mentor and enthusiastic researcher.
I would especially like to thank my co-supervisor - Dr. Khin Mi Mi Aung for her
continuous support and guidance. She had generously offered me the chance to em-
bark on this journey and the opportunity for working in the professional environment of
Data Storage Institute. Her patience and tolerance with my mistakes have helped me to
overcome the difficult time. In her, I can see the values of a true leader: taking care of
other people’s interest, making room for their development and continuously supporting
iii
them to achieve their goals. These merits will be the compass for both of my career and
personal growth.
My special appreciation also goes to my co-supervisor Prof. Lim Teng Joon for
taking care of the administrative work during the final year of my candidature. Without
his responsive feedbacks, I would haven’t finished this thesis as scheduled. In addition
to my supervisors, I would like to thank the rest of my Thesis Advisory Committee: Dr.
Ha Yajun and Prof. Bharadwaj Veeravalli. Their comments and suggestions have helped
me to widen the scope of my thesis and get more comprehensive views over the research
topic.
I also like to thank my research partner and guider Dr. Amit Kumar Singh. Working
closely with him in most of my projects, I have learned best practice from his coding
experience, critical thinking mindset, and academic writing skill. Above all, his willing-
ness of sharing and open-minded discussions make our collaboration such a joyful and
memorable experience. Furthermore, I would like to acknowledge my fellow lab mates
in MPSOC research group at NUS, especially to Tuan, Siva, Rui, Chin Hau, Liang and
Shyam, for the stimulating discussions during coffee break or the enjoyment during
hang-out activities.
Last but not least, I would like to express my heartfelt gratitude to my parent and
sister for their unconditional support and encouragement not only during this journey but
all over my life. I would like to thank my lovely fiancee, Thanh Duc, for continuously
filling my life with love, hope, and happiness. I am so lucky to have such a great source




List of Tables ix
List of Figures x
1 Introduction and Background 1
1.1 Reconfigurable Multiprocessors Systems . . . . . . . . . . . . . . . . . 1
1.1.1 Trend on IC Development . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Trend on Multiprocessor Systems . . . . . . . . . . . . . . . . 2
1.1.3 Trend on FPGA and Reconfigurable Hardware Accelerator . . . 3
1.1.4 Classification of Reconfigurable MPS . . . . . . . . . . . . . . 5
1.2 Design Automation for Reconfigurable MPS and Challenges . . . . . . 12
1.2.1 Application Level Analysis . . . . . . . . . . . . . . . . . . . . 13
1.2.2 Macro/System Level Synthesis and Exploration . . . . . . . . . 15
1.2.3 Micro/Device level synthesis and exploration . . . . . . . . . . 18
1.3 Research Objectives and Contributions . . . . . . . . . . . . . . . . . . 19
v
Contents
1.3.1 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . 23
2 Design Automation Tools for Reconfigurable MPS 25
2.1 Application Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.1.1 Application Profiling . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.2 Code Restructuring . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.3 Loop Optimization Details . . . . . . . . . . . . . . . . . . . . 28
2.2 Macro/System Level Synthesis and Exploration . . . . . . . . . . . . . 29
2.2.1 Modelling Tools . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.2 HW/SW Partitioning . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.3 Mapping and Scheduling . . . . . . . . . . . . . . . . . . . . . 33
2.2.4 Design Space Exploration Tools . . . . . . . . . . . . . . . . . 38
2.3 Micro/Device Level Synthesis and Exploration . . . . . . . . . . . . . 40
2.3.1 DSE for HLS . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.3.2 Auto-generation tools . . . . . . . . . . . . . . . . . . . . . . . 45
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3 Throughput and Energy-Aware Mapping Approach 47
3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.1 Application Model . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.2 Heterogeneous Reconfigurable MPS Model . . . . . . . . . . . 50
3.2 Proposed Mapping Strategy . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.1 Design-time DSE . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.2 Run-time Mapping . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
vi
Contents
4 Multistage Leakage-aware Scheduling Technique 73
4.1 System Model and Problem Definition . . . . . . . . . . . . . . . . . . 76
4.2 Proposed Multi-stage Resource Management Approach . . . . . . . . . 77
4.2.1 Scheduling Stage . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2.2 Placement Stage . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.3 Post-placement Heuristic . . . . . . . . . . . . . . . . . . . . . 81
4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3.1 Leakage Waste and Schedule Length . . . . . . . . . . . . . . . 84
4.3.2 Post-placement Leakage Waste and Algorithm Runtime . . . . 86
4.3.3 Case-study: Real-life Applications . . . . . . . . . . . . . . . . 86
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5 Machine Learning and Genetic Algorithm for Multi-objective DSE 89
5.1 Overall Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.1.1 Phase 1: Generating the Training Database . . . . . . . . . . . 92
5.1.2 Phase 2: Building the Predictive Models . . . . . . . . . . . . 92
5.1.3 Phase 3: Prediction at Execution Time . . . . . . . . . . . . . 93
5.1.4 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 Phase 1: Generating the Training Database . . . . . . . . . . . . . . . . 93
5.2.1 Generating the Pareto Fronts with Genetic Algorithm . . . . . 94
5.2.2 Normalizing the Pareto Fronts to Uniform Curves . . . . . . . 96
5.3 Phase 2: Building the Predictive Models . . . . . . . . . . . . . . . . . 97
5.3.1 Build Spline Regression for Pareto Curves . . . . . . . . . . . . 97
5.3.2 Build Linear Regression for Spline’s Volatile Coefficients and
Pareto’s Range . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.4 Phase 3: Applying the ML Models for Prediction at Runtime . . . . . . 102
5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.5.1 Results for Scheduling Heuristic . . . . . . . . . . . . . . . . . 106
vii
Contents
5.5.2 Result for Mapping Heuristic . . . . . . . . . . . . . . . . . . . 113
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6 Exploiting Loop-Array Dependencies for Design Space Exploration with
High Level Synthesis 119
6.1 Proposed DSE Framework for HLS . . . . . . . . . . . . . . . . . . . 122
6.2 Loop-Array Dependency Graph . . . . . . . . . . . . . . . . . . . . . 125
6.2.1 Polyhedral Model . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.2.2 Array Access Pattern . . . . . . . . . . . . . . . . . . . . . . . 127
6.2.3 Loop-Array Dependency Graph . . . . . . . . . . . . . . . . . 129
6.3 Array Parameter Computation Block . . . . . . . . . . . . . . . . . . . 130
6.3.1 Array Partition Strategy . . . . . . . . . . . . . . . . . . . . . 131
6.3.2 Access Pattern Simulator . . . . . . . . . . . . . . . . . . . . . 131
6.3.3 Pareto Optimization Filter . . . . . . . . . . . . . . . . . . . . 134
6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.4.1 Quality of Pareto Front . . . . . . . . . . . . . . . . . . . . . . 135
6.4.2 Execution Time and Number of Evaluations . . . . . . . . . . . 138
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7 Auto-generating Hardware Accelerators for Option Pricing Applications 141
7.1 Option Pricing Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.1.1 Option Overview . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.1.2 Option Pricing Problem . . . . . . . . . . . . . . . . . . . . . 144
7.1.3 Pricing Methods . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2 Design Flow and Optimization Framework . . . . . . . . . . . . . . . 147
7.3 Pricing Engine Architecture . . . . . . . . . . . . . . . . . . . . . . . 152
7.3.1 MC Method Overview . . . . . . . . . . . . . . . . . . . . . . 152
7.3.2 HW Design of MC Engine . . . . . . . . . . . . . . . . . . . . 153
viii
Contents
7.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.4.1 Comparison with SW Implementations . . . . . . . . . . . . . 159
7.4.2 Comparison with other HW Accelerators . . . . . . . . . . . . 160
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8 Conclusions and Future Directions 163
8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.2.1 System Level Synthesis . . . . . . . . . . . . . . . . . . . . . 166
8.2.2 Micro Level Synthesis . . . . . . . . . . . . . . . . . . . . . . 168
ix
Summary
With their ability to adapt both hardware and software to the application requirements,
Reconfigurable Multi-Processor Systems (MPS) offer both high computational perfor-
mance and low power consumption at the same time. Furthermore, they also overcome
the high cost and long development time limitations of ASIC solutions. They can also
provide adequate robustness and adaptability for modern computing systems. There-
fore, they are great candidates for a wide range of applications across various indus-
try domains. However, along with above advantages, the combination of software and
hardware reconfigurability also drastically increases the complexity of the system and
imposes new challenges for the design methodologies and design automation tools. Fur-
thermore, the advancement of modern computing platforms and their emergence to all
the fields of human being put the new emphasis not only on the performance but also
robustness, reliability as well as energy and power requirements of the electronic sys-
tems. These challenges impose a large gap between the advancement of reconfigurable
hardware and the tools supporting its development process.
With the aim of making the design flow for Reconfigurable MPS more productive
and autonomous, this thesis provides algorithms and tools that accelerate the develop-
ment process in both macro level and micro level synthesis. Moreover, these tools are
xi
Summary
developed to cope with different design criteria (throughput, latency, energy consump-
tion, leakage power, etc.) and enable users to explore the trade-off between them. To
deal with the high complexity in macro level synthesis, a mapping algorithm is pro-
posed for efficiently assigning tasks to processing elements on heterogeneous platforms
with reconfigurable hardware. Furthermore, heuristic has been developed for scheduling
hardware tasks on reconfigurable devices to explore the trade-off between performance
and leakage power consumption. Using Machine Learning and Genetic Algorithm, an
optimization framework for priority-based scheduling and mapping heuristics has been
proposed to address the time-consuming problem of DSE process. On micro-level syn-
thesis, a framework to exploit the loop structure has been implemented to shorten the
development time with HLS tools. Finally, an accelerator auto-generation tools for op-




1.1 Existing Reconfigurable MPS platforms . . . . . . . . . . . . . . . . . . . . . . 12
2.1 Comparison of various approaches . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1 Example of 5 actors and 3 tile-types . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2 Notations to be used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.4 Covered resource combinations . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5 Number of mappings with three tile-types . . . . . . . . . . . . . . . . . . . . . 68
3.6 Energy saving using our runtime technique . . . . . . . . . . . . . . . . . . . . 71
4.1 Leakage waste and algorithm runtime of post-placement methods . . . . . . . . . 84
4.2 Leakage waste and schedule length for real-life applications . . . . . . . . . . . 87
5.1 Execution time and quality comparison . . . . . . . . . . . . . . . . . . . . . . 112
5.2 Results for Mapping Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.1 Quality of the design space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.1 Implementation for different Option Types . . . . . . . . . . . . . . . . . . . . . 157
xiii
List of Tables
7.2 Comparison with other HW implementations . . . . . . . . . . . . . . . . . . . 161
xiv
List of Figures
1.1 Continuation of Moore’s law [107] . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Example of tightly-coupled architectures [77] . . . . . . . . . . . . . . . . . . . 7
1.3 Example of loosely-coupled architectures . . . . . . . . . . . . . . . . . . . . . 8
1.4 Example of fine-grained (a) and coarse-grained (b) architectures . . . . . . . . . 10
1.5 Design flow for Reconfigurable MPS . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 Power requirement gap of mobile devices [12] . . . . . . . . . . . . . . . . . . . 18
1.7 Contributions in the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1 Automation tools for Application Level . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Design automation tools in Macro/System Level . . . . . . . . . . . . . . . . . . 30
2.3 Design automation tools in Micro/Device Level . . . . . . . . . . . . . . . . . . 41
3.1 SDFG model of an H.263 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2 An example multiprocessor platform . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Overall flow of proposed mapping strategy . . . . . . . . . . . . . . . . . . . . . 53
3.4 Example of Run-time Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.5 Throughput and energy of MPEG . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.6 Execution time of different DSE strategies . . . . . . . . . . . . . . . . . . . . 69
xv
List of Figures
3.7 Throughput and energy consumption for H263 Decoder at different resource
combinations for different optimization criteria . . . . . . . . . . . . . . . . . . 70
4.1 Example of Leakage Waste caused by Prefetching Technique . . . . . . . . . . . 75
4.2 Multi-stage Scheduling Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3 Leakage and Schedule Length when employing Different Approaches . . . . . . 83
5.1 Proposed framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Original Pareto front of 5 task graphs . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3 Normalized Pareto fronts and their Spline Models . . . . . . . . . . . . . . . . . 97
5.4 Details of Model Building and Prediction Phases . . . . . . . . . . . . . . . . . 99
5.5 Boxplot of Spline’s coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.6 Details of Trace back module . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.7 Result for single TG from fat group . . . . . . . . . . . . . . . . . . . . . . . . 107
5.8 Combined result of all TGs in test set . . . . . . . . . . . . . . . . . . . . . . . 108
5.9 Pareto fronts generated from GA and our framework . . . . . . . . . . . . . . . 110
5.10 Pareto fronts generated for Mapping Heuristic . . . . . . . . . . . . . . . . . . . 113
6.1 Motivational example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.2 Overall flow of proposed mapping strategy . . . . . . . . . . . . . . . . . . . . . 123
6.3 Example of matrix multiplication program . . . . . . . . . . . . . . . . . . . . . 128
6.4 APCB architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.5 Results for DSE on array parameters . . . . . . . . . . . . . . . . . . . . . . . . 136
6.6 Overall DSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.1 Option pricing procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2 Generic framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.3 Generic architecture of pricing engines . . . . . . . . . . . . . . . . . . . . . . . 154
7.4 Poisson Generator block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.5 GNG block for SV models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156




1.1 Reconfigurable Multiprocessors Systems
1.1.1 Trend on IC Development
Since its introduction in the 1970s, the Moore‘s Law, which predicted the number of
transistors doubling every 18 months [102], has been the guiding factor of computing
performance for the semiconductor industry as shown in Fig.1.1. In the beginning, chip
manufacturers had relied on the Dennard‘s principle [50] to keep the development pace
by scaling the transistor’s size. With nearly 3 decades, every 70% scaling in the tran-
sistor’s size had doubled the number of transistor on the chip with 40% increase in
frequency while the power requirement remained unchanged [50]. However, since early
2000s, the transistor scaling was slowed down due to increasing leakage current, which
led to the era of increasing frequency to maintain the pace of computing performance.
But chip manufacturers couldn’t enjoy staying in this direction for too long because of
the exponential increase of chip power density while scaling the clock frequency. As a
result, they met the limitation of currently available cooling technology and the clock
frequency became a plateau. Once again, the industry came back to the spatial solution
by increasing the number of processing cores on a chip and providing parallel computing
solutions.
1
1.1 Reconfigurable Multiprocessors Systems
   IEEE SOLID-STATE CIRCUITS MAGAZINE WINTER 20 15  21
processor in 22-nm SOI with 64 MB 
of eDRAM L3 cache and 4-MB/core 
eDRAM L2 cache. Oracle details its 
SPARC M7 processor with 32 S4 cores, 
a 1.6-TB/s bandwidth 64-MB L3 cache, 
and a 0.5-TB/s data bandwidth on-
chip network to deliver more than 
three times the throughput compared 
to its predecessor. It also includes 280 
SerDes lanes that support up to 18-Gb/s 
line rate and 1-TB/s total bandwidth. In-
tel details its next-generation Xeon pro-
cessor, which supports 18 dual-thread-
ed 64-b Haswell cores, 45-MB L3 cache, 
4 DDR4-2133MHz memory channels, 
40 8-GT/s PCIe lanes, and 60 9.6-GT/s 
QPI lanes. It has 5.56-B transistors in In-
tel’s 22-nm trigate HKMG CMOS, achiev-
ing a 33% performance boost over pre-
vious generations.
The chip complexity chart in 
 Figure 16 shows the trend in transis-
tor integration on a single chip over 
the past two decades. While the 1 
billion transistor-integration thresh-
old was achieved some years ago, 
we now commonly see processors 
incorporating more than 5-B transis-
tors on a die.
Digital Systems—Memory
Subcommittee Chair:  
Joo Sun Choi, Samsung Electronics, 
Hwasung, Korea
Mobile products are everyone’s com-
panion and need to store and pro-
cess ever-increasing amounts of 
data. Progress is possible only by 
constant improvements in area, pow-
er, and performance of volatile and 
nonvolatile memory (NVM). FinFET 
technology is now mainstream for 
embedded SRAM and DRAM, facilitat-
ing continued scaling. Improvements 
in DRAM data rates support the in-
creasing demands of greater data 
volumes. NAND flash memories have 
moved from 2 b/cell to 3 b/cell, and 
3-D multilayer designs are now typi-
cal. Embedded flash, which is essen-
tial to IoT and wearable applications, 
has moved to 28 nm. Among emerg-
ing memories, STT-MRAM is the most 
mature, although ReRAM is quickly 
catching up.
Some outstanding state-of-the-art 
paper topics from ISSCC 2015 include
■■ two 14-nm SRAM bit cells; 0.050 µm2 
(HDC) and 0.058 µm2 (LVC) capable of 
achieving 1.5 GHz operation at 0.6V
■■ a 14-nm FinFET SOI eDRAM with a 
cell size of 0.01747 µm2 and 1-ns 
access time
■■ 128-Gb, 3 b/cell 32 stacked WL 
layer 3-D NAND flash running at 
1Gb/s I/O rate
■■ low-power 64-Gb 2b/cell NAND flash 
manufactured in 15-nm technology
■■ a 1.1-V, 10-Gb/s/pin transceiver 
for DRAM interface suitable for 
use beyond LPDDR4
■■ a high-speed 1-Mb STT-MRAM us-
ing 2T2MTJ cells achieves 3.3-ns 
access time and a sub-20-nm tech-
nology node STT-MRAM uses a 
 high-density 1T1MTJ memory cell
■■ a 28-nm embedded SG-MONOS 
FLASH developed for automotive 
applications. 
SRAM
Consumer and computing products 
in 2015, from smart watches to the 
cloud, depend on low-power and high-
performance embedded SRAM. Chal-
lenges for SRAM include VMIN, leakage, 
and dynamic power reduction. Last 





















1992 1994 1996 1998 2000 2002 2004
Year
2006 2008 2010 2012 2014 2016




































Figure 1.1: Continuation of Moore’s law [107]
1.1.2 Trend on Multiprocessor Systems
The era of parallel computing was pioneered by the research team from IBM and Intel
with the concept of multi-core processors. These processing units contain two or more
independent processing components (such as arithmetic and logic unit - ALU) which can
read and execute multiple instructions in parallel. These multicore processors addressed
well the thermal, power and energy problems faced by frequency scaling approach. Ap-
proximately, a dual-core processor with only half of the clock speed can achieve the
same computational performance as its single-core competitor, and consumes less en-
ergy and operates at a lower temperature due to better power distribution and scatter
thermal hotspots. These advantages are a driving force for companies moving forward
in this direction and a number of multi-core and many-core have been released: 2-cores
(Intel Core2 Duo, 2006), 4-cores (Intel Core i5, 2009), 6-cores (Intel Core i7, 2010) and
8-cores (Intel Xeon 2820, 2011).
The same trend toward parallelism is observed on the system level with the growth
2
1.1 Reconfigurable Multiprocessors Systems
of Multiprocessor System (MPS). These systems combined a number of dependent pro-
cessors (single-core or multi-core) along with hardware subsystem to coordinate these
processors (memory, bus, connections) to provide further concurrent processing power
and facilitate more computational demand. The Lucent’s Daytona chip [15], introduced
in 2000, is the first known MPS, integrating multiple homogeneous processors. Fol-
lowing this breakthrough, there has been a significant drive towards MPS development
especially in the early half of 2000. Examples included Nexperia [114] from Philips
Semiconductor, OMAP [45] from Texas Instruments and Nomadik [115] from STMi-
croelectronics. Since then, the MPS technology has matured to a great extent, growing
in complexity and size. Some of the current day MPS are UniPhier [110] from Panasonic
Semiconductor and Platform 2012 aka STHORM [105] from STMicroelectronics.
1.1.3 Trend on FPGA and Reconfigurable Hardware Accelerator
Although the number of transistors still follows Moore‘s Law to double every generation
in MPS, the transistor‘s performance and energy efficiency haven‘t improved at the same
pace. This leads to a new issue called “dark silicon”, where we need to power off a part
of the chip all the time due to the limitation of chip power budget. A recent study shows
that 50% of a chip might be in the “dark” status within 3 process generations [56]. To
address this major challenge, the industry has to develop a way to use the transistor more
efficiently and extract more computing power from existing hardware. This naturally
shifts the industry to the new paradigm of adding customizable components to MPS that
can be specifically configured to fit the requirement of high-level workloads.
When it comes to custom hardware logic, there are different options available for
chip designers. Application-specific integrated circuit (ASIC) was the standard way to
implement custom logic to facilitate specific needs of an application due to its high per-
formance and power efficiency. However, the high development cost and slow time to
market make it not always a feasible option for many applications. Addressing these
3
1.1 Reconfigurable Multiprocessors Systems
problems, Field-programmable gate array (FPGA) is another way to implement hard-
ware accelerators and has gained significant attention in both academic and industrial
domain recently. Containing fine-grained computing logics with programmable inter-
connections between them, FPGAs allow designers to implement arbitrary digital com-
ponents without fabricating the customized chip. Traditionally, FPGA is known as a
way of creating a prototype for testing and verification of ASIC systems. But nowadays,
final designs and commercial products are also implemented on FPGA. New achieve-
ments in IC technology allow entire complicated computing systems to be implemented
in an FPGA board. State-of-the-art FPGA devices offer both soft-core and hard-core
processors with on-chip memory blocks, peripherals and other functional units that can
deal with a wide range of application requirements.
An MPS integrated with reconfigurable logic from FPGA is called Reconfigurable
MPS. Run-time reconfigurability is one of the unique features of Reconfigurable MPS
that allows these kinds of systems to be adapted to a specific application, providing
flexibility in the designed system. The main disadvantage when comparing a Recon-
figurable MPS with a traditional custom implementation with ASIC is the decrease in
execution time and less power efficiency. However, a Reconfigurable MPS has a number
of advantages that can outperform its ASIC competitor.
• Less time-to-market: for Reconfigurable MPS, the design procedure does not need
to take into account the manufacturing process of the IC, with a substantial de-
crease in design time.
• Flexibility and reconfiguration: reconfigurable hardware resource makes the plat-
form suitable for a wide range of applications. Further, it is possible to customize
each component independently by adding functional units to satisfy specific re-
quirements of particular applications.
• Less cost: The design process is less expensive. This is thanks to the ability of
4
1.1 Reconfigurable Multiprocessors Systems
reuse common IP cores across similar products. Furthermore, the cost of mainte-
nance and repair is also considerably decreased due to the reconfigurable potential
of the system.
• Scalability: in a Reconfigurable MPS, due to the reconfigurability, a functionality
can easily be added or uninstalled through reconfiguration.
• Fault-tolerance: any detected fault in the accelerator can be repaired with the re-
configuration ability by upgrading a component or simply swapping out hardware
functionality without redesigning the physical board.
Taking into account these advantages, the works in this dissertation focus on Recon-
figurable MPS as the targeted computing platform.
1.1.4 Classification of Reconfigurable MPS
Because of the heterogeneity of the system itself, there are a number of different ways to
classify Reconfigurable MPS on various aspect: the architecture of the General Purpose
Processors (GPP), the role and architecture of the reconfigurable hardware unit, the in-
terconnection and cooperation between them, etc. The authors in [40] have done a great
job to put together a comprehensive classification of Reconfigurable MPS. This Section
just limits the categorization to the main aspects that are related to the applicability of
our contributions on these platforms. In other words, these classifications help to dif-
ferentiate between Reconfigurable MPS that can benefit from our contributions and the
ones where our contributions cannot be applied.
Tightly Coupled and Loosely Coupled
The first way to classify Reconfigurable MPS is based on the role and location of the
reconfigurable resources in the top level system and the way they cooperate with general
purpose processors.
5
1.1 Reconfigurable Multiprocessors Systems
• The tightly coupled architectures offer implementation of a single instruction
stream. In this class of Reconfigurable MPS, the reconfigurable hardware is usu-
ally placed as a part of the CPU or as a segment inside the Data and Control Path
(DCP) of GPP to extend the instruction set of GPP with reconfigurability. As a
result, both the fixed hardware and reconfigurable parts are executed under the
same instruction stream. Two sample implementations of this class are presented
in Figure 1.2. In Fig.1.2a, the Reconfigurable hardware is placed inside the CPU
and can be either DCP Segment or Reconfigurable Unit (RFU). In Fig.1.2b, the
Reconfigurable hardware stays outside of the CPU but still under control of the
CPU and executes extended instruction set of the CPU.
Because of the tightly coupling characteristic these architectures are usually im-
plemented on the same chip and the communication cost between the fixed and
reconfigurable parts are relatively small. Generally, the tightly coupled systems
are efficiently used to exploit the instruction level and loop level parallelism. The
main disadvantage of these systems is the limitation on the amount of reconfig-
urable hardware that can be integrated with GPP. Furthermore, this type of systems
requires a custom design for the GPP, hence requires large development effort both
from the hardware design and the compilation toolset.
• In the loosely coupled systems, multiple instruction streams are possible to be
executed and reconfigurable hardware is usually implemented independently as
co-processor of GPPs with its own memory and interconnection fabric. Hence,
in loosely coupled architectures, multi-tasks can work simultaneously on GPPs
and reconfigurable resources and they are better suited to exploitation of the ap-
plication level and task level parallelism. However, these kinds of systems often
experience high communication overheads between the GPPs and reconfigurable
hardware. Therefore, to benefit from these platforms, the execution speed up of
the reconfigurable hardware has to compensate for the communication overhead.
6
1.1 Reconfigurable Multiprocessors Systems
April 3, 2009 12:2 WSPC/123-JCSC 00503
226 L. Jo´zwiak & N. Nedjah
Single Instruction Stream





























Multiple Instruction Streams 
Fig. 4. Loosely coupled architectures.
Each of these architectures has its advantages and disadvantages. The loosely
coupled architectures allow for larger reconﬁgurable subsystems and more paral-
lelism than the tightly coupled once. In these architectures, the CPU-centric proces-
sor and reconﬁgurable resources can work concurrently on diﬀerent tasks. Moreover,
these architectures can be implemented using the standard processors and FPGAs.














































































y. (a) RFU inside CPU
April 3, 2009 12:2 WSPC/123-JCSC 00503
226 L. Jo´zwiak & N. Nedjah
Single Instruction Stream





























Multiple Instruction Streams 
Fig. 4. Loosely coupled architectures.
Each of these architectures has its advantages and disadvantages. The loosely
coupled architectures allow for larger reconﬁgurable subsystems and more paral-
lelism than the tightly coupled once. In these architectures, the CPU-centric proces-
sor and reconﬁgurable resources can work concurrently on diﬀerent tasks. Moreover,
these architectures can be implemented using the standard processors and FPGAs.














































































y. (b) RFU o tside CPU
Figure 1.2: Example of tightly-coupled architectures [77]
7
1.1 Reconfigurable Multiprocessors Systems
April 3, 2009 12:2 WSPC/123-JCSC 00503
226 L. Jo´zwiak & N. Nedjah
Single Instruction Stream





























Multiple Instruction Streams 
Fig. 4. Loosely coupled architectures.
Each of these architectures has its advantages and disadvantages. The loosely
coupled architectures allow for larger reconﬁgurable subsystems and more paral-
lelism than the tightly coupled once. In these architectures, the CPU-centric proces-
sor and reconﬁgurable resources can work concurrently on diﬀerent tasks. Moreover,
these architectures can be implemented using the standard processors and FPGAs.















































































Figure 1.3: Example of loosely-coupled architectures
As a result, they are not the candidates for applications with small computation
tasks along with intensive communication. Despite this limitation, loosely cou-
pled systems have wider adoption and come to the mainstream computing faster
due to several advantages over their tightly coupled counterparts. First of all, the
loosely coupled architectures allow for larger reconfigurable subsystems and more
parallelism than the tightly coupled ones. Secondly, they can be built from exist-
ing commercial reconfigurable devices with matured design toolkit from manufac-
turers. Taking into account the high potential and popularity of loosely coupled
reconfigurable system, contributions from this thesis will focus on this kind of
architecture. An example of loosely coupled architecture is presented in Figure
1.3.
8
1.1 Reconfigurable Multiprocessors Systems
Coarse-grained and Fine-grained
In this classification approach, the Reconfigurable MPS are grouped by the granularity
of their reconfigurable blocks. A particular reconfigurable block is a combination of
basic reconfigurable functional units, memories and interconnects. The minimum re-
configuration grain related to the size of the basic blocks defines the granularity of the
system. Based on the granularity of reconfigurable blocks, Reconfigurable MPS can be
categorized into fine-grained and coarse-grained systems.
• In the fine-grained system, the minimum level of reconfiguration is individual
bits of the reconfigurable hardware (HW) and operations for individual bits or
small groups of bits are available in the system. Examples of the fine-grained
reconfigurable system are widely available FPGA architecture from commercial
vendors and the reconfiguration units in these platforms are Look-up Table (LUT),
switching interconnection, and configurable distributed block memories. Although
the reconfiguration in the level of individual bits require much more hardware re-
sources and consumes more power, they provide a high level of flexibility. The
fine-grained blocks are very efficient when dealing with manipulation of bit-level
data such as encryption, image processing, logic and set- theoretical computations.
Besides, they have a great advantage for mixed-grain applications and designing
control, interface circuit or data-path circuits not implemented in the GPP.
• In the coarse-grained systems, the smallest reconfiguration unit is available at
the level of words and are designed with bit-parallel operations for whole words
of bits. In coarse-grained Reconfigurable MPS, the functional units of reconfig-
urable block may be an independent ALU block as shown in Figure 1.4. Since
the reconfigurable blocks are implemented with word-wide computation, the re-
configuration blocks in these systems are bigger in size but smaller in amount and
require less interconnection than the fine-grained fabric. As a result, they demand
9
1.1 Reconfigurable Multiprocessors Systems
April 3, 2009 12:2 WSPC/123-JCSC 00503






Fig. 5. Examples of a ﬁne-grained (a) and coarse-grained (b) functional block.
of control and interface circuits, as well as of data-path circuits that are based on
the data-widths not implemented in the standard processors or coarse-grained fab-
rics. In the last case, usage of a standard processor or coarse-grained fabric results
in a wasted computation eﬀort for operations on short words and execution of
multiple instructions for operations on long multi-word operands. Examples of the
ﬁne-grained fabrics can be found in Refs. 21–26, and an interesting discussion of
the inﬂuence of LUT and cluster size on deep-submicron FPGA performance and
density in Ref. 81.
The coarse-grained fabrics are intended for an eﬃcient implementation of the
word-width data-path computations. An example of a coarse-grained functional
unit is given in Fig. 5(b). The functional units of the coarse-grained fabrics repre-
sent either some kinds of ALUs or even small processors. The coarse-grained fab-
rics require less hardware resources, energy and time for reconﬁguration than the
ﬁne-grained fabrics and have more eﬃcient interconnection routing switches. The
reduction of the conﬁguration and routing resources is proportional to the word
size. Since their blocks are optimized for computations on words, they usually per-
form the word-wide computations faster and using less energy then the ﬁne-grained
fabrics implemented in a comparable technology and conﬁgured for the same word-
wide computations. The coarse-grained fabrics are however much less ﬂexible than
the ﬁne-grained fabrics. They are unable to perform single-bit manipulations or
control related computations, and are only eﬃcient for the data-path computations
with data width being close to their standard word size or its multiple. Examples
of the coarse-grain fabrics can be found in Refs. 27–36, 76, 90, 91, 134.
4.3. Type of resources and organization of the
reconfigurable fabric
In the CPU-centric processors both the instructions and data are located in one















































































Figure 1.4: Example of fine-grained (a) and coarse-grained (b) architectures
less hardware resource, energy and time for reconfiguration. However, in return,
they offer less flexibility than the fine-grained options and are only efficient for
th data-path computations of impl mented data width o its multiple.
Static and Dynamic Reconfigurable MPS
According to the reconfiguration ability of the MPS at runtime, we can refer it as a static
or a dynamic reconfigurable system.
• For Static Reconfigurable MPS, the hardware resource is configured before the
actual execution of the system and its configuration stays the same during runtime.
In static MPS, the configuration is lo ded to the configurable resourc at compile
time.
• For Dynamic Reconfigura le MPS, th configuration of the hardware resource
can be changed at runtime, during the real execution of platform. It can reconfig-
ure the architecture multiple times to satisfy different requirement of an applica-
tion at runtime. Therefore, they are also called run-time reconfigurable system.
Most of the available commercial reconfiguration devices provide dynamic reconfig-
urable capability because it enables a number of benefits for computing system:
10
1.1 Reconfigurable Multiprocessors Systems
• Increasing resource sharing: large applications can be partitioned into mutually
exclusive tasks, that can be fit into smaller hardware resources. Then, the config-
uration for each task is dynamically loaded into the hardware resource at runtime
to execute the whole applications.
• Saving power and energy: the smaller hardware usage can help to reduce the
static power directly. Moreover, avoiding to keep unused parts of applications in
the hardware also can significantly reduce the power dissipation [158], [117].
• Offering adaptability for system over change in environment: the ability to change
the circuit specialization at runtime can be used to implement adaptive control,
fault-tolerance, self-diagnosis, self-repair system, etc. [76]
Since the configuration is loaded at runtime, the amount of time needed for recon-
figuration is very important in dynamic reconfigurable system. This time is proportional
to the size of the hardware resource as well as the inverse of granularity of reconfig-
urable unit. One way to reduce this overhead is preloading a configuration into on-chip
memory before actually executing it. This is known as prefetching technique and widely
used in different reconfigurable systems [67]. Another way to decrease reconfiguration
time is to divide the reconfigurable resource in different segments, and these segments
can be independently reconfigured while the other segments are executing. This feature
is called dynamic partial reconfiguration and is offered by modern commercial FPGAs
(Xilinx, Altera).
Since the main contributions of this thesis are on design automation tools and tech-
niques for Reconfigurable MPS, the main target platform will be fine-grained, loosely-
coupled and dynamically Reconfigurable MPS because of their wide adoption, popu-
larity, as well as the readiness of supporting compilation tools and availability of the
hardware.
Table 1.1 provides a number of available Reconfigurable MPS based on above crite-
ria and also specifies whether these systems can benefit from our contributions.
11
1.2 Design Automation for Reconfigurable MPS and Challenges
Table 1.1: Existing Reconfigurable MPS platforms
Reference Coupling Granularity Reconfiguration Applicable
NAPA [134] Loosely Fine Dynamic True
Garp [68] Loosely Coarse Dynamic False
CHAMELEON [153] Loosely Fine Dynamic True
MorphoSys [150] Tightly Coarse Dynamic False
Chimaera [184] Tightly Fine Dynamic False
Experimental platform [111] Loosely Fine Dynamic True
Annabella [154] Loosely Coarse Dynamic False
Smart ChipS [143] Loosely Mixed Dynamic True
CerberO [30] Loosely Fine Dynamic True
Zedboard [8] Loosely Fine Dynamic True
Altera SoC [3] Loosely Fine Dynamic True
1.2 Design Automation for Reconfigurable MPS and Chal-
lenges
As discussed from previous Sections, with their ability to adapt both hardware and soft-
ware to the application requirements, Reconfigurable MPS offer both high computa-
tional performance and low power consumption at the same time. Furthermore, they
also overcome the high cost and long development time limitations of ASIC solutions.
They can also provide adequate robustness and adaptability for modern computing sys-
tems. Therefore, with all these potentials and capabilities, they are great candidates for
a wide range of applications across various industry domains: from embedded applica-
tions (multimedia applications, communications, etc.) to high performance computing
(data analytics, financial computing, etc.), from consumer software (image, video pro-
cessing, etc.) to scientific workload (cyber security, simulations, etc.).
12
1.2 Design Automation for Reconfigurable MPS and Challenges
However, along with advantages, the combination of software and hardware recon-
figurability also drastically increases the complexity of the system and imposes new
challenges for the design methodologies and design automation tools. Although there
are increasing studies in both academia and industry to address this problem, there is still
a large gap between the advancement of reconfigurable HW and the tools supporting its
development process. In this Section, we will give an overview on the typical devel-
opment process of applications on Reconfigurable MPS, then highlight the important
characteristics of the tools required for the process. Furthermore, we will discuss the
main challenges of each components in the process, which lay the foundations for the
research targets as well as the main contributions of this thesis. A more comprehensive
review on each component and current progress in the field are provided in Chapter 2.
1.2.1 Application Level Analysis
Figure 1.5 presents a general design flow for Reconfigurable MPS and the necessary
EDA components to make the development process efficient and effective. The devel-
opment process of Reconfigurable MPS starts with the Analysis on Application Level,
where the computational characteristics of applications are extracted with Profiling tools.
Then, high level representations of applications written in high-level language (C, Mat-
lab, Java, etc.) are reconstructed to better suit the underlying hardware architecture of
reconfigurable resource with Code Restructuring tools. The application profiling and
code restructuring procedure are tightly integrated together and based on the common
techniques and theories of compiler front-end and their representations. Therefore, the
automation tools in this level can be developed by inheriting well-established techniques
in parallel compiler for multiprocessor system with significant adaption to the unique
hardware specification of reconfigurable platforms.
13



























Figure 1.5: Design flow for Reconfigurable MPS
14
1.2 Design Automation for Reconfigurable MPS and Challenges
1.2.2 Macro/System Level Synthesis and Exploration
The second part of the design flow deals with exploring different combinations of hard-
ware components at system level (reconfigurable tiles and CPUs, memories and com-
munication resources, etc.) to define the most suitable architecture for applications’
requirements and constraints. Hence, this process is generally referred as Macro or
System Level Synthesis and Exploration. To fulfill above functionality, it requires the
involvement of a number of different tools: Hardware/Software (HW/SW) Partitioning,
Mapping and Scheduling, Design Space Exploration tools and Modeling techniques.
Main Components
The main process of System Level Synthesis starts with HW/SW Partitioning, where the
analysis result from Application level is considered to decide which part of the appli-
cation should be executed on which components of the system (CPU or reconfigurable
hardware). After that the Mapping process is executed to derive the spatial allocation
and binding of computational processes in the application onto physical hardware units
on the Reconfigurable MPS platform. Thereafter, Scheduling tool provides execution
order of these computation processes either in system level or in device level where
multiple tasks of the applications are allocated to the same hardware component.
HW/SW Partitioning, Mapping and Scheduling are the three main subproblems of
System Level Synthesis and inherently dependent on each others.They are usually bound
together in an integrated process that requires iterative exploration and refinements.
Therefore, they should be automated to a high degree.
Supporting Tools
Because of their complexity the solutions for above subproblems are exponentially in-
creasing with the number of operations in the computational processes under consid-
eration. Therefore, the exploration in system level usually deals with a coarse-grained
15
1.2 Design Automation for Reconfigurable MPS and Challenges
abstract model of the application. A list of widely used application models is presented
in Chapter 2 along with the description of two modeling tools adopted in our works:
Task Graph and Synchronous Data Flow Graph.
To facilitate the multi-objective requirements of System Level Synthesis problems,
Design Space Exploration (DSE) tools provide systematic mechanism for constructions,
evaluations and comparisons between different combinations of the design parameters.
Because of the high dimensions of the design space, the main challenge for DSE tools
is the long execution time and exponential increase of complexity with the number of
parameters under consideration. Applying Genetic Algorithm and Machine Learning
techniques to mitigate this challenge is one of the targets in this thesis.
Main Challenges
The main challenge of the components in System Level Synthesis and Exploration is that
they belong to the class of NP-complete problem so strictly optimal solutions are fea-
sible only for a constrained version of the problems that are applied to simple platform
architecture or narrow application domains. For complex and heterogeneous platforms,
different heuristics approaches need to be explored for efficient and effective solutions.
Therefore, developing these heuristics are inherently important for System Level Syn-
thesis and is one of the purposes of this thesis.
The second challenge for Reconfigurable MPS’ designers come both from the ad-
vancement of hardware technology as well as from the application requirements. In-
creasing mobility and autonomous requirement of devices put a new emphasis on the
importance of power and energy consumption for electronic design. Furthermore, the
growing complexity of system integrating both hardware and software components and
the shrinking size of transistor make electronic systems become less reliable and more
sensitive to interference from the environment. However, the emergence of embedded
16
1.2 Design Automation for Reconfigurable MPS and Challenges
systems into critical applications in all fields of human being (implantable devices, mis-
sion critical system, security applications, etc.) demand extremely high requirement in
reliability and robustness. Therefore, the development of electronic systems in general,
and reconfigurable MPS in particular, should present itself as a multi-objective opti-
mization problem, which has to take into account not only the performance but also the
robustness, reliability as well as the energy and power requirements of the applications.
In this thesis, we focus on two of the most important objectives: performance and energy
consumption. Let’s examine a practical usecase to see why they are so important to be
tackled first.
Multimedia applications on mobile devices
Nowadays, multimedia applications running on mobile devices is quite a popular use-
case. With the ever increase in quality standard: color depth (8 bit to 16 bit), spatial
resolution (HD1080p to QuadHD 4Kx2K pixels), frame rate ( 30fps to 60fps to 120fps),
a mobile device might need to process billion of pixels per second (4Kx2Kx120) to
meet the quality of services of modern multimedia application. That puts a tremendous
pressure on the computing power of the mobile devices and imposes high throughput re-
quirement on the implementation of the whole system. On the other hand, because of the
battery constraints of mobile devices, the runtime of multimedia application on these de-
vices does usually not meet the expectation of users [179]. As shown on Fig.1.6, experts
on the field also predict that there will be a huge gap between the power requirement
and the real capacity of the mobile devices. Therefore, an emphasis on energy-efficient
design is critical for those applications. We have successfully applied our contribu-
tions proposed on System Level Synthesis to several real-life multimedia applications in
Chapter 3-5 to address both performance and energy consumption objectives.
17
1.2 Design Automation for Reconfigurable MPS and Challenges
Figure 1.6: Power requirement gap of mobile devices [12]
1.2.3 Micro/Device level synthesis and exploration
Micro-architecture exploration and synthesis is the process of generating executable im-
plementation of computational processes on each hardware device. For GPPs, it is the
compilation of high-level languages to binary or machine code. For the reconfigurable
unit, the process of translating from high-level description to RTL implementation needs
special support from High Level Synthesis (HLS) tools. Despite tremendous research
and development in HLS domain recently, the tools are not mature enough to be com-
parable with the manual hardware design. Further, because of the new emergence of
the field, there is a lack of efficient Design Space Exploration tools specific to the char-
acteristics of HLS component. Addressing these challenges will not only improve the
performance of HLS tools but also widen the usability of reconfigurable hardware in
general.
The second set of tools that are widely used to generate the RTL implementation
on reconfigurable platform is hardware acceleration generator or core generator. Al-
though it’s usually developed to support only specific type of applications, the generated
accelerators from these tools can achieve competitive performance and energy.
The final component in the design flow is the Logic Synthesis and Implementation
tools provided by the FPGA vendors, these tools take RTL implementation as input and
produce the bitstream implementation specific to the FPGA architecture of each vendor.
18
1.3 Research Objectives and Contributions
1.3 Research Objectives and Contributions
1.3.1 Research Objectives
With the aim of making the design flow for Reconfigurable MPS more productive and
autonomous, this thesis provides algorithms and tools that accelerate the development
process in both macro level and micro level synthesis. Moreover, these tools are de-
veloped with consideration on multiple design criteria (throughput, latency, energy con-
sumption, leakage power, etc.) to enable user to explore the trade-off between them.
To achieve that purpose, we have implemented our contributions as the components of
a general EDA flow for Reconfigurable MPS while addressing above-mentioned chal-
lenges. Figure 1.7 visually summarizes our main contributions as well as the structure
of this thesis. A brief introduction on each component is given in the next Section.
1.3.2 Contributions
Throughput and Energy-aware Mapping
To deal with the high complexity in macro level synthesis, a mapping algorithm is
proposed for efficiently assigning tasks to processing elements on heterogeneous plat-
forms with reconfigurable hardware. This mapping approach computes multiple energy-
throughput trade-off points (mappings) at design-time and uses one of these points at
run-time based on desired throughput and current resource availability while optimiz-
ing for the overall energy consumption. A throughput and energy-aware design space
exploration (DSE) strategy has been proposed to derive the trade-off points. While sig-
nificantly reducing the complexity of the DSE, the proposed strategy still evaluates map-
pings for all the resource combinations of the platform, providing optimal mapping solu-
tions for all the scenarios of system architecture at run-time. Moreover, an energy-aware
runtime mapping technique has been proposed that utilizes the DSE results to perform
19



























Figure 1.7: Contributions in the thesis
20
1.3 Research Objectives and Contributions
efficient mapping. Experimental results show that proposed strategy achieves better
energy-throughput tradeoff points, covers all the resource combinations and reduces en-
ergy consumption up to 25% at design-time and additionally 17.8% at run-time when
compared to state-of-the-art techniques [148, 159]. This work is published in [122].
Leakage-aware Scheduling
To mitigate the NP-hard problem related to scheduling hardware tasks on reconfigurable
devices with dynamic Partial Reconfiguration, heuristic approaches are used to explore
the trade-off between performance and leakage power consumption. As a result, a
resource management approach containing scheduling, placement and post-placement
stages has been proposed to address the leakage issue. In scheduling stage, a leakage-
aware cost function is derived to cope with the leakage power. The placement stage
uses a cost function that allows designers to decide a trade-off between performance
and leakage-saving. The post-placement stage employs a heuristic approach and shows
further improvements. Experiments show that our approach can achieve large leakage
savings for both synthetic and real life applications with acceptable extended deadline.
Furthermore, different variants of the proposed approach can reduce leakage power by
40-65% when compared to a performance-driven approach and by 15-43% when com-
pared to existing works [21, 72, 188]. Our work in this chapter is presented in [123].
ML and GA for Multi-objective DSE
In above-mentioned System Level Synthesis tools, we extensively applied list-based
heuristics to solve mapping and scheduling problems due to its simplicity and efficiency.
In our list-based heuristics, a cost/priority function is used to compute the priority of
tasks/jobs and put them in an ordered list. The cost function has been developed to
be complex enough to cover increasing number of constraints in the system design.
Moreover, to enable system designers to examine the trade-off between a number of
21
1.3 Research Objectives and Contributions
design requirements (performance, power, energy, reliability . . . ), we propose a frame-
work to utilize the Genetic Algorithm (GA) for exploring the design space and obtaining
Pareto-optimal design points. Furthermore, to address the time consuming issue of DSE
process, multiple Machine Learning techniques are used to build predictive models for
the Pareto fronts. The models are built using training task graph datasets and applied on
incoming task graphs. Our framework has been verified with both mapping and schedul-
ing heuristics. For scheduling problem, the Pareto fronts for incoming task graphs are
produced in time 2 orders of magnitude faster than the traditional GA, with only 4%
degradation in the quality. For Mapping problem, our framework can boost the perfor-
mance 25x faster while sacrificing less than 5% quality of the Pareto front [121].
DSE for HLS
On micro-level synthesis, a framework to exploit the loop structure has been imple-
mented to shorten the development time with HLS tools. Due to the high level of abstrac-
tion, HLS tools can easily provide multiple hardware designs from the same behavioral
description. Therefore, they allow designers to explore various architectural options
for different design objectives. However, such exploration has exponential complexity,
making it practically impossible to explore the entire design space. The conventional
approaches to reduce the design space exploration (DSE) complexity do not analyze the
structure of the design space to limit the number of design points. To fill such a gap, we
explore the structure of the design space by analyzing the dependencies between loops
and arrays. We represent these dependencies as a graph that is used to reduce the dimen-
sions of the design space. Moreover, we also examine the access pattern of the array and
utilize it to find the efficient partition of arrays for each loop optimization parameter set.
The experimental results show that our approach provides almost the same quality of
result as the exhaustive DSE approach while significantly reducing the exploration time
with an average of speed-up of 14x [124].
22
1.3 Research Objectives and Contributions
Option pricing hardware generator
Finally, an accelerator auto-generation tools for option pricing applications is also im-
plemented for further boosting the design productivity of reconfigurable hardware. Al-
though a number of different FPGA-based option pricing accelerators have been imple-
mented, none of the existing works cover more than one models or different types of
options, which yields problem of productivity of implementing several hardware accel-
erators for the different models. To fill in the gap, we propose a design flow for generat-
ing efficient hardware accelerators for option pricing applications with different models
and option types. The framework boosts the designers productivity and enables quick
prototyping on FPGA platform by providing general template architecture for option
pricing applications. The architecture comes along with a prebuilt design library, which
covers a wide range of popular financial models. Experimental results for four models
show that the accelerators generated from our design flow outperform their counterpart
software implementation with two order of magnitude speedup. While comparing with
existing hardware designs for the same models, our framework can produce the acceler-
ators that overcome most of manual designed engines.
1.3.3 Organization of the Thesis
The rest of this thesis is organized as follows. Chapter 2 presents detail discussion on
design flow and automation tools for Reconfigurable MPS. Chapter 3 provides details
on the throughput, energy aware mapping approach for heterogeneous platform with
reconfigurable hardware. The leakage aware scheduling algorithm is presented in Chap-
ter 4. Chapter 5 wraps the contributions on System Level Synthesis with the Machine
Learning and Genetic algorithm optimization framework. DSE approach for HLS us-
ing loop array dependencies is introduced in Chapter 6. The auto-generation framework
for option pricing hardware accelerators is presented in Chapter 7. Finally, Chapter 8
concludes this thesis with an overview of the future research directions.
23
This page is intentionally left blank.
Chapter 2
Design Automation Tools for
Reconfigurable MPS
As mentioned in the previous chapter, in a typical design flow for Reconfigurable sys-
tem, the first task of designers is to analyze the computational characteristics of appli-
cation with profiling tools. Then, a partitioning process is involved in deciding which
parts of the applications are suitable for being processed on the reconfigurable hardware
resource. Thereafter, the hardware designer needs to implement the RTL-description of
those parts using a Hardware Description Language. Subsequently, the real hardware
implementation is generated using the logic synthesis tools that are usually provided by
the reconfigurable hardware vendor. Obviously, the above process requires the skills
of both software and hardware designers and has many steps that need to be manually
implemented.
To significantly boost the development productivity and reduce the inefficiency of
this error-prone and time-consuming procedure, a uniform design flow should be de-
veloped. Such a design flow needs to include EDA components that allow automating
the compilation from application‘s high-level description into its architecture optimized
representation, as well as the synthesis process to hardware level implementation. More-
over, this design flow needs to provide designers the ability to automatically explore
25
2.1 Application Level
different implementation options for the architecture in both system (macro) level and
device (micro) level. In previous chapter, a brief overview of such a design flow for
Reconfigurable MPS and its main challenges are presented. This chapter devotes for
detailing the components of the design flow and reviewing the current progress and ex-
isting toolsets available for each stage in the development cycle of Reconfigurable MPS.
The remainder of this chapter is structured as follows. Section 2.1 summarizes the
main tasks in Application Level: Application Profiling and Code Restructuring as well
as the existing techniques in this domain. Section 2.2 provides details on each sub-
problems of the System Level Synthesis and Exploration stage (HW/SW Partitioning,
Mapping, Scheduling) along with the supporting tools: Design Space Exploration and
Modeling tools. Moving toward Micro Level Synthesis and Exploration, Section 2.3
reviews the current progress of High Level Synthesis area and Hardware Accelerator
Auto-generation tools. Finally, Section 2.4 concludes this chapter.
2.1 Application Level
In this early stage of the design process, the application is analyzed to extract its compu-
tational characteristics and bottlenecks. The result of this analysis is then used to decide
which parts of the application should be assigned to the reconfigurable hardware. Since
the initial description of the application is usually written in high-level languages (C,
Matlab, Java, etc.) with the main purpose of describing its behavioral functionality, part
of it needs to be reconstructed to better suit the underlying hardware architecture of re-
configurable resource. Therefore, two main categories of tools needed at this stage are





























Figure 2.1: Automation tools for Application Level
2.1.1 Application Profiling
Two major approaches for application analysis are static and dynamic. Static analysis
methods use compiler techniques to discover essential information on the computational
metrics of an application such as average frequency or execution count of particular in-
structions, groups of instructions or functions, the ratio of data-processing instructions to
control instructions, data types and data access methods, etc. Dynamic analysis methods
extract these metrics by executing the application on the real hardware or the platform‘s
emulation or its simulation model. One of the main purposes of this analysis process is
to provide necessary information for restructuring and optimizing the application algo-
rithms or high-level codes for the execution on reconfigurable hardware.
2.1.2 Code Restructuring
Since the reconfigurable hardware‘s parallel executing model is inherently different from
the sequential execution model of the GPP, the processes of transforming and optimiz-
ing the application algorithm and code are very important. While the Reconfigurable
hardware offers massively parallel, systolic and pipelined computing paradigm with
distributed data structures, the CPU-centric programming model follows a sequential
approach with random-access and pointer-based data structure. As a result, a directly
27
2.1 Application Level
translated version of algorithms or code written for CPU-centric programs cannot ex-
ploit the advantages of underlying reconfigurable hardware architecture. Therefore, the
application should be significantly rewritten to achieve expected acceleration or energy
savings.
One of the primary tasks of code restructuring is to automatically transform a se-
quential code portion to its parallelized version. This is well-known as automatic par-
allelization technique and required to address a number of challenges related to loop
restructuring, control restructuring, data reorganization, or reuse, etc. However, because
of the major attribution of loops in the computational effort and time in regular applica-
tions, intensive researches have been focused on loop optimization.
2.1.3 Loop Optimization Details
Loop optimization is usually a combination of two processes: loop analysis and loop
transformation. Loop analysis mainly answers the questions if a transformation is safe
and worthy. To answer the first question, a number of dependence analyses need to be
performed (data, pointer, recursion, indirect access, indirect calls) to determine if differ-
ent iterations of a loop can be executed in parallel. For answering the second question,
estimation and comparison of execution time and resource are required between sequen-
tial and parallel options. Loop transformation is a huge topic in the domain of parallel
computing. Its ultimate purpose is to reconstruct the loop implementation such that
as many iterations can be executed safely and efficiently as possible in parallel. Loop
transformations usually involve loop splitting, loop unrolling, loop tiling, loop fusion,
etc. More details on loop transformation can be found in [23], [22], [116].
Another popular way to extract more parallelism from sequential applications is con-
trol restructuring which involves combining control nodes to increase the potential of
executing the operations related to these nodes in parallel. One example is transforming
serial nested if-then-else structure into a parallel multi-branch structure with switch and
28
2.2 Macro/System Level Synthesis and Exploration
case. A more general approach is combining together frequently executed sequences
of basic operation blocks to increase instruction level parallelism. The trace scheduling
method [57], superblock [66] and hyper-block formation [95] are representative exam-
ples of this approach.
There are several highlights on the main features of the tools in Application Stage.
Firstly, the processes of application analysis and code restructure are tightly integrated
together because of common requirements and toolsets. Generally, these steps are based
on the knowledge and techniques of compiler front-ends and their intermediate represen-
tations. As a result, the development of the tools at this level requires more knowledge
on the software development domain and can be inherited from well-established tech-
niques in parallel computing for multiprocessors. However, signification effort needs to
be derived to adapt with the unique hardware specifications of reconfigurable platforms.
2.2 Macro/System Level Synthesis and Exploration
On the hardware side of the design flow, a process of exploring different sets of possi-
ble compositions of hardware components, defining the most suitable one and creating
their real implementations are coined as architecture exploration and synthesis. The
selected architecture needs to support the application behaviors, satisfy the different de-
sign constraints and objectives on performance/throughput, energy consumption, area,
reliability, etc.
The procedure and components of architecture exploration and synthesis might be
different depending on the nature and essence of the applications domain and the pro-
vided hardware platform. For our targeted Reconfigurable MPS platform, which is de-
fined previously in above Section, this process will involve system (macro) level synthe-
sis and device (micro) level synthesis.
Macro-architecture synthesis and exploration deals with architectural resources at
29



























Figure 2.2: Design automation tools in Macro/System Level
the system level (reconfigurable tiles and control processing units, memories and com-
munication resources, etc.). The decisions made at this level might be the number of
reconfigurable or fixed processing elements in use, the memories and communication
resources of each type that need to be instantiated on the platform as well as specific
instantiation of processors, memories and communication resource. Moreover, the map-
ping of different computation processes on specific processing elements and the coarse
schedule of these computation processes also belong to this level. The main components
in Macro-architecture synthesis and exploration stage are presented in Fig.2.2: Model-
ing Tools, HW/SW partitioning, Mapping, Scheduling and Design Space Exploration
(DSE) tools. The details of these components are presented as follows.
2.2.1 Modelling Tools
Since all the sub-problems in this stage such as HW/SW partitioning, mapping and
scheduling are known to be NP-complete, the combination of them is also NP-complete
30
2.2 Macro/System Level Synthesis and Exploration
[59]. As a result, the solution space of System Level Synthesis and exploration is ex-
ponentially increasing with the number of operations under consideration. Therefore, to
reduce the problem into one with manageable difficulty and reasonable execution time,
the elementary operations have to be joined together to form larger macro-operations
(task or job) and the sub-problems in this stage usually work with a coarse-grained ab-
stract model of the applications. There are numerous ways to describe the behavioral
model of an application such as: Data Flow Graph (DFG), Control Data Flow Graph
(CDFG) [69], Communicating Sequential Processes (CSP) [70], Petri nets [63] or Kahn
Processing Networks [61]. Each of these application models has different features and
support different kinds of applications, computations, with different level of abstrac-
tions. However, Directed Asynchronous Graph (Task graph) [54] and Synchronous
Dataflow Graph [160] are widely used for the sub-problems in System Level Synthesis
stage. They are compact and easy enough to process for efficient mapping and schedul-
ing solutions while still ensuring that adequate information can be captured in the model
for highly effective decision making.
In Task Graph model, an application is represented as a graph where nodes are the
computational units and edges represent the dependency between them. Nodes are col-
lections of smaller operations which form a task in the Task Graph and are usually
featured with some characteristics: execution time, deadline, period, memory size, etc.
The edges might present different types of dependencies between tasks such as com-
munication dependency, data dependency or time dependency. The features of the task
and the dependencies of edges are problem specific. In general, the graph is acyclic -
the dependencies between tasks are one way - meaning that tasks cannot impose further
dependencies on their antecedents.
Synchronous Data Flow Graphs (SDFGs) are often used for modeling modern DSP
applications and for designing concurrent multimedia applications implemented on mul-
tiprocessor systems. Both pipelined streaming and cyclic dependencies between tasks
31
2.2 Macro/System Level Synthesis and Exploration
can be easily modeled in SDFGs. SDFGs allow analysis of a system in terms of through-
put and other performance properties e.g., latency and buffer requirements. The nodes
of an SDFG are called actors; they represent functions that are computed by reading
tokens (data items) from their input ports and writing the results of the computation as
tokens on the output ports. The number of tokens produced or consumed in one execu-
tion of an actor is called port rate and remains constant. The rates are visualized as port
annotations. Actor execution is also called firing, and requires a fixed amount of time,
denoted by a number in the actors. The edges in the graph, called channels, represent
dependencies among different actors.
In subsequent Sections, we refer to the computational units under consideration (the
tasks in Task Graph or actors in SDF) as computational processes and the representation
of the applications (the graph in Task Graph and SDF) as the network of computational
processes.
2.2.2 HW/SW Partitioning
As the interface between the application analysis and System Level Synthesis, the HW/SW
partitioning steps involve the analysis result from high-level description of the applica-
tion and try to decide which part of the application should be executed on which com-
ponents of the system and automatically generate the corresponding implementation
from the original high level descriptions. Since both of the partitioning and coordinat-
ing (mapping and scheduling) of different computational parts are NP-complete [59],
the existing strictly optimal solution is proposed only for constrained version of the
problem (where system included only a single programmable processor and single ac-
celerator [83], [39]). A formal way to solve this problem for heterogeneous systems
is to model it as an integer programming problem. Following this approach, the ac-
tual HW/SW partitioning is performed when estimating the timing of each processing
32
2.2 Macro/System Level Synthesis and Exploration
element while observing the timing constraints; thereafter, the actual scheduling is per-
formed for each partition block. Authors in [26, 109] have implemented this method to
solve various versions of the HW/SW partitioning problem.
For more general and larger heterogeneous systems, heuristic approaches are effec-
tively used. For example in [119], based on Petri-net behavior specification, the authors
have modeled the application as a weighted graph of computational processes and their
communication. For partitioning the graph into sub-graphs corresponding to different
processing elements, they used simulated annealing heuristic to ensure the sum of the
weights of all the cut edges (communication overhead between processing elements) are
minimized and the total weights of the subgraphs are balanced (load balancing between
processing elements). After this work, a number of other heuristics have been applied
to solve this problem: iterative improvement in [113, 180], constructive heuristic algo-
rithm [47, 156] or ant colony optimization [175, 176].
Recently, constraint programming and evolutionary algorithm are promising meth-
ods for efficiently tackling this problem with multiobjective requirements. They are
reported to generate high quality solutions for complex architecture with heterogeneous
components [138], [52], [53], [146].
2.2.3 Mapping and Scheduling
The spatial allocation and binding of components in application model (computational
processes and interaction between them) into the physical unit of the platform (process-
ing elements, network medium, memory devices, etc.) is coined as the mapping process.
The result of this process is the assignments that need to satisfy the specific design con-
straints (structural or physical) and optimize the objectives of the quality model in the
context of specific trade-off priority between these objectives.
Another important resource management task is scheduling. While mapping pro-
vides spatial assignment for computational processes into components of the hardware
33
2.2 Macro/System Level Synthesis and Exploration
platform, scheduling provides the execution order of these computational processes ei-
ther in system levels or in the devices level where multiple computational processes
are allocated to the same hardware components. As a result, scheduling process of-
ten happens after mapping when the computational processes are already bound to a
specific hardware component. Similar to the mapping process, scheduling also needs to
take into account the design constraints and the trade-off preferences between objectives
while optimizing these objectives.
There are several common characteristics among the sub-problems in System Level
Synthesis:
• There are inherent dependencies between them so that they are usually closely
bound together in an integrated process that requires iterative exploration and re-
finements. Therefore, they should be automated to a high degree.
• All of them belong to the class of NP-complete problem so strictly optimal solu-
tions are feasible only for a constrained version of the problems that are applied
to simple platform architecture or narrow application domains. For complex and
heterogeneous platform, different heuristic approaches need to be explored for
efficient and effective solutions.
Since mapping and scheduling are extremely important in the System Level Syn-
thesis stage, they are two of the main targets in this thesis. The next two subsections
summarize the state-of-the- art related works in the field and the unique features of our
contributions.
Throughput and Energy-Aware Mapping approach
Earliest DSE strategies that generate multiple mapping have been reported in [62,93,
96,159]. By generating various mapping solutions at design-time, they can provide sup-
porting information to handle the dynamism in application throughput requirement and
34
2.2 Macro/System Level Synthesis and Exploration
resource availability at run-time. However, they suffer from several shortcomings. They
target only fixed architecture platform, do not scale well with the number of tiles, and
perform duplicate (similar tasks to tiles allocations at different locations in the platform
(mapping)) evaluations for large-size platforms. The duplications increase the number
of evaluated mappings significantly and thus the overall evaluation time. In order to
overcome the aforementioned limitations, our DSE strategy considers a generic hetero-
geneous platform to provide mapping solutions that are applicable to a variety of target
platforms, which is not possible while considering a fixed platform. The generic plat-
form contains tiles depending upon the number of tasks in the application. A tile includes
a processing unit and other elements like memory or network interface (NI). Processing
Unit may have different hardware realization such as general purpose processor (GPP),
Graphics Processing Unit (GPU), digital signal processor (DSP), reconfigurable hard-
ware (RH) , etc. and it determines the tile type. The results of our DSE analysis for
a platform can be reused for multiple target platforms as long as the tile types and the
maximum distance between tiles of the target platform are subset of those considered
during DSE analysis. Therefore, the analysis results are applicable to variety of target
platforms and repeated evaluations can be avoided. Furthermore, duplications during
the analysis are avoided by not considering a bigger platform than required.
For run-time mapping, a large body of literature exists [38, 104, 112, 186]. These
early studies generate the mapping solution on-line at the arrival of applications without
any prior analysis. Therefore, the result is usually not optimal due to the limited com-
putation resources at run-time. Recently, mapping strategies have changed their focus
to hybrid approaches, which use the prior evaluation (done at design-time) to support
the mapping decision at run-time [97, 142, 185, 187]. Most of these works perform the
optimization for only one performance metric like energy consumption, throughput or
resource usage. The method in [142] provides the optimal mapping in term of aver-
age power consumption only; therefore, it cannot guarantee the throughput constraint
35
2.2 Macro/System Level Synthesis and Exploration
of applications. In [97] and [187], the DSE strategies take into account multiple quality
parameters at design-time, but leave the resource constraint problem for a controller at
run-time. On the other hand, our strategy produces mappings for all the possible re-
source combinations, where the mappings at each resource combination represent trade-
off between energy and throughput. Therefore, it provides better mapping solutions for
several performance metrics. The strategies in [97,142,185,187] target a fixed platform,
whereas our method is applicable to generic platform. In [147], a general approach is
considered but it is applicable only to homogeneous platforms. In [148], the authors tar-
get a generic heterogeneous platform, but the DSE is conducted in terms of throughput
optimization only. In contrast, our design-time analysis takes both throughput and en-
ergy consumption into account. Further, DSE in [148] reduces the number of mappings
significantly while focusing on the high-quality (throughput) mapping solutions. How-
ever, evaluation of mappings at a number of resource combinations is discarded during
the DSE process. Therefore, the analysis results might not contain mapping solutions
for all the different resource combinations available at run-time. Our proposed strat-
egy addresses this problem and reduces the energy consumption by mapping the highly
communicating tasks onto the available closest tiles.
Multistage Leakage-aware Scheduling Technique
Task graph scheduling for FPGA is an extensively studied topic [17, 21, 151, 157].
In [151], an efficient technique to schedule real-life applications on FPGA is proposed,
but partial reconfiguration and resource constraint has not been considered. Most of
the scheduling methods for FPGA focus on specific problems related to reconfigura-
tion overhead and defragmentation. Ahmadinia et al. [17] combined scheduling and
placement method for 2D FPGA architecture using cluster-based method to improve the
performance by 20% and task rejection by 16.2%. Christoph at el. [157] integrated an
36
2.2 Macro/System Level Synthesis and Exploration
on-line placement into a scheduling algorithm using small tasks first and earliest dead-
line first techniques. However, they do not take into account prefetching technique and
resource constraint due to single reconfiguration controller pertaining to PR FPGA. The
first work that considered both prefetching technique and resource constraint was intro-
duced by Banerjee at el. [21]. The scheduling and placement models are included with
the partitioning stage to form a complete HW-SW co-design approach for PR systems.
The linear placement model in this work is later adopted by Yuh et al. [188] and Hsieh
et al. [72] to address the leakage power issues.
Yuh et al. [188] first introduced the idea of using scheduling approach to mitigate the
leakage issue. The authors utilized the scheduling and placement results from [21] and
on top of that they developed a post-placement heuristic to reduce the delays between ex-
ecution and reconfiguration parts. They also proposed an exact ILP solution to perform
the post-placement in order to verify the effectiveness of the heuristic. Since their work
tackles the leakage optimization after the tasks are already allocated onto the FPGA,
the existing placement results may not allow their approach to significantly eliminate
the leakage power. To achieve maximal leakage saving, our work addresses the leakage
problem in all phases of the resource management process: scheduling stage, placement
stage and post-placement stage.
With the same model and target, Hsieh et al. [72] introduced another approach to
reduce the leakage waste. Their method consists of 3 phases: binding, priority dispatch-
ing and split-aware placing. First, the reconfiguration and execution parts of all tasks
are combined together in the binding phase so that the leakage power is minimal. Then,
each task is assigned a priority value based on the position of the task in the task graph.
Finally, while placing the tasks into FPGA architecture, the split-aware placer checks
for the deadline. If the deadline is violated, the placer splits the reconfiguration and the
execution phase of the task. While the work in [72] tried to solve the leakage problem in
the placement phase only, we propose a more complete solution having multiple stages.
37
2.2 Macro/System Level Synthesis and Exploration
Table 2.1: Comparison of various approaches
Features Ref. [21] Ref. [188] Ref. [72] Our work
Scheduling Performance No Performance Leakage
driven driven aware
Placement Performance No Leakage Leakage
driven aware aware
Post No Leakage No Leakage
placement aware aware
Priority Dynamic No Static Dynamic
of tasks
Furthermore, the scheduling algorithms in [72] used static priority, which is computed
before the actual scheduling process takes place. The static priority is computed based
on the characteristic of the task graph and remains unchanged during the scheduling pro-
cess. In contrast, our algorithm dynamically recalculates the priorities of all available
tasks every time a task is allocated onto the FPGA. Therefore, our algorithm updates the
current available resource of the FPGA, leading to a better scheduling decision.
Table 2.1 summarizes the distinction of our work in comparison to the closely related
works reported in the literature. As can be seen, existing works perform leakage aware
optimization in scheduling, placement, or post-placement stages, whereas our approach
performs optimization in all the stages. Further, unlike most of the approaches that
consider static priorities of tasks, our approach considers dynamic priorities.
2.2.4 Design Space Exploration Tools
To cope with multi-objective requirements and abovementioned high complexity of
system-level synthesis problem, efficient and effective ways for exploring design space
38
2.2 Macro/System Level Synthesis and Exploration
need to be developed. They are generally referred as design space exploration tools. Ba-
sically, they are implemented as mechanisms that allow systematic construction, evalu-
ation and comparison between different combinations of the design parameters. In this
process, they must consider all the system parameters related to system synthesis prob-
lems while specifying the preferences of the solutions through constraints, objectives
and trade-off priority. An essential part of the DSE tool is the decision-making scheme
that intelligently guides the exploration process toward “optimal”solutions in the con-
text of specified constraints, objectives and trade-off priority. Ideally, these DSE tools
should adequately exploit trade-offs between important system characteristics and result
in more coherent, compact, comprehensive, reliable, robust and lower-cost solutions.
Because of its characteristics, multi-objective optimization tools like evolutionary
algorithm and particle swarm optimization are widely adopted to implement DSE tools
[39], [81], [80], [137]. Although providing acceptable solutions, these methods require
a long execution time and their complexity increases exponentially with the dimensions
of the parameters under consideration.
Machine Learning framework for DSE
Genetic algorithm approaches have been used intensively for DSE, especially in
cloud computing systems [65]. However, when it comes to multiprocessor systems
(MPS) with tight timing requirements, the applications of GA are quite limited be-
cause of its time-consuming behavior. Sutar et al. proposed memetic algorithm that
combines GA with simulated annealing to solve the DSE for scheduling problem of
precedence constrains tasks [163]. Towards using GA-based scheduling algorithm with
primary-backup scheme to improve the fault-tolerance of real-time MPS, Zarinzad et
al. and Samal et al. proposed their frameworks in [189] and [136] respectively. Obvi-
ously, none of above-mentioned studies incorporates ML techniques to solve the time-
consuming problem of GA methods in DSE domain. That unique point makes our work
39
2.3 Micro/Device Level Synthesis and Exploration
stand out from previous studies, which also try to apply GA approaches for solving DSE
problems.
A comprehensive survey on existing learning-based approaches for the same prob-
lems on cloud computing systems has been conducted by Hormozi et al. [71]. More spe-
cific overview on the direction of energy minimization is presented by Berral et al. [27].
As summarized from these works, the main application of ML techniques in scheduling
problem is performance modeling and Quality of Service (QoS) modeling. For perfor-
mance modeling purpose, the historical data on execution trace of previous applications
are used to build predictive models to forecast the performance of new coming appli-
cations [79]. In the other hand, the models for QoS are usually built based on the
dependency with available resource (CPU, memory, bandwidth . . . ) and application
requirements [28]. These models are then used to assist the scheduler at runtime to ef-
ficiently allocate the resources. Second approach to apply ML techniques in resource
management is to classify applications and make decision based-on the classification
results [55]. Using unsupervised learning techniques such as Reinforcement Learning
to build autonomic self-management schedulers is another trend not only in cloud com-
puting [120] but also in digital system design [46]. Our framework belongs to the first
application of ML in resource management domain. However, the differences in purpose
and the interaction between DSE algorithms and ML techniques make our framework
unique and novel. While the existing works try to assist the schedulers by predicting the
performance or QoS of new applications, our framework tries to model the behavior of
schedulers during GA optimization process and build predictive model for the result of
that procedure (i.e. Pareto front).
2.3 Micro/Device Level Synthesis and Exploration
After being assigned in System Level Synthesis stage, the tasks are transferred to ap-
propriate processing elements, where their executable implementations on these devices
40






(C, Java, Matlab …)
Figure 2.3: Design automation tools in Micro/Device Level
are generated. The process of generating the detailed implementations on each device is
coined as micro-architecture exploration and synthesis. For GPPs, this is the straightfor-
ward process of compiling the high-level description of the tasks to binary or machinery
code. However, for the reconfigurable devices (FPGA), the process of translating from
high-level description to RTL implementation requires special treatments and dedicated
tools as shown in Fig.2.3.
The RTL-level implementation requires the creation of data-path components and
control-path components. Data-path components perform most of the computation, data
transmission and storage, and can be further divided into computational components
(arithmetic and logic units, multipliers, etc.), memory components (registers, embedded
memory units, etc.) and interconnection components (point-to-point interconnections,
buses, multiplexers, etc.). Whereas, control-path components manage the cooperation
between data-path components and generate command signals to select and perform
appropriate data-path components at runtime.
The methods and tools adopted in micro-architecture synthesis are dependent on
the underlying reconfigurable architectures and domain of applications. As discussed
41
2.3 Micro/Device Level Synthesis and Exploration
in the previous chapter, the reconfigurable platform can be classified into tightly cou-
pled and loosely coupled architectures. For tightly coupled reconfigurable systems, the
representative candidate is extensible or reconfigurable application-specific instruction
set processors (ASIPs). For this type of reconfigurable devices, the final result in micro-
architecture synthesis is the hardware implementation of the extended instruction set that
satisfies the application‘s requirements. Whereas, for loosely coupled architectures like
reconfigurable coprocessors or hardware accelerators, the result of micro-architecture
synthesis is the RTL-hardware implementation of data-path and control path as dis-
cussed above. As mentioned earlier, the scope of this thesis is limited to loosely coupled
reconfigurable architecture due to the wide availability of supporting tools. A compre-
hensive survey on micro-architecture synthesis for ASIPs can be found in [78].
The area of research that lays the foundation for micro-architecture synthesis of hard-
ware accelerator is High Level Synthesis (HLS) tools. With the aim of automatically
generating the RTL-implementation (VHDL, Verilog) from high-level descriptions (C,
C++, Java, Matlab, etc.), an HLS tool has to perform three main tasks: resource al-
location, operation binding and operation scheduling. Based on the result of applica-
tion analysis, resource allocation decides the amount and types of hardware resources
needed. Operation binding process defines mapping and assigns the high-level opera-
tions on specifically allocated hardware instances. The decisions on sharing and reusing
hardware instances are also taken in this step. Operation scheduling involves decisions
on the temporal execution order of operations sharing the same hardware instances. As
in the System Level Synthesis, all steps of resource allocation, operation binding and
operation scheduling should be performed in a coherent and iterative manner while tak-
ing into account the design constraints on area, power consumptions and maximizing
the design objectives of performance, reliability, robustness in a context of trade-off
preference between these objectives.
42
2.3 Micro/Device Level Synthesis and Exploration
With the advance of reconfigurable technology and increase in popularity of recon-
figurable devices, there has been tremendous research and development of HLS tools
in the past few years from both academic and industrial institutes. There are a num-
ber of commercially available HLS tools such as: Impulse C [11] from Impulse Ac-
celerated Technologies, Catapult C [29] from Calypto Design Systems, DK Suite and
Handel-C [10] from Agility (now part of Mentor Graphics), Cynthesizer [100] from
Forte Design Systems, Vivado HLS [4] from Xilinx, PICO Extreme FPGA [16] from
Synfora (now part of Synopsys), C-to-Silicon Compiler [5] from Cadence, Altera SDK
for OpenCL [2], etc. From academic side, prominent HLS tools include xPilot [41]
(acquired by Xilinx to form Vivado HLS), CHiMPS [130] developed by Xilinx and the
University of Washington, Trident from Los Alamos National Labs [171], Bambu [125]
and DWARV [106] and open source HLS tool Leg-up [36].
2.3.1 DSE for HLS
The optimization and exploration activities are very common in digital system de-
velopment and may happen in different levels of the design process. With regards to
HLS, the DSE procedure can be roughly divided into two classes as follows:
DSE inside HLS
Existing works in this class focus on the DSE procedure for the internal tasks of the HLS
tools themselves. As described in [128], the main components in the HLS flow are allo-
cation, scheduling and binding. Each of these steps can be controlled by different factors
which have a great impact on the performance and hardware usage of the resulting cir-
cuit implementation. Therefore, they are perfect candidates for applying different DSE
approaches and a large body of works has been proposed to apply the DSE for different
transformations in these steps to find the optimal hardware implementation generated by
the HLS tools [132, 144]. Due to its inherence, most of the works in this class require
43
2.3 Micro/Device Level Synthesis and Exploration
full access to the HLS tool and their result may be applicable for only a specific HLS
flow. In contrast, our framework targets the DSE flow in a higher level so that it is more
general and can be utilized with various HLS flows. This advantage is also the common
characteristic of the works in the second class.
DSE with HLS
Studies in this category are orthogonal to the works in the previous class since they are
applied in a higher abstraction level and both techniques can be utilized at the same
time without any conflict. Works classified in this category consider the HLS tools as a
black-box and explore the design space of parameters that are provided to manage the
available optimization techniques offered by the HLS tools. These works have appeared
quite recently in comparison to the previous class since the HLS tools have only recently
sufficiently matured. The earliest works in this direction tried to address the time con-
suming limitation of DSE by applying a heuristic algorithm called adaptive simulated
annealing to prune the suboptimal design points [139]. In [140, 141], Carrion et. al.
tried to reduce the complexity of the DSE problem by grouping the components (array,
loop, functions) of original source code into smaller clusters and then running the DSE
for each cluster. Although the evaluation time is reduced, the quality of the solutions is
significantly affected. Trying to mitigate the effect of local optimization in the previous
works, the authors applied the genetic algorithm to solve the DSE problem [35]. To
further reduce the exploration time, the next generation of the works in this direction
tried to apply Machine Learning (ML) techniques to build the predictive model for the
HLS tool [35, 92]. In these works, several initial design points are generated by HLS
tools to get learning database for the ML tools. Based on these initial data, different
learning techniques are applied to get a predictive model that can simulate the behavior
of the HLS tool as close as possible. After that, the subsequent evaluations are computed
using the predictive model instead of calling the HLS tool.
44
2.3 Micro/Device Level Synthesis and Exploration
Although the learning based methods can significantly reduce the evaluation time,
the accuracy of the predictive models is usually not comparable to the real execution
of the HLS tools. The main limitation of previous approaches is that they traverse the
design space without any in-depth analysis of the structure of the space it-self or without
considering the relationship between parameters of the DSE process. In contrast, our
work analyzes the most important dependencies between loop and array, extracts and
presents them as a graph. Thereafter, we utilize these correlations to derive the optimal
parameters for the array according to the given parameters of the loops. Hence, we
limit the dimensions of the DSE process for the loop only and exponentially shorten the
evaluation time, without sacrificing the quality of the result.
Memory Optimization for HLS
With regards to memory optimization techniques for HLS, there are several works that
aim to optimize the array partition for loop pipelining in HLS [91, 177, 192]. How-
ever, the authors in these works tried to improve the array partitioning process for loop
pipelining only, whereas our work proposes a method to obtain the optimal array parti-
tion factor for loop unrolling techniques. Furthermore, the main target of earlier works
is to develop code transformation tools to provide better input for HLS tools. In contrast,
our work focuses on reducing the DSE time when using HLS tools.
2.3.2 Auto-generation tools
Despite rapid advancements recently, the RTL-implementation generated from HLS
tools are still far from behind when compared with manual custom hardware imple-
mentation in terms of speed and energy-efficiency (e.g. [191], [94], [42]). Therefore, the
second body of works in micro-architecture synthesis focuses on automatic generators
for reconfigurable hardware accelerators. These tools are more applications oriented and
usually developed as a template implementation with can be tuned by a set of parameters
45
2.4 Summary
that are specific to the application domain. Available tools in this area include different
accelerator core generators and libraries of accelerators such as: SPIRAL project for
signal processing applications [129], Core Generator from Xilinx [6] and MegaCore
from Altera [7]. Despite their narrow application support, accelerator core generators
are reported to significantly boost the design productivity while ensuring competitive
performance and power savings ( [165], [75]).
For option pricing applications, the closest work to our contribution is the one pro-
posed by Thomas et al. in [165]. The authors have proposed a methodology to au-
tomatically generate reconfigurable hardware for Monte Carlo simulation in financial
applications. The hardware accelerators from the design flow can achieve an average
of 87 times speedup compared with software implementation on 2.66GHz Pentium IV.
Our framework fundamentally differentiates from previous one in 2 ways. First, we fo-
cus closely on the option pricing applications; hence, our proposed generic hardware
architecture are carefully customized to the computational characteristics of the pricing
problem. This customization brings advantages in both the development time as well
as the performance of the generated accelerators. Second, we integrate an optimization
process to derive the most efficient design parameters for the hardware accelerators.
2.4 Summary
This chapter has wrapped the background and introduction part of the thesis by provid-
ing in-depth review of the functionality of the components in design flow for Recon-
figurable MPS. State-of-the-art studies and research related to a number of components
are also presented. To keep the discussion in each chapter coherent and comprehensive,




Throughput and Energy-Aware Mapping
Approach
This chapter opens our contributions on System Level Synthesis with Throughput and
Energy-Aware Mapping approach. As discussed in previous chapters, mapping is the
process of assigning different computational parts of applications onto computing hard-
ware unit, so that all the requirements of applications and platform constraints are
satisfied. There are mainly two kinds of mapping approaches: design-time and run-
time. The design-time strategies [18, 82, 88, 103] consider static workloads (predefined
applications) and thus cannot handle dynamic workload scenarios such as insertion
of a new application into the system at run-time. On the other hand, run-time map-
ping [38, 104, 112, 186] may not provide a mapping solution that can guarantee the
throughput requirement of applications due to limited time and available computation
power at run-time. To address shortcomings of design-time and run-time approaches,
hybrid mapping strategies that use design-time analysis results to support run-time de-
cisions have been reported [142, 183, 185]. The heterogeneity in the architecture of
Reconfigurable MPS introduces new challenges for these mapping strategies. For ex-
ample, the number of mappings that need to be evaluated increases exponentially with
47
the number of Processing Elements (PEs) types, i.e, the design space becomes multi-
dimensional, whereas it is linear for the homogeneous case [148]. To overcome these
issues, some heuristic approaches have been proposed to prune the mapping space and
thus reduce evaluation effort [148]. However, while pruning the design space, the ex-
isting approaches discard evaluation of mappings for a significant number of resource
combinations. Consequently, the run-time mapping process needs to find a mapping
solution dynamically in case of missing resource combinations during DSE. For such
situations, the run-time mapping process may take a long time to find a mapping, which
may violate the strict timing deadlines imposed on the mapping time. This chapter
presents a mapping strategy that addresses shortcomings of existing strategies by pro-
viding following contributions:
• A design-time DSE technique that provide energy-throughput trade-off points for
all the possible heterogeneous resource combinations.
• A run-time mapping technique that chooses best trade-off points from the design-
time analysis results and considers different mapping options on the fly to optimize
the energy consumption.
Most existing works usually consider only one performance metric like energy or
throughput when performing optimization in DSE process [142, 148]. Hence, the best
mapping for each resource combination (generated by DSE) may excel for one perfor-
mance metric and show very bad result for the other. In our strategy, both throughput
and energy have been used in optimization process to achieve a balanced mapping so-
lution for the system. Moreover, our approach considers energy optimization both at





To developed the mapping strategy, we have used Modeling tools for both the applica-
tions and the hardware platform. The Modeling tool for applications is Synchronous
Dataflow Graphs (SDFGs) [89]. SDFGs facilitate for easier modeling of streaming mul-
timedia applications with timing constraints. A SDFG model of H.263 decoder is shown
in Fig. 3.1. The nodes (VLD, IQ, IDCT, & MC) and edges (e1, e2, e3, & e4) model tasks
and dependencies, respectively. The nodes have been referred to as actors that com-
municate with tokens sent from one actor to another through the edges. Each actor is
associated with its attributes: execution time and memory requirement when mapped on
a tile. If the actor has many implementation alternatives (e.g., GPP, DSP, RH) then it’s
attributes are listed for each implementation alternative. Implementation alternatives of
actor refer to different types of processing tiles on which the actor can be implemented.
Each edge has following attributes: size of a token, memory needed on the tile when
connected actors are allocated to the same tile, memory needed in source and destina-
tion tiles when connected actors are allocated to different tiles and respective bandwidth
requirements between the tiles. An actor fires (executes) when there are sufficient input
tokens on all of its input edges and sufficient buffer space on all of its output connec-
tions. In each firing, the actor consumes a fixed amount of tokens from the input edges
(input tokens) and produces a fixed amount of tokens on the output edges (output tokens).
These token amounts are referred to as rates. An edge may contain initial tokens.
Throughput of an application is determined as the inverse of the long term period,
which is calculated as the average time needed for one iteration of the application. An
iteration is defined as the minimum non-zero execution such that the original state of the
SDFG is acquired. For the example H.263 decoder, period is equal to the summation of



























Figure 3.1: SDFG model of an H.263 Decoder
where ExecTime is the execution time of respective actors. It should be noted that actors
IQ and IDCT have to execute 2376 times in one iteration and the number of executions
is referred to as repetition vector of the actor. The calculated period does not include
network and memory access delays. An SDFG with a throughput of 1000 Hz takes 1
millisecond (ms) to complete one iteration, i.e., its period is 1 ms.
3.1.2 Heterogeneous Reconfigurable MPS Model
Heterogeneous Reconfigurable MPS is an extension of Reconfigurable MPS with one
or more type of processing elements (Digital Signal Processing or GPU). The multipro-
cessor platform used in this chapter is a tile-based architecture as shown in Fig. 3.2.
The platform contains three types of tiles, which are connected by an interconnection
network in order to facilitate communication amongst the tiles. Each tile contains a
processor (e.g., general purpose processor (GPP), digital signal processor (DSP) or re-
configurable hardware (RH) as shown in Fig. 3.2), a local memory (M) and a network
interface (NI) containing set of communication buffers that are accessed both by the
interconnect and the local processor. The interconnection network provides end-to-end
connections between the tiles. However, the latencies of connections can be modeled
for different network-on-chips (NoCs).
50
























Figure 3.2: An example multiprocessor platform
3.2 Proposed Mapping Strategy
This section describes our mapping strategy. In contrast to conventional existing map-
ping strategies, our strategy differs in following aspects: 1) performs both energy and
throughput aware design-time DSE, 2) the DSE results contain mapping solutions for
all the possible resource combinations to cater for different run-time resource availabil-
ity aspects, and 3) performs throughput and energy optimization during the run-time
process as well.
An overview of our mapping flow is presented in Fig. 3.3. The overall flow has two
main steps: 1) DSE phase at design-time (Design-time DSE) to analyze the applications,
and 2) run-time mapping of required applications by utilizing the DSE results (Optimal
Mapping Database) with the help of a platform manager (Run-time Platform Controller
(RTPC)). In the DSE phase, multiple mapping solutions are generated for each applica-
tion to be supported onto a hardware platform. The run-time phase takes the required
applications, their throughput requirements, DSE results and the current platform status
(available resources) as input and provides an energy optimized mapping.
3.2.1 Design-time DSE
The design-time DSE step takes the applications one after another and evaluates a num-
ber of mapping solutions for each of them. The evaluation finds different mappings
51
3.2 Proposed Mapping Strategy
along with their throughput & energy consumption. For each mapping, the platform re-
sources are allocated to the application: actors are bound to tiles while edges are bound
to connections between tiles or local memory of tiles. Based on the resource allocations,
the throughput and energy consumption of the mapping are then computed.
Throughput Computation: For the mapping, first, static-order schedule that or-
ders the execution of bound actors on each tile is constructed. Then, all the binding
and scheduling decisions are modeled in a graph called binding-aware SDFG. There-
after, throughput is computed by self-timed state-space exploration of the binding-aware
SDFG [60].
Energy Consumption Computation: The total energy consumption for a mapping
is computed as the sum of communication and computation energy for one iteration of
the application. Communication energy is required to transfer data (tokens) from source
tile to destination tile and computation energy is required to process the transferred to-
ken on the destination tile. The communication energy for each edge (e) mapped to a
connection (c) is estimated as product of the number of tokens (in bits) to be transferred
through c, delay (D) and power consumption (Pbit) for transferring one bit through c.
Total communication energy for all the edges is estimated from equation 3.2. The num-
ber of tokens for an edge is computed as the product of repetition vector (repV) of source
(or destination) actor and source (or destination) port rate (equation 3.1). The power re-
quired to transfer one bit is denoted as Pbit [73]. Computation energy for each actor (a)
mapped to tile (t) is estimated as product of the number of executions of a (repV [a]),
execution time (ET [a]) and power consumption (pow) on t. ET and pow could be dif-
ferent for different types of tiles. Total computation energy for all actors is estimated
from equation 3.3. Power consumption on a tile is estimated as C × v2 × f , where C, v
and f denote average load capacitance, supply voltage and operating frequency, respec-
tively. In our approach, we focus on mapping of applications on the architecture after it
is designed. Therefore, we cannot optimize static energy consumption and restrict our
52










































Figure 3.3: Overall flow of proposed mapping strategy
focus on optimizing only dynamic energy consumption (Ecomm + Ecomp).
nrTokens[e] = repV [e→ srcActor]× (e→ srcPortRate) (3.1)
Ecomm =
∑
[{nrTokens[e]× tokenSize[e]} ×D × Pbit] (3.2)
Ecomp =
∑
[repV [a]× (ET [a]→ t)× (pow → t)] (3.3)
The proposed DSE flow takes an application & a generic platform model as input and
performs exploration to evaluate mappings while optimizing for both throughput and en-
ergy consumption (Fig. 3.3). A heterogeneous platform that contains tiles depending on
53
3.2 Proposed Mapping Strategy
the number of actors (n) and their implementation alternatives provided in the appli-
cation is considered. To cover all potential mappings for different possible resource
combinations, a platform with n tiles of each implementation alternative is considered.
Since the chosen platform can exploit all the parallelism present in the application, con-
sidering a bigger platform would not be necessary. On the other hand, a smaller platform
might not exploit all the parallelism as concurrent executing tasks may get mapped on
the same tile.
The considered platform contains tiles that are separated by a fixed distance from
each other, referred to as hop distance in this work. Initially, the hop distance is con-
sidered as one. However, at run-time, a real-life platform might have available tiles at
varying hop distances, for example, a 2×2 grid (mesh) of tiles platform may have few
available tiles separated by a hop distance of 1 while others at a hop distance of 2. To
cope with the distance variation between available tiles at run-time, our DSE flow is
repeated for all possible hop distances in the expected target hardware platform at run-
time. For example, two available tiles of a 4×4 mesh platform may have the hop distance
varying from 1 to 6, so our DSE is repeated 6 times (while considering hop distance 1 to
6) to account for varying resource availability scenarios that might incur at run-time. By
performing the DSE with higher hop distances, the applicability of the DSE results in-
creases even for bigger platforms, but the evaluation time increases. For example, DSE
results (evaluated mappings) with hop distance value of 8 are applicable to any platform
where maximum separation between the tiles is less than or equal to 8 hops such as mesh
of 2×2, 3×3, 4×4, and 5×5 tiles platforms. The main steps of the DSE flow (projected
in Fig. 3.3) are described subsequently.
Single Tile-type Evaluation
The mappings using the single tile type (homogeneous tiles) are generated by using
the DSE strategy proposed in [147] as it discards evaluation of inefficient mappings
(providing less throughput) and performs faster evaluation without missing the efficient
54
3.2 Proposed Mapping Strategy
mappings. First, 1 actor-to-1 tile mapping is evaluated, where n actors of the application
are mapped onto n homogeneous tiles so that each tile contains exactly one actor and
the edges are mapped onto connections. Then, mappings using reduced number of tiles
(p = n − 1) are evaluated by taking the best mapping using (p + 1) tiles as input. For
each pair of (p + 1) tiles, all the actors from one tile are moved to another to generate
a new mapping. A total of (p + 1)-choose-2, i.e., (p+1)C2 unique pairs are found for
p + 1 tiles and the same number of mapping using p tiles are evaluated. Out of all the
evaluated mappings using p tiles, the best mapping is chosen to evaluate mappings at
further reduced tile count, i.e. mappings using p− 1 tiles by following the similar steps.
The same process is repeated until the mapping using one tile gets evaluated. Thus,
all the mappings using different number of tiles are evaluated. The strategy in [147]
chooses the maximum throughput mapping as the best one as their optimization goal is
only throughput. In contrast, we choose the best mapping as the one having maximum
throughput/energy in order to perform throughput and energy aware exploration. Similar
exploration process is applied by considering different types of tiles one after another to
get the homogeneous tiles mappings for each tile type.
Multiple Tile-type Evaluation
Our strategy finds the most efficient mappings for all heterogeneous resource combi-
nations by using the homogeneous tiles mappings calculated in the earlier step. In a
general platform architecture (A) with m tile-types, all the resource combinations are
represented by m-dimensional array A(t1, t2, · · · , tm), where ti is the number of used
tiles of tile-type ith. If we call p(k,n) as the number of ways to partition n balls into k
slots; then the number of resource combination in our generic platform with m tile-types
that can cover n-actors applications is
n∑
k=1
P (k, n). For example, Table 3.1 presents all
the possible resource combinations when a 5-actors application and 3 tile-types (GPP,
DSP,RH) are considered.
55
3.2 Proposed Mapping Strategy
Table 3.1: Example of 5 actors and 3 tile-types
GPP DSP RH GPP+DSP DSP+RH GPP+RH GPP+DSP+RH
A(5,0,0) A(0,5,0) A(0,0,5) A(4,1,0) A(0,4,1) A(1,0,4) A(3,1,1)
A(3,2,0) A(0,3,2) A(2,0,3) A(2,2,1)
A(2,3,0) A(0,2,3) A(3,0,2) A(2,1,2)
A(1,4,0) A(0,1,4) A(4,0,1) A(1,1,3)
A(1,2,2)
A(1,3,1)
A(4,0,0) A(0,4,0) A(0,0,4) A(3,1,0) A(0,3,1) A(1,0,3) A(2,1,1)
A(2,2,0) A(0,2,2) A(2,0,2) A(1,1,2)
A(1,3,0) A(0,1,3) A(3,0,1) A(1,2,1)
A(3,0,0) A(0,3,0) A(0,0,3) A(2,1,0) A(0,2,1) A(1,0,2) A(1,1,1)
A(1,2,0) A(0,1,2) A(2,0,1)
A(2,0,0) A(0,2,0) A(0,0,2) A(1,1,0) A(0,1,1) A(1,0,1)
A(1,0,0) A(0,1,0) A(0,0,1)
To analyze the heterogeneous tiles mappings, we introduce a heuristic approach to
find the most efficient mapping for all resource combinations while evaluating a manage-
able number of mappings. The essence of our heuristic is the generation step denoted
by procedure Generate[A(..., ti, ..., tj, ...)− > A(..., ti − 1, ..., tj + 1, ...)], which is
presented in Algorithm 1. This procedure takes the best mapping of the previous re-
source combination A(..., ti, ..., tj, ...) as input and construct the best mapping for the
later resource combination A(..., ti − 1, ..., tj + 1, ...)]. In each execution of generation
procedure, there will be a tile-type with incremented number of used tiles (destination
tile-type jth) while another tile-type have its used tile-number decremented (source tile-
type ith).
Given the best mapping for a resource combination, the algorithm will find the first
empty tile p of destination tile-type. Thereafter, mappings for new resource combination
56
3.2 Proposed Mapping Strategy
Algorithm 1 Procedure Generate[A(..., ti, ..., tj, ...)− > A(..., ti − 1, ..., tj + 1, ...)]
Input: best mapping for A(..., ti, ..., tj, ...)
Output: best mapping for A(..., ti − 1, ..., tj + 1, ...)
bestMapping = 0, max = 0 ;
Find free tile p ∈ jth tile-type
for u = 1 to ti do
Move all actors from tile u to tile p to generate new mapping b
Compute throughput and energy for b
Compute metric µ = throughput
energy





Store bestMapping as optimal solution for A(..., ti − 1, ..., tj + 1, ...)
are generated by moving all actors from each tile of source tile-type to the destination
tile p. The throughput, energy and metric µ for each mapping is computed, stored into
our mapping database and compared with the current best mapping solution (bestMap-
ping). If the current mapping has better result than bestMapping, it will become the
bestMapping and is used to compare with subsequent mapping options. At the end of
the generation procedure, the most efficient mapping for new resource combination will
be bestMapping and is stored in the Optimal Mapping Database. By selecting the map-
ping having maximum µ ( throughput
energy
) at different stages, our heuristic can avoid eval-
uating a large number of inefficient mappings. Hence, the evaluation time is reduced
significantly.
57
3.2 Proposed Mapping Strategy
Our strategy to evaluate mappings using different type of tiles is presented in Algo-
rithm 2. First, the heuristic iterate through all resource combinations with different num-
ber of tile-types m′ and total number of used tiles, referred to as tile count. tile count
varies from number of actors in application n down to number of used tile-typem′. Then
we consider all the resource combinations that use m′ tile-types from given m tile-types.
The amount of such combinations will be mCm′ . Thereafter, the algorithm will conduct
the generation procedure for tile-type i1 and tile-type im′ . We define q as the total num-
ber of tiles used in i1 tile-type and im′ tile-type; hence tile count − q is the number of
tiles available for the rest (m′−2) tile-types. To cover all the resource combinations, the
main generation procedure (explained previously) Generate[A(..., ti1 , ..., tim′ , ...)− >
A(..., ti1 − 1, ..., tim′ + 1, ...)] should be repeated for all partitions of (tile count − q)
tiles into (m′ − 2) tile-types that do not participate into the generation step. In case of
m′ = 2, partition P (m′ − 2, tile count − q) is not available, so the generation step is
done outside the loop ( if m′ = 2). Algorithm 2 ensures that all the input mappings for
generation steps are available in the optimal mapping database before used.
DSE Complexity
The complexity of our algorithm depends on the number of actors n, the number of
tile-types m, and maximum hop distance h considered for the DSE. Table 3.2 introduces
the notations to be used for complexity calculation. The complexity has been calcu-
lated in terms of the number of evaluated mappings during the DSE. The number of
homogeneous tiles mappings is calculated by Equation 3.4. In heterogeneous case, the
number of mappings is computed based on the observation that in each generation step,
the number of evaluated mappings is the same as the number of used tiles of source tile-
type. The number of heterogeneous mappings for 2 tile-types combination is calculated
by Equation 3.5. Generally, number of mappings is calculated by Equation 3.6, where
P (m′ − 2, tile count− q) is the number of ways to partition (tile count− q) tiles into
58
3.2 Proposed Mapping Strategy
Table 3.2: Notations to be used
Notation Meaning
m total number of tile-types in platform
n total number of tile-types in platform
m′ number of used tile-types
tc number of used tiles in platform
ti1 number of used tiles of i1 − th tile-type
q number of used tiles of i1-th tile-type
and im-th tile-type
(m′ − 2) tile-types if (m′ > 3); otherwise, P (m′ − 2, tile count − q) = 1. The total
number of mapping is calculated as the sum of all homogeneous and heterogeneous tiles
mappings by Equation 3.7.
C(1,m, n) = m ∗ [1 +
n−1∑
p=1



























P (m′ − 2, tc− q) ∗ q







It can be seen from Equation 3.6 that the total number of mappings is related to
the partition problem solution. Therefore, the general expression for M(m,n) can be
derived if the analytical formula of P (k, n) is available. Based on formulas of P (k, n)
reported in [19], Table 3.3 presents the complexity of our algorithm for m = 1 to 7,
59
3.2 Proposed Mapping Strategy
Algorithm 2 Algorithm for multiple tile-type combination
Input: best GPP tile mapping from database
Output: most efficient mapping for multiple tile-type
for m′ = 2 to m do
for tile count = n downto m′ do
for all combination of m′ used tile-type from m tile-types do
if m′ = 2 then
for ti1 = tile count downto 2 do
tim′ = tile count− ti1
Generate[A(..., ti1 , ..., tim′ , ...) −→ A(..., ti1 − 1, ..., tim′ + 1, ...)]
end for
else
for q = tile count−m′ + 2 downto 2 do
for all partition ways of (tile count− q) tiles into (m− 2) tile-types do
for ti1 = q downto 2 do
tim′ = q − ti1









3.2 Proposed Mapping Strategy
Table 3.3: Complexity
m P (m− 2, n) Ref. [19] Θ(P (m− 2, n)) M(m,n) Θ(M(m,n))












4 bn2 c+ 1 n M(4, n) n5
5 { (n+3)212 } n2 M(5, n) n6
6 {(n+ 5)(n2 + n+ 22 + 18bn2 c)/144} n3 M(6, n) n7
7 {(n+ 8)(n3 + 22n2 + 44n+ 248 + 180bn2 c)/2880} n4 M(7, n) n8
where Θ(M(m,n)) and Θ(P (m−2, n)) presents the complexity of the whole algorithm
and the partition problem respectively.
3.2.2 Run-time Mapping
Run-time mapping of applications onto a platform is handled by the Run-time Platform
Controller (RTPC) (Fig. 3.3). In the platform, one processor is used as the RTPC (man-
ager) that is responsible for actor mapping, actor scheduling, platform resource control
and configuration control. The resources’ status is updated at run-time when an actor is
loaded in the platform. The RTPC maps the applications on the platform one after an-
other till all the applications are mapped. The sequential mapping is scalable as it avoids
the overhead for considering large number of scenarios containing different simultane-
ously active applications. For each application, the RTPC takes its desired throughput,
platform with updated resources’ status and the optimal mapping database (OMDb) as
input (Fig. 3.3) and selects the best mapping satisfying the desired throughput by fol-
lowing Algorithm 3.
The algorithm selects a mapping having minimum energyConsumption from the
OMDb by iterating from tile count one to Max Used T iles. The provided mapping
by this kind of iteration uses minimum possible number of tiles, resulting in improved
61
3.3 Performance Evaluation
resource utilization. Max Used T iles is considered as min(number of actors in the
application, number of available tiles) to restrict unnecessary search in OMDb. The
existing approaches allocate actors to tiles based on a selected mapping but do not con-
sider relative position of actors, which might require a large amount of communication
energy to facilitate communication amongst them through the edges. In our approach,
we allocate highly communicating actors in close proximity by following the Algorithm
3 in order to save the communication energy. If a throughput satisfying mapping is not
found then the application cannot be supported with available platform resources. In
general, throughput computation for a mapping is a time consuming process. Our ap-
proach just selects the best mapping without involving throughput computation at run-
time and thus accelerates the overall run-time mapping process. Further, our approach
uses minimum possible number of tiles and performs energy aware allocation towards
facilitating efficient mapping.
Fig. 3.4 demonstrates an example of run-time mapping of H.263 decoder on a plat-
form when employing existing and our approach. Although the communication over-
head (in bits) of e3 (between RH (containing MC) and GPP (containing IDCT)) is greater
than the communication overhead (in bits) of e4 (between DSP (containing IQ, VLD and
RH (containing MC)), the existing work assigns the connected actors onto two tiles with
hop distance=2. In contrast, our strategy considers communication overhead between
actors and tries to map highly communicating actors (on RH and GPP) close to each
other, so that energy consumption can be further reduced.
3.3 Performance Evaluation
Our strategy has been implemented as an extension of the tool set SDF3, which is pub-
licly available [162]. The experiments are conducted on a Core i5 processor running at
2.4 GHz. As a benchmark, models of real-life multimedia applications H.263 decoder
62
3.3 Performance Evaluation
Algorithm 3 Run-time Mapping
Input: optimal mapping database for single tile-type combination
Output: optimal mapping database for multiple tile-type combination
for tile count = 1 to Max Used T iles do
for each mapping µ using tile count tiles in OMDb do
Select closest available tile count tiles used by µ in the platform;
max hop = findMaximumHop(selected tiles);
Mapping list = Find all throughput satisfying mappings that use tile count
tiles separated by max hop and have the same resource combination as µ;
if Mapping list! = NULL then
Select the mapping having minimum energyConsumption;
Edge list = Find edges mapped to connections in mapping;
Sort Edge list in descending order of number of transferred bits;
for each edge e in Edge list do




























































e3  = 1216512
e3
Figure 3.4: Example of Run-time Mapping
(4 actors), H.263 encoder (5 actors), MPEG-4 decoder (5 actors), JPEG decoder (6 ac-
tors) and sample rate converter (6 actors) have been considered to examine the efficiency
of proposed strategy. MPEG-4 decoder, JPEG decoder, and sample rate converter will
also be referred to as MPEG, JPEG, and samplerate respectively. All the applications
are considered to be mapped onto a generic platform with 3 tile-types: GPP, DSP and
RH. Larger number of tile-types can also be considered as explained earlier. We assume
that all actors of applications can be implemented in these tile-types and their execution
times on different tile-types are known a priori. Since we consider a generic platform
as mentioned in Section IV, the maximum number of processing elements in the plat-
form depends on the number of actors in evaluated applications. In the experiments, we
compare our approach with the flows reported in [159] and in [148]. Since the strategy
in [159] considers mapping for scenario, we applied it to a single scenario, i.e., a single
64
3.3 Performance Evaluation
version of the application that has always the same behavior. The approach in [148] per-
forms optimization similar to that of ours, thus has been considered for the comparison.
Therefore, we have fair comparison for all approaches. Several experiments have been
performed to evaluate these strategies in term of throughput, energy consumption and
execution time.
The throughput and energy of mappings produced by different DSE flows are cal-
culated by the SDF3 tool set [162], which is modified according to evaluated mapping
algorithms. The results for MPEG-decoder at different possible resource combinations
are illustrated in Fig.3.5. In this experiment and later in 3.7, P0, P1, P2 represent 3 types
of Processing Element (PE): GPP, DSP and RH; while each resource combination is re-
ferred as ”iP0+jP1+kP2”, where i, j, k are the number of PEs of each type. Our strategy
provides throughput and energy values at all the resource combinations, which has been
shown by two continuous lines. In contrast, other flows cannot cover all the resource
combinations so they provide discrete points of throughput and energy values, and there
are no values at uncovered resource combinations. It can be seen that mappings from
our strategy have lower energy consumption while maintaining the throughput almost at
the same level as that of other flows. Moreover, we have computed the energy saving
of our DSE over the DSE strategy in [148] to illustrate the improvements. The results
have shown that our DSE strategy reduces the energy consumption of H263 Decoder,
H263 Encoder, JPEG, MPEG, and Samplerate by 11.32%, 12.63%, 8.26%, 24.93%, and
14.45%, respectively.
For different multimedia applications, Table 3.4 shows the number of resource com-
binations covered by different DSE flows. The number of resource combinations de-
pends on the number of actors in applications. The strategy in [159] missed a large
number of resource combinations since they look only load balanced mappings and
there are a lot of duplications generated by their flow. The strategy in [148] has bet-
































































































































































Figure 3.5: Throughput and energy of MPEG
66
3.3 Performance Evaluation
Table 3.4: Covered resource combinations
Applications Our Flow Flow in [148] Flow in [159]
H263 Decoder 34 24 10
H263 Encoder 55 34 10
MPEG 55 34 11
JPEG 83 44 11
Samplerate 83 42 10
(H263 Encoder, MPEG) and 6 actors (JPEG, Samplerate) respectively. In contrast, our
approach is designed to cover all the resource combinations for all the applications. The
number of covered resource combinations is important for hybrid mapping strategy since
it decides the flexibility for Platform Manager at run-time under the resource constraint.
Since our flow provides more mapping options for run-time, the RTPC can be better
supported.
One of the most important features that define the efficiency of a DSE strategy is
the number of evaluations performed by the strategy. A DSE with exhaustive search
analyzes all the possible mappings for each resource combination. Therefore, it cannot
scale well with the number of actors in application or the number of tile types in plat-
form. Moreover, large number of evaluations require more computation power, evalua-
tion time at design time, and more storage memory, more searching time in the memory
at run-time. On the other hand, heuristic DSE approaches significantly reduce the num-
ber of evaluated mappings but might not provide an optimal mapping for run-time [159]
or might discard mappings at several resource combinations [148]. Table 3.5 shows the
number of mappings evaluated by different DSE strategies when three types of tile are
considered.
It can be seen from Table 3.5 that the number of mappings evaluated by exhaustive
67
3.3 Performance Evaluation
Table 3.5: Number of mappings with three tile-types
Number EDSE Strategy Strategy Our
of Actors in [159] in [148] strategy
1 3 3 3 3
2 12 42 10 12
3 57 180 25 38
4 309 372 51 90
5 1,866 615 91 178
6 12,351 918 148 313
7 88,563 1281 225 507
8 681,870 1704 325 773
9 5,597,643 2187 451 1125
10 48,718,569 2730 606 1578
14 461,101,962,108 5502 1576 4735
DSE (EDSE) increases exponentially with the number of actors. Therefore, when num-
ber of actors is large (greater than 10), the exhaustive flow cannot be executed within a
reasonable time. The flow in [159] significantly reduces the number of mappings when
compared to EDSE. However, they still perform a large number of mappings in com-
parison with our strategy and strategy in [148]. Although strategy in [148] is better flow
in term of number of mappings, it does not cover all the resource combinations. The
number of mappings by our strategy is in between that of flow [159] and flow [148], but
our flow provides mappings with better quality as demonstrated previously. The number
of mappings is closely related to the execution time of the strategies. Fig. 3.6 shows ex-
ecution time of different DSE strategies for different applications. Our strategy provides
speed up over the strategy in [159], but spends more time to analyze mappings for all

























Figure 3.6: Execution time of different DSE strategies
To show the improvement of our flow in term of energy consumption, we com-
pare our results with existing hybrid approach of [148]. Our design-time DSE approach
shows significant energy savings for all the considered applications when compared to
existing approaches as mentioned earlier. Table 3.6 presents the energy saving at run-
time obtained by our flow for mapping different applications when compared with the
flow in [148]. At run-time, the main goal of our technique is to reduce the commu-
nication energy by allocating highly-communicating actors close to each other. Our
technique provides energy savings over existing techniques when at least 3 tiles are used
in the mapping. If less than 3 tiles are used, there is no edge for which communica-
tion overhead can be reduced and our approach provides similar results as that of [148].
Especially, in applications (H263 Encoder, JPEG, Samplerate) where the communica-
tion overhead is high, our technique has better improvement on energy savings (up to
17.8%). Similar improvements are obtained for other applications (Table 3.6).
We also have evaluated the efficiency of choosing parameter µ( throughput
energy
) in the op-
timization process. We evaluated our strategy with three different optimization criteria:
throughput, energy, and µ. Fig. 3.7 shows throughput and energy for the best map-
pings at different resource combinations for H263 decoder when different parameters
are chosen. The DSE optimized by energy always provides better results than DSE with








































Figure 3.7: Throughput and energy consumption for H263 Decoder at different resource
combinations for different optimization criteria
can be made when throughput is chosen as the optimization criteria. When we choose
the optimization criteria µ, for energy, the results lie between the Throughput and En-
ergy Optimization and almost overlap the result of Energy Optimization. If we consider
throughput as the guideline of optimization process, the µ option sometimes obtain bet-
ter results over the Throughput Optimization approach. Due to the heuristic behavior
of our approach, the Throughput Optimization might drop several mapping options and
miss some optimal points which can be found by the µ-Optimization. As a result, us-
ing µ as the guideline for optimization process generates the design points with better
trade-off between throughput and energy.
3.4 Summary
This chapter presents an efficient mapping strategy for heterogeneous Reconfigurable
MPS platform. Since the Reconfigurable fabric significantly increases the heterogeneity
of the platform, it poses a serious challenge for finding all available resource combina-
tions. However, our mapping approach covers all the resource combinations at design-
time within a small evaluation time. The qualities of the mappings in term of throughput
and energy are proven by experiments on a series of real-life streaming applications.
Especially, our DSE takes the trade-off between throughput and energy as the optimized
70
3.4 Summary
Table 3.6: Energy saving using our runtime technique
Application Number Energy consumption (mJ) Percent of
of tiles Strategy Our improvement
in [148] Strategy (%)
H263 Decoder 4 tiles 2.909 2.872 2.82
3 tiles 2.827 2.786 1.45
H263 Encoder 5 tiles 6.072 5.038 17.03
4 tiles 5.814 4.779 17.80
3 tiles 5.038 4.521 10.26
JPEG 6 tiles 0.365 0.334 8.56
5 tiles 0.360 0.328 8.89
4 tiles 0.354 0.323 8.76
3 tiles 0.344 0.318 7.56
MPEG 5 tiles 8.131 7.86 3.33
4 tiles 8.053 7.821 2.88
3 tiles 8.015 7.783 2.89
Samplerate 6 tiles 5.857 5.323 9.12
5 tiles 5.768 5.234 9.26
4 tiles 5.590 5.145 7.960
3 tiles 5.501 5.056 8.09
criteria; so that the mapping results can achieve more balance performance. Moreover,
our run-time mapping technique further improves the energy consumption of the sys-
tem by considering communication overhead in real time. The experimental results
show that our approach provides better energy savings and performance in comparison
to existing approaches. Experimental results show that proposed strategy achieves bet-
ter energy-throughput trade-off points, covers all the resource combinations and reduces
energy consumption up to 24.93% at design-time and additionally 17.8% at run-time
when compared to state-of-the-art techniques [122].
71




As mentioned in Chapter 1, we focus on two important optimization objectives and
mainly explore the trade-off between performance and energy consumption. In Chapter
3 where the mapping algorithm is used for resource allocation on platform level, those
objectives are represented by throughput and dynamic energy consumption. After the
mapping process, a number of tasks may be assigned to physical hardware units, where
scheduling mechanism is required to define the execution orders of those tasks on each
computing unit. In this Chapter, we propose a scheduling technique to further reduce the
total energy by targeting the static or leakage energy on finer grain level of reconfig-
urable device, while still maintaining the performance requirement of schedule length.
As defined in Chapter 1, our Reconfigurable MPS includes dynamic partially reconfig-
urable FPGA devices, i.e., a configuration can be loaded into part of the device while the
rest of the system continues operating. This feature obviously provides greater flexibility
and more powerful computing ability. However, these advantages come with additional
problems related to reconfiguration time and power dissipation. A drawback of FPGA
due to its hardware redundancy is its inefficiency in term of power consumption when
73
compared to ASIC components [86] [145]. In practice, an FPGA circuit implementation
may use only a fraction of the hardware resource but the power is dissipated in both the
used and the unused components. The total power consumption includes static (leak-
age) and dynamic power [149], and their contribution into the total power consumption
heavily depends on the circuit technology. Beyond 65 nm technology, leakage power
becomes an increasingly dominant component of total power dissipation [9]. This has
motivated us to focus our work on reducing the leakage power dissipation.
Configuration prefetching [67] is a widely adopted technique for reducing the recon-
figuration delay in Partially Reconfigurable (PR) FPGA. In prefetching, a task is loaded
into the FPGA as soon as possible and this may result in overlap between the configura-
tion part of the waiting task (to be executed) with the execution part of operating tasks,
facilitating for reduced reconfiguration overhead (time). However, even after the task is
loaded (prefetched), it may not execute and wait until few other tasks complete due to
involved dependencies. Such waiting introduces delays between the configuration and
execution part of the same task. During the delay interval, the SRAM-cells of the FPGA
(containing bits of the waiting task to be executed) cannot be powered down to avoid
the loss of configuration data from the cells. Therefore, the cells dissipate a significant
amount of power.
Motivational Example: Fig. 4.1 presents an example to demonstrate aforemen-
tioned issues. In this example, the task graph on the left-hand side is scheduled on an
FPGA platform with prefetching technique. During the interval between R3 and E3,
the logic blocks of columns 1 and 2 can be powered down to remove leakage wastes.
However, since the SRAM-cells of these columns cannot be powered down as the con-
figuration data will be lost, they consume a considerable amount of power. As SRAM
cells leakage contributes ≈ 38% to FPGA leakage [173] (up to 44% for Spartan-3 fam-
ily [174]), reducing FPGA SRAM leakage is of paramount importance.
In order to reduce leakage, a scheduling approach needs to be developed aiming at
74














R1 E1 R3 E3
C1
0 1 2 3 4 5 6 7
24 T3T63
Ri
Ei Execution phase of task i









Figure 4.1: Example of Leakage Waste caused by Prefetching Technique
allocating reconfiguration and execution parts as close as possible while keeping task
dependencies, timing and architecture constraints into account. Several works have
been proposed to solve this problem [188], [72]. However, these works attempt to ad-
dress the leakage problem in a single phase of the resource management process (details
in later sections). As a result, the leakage power cannot be significantly reduced. It
has also been observed that there exists a trade-off between leakage savings and perfor-
mance [188]. However, the trade-off analysis by employing the existing approaches is
not efficient. A high degradation in performance is noticed in order to achieve small
amount of leakage savings. To tackle the problem in a comprehensive perspective to-
wards achieving high leakage reductions, we propose a multi-stage resource manage-
ment approach consisting of three stages. Our main contributions to each stage are as
follows:
• Scheduling: A list-scheduling algorithm has been developed with a specific pri-
ority function that is customized for addressing the leakage power reduction.
• Placement: A cost function has been derived for the placement stage to further
75
4.1 System Model and Problem Definition
reduce the leakage power. This function provides designers a flexibility to manage
the trade-off between performance and leakage savings.
• Post-placement: A post-placement heuristic has been proposed to improve the
scheduling results (leakage savings) from previous stages.
4.1 System Model and Problem Definition
The targeted architecture used in this work is 1 dimensional (1D) FPGA, where the
configurable logic blocks (CLBs) are arranged in fixed vertical columns, and a task oc-
cupies an integral number of columns. Moreover, the device supports dynamic partial
reconfiguration: a part of the platform can be configured while other parts operate with-
out interruption. The basic configuration unit is a column. A task can be deployed on
an adjacent set of columns, and the reconfiguration time of the task is proportional to
the number of columns. Such an architecture is similar to Xilinx FPGA Virtex fam-
ily [182]. The device can be configured by a bitstream through configuration ports like
JTAG or ICAP. However, both configuration ports are managed by only one configura-
tion controller. Therefore, two different tasks cannot be reconfigured at the same time.
Such architectural constraint plays a critical role in the process of scheduling and place-
ment. Another key element realizing the benefits of scheduling algorithm on FPGA are
sleep transistors. It is assumed that unused CLBs can be totally powered off by the
sleep transistors integrated in the device. Based on this assumption, each column can be
independently controlled by a sleep transistor [188].
Task model: We consider only hardware tasks, i.e., a task can be synthesized and
implemented on the FPGA platform. In comparison to software tasks, hardware tasks
have some additional parameters related to the required hardware area and configuration
time. Directed acyclic graph (DAG) is used to represent the task set of an application.
An example of the task graph model is presented in Fig. 4.1. In the DAG, each node u
76
4.2 Proposed Multi-stage Resource Management Approach
represents a task, while an edge e(u; v) indicates the dependency between tasks u and v.
A task has two components: reconfiguration and execution. Reconfiguration part is
scheduled under the architectural constraint (only one reconfiguration controller) while
scheduling of execution part depends on the data dependencies, where a linear task
placement model as that of [21] has been adopted. In the scheduling process, the com-
munication overhead between tasks is ignored due to two reasons: 1) tasks communicate
with each other through a shared memory with the same latency and cost; and 2) this
latency is negligible in comparison to runtime reconfiguration overhead (time) and exe-
cution time. As a result, all task graphs are computation intensive.
Scheduling Problem
The problem targeted in this contribution considers following set of input, constraints
and objective.
• Input: The application task graph and FPGA architecture (number of columns, 1
reconfiguration controller and 1D architecture).
• Constraints: Task graph dependency for execution parts, reconfiguration con-
troller constraint for reconfiguration part and sequential relation between the re-
configuration and execution parts of the same task.
• Objective: Minimize leakage power dissipation because of the delays between
the reconfiguration and execution parts, minimize schedule length.
4.2 Proposed Multi-stage Resource Management Approach
An overview of the proposed resource management approach is provided in Fig.4.2.
The approach has 3 stages: Scheduling, Placement and Post-placement. At first, the
application task graph is processed iteratively in the first two stages (Scheduling and
Placement). In each iteration, the Scheduler will define the next task coming to the
77





















C1 E1 R2 E2




C1 E1 R3 E3






Schedulable task set 
evolving with time
Figure 4.2: Multi-stage Scheduling Scheme
Placer by a dynamic priority scheme, which means that the priorities of all the schedula-
ble tasks are changed after each iteration. The Placer then decides the column where the
task should be mapped and update the current status of the platform for the Scheduler.
After all the tasks in task graph are allocated into the platform, the refinement heuristic
in Post-Placement Stage will further improve the result from previous stages.
4.2.1 Scheduling Stage
Algorithm 4 presents our algorithm for the scheduling phase. At each step, all schedu-
lable tasks whose parents have been scheduled are stored in a set of ready task − S.
Then, the scheduler calculates the dynamic priorities of all tasks in set S according to
a priority function defined by Equation 4.1. Thereafter, it chooses the task with highest
78
4.2 Proposed Multi-stage Resource Management Approach
priority to pass to the placer. As mentioned in Section 2.2.3, we use a dynamic priority
function so that the scheduling process can adapt with the current status of the FPGA.
Since the priority function has a strong impact on the schedule quality, it is carefully
designed to address both leakage saving and performance requirement. The function in-
cludes different components that reflect the affection of constraints (FPGA architecture
and task graph dependency) as well as optimization targets (leakage saving and schedule
length) on scheduling decision. Our priority function is described as follows:
F = αBT + σC − βEET − γERT − µLK (4.1)
LK = C ∗ (EET − (RT + ERT )) (4.2)
where,
BT : bottom level of the task that represents the length of the longest path in task graph
starting from this task;
EET : earliest execution time of the task;
ERT : earliest reconfiguration time of the task;
C : number of columns required by the task;
RT : the reconfiguration time of the task;
LK : leakage waste caused by scheduling the task. The leakage waste is the product
of the used columns and the delay between reconfiguration and execution parts.
EET,ERT and LK are dynamic factors and are computed in scheduling process based
on the current status of the partial schedule. Since these variables are fundamentals
for scheduling problem, the details of their calculation can be found in basic textbook
about task scheduling, such as [152]. α, β, γ, σ, µ are coefficients related to each factor
and used to determine the intensity of their impact on the cost function. The signs of
elements in the function are given based on their impact on the schedule: tasks requiring
79
4.2 Proposed Multi-stage Resource Management Approach
Algorithm 4 Leakage Aware Task Scheduling Algorithm
Input: Task graph G=(U,V)
Output: Schedule with minimal LK
1: Put source tasks {ti ∈ U : pred(ti) = ∅} into set S
2: // S − Set of schedulable tasks
3: while S 6= ∅ do
4: Calculate priorities of unscheduled tasks in S (by Equation 4.1)
5: Choose the task t with maximum priority
6: Choose the best column C for task t (by Algorithm 5)
7: Schedule task t starting from column C
8: if child tasks of t are not already added to S then
9: Add new available tasks to S
10: end if
11: Remove task t from S
12: end while
larger columns should be placed earlier to increase the space for other tasks; tasks with
higher bottom level (close to leaf tasks) should be scheduled first because they strongly
affect the schedule length. Additionally, tasks with minimalEET ,ERT andLK should
be chosen for the desired optimization objective. As shown in Fig.4.2 the output of the
scheduling stage is a set of schedulable tasks with the task of the highest priority in
the front of the set. This highest priority task is then transferred to Placement Stage to
be allocated onto the FPGA. Since we are using a dynamic priority scheme, both the
schedulable task set and the priorities of tasks in the set are changed every time a task is
placed in FPGA.
4.2.2 Placement Stage
After getting the task with highest priority, the placer applies the steps in Algorithm
5 to allocate the task into physical column(s) of FPGA. When a task comes to this
stage, the algorithm scans all the columns to find available positions for the task and
80
4.2 Proposed Multi-stage Resource Management Approach
Algorithm 5 Leakage Aware Placement Algorithm
Input: Task t, set of columns P
Output: column C- with minimal LK
1: for each column ci ∈ P do
2: Schedule task t starting from column ci
3: Calculate cost of placing t on ci (by Equation 4.3)
4: end for
5: Choose the column C with minimal cost function
for each available position, the cost function is computed. Then, the task is placed into
the position with minimal cost value. Here, also the cost function is also designed to




∗ LK + (1− a
10
) ∗ EST (4.3)
where, LK andEST represent leakage power and earliest start time for a placement;
a is the leakage-schedule length trade-off coefficients, which can be used to provide a
balance between the two optimization goals. Therefore, the cost function not only facil-
itates to reduce the leakage dissipation but also provides designer the ability to manage
the trade-off between performance (schedule length) and leakage saving. The trade-off
values can be achieved by adjusting the value of a in Equation 4.3. By increasing the
value of a, designer can save more leakage power with a longer schedule length.
Fig. 4.2 demonstrates the placement results from the first 2 stages of our approach.
It is expected to have small leakage power as a result of above optimization techniques
as shown in the figures.
4.2.3 Post-placement Heuristic
Our post-placement heuristic is presented in Algorithm 6. The heuristic takes task graph
& tasks’ placement as input and provides optimized placement of tasks so that leakage
81
4.2 Proposed Multi-stage Resource Management Approach
Algorithm 6 Leakage Aware Post-placement Algorithm
Input: Task graph G=(U,V), Tasks’ placement after placement stage
Output: Optimized placement of tasks
1: for each leaf task ti ∈ U do
2: Schedule configuration and execution of task ti by considering architectural constraint
3: while parents of ti 6= ∅ do
4: Find reconfiguration costs for parent tasks of ti by Equation 4.4
5: Sort reconfigurations in descending order based on cost
6: Schedule reconfigurations considering architectural constraints
7: Select parents one by one from maximum to minimum cost as ti
8: end while
9: Move executions close to reconfigurations if dependencies do not violate
10: end for
power due to delays between reconfigurations and executions is further minimized. The
heuristic first schedules leaf tasks to maintain the same finish time towards meeting the
timing deadline. For each leaf task, it’s parent tasks are evaluated for their reconfigu-
ration costs and scheduled by taking architectural constraints into account. The cost is
computed as follows
C = lw ∗NC − sw ∗ SP (4.4)
where, NC and SP are the number of occupied columns and range of reconfigu-
ration space, respectively. The lw and sw are the weights to be given to NC and SP
respectively, which determine the leakage power dissipation.
After all the tasks are scheduled, the executions are tried to place close to the respec-
tive reconfigurations if dependencies are not violated. This helps us to achieve placement
that contains reconfigurations and executions close to each other as shown in Fig. 4.2,
leading to reduced leakage power.
82
4.3 Experimental Results
Figure 4.3: Leakage and Schedule Length when employing Different Approaches
4.3 Experimental Results
A series of experiments are conducted to demonstrate the performance of our resource
management approach. Three versions of our scheduling and placement approach with
different value of constant a in Equation 4.3 (a=1, a=2, a=10) are compared with fol-
lowing existing approaches: performance-driven algorithm (PDA) proposed in [21], En-
hanced Leakage Aware Algorithm (ELAA) employed in [72], the ILP and Iterative Re-
finement (ITE) heuristic approach proposed in [188]. The PDA does not consider the
leakage waste in the scheduling process, and has been used as the baseline approach
for comparisons. ELAA demonstrates high performance when dealing with the leakage
problem [72]. One important target in this work is to examine the trade-off between
leakage saving and the schedule length, so no deadline (in terms of schedule length) is
set for the trade-off analysis. The results from our post-placement approach are com-
pared to that of [188].
Our algorithm is implemented in Java language and experiments are performed on an
Intel Core i7 2.26GHz CPU with 4 GB RAM. The experiments are performed with real-
life task graphs and synthetic task sets generated by the TGFF tool [54]. For the synthetic
83
4.3 Experimental Results
Table 4.1: Leakage waste and algorithm runtime of post-placement methods
Number of tasks in task graphs
Algorithms 10 20 30 40 50
Leakage Runtime (s) Leakage Runtime (s) Leakage Runtime (s) Leakage Runtime (s) Leakage Runtime (s)
PDA+ILP 0 2.278 40 12.451 60 25.812 0 50.24 60 199.24
PDA+ITE 20 2.46E-04 80 4.32E-04 180 8.17E-04 80 1.14E-03 80 3.69E-03
PDA + 20 2.15E-04 80 4.36E-04 100 3.66E-04 80 4.32E-04 80 5.02E-04
Our heuristic
case, five task sets are considered. Each task set contains 10 task graphs with different
level of parallelism; and each task in the task graph requires 10 to 50 columns and has
the execution time from 1 to 9 time units. The FPGA platform is considered to have
a fixed number of columns as 100. For real-life task graphs, JPEG encoder [21], MP3
decoder [85] and MPEG4 decoder [43] are considered with their specifications provided
in respective references in order to demonstrate the applicability of our approach for
real-life scenarios.
The criteria of the comparison are schedule length, leakage waste, and the runtime
of the algorithms. The schedule length is measured in time unit, while the leakage waste
is measured in energy unit, which is the power dissipation of one column during 1 time
unit. The leakage waste of a particular task is computed by Eqn. 4.2. The leakage waste
of the TG after scheduling is the sum of leakage waste of all its tasks. For leakage waste
of a task set, leakage values of all the contained task graphs are added. Further, as sleep
transistors are used to stitch-off the unused SRAM cells for each column, the leakage
waste for a task before its configuration and after the execution is considered as zero.
4.3.1 Leakage Waste and Schedule Length
Fig. 4.3 presents the leakage waste and schedule length (in terms of time extension over
baseline approach PDA) of all the approaches over the five task sets. The whole bars
present the leakage waste obtained after Scheduling and Placement (S&P) stage, while
84
4.3 Experimental Results
the lower parts of the bars describe the leakage waste after applying Post-Placement (PP)
methods. Therefore, for existing approaches, the whole bars describe the leakage waste
of PDA methods, and the lower part of each bar is the leakage after post-placement
refinement (PDA+ITE or PDA+ILP). The time extension is the extended deadline re-
quired for leakage reduction. It is computed by subtracting the schedule length of each
approach to the schedule length of the baseline (PDA) and these values are presented
by columns with reversed direction (up to down). The horizontal axis declares notations
for different approaches. For example, the first two notations PDA+ILP and PDA+ITE
denote two approaches used in [188], where PDA is used in Scheduling and Placement
(S&P) phase and either ILP or ITE is used in Post Placement phase.
It can be seen from Fig. 4.3 that all versions of our approach achieve better leak-
age saving when compared with the two approaches in [188]. Furthermore, when the
number of tasks is large (greater than 10), our approach with a = 10 can reach the opti-
mal leakage saving (leakage waste = 0) with smaller extension in time when compared
to ELAA. On an average, our approach adopted with the parameter a = 1 and a = 2
shows leakage power savings of 40% and 65% respectively when compared to PDA.
Furthermore, when compared with existing approach PDA+ITE, our approach achieves
15% and 43% more leakage savings with parameter a = 1 and a = 2, respectively. The
reason behind superior results by our approach over other approaches is that we consider
leakage optimization first in scheduling and placement stages and then in post-placement
stage as well. The optimization in scheduling and placement stages results in minimize
delays between configurations and executions, and the post-placement stage try to fur-
ther minimize the left delays in order to reduce the leakage dissipation. However, other
approaches tackle the leakage optimization in only one stage (e.g., in placement stage in
ELAA [72] and in post-placement stage in [188]).
85
4.3 Experimental Results
4.3.2 Post-placement Leakage Waste and Algorithm Runtime
In this experiment, we examine the leakage saving and runtime of 3 post-placement
methods ILP, ITE in [188], and our proposed heuristic. The methods are executed with
the same inputs, which are the placement results from PDA. The deadline of all the task
graphs are set to the schedule length of our approach when achieving optimal value of
leakage saving (i.e., a = 10).
Table 4.1 shows leakage waste and algorithm runtime for various post-placement
methods. As can be seen from Table 4.1, in many cases, all the post-placement methods
are unable to totally eliminate the leakage dissipation over the PDA placement. How-
ever, for the same deadline, our multi-stage approach can achieve the optimal solution
(leakage waste = 0) as described earlier. This signifies the advantages of our comprehen-
sive strategy that addresses the leakage problem throughout the resource management
process. Although our scheduling and placement stages achieve high leakage savings,
they still can leave spaces between reconfiguration and execution parts of many tasks.
Our post-placement stage tries to reallocate reconfigurations and executions so that the
spaces between them are minimized in order to achieve further leakage savings. Table
4.1 shows that our post-placement heuristic can produce better leakage results than ITE.
Additionally, our heuristic obtains the results in a smaller runtime.
4.3.3 Case-study: Real-life Applications
We applied different scheduling approaches on real-life applications: JPEG encoder
[21], MP3 decoder [85] and MPEG4 decoder [43] as mentioned earlier. Table 4.2 shows
leakage waste and schedule length for real-life applications. The notations used in this
experiment are the same as those in previous experiments. The ELAA and our approach
with a = 10 always achieve the optimal value of leakage waste (zero) with some exten-
sion in schedule length. Therefore, leakage in these cases does not need any improve-
ment by Post-placement methods and not applicable (NA) has been mentioned for the
86
4.4 Summary
Table 4.2: Leakage waste and schedule length for real-life applications
PDA+ITE a=1 a=10 ELAA
MPEG Schedule length 44 44 53 57
(9 tasks) Leakage S&P 140 80 0 0
Leakage PP 0 0 NA NA
JPEG Schedule length 22 23 24 29
(6 tasks) Leakage S&P 60 20 0 0
Leakage PP 20 20 NA NA
MP3 decoder Schedule length 50 57 61 63
(14 tasks) Leakage S&P 270 30 0 0
Leakage PP 270 30 NA NA
same. As can be seen from the table, for MPEG and JPEG, our approach with a = 1 can
obtain the same results as that of approach PDA+ITE. However, when it comes to MP3
decoder, the advantage of our comprehensive strategy becomes obvious. Due to low
quality solution in the first two phases, the ITE approach cannot remove all the leakage
from initial placement of previous phases. In contrast, all stages of our approach still
work well to get maximum leakage saving.
4.4 Summary
To tackle the high complexity of scheduling problem, we present a multi-stage re-
source management approach with a focus on leakage power savings in Partially Recon-
figurable FPGAs. Our multi-stage approach employs leakage-aware priority function
in scheduling stage, leakage-performance trade-off function in placement stage and a
heuristic in post-placement stage. A series of experiments are performed to highlight
the advantages of the proposed approach over existing works. The results demonstrate
that the proposed approach dominates the existing approaches when the application task
87
4.4 Summary
graph contains a large number of tasks. Additionally, experiments show that our ap-
proach can always achieve the optimal value as a comprehensive strategy is adopted,
whereas other single-stage methods may not achieve the optimal value. Furthermore,
our approach also provides the flexibility to the designers to achieve trade-off values be-
tween leakage saving and performance. Specifically, different variants of the proposed
approach can reduce leakage power by 40-65% when compared to a performance-driven
approach and by 15-43% when compared to state-of-the-art works [123].
88
Chapter 5
Machine Learning and Genetic Algorithm
for Multi-objective DSE
Our contributions in System Level Synthesis so far have been extensively using list-
based heuristics to solve mapping and scheduling problems. In the previous chapter,
we have designed the cost functions in these heuristics to cope with different design
requirements: schedule length, leakage savings, throughput and energy consumption.
However, the main limitation of previously proposed list-based schedulers is their abil-
ity to produce only one scheduling result for each application; therefore, the trade-offs
between different design objectives cannot be examined. To address this problem, in
this chapter we have applied multi-objective Genetic Algorithm (GA) to explore the
design space of our list-based heuristics. Following GA approach, the components in
the cost functions are parameterized; hence, with the same task graph, different param-
eter sets give different scheduling results. Each combination of different choices for
these parameters provides a single option in terms of design objectives (performance,
energy) and forms a specific design point in the design space. Thereafter, the designers
can efficiently traverse the design space and generate a set of points that are superior
in one of the objective dimensions. These points form the Pareto front, which is the
89
Holy Grail for system designers since it not only provides the insight into the trade-off
between different objectives but also allows them to choose the most efficient design for
different purposes. However, the process of traversing the design space in GA method
is usually very time-consuming due to the exponential increase in the number of design
points to the dimension of the space, which are the number of coefficients in the priority
functions. The problem is much worse for Reconfigurable MPS since the flexibility of
reconfigurable hardware has tremendously widened the design space of the whole plat-
form. The reconfigurable hardware also introduces new components to the list-based
priority function, making it more complex and taking longer time for each evaluation of
design point
To shorten the time of generating Pareto front during GA optimization, we developed
a multi-level Machine Learning framework that utilizes Spline Regression and Linear
Regression to build predictive models for Pareto fronts from a training set of task graphs
(TG) and applies these predictive models to accurately estimate the Pareto fronts of new
incoming tasks in a fraction of time when compared to GA approach. Following are our
main contributions in this work:
• Developing a comprehensive multistage framework for integrating GA and ML
techniques to optimize existing list-based heuristics: from generating data to build-
ing predictive models and predicting Pareto fronts for new TGs;
• Building a systematic representation of Pareto front curves with Spline regression
models;
• Applying the Linear Regression techniques to model the dependency between
Spline model of Pareto front and TG’s features;










































Figure 5.1: Proposed framework
• Validating the capability of our framework by applying for both mapping and
scheduling problem.
5.1 Overall Framework
In this section, we provide an overview of the working flow and the general functionality
of the components in our framework. Basically, we explain how the Genetic Algorithm
(GA) and Machine Learning (ML) techniques are utilized to optimize the list-based
heuristics. As can be seen from Fig.5.1, our framework has 3 main phases, 2 of them
execute at training time: Generating Pareto front and Model Building Phase, while the
other Prediction Phase runs at execution time.
91
5.1 Overall Framework
5.1.1 Phase 1: Generating the Training Database
In the first phase, the original List-based heuristic is wrapped by the GA optimization
process, which takes a bunch of previously generated task graphs (TG) as the input,
iterates through their design spaces and generates the optimal Pareto front for each TG.
The generated Pareto fronts are stored in a database to feed to the Model Building Phase
after being processed by the Normalizing block. The implementation details of this
phase are discussed in Section 5.2.
5.1.2 Phase 2: Building the Predictive Models
The Model Building phase contains main contributions and most of the novelties of our
work. The procedure in this phase starts from Spline Regression block, which takes
the normalized Pareto front curves from Phase 1 as input and builds Spline Regression
models that fit the Pareto curve with acceptably small error. Thereafter, it filters out
the most Volatile Coefficients of generated Spline models and sends them to Linear
Regression block, which is the second most important ML block. This module receives
historical data, which are Volatile Coefficients from Spline Regression block and range
of Pareto front from Phase 1, as well as the Features of respective TGs in the TG dataset.
From these inputs, it builds Linear Regression models that characterize the dependences
of Spline Coefficients and Pareto range on the TG features. The Predictive models output
from this module are sent to Phase 3 for use at execution time. The last component of
Model Building Phase is Feature Extraction block, which computes the most important
metrics of TG and creates new concise and systematic representation for TG. In the
training phase, this block processes the TG from historical Dataset and sends the features
to Linear Regression block; while in Prediction Phase, it computes features for new TGs
and feed them to the Applying Model block. The components of Model Building Phase
are further presented in Section 5.3.
92
5.2 Phase 1: Generating the Training Database
5.1.3 Phase 3: Prediction at Execution Time
The last phase in our framework utilizes the results from previous stages to generate the
Pareto front for a new TG at execution time. The first building block in this phase is
Applying Model component, which takes the Linear Regression models and features of
the new TG to build the estimated Pareto curve. The Trace back block produces the real
design points on Pareto front from previously estimated curves.
5.1.4 Advantages
Our framework is developed with a fashion of modular approach so that the designers
can freely customize by plugging in new schedulers, new multi-objective optimization
approaches or new ML techniques. At the same time, the framework is also uniformly
practical in the sense that the designer can quickly apply for a new scheduling algorithm
just with the built-in components. The only part that might need to be customized is
the Feature Extraction block, which needs to be adapted for the application models (i.e.
Task Graph, SDF, or Kahn Processing Network (KPN). . . ).
5.2 Phase 1: Generating the Training Database
During the design process, system designers need to explore the design space to find
the solutions that satisfy the trade-offs between often conflicting criteria such as: per-
formance (throughput, latency), hardware usage, energy consumption and reliability.
The commonly-used tools to facilitate this exploration process are multi-objective opti-
mization algorithms such as: Evolutionary Algorithm and Particle Swarm Optimization
(PSO). The result of these optimizations is the Pareto front on the objective space that
contains non-dominated design points, which have no other design points that better
than themselves in all dimensions of the objective space. This Section explains how we
apply the GA to generate the Pareto fronts for Mapping and Scheduling process, then
93
5.2 Phase 1: Generating the Training Database
use these Pareto fronts as the training data for the ML procedure. To showcase the appli-
cability and potential of our proposed framework, we have developed our approach for
both Mapping and Scheduling problem. The Mapping algorithm is presented in [161] to
allocate actors of SDF graph to a Multiprocessor System while optimizing the trade-off
between Throughput and Energy Consumption. Whereas, the energy-conscious schedul-
ing (ECS) heuristic is proposed in [90] to schedule tasks in Task Graph to grid computer
system while considering both the Schedule Length and Energy consumption.
5.2.1 Generating the Pareto Fronts with Genetic Algorithm
TG-Scheduling
To apply the GA for scheduler ECS, we first need to parameterize the cost function
from [90]. The original cost function given by Eqn.5.1 provides only one scheduling
result for each TG, while the parameterized version in Eqn.5.2 offers various scheduling
solutions with different set of parameters (α, β, γ, δ, η).
RS(ni, pj , vj,k, p
′, v′) =
E(ni, pj , vj,k)− E(ni, p′, v′)
E(ni, pj , vj,k)
+
EFT (ni, pj , vj,k)− EFT (ni, p′, v′)
E(ni, pj , vj,k)−min(EFT (ni, pj , vj,k), EFT (ni, p′, v′)) (5.1)
RS(ni, pj , vj,k, p
′, v′) =
α ∗ E(ni, pj , vj,k)− β ∗ E(ni, p′, v′)
E(ni, pj , vj,k)
+
γ ∗ EFT (ni, pj , vj,k)− δ ∗ EFT (ni, p′, v′)
E(ni, pj , vj,k)− η ∗min(EFT (ni, pj , vj,k), EFT (ni, p′, v′)) (5.2)
Where E(ni, pj, vj,k) and E(ni, p′, v′) are the energy consumption of task ni on pro-
cessor pj with operating voltage vj,k and that of task ni on p′ with v′, respectively, and
similarly the earliest finish times of the two task-processor allocations are denoted as
EFT (ni, pj, vj,k) and EFT (ni, p′, v′). The relative superiority RS(ni, pj, vj,k, p′, v′) is
94
5.2 Phase 1: Generating the Training Database
















Figure 5.2: Original Pareto front of 5 task graphs
the objective function that balances both performance considerations. More details on
ECS can be found in [90]. The parameters (α, β, γ, δ, η) are chosen to capture all the
important factors that might affect the result of objective function.
Thereafter, the GA is used to explore the space of these parameter set to find the
Pareto front in the objectives space. In general, the GA encodes the parameters in the
form of chromosome and uses the objectives as criteria to heuristically search for better
parameters by iterating from generation to generation. The good parameter sets are
transferred through generations by inheritance while the new potential parameter sets
are explored through mutation. There are quite a number of different implementations
of GA but we use the NSGA II algorithm because of its proven efficiency and popularity
[155]. The choice of GA depends on the designers’ taste and by no means limits the
generalization capability of our framework.
SDF-mapping
Unlike the Scheduling heuristic in previous Subsection, the list-based Mapping algo-
rithm proposed in [161] already comes with parameterized cost function, which is given
as follows:
95
5.2 Phase 1: Generating the Training Database
cost(t, a) = c1.lp(t) + c2.lm(t) + c3.lc(t) + c4.ll(t) (5.3)
Where: cost(t, a) is the cost for binding an actor a to a processor t. lp(t), lm(t), lc(t),
ll(t) denote processing load, memory load, communication load and latency load of tile
t if actor a is assigned to it. The detailed computation of these components are given
in [161]. c1, c2, c3, c4 are parameters of the cost function that can be adjusted to control
the trade-off between Throughput and Energy consumption. Since the original Mapping
heuristic is already parameterized, the GA process can be applied directly to generate
the Pareto front for training SDF graphs in a similar way as in the previous Subsection.
For the ease of representation and reducing the level of abstraction, the detailed
implementation of the components in the following Sections will be presented with ex-
amples, which show how to apply our framework to the Scheduling heuristic ECS [90].
The process of applying our framework for the Mapping heuristic is implemented with
the similar manner and only the final results are reported in Experimental Section 5.5.
5.2.2 Normalizing the Pareto Fronts to Uniform Curves
Fig.5.2 shows an example of Pareto fronts of 5 TGs, which are the outcome from GA
block. As can be observed, the general shapes of the Pareto curves are somehow similar
while the range and the scale of these curves have major differences. To overcome
this problem and make the Pareto fronts easier to interpret and more uniformly across
the TG dataset, we normalize the curves so that all the Pareto fronts fit in the range
of [0, 1] for all dimensions of objectives (Schedule length and Energy). The formulas
used in the normalization process are given in Eqn.5.4 - Eqn.5.5. The Pareto fronts after
normalizing step are presented in Fig.5.3. As can be seen, the common pattern of Pareto
curves becomes more apparent when they are nicely fitted in the range of [0, 1].
Ti = (Ti − Tmin)/(Tmax − Tmin) (5.4)
96
5.3 Phase 2: Building the Predictive Models
























Figure 5.3: Normalized Pareto fronts and their Spline Models
Ei = (Ei − Emin)/(Emax − Emin) (5.5)
Where Ti and Ei denote the Schedule length and Energy consumption of the i-th
point on the Pareto front. Tmax, Tmin and Emax, Emin represent the range of Pareto
curve in two objective dimensions.
5.3 Phase 2: Building the Predictive Models
As discussed earlier, the Model Building phase contains two main blocks that integrate
Spline Regression and Linear Regression techniques into our framework.
5.3.1 Build Spline Regression for Pareto Curves
After observing the similar pattern in normalized Pareto curves, we try to quantify the
similarity by transforming the curves into a more systematic representation, which is a
function describing the relationship between Energy and Schedule Length of the points
on Pareto curve. Based on the continuity and the curvy shape of the Pareto front, a
97
5.3 Phase 2: Building the Predictive Models
number of different regression models have been tested to find appropriate function such
as: piece-wise polynomial regression, smoothing spline, local regression [58]. Amongst
them Cubic Spline Regression is nicely fitted into our framework due to the balanced
trade-off between accuracy and computational complexity [58].
In general, Spline Regression partitions the whole range of predictor (Schedule
Length) into K distinct intervals. Then, in each interval, it tries to fit a polynomial
function to the data. For the Cubic Splines case, 3-degree polynomials are used. Unlike
normal Piecewise Polynomial Regression, a set of constraints on continuity are applied
to ensure smooth transformation between intervals. The division points are called knots
and the choices of their number and values are very important factors in Spline Regres-
sion. K = 3 has been found empirically to provide the best curve fitting vs. computation
trade-off. The general formulation of Cubic Spline model is given in Eqn.5.6:








bk+3(xi) = (xi − ξk)3+, k = 1, . . . , K
where
(xi − ξk)3+ =
(xi − ξk)
3, if xi > ξk
0 otherwise
(5.6)
Where yi is the response and xi is the predictor. In our case yi and xi is the Energy
and Schedule Length of the i-th point in Pareto front; β0 − βK+3 are the coefficients of
the models. They are different from TG to TG and each coefficient set characterize the
Pareto curve of a specific TG. b1 − bK+3 are the basic functions of the models. ξk are
the knots; the basic functions relative to the knots b4 − bK+3 imply the constraints that
the curve will be continuous up to 2-orders of derivatives at each knot; hence, ensure the
98














































Figure 5.4: Details of Model Building and Prediction Phases
smoothness of the curve. From the Eqn.5.6, we need (K + 4) Coefficients to define a
unique Cubic Spline or Pareto curve of a specific TG.
Fig.5.4 presents more details on the functionality of blocks and process in Phase 2
and Phase 3. After generated in Spline Regression block, the Spline Coefficients are
classified into Volatile and Consistent Coefficients. The Consistent Coefficients have a
small variance compared with their average (≤ 10%) and they do not vary much across
different TGs. So, we can use their mean for the new coming Task graphs. Therefore,
99
5.3 Phase 2: Building the Predictive Models













Figure 5.5: Boxplot of Spline’s coefficients
they are transferred directly to Predict Normalized Pareto block. In contrast, the coef-
ficients with large variance, i.e. more than 10% of their mean, are defined as Volatile
Coefficients. These coefficients change values from TGs to TGs and are dependent on
the TG features. Therefore, we need the Linear Regression model in Subsection V-
B to characterize this dependency and they are sent to Linear Regression block. The
threshold of 10% is derived empirically and might be tuned for different Regression
techniques. Generally, for majority of the regression techniques, the threshold of 10%
gives a good trade-off between computational effort and accuracy of final results. For
example, Fig.5.5 shows Boxplot graph of 3 − knots Cubic Splines coefficients from
dataset of 40 TGs. It is obvious that only 5 out of 7 coefficients vary across the TGs;
hence, they are potential candidates of Volatile Coefficients.
5.3.2 Build Linear Regression for Spline’s Volatile Coefficients and
Pareto’s Range
From the above discussion, the real Pareto front of a TG can be rebuilt based on 3 types
of parameters: the range of Pareto curve, the Volatile Coefficients and the Consistent
100
5.3 Phase 2: Building the Predictive Models
Coefficients of the Spline Model. Amongst them, only the Consistent Coefficients are
unchanged across the TGs while the others are dependent on the features of TG. There-
fore, we need to build predictive models to capture the dependencies of the range of
Pareto curve and the Volatile Coefficients on the TG’s features. That also describes the
role of Linear Regression (LR) block in our framework. As can be seen from Fig.5.4,
there are 2 sub-modules in this block: Linear Regression for Pareto’s range and Linear
Regression for Volatile Coefficients. The former sub-module takes input from training
Pareto curves and associate TG features to generate the LR models for predicting the
min, max of Pareto curves. The later sub-module uses training Volatile Coefficients
generated from Spline Regression block to build the model for predicting normalized
Pareto curve in Prediction Phase. The general formulation of LR model is given in
Eqn.5.7:
Yi = β0 + β1X1 + β2X2 + . . .+ βnXn + i (5.7)
Where Yi is the outcome. In our framework it will be the min, max of Sched-
ule Length and Energy of points on Pareto curve or the Volatile Coefficients of Spline
Model. βi is the coefficients of LR model. In our example, there will be 9 models and
coefficient sets in total, 5 for predicting the Volatile Coefficients and 4 for estimating
the range of Pareto front. Xi refers to the features extracted from TGs such as: number
of tasks, number of edges, maximum bottom level, maximum top level, mean of task
size, variance of task size, mean of edge length, variance of edge length. These features
are selected from popular metrics of a TG [87] and its statistics. The reason behind the
choices of TG’s features and LR is once again the trade-off between accuracy and com-
putational complexity. In fact, the simple LR model can well describe the dependency
between the TG’s features and the outcomes since it can explain more than 95% of the
variation in the dataset (R2 >= 0.95). As mentioned above, the features selection de-
pends on the application model used by original list-based heuristics. For the Mapping
101




























Figure 5.6: Details of Trace back module
algorithm, the features extracted from one input SDF graph are the number of actors,
number of channels, number of actors and channels in the corresponding HDSF, the av-
erage rate of ports, the average execution time of actors, the average state Size of actors
and average token Size of channels. More details on these features are provided in [161].
5.4 Phase 3: Applying the ML Models for Prediction at
Runtime
Fig.5.6 presents the execution procedure in the third phase of our framework. When a
new TG comes, its features are extracted and sent to Applying Model block to generate
102
5.4 Phase 3: Applying the ML Models for Prediction at Runtime
an estimated Pareto curve, which is denoted as curve (C). The Trace back module is
introduced to obtain real Pareto points on the curve. The detailed steps of Trace back are
presented in the dotted rectangle. First, the targeted point (unfilled-point) and estimated
curve (curve (C)) are put on the same normalized objective space with the Pareto fronts
of training TGs. Then, k-nearest neighbors (k-NN) of the targeted point are extracted.
Subsequently, the parameter space of these k-NN points are fed from historical data and
merged together to form a potential parameter space. Thereafter, a clustering algorithm
called Ordering points to identify the clustering structure (OPTICS) [20] is applied to
potential parameter space to filter out m potential parameter sets, which have the largest
local density factor. Finally, the scheduling algorithm is called for these m potential
parameter sets to generate the desired points on objective space that are closest to the
targeted point. The rationale behind the k-NN and OPTICS steps is to extract the most
potential parameter sets from the historical parameters space.
The detailed implementation of Trace Back block is given in Algorithm 7. The first
input of this procedure is the Objective Space SO of all normalized Pareto front Paretoi
of TG i-th in the training set TrainSet. The second input is the Parameter Space SP
containing all the Parameter sets Pi associated with Pareto front Paretoi. Another input
is the predicted Pareto range Tmin, Tmax of the new coming TG, which is generated
by Applying Model block. The hyper-parameter n, k,m are respectively the number
of knots in interval [Tmin, Tmax], the number of selected nearest neighbors in kNN
step and the number of selected potential parameter sets from density filtering step.
Especially, the two hyper-parameters n,m have huge decision role on the complexity
and performance of the whole framework. Therefore, their impact is examined more
thoroughly in the experimental Section. The expected output of this Block is the Pareto
front Paretonew of the new TG. The main part of the Algorithm (the For Loop from
Line 2-25) is described above and illustrated in the dotted rectangle (Trace back) of
Fig.5.6.
103
5.4 Phase 3: Applying the ML Models for Prediction at Runtime
Algorithm 7 Trace back procedure
Input: Normalized Pareto sets on Objective Space: SO = {Paretoi, i ∈ TrainSet} ,
1: Parameter sets associated with Pareto points on Parameter Spaces: SP = {Pi, i ∈ TrainSet},
2: Predicted Pareto Range of new TG: Tmin, Tmax ,
3: Hyper-parameter: n, k,m .
Output: Pareto set of new TG: Paretonew
4: for each knot l-th (l = 1, n) in interval [Tmin, Tmax] do
5: Tl = Tmin + l ∗ (Tmax − Tmin)/N
6: tl = (Tl − Tmin)/(Tmax − Tmin)
7: /* find k- nearest neighbours points from training set */
8: for each TG i-th in training set TrainSet do
9: for each point j-th Paretoij = (Tij , Eij) on Normalized Pareto front of TG i-th do
10: Compute distance: dij = Tij − tl
11: end for
12: end for
13: Sort distance array d
14: Add k points with smallest distance to kNN set
15: /* find the potential parameter space for Pareto front of new TG */
16: for each points k-th in kNN set do
17: extract the parameter space Pk of point k-th
18: Pnew = Pnew ∪ Pk
19: end for
20: reach dist = OPTICS(Pnew)
21: Sort reach dist and extract m parameter set with smallest reachability distance to potential pa-
rameter space Ppotential
22: for each parameter set m-th in Ppotential do
23: Otemp = Original Mapping/Scheduling Heuristics(parameter set m-th)
24: Onew = Onew∪ Otemp
25: end for
26: end for
27: Extract the Pareto set Paretonew of new TG from its Objective Space Onew
104
5.4 Phase 3: Applying the ML Models for Prediction at Runtime
There are 2 obvious use cases, where our framework can be applied efficiently.
• Generating the most efficient design points that satisfy predefined constraints
on objectives: this use case is similar to the procedure described above where the tar-
geted point is defined by the objective’s constraints. Our approach provide a huge ad-
vantage over the Multi-objective Algorithms (MOAs) in term of execution time since we
just need to evaluate several points on the objective space while the traditional MOAs
need to generate the whole Pareto fronts before obtaining the design points satisfied the
constraints. The difference is especially significant when the bottleneck is usually at-
tributed to the scheduling procedure which is called to generate a design point on the
objective space.
• Generating the whole Pareto front of a new TG: in this scenario, the designer
can divide the estimated Pareto curve in n- intervals and run Trace back procedure for
each point of these intervals. The results are combined to form the Pareto front for
the new TG. The traditional alternatives for this use case are existing Multi-objective
Algorithms and GA is one of the most prominent candidate.
Although our framework is presented with an example of 2 dimensional objective
space, applying our framework to multi dimensional space is straight forward. The only
major change is in the Spline model for Pareto fronts (Subsection 5.3.1). The steps to
modify our framework for multi-dimensional objective space are described below:
• Choose one of the objectives as the response (yi);
• Consider the remaining objectives as the predictors (xi , zi , ui . . . ) ;
•Build the Spline model for the response based on the polynomial of predictors.
Eqn.5.6 will now include the components of (zi, ui . . .) similar to the ones for (xi).
• Apply the rest of the framework as described above.
The consequence of multi-dimensional design space is that the number of Volatile
Coefficients might be increased and the execution time of the whole framework might
be longer. However, the same implication exists for GA methods as well. Therefore, we
105
5.5 Experimental Results
expect the speedup of our framework to remain the same or even improve further.
5.5 Experimental Results
A number of experiments are conducted to evaluate the performance and efficiency of
our framework. Because of the limited space, in this section we report only the result
of experiments conducted for the second use case, where the designer wants to generate
the whole Pareto front. This also allows comparison of our framework directly with the
GA method. The GA optimization is implemented with the NSGA II algorithm from
NGPM package [155] in Matlab 2013 and run with a configuration of 50 population
size and 100 generations. The ML techniques are developed with R 3.2 and Splines
package [131]. All experiments are performed on an Intel Core i7 2.26GHz CPU with 8
GB RAM.
5.5.1 Results for Scheduling Heuristic
We have applied our framework to a list-based scheduler named ECS [90]. The platform
under scheduling has 4 heterogeneous processors that can operate in different levels of
Supply Voltage as in the original platform model [90]. The Energy and Schedule Length
are obtained with the energy model and execution model used in the original Scheduler
ECS [90]. The criteria for comparison in our experiments are quality of generated Pareto
fronts and the execution time of all methods.
GA Method over Original Scheduler
In the first experiment, we examine the efficiency of GA methods and the accuracy of
our Pareto front estimation, which is the result of our ML framework prior to applying
the Trace back module. This experiment is performed with 3 synthesized groups of
TGs, each with 50 TGs, which are generated from TGFF tool [51] with different levels
106
5.5 Experimental Results













Figure 5.7: Result for single TG from fat group
of parallelism: fat, medium, slim. Out of 50 TGs in each group, 40 TGs are used as
training set in Phase 1 and Phase 2 of our framework. The predictive models are built
with 10-fold Leave One Out Cross Validation process [58] to assure the generalization
capability of the models. The other 10 TGs are used as new TGs to test the accuracy of
the ML techniques. All the results shown in this Section are from the test set.
Fig.5.7 shows the design space for 1 TG in the fat type. The red plus sign presents
the scheduling result from original ECS, while the real Pareto front from GA method is
marked with blue circle points and the estimated Pareto curve by ML techniques is pre-
sented as continuous green line. It is obvious that the GA algorithm provides far better
results in all objective dimensions when compared with original scheduler. It is easy to
understand since the GA has to pay a huge trade-off in running time to achieve such a
superior result as can be seen later in the runtime analysis part. The more interesting
observation is that the Pareto curve estimated by our 2 levels ML techniques is very
close to the real Pareto front generated by GA. This result proves that both the Spline
Regression and Linear Regression have done a good job in modeling the dependency
between TG’s features and Pareto front curve.
Fig.5.8 combines all the results from 10 TGs in the test set into one plot. As can be
107
5.5 Experimental Results











































Figure 5.8: Combined result of all TGs in test set
108
5.5 Experimental Results
seen from these figures, the superiority of GA over original schedulers and the accuracy
of our estimated Pareto curve hold true for all the TGs in the test set across 3 different
TG types. Another interesting phenomenon is that the shape of Pareto fronts becomes
more homogeneous when moving from fat to slim group; the results of GAs also become
less dominating over the results of original scheduler. This can be explained by the fact
that the TGs with higher parallelism will have more different ways to be placed on the
computational platform; hence, their design spaces are bigger and more heterogeneous.
The chance that original scheduler produces a result in suboptimal region of the design
space is also higher.
Our Framework over GA Method
While previous experiment has shown that the estimated Pareto curves are very close to
the GA Pareto front, they are just intermediate result and need to be processed by Trace
back module to generate real design points on objective space. In this experiment, we
examine the ultimate result of our framework which are generated after Trace back
module. These results are compared directly with the Pareto fronts from GA method.
Fig.5.9a and Fig.5.9b show the result for 1 TG in the fat type and medium type. The red
points are the Pareto front generated by GA, the green line is the estimated Pareto curve
from the beginning of the Trace back procedure. All the points generated from Trace
back step are marked with blue color, where the plus signs and square signs represent
the result when m = 1 and m = 10 respectively. As can be seen, the Pareto fronts
generated by our framework are very close to the ones from GA method. The figure also
shows how the quality of our Pareto fronts improve with the increase in m.
Since the most time consuming process in both GA and our framework is executing
the scheduler to get the design point on objective space, we designed this experiments
around two hyper-parameters: n-interval and m-potential candidates, which directly de-





(c) Sparse Matrix Solver
Figure 5.9: Pareto fronts generated from GA and our framework
110
5.5 Experimental Results
m and n, we quantify the quality of the Pareto fronts using popular metric in the MOA
domain: R2-indicator (R2I) [32], where the reference set is the origin of the objective
space. The quality degradation of the results generated by our framework when com-
pared with GA’s Pareto fronts are measure relatively by the Quality Trade-off (QT) in
percentage of the GA’s R2-indicator. The measurements are averaged over all the 10
TGs of the test set and reported a long with the execution time in Table 5.1. As can be
observed from both the Table and the Figure, the quality of our framework is approached
to the one generated by the GA method when increasing the number of evaluations (by
increasing m or n) and the pay-off for that improvement is the nearly linear increase
in execution time. However, to achieve the comparable quality to the result from GA
we need only a fraction of time. With m = 1 and n = 20, we can achieve 2 orders
of speed-up over the GA with less than 1% deficiency in the quality of the Pareto front
for all types of task graph. Such an achievement is due to the fact that all the heavy
computation is moved to the training phase and take advantage of the ML models built
upon the historical data. As discussed in subsection 5.4, the runtime overhead of ML
method can be broken down to 4 main components: Feature Extraction, applying Linear
Regression model, applying Cubic Spline model and Denormalizing. While the Feature
Extraction part has more or less the same complexity as the original scheduler, the other
components are very simple computations: vector multiplications for applying Linear
Regression and Denormalizing blocks, 3-degree polynomial computation for Applying
Cubic Spline.
Real-life Applications
In this experiment, we used the predictive model built from fat group to apply for task
graphs of realistic applications: MP3 decoder [84], robot control, sparse matrix solver
and fpppp from the benchmark [169]. The choice of the model from fat training set
is explained by the fact that fat type has the largest variance in parallelism from the 3
111
5.5 Experimental Results
Table 5.1: Execution time and quality comparison
Our approach GA
TGs m 1 5 10 5 5 50
n 20 20 20 40 80 100
R2I 0.4201 0.4198 0.4196 0.4195 0.4195 0.4194
Fat QT (%) 0.17 0.10 0.06 0.04 0.03 0
Time 24 68 124 132 261 2654
R2I 0.4179 0.4172 0.4167 0.4166 0.4165 0.4164
Medium QT (%) 0.34 0.16 0.06 0.03 0.01 0
Time 21 59 105 112 222 2564
R2I 0.4202 0.4186 0.4183 0.4180 0.4179 0.4178
Slim QT (%) 0.59 0.20 0.12 0.07 0.02 0
Time 19 57 106 115 221 2542
R2I 0.4209 0.4202 0.4199 0.4196 0.4195 0.4185
MP3 QT (%) 0.58 0.41 0.35 0.27 0.24 0
Time 16 31 48 58 115 2348
R2I 0.4316 0.4208 0.4189 0.4193 0.4189 0.4164
robot QT (%) 3.64 1.04 0.59 0.68 0.58 0
Time 21 52 92 101 198 2554
R2I 0.4228 0.4188 0.4185 0.4187 0.4186 0.4169
sparse QT (%) 1.41 0.44 0.36 0.42 0.39 0
Time 20 52 93 101 200 2833
R2I 0.4250 0.4206 0.4164 0.4205 0.4161 0.4143
fpppp QT (%) 2.57 1.52 0.49 1.48 0.42 0
Time 30 89 155 215 425 6632
112
5.5 Experimental Results
groups; hence, model built on these TGs is the most flexible and generalizable. Fig.5.9c
presents the qualitative results from the sparse task graph. The figure again demon-
strates the capability of our ML methods to accurately generate the Pareto fronts of new
TGs just based on their features (without any prior information about TG). The quanti-
tative result is shown in the second half of Table 5.1 with the same metrics as in the 1st
experiment. The same 2 orders of magnitude speed-up can be achieved for the simplest
configuration (m=1, n=20) with small degradation in the Quality (the trade-off is still
less than 4%).
5.5.2 Result for Mapping Heuristic
To validate the capability of our framework for Mapping problem, we have applied the
framework to a list-based mapping algorithm reported in [161]. In contrast to the ECS,
the mapping algorithm operates on the SDF graph as application model. The original
mapping takes into account only throughput objective, we have applied the estimation
model reported in Chapter 3 to compute the energy consumption as the second objective
in the design space. The computing platform under mapping is a tile-based platform
with a mesh of 4x4 tiles as described in Chapter 3. The criteria for comparison in our
experiments are quality of generated Pareto fronts and the execution time of all methods.
The experiments are designed with both synthetic SDF graphs as well as the SDF models
of real life applications.
Synthetic Graphs
Similar to the previous experiments, we have generated 50 synthetic SDF graphs by the
tool SDF3 [160]. Then, we used 40 SDF graphs as training set in Phase 1 and Phase 2
of our framework. The predictive models are built with 10-fold Leave One Out Cross
Validation process [58] to assure the generalization capability of the models. The other
10 SDF graphs are used as new SDF graphs to test the accuracy of the ML techniques.
113
5.5 Experimental Results
























































Figure 5.10: Pareto fronts generated for Mapping Heuristic
All the results shown in this Section are from the test set.
Varying the same hyper-parameters m and n as in previous Experiments for ECS,
we run our framework with different settings and compare the results with the outcome
from GA method. An example of the result for one of the Synthetic SDF from test set is
presented in Fig.5.10a: the red circle represented Pareto front from GA method, results
from our framework denoted with blue color, where a simple setting (m = 5, n = 20) is
presented with plus sign and most complex setting (m = 10, n = 80) is presented with
square sign. As can be seen, the Pareto front generated from the most simple setting
(m = 5, n = 20) follows quite closely with the result from GA method. Furthermore,
there are points generated from most complex setting that surpass the quality of GA’s
Pareto front. The quantitative result is reported in the first group of Table 5.2. With the
most simple setting (m = 1, n = 20), our method can generate the Pareto front with
200x faster then the GA while sacrificing only 7.5% in quality of the result. With the
most complex configuration (m = 10, n = 80), we are approaching the result of GA
method with only 2.5% trade-off in quality but still shortening the execution time by 7
times.
Comparing to the result from Scheduling Heuristic, our framework becomes less ef-
ficient when applying to Mapping Heuristic. This phenomenon might be explained with
two factors. First of all, the cost function of the Mapping heuristic required much more
114
5.5 Experimental Results
complicated computational process than the cost function of ECS. Therefore, the pre-
dictive models (linear regression model and spline regression model), which work well
in modeling the behavior of ECS, cannot describe all the characteristics of the Mapping
Heuristic. Possible solution for this issue is to apply more sophisticated ML techniques
such as Artificial Neural Network or Random Forest, which operate as ”black box” mod-
els that can capture more complex behavior of the Mapping Heuristic. The second factor
influencing the result of our framework is the complexity of the computational platform
under consideration. In case of Mapping Heuristic, the platform includes 4x4 tiles of
heterogeneous processing elements, whereas, the platform used in ECS experiment con-
tains only 4 processors. The large number of processing elements significantly increase
the design space and make the behavior of the mapping process is more unpredictable.
Real-life Applications
In this experiment, we used the predictive model built from synthetic SDF graphs to
apply for SDF graphs of real-life multimedia applications: H.263 encoder (5 actors),
MPEG-4 decoder (5 actors), JPEG decoder (6 actors) and sample rate converter (6 ac-
tors) as in Chapter 3. The qualitative results of 2 applications MPEG4 and JPEG decoder
are presented in Fig.5.10b and Fig.5.10c respectively. As can be seen, the result from a
simple setting (m = 5, n = 20) of our framework is not so close to the GA Pareto front.
However, with the more complex configuration (m = 10, n = 80), the proposed frame-
work can deliver comparable result as GA method (or sometimes better as in JPEG’s
example). The detailed results are presented in Table 5.2. The table shows that for a
medium configuration (m = 10, n = 20), our framework just needs to trade-off less




Table 5.2: Results for Mapping Heuristic
Our approach GA
SDFs m 1 5 10 10 10 50
n 20 20 20 40 80 100
R2I 0.1607 0.1572 0.1560 0.1537 0.1532 0.1495
Synthetic QT (%) 7.50 5.20 4.37 2.82 2.48 0
SDFs Time 46 181 339 701 1357 9245
Speedup 201 51 27 13 7 1
R2I 0.5035 0.5035 0.5035 0.5035 0.5035 0.4954
H263 QT (%) 1.63 1.63 1.63 1.63 1.63 0
Encoder Time 59 238 464 994 1997 11558
Speedup 194 48 25 11 6 1
R2I 0.0478 0.0454 0.0430 0.0409 0.0409 0.0410
JPEG QT (%) 16.36 10.72 4.70 -0.35 -0.35 0
Time 18 40 71 185 424 2658
Speedup 144 66 37 14 6 1
R2I 0.8748 0.8746 0.7865 0.7849 0.7849 0.7777
MPEG4 QT (%) 12.48 12.46 1.13 0.92 0.92 0
Time 96 469 949 2293 4441 25545
Speedup 265 54 27 11 6 1
R2I 0.6623 0.5243 0.5243 0.5222 0.5222 0.5210
samplerate QT (%) 27.13 0.63 0.63 0.24 0.24 0
Time 144 618 1230 2696 5086 32666




This chapter presents a generic framework that utilizes Genetic Algorithm and Machine
Learning techniques to predict the Pareto front of multi-objective list-based heuristics in
System Level Synthesis. While the GA optimization provides Pareto front that are far
better than original schedulers with huge trade-off in execution time, our multilevel ML
techniques with Spline Regression and Linear Regression can accurately generate the
Pareto curve with a much lower execution time and small degradation in quality of the
Pareto front. Moreover, the framework has been tested with both Mapping and Schedul-
ing, which are most important tasks in System Level Synthesis of digital systems. Al-
though the efficiency and effectiveness of our framework vary for different problems,
all the experimental results show promising potentials for applying ML techniques to
improve the performance of list-based System Level Synthesis tasks. The optimization
framework in this chapter also wraps our contributions on System Level Synthesis.
117
This page is intentionally left blank.
Chapter 6
Exploiting Loop-Array Dependencies for
Design Space Exploration with High
Level Synthesis
In this chapter we move to the contributions on Micro or Device Level Synthesis and
Exploration, where High Level Synthesis (HLS) plays an indispensable role. Recently,
HLS tools that can automatically and quickly generate the RTL design from high level
abstraction languages have emerged as the next mainstream in the world of electronic
system design with the appearance of a lot of commercial products and academic re-
search [37, 44]. With the ability to generate hardware designs in a fast and easy way,
HLS tools allow the designers to easily evaluate different architectural implementation
alternatives for the same high level behavioral description in order to satisfy different
performance requirements and design constraints. To generate a new architectural im-
plementation, the designers need to adjust a set of parameters that control the RTL gener-
ation process, such as: number of execution units, amount of resources to share, memo-
ries to allocate, data types and sizes, algorithm choice, pipeline stages, unrolling factors,
etc. Each combination of different choices for these parameters provides a single option
119
Loop-array dependency for HLS DSE 
• No works examine the structure of the Design Space 
Loop 1:     for (i=0; i<N; i++) 








 factor Pc 
• The DS of previous works: 
U1*Pa*Pb*Pc = N4 
• If the loop-array dependency  
explored => DS reduce to  U1 =  
Figure 6.1: Motivational example
(performance and resource usage) for the generated hardware (HW) engine and forms a
specific design point in the design space. The number of design points is exponentially
proportional to the size of the space, which is governed by the number of parameters
to be taken into account. Furthermore, the long runtime of the HLS tool itself is also a
bottleneck that makes the DSE for HLS a very time consuming process [141].
To address the time consuming problem of DSE in HLS, different approaches have
been proposed. Most of the approaches prune the design space using heuristic algo-
rithms such as: simulated annealing, genetic algorithms, etc. [132, 139]. Some ap-
proaches apply machine learning techniques by building predictive model for HLS tools
[35, 92]. Although these approaches can reduce the DSE time, they significantly sacri-
fice the quality of their solutions and can reach only to suboptimal solutions. The main
reason is that existing approaches don’t try to examine the structure of the design space
and the relationship between explorable parameters of the problems. One of the most
important relationships that has significant effect on the performance of HLS tools is the
dependency between loops and arrays [127, 192].
To illustrate the importance of loop-array dependency on the DSE of HLS, we ex-
amine a simple example. Fig.6.1 shows the pseudo code for adding two vectors b and
c, and storing the results into vector a. If the dimension of the vectors is N then we can
have N options for the unroll factor of Loop 1 and for each partition factor of arrays a,
b and c as well. This implies that if we do not consider the access pattern of the array
120
and the loop-array dependency, the design space complexity will be O(N4). However,
it is easy to realize that if the partition factors of the array are greater than the unroll
factor of the loop, the HW engines cannot utilize all the access to available partitioned
memory banks. Hence, there is unnecessary wastage of resources for any array partition
factor that is greater than the unroll factor of the loop accessing the array. Moreover,
due to the simplicity of the code, we can observe that the optimal partition factor of
the arrays should be the same as the unroll factor of the loop. Therefore, it is desirable
and sufficient to traverse the design space for the loop unrolling factor only and such
consideration will reduce the size of the design space to O(N).
From the example demonstration (Fig.6.1), it is clear that the unroll factor of the
loop and partition factor of the arrays are closely interdependent on each other. Fur-
thermore, when the access pattern of the arrays is complex, it is difficult to extract the
dependency between unroll factors of the loops and partition factors of arrays, as well as
defining the optimal partition factors that fit the unroll factors. None of previous works
have exploited such dependencies to improve the DSE process of HLS tools. Towards
exploiting such dependencies, in this work, we propose a DSE framework for HLS tools
that can exploit the loop-array dependency to reduce the evaluation time while evalu-
ating optimal or near optimal solutions. Within the framework, following are the main
contributions:
• Loop-array Dependency Graph: A systematic and formal method to represent the
relationship between loops and arrays. We also developed a tool to extract the
graph from C code.
• Array Partition Factor Computation Block: A module that can generate the Pareto
optimal array partition factors according to the related loop optimization tech-
niques.
• Novel Framework for DSE in HLS: A multilevel DSE approach that efficiently
exploits the loop array-dependency to significantly reduce the DSE time.
121
6.1 Proposed DSE Framework for HLS
6.1 Proposed DSE Framework for HLS
DSE with HLS is usually a multi-objective optimization problem with regards to vari-
ous conflicting objectives such as performance (throughput, latency), hardware usage or
power consumption and reliability. In this work, we address the two main concerns of
any digital system design: performance and hardware usage. However, the framework
can be applied to include other criterion as well. Our main goal is to produce a Pareto
optimal front of the designs so that the designer can choose between different trade-off
points.
In HLS, optimization techniques related to loops and arrays have the most sig-
nificant impact on the performance and resource usage of any hardware implementa-
tion [44, 192]. In our framework, we consider two prominent optimization techniques
for loop, called Loop pipelining and Loop unrolling. Similarly, the most important array
optimization technique, called Array partitioning, is also taken into account. Another
reason for the choice of these optimization techniques is that they are commonly avail-
able in most of the HLS tools [98]. Therefore, our framework can be integrated with
a variety of different HLS tools. The proposed framework is a multi-level DSE solu-
tion: first, the normal DSE algorithm runs for the loop parameters optimization. Then,
inside each iteration of loop parameter set, another DSE process for array optimization
parameters is executed.
The overall framework of our DSE approach is presented in Fig.6.2. Three main
components of the framework are the DSE Algorithm, the Loop-Array Dependency
Extractor (LADE) and the Array Parameters Computation Block (APCB). The DSE al-
gorithm is the main block that controls the whole DSE process by generating the new
loop parameter set (loop unrolling and pipelining parameters) for each iteration of the
framework. Thereafter, the new generated loop parameter set is passed to the second
block (APCB) as a reference for the second DSE process to find the Pareto optimal Ar-
ray partitioning parameter for the corresponding input loop parameter set. The DSE
122
6.1 Proposed DSE Framework for HLS



























Figure 6.2: Overall flow of proposed mapping strategy
for array level runs on a simulator that can guarantee efficient array partition parameters
according to each loop parameter set. By using a simulator the second level of DSE
takes much less time as compared to traditional approaches that need to call the HLS
for evaluating each array parameter set. The second level of DSE for array parameter is
executed in the Array Parameter Computation Block (detail in Section 6.3). To initiate
the APCB, we need the Array Access Pattern and the information from Loop-array de-
pendency (LAD) graph, which are the output from the LADE. In contrast to the iterative
execution behavior of the other blocks in the framework, the LADE is required to run
only once at the beginning of the process to extract the LAD graph and the access pattern
123
6.1 Proposed DSE Framework for HLS
of the arrays, which are the references for the APCB. The architecture and functionality
of LADE block are further discussed in Section 6.2.
After getting the result from APCB, the tool passes information about the optimiza-
tion techniques for both the loops and the arrays to the block Directive Generator to
form the Directives for the HLS tools. Basically, this block is a script that depends on
the Directive Library of different HLS tools. Finally, the HLS tools are called to generate
the new hardware implementation as well as to evaluate the performance and resource
usage of the current parameter set. The Pareto Checker will compare the results of the
current parameter set with previous ones to decide whether it belongs to the Pareto front.
Then, the control is passed to the DSE algorithm block to begin a new iteration for loop
parameter DSE. The framework is terminated when it meets the stop condition in the
DSE algorithm block.
In this framework, the two blocks LADE and APCB reflect our main contributions
that utilize the LAD to mitigate the timing issue of DSE process without affecting the
quality of the solutions. These blocks are designed in a modular approach as independent
blocks from the DSE algorithm. Therefore, our framework provides users the freedom to
utilize different DSE algorithms for the loop parameter DSE process such as exhaustive,
heuristics: hill climbing, ant colony or multi-objective algorithm like Genetic Algorithm
or Particle Swarm Optimization.
To illustrate the advantages of our approach over the traditional approaches, in this
work we use an exhaustive method for loop parameters DSE as described in Algorithm
8. We have customized the traditional exhaustive algorithm to fit with the problem of
loop parameter optimization. As mentioned above, we consider two most important and
popular loop optimization techniques: pipelining and unrolling. When pipelining is ap-
plied for an outer loop, all the loops nested in this current loop will automatically be
unrolled [1]. Therefore, the pipeline directive is more critical and needs to be traversed
first using the outer For loop (Line 2). First, the case of no pipelined loop is considered,
124
6.2 Loop-Array Dependency Graph
where all the possible options of unrolling loops are examined using TRAVERSE() func-
tion (Line 4-7). Then, the loops are pipelined in order from inner-most to outer-most.
Whenever, one loop in the loop nest is pipelined (Line 8), the unroll factor of the other
loops under it in hierarchy will be set to the maximal value (Line 10), while the unroll
factor exploration is considered for the loops outside of the pipelined loop (Line 12).
The algorithm terminates when the outer-most loop is pipelined i.e. PL = 1. TRA-
VERSE(N,PL,U) function is a recursive function that can exploit all the unroll factors of
the loops outside loop N and store the current pipeline and unroll factors into variable
PL and U .
6.2 Loop-Array Dependency Graph
This Section discusses about the LAD graph, and its usage in the proposed framework.
As shown in Fig.6.2, the LADE block takes the C code as input and generates the Array
Access Pattern and the LAD graph. Before going to the discussion of these two forms
of representations, we need some preliminary concepts of Polyhedral Model, which is
the fundamental of these representations.
6.2.1 Polyhedral Model
Polyhedral Model is an alternative representation of programs that provides high po-
tential of analysis, expressiveness and flexible transformation for the loop nests. The
Polyhedral Model is based on three basic concepts: iteration domain, scattering func-
tion and access function. In the scope of this work, only the iteration domain and the
access function are utilized so only these two concepts are defined. Reader who are
interesting in the whole Polyhedral Model, can refer to [24] for more complete defini-
tions.
• Definition 1: Iteration Domain. Given a nest of N loops, the iteration vector I
125
6.2 Loop-Array Dependency Graph
Algorithm 8 Exhaustive DSE algorithm for loop optimization
Input: Loop nest with N loops
Output: All possible loop optimization parameter set
int PL // variable indicating pipelining technique in which loop level
int U [N ] // array storing the unroll factors for each loop level
for i = N + 1 to 1 do





for j = i+ 1 to N do
U [j] = Maxj
end for




if N = 1 then
for i = 1 to Max1 do




for i = 1 to MaxN do
U [N ] = i




6.2 Loop-Array Dependency Graph
is defined as: I = (i1, i2, ..., iN), where ij = 1, ..,Maxj is the iterator of jth loop. All
possible values of the iteration vector forms the Iteration Domain of the given loop nest:
DI = {(i1, i2, ..., iN) ∈ ZN |0 ≤ ij ≤ Maxj, j = 1, .., N} and each iteration vector
represents one instance of loop iteration of the loop nest.
• Definition 2: Array Domain. Given an array A with M dimensions, the access
vector of array A is defined as RA = (a1, a2, ..., aM). Each particular instance of the
access vector gives access to one element of the array: A[a1][a2]...[aM ] and all possible
values of the access vector form the Array Domain: DA = {(a1, a2, ..., aM) ∈ ZM |0 ≤
aj ≤ Sizej, j = 1, ..,M}, where Sizej indicates the size of the array in jth dimension.
• Definition 3: Array Access Function. Given array domain DA and iteration
domain DI , the array access function Fk for kth array reference is defined as: Fk :
DI −→ DA. The array access function gives us the information about the element of
array A that is accessed in a particular iteration of the loop nest. Since we consider only
affine access to array, the function Fk has the following form: Fk(I) = X ∗I+Y , where























Xx11 x12 ... x1N
x21 x22 ... x2N
... ... ... ...














6.2.2 Array Access Pattern
• Definition 4: Array Access Pattern. Given a loop nest I and array A, all existing
references to array A in the loop nest forms the access pattern of loop nest I for array
A. The access pattern for each array is a set of all references to that array and is presented
as a set of matrix M and I .
127
6.2 Loop-Array Dependency Graph
Loop 1: for (i=0; i<N; i++) 
    Loop 2: for (j=0; j<N; j++) 
              { C[i][j]=0; 
    Loop 3: for (k=0; k<N; k++) 
  C[i][j]=C[i][j]+A[i][k]*B[k][j]; 





Loop 1: for (i=0;i< N; i++)
Loop 2: for (j=0;j<N; j++)
{ c[i][j]=0;


















Figure 6.3: Example of matrix multiplication program
We use an example of matrix multiplication loop nest to further clarify above-defined
concepts. Fig.6.3 presents the listing of the loop nest and the corresponding parameter
values are as follows:
Iteration vector: I(i, j, k)
Iteration Domain: DI = {(0, 0, 0), (0, 0, 1), . . . , (N − 1, N − 1, N − 1)}
Access vector of array A: RA = (i, k)
Array Domain of array A: DA = {(0, 0), . . . , (N − 1, N − 1)}


















Access pattern for array A: matrix XA and vector YA
Similarly, access patterns of arrays B and C are given by vector YB = YC = YA and














6.2 Loop-Array Dependency Graph
6.2.3 Loop-Array Dependency Graph
The LAD graph captures the overall relationships between different loop levels as well
as the dependencies between loops and arrays in the loop nest.
• Definition 5: LAD graph. Given a loop nest with N loops, L = {L1, L2, . . . , LN}
and all the arrays accessed by the loop nest,A = {A1, A2, . . . , AK}, then the LAD graph
is defined as a directed graph G(V,E), where V = L ∪ A and E is defined as follows:
• If the loops are arranged in order from outer-most to inner-most as: L1, L2, . . . , LN
then edge (Li, Li+1) belongs to E.
• If array Aj is accessed by the iterator of loop Li then edge (Li, Aj) belongs to E.
Fig.6.3b illustrates the LAD corresponding to the loop nest of Matrix Multiplication
given in Fig.6.3a. According to the above definition, there will be 3 nodes representing
the loops in the loop nest (L1, L2, L3) and 3 nodes for arrays representation (A,B,C).
Optimization techniques applied to L1 will affect the optimization of loop L2, array A,
and array C. Similarly, other edges in the LAD graph will indicate different dependen-
cies between loops and arrays. The LAD graph benefits the whole framework by two
folds. First, the loop dependencies are passed to the DSE algorithm to define the order
of loop pipelining. For example, when we pipeline the loop L2 then the unroll factor
of L3 should be maximal, and users do not need to consider other options for unrolling
loop L2. Secondly, the dependencies between loop and array are useful for the APCB
block to define which array parameters should be considered according to the loop pa-
rameters. For example, whenever the optimization techniques are applied to loop L1
then only partition factors of array A and C need to be examined and there is no need to
evaluate the optimization techniques for array B. These representations are customized
to the DSE HLS problem and are efficiently used in our framework.
129






















0 0 1 1 2 2 3 3
0 0 1 1 2 2 3 3
4 4 5 5 6 6 7 7
4 4 5 5 6 6 7 7
0 1 2 3 0 1 2 3
4 5 6 7 4 5 6 7
0 1 2 3 0 1 2 3




























0 1 2 3 0 1 2 3
4 5 6 7 4 5 6 7
0 1 2 3 0 1 2 3
4 5 6 7 4 5 6 7
HW instance 1 HW instance 2
APS Demonstration
Figure 6.4: APCB architecture
6.3 Array Parameter Computation Block
The APCB block outputs the Pareto optimal parameters for array optimization tech-
niques corresponding to the current loop parameter set. Fig.6.4 illustrates the architec-
ture as well as all the input and output data of this block. Basically, the APCB performs
the second level of DSE process for array parameters while fixing the loop parameters.
The Partition Strategy Generator traverses all over the design space of array param-
eters and exhaustively produces new Array partition strategies for the Access Pattern
Simulator (APS). Taking all the required input of the access pattern, the current loop
optimization parameters and array partition strategy, the APS will simulate the memory
access behavior of the given hardware design and output the number of cycles needed for
memory access. After getting all memory access cycles for every partition strategies, the
results are passed through the Pareto Optimization Filter (POF) to get the final Pareto
optimal array partition strategies. The detail implementations of these sub-modules are
given in the next sections.
130
6.3 Array Parameter Computation Block
6.3.1 Array Partition Strategy
Memory partitioning is a widely used technique to improve memory bandwidth without
data duplication. The original array will be placed into N non-overlapping banks, and
each bank is implemented with a separate memory block to allow simultaneous accesses
to different banks. The two most commonly used data partition schemes are block parti-
tion and cyclic partition, as shown in Fig.6.4. These schemes provide regular partitions
and thus can be easily implemented. This is desired for hardware synthesis as extra
logic required to handle irregular patterns may increase the final design area drastically.
Moreover, these schemes are widely supported by different HLS tools such as: Vivado
HLS, LegUp, etc. Therefore, the memory partition strategies in our framework cover
both cyclic and block partition schemes and are defined as follows:
• Definition 6: Array partition strategy. Given an array A with M dimension, then
an array partition strategy for A is defined as a (m + 1) tuples: PS = (p0, p1, . . . , pM),
where:
• pi is partition factor for ith dimension (i 6= 0) and
• p0 = 0 for block partition, or p0 = 1 for cyclic partition.
Mapping function for array partition strategies: Given an array partition strategy,
the mapping function defines the index of memory bank for each element in the array.
Mapping function is a function P that maps array address A = (a1, a2, . . . , aM) in array
domain to partitioned memory banks, that is, P (A) is the memory bank index that A
belongs to after partitioning: P : DA −→ Z. The detail of memory mapping functions
for block and cyclic schemes are provided in Algorithm 9.
6.3.2 Access Pattern Simulator
The APS utilizes analytical function to approximate the memory cycles instead of call-
ing the HLS tool, and thus achieves much shorter execution time. The accuracy of the
131
6.3 Array Parameter Computation Block
Algorithm 9 Memory mapping function for partition strategies
Input: Array partition strategies PS = (p0, p1, . . . , pM); Array access vector A =
(a1, a2, . . . , aM)
Output: Memory bank index of element A[a1][a2] . . . [aM ]: bank number
if p0 = 0 then
// Block partition
for i = 1 to M do
tempi = bSizei ÷ pic




for i = 1 to M do









results generated by APS guarantees a theoretical bound for the actual results obtained
by calling HLS. The performance gains and accuracy are further examined in experi-
mental section.
The implementation details of APS are provided in Algorithm 10. First of all, the
memory bank index for each array element is computed according to the partition strat-
egy described in Algorithm 9 (Line 2). Then, the algorithm traverses through all exe-
cuting iterations of the HW engines (Line 4). In each iteration, it iterates through all the
HW instances (Line 5) (defined by the loop optimization parameters) and all memory
132
6.3 Array Parameter Computation Block
references (Line 6) (defined by the access pattern of the code). For each memory refer-
ence of each HW instance, the simulator will compute the index of the accessed element
in original array using the access function matrix described in 6.3.1 (Line 7) then derive
the memory bank index of this element from the result in Line 2. All the bank indexes
accessed in current iteration are stored in a 2 dimension array R[l][k], where l indicates
the HW instances and k indicates the memory references. Thereafter, the algorithm
computes the frequency of different values appeared in array R (Line 11). The values
appeared in array R indicate the indexes of memory banks accessed in current iteration
and the frequency of each value represents the number of accesses to the corresponding
memory bank. The algorithm then identifies the memory bank with maximum num-
ber of accesses and stores its frequency into variable Max (Line 12). The final result,
which is the total number of required memory cycles, is generated by accumulating the
memory cycle of all iterations (Line 13).
An example demonstrating the mechanism of APS is presented in Fig.6.4. The
demonstration considers Denoise application with input array size of (8, 4). Loop opti-
mization parameters are unrolled 2 times for Loop 2, and no optimization is applied for
Loop 1. Hence, there are 2 HW instances working concurrently as shown in the figure.
The first HW instance operates on the left part of the original array, while the second
instance processes the right part. The current partition strategy is cyclic scheme with
partition factor P1 = 2 and P2 = 4. Following this partition strategy, the original array
is divided into 8 different memory banks with the index from 0 to 7. The number in each
memory cell represents the index of the memory bank that the cell belongs to. The red
cycle indicates the current elements that are processed by two HW instances. The sur-
rounding elements marked by the yellow square show all the required memory accesses
needed for processing current iteration. In this particular iteration, we can see that each
HW instance needs to access 1 element in bank 4, 1 element in bank 6, and 2 elements
in bank 1. Since memory access on different memory banks can happen simultaneously,
133
6.3 Array Parameter Computation Block
the memory access in bank 1 will be the bottleneck with 4 elements and 8 cycles (each
access required 1 cycle for transferring address and 1 cycle for loading value).
Algorithm 10 Access Pattern Simulator implementation
Input: Array partition strategies, Array access pattern, Loop parameters
Output: Memory cycles needed
for each array element do
Compute memory bank index using Algorithm 9
end for
for each executing iteration do
for each HW instance lth do
for each memory reference kth do
Compute access vector A using Equation 6.1
R[l][k]←− memory bank number of A (from Line 2)
end for
end for
Count frequency of each element in array R




6.3.3 Pareto Optimization Filter
After getting the memory cycles needed for all partition strategies, this block compares
the partition strategies based on their partition factors in all dimensions as well as their
memory cycles. Then, it keeps all the non-dominated partition strategies and output the
Pareto optimal partition strategies in term of memory cycles and partition factors. Ac-
cording to [1], the resource utilization proportionally increases with the partition factor.
134
6.4 Experimental Results
Furthermore, since the computation cycle should not change for a fixed loop optimiza-
tion parameter set, the overall performance will be characterized by the memory cycles.
Therefore, the partition strategies generated by this block will be efficient in both latency
and resource utilization. As the example demonstration, in Fig.6.4, the Pareto-optimal
partitions are marked with red cycles.
6.4 Experimental Results
The results from our framework are compared with exhaustive search and following ex-
isting approaches: an Adaptive Simulated Annealing (ASA) approach in [139] and an
approach using multi-objective Genetic Algorithm (NSGA) in [35]. The ASA approach
has been implemented in Matlab 2013 using ASAMIN package [135]. Similarly, the
evolutionary computing approach is implemented with the NSGA II algorithm from
NGPM package [155] in Matlab 2013. The LADE block of our framework is imple-
mented as an extension of the CLAN tool [24], while the DSE algorithm and the APS
are developed in Java 8.0. Vivado HLS tool is used as the High Level Synthesis Tool
with Zedboard as the synthesized platform. The experiments are performed for five
applications from the Polybench [126], a popular benchmark for testing loop and array
related problems. These applications are chosen to reflect different array access patterns:
image processing function (Denoise), matrix multiplication (Matmul), triangular solver
(Trisolve), Floyd-Warshall graph analysis algorithm (Floyd) and 2D Seidel stencil com-
putation (Seidel) [126]. The criterion of the comparison are the quality of the Pareto
fronts generated using above approaches and the effort needed by these approaches.
6.4.1 Quality of Pareto Front
The comparison on quality of the design space generated by four approaches are illus-
trated through Fig.6.5 and Fig.6.6. These figures are plotted in log scale, where the
135
6.4 Experimental Results



















Figure 6.5: Results for DSE on array parameters
Latency is the number of clock cycles output from Vivado HLS and the hardware uti-
lization (HW ) is characterized by following formulas: HW = λ1∗FF+λ2∗LUT+λ3∗
BRAM + λ4 ∗DSP ; where FF,LUT,BRAM,DSP are the number of FIFO, LUT,
BRAM and DSP blocks utilized in the design. The values of λi are platform-dependent
coefficients and are inversely proportional to available resources for FIFO, LUT, BRAM
and DSP on the platform.
Fig.6.5 shows the results obtained from the second DSE level on array parameters
for Denoise application with unroll factors u1 = 2 and u2 = 4. As shown in the figure,
design points generated by our framework efficiently covers all the points on the Pareto
front generated by exhaustive approach. The NSGA algorithm also can generate Pareto-
optimal points but needs more evaluations while the ASA approach fails to do the same.
The nice covering effect of the results from our framework can be explained by the fact
that the APS provides a theoretical bound on the latency of the evaluated points. There-
fore, the points generated from our approach (which are obtained from APS) always
include all the optimal points in terms of Latency, which are part of the real Pareto front
of the design space.
136
6.4 Experimental Results













Pareto front Exhaustive NSGA ASA APS
(a) Design Space for Denoise












Exhaustive NSGA ASA APS+
(b) Pareto Front for Denoise











Pareto front Exhaustive NSGA ASA APS
(c) Design Space for Matmul











Exhaustive NSGA ASA APS+
(d) Pareto Front for Matmul
Figure 6.6: Overall DSE
Similar results are observed for the overall DSE on loop level and are presented
in Fig.6.6. In this level, our framework clearly shows its advantages over the NSGA
by covering a large fraction of the Pareto front while the NSGA can do that only with
less number of points. The quality of results is described in more details using two
concepts proposed in [141]: Hypervolume and Pareto Dominance. The Hypervolume
indicates the fraction of design space that is dominated by the points generated from
each approaches, while the Pareto Dominance counts the number of design points from
each approach that belong to the Pareto front. The left part of Table 6.1 (Quality) shows
that design points from our framework dominate the NSGA and ASA approaches for all
the evaluated applications. The advantage of our approach at this level has resulted from
137
6.4 Experimental Results
the efficient results in the array parameters DSE level. It is also observed that the Pareto
points missed by our approach mostly belong to the designs with low HW utilizations
(Fig.6.6d). It indicates that our approach does not efficiently detect the HW-optimal
designs and the reason for this limitation is the simple estimation of HW criteria using
array dimension in the Pareto Optimization Filter. This problem can be improved by
integrating more efficient HW estimator to the Pareto Filter block, which is one of our
directions for future work.
6.4.2 Execution Time and Number of Evaluations
One of the most important features that define the efficiency of a DSE strategy is the
number of evaluations performed by the strategy. The last two columns of Table 6.1
present the timing result and the percentage of number of evaluated points for all ap-
proaches. The results are normalized with respect to the exhaustive approach. As can be
seen from Table 6.1, the execution time of all approaches are proportional to the num-
ber of evaluations. The reason of this effect is that the execution time of the algorithms
themselves are not significant (ignorable) when compared with time required for each
evaluation, which are performed by executing the HLS tool. In Table 6.1, our framework
again demonstrates its advantages over NSGA and ASA approaches on both number of
evaluations and execution time. Furthermore, it also brings a huge improvement over the
exhaustive approach with an average speed-up of 14×. The efficiency of our approach
comes from the fact that in the second level of DSE for array parameters, a significant
amount of inefficient design points are eliminated by the APCB module. Hence, only
promising candidates for the Pareto front are evaluated by HLS tool in our approach.
138
6.4 Experimental Results
Table 6.1: Quality of the design space
Quality Effort
Application Approaches Hypervolume Pareto Evaluated Time
(%) Dominance Points (%) (s)
Exhaustive 100 10 100 66,484
Denoise NSGA 96.88 5 17.71 18,456
ASA 95.05 2 15.23 14,164
APS 98.32 8 13.41 10,766
Exhaustive 100 9 100 8,159
Matmul NSGA 90.89 6 17.97 922
ASA 88.41 2 16.41 679
APS 96.21 8 7.29 371
Exhaustive 100 3 100 29,134
Floyd NSGA 88.80 1 15.49 2,894
ASA 93.10 0 16.63 3,631
APS 100 3 6.90 2,247
Exhaustive 100 4 100 193,848
Trisolv NSGA 78.26 2 17.32 22,098
ASA 89.97 3 17.06 17,401
APS 100 4 16.93 11,234
Exhaustive 100 9 100 146,180
Seidel NSGA 97.09 4 16.67 18,063
ASA 96.35 2 15.63 15,244




This chapter presents an efficient framework for utilizing the loop-array dependencies to
solve the time-consuming problem of DSE for HLS. The proposed framework includes a
systematic and formal representation for the loop-array dependencies, a simulation block
that can efficiently compute the Pareto optimal array partition parameters for each loop
parameter set. Moreover, a multilevel DSE algorithm is also developed to exploit the
framework. The experimental results show that our framework provides better Pareto
front in term of resource utilization and latency in comparison to existing approaches




for Option Pricing Applications
Although it can help to reduce the development time, HLS tools usually generate hard-
ware designs that are suboptimal in performance and resource usage than manual hard-
ware implementation. To improve the performance of auto-generated RTL hardware
implementation from high level description but not sacrifice on the productivity of the
development process, dedicated hardware generator tools need to be developed. The
trade-off for this kind of tools compared with HLS is the narrow scope of usage. They
are usually developed to generate the accelerator for a specific class of applications. A
number of examples for auto-generation tools are given in Section 2.3. This chapter
presents the implementation of our hardware generator tool for option pricing applica-
tions, which has been implemented with the loop-array dependency technique proposed
in previous Chapter.
Option trading is the fundamental operation of every financial institution; therefore,
evaluating the option accurately and fairly is essential for the business of these financial
company and their customers [74]. However, the appearance of highly complex options,
which involve multiple underlying assets or contains complicated contractual features,
141
makes the pricing problem a very challenge task. Besides, new underlying models for
describing the stock prices have been added more and more features to better reflect the
real behavior of the market. All of these complexities make the option pricing problems
computationally intensive.
The low latency demand is another important factor that makes the financial firms
accelerate the option pricing methods. In the era of automated trading, the prices of
the financial products change within a small fraction of second. In this competitive
environment, faster and more accurate pricing can bring a huge advantage and profit for
financial firms. Therefore, high performance computing needs to be exploited to support
real time pricing [181].
Amongst other options for high performance computing in finance like GPU or CPU
cluster, FPGA has proven itself as a promising candidate for option pricing application
due to the following features. First of all, FPGA accelerators can exploit different levels
of parallelism inherent in the pricing methods: from instruction level, thread level to data
level with pipeline parallelism [164]. Moreover, they can provide the same performance
with much less power, size and cost in comparison with commodity CPU [164]. Last but
not least, the potential on low latency network interface for updating market information
and trading decision brings the unique advantage for FPGA implementation.
Because of its high potential and advantages, FPGA has been intensively studied
over the last few years to accelerate the option pricing problem. Although a number
of hardware (HW) implementations for option pricing have been reported with huge
speedup and lots of energy saving [167], the productivity is still a challenge when it
comes to FPGA design as compared to GPU and CPU [108]. Most existing works
proposed the hardware designs for one particular pricing model only: either Black Sc-
holes [166–168] or Heston [33, 34, 48, 49]. Because the computational features/ char-
acteristics of pricing models are different in nature, when users need to switch between
different models, the designers have to restart the design process from beginning, which
142
7.1 Option Pricing Basics
consumes lots of time and effort. To address this problem, we introduce a generic frame-
work that can help designers to generate efficient and high quality hardware accelerators,
which can facilitate different pricing models as well as various types of options. Follow-
ing are the main Contributions of our framework:
• A generic design flow that can automatically optimize and generate the hardware
accelerators from a high level description of option pricing application;
• A template of modular and parameterizable hardware architecture that covers all
different computational features of various pricing models;
• A library for the hardware implementation of most popular pricing models: Black
Scholes; Merton; Heston and Bates;
• A heuristic to find optimization parameters for above-mentioned hardware de-
signs;
7.1 Option Pricing Basics
7.1.1 Option Overview
Option is a contract between two parties that provide the buyer the right to execute a
transaction on one or several underlying assets (stock, currency, index or debt) with
a strike price K at a future moment T (maturity or expiry) under specific conditions.
An option is a Call Option if the buyer has the right to buy in the future; whereas an
Option with the right to sell is called Put Option. The profit from Option at maturity is
defined by the payoff, which depends on the exercise condition, the Strike price K and
the underlying asset price at maturity T . Our framework can be applied to a wide range
of European options, which are executed only at Expiry. Based on the definition in [74],
the following Option Type are implemented in our framework:
143
7.1 Option Pricing Basics
• Vanilla Option: the most popular and traditional type of option, the payoff condi-
tion depends on the strike price K, and the stock price at Expiry;
• Asian Option: the Strike price is defined as the arithmetic average of the underly-
ing asset price during the contract period;
• Barrier Option: the payoff condition is defined by whether its underlying assets
price achieves specific values.
• Binary Option: Binary options are options with discontinuous payoffs. A simple
example of a binary option is a cash-or-nothing call. This pays off nothing if the
asset price ends up below the strike price at time T and pays a fixed amount, Q, if
it ends up above the strike price.
• Lookback Option: The payoffs from lookback options depend on the maximum
or minimum asset price reached during the life of the option.
7.1.2 Option Pricing Problem
Options are one of the most widely traded products in the market [74]. Therefore, finan-
cial institutions need to define the fair option price to avoid the arbitration opportunity
from their competitors. The typical option pricing procedure is illustrated in Figure 7.1.
First of all, the evolvement of the underlying asset price needs to be described
through a model. Based on the model and current assets price S0, the price at matu-
rity ST can be calculated. After that, the option price at maturity CT is computed using
payoff function. Finally, the future option price CT is discounted to get the current
option price C0. From the pricing process, we can recognize the ultimate importance
of the underlying asset price model. The most popular and widely used models are
Black-Scholes model, Merton model, Heston model and Bates model [74].
144
7.1 Option Pricing Basics














ൌ ߤ݀ݐ ൅ ߪ݀ ௧ܹ
݀ܵ௧
ܵ௧
ൌ ߤ݀ݐ ൅ ߪ݀ ௧ܹ ൅ ܼ݀௧
݀ܵ௧
ܵ௧
ൌ ߤ݀ݐ ൅ ݒ௧݀ ௧ܹ
ଵ




ൌ ߤ݀ݐ ൅ ݒ௧݀ ௧ܹ
ଵ ൅ ܼ݀௧
݀ݒ௧ ൌ κ ߠ െ ݒ௧ ݀ݐ ൅ ߪ ݒ௧݀ ௧ܹ
ଶ
Figure 7.1: Option pricing procedure
Black-Scholes Model
Black-Scholes (BS) model has been introduced in 1973 and widely adopted by financial
institution for option pricing problem [74]. In this model, Black and Scholes use perfect
market hypothesis to assume that all the known information of market has been included
in the prices of traded asset. Therefore, the asset price follows a Brownian motion with
constant drift µ and volatility σ:
dSt
St
= µdt+ σdWt (7.1)
where Wt is the Wiener random process, which is in the order of N(0, T ), a Gaussian
distribution with T as standard deviation.
dWt = Wt −W0 ≈ N(0, T ) (7.2)
145
7.1 Option Pricing Basics
Heston Model
One disadvantage of the BS model is the assumption that the volatility is always con-
stant. This assumption does not reflect the properties of real market and may introduce
discrepancy between computed option price and real price. Therefore, Steven L. Heston
introduced the stochastic volatility to improve the Black-Scholes model. This improve-
ment described the stock price behavior much more accurately and the model is widely
accepted in financial community [74]. The Heston model has two stochastic differential








dvt = κ(θ − vt)dt+ σ√vtdW 2t (7.4)
Here, W 1 and W 2 are two Brownian motions with the correlation ρ that model the
randomness of the market. t is the time, and the other parameters further specify the
specific behavior of the financial market.
Merton Model
Another way to bring the BS model closer to the real market is to include a Jump com-
ponent to the diffusion process of the stock price. That was proposed by Merton [101],
and the formula is described in Equation 7.5.
dSt
St
= µdt+ σdWt + dZt (7.5)
where dZt describes the Jump component and usually follows a Poison process.
Bates Model
Bates introduced a new model which contains both the features of stochastic volatility








t + dZt (7.6)
dvt = κ(θ − vt)dt+ σ√vtdW 2t (7.7)
146
7.2 Design Flow and Optimization Framework
7.1.3 Pricing Methods
From these models we can have several choices to define the dependency of the Option
price on the Stock price S, the volatility V and the time T . The way that we describe this
relationship also defines the numerical method used to price the option. First of all, the
stochastic differential equations (SDE) are the most straight forward way to describe the
abovementioned dependencies. For numerical computation, there are several options to
derive the option price, namely: Monte-Carlo (MC) Simulation, Finite Different meth-
ods, Binomial tree and Quadrature Method. Among them, MC is the most widely used
methods because of its simplicity and generality: it can be applied to almost all option
types as well as underlying stock models [64]. Therefore, our framework focuses on the
hardware accelerators for the Monte-Carlo method in option pricing applications.
7.2 Design Flow and Optimization Framework
In this section, we propose a generic design flow that can generate efficient accelerators
for different types of options and all the models mentioned-above. Besides, a frame-
work that can be used to optimize the engine parameters according to different design
constraints is also developed to shorten the process of FPGA development. Our pro-
posed design is illustrated in Figure 7.2. The input of the design flow is a specification
of an option pricing query which includes all the information needed for generating the
HW accelerator and obtaining the option price. The specification is written in XML file
and contains following details about the pricing request:
• Option Type Parameters: Vanilla, Asian, Barrier, Binary or Lookback;
• Model Parameters: Black Scholes, Merton, Heston or Bates;
• Option Parameters (general variables for all types of option): Strike price K, cur-
rent stock value S0, volatility σ, mean of expected return µ, number of time steps,
147
7.2 Design Flow and Optimization Framework
number of paths, maturity T;
• Jump Parameters: jump rate λ, jump mean α, jump deviation β ;
• Stochastic Volatility (SV) Parameters: reversion rate κ, Variance of volatility ξ,
long term variance θ, correlation coefficient ρ ;
• User constraints: timing constraint, precision constraint or HW resource con-
straint.
The first main block of the framework is the Analysis and Optimization Module
(AOM). Basically, this module considers the Option Type and Option Model, analyzes
the Design Constraints and generates the optimal Engine Parameters for the next block.
The essence of AOM is an optimization algorithm which is a heuristic search algorithm
that intelligently traverses the design space to find the most efficient design point. The
design space is given by all the available set of Engine Parameters such as: number of
bits for exponential (e) and mantissa (m) of floating point representation; number of
hardware instances pipes and the frequency of the Hardware F . The procedure is de-
scribed in Algorithm 11. The input of the Algorithm is the design constraints given
by the designer in the Specification of pricing request. The constraints might contain
requirement of the error , the available hardware resource HW , and the timing condi-
tion, which is translated to throughput requirement tp. Taking into account the input,
the heuristic produces the most efficient engine Eopt configuration, which is represented
as a tuple of number representation (m, e), number of HW instances pipes, and the op-
erating frequency F . Using profiling method, the Algorithm first defines the number
representation (m0 = 41, e0 = 7), which satisfied the error requirement  = 5% . Then,
it synthesizes the smallest configuration E0 = {(m0, e0), pipes = 1, F = 100} to ob-
tain the HW usage of 1 instance with standard frequency F = 100MHz. The flag of
successfully finding the optimal configuration is set to False in line 3. The maximal fre-
quency that makes the minimal configuration with 1 HW instance satisfy the throughput
148
7.2 Design Flow and Optimization Framework









































Figure 7.2: Generic framework
149
7.2 Design Flow and Optimization Framework
requirement is assigned in line 4. The upper bound of the number of HW instance im-
plemented is set in line 5. The next FOR loop iterates through possible number of HW
instances (Line 6), computes the minimal operating frequency (Line 7), and checks if
the minimum configuration for that number of HW instance is feasible to implement on
FPGA (line 8). If the minimal configuration is not feasible, the loop moves to next num-
ber of HW instances (Line 21). Otherwise, it indicates that an available configuration
found (Line 11) and continues searching for the maximal frequency using binary search
(the While loop).
The output of the AOM is the most efficient value of Engine Parameters which are
passed to the Automatic HW Generation (AHG) block. The main feature of this block
is using the Engine Parameters to configure the available HW modules in the HW li-
brary, combine these modules into a complete engine and generate the VHDL file for
the engine.
To better understand the mechanism of the framework and the functionality of each
module we will provide an overview of the general architecture of the pricing engines.
As presented in Figure 7.2, the HW library contains the architecture of 4 pricing engines
associated with abovementioned 4 pricing models. Each pricing engine has two main
parts: control path and data path. The data paths mainly contain arithmetic operations
and will dominate the HW consumption as well as the latency of each engine. The con-
trol paths are used to manage the data transfer and synchronization between data path
modules. Both the control paths and data paths are designed with a focus on flexibility
and modularization, that means these modules contain some features that can be param-
eterized with the Engine Parameters. In the data paths, these features will be the constant
and the user defined type, while in the control path the I/O enable signals and the routing
signals between blocks are customizable.
150
7.2 Design Flow and Optimization Framework
Algorithm 11 Finding the most efficient Engine Parameters
Input: Design constraint: HW, tp, 
Output: Most efficient Engine Parameters : E = {(m, e), pipes, F}
1: Profiling to get the number representation (m0, e0) satisfied the accuracy 
2: Synthesize the minimal HW configuration with: E0 = {(m0, e0), pipes = 1, F =
100} to get the results HW0
3: available = FALSE
4: Fmax = tp
5: max pipes = HW/HW0
6: for i = max pipes to 1 do
7: Fmin = tp/i
8: if synthesize(Ei = {(m0, e0), pipes = i, F = Fmin}) = feasible then
9: pipesopt = i
10: Fopt = Fmin
11: available = TRUE
12: while (Fmax > Fmin + 1) do
13: if synthesize(Ei = {(m0, e0), pipes = i, F = Fmax}) = feasible then
14: Fopt = Fmax
15: Break
16: else







24: if available=TRUE then
25: Return E = {(m, e), pipesopt, Fopt}
26: else
27: Print(”there is no configuration satisfied the constraint”)
28: end if
151
7.3 Pricing Engine Architecture
7.3 Pricing Engine Architecture
This section provides the implementation details and architecture of the generic pricing
engine mentioned above. First we review the procedure of MC simulation applied to
pricing problem with different models; then architecture of the HW accelerator for MC
pricing method is developed with the parameterizable functionality in mind.
7.3.1 MC Method Overview
MC method is a numerical method that is widely used to simulate stochastic processes.
The essence of the method is based on the procedure of sampling underlying random
variables, then computing the outcome of the process and setting the average of all
simulated outcomes as the required value. The MC simulation method is intensively
used for option pricing problems because it is robust and stable; it can be used to de-
rive options without closed form formula and the complexity of the method does not
increase exponentially with the dimension of underlying assets. Therefore, this method
is a promising candidate for high performance computing accelerator. Moreover, the
independence between simulated paths makes it more attractive for parallel computing
systems like FPGAs and GPUs.
The procedure of MC method can be described as follows: firstly, the pricing pe-
riod is discretized into small time steps δt; then, the continuous stochastics differential
equations (SDE) of the pricing model is translated into discrete version to describe the
change of the asset price and volatility in one time step. After that, all the random move-
ments of price and volatility within period [0, T ] are accumulated to get the asset price
for one simulation path. The option price for each path at time T is computed using
payoff function. Then, the expected option price at time T is defined as the average of
all simulated option prices. Finally, the current price for the option is discounted from
its price in time T .
152
7.3 Pricing Engine Architecture
7.3.2 HW Design of MC Engine
The HW architecture of the MC engine is presented in Figure 7.3 and closely follows
the procedure described in the previous Section. The blocks inside the dotted rectangle
Point are responsible for computing the movement of stock price for each time step,
while the Payoff Core is only executed at the last time step to compute the final price of
a Path (outer dotted rectangle). In other words, the Point Rectangle is the inner most loop
iterating through all the time steps, while the Path Rectangle is the outer loop iterating
through all the simulated paths. The Coeff. Precomputation block is implemented in the
highest hierarchal level and executed only once at the beginning of pricing procedure.
As suggested by its name, this block precomputes all the constant parameters during the
pricing process, so that they are not redundantly recomputed in the inside process. As
can be seen in Figure 7.3, there are two types of input parameters: the Model Parameters
and Option Parameters (red color) are configuration inputs, which are used by the HW
Generator Block to define appropriate architecture that needs to be loaded. The Model
Parameters determine which HW modules from library need to be configured in the
Coeff. Precomputation block, Variance Core and Price Core. Based on the type of
underlying models, they also decide whether or not to include the Possion Generator and
Jump Generator in the design, and how many Gaussian Number Generators (GNG) need
to be implemented. Having smaller impact on the design, the Option Type Parameters
decide the configuration need to be loaded in the Existence Core and Payoff Core. In
contrast to Model Parameters and Option Parameters, the rest of the inputs (Jump Paras,
SV Paras, Option Paras) are presented in black color and affect only the execution phase
when a particular pricing engine is already loaded into the FPGA. These parameters
are fed as input data to relevant modules of the configured pricing engine to produce
appropriate output results during execution.
As can be observed from Figure 7.3, the overall architecture of the pricing engine
is developed in a highly modular fashion so that it can accommodate various types of
153



























Figure 7.3: Generic architecture of pricing engines
154
























Figure 7.4: Poisson Generator block
option and underlying models. The functionality of each block is described in detail as
follows:
The Poisson Generator block generates a series of numbers following Poisson dis-
tribution for the next block; the mean θ of the Poisson distribution are set from Jump
Paras input. The HW implementation of the Poisson Generator is presented in Figure
7.4, following the algorithm in Figure 3.9 of [64].
Taking into account the Jump Paras and Possion numberNt from previous block, the
Jump Generator computes the sudden change in stock price and passes the result to the
Price Core Module. These 2 blocks are available in the pricing engine of Merton and
Bates models only.
155









Figure 7.5: GNG block for SV models
The GNGs block is used to generate a series of numbers following Gaussian distri-
bution. Firstly, the uniform random number are generated using Min-Twister methods
(block UNG). Then they are converted to Gaussian random number using Box-Muller
method. This block is designed with the algorithms given in [99] and [190] .The con-
figuration factor of this block, which is controlled by Model Paras, is the number of
GNG instances. For models with constant Volatility (Black Scholes and Merton), there
is only 1 instance implemented, while SV models (Heston, Bates) require two instances
per time step. Moreover, the Gaussian numbers generated for SV models are correlated
to each others with correlation coefficient ρ. The functional scheme of GNGs block for
SV models are presented in Figure 7.5.
The Variance Core is included in the pricing engine by Models Paras when working
with Stochastic Volatility models. Its function is to compute the volatility of the next
time step. The discretization formula of this block is given in Equation 7.8 [133] and the
optimized version with precomputed parameters is given in Equation 7.9.
156
7.3 Pricing Engine Architecture
Table 7.1: Implementation for different Option Types
Option Type Carry Value Carry Core Payoff Core
Vanilla NA NA C = max(ST −K; 0)




Barrier Existence E = E&(St < H) C = (E)?max(ST −K; 0) : 0
Binary NA NA C = (ST > K)?Q : 0
Lookback Min of previous prices Smin = min(Smin, St) C = max(Smin −K; 0)





V (t+ dt) = V (t) + kpd− V ∗(t) ∗ kd+ sd ∗
√
V ∗(t)Zv (7.9)
where kpd = κ ∗ θ ∗ dt; kd = κ ∗ dt; and sd = σ ∗ √dt do not change over iterations
and are precomputed in the Coeff. Precomputation block.
The Price Core computes the movement of the price for the next time step and has
two different implementation versions for SV models (Heston, Bates) and non-SV mod-
els (BS and Merton). Moreover, the discretization formula of this block is also differen-
tiated by the Jump component. Therefore, the discretization formulas of Black Scholes
and Heston model are given as in Equation 7.10, 7.11 [133]. While considering the Jump
component, the coefficient r is adjusted as in Equation 7.12 [31] and the discretization
scheme for Merton and Bates are described in Equation 7.13, 7.14. The decision on
configuring the appropriate version for this block again depends on the Model Paras
input.





S(t+ dt) = S(t) ∗ exp((r − 0.5 ∗ V (t))dt+
√
dtV (t)Zs) (7.11)
radj = r − λ ∗ (exp(a+ 0.5 ∗ b2)− 1) (7.12)
S(t+ dt) = S(t) ∗ exp((radj − 0.5 ∗ σ2)dt+ σ
√
dtZs + J) (7.13)
S(t+ dt) = S(t) ∗ exp((radj − 0.5 ∗ V (t))dt+
√
dtV (t)Zs + J) (7.14)
The Carry Core, which computes the additional value needed to be carried during the
pricing path to define special execution condition. The Carried Value and their imple-
mentation for specific type of options such as Asian Option, Barrier Option and Look-
back Option are presented in Table 7.1.
The Payoff Core is computed only once per simulated path at maturity T and its
configuration depends on the value of Option Type Parameters. The implementations for
different option types are given in Table 7.1. For the sake of brevity, only the formula for
Call options are presented. Finally, the price of all the simulated paths are accumulated
to form the final accumulated price.
As can be seen from the implementation of Variance Core, Price Core and Exis-
tence Core modules, the result values of these blocks are dependent on the values from
previous time steps which are stored and transfered to them by the Loop back Block.
7.4 Experimental Results
A series of experiments are conducted to evaluate the efficiency and performance of
the hardware accelerators generated from our framework. In our implementation, the
option pricing request is described in XML format, the proposed design flow and opti-





































Figure 7.6: Comparison with other SW implementations
The generic Pricing Engine in Section 7.3 is developed by Maxj data flow language,
the Java High Level Synthesis language developed by Maxeler [118]. All 4 pricing en-
gines are developed with C-slow optimization techniques [178]. The experiment results
are obtained by implementing and running the engines on Maxeler Workstation model
MAX3424A [118], which features with a Xilinx Virtex-6 SX475T FPGA device and
Intel Core i7 870 2.93 GHz with 16GB RAM.
7.4.1 Comparison with SW Implementations
In the first experiment, we compare the throughput of our FPGA accelerators with the
software implementation for CPU. The competitor in this experiment is the online option
pricing service Premia provided by INRIA (the French national institute for research in
computer science and control) [14]. For all the models, we choose to price European
Vanilla Option with 1 million paths and 100 time steps. The throughput of both imple-
mentations and the speedup of our pricing engines over the CPU implementation are
reported in Figure 7.6. As can be observed from the Figure, all the pricing engines from
our framework achieve two orders of magnitudes higher throughput over the SW imple-
mentations. The speedup is more significant for complex models since each simulated
159
7.4 Experimental Results
path of these models requires much more computation effort and execution time for SW
implementation. On the other hand, for highly pipelined hardware accelerators with one
output per clock cycle, the complex data path of these models do not significantly affect
the throughput of the engine.
7.4.2 Comparison with other HW Accelerators
To further examine the performance of our HW accelerators, we compare the through-
put, hardware usage and energy of our pricing engine with other available engines in
literature. Since there is no work covering all the pricing models as ours, we have com-
pared our engines with different competitors: for the Black Scholes engine the most
recent work is reported by [170], while the most efficient manual design is proposed
in [172]; for Heston pricing engine, [48] is the work using the same Monte Carlo method;
for Bates models, the only available work reported is from Maxeler [13]. Table 7.2 sum-
marizes the comparison results. Since the results from above-mentioned works were
reported with different pricing set-up (option type, number of simulated paths, number
of time steps), we used throughput (number of computed time steps per second) as the
performance metrics to put all the results in the same perspective. As can be seen from
the table, our engine for Black Scholes model has around 2.25 times better performance
over the design in [170] and achieves about 45% performance of the highly customized
design in [172]. For the Heston models, the accelerators from our framework have clear
advantage over previous works. Part of the reasons for this improvement comes from
the technology of the devices, but the main explanation comes from our highly pipelined
architecture. For the Bates model, we use the same technology as Maxeler implemen-
tation but can achieve around 5% improvement in throughput by using more efficient
discretization scheme and simpler methods of generating volatility.
Although the main advantage of our proposed framework is the productivity and re-
duction on development time, it is hard to quantify and compare with previous work in
160
7.5 Summary
Table 7.2: Comparison with other HW implementations
Throughput
Work FPGA Frequency BS Merton Heston Bates
[172] Virtex 5 200 3200 NA NA NA
[170] Virtex 5 80 640 NA NA NA
[48] Virtex 5 100 NA NA 142.7 NA
[33] Zynq 100 NA NA 459 NA
[13] Virtex 6 175 NA NA NA 700
Ours Virtex 6 115-140 1680 920 1080 738
this aspect. From authors’ experience, following the framework and modular architec-
ture, a designer with little knowledge about option pricing applications can develop a
new engine for new models or new option types in less than a week.
7.5 Summary
In this chapter, a framework for generating option pricing hardware accelerators has
been proposed. The framework is combined with a highly modular architecture design
that can cover four popular pricing models and numerous type of options. Moreover,
a heuristic for finding local optimal parameter set for the pricing engines is developed
to further improve the performance of generated accelerators. As a result, the engines
developed from our framework can achieve a speedup of 2 orders of magnitude com-
pared to SW implementations. While comparing with existing hardware designs for
the same models, our framework can produce the accelerators that overcome most of
manual designed engines.
161
This page is intentionally left blank.
Chapter 8
Conclusions and Future Directions
8.1 Conclusions
Reconfigurable MPS provides huge potential and functionalities for digital system de-
signers by combining the power of both hardware and software components. However,
the complexity of the systems also dramastically increases and imposes new challenges
for the development process of Reconfigurable MPS. This thesis aims to address those
challenges by developing a number of automation tools and techniques to make the
design flow of Reconfigurable MPS more effective and efficient. Chapter 1 provides a
detailed classification on different types of Reconfigurable MPS. Following on, the main
challenges for the development process of Reconfigurable MPS are highlighted. Chapter
2 presents a uniform design flow and detailed description on each component of design
flow: Application Analysis Level, Macro/System Level Synthesis and Exploration and
Micro/Device Level Synthesis and Exploration. After laying the foundation and back-
ground in the first two chapters, the main contributions are presented in the remaining
chapters of the thesis.
In Chapter 3, the first contributions on System Level Synthesis have been presented.
A hybrid mapping strategy has been proposed to address both throughput and energy
163
8.1 Conclusions
requirement of the application. A DSE technique has been developed to provide all
energy-throughput trade-off points for all possible heterogeneous resource combinations
at compile time. The strategy is further improved by a runtime decision scheme that
chooses best trade-off points from the design time analysis and considers the system
context on the fly to optimize the energy consumption. As a result, our mapping ap-
proach provides better energy-throughput trade-off points, covers all the resource com-
binations and reduces energy consumption up to 25% at design-time and additionally
17.8% at run-time when compared to state-of-the-art techniques [122].
Chapter 4 presents our contribution to solve Scheduling problem on System Level
Synthesis. A multi-stage resource management approach has been developed to address
the leakage energy problem caused by the prefetching technique while scheduling tasks
in Reconfigurable devices. The proposed approach tries to allocate reconfiguration and
execution parts of tasks as close as possible while taking task dependencies, timing and
architecture constraints into account. A list-scheduling algorithm has been developed
with a specific priority function that is customized for addressing the leakage power
reduction. Moreover, a cost function has been derived for the placement stage to further
reduce the leakage power. This function provides designers a flexibility to manage the
trade-off between performance and leakage savings. Finally, a post-placement heuristic
has been proposed to improve the scheduling results (leakage savings) from previous
stages. From experimental results, the advantages of proposed multi-stage scheduling
technique have been proven. Specifically, different variants of the proposed approach
can reduce leakage power by 40-65% when compared to a performance-driven approach
and by 15-43% when compared to state-of-the-art works [123].
Chapter 5 wrapped up our contributions on System Level Synthesis by describing
an DSE framework for list-based mapping and scheduling heuristics. A comprehensive
multistage framework has been developed to integrate GA and ML techniques to opti-
mize existing list-based schedulers: from generating data to building predictive models
164
8.1 Conclusions
and predicting Pareto fronts for new TGs. In the first stage of the framework, after gen-
erating the Pareto front with Genetic Algorithm, a systematic representation of Pareto
front curves is built with Spline regression models. Thereafter, Linear Regression tech-
niques has been applied to model the dependency between Spline model of Pareto front
and TG’s features. During the Prediction Stage, Density-base Clustering Algorithm is
used to generate near-Pareto-optimal design points. As a result, our ML approach can
achieve 2 orders of magnitude speed-up with only 4% trade-off in the Quality when
applied to scheduling heuristic. For mapping problem, our framework can boost the
performance 25x faster while sacrificing less than 5% quality of the Pareto front.
Chapter 6 presents our first contribution on Micro Level Synthesis with a DSE frame-
work for HLS design. The proposed DSE framework facilitates designers to exploit the
loop-array dependency to reduce the time consumption of DSE process while maintain-
ing the quality of Pareto front results. To achieve that purpose, Loop-array Dependency
Graph, a systematic and formal method to represent the relationship between loops and
arrays, has been proposed. We also developed a tool to extract the graph from C code.
Moreover, a module called Array Partition Factor Computation Block is developed to
generate the Pareto optimal array partition factors according to the related loop opti-
mization techniques. Finally, a multilevel DSE heuristic has been developed on top of
above modules to efficiently exploit the loop array-dependency and significantly reduce
the DSE time. Consequently, our DSE approach can achieve 14 times speedup when
compared with exhaustive approach while providing Pareto front with nearly the same
quality. In comparison with existing works, our DSE excels on both quality of the results
and execution time.
Continuing our contributions on Micro Level Synthesis, Chapter 7 presents a hard-
ware generator tool for option pricing applications. The tool helps hardware design-
ers become more productive in the development process by automatically generating
efficient and high quality hardware accelerators, which can facilitate different pricing
165
8.2 Future Directions
models as well as various types of options. First of all, a template of modular and pa-
rameterizable hardware architecture that covers all different computational features of
various pricing models has been implemented. Thereafter, a library for the hardware
implementation of most popular pricing models: Black Scholes; Merton; Heston and
Bates is prebuilt. To further alleviate the job of developer, a heuristic to find optimiza-
tion parameters for above-mentioned hardware designs are also developed. Combining
all abovementioned components, a generic design flow has been proposed to automati-
cally optimize and generate the hardware accelerators from a high level description of
option pricing application. Experiments demonstrate that the hardware engines gener-
ated by our tool can achieve 2 orders of magnitude speed-up when compared with SW
implementations and have superior results in terms of throughput and hardware usage in
comparison with existing HW implementations.
8.2 Future Directions
Although the contributions in this thesis have made the development process of Re-
configurable MPS more efficient and productive, there are a number of ways to further
improve these tools and techniques in both stages: System Level Synthesis and Micro
Level Synthesis.
8.2.1 System Level Synthesis
Increasing Heterogeneity of Platform
Nowadays, more and more different types processing elements are added to modern
computing platform to further improve the computing efficiency at the circuit level.
However, the increase in heterogeneity of processing elements also significantly surges
up the complexity of the system since it adds a new dimension to mapping and schedul-
ing problems. The design space expands exponentially and gradually becomes unman-
ageable to be explored with current heuristics. Further, not every task can be executed on
166
8.2 Future Directions
every type of processing elements so a new set of constraints needs to be formulated to
reflect the limitation. For the contributions on System Level Synthesis, the Mapping ap-
proach in Chapter 3 has addressed the heterogeneous platform but the Scheduling tech-
niques in 4 need to modify the cost functions in list-based heuristics to cover more type
of processing elements in the devices. Our ML and GA framework for multi-objective
DSE is robust and expected to function well while increasing the heterogeneity of the
platform.
3D Architecture
Recently, the rapid advancement of 3D architecture multiprocessor systems create a
great attraction of research on System Level Synthesis and Exploration for 3D system.
With the ability of packing more computing power on a chip, the 3D architecture pro-
vides more performance and energy advantages over 2D architecture. However, new
concerns are emphasized on the thermal and reliability aspects of the 3D systems. In
general, the approaches applied in the System Level contributions of this thesis can be
adapted to address mapping and scheduling problem for 3D system but adequate adjust-
ment need to be derived for new architectural model and underlying thermal characteri-
zation of 3D systems.
Reliability and Fault-tolerance Metrics
With the advance of multiprocessor architecture to deep sub-micron technology, com-
puting systems are more and more sensitive with faults and errors since escalating
power density and temperature variation accelerates wear-out and leading to a grow-
ing prominence of device defects. Therefore, reliability and fault-tolerance are rising
as the most important concerns for multiprocessor system. Although the reliability and
fault-tolerance are not directly addressed in our System Level Synthesis contributions,
they can be incorporated to proposed mapping and scheduling algorithms by adjust-
ing the cost/priority functions with additional components related to system’s reliability.
167
8.2 Future Directions
However, to achieve higher level of fault-tolerance and robustness, thorough research
needs to be done on the impact of faults as well as error correction methods on each
component of the Reconfigurable MPS.
8.2.2 Micro Level Synthesis
Hardware Resource Modelling
In both of our contributions on Micro Level Synthesis, although efficient heuristics for
traversing the design space have been proposed, the execution of optimization frame-
works are still time consuming due to the long runtime of HLS tools for each design
point evaluation. Therefore, an analytical approach for estimating the hardware resource
usage from the high level description of application and HLS parameters (loop unroll,
loop pipeline, etc.) will significantly reduce the time consumption of the tools. In the
case of loop-array dependency DSE framework in Chapter 6, an accurate HW estimator
can also improve the quality of generated Pareto front by covering more design points
with lower HW usage. For the option pricing hardware generator in Chapter 7, shorter
time for estimating hardware usage create the room for applying more sophisticated
optimization techniques during design space exploration.
Platform-independent Implementation
Currently, our contributions on Micro Level Synthesis are implemented on vendor spe-
cific platforms. In particular, the option pricing hardware generator in Chapter 7 is
developed with Maxj data flow language, the Java High Level Synthesis language de-
veloped by Maxeler [118] and Maxeler devices. However, the Loop-Array dependency
DSE framework in Chapter 6 is dependent on Vivado HLS of Xilinx. To make our contri-
butions more flexible and platform-independent, another direction that we would like to
consider is to extend and verify our contributions with open HW description languages
(VHDL or Verilog) and open source HLS tool (Leg-up [36]).
168
Bibliography
[1] Xilinx partial reconfiguration user guide.
[2] Altera SDK for OpenCL, 2008. https://www.altera.com/products/design-
software/embedded-software-developers/opencl/overview.html.
[3] Altera socs, altera, 2008. https://www.altera.com/products/soc/overview.html.
[4] AutoPilot, 2008. http://www.autoesl.com, AutoESL Design Technologies.
[5] C-to-silicon compiler, 2008. http://www.cadence.com/products/sd/ silicon com-
piler/pages/default.aspx.
[6] Core Generator, Xilinx, 2008. http://www.xilinx.com.
[7] MegaCore, Altera, 2008. http://www.altera.com.
[8] Zedboard, xilinx, 2008. http://zedboard.org/.
[9] The International Technology Roadmap for Semiconductors (ITRS), System
Driver Report, 2011, http://www.itrs.net/. Technical report, 2011.
[10] Agility DK design suite, 2013. http://www.agilityds.com/products/cbasedproducts/default.aspx.
169
Bibliography
[11] Impulse C., 2013. http://www.impulsec.com.
[12] The International Technology Roadmap for Semiconductors (ITRS), System In-
tegration, 2014, http://www.itrs.net/. Technical report, 2014.
[13] Maxeler App Galery, 2015. http://appgallery.maxeler.com/.
[14] PREMIA - A platform for pricing financial derivatives, 2015.
www.rocq.inria.fr/mathfi/Premia/index.html.
[15] B Ackland, A Anesko, and D Brinthaupt. A single-chip, 1.6-billion, 16-b MAC/s
multiprocessor DSP. Solid-State Circuits,, 2000.
[16] S Aditya and V Kathail. Algorithmic synthesis using PICO. High-Level Synthesis,
2008.
[17] A. Ahmadinia, C. Bobda, and J. Teich. A dynamic scheduling and placement
algorithm for reconfigurable hardware. Organic and Pervasive Computing–ARCS
2004, pages 443–465, 2004.
[18] Yongjin Ahn et al. SoCDAL: System-on-chip design AcceLerator. ACM TO-
DAES, 13:17:1–17:38, 2008.
[19] George E Andrews and Kimmo Eriksson. Integer partitions. Cambridge Univer-
sity Press, 2004.
[20] Mihael Ankerst, Markus M Breunig, Hans-Peter Kriegel, and Jo¨rg Sander. Op-
tics: ordering points to identify the clustering structure. In ACM Sigmod Record,
volume 28, pages 49–60. ACM, 1999.
[21] S. Banerjee, E. Bozorgzadeh, and N. Dutt. Physically-aware hw-sw partitioning




[22] U Banerjee. Loop transformations for restructuring compilers: the foundations.
2007.
[23] U Banerjee. Loop parallelization. 2013.
[24] Cedric Bastoul et al. Putting polyhedral loop transformations to work. In
LCPC16, pages 209–225, october 2003.
[25] David S Bates. Jumps and stochastic volatility: Exchange rate processes implicit
in deutsche mark options. Review of financial studies, 9(1):69–107, 1996.
[26] A Bender. Design of an optimal loosely coupled heterogeneous multiprocessor
system. European Design and Test Conference, 1996., 1996.
[27] Jl Berral et al. Toward Energy-Aware Scheduling Using Machine Learning. En-
ergy Efficient Distributed Computing Systems, pages 215–244, 2012.
[28] Josep Ll. Berral et al. Power-Aware Multi-data Center Management Using Ma-
chine Learning. ICPP, pages 858–867, 2013.
[29] T Bollaert. Catapult synthesis: a practical introduction to interactive C synthesis.
High-Level Synthesis, 2008.
[30] Simone Borgio et al. Hardware dwt accelerator for multiprocessor system-on-
chip on fpga. International Conference on Embedded Computer Systems: Archi-
tectures, Modeling and Simulation, pages 107–114, 2006.
[31] Maya Briani. Numerical methods for option pricing in jump-diffusion markets.
PhD thesis, PhD thesis, Universita` degli Studi di Roma La Sapienza, 2003.
[32] Dimo Brockhoff, Tobias Wagner, and Heike Trautmann. On the properties of
the r2 indicator. In Proceedings of the 14th annual conference on Genetic and
evolutionary computation, pages 465–472. ACM, 2012.
171
Bibliography
[33] Christian Brugger et al. Hyper: A runtime reconfigurable architecture for monte
carlo option pricing in the heston model. In FPL, pages 1–8. IEEE, 2014.
[34] Christian Brugger et al. Mixed precision multilevel monte carlo on hybrid com-
puting systems. In Computational Intelligence for Financial Engineering & Eco-
nomics (CIFEr), 2104 IEEE Conference on, pages 215–222. IEEE, 2014.
[35] B. C. Schafer and K. Wakabayashi. Machine learning predictive modelling high-
level synthesis design space exploration. IET, page 153, 2012.
[36] A Canis, J Choi, M Aldham, and V Zhang. LegUp: An open-source high-level
synthesis tool for FPGA-based processor/accelerator systems. ACM Transactions
on, 2013.
[37] Andrew Canis et al. Legup: high-level synthesis for fpga-based proces-
sor/accelerator systems. In FPGA, pages 33–36. ACM, 2011.
[38] E. Carvalho and F. Moraes. Congestion-aware task mapping in heterogeneous
MPSoCs. In SoC, pages 1–4, 2008.
[39] KS Chatha and R Vemuri. An iterative algorithm for hardware-software partition-
ing, hardware design space exploration and scheduling. Design Automation for
Embedded Systems, 2000.
[40] Anupam Chattopadhyay. Ingredients of adaptability: A survey of reconfigurable
processors. User Modeling and User-Adapted Interaction, 2013, 2013.
[41] J Cong, Y Fan, G Han, and W Jiang. Platform-based behavior-level and system-
level synthesis. SOC Conference, 2006, 2006.
[42] J Cong and Y Zou. FPGA-based hardware acceleration of lithographic aerial
image simulation. ACM Transactions on Reconfigurable Technology and, 2009.
172
Bibliography
[43] Jason Cong and Karthik Gururaj. Energy efficient multiprocessor task scheduling
under input-dependent variation. In DATE, pages 411–416, 2009.
[44] Philippe Coussy and Adam Morawiec. High-level synthesis. Springer, 2010.
[45] P Cumming. The TI OMAP Platform Approach to SoC. Winning the SOC Revo-
lution, 2003.
[46] Anup Kumar Das et al. Reinforcement learning-based inter-and intra-application
thermal optimization for lifetime improvement of multicore systems. In Design
Automation Conference, pages 1–6. IEEE, 2014.
[47] BP Dave and NK Jha. COHRA: hardware-software cosynthesis of hierarchical
heterogeneous distributed embedded systems. IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, 1998.
[48] Christian de Schryver et al. An energy efficient fpga accelerator for monte carlo
option pricing with the heston model. In ReConFig, pages 468–474. IEEE, 2011.
[49] Christian de Schryver et al. A multi-level monte carlo fpga accelerator for option
pricing in the heston model. In DATE, pages 248–253. EDA Consortium, 2013.
[50] RH Dennard, J Cai, and A Kumar. A perspective on today’s scaling challenges
and possible future directions. Solid-State Electronics, 2007.
[51] Robert P Dick et al. Tgff: task graphs for free. In Proceedings of the 6th interna-
tional workshop on Hardware/software codesign, pages 97–101. IEEE Computer
Society, 1998.
[52] RP Dick and NK Jha. CORDS: hardware-software co-synthesis of reconfigurable




[53] RP Dick and NK Jha. MOGAC: a multiobjective genetic algorithm for hardware-
software cosynthesis of distributed embedded systems. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 1998.
[54] R.P. Dick, D.L. Rhodes, and W. Wolf. Tgff: task graphs for free. In CODES,
pages 97–101, 1998.
[55] Heungsik Eom et al. MALMOS: Machine Learning-Based Mobile Offloading
Scheduler with Online Training. International Conference on Mobile Cloud Com-
puting, Services, and Engineering, pages 51–60, 2015.
[56] H Esmaeilzadeh, E Blem, and RS Amant. Power challenges may end the multi-
core era. Communications of the, 2013.
[57] JA Fisher. Trace scheduling: A technique for global microcode compaction. IEEE
Transactions on Computers, 1981.
[58] Jerome Friedman et al. The elements of statistical learning, volume 1. Springer
series in statistics Springer, Berlin, 2001.
[59] MR Garey and DS Johnson. Computers and intractability. 2002.
[60] A. H. Ghamarian et al. Throughput Analysis of Synchronous Data Flow Graphs.
In ACSD, pages 25–36, 2006.
[61] K Gilles. The semantics of a simple language for parallel programming. In
Information Processing, 1974.
[62] Beltra Giovanni et al. Decision-theoretic design space exploration of multipro-
cessor platforms. IEEE TCAD, 29:1083–1095, 2010.
[63] C Girault and R Valk. Petri nets for systems engineering: a guide to modeling,
verification, and applications. 2013.
174
Bibliography
[64] Paul Glasserman. Monte Carlo methods in financial engineering, volume 53.
Springer Science & Business Media, 2003.
[65] Mateusz Guzek et al. A survey of evolutionary computation for resource man-
agement of processing in cloud computing [review article]. Computational Intel-
ligence Magazine, IEEE, 10(2):53–67, 2015.
[66] RE Hank, SA Mahlke, and RA Bringmann. Superblock formation using static
program analysis. Proceedings of the 26th Annual International Symposium on
Microarchitecture, 1993.
[67] S. Hauck. Configuration prefetch for single context reconfigurable coprocessors.
In FPGA, pages 65–74, 1998.
[68] JR Hauser and J Wawrzynek. Garp: A MIPS processor with a reconfigurable
coprocessor. The 5th Annual IEEE Symposium on, 1997.
[69] SA Hayati and AC Parker. Representation of control and timing behavior with
applications to interface synthesis. Computer Design: VLSI in, 1988.
[70] CAR Hoare. Communicating sequential processes. 1978.
[71] Elham Hormozi et al. Using of machine learning into cloud environment (a sur-
vey): Managing and scheduling of resources in cloud systems. Proceedings -
3PGCIC, pages 363–368, 2012.
[72] J.W. Hsieh et al. An enhanced leakage-aware scheduler for dynamically recon-
figurable fpgas. In ASP-DAC, pages 661–667, 2011.
[73] Jingcao Hu and Radu Marculescu. Energy-aware communication and task
scheduling for network-on-chip architectures under real-time constraints. In
DATE, pages 10234–, 2004.
175
Bibliography
[74] John C Hull. Options, futures, and other derivatives. Pearson Education India,
2006.
[75] Gordon Inggs, Shane Fleming, David B. Thomas, and Wayne Luk. Is High Level
Synthesis Ready for Business? An Option Pricing Case Study. In FPGA Based
Accelerators for Financial Applications, pages 97–115. Springer International
Publishing, Cham, 2015.
[76] Lech Jozwiak. Life-inspired systems and their quality-driven design. Architecture
of Computing Systems-ARCS 2006, 2006.
[77] Lech Jozwiak. Modern architectures for embedded reconfigurable systems A
survey. Journal of Circuits, Systems, and Computers, 18(2):209–254, 2009.
[78] Lech Jozwiak, Nadia Nedjah, and Miguel Figueroa. Modern development meth-
ods and tools for embedded reconfigurable systems: A survey. Integration, the
VLSI Journal, 43(1):1–33, 2010.
[79] Nirav H Kapadia et al. Predictive application-performance modeling in a com-
putational grid environment. In International Symposium on High Performance
Distributed Computing, pages 47–54. IEEE, 1999.
[80] I Karkowski and H Corporaal. Design space exploration algorithm for heteroge-
neous multi-processor embedded system design. Proceedings of the 35th annual
Design, 1998.
[81] M Kaul and R Vemuri. Temporal partitioning combined with design space explo-
ration for latency minimization of run-time reconfigured designs. Proceedings of
the conference on Design,, 1999.
176
Bibliography
[82] Joachim Keinert et al. SystemCoDesigner an automatic ESL synthesis approach
by design space exploration and behavioral synthesis for streaming applications.
ACM TODAES, 14:1:1–1:23, 2009.
[83] PV Knudsen and J Madsen. PACE: A dynamic programming algorithm for hard-
ware/software partitioning. Proceedings of the 4th International Workshop, 1996.
[84] Pratyush Kumar et al. Thermally optimal stop-go scheduling of task graphs with
real-time constraints. In ASP-DAC, pages 123–128. IEEE, 2011.
[85] Pratyush Kumar and Lothar Thiele. Thermally optimal stop-go scheduling of
task graphs with real-time constraints. In ASP-DAC, pages 123–128. IEEE Press,
2011.
[86] I. Kuon and J. Rose. Measuring the gap between fpgas and asics. TCAD, 26:203–
215, 2007.
[87] Yu-Kwong Kwok et al. Benchmarking and comparison of the task graph schedul-
ing algorithms. Journal of Parallel and Distributed Computing, 59(3):381–422,
1999.
[88] Choonseung Lee et al. A systematic design space exploration of mpsoc based on
synchronous data flow specification. JSPS, 58(2):193–213, 2010.
[89] Edward Ashford Lee and David G. Messerschmitt. Static scheduling of syn-
chronous data flow programs for digital signal processing. IEEE Trans. Comput.,
36:24–35, 1987.
[90] Young Choon Lee et al. Minimizing energy consumption for precedence-




[91] Peng Li, Yuxin Wang, Peng Zhang, Guojie Luo, Tao Wang, and Jason Cong.
Memory partitioning and scheduling co-optimization in behavioral synthesis. IC-
CAD, page 488, 2012.
[92] Hung-Yi Liu and Luca P. Carloni. On learning-based methods for design-space
exploration with high-level synthesis. DAC, page 1, 2013.
[93] Martin Lukasiewycz et al. Efficient symbolic multi-objective design space explo-
ration. In ASP-DAC, pages 691–696, 2008.
[94] J Luu, K Redmond, WCY Lo, and P Chow. FPGA-based Monte Carlo compu-
tation of light absorption for photodynamic cancer therapy. 17th Symposium on
Field Programmable Custom Computing Machines (FCCM), IEEE, 2009.
[95] SA Mahlke, DC Lin, WY Chen, and RE Hank. Effective compiler support for
predicated execution using the hyperblock. ACM SIGMICRO, 1992.
[96] G. Mariani et al. An industrial design space exploration framework for supporting
run-time resource management on multi-core systems. In DATE, pages 196–201,
2010.
[97] Giovanni Mariani et al. Using multi-objective design space exploration to enable
run-time resource management for reconfigurable architectures. In DATE, pages
1379–1384, 2012.
[98] Grant Martin and Gary Smith. High-level synthesis: Past, present, and future.
IEEE Design & Test of Computers, 26(4):18–25, 2009.
[99] Makoto Matsumoto and Takuji Nishimura. Mersenne twister: a 623-
dimensionally equidistributed uniform pseudo-random number generator. ACM




[100] M Meredith. High-level SystemC synthesis with forte’s cynthesizer. High-Level
Synthesis, 2008.
[101] Robert C Merton. Option pricing when underlying stock returns are discontinu-
ous. Journal of financial economics, 3(1):125–144, 1976.
[102] Gordon E Moore. Cramming More Components onto Integrated Circuits. Elec-
tronics, pages 114–117, 1965.
[103] Orlando Moreira et al. Online resource management in a multiprocessor with a
network-on-chip. In SAC, pages 1557–1564, 2007.
[104] Orlando Moreira et al. Scheduling multiple independent hard-real-time jobs on a
heterogeneous multiprocessor. In Proceedings of the International conference on
Embedded software, pages 57–66, 2007.
[105] J Mottin, M Cartron, and G Urlini. The sthorm platform. Smart Multicore Em-
bedded Systems, 2014.
[106] R Nane, VM Sima, B Olivier, and R Meeuws. DWARV 2.0: A CoSy-based
C-to-VHDL hardware compiler. 22nd International Conference on Field Pro-
grammable Logic and Applications (FPL), 2012.
[107] Siva G Narendra, Laura C Fujino, and Kenneth C Smith. Through the looking
glass? the 2015 edition: Trends in solid-state circuits from isscc. IEEE Solid-
State Circuits Magazine, 7(1):14–24, 2015.
[108] Brent Nelson. Fpga design productivity–a discussion of the state of the art and a
research agenda. In Reconfigurable Computing: Architectures, Tools and Appli-
cations, pages 1–1. Springer, 2009.
179
Bibliography
[109] R Niemann and P Marwedel. Hardware/software partitioning using integer pro-
gramming. Proceedings of the 1996 European conference on Design and Test,
1996.
[110] Y Nishimichi, N Higaki, and M Osaka. UniPhier: series development and SoC
management. 2009 Asia and South Pacific Design Automation Conference, 2009.
[111] V Nollet, P Avasare, and H Eeckhaut. Run-time management of a mpsoc con-
taining fpga fabric tiles. Very Large Scale, 2008.
[112] Vincent Nollet et al. Run-time management of a MPSoC containing FPGA fabric
tiles. IEEE TVLSI, 16:24–33, 2008.
[113] H Oh and S Ha. A hardware-software cosynthesis technique based on heteroge-
neous multiprocessor scheduling. Proceedings of the seventh international work-
shop on, 1999.
[114] JA De Oliveira and H Van Antwerpen. The Philips Nexperia digital video plat-
form. Winning the SoC Revolution, 2003.
[115] Maurizio Paganini. Nomadik: AMobile Multimedia Application Processor Plat-
form. In 2007 Asia and South Pacific Design Automation Conference, pages 749–
750. IEEE, jan 2007.
[116] S Pande and DP Agrawal. Compiler optimizations for scalable parallel systems:
languages, compilation techniques, and run time systems. 2001.
[117] SR Park and W Burleson. Reconfiguration for power saving in real-time motion
estimation. Proceedings of the 1998 IEEE International Conference onAcoustics,
Speech and Signal Processing, 1998.
[118] Oliver Pell and Vitali Averbukh. Maximum performance computing with dataflow
engines. Computing in Science & Engineering, 14(4):98–103, 2012.
180
Bibliography
[119] Z Peng and K Kuchcinski. An algorithm for partitioning of application specific
systems. Proceedings of the 4th European Conference on Design Automation,
with the European Event in ASIC Design, 1993.
[120] Julien Perez et al. Utility-based reinforcement learning for reactive grids. In
International Conference on Autonomic Computing, pages 205–206. IEEE, 2008.
[121] Nam Khanh Pham, Akash Kumar, and Khin Mi Mi Aung. Machine learning
approach to generate pareto front for list-scheduling algorithms. In Proceedings
of the 19th International Workshop on Software and Compilers for Embedded
Systems, pages 127–134. ACM, 2016.
[122] Nam Khanh Pham, Amit Kumar Singh, Ajit Kumar, and Khin Mi Mi Aung. In-
corporating energy and throughput awareness in design space exploration and
run-time mapping for heterogeneous mpsocs. In DSD, pages 513–521. IEEE,
2013.
[123] Nam Khanh Pham, Amit Kumar Singh, and Akash Kumar. A multi-stage leak-
age aware resource management technique for reconfigurable architectures. In
GLSVLSI, pages 63–68. ACM, 2014.
[124] Nam Khanh Pham, Amit Kumar Singh, Akash Kumar, and Mi Mi Aung Khin.
Exploiting loop-array dependencies to accelerate the design space exploration
with high level synthesis. In DATE, pages 157–162. EDA Consortium, 2015.
[125] C Pilato and F Ferrandi. Bambu: A Free Framework for the High Level Synthesis
of Complex Applications. University Booth of DATE, 2012.
[126] Louis-Noe¨l Pouchet. Polybench: The polyhedral benchmark suite. URL:
http://www. cs. ucla. edu/˜ pouchet/software/polybench/, 2012.
181
Bibliography
[127] Louis-Noel Pouchet et al. Polyhedral-based data reuse optimization for config-
urable computing. FPGA, page 29, 2013.
[128] Adrien Prost-Boucle, Olivier Muller, and Fre´de´ric Rousseau. Fast and standalone
Design Space Exploration for High-Level Synthesis under resource constraints.
Journal of Systems Architecture, 60(1):79–93, January 2014.
[129] M Pu¨schel, JMF Moura, and JR Johnson. SPIRAL: Code generation for DSP
transforms. Proceedings of the, 2005.
[130] A Putnam, D Bennett, and E Dellinger. CHiMPS: A C-level compilation flow
for hybrid CPU-FPGA architectures. 2008 International Conference on Field
Programmable Logic and Applications, 2008.
[131] R Core Team. R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna, Austria, 2015.
[132] DS Harish Ram, MC Bhuvaneswari, and SM Logesh. A novel evolutionary tech-
nique for multi-objective power, area and delay optimization in high level synthe-
sis of datapaths. In ISVLSI, pages 290–295. IEEE, 2011.
[133] Fabrice Douglas Rouah. Euler and milstein discretization. Documento de tra-
bajo, Sapient Global Markets, Estados Unidos. Recuperado de www. frouah. com,
2011.
[134] CR Rupp, M Landguth, and T Garverick. The NAPA adaptive processing archi-
tecture. FPGAs for Custom, 1998.
[135] Shinichi Sakata. Asamin:a matlab gateway routine to lester ingber’s adaptive
simulated annealing (asa) software, 2014.
182
Bibliography
[136] Abhaya Kumar Samal et al. Fault tolerant scheduling of hard real-time tasks on
multiprocessor system using a hybrid genetic algorithm. Swarm and Evolutionary
Computation, 14:92–105, February 2014.
[137] A Sˇaramentovas and P Ruzgys. HSDPA design space exploration and implemen-
tation guidance with Design-Trotter. 6th International Conference on Informa-
tion, Communications & Signal Processing, 2007.
[138] MJW Savage, Z Salcic, and G Coghill. Extended genetic algorithm for codesign
optimization of DSP systems in FPGAs. Proceedings. 2004 IEEE International
Conference onField-Programmable Technology, 2004.
[139] B Carrion Schafer, Takashi Takenaka, and Kazutoshi Wakabayashi. Adaptive
simulated annealer for high level synthesis design space exploration. In VLSI-
DAT, pages 106–109. IEEE, 2009.
[140] Benjamin Carrion Schafer and Kazutoshi Wakabayashi. Design space exploration
acceleration through operation clustering. ICCAD, 29(1):153–157, 2010.
[141] Benjamin Carrion Schafer and Kazutoshi Wakabayashi. Divide and conquer high-
level synthesis design space exploration. TODAES, 17(3):29, 2012.
[142] A. Schranzhofer et al. Dynamic Power-Aware Mapping of Applications onto
Heterogeneous MPSoC Platforms. IEEE Transactions on Industrial Informatics,
6(4):692 –707, 2010.
[143] Eberhard Schu¨ler, Ralf Ko¨nig, Ju¨rgen Becker, Gerard Rauwerda, Marcel van de
Burgwal, and Gerard J. M. Smit. Smart Chips for Smart Surroundings 4S. In




[144] Anirban Sengupta and Reza Sedaghat. Integrated scheduling, allocation and bind-
ing in high level synthesis using multi structure genetic algorithm based design
space exploration. In ISQED, pages 1–9. IEEE, 2011.
[145] M. Shafique, L. Bauer, and J. Henkel. Remis: Run-time energy minimization
scheme in a reconfigurable processor with dynamic power-gated instruction set.
In ICCAD, pages 55–62, 2009.
[146] L Shang, RP Dick, and NK Jha. Slopes: hardwaresoftware cosynthesis of low-
power real-time distributed embedded systems with dynamically reconfigurable
fpgas. Computer-Aided Design of, 2007.
[147] Amit Kumar Singh et al. A Hybrid Strategy for Mapping Multiple Throughput-
constrained Applications on MPSoCs. In CASES, pages 175–184, 2011.
[148] Amit Kumar Singh et al. Accelerating throughput-aware runtime mapping for
heterogeneous mpsocs. ACM TODAES, 18(1):9:1–9:29, 2013.
[149] Amit Kumar Singh et al. Energy optimization by exploiting execution slacks in
streaming applications on multiprocessor systems. In DAC, pages 115:1–115:7,
2013.
[150] H Singh, MH Lee, G Lu, and FJ Kurdahi. MorphoSys: an integrated reconfig-
urable system for data-parallel and computation-intensive applications. Comput-
ers, IEEE, 2000.
[151] Amit Kumar Singh et al. Mapping real-life applications on run-time reconfig-
urable noc-based mpsoc on fpga. In FPT, pages 365–368, 2010.
[152] Oliver Sinnen. Task scheduling for parallel systems, volume 60. 2007.
184
Bibliography
[153] Gerard J.M. Smit, Paul J.M. Havinga, Lodewijk T. Smit, Paul M. Heysters, and
Michel A.J. Rosien. Dynamic Reconfiguration in Mobile Systems. pages 171–
181. Springer Berlin Heidelberg, 2002.
[154] GJM Smit, ABJ Kokkeler, and PT Wolkotte. Multi-core architectures and stream-
ing applications. Proceedings of the 2008 international workshop on System level
interconnect prediction, 2008.
[155] LIN Song. Ngpm–a nsga-ii program in matlab. 2011.
[156] SR Srinivasan and NK Jha. Hardware-software co-synthesis of fault-tolerant real-
time distributed embedded systems. Design Automation Conference, 1995,, 1995.
[157] C. Steiger, H. Walder, and M. Platzner. Heuristics for online scheduling real-time
tasks to partially reconfigurable devices. FPGA, pages 575–584, 2003.
[158] G Stitt and F Vahid. Energy advantages of microprocessor platforms with on-chip
configurable logic. IEEE Design & Test of Computers, 2002.
[159] S. Stuijk et al. A Predictable Multiprocessor Design Flow for Streaming Appli-
cations with Dynamic Behaviour. In DSD, pages 548 –555, 2010.
[160] S Stuijk, M Geilen, and T Basten. SDFˆ 3: SDF For Free. null, 2006.
[161] Sander Stuijk. Predictable mapping of streaming applications on multiprocessors,
2007.
[162] Sander Stuijk et al. SDF3: SDF For Free. In ACSD, pages 276–278, 2006.
[163] S Sutar et al. Task scheduling for multiprocessor systems using memetic al-
gorithms. In 4th International Working Conference Performance Modeling and
Evaluation of Heterogeneous Networks, 2006.
185
Bibliography
[164] David B Thomas. Acceleration of financial monte-carlo simulations using fpgas.
In High Performance Computational Finance (WHPCF), 2010 IEEE Workshop
on, pages 1–6. IEEE, 2010.
[165] David B Thomas, Jacob A Bower, and Wayne Luk. Automatic generation and
optimisation of reconfigurable financial Monte-Carlo simulations. In Application-
specific Systems, Architectures and Processors, 2007. ASAP. IEEE International
Conf. on, pages 168–173. IEEE, 2007.
[166] Xiang Tian and Khaled Benkrid. Design and implementation of a high perfor-
mance financial Monte-Carlo simulation engine on an FPGA supercomputer. In
FPT, pages 81–88. IEEE, 2008.
[167] Xiang Tian and Khaled Benkrid. High-performance quasi-monte carlo financial
simulation: Fpga vs. gpp vs. gpu. ACM Transactions on Reconfigurable Technol-
ogy and Systems (TRETS), 3(4):26, 2010.
[168] Xiang Tian and C Bouganis. A run-time adaptive fpga architecture for monte
carlo simulations. In FPL, pages 116–122. IEEE, 2011.
[169] Takao Tobita et al. A standard task graph set for fair evaluation of multiprocessor
scheduling algorithms. Journal of Scheduling, 5(5):379–394, 2002.
[170] Jakob Kenn Toft and Alberto Nannarelli. Energy efficient fpga based hardware
accelerators for financial applications. In NORCHIP, 2014, pages 1–6. IEEE,
2014.
[171] JL Tripp, MB Gokhale, and KD Peterson. Trident: From high-level language to
hardware circuitry. Computer, 2007.
186
Bibliography
[172] Anson HT Tse, David B Thomas, Kuen Hung Tsoi, and Wayne Luk. Efficient
reconfigurable design for pricing asian options. ACM SIGARCH Computer Ar-
chitecture News, 38(4):14–20, 2011.
[173] T. Tuan and B. Lai. Leakage power analysis of a 90nm fpga. In CICC, pages
57–60, 2003.
[174] Tim Tuan, Sean Kao, Arif Rahman, Satyaki Das, and Steve Trimberger. A 90nm
low-power fpga for battery-powered applications. In FPGA, pages 3–11, 2006.
[175] D Wang, S Li, and Y Dou. Collaborative hardware/software partition of coarse-
grained reconfigurable system using evolutionary ant colony optimization. Pro-
ceedings of the 2008 Asia and South Pacific, 2008.
[176] G Wang, W Gong, and R Kastner. Application partitioning on programmable
platforms using the ant colony optimization. Journal of Embedded, 2006.
[177] Yuxin Wang, Peng Li, Peng Zhang, Chen Zhang, and Jason Cong. Memory par-
titioning for multidimensional arrays in high-level synthesis. DAC, page 1, 2013.
[178] Nicholas Weaver et al. Post-placement c-slow retiming for the xilinx virtex fpga.
In FPGA, pages 185–194. ACM, 2003.
[179] Claas Wilke. Energy-Aware Development and Labeling for Mobile Applica-
tions. PhD thesis, Saechsische Landesbibliothek-Staats-und Universitaetsbiblio-
thek Dresden, 2014.
[180] WH Wolf. An architectural co-synthesis algorithm for distributed, embedded
computing systems. Very Large Scale Integration (VLSI) Systems, IEEE, 1997.
[181] Chris Wynnyk and Malik Magdon-Ismail. Pricing the american option using
reconfigurable hardware. In Computational Science and Engineering, 2009.
CSE’09. International Conference on, volume 2, pages 532–536. IEEE, 2009.
187
Bibliography
[182] Xilinx. Partial reconfiguration user guide. Technical report.
[183] Peng Yang et al. Managing dynamic concurrent tasks in embedded real-time
multimedia systems. In ISSS, pages 112–119, 2002.
[184] Zhi Alex Ye, Andreas Moshovos, Scott Hauck, Prithviraj Banerjee, Zhi Alex Ye,
Andreas Moshovos, Scott Hauck, and Prithviraj Banerjee. CHIMAERA. In Pro-
ceedings of the 27th annual international symposium on Computer architecture -
ISCA ’00, volume 28, pages 225–235, New York, New York, USA, 2000. ACM
Press.
[185] C. Ykman-Couvreur et al. Linking run-time resource management of embedded
multi-core platforms with automated design-time exploration. Computers Digital
Techniques, IET, 5(2):123 –135, 2011.
[186] Ch. Ykman-Couvreur et al. Fast Multi-Dimension Multi-Choice Knapsack
Heuristic for MP-SoC Run-Time Management. In SoC, pages 1 –4, 2006.
[187] Chantal Ykman-Couvreur et al. Run-time resource management based on design
space exploration. In CODES, pages 557–566, 2012.
[188] P.H. Yuh et al. Leakage-aware task scheduling for partially dynamically recon-
figurable fpgas. TODAES, 14:52, 2009.
[189] Golbarg Zarinzad et al. A novel intelligent algorithm for fault-tolerant task
scheduling in real-time multiprocessor systems. In Convergence and Hybrid In-
formation Technology, volume 2, pages 816–821. IEEE, 2008.
[190] GL Zhang et al. Reconfigurable acceleration for monte carlo based financial
simulation. In FPT, pages 215–222. IEEE, 2005.
[191] W Zhang, V Betz, and J Rose. Portable and scalable FPGA-based acceleration of
a direct linear system solver. ACM Transactions on Reconfigurable, 2012.
188
Bibliography
[192] Wei Zuo et al. Improving high level synthesis optimization opportunity through
polyhedral transformations. FPGA, page 9, 2013.
189
Appendix A: List of Publications
List of Journal Papers
J1 Nam Khanh Pham, Amit Kumar Singh, Akash Kumar, and Khin Mi Mi Aung.
Leakage Aware Resource Management Approach with Machine Learning Opti-
mization Framework for Partially Reconfigurable Architectures. In: Micropro-
cessors and Microsystems, 47, 231-243, 2016.
J2 Nam Khanh Pham, Akash Kumar, and Khin Mi Mi Aung. Optimizing List-based
System Level Synthesis Techniques with Genetic Algorithm and Machine Learn-
ing. Under review: IEEE Transactions On Computer-Aided Design of Integrated
Circuits and Systems (TCAD).
List of Conference Papers
C1 Nam Khanh Pham, Amit Kumar Singh, Akash Kumar, and Khin Mi Mi Aung.
Incorporating energy and throughput awareness in design space exploration and
run-time mapping for heterogeneous MPSoC. In Euromicro Conference on Digital
System Design (DSD), pages 513-521. IEEE, 2013.
C2 Nam Khanh Pham, Amit Kumar Singh, and Akash Kumar. A multi-stage leakage
aware resource management technique for reconfigurable architectures. In Pro-
ceedings of the 24th edition of the great lakes symposium on VLSI, pages 63-68.
ACM, 2014. (Best paper award candidate).
C3 Nam Khanh Pham, Amit Kumar Singh, Akash Kumar, and Mi Mi Aung Khin. Ex-
ploiting loop-array dependencies to accelerate the design space exploration with
high level synthesis. In Proceedings of the 2015 Design, Automation & Test in
Europe Conference & Exhibition, pp. 157-162. EDA Consortium, 2015.
C4 Nam Khanh Pham, Akash Kumar, and Khin Mi Mi Aung. Machine learning ap-
proach to generate pareto front for list-scheduling algorithms. In Proceedings of
Bibliography
the 19th International Workshop on Software and Compilers for Embedded Sys-
tems, pages 127-134. ACM, 2016. (Best presentation award).
C5 Nam Khanh Pham, Akash Kumar, and Khin Mi Mi Aung.Automatic framework
to generate reconfigurable accelerators for option pricing applications.In: Inter-
national Conference on Reconfigurable Computing and FPGAs (ReConFig), Nov
2016.
191
