











(M.Eng. (Hons.), Imperial College of Science, Technology And Medicine)
(S.M. Singapore-MIT Alliance)
A THESIS SUBMITTED FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
IN COMPUTER SCIENCE
DEPARTMENT OF COMPUTER SCIENCE




Publications related to this thesis:
• Compile-time Design Space Exploration for Dynamically Reconfigurable System-on-a-Chip.
Joon Edward Sim, Tulika Mitra and Weng-Fai Wong. Invited presentation at the Optimizing
Compiler Assisted SoC Assembly Workshop (OCASA), September 2005.
• Defining Neighborhood Relations For Fast Spatial-Temporal Partitioning of Applications on
Reconfigurable Architectures. Joon Edward Sim, Tulika Mitra and Weng-Fai Wong. IEEE
International Conference on Field-Programmable Technology (FPT), December 2008.
• Optimal Placement-aware Trace-based Scheduling of Hardware Reconfigurations for FPGA
Accelerators. Joon Edward Sim, Weng-Fai Wong and Ju¨rgen Teich. 17th IEEE Symposium
on Field-Programmable Custom Computing Machines (FCCM), April 2009.
• Interprocedural Placement-Aware Configuration Prefetching for FPGA-based Systems. Joon
Edward Sim, Weng-Fai Wong, Gregor Walla, Tobias Ziermann and Ju¨rgen Teich. 18th IEEE
Symposium on Field-Programmable Custom Computing Machines (FCCM), May 2010.
Other publications:
• DEP: Detailed Execution Profile. Qin Zhao, Joon Edward Sim and Weng-Fai Wong. 15th
International Conference on Parallel Architectures and Compilation Techniques (PACT),
September 2006.
• An Efficient Framework for Dynamic Reconfiguration of Instruction-Set Customization. Huynh
Phung Huynh, Joon Edward Sim and Tulika Mitra. 7th ACM/IEEE International Conference
on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), October 2007.
• Fast and Accurate Simulation of Biomonitoring Applications on a Wireless Body Area Net-
work. Kathy D. Nguyen, Ioanna Cutcutache, Saravan Sinnadurai, Shanshan Liu, Cihat Basol,
iii
Joon Edward Sim, Linh Thi Xuan Phan, Teck Bok Tok, Lin Xu, Francis Eng Hock Tay, Tu-
lika Mitra and Weng-Fai Wong. Proceedings of the 6th International Workshop on Wearable
and Implantable Body Sensor Networks (BSN), June 2009
• BSN Simulator: Optimizing Application Using System Level Simulation. Ioanna Cutcutache,
Thi Thanh Nga Dang, Wai Kay Leong, Shanshan Liu, Kathy Dang Nguyenm, Linh This
Xuan Phan, Joon Edward Sim, Zhenxin Sun, Teck Bok Tok, Lin Xu, Francis Eng Hock
Tay and Weng-Fai Wong. Proceedings of the 5th International Workshop on Wearable and
Implantable Body Sensor Networks (BSN), June 2008
• An Efficient Framework for Dynamic Reconfiguration of Instruction-Set Customization. Huynh
Phung Huynh, Joon Edward Sim and Tulika Mitra. Springer Journal of Design Automation
for Embedded Systems, 2009.
iv
Acknowledgements
I would like to thank my advisor Professor Wong Weng-Fai for this support and guidance
throughout the entire process of producing this thesis, including the years that were spent
in preparing the research behind this document. His enthusiastic and hands-on approach
towards scientific research has been and always an inspiration for my future undertakings.
Especially, I appreciate his patience in the way he has guided and taught me over the years.
I wish to thank the members of my thesis committee, Professor P.S. Thiagarajan and Abhik
Roychoudhury for their invaluable input and discussions right from the early stage of this
work. This thesis is not possible without their help.
I wish to thank Professor Tulika Mitra and Professor Ju¨rgen Teich for giving me the
opportunity to work with them in the past years. Their creative and yet careful approach
to research will always be an example to me. I have derived great personal benefit from
working together with them.
I would like to thank all the colleagues, both past and present, whose time in the em-
bedded systems lab has overlapped with mine. I would like to especially thank Pan Yu,
Kathy Nguyen Dang, Phan Thi Xuan Linh, Ge Zhiguo, Unmesh Dutta Bordoloi, Ankit
Goel, Liang Yun, Sun Zhenxin, Wang Chundong, Qi Dawei, Chen Jie, Sivakuma Achudan,
Liu Shanshan, Andrei Hagiescu and Huynh Bach Khoa. Space does not permit me to list
everyone. I cannot imagine going through the PhD candidature without your company.
Thanks for the foosball games, the meals and the conversations that ranged from technical
discussion, politics to personal life sharing. I would especially like to thank Zhao Qin and
Huynh Phung Huynh, with whom I have co-wrote some publications, for giving me the joy
and opportunity to work with them.
I would like to thank my Christian friends, especially those from the community of
Covenant Evangelical Free Church. Your prayers, encouragements and concerns are greatly
v
appreciated. I would like to especially thank Alan Cheng, Tan Gek Woo, Tan Huai Tze,
Gabriel Koh, Wang Yi Tian and Maureen Ng for just being there when I need them.
My wife Liu Shuyi has been a tremendous support for me throughout the years, es-
pecially during the last stages of the thesis research. The way that she has sacrificially
supported me and loved me has been my greatest motivation.
Last but not least, I thank my God Jesus, who died for me on a shameful cross even
when I do not deserve it.
vi
Contents




1.1 Motivation and Problem Overview . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Run-Time Reconfigurable FPGA . . . . . . . . . . . . . . . . . . . 8
1.1.2 Partially Run-Time Reconfigurable FPGA . . . . . . . . . . . . . . 11
1.2 Contributions and Thesis Organization . . . . . . . . . . . . . . . . . . . . 13
2 Run-Time Reconfigurable Computing and Hardware-Software CoDesign 16
2.1 FPGA Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 Run-time Partial Reconfiguration of FPGAs . . . . . . . . . . . . . 18
2.1.2 Heterogeneous Processing Elements . . . . . . . . . . . . . . . . . 20
2.2 Reconfigurable Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Types of Coupling with Host . . . . . . . . . . . . . . . . . . . . . 22
2.2.2 Interface with Reconfigurable Logic . . . . . . . . . . . . . . . . . 24
2.2.3 Reconfiguration Latency Hiding . . . . . . . . . . . . . . . . . . . 24
vii
2.3 Hardware Software Partitioning . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Configuration Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1 Online Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.2 Offline Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Design Space Search for Hardware-Software Partitioning 32
3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.1 The Design Space . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.2 Configurations and Partitions . . . . . . . . . . . . . . . . . . . . . 34
3.2 Fast Evaluation of Neighboring Design Points . . . . . . . . . . . . . . . . 37
3.2.1 Evaluating Partitions . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.1.1 Computing Optimal Configuration Instance . . . . . . . 38
3.2.1.2 Loop Trace Compression Using SEQUITUR Graph . . . . 39
3.2.1.3 Counting Reconfigurations . . . . . . . . . . . . . . . . 40
3.2.2 The Neighborhood Relationship . . . . . . . . . . . . . . . . . . . 41
3.2.3 Simulataneous Evaluation of Neighbors . . . . . . . . . . . . . . . 42
3.2.3.1 Labeling Extension and Sequence Enumeration . . . . . 43
3.2.4 Employing the Entire Framework . . . . . . . . . . . . . . . . . . 46
3.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.2 Scaling Hardware Resources and Reconfiguration Time . . . . . . 52
3.3.3 Impact of Using SEQUITUR and Label Extensions . . . . . . . . . . 53
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4 Optimal Scheduling of Hardware Reconfigurations 55
4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
viii
4.1.1 Architecture Model . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.2 Scheduling Model . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.1 Reconfiguration Tasks Generation . . . . . . . . . . . . . . . . . . 59
4.2.2 Minimizing Schedule Length . . . . . . . . . . . . . . . . . . . . . 61
4.3 Algorithm MLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.1 Bubbles in the Reconfiguration Schedule . . . . . . . . . . . . . . 65
4.3.2 Proof of Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.3 Further Clarifications . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4.1 H264-encoder Case Study . . . . . . . . . . . . . . . . . . . . . . 69
4.4.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.3.1 Scaling the Reconfiguration Overhead . . . . . . . . . . 73
4.4.3.2 Scaling the Number of Conflicts . . . . . . . . . . . . . 74
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5 Interprocedural Placement-Aware Configuration Scheduling 77
5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1.1 Architecture Model . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1.2 Reconfiguration Library Support . . . . . . . . . . . . . . . . . . . 80
5.1.3 Interprocedural Control Flow Graphs . . . . . . . . . . . . . . . . 83
5.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4 Interprocedural Placement-Aware Configuration Scheduling . . . . . . . . 93
5.4.1 Finding the Intra Post Dominator Paths . . . . . . . . . . . . . . . 94
ix
5.4.2 Iterative Placement-Aware Estimated Probability Updating . . . . . 96
5.4.3 Prefetch Reduction and Code Generation . . . . . . . . . . . . . . 101
5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6 Conclusions and Future Work 113
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.2.1 Granularity of Reconfiguration and Configuration Scheduling . . . 116
6.2.2 Hardware-Software Co-Placement and Partitioning . . . . . . . . . 116
6.2.3 Configuration Management for Multi-core Reconfigurable Com-
puting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
x
List of Figures
1.1 Abstract model of reconfigurable architectures. . . . . . . . . . . . . . . . 4
1.2 Two different FPGAs supporting runtime reconfiguration. . . . . . . . . . . 6
1.3 429.mcf An example control flow graph. Basic blocks are shown in blue.
Hardware regions are shown in red. . . . . . . . . . . . . . . . . . . . . . 7
1.4 Four partitioning strategies for hardware software codesign. . . . . . . . . . 10
2.1 Basic computation units in modern FPGAs. . . . . . . . . . . . . . . . . . 17
2.2 A typical island-style FPGA. The interconnect shown is an abstraction and
not intended to represent realistic implementations of the FPGA. . . . . . . 19
2.3 Different configuration architectures of Xilinx FPGAs. . . . . . . . . . . . 21
2.4 Using pipelining to reduce reconfiguration costs. . . . . . . . . . . . . . . 27
2.5 An example of infeasible placement. . . . . . . . . . . . . . . . . . . . . . 28
3.1 DAG representing a SEQUITUR grammar. . . . . . . . . . . . . . . . . . . . 38
3.2 Pareto-optimal kernel instances. . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 An example of a partition with its neighboring design points and the asso-
ciated reconfiguration costs. . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4 A SEQUITUR graph labeled with H and S tags given a partition that has put
kernels a, c and d in hardware. . . . . . . . . . . . . . . . . . . . . . . . . 44
xi
3.5 An example showing the change in annotation of the SEQUITUR graph
and enumeration of sequences after a move between neighboring design
points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.6 Optimal speedups plotted against increasing hardware resource. . . . . . . 51
3.7 Optimal speedups plotted against increasing reconfiguration time. . . . . . 52
4.1 Architecture model: A CPU (left) controlling the reconfiguration interface
of an FPGA (right) used as a hardware accelerator for an incoming task
sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Example of actor trace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Reconfiguration task generation. . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 Dependence relations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.5 Feasible schedule for the problem introduced in Example 4.2.3. . . . . . . . 62
4.6 Example of optimal feasible schedule produced by MLS. . . . . . . . . . . 65
4.7 Induction proof cases for MLS. . . . . . . . . . . . . . . . . . . . . . . . . 66
4.8 Speedup over baseline plotted against increasing reconfiguration time. . . . 73
4.9 Speedup over baseline plotted against decreasing number of conflicts. . . . 75
5.1 Architecture model for interprocedural placement-aware configuration schedul-
ing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 HeapSort C code example. . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 HeapSort C code example. . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 HeapSort interprocedural control flow graph. . . . . . . . . . . . . . . . . 87
5.4 How prefetching affects overall execution time. . . . . . . . . . . . . . . . 90
5.5 Motivating examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
xii
5.6 An example ICFG. The squares represent hardware nodes while ovals rep-
resent basic blocks. The thick edges represent call edges between proce-
dures and the dotted lines represent the return edges from the procedures.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.7 Loading code template for hardware node hw. The condition for is ex-
pressed as a product of sums. . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.8 Cascading ifs code template to be inserted at prefetch points . . . . . . . . 102
5.9 Speedups over baseline for 8-bits wide reconfiguration port running at 100MHz.108
5.10 Speedups over baseline for 32-bits wide reconfiguration port running at
100MHz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.11 Proximity to optimal by normalizing the range between baseline and opti-
mal (8 bits wide reconfiguration port). . . . . . . . . . . . . . . . . . . . . 111
5.12 Proximity to optimal by normalizing the range between baseline and opti-
mal (32 bits wide reconfiguration port). . . . . . . . . . . . . . . . . . . . 112
xiii
List of Tables
1.1 The top 8 most computationally-intensive loops for benchmark 429.mcf. . 5
3.1 The running times of exhaustive search, Hill-Climb algorithm and com-
pressed trace sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2 Experiment results showing how many times Tabu and Hill Climb slowed
down when not using SEQUITUR and neighborhood relationship. . . . . . 54
4.1 Characteristics of the two traces. . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 Resource conflicts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3 Hardware modules, the hardware area occupied and the number of cycles
taken up in the application. . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.1 The regions selected for hardware implementation in the h264enc and
429.mcf benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2 Reconfiguration Overhead of 1 ReCoBus Slot for different bit-widths. . . . 105
5.3 Benchmarks with different placements. . . . . . . . . . . . . . . . . . . . . 107
xiv
Abstract
Run-time reconfigurable FPGA-based systems create both opportunities and challenges for
hardware-software codesign. On the one hand, it has been shown that significant speedups
could be obtained for computations when performed on the reconfigurable hardware fabric
and this potential speedup can be achieved without re-fabrication costs. On the other hand,
the virtualization of the hardware resources comes at a price. Hardware computation mod-
ules have to be pre-loaded onto the FPGA prior to execution and the time taken to preload
these modules can be significant. In order to obtain quality solutions for implementing
applications on these platforms, we need to navigate the trade-off between the speedup
achievable for individual components and the reconfiguration costs required to load them.
Envisioning that run-time reconfigurable computing will be a major part of mainstream
computing, this thesis studies and proposes methodologies that can be incorporated into the
design process of single, sequential programs written in high-level programming language
(e.g. Java, C etc) for reconfigurable computing platforms. This thesis makes the following
contributions.
First, we propose a novel design-space search framework for hardware-software parti-
tioning of a single, sequential program. A key feature of this framework is that it facilitates
the efficient computation of reconfiguration costs. Our definition of neighboring relation-
ships between design points, when coupled with execution traces encoded in SEQUITUR
grammar, speeds up the process of reconfiguration cost estimation when the search moves
between neighboring design points. To our knowledge, this is the first work that examined
the problem of implementing neighborhood searches of both the temporal and spatial par-
titioning space. Our experiments show that searches can be speeded up by up to 2 orders
of magnitude when all the key features of our framework are employed.
xv
Second, although the design-space search framework allows efficient computation of
reconfiguration costs, it has been assumed that the reconfiguration costs cannot be hid-
den through techniques such as configuration prefetching that can occur in parallel with
computation. In the second part of the thesis, we propose a novel, polynomial time algo-
rithm that examines an execution trace and schedules placement-aware configurations to
minimize overall execution time. This algorithm is provably optimal and our experiments
show a speedup of up to 40% when compared with schedules done by online scheduling
algorithms that relies on hardware predictors.
Finally, we visit the problem of inserting configuration prefetching calls into an In-
terprocedural Control-Flow Graph statically. While the algorithm described above yields
optimal schedules, the schedules that it produces are very specific for particular input ex-
ecution traces. Through the usage of profiled execution frequencies of control-flow edges,
our proposed algorithm estimates placement-aware probabilities of reaching hardware exe-
cution for each basic block. The prefetches are then generated based on these probabilities.
Experiments show that our proposed algorithm makes significant improvements over the




Reconfigurable computing is an alternative computing paradigm to the traditional Von Neu-
mann model of computation. Typically, a general purpose processor (GPP) is coupled with
a reconfigurable hardware for the purpose of application acceleration. By the word ‘recon-
figurable’, it is intended to refer to the feature that the hardware may be configured multiple
times (e.g., either during application run-time or prior to execution) to perform different
computation in a way that is analogous to different programs being loaded into memory for
execution. These ‘softer’ hardware – sometimes termed configware and usually a Field-
Programmable Gate Array (FPGA)– carves a middle ground between the flexibility of gen-
eral processors and the high performance of traditional hardware. The reconfigurability
of the hardware allows rapid modifications of the platform, decreasing the time-to-market
delay and prototyping costs. Although the performance of carefully designed custom hard-
ware (e.g., Application Specific Integrated Circuits (ASICs)) still surpass that of FPGAs,
studies have shown that applications are sped up by orders of magnitude compared to run-
ning the same on a general purpose machine. Due to these advantages, reconfigurable
computing is now considered a viable option, especially in embedded systems, due to in-
creasing complexity and requirements imposed by applications.
1
Although reconfigurable computing has been the subject of intensive research and de-
velopment during the past decade, the success of reconfigurable computing remained beset
by a dearth of automatic design tools for efficient implementation of applications written in
high-level languages (e.g., C, C++ etc.). In particular, most current state-of-art design tools
still assume that the developers have deep understanding of both hardware and software
designs and it is their responsibility to fully exploit the benefits of this approach. On the
other hand, traditional hardware-software co-design techniques cannot be easily extended
for such architectures. The main issue is that traditional hardware-software co-design tech-
niques are applied to small, embedded systems where the number of applications running
in the system are few and limited. Furthermore, these techniques do not consider run-time
reconfiguration in general because their target platform is ASICs instead of FPGAs. Thus,
the overhead of run-time reconfiguration are not considered in traditional approaches.
In this thesis, we focus on proposing novel methodologies that could be adapted into de-
sign tools for reconfigurable computing platform. These innovations in hardware-software
partitioning and reconfiguration scheduling seek to exploit the run-time reconfiguration
features of reconfigurable computing platforms for designs that are written in high-level
languages such as C.
In this chapter, we outline the motivation and problem overview of the thesis within
the context of FPGA Run-Time Reconfiguration (RTR) in Section 1.1. In Section 1.2, we
outline the major contributions and the organization of this thesis.
1.1 Motivation and Problem Overview
In traditional hardware-software codesign, it is a usual practice to transfer a proportion of
the computation of an application to hardware for the purpose of application acceleration.
2
Assuming that a proportion P of the computation can be transfered to the custom hardware
(e.g., ASICs) and that this proportion P can be improved by a speedup of Q, Amdahl’s law





However, in the case of reconfigurable computing platforms, the situation is different.
The support for run-time reconfiguration of the hardware implies that the resources avail-
able in that hardware can be time-multiplexed and shared by different computations. Thus,
for the same silicon area, having a run-time reconfigurable unit implies that the proportion
P that can be transferred onto the hardware (e.g., FPGA) can be larger than before. While
this may increase the potential for greater speedup, the speedup achievable is offset by the
overhead needed during run-time to configure the FPGA. By denoting this overhead as R,




Let us consider a single, sequential program 429.mcf that is to be executed on machine
based on the abstract reconfigurable computing architectural model shown in Figure 1.1.
This is an application taken from SPEC2006[58], In the architectural model that we are
considering, the CPU and the FPGA co-processor share the same memory address space
through a shared bus connection. The reconfiguration manager is responsible for config-
uring the FPGA with hardware modules by loading bitstreams stored in memory onto the
FPGA. It should be noted that the reconfiguration manager enables loading of bitstreams to
be done in parallel with CPU execution. Although details may differ in actual implemen-







Figure 1.1: Abstract model of reconfigurable architectures.
architectures), this abstract model is fairly representative of the architectures that we are
interested in for application acceleration.
Table 1.1 shows the top most 8 computationally-intensive loops for 429.mcf and the
respective proportion of the total computation time taken up by them1 . The loops are in-
dexed alphabetically and the loop names are taken from the function names of the C code
and the loop ids suggested by the compiler2. We shall use this benchmark information to
illustrate the complexity of implementing an optimized, efficient program on a reconfig-
urable architecture.
Figure 1.2(a) and 1.2(b) shows us two FPGA models: a run-time reconfigurable FPGA
and a partially run-time reconfigurable FPGA. For the sake of convenience, we refer to
the former as rFPGA and the latter as pFPGA for the rest of the thesis. Any references to
FPGA should be considered generic and applicable to both models of FPGA. We shall now
proceed to outline the factors involved when considering the implementation of (429.mcf)
for a rFPGA and then for a pFPGA.
1The profile was taken and estimated by running 429.mcf on a PowerPC machine and scaling it the
number of cycles to the typical clockspeed of PowerPC embedded processors.
2It should be noted that while we have identified these loops as the computationally intensive portions
that’s to be implemented in hardware, computationally intensive basic blocks or even functions could be
suitable candidates as well.
4
Index Loop Name No. of Cycles Proportion of Computation
(func name-loop id) (Nearest ’000000 cycles) (%)
A primal bea mpp-2 24206 0.36
B price out impl-2 8995 0.13
C refresh potential-2 2682 0.04
D sort basket-2 2632 0.04
E primal iminus-1 1394 0.02
F primal bea mpp-4 1337 0.02
G sort basket-3 830 0.01
H update tree-1 60 0.01
total 0.62






loaded at each time.
However, each 
configuration may be




Organized in array of 
reconfigurable regions marked 
by dotted lines. 
Multiple Configurations can be 
loaded at the same time. a b
cf
e
0 1 2 3 4
(b) pFPGA
Figure 1.2: Two different FPGAs supporting runtime reconfiguration.
6
Figure 1.3: 429.mcf An example control flow graph. Basic blocks are shown in blue.
Hardware regions are shown in red.
7
1.1.1 Run-Time Reconfigurable FPGA
For the rFPGA, only one configuration may be loaded at any one time instance. However,
the configurations may be shared by various hardware modules. As shown in Figure 1.2(a),
hardware module g (or the hardware instance of loop g) occupies a single configuration
while hardware instances of loops a and b share a configuration.
On a FPGA, there are both routing resources (e.g., I/O pins, interconnect switches etc),
computation resources (e.g., lookup-tables, hardware multipliers) and storage resources
(e.g., Block RAMs, flip-flops etc). The amount and type of resources used by a hardware
module depends upon the requirements of the hardware module. For example, modules
that read video input may need to use the I/O pins that are connected to the VGA inputs.
Other modules may require specific hardware resources such as hardware multipliers to
optimize the computation.
In the case where there are sufficient resources on the rFPGA for all the hardware
modules to be loaded, the solution becomes trivial. All the computationally intensive loops
shall be selected for hardware implementation. However, given the increasing complexity
of applications and the practical constraints imposed by cost and size considerations, this
luxury cannot be realistically enjoyed. In the light of this, the following are the factors that
affect the quality of the solution:
• The selection of a subset of the candidate kernels for hardware implementation
If the rFPGA resources are insufficient for all hardware modules to be loaded at the
same time, we can only choose a subset of the hardware kernels for implementation.
We can still choose to have all the candidate kernels to be implemented in hardware,
but this entails that we need to reconfigure the rFPGA during run-time so that the
FPGA resources can be virtualized and time-shared.
8
• The selection of suitable candidate kernels to share a configuration Although
the rFPGA resources may be insufficient for all hardware modules to be loaded at
the same time, it may have sufficient resources to hold a subset of the hardware
instances of the candidate kernels. Consequently, a good solution needs to determine
which candidate kernels should share a configuration.
• The selection of one of the alternative implementations of the candidate kernels
The search for a good solution is being made more complex when one considers that
each of these candidate kernels may have alternative implementations. In general,
the more parallelism that is exploited, the more resources that would be needed to be
employed. Thus, a good solution would need to strike a balance between resources
used on the rFPGA and the speedup that may be obtained.
Therefore, a quality solution needs to consider both temporal and spatial partition-
ing. Temporal partitioning is the selection of configurations that would be loaded into the
rFPGA during run-time (hence time-sharing or time-partitioning the rFPGA). Spatial par-
titioning is the selection of candidate kernels to share the rFPGA resources in one config-
uration (hence spatially sharing the rFPGA). When considering both spatial and temporal
partitioning, a design point may fall into one of the following categories, as shown in Figure
1.4
• Static single kernel(SS): A single kernel is implemented in hardware without dy-
namic reconfiguration. Neither spatial nor temporal partitioning is required. In Fig-
ure 1.4, loop a is selected to be realized in hardware. Loop b, c and d are executed
in software.
9
{a} {a} {a} {a}
t0 t1 t2 t3
configure
SS
{a,c} {a,c} {a,c} {a,c}

















{a} {a} {c} {c}
t0 t1 t2 t3
configure
DS
{a,b} {a,b} {c,d} {c,d}




















Figure 1.4: Four partitioning strategies for hardware software codesign.
10
• Static multi kernel(SM): Hardware is spatially partitioned among multiple kernels.
However, there is no dynamic reconfiguration, i.e., no temporal partitioning. In Fig-
ure 1.4, loops a and c share the hardware. Loop b and d are executed in software.
• Dynamic single kernel (DS): Hardware is temporally partitioned such that at any
point exactly one kernel occupies the entire hardware. There is no spatial partitioning
in this case. In Figure 1.4 loop a and loop c occupy the hardware at time t0 and t2,
respectively. Loop b and d are executed in software.
• Dynamic multi-kernel(DM): Hardware is both spatially and temporally partitioned.
In Figure 1.4, loop a, b share the hardware between time t0 and t2 and they are
swapped out by loops c, d at time t2.
1.1.2 Partially Run-Time Reconfigurable FPGA
The configuration architecture of the pFPGA is organized as an single-dimension array of
minimum columnar reconfigurable regions marked out by dotted lines in Figure 1.2(b). In
the example shown in the diagram, there are 5 such reconfigurable regions numbered 0 to
4.
Instead of having to occupy the entire pFPGA, a configuration only needs to span across
a multiple of the minimum reconfiguration regions. One particular feature of such pFPGAs
is their ability to perform computation in parallel with reconfiguration under the condition
that the regions being reconfigured are not the same as those performing the computations.
Thus if hardware module b is to be replaced by loading in hardware module c, hardware
module a can continue computation without interruption. This highlights one of the ma-
jor differences between this and the previous configuration architecture. Namely, while
11
rFPGAs allow reconfiguration to be in parallel with software execution, pFPGAs allow
reconfiguration to be in parallel with both software and hardware execution.
However, this added advantage comes with increased complexity to the problem. In
addition to the three factors mentioned above in Section 1.1.1, a good quality solution that
targets the partially RTR FOGA needs to consider the following factors as well.
• Selection of placements of configurations We refer to the exact location that hard-
ware modules occupy on the pFPGA as placements. In Figure 1.2(b), module a
occupies reconfiguration regions 0 and 1. Hence the placement of module a begins
at 0. Placements of hardware modules (i.e., the reconfigurable region they occupy)
are usually decided during design-time for 3 reasons. Firstly, if the hardware modules
require specific I/O, the positions of the I/O pins constrains the region the hardware
module can occupy. Secondly, since the placement of hardware modules are usu-
ally embedded within the configuration bitstream information, relocation of hard-
ware modules during run-time could be a costly operation. Thirdly, due to the above
mentioned fact that hardware modules may require specific resources, the placement
of such modules are constrained to be in reconfigurable regions that contain such
resources.
Hardware modules that share the same reconfigurable regions are said to be in ’con-
flict’ with one another(i.e., if a and b are in conflict, they cannot be loaded on the
pFPGA at the same time. The rFPGA could be considered as a pathological case
of having only one reconfigurable region, thus making every distinct configuration
in conflict with one another. In general, a configuration for pFPGA only conflicts
with a subset of other configurations. Thus, the placements of configurations affects
the number of conflicts and since configurations in conflicts cannot be loaded on
the pFPGA at the same time, the number of conflicts in turn influence the number
12
of reconfigurations that occur during run-time and hence the overall reconfiguration
overhead as well.
• Reonfiguration scheduling As noted above, pFPGAs make it possible for recon-
figuration to occur in parallel with both hardware and software execution. It is not
trivial to schedule reconfiguration for a single, sequential program that is presented
in a control flow graph as shown in 1.3 because the control flow of the execution
changes dynamically during runtime.
In summary, in order to implement an efficient, accelerated single, sequential program
for a machine based on the model shown in Figure 1.1, the factors mentioned above, espe-
cially with regards to run-time reconfiguration overhead, need to be considered. It should
be noted that these factors are inter-dependent and inter-related.
1.2 Contributions and Thesis Organization
The main aim of this thesis is to study the effects of run-time reconfiguration overhead and
propose new novel methodologies that address run-time reconfiguration issues that can be
incorporated into the design process of applications for reconfigurable computing. Run-
time reconfiguration overhead can be adversely increased by a wrong choice of candidate
kernels and mis-prefetches during configuration scheduling. Consequently, a design frame-
work of applications for reconfigurable computing platform needs to factor in these issues
for application acceleration.
The main contributions of this thesis are the development of methodologies that can be
incorporated into such frameworks. Specifically, this thesis makes the following contribu-
tions to the state of the art:
13
1. For a design space that spans both temporal and spatial partitioning, we have imple-
mented a framework in which efficient neighborhood searches can be implemented.
Through defining the neighborhood relationship between the design points carefully,
the run-time reconfiguration cost can be efficiently computed when moving from
one design point to its neighbor during the search. Our experiments show that the
employment of this framework speeds up neighborhood searches (specifically, hill
climbing and Tabu search) by up to two orders of magnitude., compared to imple-
mentations of these neighborhood searches that either a) do not use the framework at
all or b) use the framework in a partial manner.
2. We present a novel, polynomial time algorithm for scheduling reconfiguration given
an execution trace of hardware modules that is both provably optimal and placement-
aware. The algorithm includes a dependence analysis to determine whether for each
instance of hardware module execution, a reconfiguration task is needed prior to its
execution in hardware. A formal proof that our scheduling algorithm is optimal with
respect to the application’s overall execution time is also given. Our experiments
demonstrate how previously proposed online scheduling algorithms fare in compari-
son with the optimal algorithm.
3. We present a novel static reconfiguration scheduling algorithm for programs defined
in control-flow graphs. Using profiling information, we first perform an interpro-
cedural, path-sensitive reachability analysis of the control flow graph. The analysis
estimates for each basic block, the probability of reaching a hardware module with-
out encountering conflicting configurations on the way. Our experiments show that
the proposed novel algorithm performs better than current state-of-the-art algorithms
across varying sets of conflicing hardware modules.
14
This thesis is organized as follows: In Chapter 2, we give an overview of FPGA and
discuss previous and related work in reconfigurable computing, especially work in recon-
figurable architectures, hardware-software partitioning and configuration prefetching; in
Chapter 3, we present the neighborhood search framework for the efficient implementation
of neighborhood searches of the hardware-software partitioning design space; in Chapter
4, we present MLS, a provably optimal reconfiguration scheduling algorithm for an exe-
cution trace of hardware modules; in Chapter 5, we present a static scheduling algorithm
for a control-flow graph specification. Finally, we conclude the thesis by summarizing the





Reconfigurable computing has been the subject of intensive research for the past decade.
However, research for implementing applications written control-flow specifications (or
programs written in high-level languages) have often focused on synthesizing the spec-
ification for hardware implementation. In comparison, there has been fewer works that
focused on hardware-software partitioning and configuration scheduling for the such ap-
plications. In this chapter, we review the background and related works of this thesis. In
Section 2.1, We begin with an overview of FPGAs, especially with regards to the reconfig-
uration technology available in commercial chips such as Xilinx Virtex II Pro and Xilinx
IV. In Section 2.2, we present a classification of reconfigurable architectures and describe
some of their distinguishing features. In Section 2.3, we review some of the key works in
hardware-software partitioning for run-time reconfigurable computing. In Section 2.4, we















(b) A 4-LUT can be programmed with a 16 bit SRAM
Figure 2.1: Basic computation units in modern FPGAs.
2.1 FPGA Overview
Field-programmable gate arrays(FPGA) are integrated circuits that are re-programmable
after fabrication (hence field-programmable). In contrast with ASICs that are set in stone
once fabricated, FPGAs are reprogrammable(i.e., more flexible) and therefore have a low
non-recurring engineering cost. Traditionally, it has been used as a prototyping platforms
for chip design, but due to increasing complexity of the requirements of applications there
have been increasing attempts to use FPGAs as co-processors for program acceleration.
The main reason why FPGAs are ‘programmable’ is the insight that any computation
can be represented as a Boolean equation. In turn, any Boolean equation can be expressed
as a truth table. Lookup tables (LUTs) could be used to implement any truth tables. Us-
ing these truth tables as the basic unit of expression of a computation, complex and even
17
conditional statements (e.g., classic statements such as if-then-else) can be expressed by
combining these LUTs to form a complex computational structure. LUTs give FPGAs the
generality to implement arbitrary digital logic.
LUTs alone are insufficient if you consider the need to hold state information, espe-
cially for recursive/iterative computations that depends on the results from previous states.
However, if we add a flip-flop register and a multiplexer to the LUT as shown in 2.1(a),
the new circuit is enabled to hold previous state information. Figure 2.1(a) shows a typical
example of a logic block that forms the basic unit of computation for an FPGA. Depending
on the multiplexer input bit, this piece of logic will either output the previous value held in
the flip-flop register or the output from the LUT directly.
This logic block can be programmed through the use of SRAMs. For example, by ini-
tializing the 16-bit SRAM in Figure 2.1(b) with appropriate values, we can implement any
4-input boolean logic. The flip-flop register and multiplexer can also be initialized with
single-bit SRAMs. The size of the LUT was the subject of considerable research[56]. 4-
LUTs is the usual size for modern FPGAs although the new Virtex-5 SRAM-based FPGA
from Xilinx has a 6-LUT architecture. Figure 2.2 shows a typical island-style FPGA where
the individual logic block ’islands’ are organized in a mesh-like matrix in a ‘sea’ of inter-
connects.
2.1.1 Run-time Partial Reconfiguration of FPGAs
Xilinx has supported partial run-time reconfiguration since the Virtex II Chip[71]. Instead
of reconfiguring the entire FPGA, synthesis tools provided by Xilinx are able to generate
partial bitstreams that configure on parts of the chip. The smallest reconfigurable unit of
the Xilinx FPGA series is called a ‘frame’. In Virtex II Chips, the frames are 1-bit wide






















































Figure 2.2: A typical island-style FPGA. The interconnect shown is an abstraction and not
intended to represent realistic implementations of the FPGA.
19
module designs must occupy at least one frame. The fact that each frame spans the height
of the FPGA implies a potential inefficiency in terms of resource utilization because the
hardware module design may not be placed and routed in such a way as to fill up the entire
height of the device. So, even if the hardware module utilizes only half the height of the
device, the entire height still needs to be configured. Starting from the Virtex 4[75] chips,
each frame has a fixed height that forms the unit of the column height of the device. This
means that each column constitutes a multiple of frames. For example, each column of
Virtex 4 LX25 chips contains 12 frames and each frame has a fixed size of 328 bits. This
configuration architecture improves resource utilization and hence the efficiency of run-
time reconfiguration as well. Figure 2.3 shows the difference between the two architectures.
2.1.2 Heterogeneous Processing Elements
Figure 2.2 shows a basic architecture of FPGAs that is a two-dimensional array of ho-
mogeneous logic blocks and this is reflective of traditional FPGAs. However, modern
FPGAs are increasingly heterogeneous. Besides the traditional LUTs, hardware multipli-
ers, BRAMs(in Virtex II series) and even hardcore PowerPCs (in Virtex II Pro chips [70])
have been embedded into the design of modern FPGAs. The current Virtex 6 [76] chips
contain DSP processors within its core. Recent research[28] have studied the advantage of
embedded floating point cores within the FPGA. All these show that the trend of having
heterogeneous processing elements is here to stay. These developments provide a greater
challenge for the placement problem described in chapter 1.
20
regions, (2) bitstreams are integrated at run-time, (3) the
target bitstream is read from configuration memory before
the integration operation, which enables (4) more sophisti-
cated integration operations to be used.
Another tool widely used in dynamic reconfiguration
research is JBits [10]. As with merge reconfiguration,
JBits uses bitstream information read from the device at
run-time, and has been used for logic relocation [11].
Significantly, the JBits interface is low-level and
architecture-dependent, and it does not integrate easily
with high-level design flows. This makes JBits more suit-
able for fine bitstream manipulations than module-level
reconfiguration and relocation. Addressing this problem
by combining JBits with a high-level HDL is the subject
of recent unfinished research [12].
The work reported here was originally published in the
work of Sedcole et al. [13]. This expanded paper provides
further details, and describes the application of the merge
dynamic reconfiguration method to the latest generation
of Xilinx FPGAs, the Virtex-4.
2 Virtex configuration architecture
The configuration architecture of the Virtex family of
FPGAs is described in a Xilinx Application Note [14],
and is essentially the same for Virtex-II [15] and Virtex-II
Pro [16] devices. The configuration is stored in SRAM
memory that can be read from or written to without
halting the device. The smallest unit of configuration
memory that can be read or written is a ‘frame’, which
spans the entire height of the device (including I/O
blocks) and a fraction of one column (Fig. 2).
It should be noted that Virtex-II/Pro FPGAs have the
characteristic of ‘glitchless dynamic reconfiguration’: if a
configuration bit holds the same value before and after con-
figuration, the resource controlled by that bit will not
experience any discontinuity in operation [17], with the
exception of LUT RAM and SRL16 primitives. It is there-
fore possible for a reconfigurable module to occupy an arbi-
trary area, provided that (1) the areas above and below the
module area do not contain LUT RAM or SRL16 logic and
(2) the configuration data written to these areas when the
module is replaced overwrites the existing configuration
with exactly the same values. Similarly, static, system-level
routing may pass through a reconfigurable region if its con-
figuration data are persistent when the module is
reconfigured.
The latest generation of Virtex FPGAs, the Virtex-4
family, marks a significant change in layout over previous
devices. As shown in Fig. 3, the configuration architecture
is still frame-based, but a frame spans 16 rows of configur-
able logic blocks (CLBs) rather than the full device height
[6]. Clock distribution regions are also aligned in blocks
of 16 CLB rows, unlike earlier Virtex devices, where
clock regions were defined to be quadrants. Note that I/O
blocks are arranged in columns (like all other resources)
rather than a ring. The Virtex-4 shares the glitchless
dynamic reconfiguration property of earlier devices, but
this now applies to all primitives including LUT RAM
and SRL16 logic.
3 Direct dynamic reconfiguration
In the direct dynamic reconfiguration process, reconfigur-
able modules are composed from complete frames of con-
figuration memory. This implies that a module occupies
the full height of the device, including the I/O at the top
and bottom of the reconfiguration region (Fig. 4). The
module may be a variable number of CLB columns in
width, and all logic and routing within the reconfiguration
region are dedicated to the module. Using this scheme, a
module may be replaced very simply by writing over the
Fig. 3 Virtex-4 configuration architecture
Fig. 1 Bitstream sizes for Virtex FPGAs Fig. 2 Virtex-II configuration architecture
IEE Proc.-Comput. Digit. Tech., Vol. 153, No. 3, May 2006158
(a) Virtex-II C figuration Architecture
regions, (2) bitstreams are integrated at run-time, (3) the
target bitstream is read from configuration memory before
the integration operation, which enables (4) more sophisti-
cated integration operations to be used.
Another tool widely used in dynamic reconfiguration
research is JBits [10]. As with merge reconfiguration,
JBits uses bitstream information read from the device at
run-time, and has been used for logic relocation [11].
Significantly, the JBits interface is low-level and
architecture-dependent, and it does not integrate easily
with high-level design flows. This makes JBits more suit-
able for fine bitstream manipulations than module-level
reconfiguration and relocation. Addressing this problem
by combining JBits with a high-level HDL is the subject
of recent unfinished research [12].
The work reported here was originally published in the
work of Sedcole et al. [13]. This expanded paper provides
further details, and describes the application of the merge
dynamic reconfiguration method to the latest generation
of Xilinx FPGAs, the Virtex-4.
2 Virtex configuration architecture
The configuration architecture of the Virtex family of
FPGAs is d scribed in a Xilinx Application Note [14],
and is essentially the same for Virtex-II [15] and Virtex-II
Pro [16] devices. The configuration is stored in SRAM
memory that can b re d from or written to without
halting the device. The smallest unit of configuration
memory that can be read or written is a ‘frame’, which
spans the entire height of the device (including I/O
blocks) and a fraction of one column (Fig. 2).
It sho ld be oted that Virtex-II/Pro FPGAs have the
characteristic of ‘glitchless dynamic reconfiguration’: if a
configuration bit holds the same value before and after con-
figuration, the resource controlled by that bit will not
experience any discontinuity in operation [17], with the
except on of LUT RAM and SRL16 primitives. It is there-
fore possible for a reconfigurable module to occupy an arbi-
trary area, provided that (1) the areas above and below the
module area do not contain LUT RAM or SRL16 logic and
(2) the configuration data written to these areas when the
module is replaced overwrites the existing configuration
with exactly the same values. Similarly, static, system-level
routing may pass through a reconfigurable region if its con-
figuration data are persistent when the module is
reconfigured.
The latest generation of Virtex FPGAs, the Virtex-4
family, marks a significant change in layout over previous
devices. As shown in Fig. 3, the configuration architecture
is still frame-based, but a frame spans 16 rows of configur-
able logic blocks (CLBs) rather than the full device height
[6]. Clock distribution regions are also aligned in blocks
of 16 CLB rows, unlike earlier Virtex devices, where
clock regions were defined to be quadrants. Note that I/O
blocks are arranged in columns (like all other resources)
rather than a ring. The Virtex-4 shares the glitchless
dynamic reconfiguration property of earlier devices, but
this now applies to all primitives including LUT RAM
and SRL16 logic.
3 Direct dynamic reconfiguration
In the direct dynamic reconfiguration process, reconfigur-
able modules are composed from complete frames of con-
figuration memory. This implies that a module occupies
the full height of the device, including the I/O at the top
and bottom of the reconfiguration region (Fig. 4). The
module may be a variable number of CLB columns in
width, and all logic and routing within the reconfiguration
region are dedicated to the module. Using this scheme, a
module may be replaced very simply by writing over the
Fig. 3 Virtex-4 configuration architecture
Fig. 1 Bitstream sizes for Virtex FPGAs Fig. 2 Virtex-II configuration architecture
IEE Proc.-Comput. Digit. Tech., Vol. 153, No. 3, May 2006158
(b) Virtex-4 Configuration Architecture
Figure 2.3: Different configuration architectures of Xilinx FPGAs.
21
2.2 Reconfigurable Architectures
A reconfigurable system is loosely defined as any architecture in which a general processor
is augmented with reconfigurable fabric. Kiran[5] identified a wide spectrum of architec-
tures which may be termed as such. The architectural details will have an impact upon
hardware software codesign for the system. For example, loose coupling between the pro-
cessor and the reconfigurable hardware will imply communication overhead that needs to
be taken into consideration when designing applications for the system. The following
aspects clarifies the distinctions between the various architectures:
2.2.1 Types of Coupling with Host
In most reconfigurable systems, the processor is known as the host. There is a master-
slave relationship between the processor and the attached reconfigurable logic. The host
is expected to configure the programmable logic with the appropriate configuration before
initiating computation. The host is also responsible for marshaling the proper inputs and
reading back outputs produced by the configured computation. The communication be-
tween the host and the logic is a potential bottleneck. Roughly speaking, these are the
classifications of various type of coupling (following the classifications given in [5]):
• Loose Coupling through I/O peripherals A typical example of a system that employs
this kind of coupling would be the situation where a FPGA board is attached on a the
CPU, communicating with the CPU through the PCI bus interface. The communica-
tion overhead between the host and the FPGA is high. Also, data access is expensive
for the reconfigurable logic since it does not share the same memory interface as the
host processor. As such, the granularity of computation which can be put into the
22
hardware must be large because the communication overhead would be too costly to
cover any speedups that may be gained.
• Loose Coupling on a Chip Level An example of this would be PRISM-1[2] that
consists of a 10 Mhz M68010 processor directly linked to four Xilinx 3090 FPGAs
through a direct 16-bit bus connection. Although this is an improvement upon the
previous type of coupling, PRISM[2] reported a communication overhead of 48 to
72 processor cycles to transfer input arguments, which is significant.
• Tight Coupling on Single-Chip There has been considerable academic and commer-
cial research efforts that produced various architectures where the host and the recon-
figurable fabric are placed on the same chip, effectively reducing the communication
overhead to the minimum. This is part of the trend of that moved towards the ‘System
on a Chip’. A trend that was encouraged by increasing silicon density predicted by
Moore’s law. GARP[27], Chimaera[77], AEPIC[62], eMIPs[51] and MOLEN[66]
are examples of such systems. There is no doubt that the advent of these reconfig-
urable systems which promise potential performance gain creates the need for the
design and implementation of new hardware-software partitioning algorithms.
• Multi-cores including Reconfigurable Fabric This is an extension of the previous
category where traditionally, a single general processor is tightly coupled with one
reconfigurable co-processor on the same chip. However, silicon technology has de-
veloped to a stage where it is possible for multiple CPU cores to be placed in the
same chip. One of these cores could be a reconfigurable FPGA. The Intel FSB-
FPGA[33] with FPGA accelerators pluggeed into Intel Xeon processor sockets is a
prime example of this recent development.
23
2.2.2 Interface with Reconfigurable Logic
The way the hosts interfaces with the reconfigurable logic depends on the way the recon-
figurable logic is coupled with the processor. In general, a reconfigurable logic can be
placed on the datapath of the general processor and serves as a reconfigurable functional
unit (RFU) or it can be a co-processor akin to dual cores architectures. Chimaera[77] and
Stretch[1] are examples of the former while GARP[27] and AEPIC[62] are examples of
the latter.
In order to create new custom instructions for architectures like Chimaera and Stretch,
data-flow graphs are constructed and patterns of subgraph are mapped into custom instruc-
tions. The RFUs that execute these instructions usually do not access memory and impose
a restriction on the number of inputs and outputs. The execution of the RFUs is usually
transparent to the CPU. However, the ISA design for these architectures must accomodate
space for opcode extension. As a result, there is a limit to the number of custom instructions
that can be supported by such systems.
By contrast, the interface between the CPU and the reconfigurable logic for architec-
tures such as GARP and AEPIC resembles function call semantics. As the reconfigurable
logic is a co-processor that shares the same bus as the CPU instead of being placed on the
datapath, the communication overhead when executing the reconfigurable logic is higher.
However, the advantage is that larger granularity of computation can be accelerated and the
system places no limits on the number of hardware accelerators supported.
2.2.3 Reconfiguration Latency Hiding
As mentioned in the previous chapter, reconfiguration latency is a potential performance
bottleneck in these systems. Some architectures reduce the reconfiguration latency by sup-
24
porting multiple configuration contexts in its configurable hardware. AEPIC’s[62] recon-
figurable hardware is called Multiple-context Reconfigurable Logic Array (MRLA), which
consists of a two-dimensional array of programmable logic and interconnection blocks,
collectively known as programmable elements. Through a configuration store being asso-
ciated with each programmable logic, the entire programmable hardware is effectively an
array of FPGAs. When presented with a context id, the various programmable elements
will employ the associated configuration indexed by context id into the configuration store.
Chimaera[77] reports a similar structure, supporting multiple contexts for its Reconfig-
urable Functional Unit (RFU). These hardware features are further supported by the pres-
ence of configuration caches in both cases. That is, when a particular configuration is used
and not present in the configuration store, the desired configuration may be swiftly fetched
from the configuration cache, avoiding a full reconfiguration of the hardware. This scheme
can be further improved by having a design of hierarchical configuration memories analo-
gous to the memory hierarchy’s in traditional computer architectures. On the other hand,
not all reconfigurable architectures support and supply multiple contexts through the above
mentioned method. GARP[7] reduces the reconfiguration overhead by having a wide data
path between the GARP’s array and memory coupled with the employment of its configu-
ration cache.
Many of the surveyed reconfigurable architectures are attempts to build better silicon
devices for reconfigurable computing. Unfortunately, while many innovations and ad-
vances have been proposed, as of now, almost none of these systems are commercially
viable. For this reason, our thesis has focused on developing new codesign methodolo-
gies that targets systems coupled with FPGAs, especially from the Xilinx family of FPGA
chips.
25
2.3 Hardware Software Partitioning
The subject of hardware-software partitioning[34, 16] has been extensively researched over
the years. Stated informally, the problem of hardware software partitioning is to find a suit-
able designation of the various portions of an application to either hardware or software
implementation so that certain performance metrics (e.g., minimize overall execution time,
energy savings) can be achieved. In the early 90s, the hardware-software partitioning prob-
lem was being solved for traditional architectures (i.e., CPU with ASIC). While it may be
possible extend the standard techniques proposed then applied to these traditional archi-
tectures, run-time reconfigurable architectures present new challenges to an old problem.
As we have indicated so far in this thesis, minimizing reconfiguration overhead is critical
for obtaining a quality solution. However, as we have tried to show in the thesis thus far,
this overhead is in turn influenced by factors such as latency hiding and the heterogeneity
of resources available on the FPGA. There have been several works on hardware-software
partitioning [17, 40, 10, 22, 46, 60, 52, 3] over the years. However, most of these works
do not consider both temporal and spatial partitioning. In the following, we offer brief
descriptions of certain notable and related projects.
Nimble
The Nimble compiler[40] is built on top of the garpcc[6] to address hardware-software
partitioning issues not dealt with in garpcc. Given a set of loops within the program code,
the Nimble compiler seeks to choose between the loops for implementation in hardware
through a cost function estimate. This work did not exploit either partial reconfiguration
nor pre-fetching. Rather, it heuristically tried to identify loops which may compete for the
hardware through the loop-procedure hierarchy graph. After clustering such loops which
may be in contention, the compiler selects loops to be moved into hardware so that overall
26
cuting. We plan to extend our methodology to multi-threaded
specifications, where more than one block can execute at the
same time.
3.2 Target Architecture Model
RC1        RC2
 
Figure 5. Target Architecture
The target architecture model is illustrated in Figure 5. The
PRC device is split into two parts, RC1 and RC2. The design
is implemented on the RC such that, when T Pi executes on
RC1, T Pi+1 reconfigures on RC2. Similarly, when TPi+1 ex-
ecutes on RC2, T Pi+2 reconfigures on RC1. The PRC device
is divided into two parts as at any time there exists only two
active events on the device at a given time: execution and
reconfiguration. RC1 and RC2 have a fixed area and position
on the PRC device. This ensures that a TP on either of these
parts remains undisturbed between the time it reconfigures
and executes, thereby aiding partial reconfiguration.










Figure 6. Host-PRC Interaction Model
A host-side controller handles the interaction between the
PRC device and the host processor. The host processor is re-
sponsible for loading and executing the partitioned modules
on the PRC device. The host-side controller provides a hand-
shaking protocol between the host and the device. The con-
troller is a finite state machine (FSM), where in every state
the partitioned modules are either loaded, executed or no op-
eration is performed on them. State transitions are based on
control signals obtained from the design executing or recon-
figuring on the PRC device. The FSM is derived from a set of
host-controller semantics that are generated by the partition-
ing and pipelining phases. These directives are explained in
more detail along with the algorithm in the following section.
4 Temporal Partitioning
Given the block graph specification, the partitioner has to
temporally partition the graph into k segments such that:
(1) area(TPi)  area(RC1) 8 odd i 1  i  k
(2) area(TPi)  area(RC2) 8 even i 1  i  k
(3) 9 no loops across T Ps
When a design is too large to fit on the PRC device, the
procedure Temporal Partitioning(BBIF, blk area, prc area)
traverses the block graph and performs appropriate actions
based on the area and the block type of individual blocks,
which can be either a loop block (Lblk), conditional block
(Cblk), or a non-control construct block.
The procedure Partition Block(blk) is invoked to partition the
operation graph of a block when the estimated area of that
block violates the area constraint imposed on the partitioner.
Any operation graph partitioning algorithm can be used for
this purpose. The conditions imposed on such a partitioner
is as follows: (1) The partitioner should not introduce cycles
in the block graph (2) The partitioner should minimize the
average number of data transfers between partitions.
The block–partitioner generates a sequence of acyclic par-
titioned segments. The next block that is traversed by the
procedure Temporal Partitioning(BBIF, blk area, prc area is
the last segment in the sequence of the partitioned segments.
The last segment in the sequence is assigned the same type
as the original block. This will ensure that if the type of the
original block is Lblk or Cblk, the corresponding procedures
are invoked based on the type. The block–partitioning sce-
nario is illustrated in Figure 8.
The procedure Handle Loop is invoked when a block of type
Lblk is encountered in the BBIF block graph. The procedure
obtains the cumulative area of all the blocks in the entire loop
structure using an area estimator. If the estimated area meets
the area constraint, all the blocks in the loop are merged into
a single partition. If the area constraint cannot be met, the
exception is handled by grouping all the blocks in the loop
structure so that the loop fits on the entire PRC device. Ex-
ception handling is done to accommodate large loop bodies
in the input specification. If the loop does not fit on the entire
device either, the partitioner reports a failure as otherwise the
loop has to be partitioned across temporal segments.
If the block type encountered is a Cblk block, the procedure
Handle Conditional obtains the area of all the branches of
the conditional evaluating block. If the estimated area meets
the partitioner’s area constraints, these blocks are grouped
into a single partition. If the area constraint is violated, a host
polling strategy is adopted. Performing Partition Block be-
fore handling conditionals ensures that if a conditional block
is too large to fit on a device partition, it is partitioned into
smaller blocks before Handle Conditional is invoked on that
block.
The effect of the partition methodology described above on
the associated execution model is detailed in the following
sub–sections.
4.1 Execution Model for Loop Handling
When a Lblk structure is encountered in BBIF, the entire
loop is grouped into a single temporal partition if the area
constraint is not violated. This ensures that the correspond-
ing temporal partition spends a significant amount of time
in execution. The execution time can be maximally over-
3
(a) Division of the hardware resource
An Integrated Temporal Partitioning and Partial Reconfiguration Technique for
Design Latency Improvement 
Satish Ganesan and Ranga Vemuri
Department of ECECS, ML 0030
University of Cincinnati,Cincinnati, OH 45221.
fsatish,rangag@ececs.uc.edu
Abstract
Partially reconfigurable processors provide the unique abil-
ity by which a part of the device can be reconfigured, while
the remaining part is still operational. In this paper, we
present a novel partitioning methodology that temporally
partitions a design for such a partially reconfigurable pro-
cessor and improves design latency by minimizing reconfig-
uration overhead. This is achieved by overlapping execu-
tion of one temporal partition with the reconfiguration of an-
other, using the processors partial reconfiguration capability.
We have incorporated block-processing in the partitioning
framework for reducing reconfiguration overhead of parti-
tioned designs. A highlight of our partitioner is it’s ability
to handle loops and conditional constructs in the input spec-
ification. The proposed methodology was tested on several
examples on the Xilinx 6200 FPGA. The results show signif-
icant reduction in the design latency, leading to a consider-
able speed-up due to partial reconfiguration.
1 Introduction
Dynamically reconfigurable processors have the potential for
achieving high performance at a relatively low cost for a
wide range of applications. Reconfigurable devices, such as
Field Programmable Gate Arrays (FPGA), can also imple-
ment large designs by the virtue of partitioning the design
in time [1, 2] leading to run-time reconfigurable implemen-
tations of the design. However, the reconfigurable proces-
sors typically have a high reconfiguration overhead, which
degrades the performance of the design.
Certain p r ially reconfigurable processors [3, 4] possess the
unique capability by which a part of the device can be oper-
ational while the remaining part is being reconfigured. This
feature can be used to overlap execution and reconfiguration
of different portions of the design l a ing to partial, if ot
complete, amortization of the reconfiguration overhead and
significant improvement in the design latency. This advan-
tageous feature of such partially reconfigurable computing
(PRC) systems motivates the work in this paper. We pro-
pose a novel technique to generate RTR designs for a PRC
device, that improves design latency by reduction of the re-
configuration overhead posed by the device. A highlight of
our partitioner is the capability to handle control constructs
This work is supported in part by the US Air Force, Wright Laboratory,
WPAFB, under contract number F33615-97-C-1043.



















































Figure 1. Partitioning/Pipelining Methodology
Figure 1 depicts the overview of our approach. The first step
is to partition the design into a sequence of temporal seg-
ments. This is followed by a pipelining phase, where the
execution of each temporal partition is pipelined with the re-
configuration of the following partition. Referring to Figure
1, at the ith instant, TPi executes and TPi+1 reconfigures on
the PRC device. Reconfiguration time of segment T Pi+1 is
reduced due to overlap with execution of T Pi. Similarly, the
(i+ 1)th instant involves overlap of execution of T Pi+1 and
reconfiguration of TPi+2. This process is continued until all
the temporal segments have been loaded and executed.
Let Ri and Ei be the reconfiguration and execution time re-
spectively, of the ith TP segment on the target architecture.








Ri+1 Ei  0 8i 1 i  n 1 (2)
there is complete amortization of the reconfiguration over-
head using partial reconfiguration. Hence, it is clear that
in order to obtain significant improvement in design perfor-
mance, the reconfiguration time of T Pi+1 should be com-
parable to the execution time of TPi. This allows maximal
overlap between execution and reconfiguration and results in
considerable reduction in reconfiguration overhead. When
device reconfiguration times are much higher than design ex-
ecution times, it becomes essential to group computationally
intensive structures, e.g. loops, in a single temporal segment
to increase Ei and thereby minimize Ri+1 Ei.
1
(b) Pipelining the application into temporal segments
Figure 2.4: Using pipelining t reduce reconfigu a costs.
performance may be optimized, having estimated the reconfiguration costs. The Nimble
compiler is bench-marked against a locally optimal algorithm and an idealized performance
upper bound for the benchmarks. Experiments have shown that the Nimble compiler gives
near optimal solution when compared with the performance upper bound.
Physically-aw re Hardware-Software Partitioning
Ba erjee et al. [3] have focused solving a co-parti ioni g and scheduling problem for task
graphs. The proposed algorithm dete mines for every task, whether it is implemented in
hardware and if so, when to configure the task an when to execute it so as to minimize the
the entire schedule. The main motivation for their work is the fact that existing scheduling
algorithms may suffer from producing so-called ‘optimal’ schedules that are physically
unrealizable due to placement constraints. Figure 2.5 shows a schedule where T1 and T3
occupies one column each while T2 occupies two columns, assuming that the resource
contains only 4 columns. Although by the time t2, two columns are free to be reconfigured,















































































Figure 3: Heterogenous FPGA with partial RTR
our work, we currently do not exploit such resource-sharing across
tasks. We focus on integrating key architectural constraints and
placement considerations into the scheduling formulation for the
more realistic scenario of varying task sizes.
Our work is most closely related to [6] and [7]. Mei et al. [6]
present a genetic algorithm for partial RTR that considers colum-
nar task placement. However, their approach does not consider
prefetch or the single reconfiguration controller bottleneck. Jeong
et al. [7] present an exact algorithm (ILP) and a KLFM-based
approach. Their ILP considers prefetch and the single reconfigu-
ration controller bottleneck– however, while scheduling, they do
not consider the critical issue of physical task placement. We will
demonstrate that an optimal formulation that does not simultane-
ously consider placement while scheduling can generate schedules
which can not be placed and hence are not physically realizable.
Another distinctive feature of our work compared to existing work
is our consideration of heterogeneity in resources, a key feature of
modern reconfigurable architectures.
3. PROBLEM DESCRIPTION
We consider the problem of HW-SW partitioning of an applica-
tion specified as a task dependency graph extracted from a func-
tional specification in a high-level language like C, VHDL, etc. In
a task dependency graph (Figure 2), each vertex represents a task
that can start execution only when all its ancestors have completed.
Our target system architecture as shown in Figure 1 consists of a
SW processor and a dynamically reconfigurable FPGA with partial
reconfiguration capability. The processor and the FPGA commu-
nicate by a system bus. We assume concurrent execution of the
processor and the FPGA. We assume that the dynamically recon-
figurable tasks on the FPGA communicate via a shared memory
mechanism– this shared memory can be physically mapped to local
on-chip memory and/or off-chip memory depending upon memory
requirements of the application. Under this abstraction, communi-
cation time between two tasks mapped to the FPGA is independent
of their physical placement. Thus, when adjacent tasks in the task
graph are mapped to the same device (processor or FPGA), the
communication overhead is considered insignificant, while tasks
mapped to different devices incur a HW-SW communication delay.
On such a system architecture, a task can have multiple imple-
mentations: as a simple example, compiler optimizations like loop
unrolling often result in a faster implementation with more HW
area. Another example is the possibility of a very area-efficient
implementation using dedicated resources like embedded memory.
Our objective is to minimize the execution time of the applica-
tion while respecting the architectural and resource constraints im-
posed by the system architecture. Thus, our desired solution is a
task schedule where each task is bound to HW or SW along with a
suitable implementation point for each task.
Dynamically reconfigurable FPGA
Our target dynamically reconfigurable device as shown in Fig-
ure 3 consists of a set of configurable logic blocks (CLB) arranged






































































































































Figure 5: Detailed infeasible
specialized resource columns are distributed between CLB columns
(the Xilinx Virtex-II architecture is an example of such a device).
The basic unit of configuration for such a device is a frame span-
ning the height of the device. A column of resources consists of
multiple frames. A task occupies a contiguous set of columns.
Such a device is configured through a bit-serial configuration port
like JTAG or a byte-parallel port. However, only one reconfigura-
tion can be active at any time instant. The reconfiguration time of
a task is directly proportional to the number of columns (frames)
occupied by the task implementation.
4. KEY ISSUES IN SCHEDULING
4.1 Criticality of linear task placement
Each task implementation mapped to the target reconfigurable
device occupies a set of adjacent columns. Under our abstraction
that communication between such tasks is realized through a shared
memory accessible from each task, task placement on such a device
reduces to simple linear placement.
LEMMA 1. For a given scheduled task graph with inter-task
communication via shared memory and equal size tasks, a feasi-
ble and optimal placement is guaranteed and can be generated in
polynomial time.
PROOF. The problem for equal sized tasks reduces to graph col-
oring on interval graphs and thus efficient algorithms like left-edge
algorithm can be applied [3]. More details in [16].
However, for tasks that occupy a different number of columns
in the implementation, placement feasibility is not guaranteed
even with an exact algorithm. (detailed explanation in [16]) In
Figure 4 we demonstrate an instance of such infeasibility using an
exact approach for partitioning and scheduling followed by linear
placement for such multi-column tasks. This is a two-dimensional
view of the task schedule where the Y-axis (length) corresponds to
time, the X-axis (width) corresponds to number of columns. The
FPGA has 4 columns and 3 tasks mapped onto it. Tasks T1, T2,
T3 occupy columns C1, (C2,C3), and C4 respectively. At time t2, a
model that does not consider placement information would indicate
that 2 units of area were available. So a new task, say T4, that
requires 2 columns, could be scheduled at time t2. However, this
would be incorrect as 2 adjacent columns are not available at t2.
In Figure 4, of course there is the opportunity for better place-
ment by initially placing task T2 into columns (C3,C4)– then, at
336
Figure 2.5: An example of infeasible placement.
algorithms that do not consider physical placement of hardware may end up with physically
unrealizable reconfiguration sch dules. Therefore Banerjee’s work h s focused n tackling
this issue by incorporating hardware placement into the partitioning algorithm, using an
extension of the KLFM heuristic proposed by Vahid et al.[65]
Temporal Partitioning with Pipelining
Vemuri et al. [22] integrate partial reconfiguration with temporal partitioning to improve
overall performance. The key to their technique lies in dividing the pFPGA into 2 spe-
cific reconfigurable regions, as shown in Figure 2.4(a). By pipelining the application into
temporal partitions(TP) as shown in Figure 2.4(b), we can overlap computation with re-
configuration time to reduce the re onfiguration overhead. For example, the execution of
TP1 overlaps with the configuration of TP2 on the hardware. The reconfiguration over-




Huynh et al.[30] have focused on the problem of performing both spatial and temporal
partitioning for run-time reconfigurable custom instructions. For a set of candidate custom
instructions of size N, they seek to find a suitable subset for hardware implementation while
reducing the run-time reconfiguration ovehead at the same time. The proposed iterative
algorithm relied on the the insight that the optimal solution could choose to implement n
custom instructions, where 0 ≤ n ≤ N. This is the only work we know of that performs
both spatial and hardware partitioning for control-flow specified applications.
2.4 Configuration Scheduling
Techniques to reduce the reconfiguration overhead include configuration compression [41,
35], configuration caching [41], and configuration prefetching [26, 42, 11, 18]. While
configuration compression is supported in commercial FPGAs, the chips proposed that
support configuration caching have yet to be fabricated. We shall focus on the configuration
prefetching in this section.
Although configurations can be loaded as and when they are requested, a significant
overhead is incurred when the entire execution stalls while waiting for the loading to com-
plete. Instead, configurations can be requested to be loaded ahead of time in anticipation
of their usage. This process is called prefetching and as indicated in chapter 1, is done so
that reconfiguration can occur in parallel with application execution. Although this is an
effective method for reducing the reconfiguration overhead, one must be careful of mis-
prefetches that may result in evicting hardware modules that are executed later.
Early work[26, 63] has focused on prefetching for reconfigurable fabrics that can only
hold one configuration at a time. Later works have targeted platforms with multi-context
29
FPGAs[54, 46, 47, 23]. More recently, research[59, 29] has begun to focus on scheduling
configurations at the level of operating systems scheduling. Prefetching technique is also
being employed for High-performance Computing research[15] that attaches a supercom-
puter with a reconfigurable device. Solutions proposed may be divided into 2 categories -
online (or run-time) and offline approaches.
2.4.1 Online Scheduling
Online scheduling requires hardware support and additional storage to keep track of his-
torical information and monitor the dynamic state of execution. The advantage of online
approaches is that these solutions are both highly adaptive and do not require access to ap-
plication’s source code for implementation. Noguera[46, 47] technique relies on an event
window by which the system anticipates the next hardware event to be configured. Their
work targets a special hardware architecture with multiple homogeneous rFPGA devices
and the method depended on prefetching for the highest priority task within the event win-
dow as new incoming tasks occur. Fu[21] proposed another window-based scheduler that
uses a multi-constraint knapsack approach to select configurations with best speedup for
the next window of time. Li[42] and Chen[11] proposed a hardware-based predictor based
on building a Markov-chain and a least-mean square model during run-time. Banerjee et
al.[4] proposed an online heuristic scheduling for a linear sequence of task (a task chain)
that takes into consideration the bandwidth required by the tasks as an added constraints.
Although this work targets pFPGAs, they do not prefetch tasks out of order because the
focus is upon scheduling the tasks to achieve maximum data parallelism. Resano[54] pro-
posed a hybrid scheme that schedules prefetches for multiple dynamically reconfigurable
hardwares but their approach does not consider a Xilinx-like pFPGAs. This thesis shows in
Chapter 4 that the state-of-the-art online prefetching techniques are still considerably sub-
30
optimal. We expect online scheduling to be helpful in particular because reconfigurable
systems are moving into mainstream computing.
2.4.2 Offline Scheduling
Offline scheduling can be classified in two categories. The first takes in a task graph as a
specification while the latter takes in a Control-Flow Graph as a specification for the in-
put application. This difference is not trivial because the task graph is usually a DAG in
which the edges are merely precedence constraints (i.e., successive tasks can immediately
be fired off once precedence conditions are satisfied) while the control-flow graph is a di-
rected graph that contains cycles and the executed control-flow is conditional upon input
and run-time information. Furthermore, while task graphs are usually at most hundreds in
terms of size, the number of basic blocks on control-flow graphs can be tens of thousands
for small applications and the execution traces obtained could be in the order of millions.
Immediately, offline scheduling of task graphs that uses ILP formulation[18, 53] could not
be applied to control-flow graph specifications and scalability becomes an intractable issue
for such approaches. There have been relatively less research on configuration scheduling
for control-flow graphs. The control-flow graph prefetching problem is usually formulated
as an instruction scheduling problem[49, 26, 42] (i.e., where to insert the prefetch instruc-
tion to load configurations in advance). Offline scheduling are important for either systems
that run relatively a small, static set of applications or in environments where the operating
system does not support configuration management.
31
Chapter 3
Design Space Search for
Hardware-Software Partitioning
When mapping an application that has multiple kernels onto a rFPGA, it may be necessary
to share the hardware among these kernels through spatial partitioning [64] so as to acceler-
ate overall execution. A reconfigurable computing architecture allows for the virtualization
of hardware through run-time reconfiguration. In this case, the kernels can be swapped in
and out of hardware at runtime. This is useful if the total area required to realize different
kernels in hardware exceeds the available area. Run-time reconfigurability adds another
dimension to the already complex design space exploration problem. We need to consider
the temporal partitioning of the kernels in addition to the spatial partitioning. Moreover,
the key to success is to ensure that the benefits derived from hardware acceleration of a
kernel are not overwhelmed by the overhead of runtime reconfiguration [14]. Thus, this
overhead should be taken into account in the partitioning decision.
For a single, sequential program, the executions of these kernels are mutually exclu-
sive, i.e., only one kernel can execute at any point in time. The spatial partitioning problem
looks at the optimal choice and placement of kernels constrained by the amount of available
32
hardware resource (area). The problem is further complicated by the fact that there often
exist multiple instantiations of a candidate kernel. For example, applying different opti-
mizations (e.g., loop unrolling) on a loop kernel results in a number of different hardware
implementations with varying area and performance. The design space exploration needs
to take all these different instances of the kernels into consideration in order to obtain an
optimal solution.
Neighborhood searches such as GRASP[19] and Tabu[24] have been used to solve com-
plex combinatorial problems effectively. Thus one way to traverse the design space is to
use some form of neighborhood search. A key insight in speeding up such searches is
that all these techniques involve evaluating the neighbors of the current design point. Such
evaluations are often time-consuming. In this chapter, we will propose a way of speeding
up such neighborhood searches.
The rest of the chapter is organized as follows. Section3.1 states the problem formu-
lation. Section 3.2 describes the various aspects of the framework that we proposed that
supports fast evaluation of neighboring points, including the key contribution: a neigh-
borhood relationship among design points, a method for computing reconfiguration cost.
After that, we show how these various aspects put together to improve the efficiency of the
neighborhood searches. In Section 3.3, we present and analyze the experimental results,
where we evalute the results of our framework with Tabu and Hill-Climb Search. This is
followed by a conclusion.
3.1 Problem Formulation
In this section, we formally define notions used in the description of our technique in Sec-
tion 3.
33
3.1.1 The Design Space
The design space in the context of this chapter is defined in terms of the following param-
eters.
• K1 . . .KN : Candidate kernels (loops)
• ki,1 . . .ki,mi: Different hardware implementation instances of kernel Ki with varying
area and performance
• a(ki, j): Area required by kernel instance ki, j
• s(ki, j): Savings in execution time due to hardware implementation instance ki, j of
kernel Ki over its software execution
• Loop trace indicating the run-time execution sequence of the candidate kernels
• A: Total hardware area constraint
• ρ: Time to perform one reconfiguration
The loop trace and candidate loops can be obtained through profiling [61, 40]. The savings
and area estimates of alternate hardware implementations can be obtained through behav-
ioral synthesis [57] and other methods of estimation. The details of profiling and estimation
are beyond the scope of this chapter and are orthogonal to its contribution. For a particular
reconfigurable hardware, we assume that ρ and A are constants.
3.1.2 Configurations and Partitions
We define a configuration to be a non-empty set of kernels. A configuration instance is a
particular implementation of a configuration. A configuration instance is obtained from a
34
configuration by choosing particular instances corresponding to each member kernel. Us-
ing the example in Figure 1.4, a configuration can be of the form {Ka,Kb} but configuration
instance will be of the form {ka.i,kb. j}, where ka.i,kb. j are hardware instances of loops a
and b, respectively. The total area required by all the kernel instances in a configuration
instance must not exceed the hardware area constraint. Given a configuration, selecting
an optimal configuration instance is a sub-problem of the entire design-space exploration
problem. Switching from one configuration to another incurs a reconfiguration cost.
A set of configurations is called a partition. Similarly, a partition instance is a particular
implementation of a partition. A partition instance is obtained from a partition by choosing
particular instances corresponding to each member configuration. A partition consisting
of a single configuration corresponds to static configuration. This is SS when the config-
uration is a singleton, and SM otherwise. A partition with more than one configuration
implies dynamic reconfiguration. This is DS when all the configuration are singletons, and
DM otherwise. An empty partition implies that no kernel was chosen for hardware im-
plementation. It should be noted that a chosen partition implicitly implies that the other
kernels have been designated for software implementation. For a partition P, we refer the
set of kernels designated for hardware implementation as HW (P) and the set of the kernels
designated for software implementation as SW (P) = {K1, . . . ,KN}−HW (P).
We have chosen to enforce a constraint that a loop kernel can appear in at most one
configuration in a partition. The reason for such a constraint is if a loop is allowed to have
multiple hardware versions, then it becomes necessary to dynamically infer the context
under which a particular hardware version of the loop should be loaded, which further
complicates the problem.
Our approach aims to minimize the total execution time of the application by acceler-
ating candidate loops in hardware. Therefore, we define a set of functions to compute the
35
savings corresponding to kernel instances, configuration instances, and partitions.
s(ki. j) = tsw(Ki)×nsw(Ki)−o(ki. j)×nhw(ki. j)
−thw(ki. j)×nhw(ki. j)− tsw(Ki)×nsw(ki. j) (3.1)
Equation 3.1 shows the savings corresponding to a hardware implementation of a kernel.
The terms on the right-hand side of the equation represent the software execution time,
communication overhead, hardware execution time, and software execution time of the
remainder loops, respectively.







Equation 3.2 shows the savings of a configuration instance. The savings of a configuration
instance is simply the sum of the savings of its member hardware kernel instances. We
compute the total savings of a partition instance in Equation 3.3 by offsetting the total
reconfiguration time against the total savings of the member configuration instances. n(P)
is the expected number of reconfigurations for partition instance P and ρ is the time to
perform one reconfiguration.
PROBLEM Develop a design space framework by defining the neighborhood relation-
ships and an efficient evaluation function to facilitate the implementation of efficient neigh-
borhood searches that solves the above problem i.e., maximize the savings of each config-
uration as to achieve overall execution time minimization.
We shall now describe how our neighbors of a given current point in the design space
can be evaluated over a SEQUITUR-compressed trace of loops.
36
3.2 Fast Evaluation of Neighboring Design Points
We consider the exploration of the design space described above using some neighborhood
search scheme. One of the key components that is common among these search strategies
is the evaluation of the design points within a certain neighborhood. We shall now describe
a neighborhood relationship between partitions that 1) is complete in coverage of the par-
titioning space and 2) does not recompute unnecessarily when evaluating the neighbors of
an evaluated partition. The necessary components of our techniques are:
• Loop traces encoded using SEQUITUR grammar
• Evaluation of a single partition (without any evaluated neighbors)
• The neighborhood relationship proper
• Evaluation of a partition’s neighboring points
We shall now describe each of these.
3.2.1 Evaluating Partitions
We evaluate a partition by determining the optimal way to implement the partition. The
savings of a partition instance depends on the savings of its member configuration instances
and the number of reconfigurations. However, all the partition instances corresponding to
a partition requires the same number of reconfigurations for a given loop trace. In the
example shown in Figure 1.4, if loops a and b are put in one configuration, and c and d are
put in another, there will be only one reconfiguration per iteration of the outer while loop,
regardless of the instances of the loops chosen to be implemented in hardware. Therefore,
an optimal partition instance can be obtained by simply choosing optimal configuration













Figure 3.1: DAG representing a SEQUITUR grammar.
instance and a method for calculating the number of reconfigurations. These are described
in the following subsections.
3.2.1.1 Computing Optimal Configuration Instance
Each loop kernel is associated with a number of alternative hardware implementations. A
naive approach to find the optimal instance corresponding to a configuration would be to
enumerate all feasible instances. However, this approach does not scale either with the
number of kernels or with the number of instances corresponding to each kernel.
We handle this problem by pruning the number of instances corresponding to a kernel.
We only keep the pareto-optimal instances corresponding to each kernel. Intuitively speak-
ing, these instances are more efficient in terms of area utilization, giving better speedups
with less area. After this pruning, the optimal configuration instance is found by an exhaus-
tive enumeration of the remaining feasible configuration instances. We do not synthesize
38






















Figure 3.2: Pareto-optimal kernel instances.
the configuration instances at this stage. Rather, the savings of a particular configuration
instance is estimated using Equation 3.2 along with the area requirement.
For example, Figure 3.2 shows all the instances corresponding to a loop kernel taken
from the JPEG encoding benchmark. The estimated rFPGA area in terms of slices is plot-
ted against the expected execution time of a single loop iteration for each of the kernel
instances. Among the eight kernel instances, we only keep the ones on the pareto-optimal
front.
3.2.1.2 Loop Trace Compression Using SEQUITUR Graph
We can compute the reconfiguration cost of any given partition by going through the entire
trace. However, this step could be costly in terms of computation due to the size of the
traces. Therefore, we compress the loop trace using SEQUITUR, in a format amenable for
reconfiguration cost computation, as shown in the later subsections.
The SEQUITUR algorithm developed by Nevill-Manning [45] compresses a sequences of
symbols (loop ids) by building hierarchical structures of frequently repeated sub-sequences.
The SEQUITUR algorithm represents a finite string σ as a context free grammar. whose lan-
guage is a singleton set {σ}. The SEQUITUR grammar can be represented as a directed
39
acyclic graph. Each leaf vertex in the DAG corresponds to a candidate loop. Each inter-
mediate vertex in the DAG represents a sub-trace and the root vertex represents the entire
loop trace. An in-order traversal of the sub-graph rooted at a vertex retrieves the corre-
sponding sub-trace. For example, an in-order traversal of the graph shown in Figure 3.1
generates the sequence ababacacbcbcababacacbcbcd. It should be noted that the same
vertex can be a direct sibling of itself, as shown in the figure.
3.2.1.3 Counting Reconfigurations
We can efficiently compute the number of reconfiguration of a partition through a single
bottom-up traversal of the SEQUITUR DAG G = (V,E) where V is the set of vertices and
E the set of edges with complexity O(V +E). During the traversal for a particular partition,
each vertex v in the DAG is labeled with the following: (1) the first and last hardware
kernel in the the loop sub-trace represented by v, and (2) total number of reconfigurations
for the loop sub-trace represented by v. During the same bottom-up traversal, we can
compute the labels corresponding to an intermediate vertex by looking at the labels of its
children as follows. Let v be an intermediate vertex with children v1 . . .vk. Let n(v), f (v),
and l(v) represent the number of reconfigurations, first and last configuration of vertex v.
Then n(v) = ∑ki=1 n(vi)−∑k−1i=1 xi, where xi is equal to 1 if l(vi) = f (vi+1) and 0 otherwise.
The leaf vertices would be the base case where the loop sub-trace consists of only one
candidate loop corresponding to the leaf vertex. Let v be a leaf vertex. n(v) would be 1 if
the candidate loop has been designated for hardware, 0 otherwise. f (v) and l(v) would be
the candidate loop if the candidate loop has been designated for hardware, null otherwise.
At the end of the traversal, the label at the root vertex yields the number of reconfigurations
corresponding to the entire loop trace.
40
3.2.2 The Neighborhood Relationship
The neighbor of a partition (in the design space) is obtained by either (1) removing a hard-
ware kernel from any of the member configurations (removing the entire configuration if
the configuration becomes empty) or (2) adding a kernel currently in software into the
partition(thus designating it for hardware implementation), either into one of the existing
configurations or as a new configuration containing only this new kernel.
Figure 3.3 shows a partition {{a},{b,c} with all of its neighboring partitions. The
removal of kernel c from the partition gives us the neighboring partition {a},{b}. There are
3 ways to add kernel d into the partition. Thus, adding d gives us partitions {{a},{b,c,d}},
{{a,d},{b,c}} and {{a},{b,c},{d}}. Removal of kernel b gives us partition {{a},{c}}
and removing kernel a leaves us with {b,c}. There are 6 neighbors in all. The partition
{{c}} cannot be {{a},{b,c}’s neighbor because they differ by more than one kernel.
A partition {{Kc}} cannot be P’s neighbor because they differ by more than one kernel.
In general, a partition P has |SW (P)|× (|P|+1)+ |HW (P)| neighbors, where HW (P) and
SW (P) are the set of hardware and software kernels for partition P, respectively. |P| is the
number of configurations in partition P. This relationship is complete in the sense that any
partition may be constructed starting from an empty partition (by adding the kernels one
by one) and the empty partition may be reached by deconstructing any partition as well (by
removing the kernels one by one).
From Figure 3.3, we observe that the reconfiguration cost of the neighboring design
points cannot be computed simply by adding or subtracting the number of occurrence of
the kernel added or removed to the design point. For example, when kernel c is removed,
the reconfiguration cost does not decrease by 2 even though c’s occurrences in the loop
trace is 2. In the next section, we propose a way to compute the reconfiguration cost of
















Figure 3.3: An example of a partition with its neighboring design points and the associated
reconfiguration costs.
3.2.3 Simulataneous Evaluation of Neighbors
In Section 3.2.1.3, we have shown how to compute the number of reconfigurations of a
partition efficiently using a compressed loop trace. However, the number of neighbors of
a partition can be quite large. Therefore, traversing the SEQUITUR graph for each neighbor
can be quite expensive. Instead, given a partition P, we propose a method to compute
the reconfiguration cost of all its neighbors through a single bottom-up traversal of the
SEQUITUR graph.
Our method is based on the observation that only certain sequences in the loop trace
need to be considered in order to compute the reconfiguration cost of a neighboring parti-
tion. Let K be an arbitrary kernel in configuration C of partition P, i.e., K ∈C ∈P. The loop
trace contains many sequences of the form of < Kx,S,K,S′,Ky > where Kx,Ky ∈ HW (P)
and S,S′ are (possibly empty) sequences of software kernels. In each of these sequences,
there are three mutually exclusive possibilities:
1. Kx or Ky is in the same configuration as K. In this case, removing K has no effect on
the number of reconfigurations.
2. Kx and Ky are in the same configuration, but not in the same one as K. In this case,
removing K results in the savings of two reconfigurations.
42
3. Kx, Ky and K are in distinct configurations. In this case, removing K results in the
saving of one reconfiguration.
The effect of removing a kernel K can thus be computed after identifying all distinct se-
quences of the form s =< Kx,S,K,S′,Ky > and the number of times, w(s), each of these
sequences occurs in the trace. The decrease in number of reconfigurations d(s) can then be
computed based on the three cases above. The total savings in number of reconfigurations
is ∑s d(s)×w(s). The effect of adding kernels can be computed in a similar way.
Therefore, given a partition P, we need to enumerate all sequences of the form <
Kx,S,Ki,S′,Ky > and their frequency for each candidate kernel Ki (1 ≤ i ≤ N). This will
allow us to compute the number of reconfigurations of a partition obtained through addi-
tion (if in software) or removal (if in hardware) of kernel Ki from partition P. This can be
performed efficiently though a single bottom-up traversal of the SEQUITUR graph by ap-
propriately labeling the vertices through an extension of the labeling proposed in Section
3.2.1.3.
3.2.3.1 Labeling Extension and Sequence Enumeration
We observe that these sequences <Kx,S,Ki,S′,Ky > will always span two consecutive sub-
traces. The extreme case of these sub-traces would be one sub-trace having one loop and
the other sub-trace having two loops. Given that the each vertex in the SEQUITUR graph
represents a sub-trace, we need to label the vertices in a way that allows such sequences to
be identified easily.
Consider sub-traces represented by (not necessarily distinct) vertices vi and vi+1 that
are next to each other in the original trace (i.e., vi and vi+1 would be children of the same
parent vertex direct siblings). Assume further that the sub-trace represented by vi to be















































Figure 3.4: A SEQUITUR graph labeled with H and S tags given a partition that has put
kernels a, c and d in hardware.
44
where K1,K2,K3,K4 ∈ HW (P) and S1,S2,S3,S4 represents (possibly empty) sequences of
software kernels. In order to enumerate the < Kx,S,K,S′,Ky > sequence that spans these
2 sub-traces, we need to consider two cases. If K is in hardware, then both K2 and K3 are
candidates for K. If K is in software, then all kernels occurring in S2 and S3 are candidates
for K. We further note that once Ki is identified, Kx and Ky can be easily identified by
finding the nearest preceding and subsequent hardware kernel.
The above consideration leads to the conclusion that both the first two and the last two
hardware kernels of the sub-traces are needed to identify Kx, K and Ky. Any software ker-
nels in the sub-trace occurring before the first hardware kernel and after the last hardware
kernel are also needed. Recall from section 3.2.3 that an in-order traversal of any vertex
recovers a sub-trace. Thus, We label each vertex, vi, with a H tag and an S tag, as shown in
Figure 3.4, where kernels a, c and d have been chosen for hardware implementation.
The H tag consists of two pairs of indices The first pair would be the first two hardware
kernels of the sub-trace represented by vi. The second pair would be the last two hardware
kernels in the same sub-trace. In cases when the sub-trace represented by the vi does not
contain at least 2 hardware kernels (e.g., in the case of leaf vertices), the non-existent
hardware kernels would be labelled with ‘ ’.
The S tag consists of two (possibly empty) sets of indices. The first set contains the
software kernels that occur in the sub-trace represented by vi before the first hardware
kernel. The second set contains the software kernels that occur in the sub-trace represented
by vi after the last hardware kernel. In cases when the sub-trace does not contain any
hardware kernels, all the kernels contained in the sub-trace are added to both sets.
This labeling process, i.e., computing the H and S tags, is done in a single bottom-up
traversal of the SEQUITUR tree. Assuming that all the children vertices of vi are properly
45
labelled, the H and S tags of vi can be computed using the H and S tags of the first and last
child of vi.
With the H and S tags in hand, we can now enumerate the <Kx,S,Ki,S′,Ky > sequences
of vi. It turns out that this can be done in the same bottom-up traversal by examining the
labels of vi’s siblings and concatenating the possible sequences. For example, according
to the tags of vertex C, there is only one hardware kernel a that occurs in the sub-trace
represented by vertex C and b is the only software kernel that occurs after a. According
to the tags of vertex D, the first hardware kernel of the sub-trace represented by vertex D is
a. Thus the the sequence <a,b,a> spans the two sub-traces represented by vertex C and
D. In fact, the sequence <a,b,a> also spans the sub-traces represented by two consecutive
occurrences of vertex C. Thus, this sequence occurs twice in the sub-trace represented by
vertex B and since vertex B itself has an occurrence count of 2, the sequence <a,b,a>
occurs four times in total.
With all the necessary sequences enumerated, all the neighbors of a design point can be
evaluated easily based on Equation 2.
3.2.4 Employing the Entire Framework
A crucial step during a neighborhood search usually involved the following steps: evalua-
tion of the current design point, comparison with neighboring design points and eventually
selecting one particular neighboring design point to be the next step of the search. Figure
3.5 shows what happens during such a step in a search. It shows a partial view of the rel-
evant design space, enumerated sequences and labeled SEQUITUR graph for 2 consecutive
steps of a search. The current design point is shown in bold while the ignored design points






























{b}, {c,d} {a,b}, {c}

























































Figure 3.5: An example showing the change in annotation of the SEQUITUR graph and
enumeration of sequences after a move between neighboring design points.
47
Initially, the current partition of the search is {{a},{c,d}}. When considering the move
of adding kernel b into configuration {a} (i.e., move to {{a,b},{c,d}}), the occurrence
count of enumerated sequences < a,b,a > and < c,b,c > used to compute the change in
the reconfiguration cost in such a move. Since kernel a and b would be in the same config-
uration, the increase in reconfiguration count is 4. Similar computations can be made for
the other neighbors and are left as an exercise for the reader. The partition {{a,b},{c,d}}
is selected(the criterion depends on the search algorithm) for the next step in the search.
Consequently, the SEQUITUR tree needs to be relabeled according to the methods described
in section 3.2.3.1. To complete the move, sequences < a,c,b > and < b,a,c > are enumer-
ated to reflect the case that kernel b is now in hardware. The search can thus continue after
the move is completed.
3.3 Experimental Evaluation
3.3.1 Experimental Setup
We use four non-trivial benchmarks for our experimental evaluation: a JPEG encoder(cjpeg),
a JPEG decoder (djpeg), an encryption key exchange program (dh), and an MPEG encoder
(mpegenc). cjpeg, djpeg and mpegenc are taken from the Mediabench [38] benchmark
suite while dh is taken from NetBench[44]. We use the Trimaran compiler infrastructure [9]
to generate the input parameters for the design space exploration problem. In particular,
we have implemented a loop profiler that selects a loop kernel (both inner and outer) as a
candidate if its computation time exceeds more than 1% of the entire application.
In view of a lack of estimation tools, we have to pre-generate the area and timings
estimation. We applied loop unrolling with various loop unroll factors to each candidate
loop kernel. To obtain hardware performance and area required for each kernel instance,
48
we automatically generate Handel-C code [8] from Trimaran’s Elcor intermediate repre-
sentation. The timings and area estimations of these alternate hardware implementations
are subsequently obtained through synthesis using the Celoxica DK design suite and Xilinx
ISE tools. The target rFPGA for synthesis is the Xilinx 2000E model[69]. To evaluate our
framework, we developed three algorithms: Exhaustive, Hill-Climb and Tabu search.
Exhaustive search In a pre-processing phase, we compute the optimal configuration
instances corresponding to all possible configurations of the candidate kernels using the
method described in section 3.2.1.1. The main phase then enumerates all possible par-
titions. The enumeration algorithm used is by Kreher and Stinson [37]. This algorithm
ensures the proper enumeration of all the partitions. The savings of a partition is defined
as the savings of its optimal instance. Evaluation of the savings of a partition is described
in Section 3.2.1. The partition instance with the maximum savings is chosen as the opti-
mal partition instance. It should be noted that apart from how the optimal configuration
instances are chosen, the Exhaustive search algorithm does not make use of the rest of the
framework.
Hill-Climb search We start with an empty partition. This ensures that our solution is at
least as good as an all-software solution to begin with. We then evaluate all its neighbors
using the technique described in the previous subsection. We choose the neighbor with
the maximum savings (i.e., minimum execution time). The search then moves to this new
design point and examines its neighbors. We always maintain the best partition obtained
so far. The search terminates at a design point if we cannot find any partition in the neigh-
borhood that is better than the current best partition. It should be noted that our Hill-Climb
search draws heavily on the framework described in Section 3.2, making full use of the
neighborhood relationships and the efficient evaluation of the neighbors.
49
Tabu search We modify the Hill-Climb search to obtain the Tabu search. The main dif-
ference being that the search does not terminate when a local maximum is reached. Instead,
we maintain a tabu list of design points which have been visited and the most profitable
neighbor is always visited, irregardless of whether the neighbor yields more savings than
the current design point. If the particular neighbor design point is on the tabu list, the next
most profitable neighbor not present on the tabu list is visited. The search terminates when
the number of moves made reaches a certain limit. In our experimentation, we have fixed
the number of entries on the tabu list to be 100 and the limited number of moves to be a
logarithm of the design space size to base 1.05.
Benchmark Num. of Size of Avg. Exhaustive Avg. Hill-Climb Avg. Tabu
Candidate Comp. Trace Search Time Search Time Search Time
Kernels KBytes (sec) (sec) (sec)
cjpeg 11 1 17719.72 0.24 3.17
djpeg 7 4.3 17.34 0.04 1.06
dh 7 72 3837.87 3.73 112.04
mpegenc 6 74 245.88 0.96 18.44
Table 3.1: The running times of exhaustive search, Hill-Climb algorithm and compressed
trace sizes.
We have implemented the search algorithms in C++ compiled by gcc version 4.1.2. We
run the experiments on a 2.8 GHz Pentium 4 machine in the GNU-Linux environment.
All the run-time of the search algorithms reported are based on Pentium’s hardware cycle
counters. The Trimaran framework allows us to define a VLIW machine with 4 integer
50
units, , 1 branch unit and and 1 load/store unit. We obtain the cycle-accurate measure of
the all-software solution based on the simulator reports of the Trimaran framework.
Table 3.1 show the number of candidates kernels for each benchmark and the average
running times of the implemented searches for all benchmarks. These values are obtained
by running the experiments with varying input parameters described in section 3.3.2. This
table demonstrates the infeasibility of the exhaustive search approach. The number of ker-
nels increases the running exponentially, even though cjpeg has the smallest compressed
trace among all the benchmarks, the running time was close to 4 hours to run the exhaustive
search. Table 3.1 shows the average running time of the Hill-Climb search and Tabu search
as well. It should be noted that the length of the trace dominates the running time when the
number of kernels is the same. We can conclude this by observing that the running time
of dh is longer compared to djpeg even though the number of kernels is the same. Our
experiments show that Hill-Climb is able to find the optimal design point more than 90%









5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
















































Figure 3.7: Optimal speedups plotted against increasing reconfiguration time.
3.3.2 Scaling Hardware Resources and Reconfiguration Time
Both Figure 3.6 and Figure 3.7 plots the results of exhaustive search in order to give an
idea of the design space. The lines on the graphs have been labelled with the benchmark
name and the plotted points with shapes to indicate the type of the solution. For example,
following Figure 3.6, benchmark dh yields a SM partition under a resource constraint of
5K slices and then a DS partition under the resource constraint of 6K slices. Beyond, the
resource constraint of 16K slices, dh is optimally implemented with SM. It should be noted
that while many of the plotted points show DM to be the optimal partition, the kernels
included in the partition are not uniformly the same for the same benchmark. Sometimes,
as resource constraints increase or decrease, certain kernels has to be moved to software or
hardware, though the resulting partition is still ostensibly DM.
Figure 3.6 plots the speedups of the optimal design point through exhaustive search
against increasing resource area. The reconfiguration time is set at 10 µseconds. We ob-
serve that placing multiple kernels in hardware yields the optimal results in most cases
except for dh. For the dh benchmark, if the resource available goes beyond 15K slices,
52
the SM will give better speedup than DM. This is because when the area is large enough
to be shared by all the kernels, we no longer gain from dynamic reconfiguration. If the
resources available decrease below 7K slices, DS and SM gives better speedup. This could
be because the available resources becomes too small to hold multiple kernel. It should be
mentioned that though the graph shows DM to be an optimal design point most of the time,
the partition solutions for each benchmark are not the same throughout.
Figures 3.7 plot speedups of the optimal design point as reconfiguration time increases.
The area is fixed at 10,000 logic slices. The speedups of cjpeg and djpeg remain almost
constant because the optimal design point gives a partition which yields quite a small re-
configuration cost while achieving the speedup at the same time. As a result, the change in
the reconfiguration cost is insufficient to alter the optimal design point. For the dh bench-
mark, if the reconfiguration time is small, it will still employ dynamic reconfiguration with
multiple kernels. The trade-off between more kernels and reconfiguration cost comes in
when the reconfiguration time increases beyond 15µseconds.
3.3.3 Impact of Using SEQUITUR and Label Extensions
In order to demonstrate the difference made when the SEQUITUR compressed trace and
label extensions are used, we implemented -trace and -seq versions of the Tabu and
Hill-Climb search. The -trace version traverses the uncompressed loop trace stored in
memory to compute the reconfiguration cost of a design point. The -seq version traverses
the compressed SEQUITUR loop trace and calculates reconfiguration cost without the label
extensions, i.e. using the technique described in section 3.2.1.3 every time a design point
is evaluated. Table 3.2 shows the various slow-downs of these implementation compared
to the Tabu and Hill-Climb searches that employ both the SEQUITUR compressed trace
and neighborhood relationship. The slow down is significant. Although using the -seq
53
version gives about an order of magnitude of speedup compared with the -trace version,
employing the neighborhood relationship makes the search a further order of magnitude
faster in general, except in the case of Hill-Climb search for cjpeg.
Benchmark tabu-trace tabu-seq hc-trace hc-seq
cjpeg 97.24x 10.34x 10.12x 1.13x
djpeg 21.77x 8.57x 6.20x 13.92x
dh 45.35x 11.06x 31.62x 8.33x
mpegenc 439.18x 18.72x 242.92x 9.29x
Table 3.2: Experiment results showing how many times Tabu and Hill Climb slowed down
when not using SEQUITUR and neighborhood relationship.
3.4 Summary
In this chapter, we considered the problem of exploring the design space of dynamically
reconfigurable SoCs for spatial and temporal partitioning . Specifically, we proposed a
means of speeding up neighborhood searches of such design spaces by a novel method of
estimating the design points near the current one in a compressed trace. We showed that
our technique works for both Hill-Climb and Tabu search. On four benchmarks, we found
that using our neighboring design point computation method, the searches were faster by
up to two orders of magnitude while reporting near-optimal solution most of the time.
The framework we proposed is generic. It can be easily extended to apply to any gran-




Optimal Scheduling of Hardware
Reconfigurations
The design space search framework presented in Chapter 3 has assumed an rFPGA archi-
tecture where the configuration time is not reduced through configuring the FPGA in par-
allel with application execution. In this chapter, we consider the configuration scheduling
problem for pFPGAs.
A typical architecture that we consider here is shown in Figure 1. One of the key
challenges in achieving real speedups using such an architecture is that hardware recon-
figuration of today’s massive pFPGAs can be very costly. It often takes thousands, if not
hundreds of thousands of clock cycles to reconfigure. If this high reconfiguration cost can-
not be reduced, then all benefits of hardware acceleration may be lost as the application has
to wait for reconfiguration to complete. Configuration prefetching [26] seeks to address
this problem by overlapping (partial) reconfiguration with the execution of the application
in pFPGA. However, a prefetch miss is costly because of the additional reconfigurations
that may be needed to recover from the miss. Therefore, the scheduling of reconfiguration
is crucial.
55
In this context, this chapter solves the following problem. Given a sequence (trace) of
actors (an invocation of a hardware module):
• Determine whether for a given actor in the trace, it is necessary to schedule a recon-
figuration task before it. This will depend on whether part or all of the resources
required by the hardware module is currently being used by another (different) mod-
ule. In other words, the two modules’ placements overlap.
• Compute the earliest possible time a required reconfiguration task may be scheduled.
For the current technology, at most one reconfiguration task is typically allowed at
any time.
In essence, we will present a polynomial-time (in terms of the length of the trace and the
number of distinct hardware modules) algorithm that schedules all the required reconfig-
urations such that the overall execution time (latency) of the given actor trace is provably
minimized. To the best of our knowledge, this is the first time an algorithm of this nature
has been proposed.
In the following, we present an overview of the contents of the chapter: Section 4.1
is a preliminary description of our architectural and scheduling model. Next, we provide
an analysis of the dependencies between actor invocations into (a) data dependencies, (b)
resource conflicts, and (c) reconfiguration dependencies in Section 4.2. Our scheduling
algorithm is then described in Section 4.3. In the same section, we will sketch the proof
of its optimality. Finally, in Section 4.4, we provide a detailed case study to illustrate the




We consider an architecture with one micro-processor that receives a trace of actors (Figure
4.1). For each actor, we assume a corresponding hardware accelerator module that may be









Figure 4.1: Architecture model: A CPU (left) controlling the reconfiguration interface of
an FPGA (right) used as a hardware accelerator for an incoming task sequence.
4.1.2 Scheduling Model
Example 4.1.1 Figure 4.2 shows an example of a given application consisting of a se-
quence of five actors (corresponding to four tasks) with data dependencies, and a given
conflict relation concerning the shared use of FPGA resources. For example, when task B
conflict with C, this would mean that they share some common hardware resources on the
FPGA which may be either I/O pins, memory resources (such as block rams), or slices.
57
a1=B a2=C a3=C a4=A a5=Da0=Td
Set of Tasks = {A,B,C,D,Td}
B conflicts with C, C conflicts with D, Td conflicts with A,B,C,D
Sequence of actors a0 to a5: a0=Td, a1=B, a2=C, a3=C, a4=A, a5=D
Figure 4.2: Example of actor trace.
Assume for now that this conflict relation is given statically, i.e., no module relocation is
allowed. Thus we know the conflicts between every pair of actors at compile time. More
formally, we define an actor trace and the corresponding conflicts as follows:
1. Trace of actors: Sa = (a0,a1,a2, . . . ,an) with ai ∈ T, i 6= 0. T is a set of tasks, and
‖T‖ = N, where N is number of tasks. We define a dummy task Td that always
corresponds to the dummy actor a0. Td takes zero execution time and it can always
be inserted at the beginning of any sequence of actors without loss of generality.
2. Resource conflicts: The relation C = {(Ti,Tj)|Ti  Tj} denotes that the placement of
Ti conflicts with placement of Tj. Td is by definition in conflict with all the members
of T .
3. Any actor ai ∈ Sa can only be scheduled for execution on the FPGA if all its preceding
tasks have completed execution. Furthermore, if the corresponding module is not
yet resident on the FPGA, it needs to be loaded, i.e., the corresponding resources
reconfigured, prior to execution.
4.2 Problem Formulation
Before defining the scheduling problem, we need to distinguish three different types of
dependencies: The first one, data dependencies, is obvious. The second is the conflict
58
relation introduced above that is due to the sharing of FPGA resources among the hardware
modules. Finally, the third kind of dependencies arise because some actors cannot begin
execution until its corresponding configuration task is completed. In order to compute this,
we first need to discuss the problem of reconfiguration task generation.
4.2.1 Reconfiguration Tasks Generation
Definition 4.2.1 (True dependence) Given a sequence of actors Sa. ai is called truly de-
pendent on a j, written a j ≺ ai iff
@k, j < k < i : (ak  ai)∧ (∀k′,k < k′ < i : ak 6= ai)
True dependence is based on the intuition that, for an actor ai of task t ∈ T , not every
occurrence of conflicting predecessors in the trace matters. It is the conflicting predeces-
sor ak that is closest to ai that will have an impact on the reconfiguration decision for ai.
Furthermore, ai must be the first actor of task type t in the trace subsequent to ak.
Example 4.2.1 In Figure 4.2, a2 is truly dependent on a1 but a3 has no true dependence
because it executes after another actor, a2, of the same task. Also, because task A does not
conflict with any tasks except task Td , actor a4 is truly dependent only on actor a0.
Now, each first appearance of a task in a trace will also necessitate exactly one recon-
figuration task. Hence, the set of required reconfiguration tasks Sr = (r1, . . . ,rl) may be
found by inspecting the given trace once.1
Theorem 4.2.1 (Reconfiguration task instantiation) For an actor ai in a given trace Sa,
there needs to be a corresponding reconfiguration actor (task) ri if, and only if, ∃a j ∈ Sa :
a j ≺ ai. In other words, if there exists a predecessor a j in Sa on which ai is truly dependent.
1Note that the subscripts of reconfiguration tasks in Sr are in sequence but not necessary running as they
correspond to the subscript of the associated actor, and not all actors need reconfiguration.
59
For each reconfiguration task ri, two additional dependencies must be created. First,
each ri must complete before the corresponding actor ai starts executing. Second, for a j
such that a j ≺ ai, reconfiguration task ri for ai cannot start earlier than the completion
of a j on which ai is truly dependent on because ri affects the execution of a j. The two
dependencies are shown by adding an outgoing edge from ri to ai and one incoming edge
from a j to ri.
Example 4.2.2 Figure 4.3 shows the set of reconfiguration tasks generated for the running
example as introduced in Example 4.1.1 as well as the additional scheduling dependencies
in Figure 4.4. Note that actor a3 does not induce a reconfiguration task to be created
because it is preceded by a2, an actor of the same task. It should be noted that while r1 and
r4 should be preceded by a0, the constraint is not enforced in practice since Td has zero
execution time.
a1=B a2=C a3=C a4=A a5=Da0=Td
r1 r2 r4 r5
Figure 4.3: Reconfiguration task generation.
a1=B a2=C a3=C a4=A a5=Da0=Td
r1 r2 r4
r5
Figure 4.4: Dependence relations.
60
In summary, we have to consider the following three types of dependencies for schedul-
ing after having all the required reconfiguration tasks generated:
• Sequential precedence:
Ps = {(ai,a j)|(0≤ i≤ n−1)∧ ( j = i+1)};
• Conflict (resource) precedence:
Pc = {(a j,ri)|a j ≺ ai}; and
• Reconfiguration precedence:
Pr = {(ri,ai)|(∃ri ∈ Sr)}.
The complete dependence relation is thus P = Ps∪Pr∪Pc.
4.2.2 Minimizing Schedule Length
Given the above, we are now in a position to state the scheduling problem formally. The
following notation will be used throughout this chapter:
• l(ai): latency of actor ai
• s(ai): the start time of actor ai
• f (ai): the end time of actor ai
• l(ri): latency of reconfiguration task ri
• s(ri): the start time of reconfiguration task ri
• f (ri): the finishing time of reconfiguration task ri
Definition 4.2.2 (Feasible schedule) A feasible schedule is an assignment of end times
f (ai) and f (ri), respectively, to every actor ai ∈ Sa and reconfiguration task ri ∈ Sr such that
61
all the above mentioned precedence constraints are satisfied, i.e., ∀ j such that (Xi,X j) ∈ P
then s(X j)≥ f (Xi).
Example 4.2.3 Figure 4.5 shows an example of a feasible schedule for five actors a1,a2,a3,a4,a5







l(a1)=20,  l(a2)=l(a3)=10,  l(a4)=25,  l(a5)=15,
l(r1)=15,  l(r2) =20,  l(r4)=35,  l(r5)=30
Figure 4.5: Feasible schedule for the problem introduced in Example 4.2.3.
Obviously, the reconfiguration interface may be regarded as a separate resource. The
aim of a scheduling algorithm for this problem is to find a feasible schedule where f (an) is
minimized for a trace of actors Sa = (a0,a1,a2, . . . ,an).
4.3 Algorithm MLS
We shall now present the main result of this chapter, namely a polynomial time, latency-
optimal scheduling algorithm for actors and reconfiguration tasks that we call Modified List
Scheduling (MLS). The algorithm assumes that reconfiguration tasks can be pre-empted
and resumed later. This is based on the way frame-based reconfigurable devices operate.
Configuration for frame-based devices such as Xilinx FPGAs is achieved by writing a set
of frames into the SRAM configuration memory of the device. It does not matter whether
the reconfiguration process is carried out in 1, 2, or more phases as long as the affected
62
area is not again rewritten by other module configurations in between. Also, the algorithm
prioritizes reconfiguration tasks by the order of appearance of their corresponding actors in
the actor trace.
The MLS algorithm is shown in Algorithm 1. It consists mainly of two passes through
the actor trace. In the first pass, the algorithm seeks to discover true dependences between
the actors and generate the corresponding reconfiguration tasks Sr. To do this, we maintain
a flag ft for each task t ∈ T and an index prevt . We traverse the trace from a1 to an.
Assume that ai is the current actor. If flag fai is true, a corresponding reconfiguration
task ri will be created, and if prevt 6= −1, ri is to be preceded by actor aprevt (i.e., truly
dependent on aprevt ). prevt = 1 when the reconfiguration task created is needed for the
first occurrence of ai. Furthermore, we record all ready reconfiguration tasks in a heap
data structure H, ordered by the relative appearance order of the associated actor in the
actor trace. In order to facilitate the preemptive scheduling of reconfiguration tasks, we
maintain a TimeRemaining attribute for each of the tasks and this is initialized to the full
reconfiguration latency required.
The second pass through the trace computes the actual scheduling time using preemp-
tive scheduling of reconfiguration tasks. current A is the current ready actor. In the case
when there are no ready actors, we schedule a ready reconfiguration task r whose associ-
ated actor has the earliest appearance order in the actor trace. Otherwise, we schedule actor
current A. In the time l(current A), we will try to schedule as many reconfiguration tasks
sequentially as possible to configure the FPGA in parallel with the execution of current A.
However, the space given by the scheduled actor may not be enough for the TimeRemain-
63
ing of r to fill up. Such r’s are inserted back into H with updated TimeRemaining. The
algorithm terminates when the last actor an is scheduled.
Algorithm 1: MLS algorithm.
Input: Trace of actors: Sa;
Set of Conflicting Hardware Modules: C;
Set of tasks: T ;
Result: Optimal Schedule Length
ForAll( ft ,prevt : t ∈ T) ft ← true; prevt ←−1;
for ai← a1 to an : ai ∈ Sa do
if fai is true then
CreateReconfigurationTask (ri);
(ri).TimeRemaining← l(ri);
if prevai 6=−1 then AddEdge (aprevai , ri;);
AddEdge (ri,ai);
ForAll(t ∈ T) if (t,ai) ∈C then ft ← true; prevt ← i;
if ri has no preceding tasks then Insert (H, ri);
if TaskReady (a1) then current A← a1; else current A← empty;
length← 0;
while current A 6= an do
if current A is empty then
r← ExtractMax (H);
length← length +(r).TimeRemaining;
current A← NextTask (r);
else
length← length +l(current A );
T ← l(c);
while H not empty ∧T 6= 0 do
r← ExtractMax (H);
if l(r)< T then T ← T − l(r);
else
r..TimeRemaining← r.TimeRemaining −T ;
T ← 0;
Insert (H, r);
ForAll(r ∈ DependsOn (current A)) Insert (H, r);
if TaskReady (NextTask (current A)) then
current A← NextTask (current A);














Figure 4.6: Example of optimal feasible schedule produced by MLS.
4.3.1 Bubbles in the Reconfiguration Schedule
Before describing the optimality proof of our algorithm, we will illustrate a key idea used
in the proof by means of an example.
Example 4.3.1 Figure 4.6 shows the result of applying the MLS algorithm to our running
example. In the schedule of the reconfiguration tasks, we can see ‘bubbles’, i.e., time inter-
vals in which there are no reconfiguration tasks occupying the reconfiguration interface of
the FPGA. By the greedy nature of list scheduling, if a ready task exists, it will always be
scheduled. Therefore, a bubble can only exist at a time instant tb when there are no ready
reconfiguration tasks to be scheduled at tb. In the example, a bubble exists between r4 and
r5 because there are no ready tasks before the completion of a3. The figure also shows the
preemption of reconfiguration task r4. r4 is preempted at time 35 before being continued at
time 55.
4.3.2 Proof of Optimality
We shall now present the induction proof for the optimality of the MLS algorithm. Note
that while in general list scheduling is but a heuristic, in the case of MLS, the order of an
actor’s appearance in the schedule is pre-determined by the trace. Furthermore, due to the



















leftover of rn from filling the bubbles
(a) Base case
-
f .. f - i i i
1
1 -1
f( 1) .. f( n-1) i i i
i
ll l s
fill  it  rn
l ft v r f rn fr  filli  t  l s


















leftover of rn from filling the bubbles
(c) Inductive step: an has reconfiguration task rn
Figure 4.7: Induction proof cases for MLS.
by any other once the actor trace and conflicts are known. These two facts combined to
guarantee the optimality of MLS.
Theorem 4.3.1 (Optimality) Given a trace of actors Sa of length n with the required re-
configuration tasks introduced in accordance to Theorem 4.2.1 and the corresponding de-
pendencies added, the MLS algorithm with (a) preemption of reconfiguration tasks, and (b)
task priority given by the number of successor tasks in the precedence graph will yield a
schedule with the smallest possible f (an).
66
Proof.
We show Theorem 4.3.1 by induction on the length of the actor trace Sa.
Base case: |Sa|−1 = n = 1. Figure 4.7(a) shows the base case. The dummy actor a0 = Td
that has zero latency is not shown. r1 followed by a1 is the optimal schedule. MLS would
trivially generate exactly this schedule.
Inductive Step:
Consider now a trace of length n > 1 and assume that f (ai) is minimal for 1 ≤ i < n. We
need to show that when an is considered: (1) the resulting schedule created by including an
is still optimal, and (2) that this schedule can be computed by MLS.
There are 2 possibilities:
• an has no reconfiguration task as a predecessor.
This trivial case is shown in Figure 4.7(b). an is simply appended to the actor sched-
ule. Since f (an−1) is optimal, f (an) is also optimal since there is no bubble between
an−1 and an. It is easy to see that MLS will yield exactly this schedule.
• an has a preceding reconfiguration task rn.
There must then exist an actor ai,0 ≤ i < n such that ai ≺ an. Observe from the
assumption made in the induction step that f (ai) is minimal, i.e., ai cannot be sched-
uled earlier. Due to the precedence constraints and inductive assumption, rn cannot
be scheduled earlier than f (ai). So, the time to schedule rn is in the interval from
f (ai) to s(an). Now preemption allows us to schedule and split the latency l(rn) of
the reconfiguration task rn to fill up all the bubbles between f (ai) and f (an−1) left
in the reconfiguration schedule. Any leftover time of l(rn) will be simply scheduled
after f (an−1). Figure 4.7(c) shows this case. The bubbles between f (ai) and f (an−1)
exist because there are no reconfiguration task r j, j < n, that can use them. Either
67
the total length of these bubbles is (a) completely sufficient to absorb l(rn), or (b)
insufficient and hence there is some amount of outstanding l(rn). For the former, the
schedule is optimal since all we need to do is to append an immediately after an−1.
For the latter, the schedule obtained by first scheduling the remainder of rn followed
immediately by the start of an is also optimal. Finally, note that scheduling rn using
MLS will not interfere or preempt other previously scheduled reconfiguration tasks
r j, j < n as these will have higher priorities, and only those bubbles starting from
f (ai) (unusable by any r j, j < n) are used for rn.
Thus, we have shown that if the inductive step holds, the resulting schedule for Sa of length
n is optimal. Since the base case is true, the proposed property holds true for all traces Sa
of any size more than one.
4.3.3 Further Clarifications
The optimality of the proposed algorithm is constrained by 2 factors: 1) available hardware
resources on the FPGA and 2) the conflict relationships between the tasks. Furthermore, it
is important to note that while the pre-emption of reconfiguration tasks frees the reconfig-
uration port so that other reconfiguration tasks can proceed, the configured resources (e.g.
frames, columns) are not freed to be occupied by other tasks. We shall further illustrate
clarify the workings of the algorithm with one pathological case here.
In the first pathological case, all tasks are placed on the FPGA. Each tasks require sig-
nificant hardware resources for implementation and thus every task more or less occupies
the whole FPGA. In this case, every tasks conflicts with one another. An analogous situ-
ation would be the case where the hardware resources available are so few that all tasks
placed on the FPGA are forced to be in conflict with one another. Our algorithm will still
return an optimal schedule in this case. It should be noted that since each task is in re-
68
source conflicts with all other tasks, it is not possible to configure the FPGA in parallel
with the execution of the tasks. The only and hence optimal schedule in this case would be
to configure the FPGA each time a different task is being executed.
4.4 Case Study
4.4.1 H264-encoder Case Study
We use a H.264 [67] encoder application as a case study of the effectiveness of our al-
gorithm. Based on profiling, we identified 15 loops that take up most of the computa-
tion time in the application. The hardware implementation of these loops were synthe-
sized using Xilinx’s ISE. Table 4.3 gives the details of the loops. The loops are named by
their containing functions’ names and identifiers assigned by the compiler. For example,
biari encode symbol-6 is loop 6 in function biari encode symbol. The target device
for synthesis is Xilinx Virtex-II XC2V6000 device[72].
Table 4.1 shows the characteristics of the application using two actor traces obtained
with the 15 loops. It shows the length of the actor traces and the number of unique patterns
occurring within the trace. A pattern is a maximal acyclic sequence of actors that occurs
repeatedly in the trace. Two patterns are considered different if they differ in at least one
actor. Intuitively speaking, the more unique patterns they are in the trace would imply
greater adaptability and variation for the given input. We obtained the shorter trace by
encoding one frame and the longer trace by encoding two consecutive frames. The frames
are 704 by 576 pixels in size. All the hardware modules are assumed to be running at a
frequency of 50 MHz.
We tested the effectiveness of Algorithm MLS by observing the scheduling results
for different sets of resource conflicts. Prefetching is beneficial only if there is sufficient
69
amount of time between the execution of conflicting hardware modules. Otherwise, there
is no concurrency between execution and reconfiguration of the hardware modules. We ob-
tained different sets of resource conflicts by allowing conflicts (i.e., placement overlap) to
occur if the minimum average number of execution cycles between the actors involved are
at least above a given threshold number of cycles. Table 4.2 shows the resource conflicts
for thresholds between 700 to 1300 cycles. Obviously, the number of conflicts decreases
when the threshold is increased. We shall report the impact of different number of conflicts
on the schedule length in Section 4.4.3.2.
Trace Num. of Num. of Num. Of
Frames Encoded Actors Unique Patterns
Short 1 35,622,092 52
Long 2 185,232,537 100
Table 4.1: Characteristics of the two traces.
4.4.2 Experiment Setup
To demonstrate the effectiveness of our approach, we compared it against three algorithms:
two different online Least Mean Square Predictor, and a simple scheduler.
Simple Scheduler The Simple Scheduler does not attempt any form of prefetching. In-
stead, it simply maintains a record of the currently FPGA configuration and only schedules
a reconfiguration on demand if the actor to be executed is not yet in the FPGA. It is reason-
able to expect that any prefetching approach should do no worse than the Simple Scheduler.
70
Minimum avg. cycles Num. of








Table 4.2: Resource conflicts.
We therefore used the schedule length computed by the Simple Scheduler as the baseline
for our comparisons.
Least Mean Square Online Predictor A (LMSA-a) This is an online predictor that is
similar to that described in [42, 11]. The Least Mean Square Filter is used as the predictor
function. However, because the target FPGA architecture considered in this thesis is differ-
ent (their architecture [13] supports relocation and defragmentation), our approach does not
use the priority function that is based on the configuration sizes and the different eviction
policies. Rather, the hardware module evicted are those in conflict with the module current
being prefetched.
Least Mean Square Online Predictor B (LMSA-b) This is a modification of LMSA-a.
Instead of predicting the next hardware task, the algorithm predicts the next conflicting
task. An important difference here is that instead of keeping historical information for only
71
Loop Name Num. Of Num. Of Cycles Num. Of Cycles
(func name-loop id) Slices (long trace) (short trace)
biari encode symbol-1 1552 1,787,969,064 393,559,992
biari encode symbol-6 1486 458,442,739 103,953,560
dct luma-1 1597 979,861,500 159,476,940
dct luma-3 3316 8,100,188,400 1,318,342,704
dct luma-4 1428 740,339,800 120,493,688
dct luma-5 3314 2,917,809,800 474,886,888
dct luma-8 1052 152,422,900 24,807,524
dct luma-9 1388 609,691,600 99,230,096
Mode Decision for 4x4IntraBlocks-4 1234 106,317,960 21,263,592
Mode Decision for 4x4IntraBlocks-5 1472 567,029,120 113,405,824
reset coding state-1 1268 18,720,876 3,593,348
RDCost for 4x4IntraBlocks-1 1222 106,317,960 21,263,592
RDCost for 4x4IntraBlocks-2 1351 425,271,840 85,054,368
writeLumaCoeff4x4 CABAC-1 1812 466,152,012 107,780,832
write significant coefficients-1 1985 1,674,754,726 365,849,786
Table 4.3: Hardware modules, the hardware area occupied and the number of cycles taken
up in the application.
72
the currently executing task, LMSA-b requires the information for all tasks to be kept in
order to predict the next conflicting task.
4.4.3 Experimental Results
4.4.3.1 Scaling the Reconfiguration Overhead
We seek to show the effect of increasing configuration overhead on the scheduling length.
For the FPGA we used, empirical measurements showed a configuration time of about
400µsec per CLB column. Using this high overhead, most applications stand to gain little
by hardware acceleration via dynamic reconfiguration. Fortunately, recent works [43, 12]
have shown that the overhead of configuring partial bitstreams can be potentially reduced
by factors of 20 or more. In particular, assuming the geometry of a Xilinx device, we ran















1 5 10 20
























MLS (short trace) LMSA-a (short trace) LMSA-b (short trace)
MLS (long trace) LMSA-a (long trace) LMSA-b (long trace)
Figure 4.8: Speedup over baseline plotted against increasing reconfiguration time.
73
Figure 4.8 shows the performance increase of the different approaches over the sched-
ule produced by the Simple Scheduler. The threshold of the minimum average execution
cycles between two conflicting hardware module is set to 1000 cycles for this experiment.
We observe that as reconfiguration speed decreases, the performance gain achieved by all
the approaches decreases. With a high reconfiguration overhead, execution just has to wait
till reconfiguration completes. The single reconfiguration port also becomes a bottle-neck.
Over the range of reconfiguration overheads we considered, the schedule produced by MLS
outperforms the others in every case. At best, it can be 30 percent better than those pro-
duced by the other schemes.
Another interesting observation is that LMSA-b performs better than LMSA-a in gen-
eral. This could be because predicting the next conflicting hardware module gives prefetch-
ing more time and a miss in the prefetching is less costly. LMSA-a and LMSA-b also per-
form worse in the longer trace because the execution order is more complex, as shown in
the higher number of unique patterns. Another interesting note is that LMSA-a can perform
worse than the Simple Scheduler. Prefetch misses can sometimes increase the number of
reconfigurations beyond what is normally needed because incorrectly predicted prefetches
can evict hardware modules which are otherwise not evicted by the Simple Scheduler.
4.4.3.2 Scaling the Number of Conflicts
Figure 4.9 shows the performance of the different approaches under the different conflict
sets listed in Table 4.2. We set the reconfiguration speed to be 10µs per CLB column for
this experiment. We observe that as the number of conflicts decreases, the performance
gain of the schedules increases for MLS. However, the same cannot be said of LMSA-a
and LMSA-b. We attribute this to the fact that different conflict sets can cause different
mispredictions for the same actor trace. MLS outperforms LMSA-a and LMSA-b in all
74
cases. In the case of the long trace, the difference between the schedules are significant.
This shows that the more complex the execution order, the more difficult it is for the online
prefetcher to yield a good schedule. Interestingly, in the longer trace, the LMSA-a performs













700 800 900 1000 1100 1200 1300























e ) MLS (short trace) LMSA-a (short trace) LMSA-b (short trace)
MLS (long trace) LMSA-a (long trace) LMSA-b (long trace)
Figure 4.9: Speedup over baseline plotted against decreasing number of conflicts.
4.5 Summary
In this chapter, we presented an algorithm for the scheduling of reconfiguration tasks for
FPGA-based hardware acceleration at the electronic system level. For a given trace of
complex computational kernels for which there exist hardware accelerators, we analyze
the dependencies between actors into (a) data dependencies, (b) resource dependencies
(conflicts), and (c) reconfiguration dependencies. Our algorithm inspects each actor in the
trace and determines whether a reconfiguration task is needed and if so, schedules such
75
a task in accordance with given dependencies. We provided a proof that the algorithm
always yields the optimal result in terms of the overall execution time (latency) of the
trace. Furthermore, it is polynomial in the length of the given trace of actor activations.
A realistic case study using the H.264 encoder has been provided to show the benefits and
sensitivity of the results.
We had assumed that the conflict relation between actors are given. Conflicts are depen-
dent on the placement of the corresponding hardware modules in the FPGA. In the future,
we would like to extend and optimize also the following scenarios: Assuming that an actor
may be executed either in software or has several instances that can be placed onto different
locations in the reconfigurable fabric, then we would like to find the best implementation
and placement decisions that will optimize the overall execution time of the application.
Due to its polynomial nature, our algorithm scales better than previously proposed ILP
approaches. Nonetheless, the analysis of a long trace is still quite time-consuming. Hence,
we would like to investigate heuristics that work at the level of the (static) control flow
graphs of the application. While such heuristics may not always produce the optimal so-
lutions, they may yield solutions that are “good enough” in practice, without the need for





In general, hardware implementations of computation are faster than the equivalent com-
putation performed on general purpose processors. However, the speedup obtained by
computation executed in FPGAs are offset by the huge reconfiguration latency required to
configure the FPGA during run-time. Chapter 1 and 2 has shown that pFPGAs provide
an additional opportunity for this reconfiguration to occur in parallel with both hardware
and software execution. Thus reconfiguration scheduling becomes critical for the reduc-
tion of the reconfiguration overhead. The reconfiguration schedule needs to maximize the
parallelism between the necessary reconfigurations and the execution, both hardware and
software, of the application,
The configuration scheduling problem is made complicated by the fact that the hard-
ware modules that are to be executed may compete with each other for resources on the
FPGA. Such modules are said to be in ‘conflict’. In cases where two hardware modules
A and B conflict with each other with module A is already loaded in the FPGA, it follows
that it is necessary to load B into hardware before B is executed. Therefore, whether a re-
77
configuration is necessary is dependent on the conflicts relationship between the hardware
modules.
In this chapter, we propose an algorithm that provides appropriate cues for the compiler
to insert configuration commands into the control flow graph of the application so that
the overall execution time is minimized. This algorithm relies on characteristics of the
compiled software application and the conflict information of the hardware modules to
achieve this goal.
The chapter is organized as follows. Section 5.1 gives the background information that
forms that context of the problem we are solving. After illustrating our motivation with two
examples in Section 5.2, we present the problem formulation in Section 5.3. Section 5.4
describes the proposed algorithm. In Section 5.5, we present the experiment results before
concluding in Section 5.6.
5.1 Background
5.1.1 Architecture Model
We consider the architecture model as shown in Figure 5.1. The model is realistic for
archtectures such as the Xilinx Virtex Family of FPGAs, especially Virtex-II Pro, IV and
V. We show the major components of interest in Figure 5.1. Memory is where the software
code and data are stored, together with the bitstreams to be loaded onto the reconfigurable
region. The CPU is the main controller of application execution and is responsible for
initiating reconfiguration of the reconfigurable region. The reconfiguration manager is a






















Figure 5.1: Architecture model for interprocedural placement-aware configuration schedul-
ing.
The reconfigurable region is where the hardware modules of the application are exe-
cuted. It contains n slots where hardware modules can be placed. Each hardware module
must be placed on contiguous slots within the reconfigurable region. Through the bridge
interface, the hardware modules can read the memory in bursts and share the same address
space as the CPU. Although it is possible for hardware modules to be relocated during run-
time, it could be computationally expensive because the bitstreams for relocation needs to
be generated at run-time. As such, the placement of the hardware modules are decided
during design time. We consider any two placements of the hardware modules that overlap
with each other to be in ‘physical placement conflict’ (or just ‘conflict’ for the rest of the
chapter). For example, if one hardware module is placed on slots 1 and 2 and the another
hardware module is placed on slots 2 and 3, these two hardware modules are in conflict.
79
Conflicting hardware modules cannot be loaded and run on the reconfigurable region si-
multaneously.
Informally, we aim to minimize the execution time of a single, sequential application
for this platform. The application consists of a combination of a program and m hardware
modules. These hardware modules are required to be loaded on the reconfigurable region
prior to their execution.
5.1.2 Reconfiguration Library Support
The architecture described above supports the preemption and subsequent resumption of
a hardware module. This is based on the insight that frame-based devices such as Xilinx
FPGAs allow the configuration loading to occur in non-contiguous temporal segment as
long as previously loaded bits are not overwritten. To support the software control of re-
configuring the reconfigurable region, we define a set of library calls that interfaces with
the reconfiguration manager and yet hide the underlying architecture details from the pro-
grammer.
The library requires some internal data structures to maintain the following information:
(a) the state of FPGA(i.e. what hardware modules are currently loaded on the FPGA), (b)
the hardware module being loaded onto the FPGA (if any), (c) the conflict information
between modules, (d) the location and length of the hardware module bitstreams, and (e)
the reconfiguration data required for resumption of reconfiguration. The last one requires
some further explanation. Suppose we preempt the reconfiguration of a hardware module
that consists of 5 CLB columns and 3 columns have been loaded thus far. We need to
remember the information about how much of the hardware module has been reconfigured
so as to support a future resumption of the reconfiguration of this module. We shall now
proceed to explain how the information (d) and (e) mentioned above are stored.
80
We maintain a structure called HW load info for each hardware module to store the
information needed to support the reconfiguration of the module. It has 3 fields: address,
length and recon unit size. address is where the bitstream is located in the memory.
length is the size of the bitstream that needs to be loaded. recon unit size is the basic
reconfiguration granularity at which the hardware module is to be atomically configured
each time. If recon unit size is 1 CLB column, then 1 CLB column of bitstream data
will be loaded onto the reconfigurable region at a time. In other words, when a recon-
figuration is preempted, the library ensures that a multiple of recon unit size have been
loaded always. The actual value of recon unit size is target-device dependent, given that
different devices have different column lengths. We store the structures of each hardware
module in in an internal data structure HW LOAD TABLE that is declared in the following in
C-like pseudo code. It should be noted that by each hardware module is assigned a unique
hw id that is also used as an index of this table.




int recon unit size;
} HW load info;
HW load info HW LOAD TABLE[NUM OF HARDWARE];
We maintain the information needed to support resumption of reconfiguration in a
structure called HW resume info for each hardware modules. HW resume info is iden-
tical to the HW load info except except that it contains a field called valid that indicates
whether resumption is still valid for the hardware module. RESUMPTION Q holds all the
HW resume infos of the hardware modules.
81




int recon unit size;
int valid;
} HW resume info;
HW resume info RESUMPTION Q[NUM OF HARDWARE];
The following library calls support the software control (i.e. initialization, pre-emption,
and resumption) of run-time partial reconfiguration:
load(hw id): This is a non-blocking library call that loads the bitstream of a hardware
module hw id. load looks up the valid field of hw id’s entry in the RESUMPTION Q. If it
is invalid (i.e. the hardware module is not yet loaded on the FPGA), load looks up hw id’s
entry in HW LOAD TABLE for the starting address and length of the bitstream to be passed
to the reconfiguration manager. If it is valid (i.e. configuration resumption is possible),
load will pass the values stored in RESUMPTION Q to the reconfiguration manager. This
is a non-blocking call because the reconfiguration manager will start the loading of the
hardware module. When a hardware module is loaded, all it’s conflicting modules’ entries
in RESUMPTION Q will be invalidated.
currently reconfiguring(): Returns the id of the hardware module currently being
reconfigured, if any. Returns -1 if there are no hardware modules being reconfigured.
is loaded(hw): This returns a boolean value indicating whether a hardware module is
loaded on the reconfigurable region.
82
hw exec(hw id, ...): A blocking call that returns upon the completion of the execu-
tion of the hardware module indicated by hw id. The rest of the parameters are inputs
(usually some register and address values) that are needed to be transferred to the hardware
module. If hw id is already loaded on the FPGA, the execution starts immediately. If the
hw id is not yet loaded, execution of hw may be delayed by either full or partial loading of
the hardware module. This delay forms the reconfiguration cost paid in expense of better
performance. However, we seek to reduce this delay through configuration prefetching.
5.1.3 Interprocedural Control Flow Graphs
The control flow graph(CFG) is a common, intermediate-level data structure used by com-
pilers to represent applications. The CFG displays all the possible paths that might be
traversed for the procedure that it represents. Every node in the graph is a basic block, that
has only one entry instruction and one exit instruction and no jump instructions in between
the entry and exit instructions.
The CFG usually represents the control flow of a single procedure. While the CFG is
useful for intra-procedural (i.e. within a procedure) optimizations, it may not be suitable
for optimizations that spans across procedure call boundaries. In such cases, an interpro-
cedural control flow graph (ICFG)[25] may be a better choice as a representation of the
application for optimization. All possible paths that might be taken during run-time are
represented completely in an ICFG.
As an example, we present in Figure 5.2 the C code for computing HeapSort and its
associated ICFG in Figure 5.3. The ICFG contains the control flow of all the procedures
of the HeapSort program. It should be noted that the CFG of the swap procedure is not
included because it is inlined after applying compiler optimizations. Observe the following




void HEAPSORT(int heap[100], int n);
void swap(int *p, int *q);
int BUILD HEAP(int heap[100], int n);




static inline void swap(int *p, int *q)
{
int t;




int i, j, n, heap[110];
while (1) {
printf("Enter the number of element (0 to exit): ");
scanf("%d", &n);
if (n == 0) break;
for (i = 1; i <= n; ++i) scanf("%d", &heap[i]);
HEAPSORT(heap, n);
printf("The sorted List is:\n");
for (i = 1; i <= n; ++i) printf("%d ", heap[i]);
}
}
Figure 5.2: HeapSort C code example.
84
void HEAPSORT(int heap[100], int n)
{
int i, heap size;
heap size = BUILD HEAP(heap, n);
for (i = n; i >= 2; --i) {
swap(&heap[1], &heap[i]);
--heap size;
HEAPIFY(heap, 1, heap size);
}
}
int BUILD HEAP(int heap[100], int n)
{
int i, heap size;
heap size = n;
for (i = floor(n / 2); i >= 1; --i) HEAPIFY(heap, i, heap size);
return heap size;
}
void HEAPIFY(int heap[100], int i, int heap size)
{
int l, r, largest;
l = LEFT(i);
r = RIGHT(i);
if (l <= heap size && heap[l] > heap[i]) largest = l;
else largest = i;




i = largest; l = LEFT(i); r=RIGHT(i);
if( l<=heap size && heap[l] > heap[i]) largest=l;
else largest=i;
if (r <= heap size && heap[r] > heap[largest]) largest = r;
}
}
Figure 5.2: HeapSort C code example.
85
• Entry and Exit Nodes. In the ICFG, every procedure has a single entry and single
exit. While it is possible to have multiple return instructions for a procedure written
say in a high-level language such as C, we add an additional exit basic block and
replace the multiple return instructions with a branch to the added exit basic block.
Also, we have included a start and end node to indicate where the program begins
and ends normally. Normally, the program begins at the start of the main procedure
and end after the exit basic block of the main procedure.
• Call and Return Control Flow Edges. Apart from edges that indicate the usual
control-flow transfers (i.e. branch taken or fall-through), ICFGs include two addi-
tional types of control-flow edges. For example, the edge from node 3 to node 7 de-
notes a procedure call being made by the main procedure to the HEAPSORT procedure.
Similarly, the edge from node 18 to node 9 indicates that upon the completion of the
procedure call to BUILD HEAP, the control returns to node 9 of procedure HEAPSORT.
In Figure 5.3, the call edges are indicated by thicker lines in bold while the return
edges are indicated by dotted lines. It should be noted that if a call site makes a call
through a procedure pointer, it is possible for the call site to have multiple out-going
edges.
• Invalid Paths. While the ICFG contains all the possible paths that could be traversed
during run-time by the application, it contains paths that are invalid. For example,
the path 15→ 19→ 20→ 21→ 22→ 26→ 11 is invalid. 15→ 29 is a call edge that
indicates a procedure call made by BUILD HEAP to HEAPIFY. We normally expect the
call to return to the caller, but 26→ 11 is a return edge to HEAPSORT, thus making the
above given path invalid. It should be noted that for every call edge there is exactly








































Figure 5.3: HeapSort interprocedural control flow graph.
87
• Hardware Nodes. The regions that are designated for hardware implementation are
collapsed into a node in the ICFG. In Figure 5.3, the loop in procedure HEAPIFY
is converted into a single block of code that begins with a call to hw exec(0), fol-
lowed by necessary control flow to other basic blocks in the ICFG. We refer to these
converted blocks of code that begins with a hw exec call as ’hardware nodes’ be-
cause these blocks of code initiate the execution of hardware modules. It should
be noted that these hardware nodes could contain multiple jump instructions, akin
to superblocks[31] that has a single entry instruction and multiple exit instructions.
The hardware nodes are depicted by a rectangular box and the software basic blocks
are indicated by ovals in figures showing ICFG. Furthermore, every hardware node
contains exactly one hw exec call.
For the rest of the chapter, we denote an ICFG as a directed graph G=(V,E,C, I,U,HW ).
V is the set of all the nodes in the ICFG. E is the set of edges denoting all possible con-
trol flow transfers in the program. head(e) and tail(e) refers to the begin and end nodes
of edge e respectively. C is the set of all call sites and C ⊂ V . I is the set of all entry
nodes of each procedure and I ⊂ V . U is the set of all exit nodes of each procedure and
U ⊂ V . HW ⊂ V is the set of all the hardware nodes in the ICFG. Each hardware node is
assigned a unique ID. Recall that every hardware node contains exactly one hw exec call.
For convenience’s sake, we shall refer to the hardware node and its corresponding hardware




After replacing the regions designated to run on FPGA with hardware nodes, the com-
piled ICFG should execute on the reconfigurable platform. However, without inserting the
load library calls for optimization, the execution of such executables results in what we
call “fetch-on-demand”(FOD) schedules during run-time. Consider the execution sequence
abcb for hardware modules a,b and c . Figure 5.4(a) shows how the execution of the hard-
ware modules pan out during run-time. According to the hw exec call semantics, since
no hardware modules are preloaded, execution of the hardware modules are preceded by
a reconfiguration phase if the desired hardware module is not yet present on the FPGA.
Thus, in the example shown in Figure 5.4(a), we need to reconfigure a, b, c prior to their
execution. However, it should be noted that the last execution of b did not require a recon-
figuration since there are no conflicts between b and c and thus hardware module b is still
present on the FPGA after the execution of c.
The FOD schedule is sub-optimal and it can be improved upon with appropriately
placed load library calls. Figure 5.4(b) shows that by inserting an appropriate load c
call during the execution of b, the overall execution time is reduced. This is because the
reconfiguration of c can occur in parallel with the execution of b as c and b do not conflict
with each other. On the other hand, a misplaced load library call may result in a schedule
that is longer than FOD. Figure 5.4(c) shows the case that a load a call during the execution
of c results in an additional reconfiguration of b later, hence lengthening the original FOD
schedule. Although load library calls are necessary to improve upon the execution time,
these calls must be appropriately placed and the current chapter aims to insert these library
calls so that overall execution time is minimized. The following examples show how the









a conflicts with b










ex(a) ex(b) ex(c) ex(b)











ld()  – load HW
ex() – execute HW








a conflicts with b










ex(a) ex(b) ex(c) ex(b)











ld()  – load HW
ex() – execute HW








a conflicts with b










ex(a) ex(b) ex(c) ex(b)











ld()  – load HW
ex() – ex cute HW
(c) Inserting prefetch loads increase execution time
Figure 5.4: How prefetching affects overall execution time.
90
To our knowledge, the two works that are most closely related to our work are done by
Panainte et al. [50] and Li et al. [42]. Panainte proposed a static interprocedural analysis
on call graphs that determined regions not shared between 2 conflicting hardware modules.
It should be noted that their paper gave no details as to exactly which basic block should
the load instructions be inserted. In our view, this approach could be too conservative and
loses chances of hiding more configuration latency. Figure 5.5(a) shows the control flow
graph of a function a will either call function b or c, depending on the branch taken at the
beginning of the function. Function b and c will execute hardware modules HW1 and HW2
respectively. Given that HW1 and HW2 conflicts with each other, the approach by Panainte
will not prefetch them beyond the boundaries of their respective functions (because the call
graph loses detailed path information). However, basic blocks A and B are probably at least
the safest earliest points where HW1 and HW2 can be prefetched.
Li proposed a probabilistic algorithm where a probability is attached to each edge in the
control flow graph, to indicate how probable those edges will be taken, should their source
be executed. After simplifying the control flow graph(that involves removing all cycles
in the graph), the probability to reach each hardware reconfiguration can be computed
by propagating the probabilities using a bottom-up approach. However, this approach is
less satisfactory when applied to situations where there are placement conflicts between
hardware modules. Figure 5.5(b) shows the case where the probability of reaching HW1 and
HW2 are computed for basic block A. The probabilities indicate that HW2 should be loaded
first. Given that HW1 and HW2 conflicts with each other, we should load HW1 first, contrary
to what is indicated by the computed probabilities. This is because if we will reach HW2
from basic block A, we are most likely to reach it through HW1. Furthremore, it is not













Long paths consisting 
of many basic blocks



















(b) Motivating example 2
Figure 5.5: Motivating examples.
information. These examples show that both path and conflict information are important in
order to improve the prefetching of configurations.
5.3 Problem Formulation
PROBLEM Given a directed, weighted ICFG G = (V,E,C, I,U,HW ), we would insert
load calls into the ICFG so that a compiler system, together with the reconfiguration library
support described in Section 5.1.2 will produce an executable that runs on the platform
described in Section 5.1.1 so as to minimize the necessary reconfiguration overhead of
92
the application. One assumption that we make is every computation region is only either
executed in software or hardware.
5.4 Interprocedural Placement-Aware Configuration Schedul-
ing
The algorithm that we propose has the following 5 major stages:
1. Using profiling information, obtain the frequency of executing each control-flow
edge and prune the edges accordingly.
2. Compute the immediate post dominator for every node.
3. Compute the intra post dominator paths (IPDP) i.e., paths that do not extend beyond
the immediate post-dominator of the starting node of the path.
4. Compute for every node on the graph the estimated placement-aware probability of
reaching each hardware node with the IPDP and post-dominator information using
an iterative method.
5. Reduce the redundant prefetches and generate code for prefetching for each basic
block.
In the first stage, we profile the application by inserting a profiling instrumentation code
at the beginning of every basic block of the ICFG. By doing this, we are able to obtain the
execution trace information and execution frequencies for each control-flow edge. In order
to improve the efficiency of the algorithm, all edges with zero frequencies are removed at
this stage. The weight function w for each edge is computed using the below equation:
w(e) =
frequency of edge e























Figure 5.6: An example ICFG. The squares represent hardware nodes while ovals represent
basic blocks. The thick edges represent call edges between procedures and the dotted lines
represent the return edges from the procedures.
A node g of the CFG post dominates node v if every path from v to the exit node of
their procedure passes through g. We denote the set of post-dominators for each node v
to be pdoms(v). For each node v ∈ V −{U}, there exists an immediate post-dominator g
where g∈ pdoms(v) and @n∈ pdom(v) : n∈ pdom(g) (i.e., g post-dominates v but does not
post-dominate any other post-dominators of v). We denote the immediate post-dominator
of each v to be ipdom(v). Classic algorithms [39] exist for obtaining the post dominator
information. We proceed to describe steps 3 to 5 of the algorithm in more detail for the rest
of this section.
5.4.1 Finding the Intra Post Dominator Paths
As mentioned above, intra post dominator paths are paths begins with a node v in the
ICFG but never extends beyong the immediate post dominator of v. It is important for our
algorithm to find these paths. Before defining what it is formally, we give an intuition as
to why we require this information. Consider the ICFG shown in Figure 5.6 and procedure
94
e in particular. Suppose that the probability with which node 1 will reach the hardware
node C is to be computed. A naive way of doing so would be to compute all possible paths
between node 1 and C. Otherwise, we can observe that C is a postdominator of 1 and hence
the probability of reaching C is 1. However, as hinted in section 5.2, this is insufficient
as this probability is not placement-aware. If C were to conflict with D or A, we need to
estimate the probability with which node 1 will reach C without encountering A or D on the
way. Intuitively speaking, a node should have the same probability of reaching all hardware
nodes as its immediate post dominator provided it does not encounter conflicting ones on
all paths from it to its postdominator. In order to know whether this is the case, we need
path information for every node before its immediate post-dominator.
Definition 5.4.1 Intra Post Dominator Paths (IPDP) Given a ICFG G=(V,E,C, I,U,HW ),
a path p of length j from node m to node n is a sequence of j edges, which will be de-
noted by [e1,e2, . . . ,e j] such that for all i,1 ≤ i ≤ j− 1, head(ei) = tail(ei+1). For con-
venience, we also denote that begin(P) = head(e1) and end(P) = tail(e j). Although p
is a path and a sequence of edges, we abuse notation by referring to nodes along the
path using the set notation. Hence, v ∈ p means that node v occurs in path p. The esti-
mated probability of taking a path Ppath(p) is the product of the weightage of the edges
Ppath(p)∏
j
i=1 w(ei). An IPDP p is a path as defined above with the following added prop-
erties: a) ∀v ∈ p,@n ∈ p : n 6= v∧ v ∈ pdom(n). There does not exist any node along the
path that is a post-dominator of any other node along the path and b) Ppath(p)> threshold,
the estimated probability of this path being taken is greater than a threshold value. This
threshold value is set to be 0.0005 in our experiments.
Algorithm 2 shows the pseudo-code for how the IPDP information is computed for each
node v ∈V . The set of IPDPs for each node v ∈V is denoted as IPDPv. The algorithm con-
sists of two loops. In the first loop, we initialize IPDPv for each node v with the immediate
95
outgoing edges of v if the destination of the edge is not a post-dominator of v. The second
loop is a classic working list algorithm loop, where outgoing edges of end(p) are being
added to the set IPDPv as long as the destination of the edge does not post-dominate any
nodes in the path being concatenated to. It should be noted that the IPDP information does
not extend beyond procedure boundaries (i.e., all paths leading to call sites or exit nodes
terminate there).
Algorithm 2: Obtaining Intra-PostDominator Paths
Input: ICFG: (V,E,C, I,U,HW );
Result: Intra-PostDominator Paths Collected ∀v ∈V
forall all nodes v ∈V −U do
forall all outgoing edges (v,s) ∈ E of v do
if s is not a postdominator of v i.e. s /∈ pdoms(v) then
initialize p with edge (v,s);
insert path p into set IPDPv;




foreach path p ∈ IPDPv do
foreach all outgoing edges s : (end(p),s) ∈ E of end(p) do
if s is either an exit node or call node i.e. s ∈U ∨ s ∈ I then
continue;
if ∀n ∈ p : s /∈ pdoms(n) then
Concatenate path p with edge (end(p),s) to get path pnew;
if probability of path pnew is higher than threshold then
Insert pnew into the set IPDPv;
Change← true;
5.4.2 Iterative Placement-Aware Estimated Probability Updating
As mentioned above, each node should have the same estimated probabilities reaching
hardware nodes as its immediate post dominator. However, conflicting hardware nodes
96
may exist in a path between the node and its immediate post dominator. To avoid enumer-
ating all possible paths (which may also be inter-procedural) between each node and its im-
mediate postdominator, we compute the estimated probabilities of reaching each hardware
node through a fixed point iterative process starting from the hardware nodes. Algorithm
3 shows a main loop that iterates through all the nodes in the graph during each iteration
and continues doing so until a fixed-point (i.e., the estimated probabilities for each node
have stabilized.) is reached. For this stage of the proposed algorithm, we maintain two
two-dimensional vectors IPDP Prob and P. IPDP Prob(v,hw) maintains the estimated
probabilities that a node v may reach a hardware node hw through its IPDP paths. P(v,hw)
maintains the estimated probabilities that a node v may reach a hardware node hw through
all possible paths while P(v) refers to the vector of estimated probabilities for node v. All
P(v,hw)s are initialized to zeros except when v = hw, where P(v,hw) is initialized to 1. A
procedure may have multiple callers. Due to the uncertainty of the call context, we do not
update the estimated probabilities for the exit nodes of the procedures.
We distinguish between the general case and call sites for the updating of estimated
probabilities. Algorithm 4 shows how the estimated probabilities for a general node v is
updated. The main thing is to compute a vector of estimated probabilities temp p that will
be used to update P(v) if these 2 vectors are different. In the case when P(v) is updated, a
change is reported.
The computation of vector temp p is done by computing a max prob for each hw ∈
HW . During each iteration of the main loop, max prob is the greater of two probabili-
ties: One of them is the estimated probability of reaching hardware node hw through its
IPDPs. This probability is given by new prob = P path(p)×P(end(p),hw) for path p.
The other probability is the estimated probability of reaching hardware node hw through
its post-dominator. This probability is given by f actor×P(ipdom(v),hw) where f actor
97
is computed by a summation of all the possibilities of reaching conflicting nodes of hw
through its IPDPs, f actor = 1−∑hw′hw IPDP Prob(v,hw′).
Algorithm 3: Iterative Probability Updating
Result: Final Placement-Aware Probabilities Computed For Each Basic Block
∀v ∈V
forall v ∈ V do
forall hw ∈ HW do
IPDP Prob(v,hw)← 0;







forall v ∈ V do
if v is an exit node i.e. v ∈ U then
continue;
else if v is a call site i.e. v ∈ C then
tmp change← update probabilities for call site(v);
else
tmp change← update general probabilities(v);
if tmp change then
change← true;
return P;
Furthermore, IPDP Prob(v,hw) is updated whenever a larger estimated probability of
reaching hw through a node’s IPDP is found. We compute new IPDP prob by deducting
P(ipdom(v),hw) from new prob. This is done to avoid double counting since it is possible
98
for the end of a IPDP path to reach a hardware node through the immediate post dominator
of node v. new IPDP prob is used to update IPDP Prob(v,hw) if it is a greater value.
Algorithm 4: update general probabilities(v)
Input: v
Result: Update probabilities for v based on post-dominator and IPDP Prob
information
change← false;
forall hw ∈ HW do
if v is hw or conflicts with hw i.e. v = hw ∨v  hw then
continue;
max prob← -1;
forall p ∈ IPDPv do
if no nodes in p conflicts with hw (i.e. @n : n ∈ p∧n hw) then
new prob← Ppath(p)×P(end(p),hw);
new IPDP prob← new prob − P(ipdom(v), hw);
if new IPDP prob > IPDP Prob(v,hw) then
IPDP Prob(v,hw)← new IPDP prob;
if new prob > max prob then
max prob← new prob;
factor← 1.0;
forall hw′ ∈ HW : hw′  hw do
factor← factor − IPDP Prob(v,hw′);
if factor × P(ipdom(v), hw) > max prob then
max prob← factor × P(ipdom(v), hw);
if max prob < threshold then
temp p(hw)← 0;
else
temp p(hw)← max prob;




Similarly, in the case of updating the estimated probabilities for call sites, we compute
a variable temp p that will update P(v) if these 2 vectors are different. The pseudo-code
is given in Algorithm 5. Recall that w(e) is the weight function for edge e, temp p is
99
computed using the following equation in the first loop:
temp p(hw) = ∑
e ∈ out(v)
w(e)×P(tail(e),hw)
Here, out(v) is the set of outgoing edges of node v. It should be noted that while we
normally expect a call site to call only one callee, this is not generally true for call sites that
make calls through procedure pointers. We rely on profiling results to determine which
procedures are being called.
Algorithm 5: update probabilities for call site(v)
Input: v
Result: Update probabilities for call node v based on post-dominator and
IPDP Prob information
change← false;
Initialize all values of array temp p to 0;
forall outgoing edges e of v do
foreach hw ∈ HW do
temp p(hw)← temp p(hw) + w(e) × P(tail(e), hw);
forall hw ∈ HW do
if temp p(hw) = 0 then
val← P(ipdom(v), hw);
forall hw′ ∈ HW : hw′  hw do
val← val − temp p(hw′);
temp p(hw)← val;




We post-process the temp p computed thus far by updating it with the estimated prob-
abilities of the corresponding return site of v where needed. It should be noted that the
corresponding return site of v will be its immediate post-dominator. temp p(hw) will be
updated when it is zero. We deduct the estimated probabilities of reaching the conflict-
ing hardwares of hw in temp hw from the estimated probability of reaching hw from the
100
return site P(ipdom(v),hw). If this value is greater than zero, it will be used to update
temp p(hw).
5.4.3 Prefetch Reduction and Code Generation
Thus far, we have computed the estimated probabilities of reaching the hardware nodes
from each basic block and the results are stored in the two-dimensional vector P. Consider
a node v with its associated probability vector P(v). We generate the prefetching code for
node v using two code templates shown in Figure 5.7 and Figure 5.8 in pseudo-C code1.
Firstly, we sort the hardware probabilities in descending order. After that, we insert the code
template in Figure 5.8 into the beginning of node v based upon the sorted probabilities. It
should be added that only hardware nodes that do not conflict with the most probable
hardware are considered. In other words, we do not generate load calls for hardware nodes
that conflict with the most probable hardware. Next, we fill in the inserted template with the
loading template shown in Figure 5.7. The final condition for calling load for a hardware
node is predicated upon a) the difference in probability between reaching this hardware
node and its conflicting hardware node is small enough and b) whether the hardware node is
already loaded. In cases where both conditions are true, we do not reconfigure the hardware
node in question. In general, we will attempt to load the most probable reachable hardware
node if it is not loaded and only consideer loading other hardware nodes if there are no
hardware modules being reconfigured currently.
Naively, every basic block that has a non-zero probability of reaching a hardware node
should be a candidate for inserting the load library call. However, this is needlessly expen-
1A compiler will insert these codes using low-level intermediate representation. We omit details here for
the sake of brevity
101
//Suppose hardware nodes conflicting with hw
// are c0, c1 c2 ...
if( (P[v][hw]-P[v][c0]>THRESHOLD || !is loaded(c0)) &&




Figure 5.7: Loading code template for hardware node hw. The condition for is expressed
as a product of sums.
if(!is loaded(most probable hardware node A))
//Insert loading template for A
else if(currently reconfiguring()==-1)
{
if(!is loaded(2nd most probable hardware node B))
//Insert loading template for B
else if(!is loaded(3rd most probable hardware node C))
//Insert loading template for C
else if(!is loaded(4th most probable hardware node D))
//Insert loading template for D
}
Figure 5.8: Cascading ifs code template to be inserted at prefetch points
sive. The prefetch points can be reduced by clearing the probabilities for nodes where all
its parents have the same probabilities of reaching the hardware nodes as itself.
5.5 Experimental Evaluation
5.5.1 Experimental Setup
We performed experiments with three applications 401.bzip2, 429.mcf and h264enc
to study the effectiveness of our algorithm. 401.bzip2 and 429.mcf were taken from
the SPEC2006 benchmark suite [58]. h264enc was taken from the MediaBench II video
benchmark suite [20]. 401.bzip2 is a block-sorting compression application while 429.mcf
is a program used for single-depot vehicle scheduling. h264enc[67] is an implementation
102
of H.264/AVC(Advanced Video Coding) encoder, the latest state-of-the-art video compres-
sion standard.
In our implementation of the architecture model, we employed the concept of ReCoBus
by Koch et al[36] to support complex run-time reconfiguration. The ReCoBus’s reconfig-
uration regions are organized in terms of reconfigurable slots i.e. the slots are the smallest
granularity that the hardware modules will occupy on the pFPGA. The minimum size of
each slot is 6 CLB Columns. Through profiling, we have identified 6 compute-intensive
regions for 429.mcf and 401.bzip2. For h264enc, 7 such regions have been identified.
These compute-intensive regions mapped to either basic blocks or loops in the original
program. Table 5.1 shows these regions and the estimated number of slots (based on the
software code size) that they occupy on the pFPGA. It has been assumed that the hard-
ware performance is faster than software by 5 times for our experiments. We refer to the
hardware regions by their indexes for the rest of this section.
For our experiments, we assumed a hardware device that has a similar geometry as
Xilinx Virtex II Pro[70] FPGAs (i.e. column based), which is organized as a CLB matrix of
80 rows and 56 columns. The PowerPC CPU is operating at 300MHz. Every CLB column
consists of 22 frames and each frame in turn requires 6,592 bits of configuration data.
Thus, each CLB requires 145,024 bits of configuration data. Different FPGA architectures
support different bit-widths of reconfiguration. For example, the Virtex IV Family supports
a bitwidth of 32 bits for the SelectMap interface for the reconfiguration of the FPGA while
the Virtex II Family supports a bitwidth of 8 bits. We assumed a single reconfiguration port
running at 100MHz and performed experiments for reconfiguration bitwidths 8 and 32.
Table 5.2 shows the different reconfiguration overheads of reconfiguring a single ReCoBus
slot for different bitwidths. Obviously, the wider the bitwidth the lower the overhead.
103
Benchmark Index Region from Procedure No. Of Slots
401.bzip2 B0 mainQSort3 2
401.bzip2 B1 fallbackSort 3
401.bzip2 B2 copy input until stop 3
401.bzip2 B3 generateMTFValues 2
401.bzip2 B4 mainSort 1
401.bzip2 B5 mainSimpleSort 2
h264enc H0 FastFullPelBlockMotionSearch 2
h264enc H1 SetupFastFullPelSearch 2
h264enc H2 SATD 2
h264enc H3 writeRunLevel CABAC 2
h264enc H4 biari encode symbol 2
h264enc H5 dct luma 3
h264enc H6 dct luma 1
429.mcf M0 primal bea mpp 2
429.mcf M1 price out impl 2
429.mcf M2 sort basket 2
429.mcf M3 refresh potential 2
429.mcf M4 primal iminus 3
429.mcf M5 primal bea mpp 2
Table 5.1: The regions selected for hardware implementation in the h264enc and 429.mcf
benchmarks.
104
Bit Widths Reconfiguration Overhead for 1 ReCoBus Slot
(PowerPC cycles at 300MHz)
8 65928 × 100300 ×22×3 = 326304 cycles
32 81576
Table 5.2: Reconfiguration Overhead of 1 ReCoBus Slot for different bit-widths.
The benchmarks were compiled using the Open IMPACT compiler[48]. While Open
Impact was targeted for the Itanium machine[32], we made changes so that the compiler
backend generated code for PowerPC 405[74] instead. The CPU cores embedded in Xilinx
Virtex II Pro chips are of the PowerPC 405 model. The changes made enabled us to compile
applications that targets the Xilinx FPGA platforms. Information such as the control-flow
graph and basic block IDs were obtained from the Open IMPACT compiler.
Through code instrumentation, we were able to obtain a trace of basic block IDs from
the execution of the application and measure the average execution time for each basic
block. The average execution time for the basic blocks of the h264enc application was
measured by running the instrumented code on the Xilinx University Program Board[68].
For 429.mcf and 401.bzip2, the measurements were taken by running the instrumented
code on a PowerPC machine and the execution times were later scaled back to match the
execution frequency of 300MHz. An inhouse developed trace-based simulation used the
trace and the execution time information to compute the expected execution time. Finally,
we compared the performance of our algorithm by comparing it against three scenarios:
fetch-on-demand(FOD), optimal and the placement-blind probabilistic algorithm.
Fetch-On-Demand: The Fetch-On-Demand schedule has been described earlier in Sec-
tion 5.2. Basically, there are no load library calls being made at all. The hardware modules
105
are loaded onto the pFPGA if they are encountered during execution and if it is not already
loaded onto the pFPGA. It is reasonable to expect that any prefetching approach to improve
upon this case. We used the expected execution time of the Fetch-On-Demand scenario as
the baseline for comparison in our experiments.
Optimal: The optimal case is when the entire execution trace is already known before-
hand and every prefetching decision is made based upon this foreknowledge. Our imple-
mentation of this scenario relied on the algorithm described in [55]. We do not expect a
static approach described in this chapter to be as good as the optimal case, but the gap
between the Optimal and Fetch-On-Demand is useful for gauging the effectiveness of our
approach.
Placement-blind Probabilistic algorithm: The implementation of the placement-blind
probability algorithm is based on [42]. Some changes such as identifying back-edges and
removing them need to be made to the control flow graph before we use this algorithm.
Basically, it is a bottom up approach of propagating the probability of reaching the hard-
ware nodes. This technique is developed for relocatable and defragmentable FPGAs and
not for the Xilinx FPGA architectures. Therefore, this approach does not account for the
placement conflicts between the hardware modules and serves as a good gauge of what
happens when we are not placement-aware.
5.5.2 Experimental Results
The placement of the hardware modules determines the conflict relationships between
them. To evaluate the effect of different conflict sets for our algorithm, we generate differ-
ent placements for the selected regions in Table 5.1 so that the number of conflicts/overlap
106
between the hardware modules is minimized. We omit the placement details and instead
abstract them by showing the different conflicts in Table Table 5.3.
Placement Labels Conflicts
bzip2-s3-1 {B0 B1,B1 B3,B0, B3,B2 B4,B2 B5}
bzip2-s3-2 {B1 B4,B1 B5,B2, B3,B2 B0,B0 B3}
bzip2-s3-3 {B1 B4,B0 B1,B2, B3,B2 B5,B3 B5}
bzip2-s4-1 {B1 B2,B0 B5}
bzip2-s4-2 {B1 B2,B0 B3}
bzip2-s4-3 {B1 B2,B3 B5}
h264-s3-1 {H0 H5,H3 H5,H0 H3,H1 H4,H2 H4,H1 H2}
h264-s3-2 {H4 H5,H1 H4,H1 H5,H0 H2,H0 H3,H2 H3}
h264-s3-3 {H0 H3,H0 H1,H1 H3,H4 H5,H2 H5,H2 H4}
h264-s4-1 {H1 H3,H0 H5,H2 H4}
h264-s4-2 {H1 H5,H0 H3,H2 H4}
h264-s3-3 {H1 H4,H0 H3,H2 H5}









Table 5.3: Benchmarks with different placements.
107
The labels in Table 5.3.for each different placement bear some explanations. All the
labels are named after the corresponding applications. Specifically, labels starting with
bzip2- refers to placements for 401.bzip2. Labels starting with h264- refers to place-
ments for h264enc and labels starting with mcf refers to placements for 429.mcf. The
placements that are labeled with ‘s3’ are placements generated for a ReCoBus implemen-
tation with 2 separate Reconfigurable Region of 3 configurable slots while placements la-
beled with ‘s4’ are generated for a ReCoBus implementation with 2 separate Reconfig-
urable Region of configurable 4 slots. Each of these placements form a separate test case
for our experiments. Obviously, we observe that the number of conflicts decreases when





































































































different placement sets 












































Figure 5.10: Speedups over baseline for 32-bits wide reconfiguration port running at
100MHz.
Figure 5.9 and 5.10 show us the various speedups/slowdown over the baseline for recon-
figuration bit-widths of 8 bits and 32 bits respectively, after applying the placement-blind
probability, optimal and our placement aware algorithm to the various placement sets. We
make the following observation of the results shown:
• All algorithms performs better in the case when reconfiguration port’s bitwidth is
32 bits. This shows that higher reconfiguration speeds creates more temporal space
during execution for prefetch to occur.
• The performance of our algorithm is worse in placements for 401.bzip2. However,
we observe that the gap between the optimal and baseline is almost negligiable for
401.bzip2. This implies that there is not much space for the execution to be opti-
mized through configuration prefetching. However, our algorithm still manages to
perform better than baseline in two of the test cases for 401.bzip2 despite the nar-
109
row gap. It should also be observed that the size of the gaps between baseline and
optimal is dependent on the Hardware-Software partitioning of the application. In
this case, the partitioning for 401.bzip2 is not ideal in the first place.
• We observe that performance degrades seriously for when conflicts are not taken
into account. The placement-blind probability suffers a maximum of 90% slowdown
and a 20% degradation in performance for most of the placement sets tested in our
experiments. This shows the inadequacy of the placement-blind algorithm for the
pFPGA architecture we are targeting.
• For the same benchmark, the speedup that can be gained through configuration schedul-
ing differs across varying placement sets. In particular, h264-s3-1 is the best for
h264enc, achieving a speedup of almost more than 30% for the optimal algorithm.
This shows how the placements affect both the overall performance and the opportu-
nities available for configuration prefetching.
Another way to measure the quality of our algorithm is by showing how close the result
of our algorithm is to the optimal when it performs better than baseline. To do this, we
compute what is called optimal proximity score by
Optimal Proximity Score =
Performance increase of placement-aware algorithm

























































































Benchmarks with Different Placements
Figure 5.11: Proximity to optimal by normalizing the range between baseline and optimal
(8 bits wide reconfiguration port).
The optimal proximity score shows where the results of the placement-aware algo-
rithm falls within the normalized range between the baseline and the optimal. Figure 5.11
and 5.12 show the optimal proximity for 8 bits and 32 bits wide reconfiguration ports re-
spectively. Higher optimal proximity score indicates better proximity to the optimal. For
example, in the case when the reconfiguration port is 32 bits wide, h264-s3-1 has a score
of 72% while h264-s4-1 has a score of 17%. This shows that the result of h264-s3-1 is




































Benchmarks with Different Placements
Figure 5.12: Proximity to optimal by normalizing the range between baseline and optimal
(32 bits wide reconfiguration port).
5.6 Summary
In this chapter, we have described a novel method that statically determines for each basic
block what it ought to pre-load into the FPGA so as to reduce the reconfiguration overhead.
Our approach is consistently better than baseline and performs better than state-of-the-art
prefetching algorithms based upon static analyses. However, our experiments show that
there is still room for improvement in our approach. For a static approach, it is important
to avoid being too conservative and too speculative at the same time. The former will
lead to less reconfiguration latency hiding while the latter will cause mis-prefetches that
may increase the number of reconfigurations initiated. A better approach would need to
sensitive to the context of the execution i.e. the code becomes ‘aware’ of the phase it is
executing in and prefetches according to the probabilities estimated for that phase instead.
112
Chapter 6
Conclusions and Future Work
6.1 Conclusion
In this thesis, we have studied the hardware software co-design for FPGA-based systems
with the aim of improving overall execution time by reducing run-time reconfiguration
overhead. This overhead can potentially wipe out any speed up obtained by implement-
ing the computation in a FPGA. This consideration and the opportunities afforded by the
advances in architectural support for more efficient reconfiguration form the motivation of
this thesis. The main contributions of this thesis are as follows:
• In Chapter 3, we presented a framework for the efficient implementation of neigh-
borhood searches of the temporal and spatial partitioning design space. It is demon-
strated in this chapter that both temporal and spatial partitioning affects the number
of reconfigurations, thus making it difficult to estimate the run-time reconfiguration
overhead incurred by a particular design point. A naive solution would be to scan
through the entire trace every time a design point is evaluated. Apart from the in-
tractable size of the design space, this solution does not scale with increasing length
113
of the execution trace. Our framework provides an efficient way of computing the
run-time reconfiguration cost through (a) a novel definition of the neighboring re-
lationship between design points, (b) using a loop trace encoded with SEQUITUR
grammar, and (c) an algorithm that leverages the former two to compute the change
in run-time reconfiguration cost when moving between 2 neighboring points. This
way, once an initial computation of the reconfiguration cost is known for one de-
sign point, we can efficiently compute the reconfiguration cost for all its neighbors
and transitively, for the rest of the design space as well when required. We eval-
uated the efficiency of this framework with the implementation of 2 neighborhood
searches, namely hill climbing and Tabu search using this framework. Our experi-
ments showed that hill climbing is able to find the optimal design point more than
90% of the cases while Tabu search found the optimal design point in all of our ex-
periments. It was also shown that the searches were sped up by up to two orders of
magnitude when the proposed framework is employed.
• While the framework presented in Chapther 3 allows the design space of both tem-
poral and spatial partitioning to be searched, it does not consider the possibility of
further configuration overhead reduction on pFPGAs. In Chapter 4, we examined the
following sub-problem: Given an execution trace, the associated hardware modules
and their placements on a pFPGA, find the optimally feasible schedule that mini-
mizes the overall execution time. This is a complementary, orthogonal problem to
the one solved in Chapter 3. To solve this problem, we present a novel, polynomial
time algorithm that solves this problem by scheduling the reconfiguration of hard-
ware modules to occur in parallel with application execution whenever possible. The
resultant schedule is shown to be provably optimal. A key to the algorithm is a de-
pendence analysis that determines whether for each instance of the hardware module
114
execution, a prior reconfiguration is needed. Experiments performed using the H.264
benchmark shows that the current state-of-the-art online prefetching algorithms per-
form considerably worse than the schedule returned by our algorithm. The difference
between the two in terms of speedup over the baseline can be as large as 40%.
• Although the algorithm in Chapter 4 returns an optimal schedule for a particular ex-
ecution trace, a program’s execution pattern may differ with varying inputs. It is not
feasible to schedule for every possible input of a program. In Chapter 5, we examined
the following sub-problem: Given a program that is represented in an interprocedural
control flow graph, together with the hardware modules and their associated place-
ments on a pFPGA, find suitable prefetch points in the graph for the insertion of
library calls that will load the hardware modules ahead of time so that overall execu-
tion time is minimized. By making use of profiled execution frequencies of control-
flow edges, our proposed novel algorithm solves this problem through an iterative
approach that estimates placement-aware probabilities of reaching hardware execu-
tion for each basic block. Placement-aware probability refers to the probability of
reaching the execution of a hardware module without encountering conflicting mod-
ules on the path of the interprocedural control flow graph. Experiments show that our
proposed algorithm makes significant improvements over state-of-the-art prefetching
strategies that do not consider placement conflicts.
115
6.2 Future Works
6.2.1 Granularity of Reconfiguration and Configuration Scheduling
To our knowledge, there have been no previous work that discusses how the possibility
of splitting a single reconfiguration of a hardware module into distinct temporal phases
affects configuration scheduling. The work presented in this thesis has assumed that the
cost of pre-emption and resumption to be minimal compared with the entire reconfigura-
tion overhead. However, in practice, reconfiguration needs to occur in multiple of frames
and there is a setup-cost involved in initiating a reconfiguration. For example, it is possible
to break down the configuration of a hardware module consisting of ten frames into ten
distinct stages (i.e., during each stage, only one frame is loaded.). While the flexibility of
scheduling the configuration of this hardware module has increased, the overall configu-
ration overhead has increased as well because a setup cost hsa to be paid for each of the
ten stages. However, reconfiguring at a larger granularity will result in a loss of this flex-
ibility. Therefore, the granularity of reconfiguration affects the problem of configuration
scheduling and this should be considered in future works.
6.2.2 Hardware-Software Co-Placement and Partitioning
Both Chapters 4 and 5 have assumed that the hardware partitioning is already done and the
placements of the hardware modules are decided beforehand. The focus was on the relative
speedup between the fetch-on-demand schedule and the desired optimized schedule. This
problem is orthogonal to that of selecting a suitable conflict set so that overall execution
time is minimized. Our experimental data from these studies show that the difference in
execution time between 2 different sets of conflicts could be as large as 60%. The problem
of hardware-software co-placement and partitioning could be expressed as follows: Given a
116
single sequential program and its constituent compute-intensive regions, how do we decide
which of these regions should be implemented in hardware? For the selected hardware
implementations, how shall we place these hardware modules on the FPGA so that the
conflict relationships between them will minimize the run-time reconfiguration overhead?
Obviously, these two questions are inter-related and need to be answered to obtain a quality
solution that minimizes the overall execution time. Therefore it makes sense to combine
the two into a unified co-design problem.
6.2.3 Configuration Management for Multi-core Reconfigurable Com-
puting
This work so far has concentrated on architecture models that consist of a single CPU
and a single FPGA co-processor. However, given the advent of multi-core architectures
like FSB-FPGA[33], the challenge would be for general purpose programs to harness the
potential speedup possible from the attached reconfigurable device. Although there are
ongoing research in this area[29], many open problems remain unsolved especially in the
domain of general purpose reconfigurable computing, where the set of applications running
in the system is dynamically changing according to the demands of the users.
117
Bibliography
[1] J. M. Arnold. S5: the architecture and development flow of a software config-
urable processor. In ICFPT ’05, Proceedings of International Conference on Field-
Programmable Technology 2005, pages 121–128, Singapore, Dec. 2005. IEEE.
[2] P. M. Athanas and H. F. Silverman. Processor reconfiguration through instruction-set
metamorphosis. Computer, 26(3):11–18, Mar. 1993.
[3] S. Banerjee, E. Bozorgzadeh, and N. Dutt. Physically-aware HW-SW partitioning
for reconfigurable architectures with partial dynamic reconfiguration. In DAC ’05:
Proceedings of the 42nd annual conference on Design automation, pages 335–340,
New York, NY, USA, 2005. ACM Press.
[4] S. Banerjee, E. Bozorgzadeh, N. Dutt, and J. Noguera. Selective bandwidth and
resource management in scheduling for dynamically reconfigurable architectures. In
DAC ’07: Proceedings of the 44th annual Design Automation Conference, pages 771–
776, New York, NY, USA, 2007. ACM.
[5] K. Bondalapati and V. K. Prasanna. Reconfigurable computing systems. Proceedings
of the IEEE, 90:1201–1217, 2002.
118
[6] T. J. Callahan. Automatic Compilation of C for Hybrid Reconfigurable Architecture.
PhD thesis, University of California at Berkeley, Berkeley, California, United States,
2002.
[7] T. J. Callahan, J. R. Hauser, and J. Wawrzynek. The Garp architecture and C compiler.
IEEE Computer, 33(4):62–69, 2000.
[8] Celoxica Ltd., Oxfordshire, UK. DK3: Handel-C Language Reference Manual, 2002.
[9] L. N. Chakrapani, J. Gyllenhaal, W. W. Hwu, S. A. Mahlke, K. V. Palem, and R. M.
Rabbah. Trimaran: An infrastructure for research in instruction-level parallelism.
In LCPC ’04, Proceedings of the 17th International Workshop on Languages and
Compilers for Parallel Computing, volume 3602, pages 32–41. Springer, 2005.
[10] K. S. Chatha and R. Vemuri. Hardware-software codesign for dynamically reconfig-
urable architectures. In FPL ’99: 9th International Workshop on Field-Programmable
Logic and Applications, pages 175–184. Springer, 1999.
[11] Y. Chen and S. Y. Chen. Cost-driven hybrid configuration prefetching for partial
reconfigurable coprocessor. In RAW ’07: Proceedings of International Parallel and
Distributed Processing Symposium Reconfigurable Architectures Workshop (RAW),
pages 194–200, Los Alamitos, CA, USA, 2007. IEEE.
[12] C. Claus, F. H. Mu¨ller, J. Zeppenfeld, and W. Stechele. A new framework to accel-
erate Virtex-II Pro dynamic partial self-reconfiguration. In RAW ’07: Proceedings of
International Parallel and Distributed Processing Symposium Reconfigurable Archi-
tectures Workshop (RAW), pages 1–7, Los Alamitos, CA, USA, 2007. IEEE.
[13] K. Compton, J. Cooley, S. Knol, and S. Hauck. Configuration relocation and de-
fragmentation for reconfigurable computing. In FCCM ’00: Proceedings of the 8th
119
IEEE Symposium on Field-Programmable Custom Computing Machines, page 279,
Washington, DC, USA, 2000. IEEE Computer Society.
[14] K. Compton and S. Hauck. Reconfigurable computing: a survey of systems and
software. ACM Computing Surveys (CSUR), 34(2):171–210, 2002.
[15] E. El-Araby, I. Gonzalez, and T. El-Ghazawi. Exploiting partial runtime reconfigura-
tion for high-performance reconfigurable computing. ACM Transactions on Recon-
figurable Technology and Systems, 1(4):1–23, 2009.
[16] P. Eles, Z. Peng, K. Kuchcinski, and A. Doboli. System level hardware/software
partitioning based on simulated annealing and tabu search. Kluwer Journal on Design
Automation For Embedded Systems, 2(1):5 –32, 1997.
[17] R. Ernst, J. Henkel, and T. Benner. Hardware-software cosynthesis for microcon-
trollers. pages 18–29, 2002.
[18] S. P. Fekete, J. C. van der Veen, J. Angermeier, C. Go¨hringer, M. Majer, and J. Teich.
Scheduling and communication-aware mapping of HW/SW modules for dynamically
and partially reconfigurable SoC architectures. In ARCS ’07: 20th International Con-
ference on Architecture of Computing Systems 2007, pages 151–160. VDE-Verlag,
Berlin, 2007.
[19] T. Feo and M. Resende. Greedy randomized adaptive search procedures. In Journal
of Global Optimization, volume 6, pages 109–133, 1995.
[20] J. E. Fritts, F. W. Steiling, J. A. Tucek, and W. Wolf. Mediabench II video: Expediting
the next generation of video systems research. Microprocessors and Microsystems,
33(4):301–318, 2009.
120
[21] W. Fu and K. Compton. An execution environment for reconfigurable comput-
ing. In FCCM ’05: Proceedings of the 13th Annual IEEE Symposium on Field-
Programmable Custom Computing Machines, pages 149–158, Washington, DC,
USA, 2005. IEEE Computer Society.
[22] S. Ganesan and R. Vemuri. An integrated temporal partioning and partial reconfigu-
ration technique for design latency improvement. In DATE ’00: Proceedings of the
Conference on Design, Automation and Test in Europe, pages 320–325, New York,
NY, USA, 2000. ACM Press.
[23] S. Ghiasi, A. Nahapetian, and M. Sarrafzadeh. An optimal algorithm for minimizing
run-time reconfiguration delay. ACM Transactions in Embedded Computing Systems,
3(2):237–256, 2004.
[24] F. Glover and M. Laguna. Tabu search. In C. Reeves, editor, Modern Heuristic
Techniques for Combinatorial Problems, Oxford, England, 1993. Blackwell Scientific
Publishing.
[25] M. J. Harrold, G. Rothermel, and S. Sinha. Computation of interprocedural control
dependence. In ISSTA ’98: Proceedings of the 1998 ACM SIGSOFT International
Symposium on Software Testing and Analysis, pages 11–20, New York, NY, USA,
1998. ACM.
[26] S. Hauck. Configuration prefetch for single context reconfigurable coprocessors. In
FPGA ’98: Proceedings of the 1998 ACM/SIGDA Sixth International Symposium on
Field Programmable Gate Arrays, pages 65–74, New York, NY, USA, 1998. ACM.
[27] J. R. Hauser and J. Wawrzynek. Garp: a mips processor with a reconfigurable co-
processor. In FCCM ’97: Proceedings of the 5th IEEE Symposium on FPGA-Based
121
Custom Computing Machines, page 12, Washington, DC, USA, 1997. IEEE Com-
puter Society.
[28] C. H. Ho, C. W. Yu, P. Leong, W. Luk, and S. Wilton. Floating-point FPGA: Ar-
chitecture and modeling. IEEE Transactions on Very Large Scale Integration [VLSI]
Systems, 17(12):1709–1718, Dec. 2009.
[29] C. Huang and F. Vahid. Dynamic coprocessor management for FPGA-enhanced com-
pute platforms. In CASES ’08: Proceedings of the 2008 International Conference on
Compilers, Architectures and Synthesis for Embedded Systems, pages 71–78, New
York, NY, USA, 2008. ACM.
[30] H. P. Huynh, J. E. Sim, and T. Mitra. An efficient framework for dynamic recon-
figuration of instruction-set customization. In CASES ’07: Proceedings of the 2007
International Conference on Compilers, Architecture, and Synthesis for Embedded
Systems, pages 135–144, New York, NY, USA, 2007. ACM.
[31] W. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann,
R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D. M. Lavery.
The superblock: An effective technique for VLIW and superscalar compilation. The
Journal of Supercomputing, 7:229–248, 1993.
[32] Intel Corp. Intel Itanium 2 Processor Reference Manual for Software Development,
Document Number 251110-001, June 2002.
[33] Intel Corp. Intel Quickassist Technology, 2008.
http://www.intel.com/technology/platforms/quickassist/index.htm.
[34] A. Kalavade and E. A. Lee. A global criticality/local phase driven algorithm for the
constrained hardware/software partitioning problem. In CODES ’94: Proceedings of
122
the 3rd International Workshop on Hardware/Software Codesign, pages 42–48, Los
Alamitos, CA, USA, 1994. IEEE Computer Society Press.
[35] D. Koch, C. Beckhoff, and J. Teich. Bitstream decompression for high speed fpga
configuration from slow memories. In ICFPT ’07, Proceedings of International Con-
ference on Field-Programmable Technology 2007, pages 161–168, Kokurakita, Ki-
takyushu, JAPAN, Dec. 2007. IEEE.
[36] D. Koch, C. Beckhoff, and J. Teich. A communication architecture for complex run-
time reconfigurable systems and its implementation on spartan-3 fpgas. In FPGA
’09: Proceeding of the 2009 ACM/SIGDA International Symposium on Field Pro-
grammable Gate Arrays, pages 253–256, New York, NY, USA, 2009. ACM.
[37] D. L. Kreher and D. R. Stinson. Combinatorial Algorithms Generation, Enumeration
and Search. CRC Press Inc, 1998.
[38] C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Mediabench: a tool for evaluating
and synthesizing multimedia and communicatons systems. In MICRO 30: Proceed-
ings of the 30th annual ACM/IEEE international symposium on Microarchitecture,
pages 330–335, Washington, DC, USA, 1997. IEEE Computer Society.
[39] T. Lengauer and R. E. Tarjan. A fast algorithm for finding dominators in a flowgraph.
ACM Transactions on Programming Languages and Systems, 1(1):121–141, 1979.
[40] Y. Li, T. Callahan, E. Darnell, R. Harr, U. Kurkure, and J. Stockwood. Hardware-
software co-design of embedded reconfigurable architectures. In DAC ’00: Proceed-
ings of the 37th Annual Design Automation Conference, pages 507–512, New York,
NY, USA, 2000. ACM.
123
[41] Z. Li, K. Compton, and S. Hauck. Configuration caching management techniques for
reconfigurable computing. In FCCM ’00: Proceedings of the 8th IEEE Symposium on
Field-Programmable Custom Computing Machines, page 22, Washington, DC, USA,
2000. IEEE Computer Society.
[42] Z. Li and S. Hauck. Configuration prefetching techniques for partial reconfigurable
coprocessor with relocation and defragmentation. In FPGA ’02: Proceedings of the
2002 ACM/SIGDA Tenth International Symposium on Field Programmable Gate Ar-
rays, pages 187–195, New York, NY, USA, 2002. ACM.
[43] P. Lysaght, B. Blodget, J. Mason, J. Young, and B. Bridgford. Invited paper: En-
hanced architectures, design methodologies and CAD Tools for Dynamic Reconfigu-
ration of Xilinx FPGAs. In FPL ’06: Proceedings of the 2006 International Confer-
ence on Field Programmable Logic and Applications, pages 1–6, 2006.
[44] G. Memik, W. H. Mangione-Smith, and W. Hu. Netbench: a benchmarking suite for
network processors. In ICCAD ’01: Proceedings of the 2001 IEEE/ACM international
conference on Computer-aided design, pages 39–42, Piscataway, NJ, USA, 2001.
IEEE Press.
[45] C. Nevill-Manning and I. Witten. Identifying hierarchical structure in sequences: A
linear-time algorithm. Journal of Artificial Intelligence Research, 7:67–82, 1997.
[46] J. Noguera and R. Badia. A hw/sw partitioning algorithm for dynamically recon-
figurable architectures. In DATE ’01: Proceedings of the Conference on Design,
Automation and Test in Europe, page 729, Piscataway, NJ, USA, 2001. IEEE Press.
124
[47] J. Noguera and R. M. Badia. Multitasking on reconfigurable architectures: microar-
chitecture support and dynamic scheduling. ACM Transactions on Embedded Com-
puting Systems, 3(2):385–406, 2004.
[48] UIUC Open IMPACT Effort, ”The Open IMPACT IA-64 Compiler.”
http://gelato.uiuc.edu.
[49] E. M. Panainte, K. Bertels, and S. Vassiliadis. Instruction scheduling for dynamic
hardware configurations. In DATE ’05: Proceedings of the conference on Design,
Automation and Test in Europe, pages 100–105, Washington, DC, USA, 2005. IEEE
Computer Society.
[50] E. M. Panainte, K. Bertels, and S. Vassiliadis. Interprocedural compiler optimization
for partial run-time reconfiguration. Journal of VLSI Signal Processing Systems, 43(2-
3):161–172, 2006.
[51] R. N. Pittman, N. L. Lynch, R. Forin, R. N. Pittman, N. L. Lynch, and R. Forin.
eMIPS, a dynamically extensible processor. Technical report, Microsoft Research,
Redmond, WA, United States, 2006.
[52] D. N. Rakhmatov and S. B. K. Vrudhula. Hardware-software bipartitioning for dy-
namically reconfigurable systems. In CODES ’02: Proceedings of the Tenth Interna-
tional Symposium on Hardware/Software Codesign, pages 145–150, New York, NY,
USA, 2002. ACM.
[53] F. Redaelli, M. D. Santambrogio, and D. Sciuto. Task scheduling with configura-
tion prefetching and anti-fragmentation techniques on dynamically reconfigurable
systems. In DATE ’08: Proceedings of the Conference on Design, Automation and
Test in Europe, pages 519–522, New York, NY, USA, 2008. ACM.
125
[54] J. Resano, D. Mozos, and F. Catthoor. A hybrid prefetch scheduling heuristic to min-
imize at run-time the reconfiguration overhead of dynamically reconfigurable hard-
ware. In DATE ’05: Proceedings of the conference on Design, Automation and Test
in Europe, pages 106–111, Washington, DC, USA, 2005. IEEE Computer Society.
[55] J. E. Sim, W. F. Wong, and J. Teich. Optimal placement-aware trace-based scheduling
of hardware reconfigurations for fpga accelerators. In FCCM ’09: Proceedings of the
17th Annual IEEE Symposium on Field-Programmable Custom Computing Machines,
pages 279–282, Napa, CA, USA, Apr. 2009. IEEE Computer Society.
[56] S. Singh, J. Rose, P. Chow, and D. Lewis. The effect of logic block architecture
on FPGA performance. Solid-State Circuits, IEEE Journal of, 27(3):281–287, Mar.
1992.
[57] B. So, M. W. Hall, and P. C. Diniz. A compiler approach to fast hardware design
space exploration in FPGA-based systems. In PLDI ’02: Proceedings of the ACM
SIGPLAN 2002 Conference on Programming Language Design and Implementation,
pages 165–176, New York, NY, USA, 2002. ACM.
[58] Standard Performance Evaluation Corporation. SPEC CPU2006 Benchmark Suite,
2006. http://www.spec.org/cpu2006.
[59] C. Steiger, H. Walder, M. Platzner, and L. Thiele. Online scheduling and placement
of real-time tasks to partially reconfigurable devices. In RTSS ’03: Proceedings of the
24th IEEE International Real-Time Systems Symposium, page 224, Washington, DC,
USA, 2003. IEEE Computer Society.
126
[60] G. Stitt, F. Vahid, and S. Nematbakhsh. Energy savings and speedups from partition-
ing critical software loops to hardware in embedded systems. ACM Transaction on
Embedded Computing Sys., 3(1):218–232, Feb. 2004.
[61] D. C. Suresh, W. A. Najjar, F. Vahid, J. R. Villarreal, and G. Stitt. Profiling tools for
hardware/software partitioning of embedded applications. In LCTES ’03: Proceed-
ings of the 2003 ACM SIGPLAN conference on Language, Compiler, and Tool for
Embedded Systems, pages 189–198, New York, NY, USA, 2003. ACM.
[62] S. Talla. Adaptive Explicitly Parallel Instruction Computing. PhD thesis, New York
University, New York, United States, 2001.
[63] X. Tang, M. Aalsma, and R. Jou. A compiler directed approach to hiding configura-
tion latency in chameleon processors. In FPL ’00: Proceedings of the The Roadmap
to Reconfigurable Computing, 10th International Workshop on Field-Programmable
Logic and Applications, pages 29–38, London, UK, 2000. Springer-Verlag.
[64] J. Teich, S. P. Fekete, and J. Schepers. Optimization of dynamic hardware reconfigu-
rations. The Journal of Supercomputing, 19(1):57–75, 2001.
[65] F. Vahid. Modifying min-cut for hardware and software functional partitioning. In
CODES ’97: Proceedings of the 5th International Workshop on Hardware/Software
Co-Design, page 43, Washington, DC, USA, 1997. IEEE Computer Society.
[66] S. Vassiliadis, S. Wong, G. Gaydadjiev, K. Bertels, G. Kuzmanov, and E. M.
Panainte. The MOLEN polymorphic processor. IEEE Transactions on Computers,
53(11):1363–1375, 2004.
127
[67] T. Wiegand, G. J. Sullivan, G. Bjntegaard, and A. Luthra. Overview of the H.264/AVC
video coding standard. Circuits and Systems for Video Technology, IEEE Transactions
on, 13(7):560–576, 2003.
[68] Xilinx Inc., San Jose, CA, United States. Xilinx Corp, XUP Virtex II Pro Development
System. Avaliable at http://www.xilinx.com.
[69] Xilinx Inc., San Jose, CA, United States. Xilinx Virtex-E 1.8 V Field-Programmable
Gate Arrays DataSheet. Available at http://www.xilinx.com.
[70] Xilinx Inc., San Jose, CA, United States. Xilinx Virtex-II Pro Platform FPGAs: com-
plete data sheet. Available at http://www.xilinx.com.
[71] Xilinx Inc., San Jose, CA, United States. Virtex-II 1.5V Field-Programmable Gate
Arrays, Nov. 2001. Available from http://www.xilinx.com.
[72] Xilinx Inc., San Jose, CA, United States. XC2V6000 data sheet, 2001. Available from
http://www.xilinx.com.
[73] Xilinx Inc., San Jose, CA, United States. Virtex Series Configuration Architecture
User Guide (XAPP151 v1.7), Oct. 2004. Available from http://www.xilinx.com.
[74] Xilinx Inc., San Jose, CA, United States. PowerPC 405 Processor Block Reference
Guide, July 2005. Available at http://www.xilinx.com.
[75] Xilinx Inc., San Jose, CA, United States. Virtex-4 User Guide (UG070 v2.6), 2008.
Available from http://www.xilinx.com.
[76] Xilinx Inc., San Jose, CA, United States. Virtex-6 Family Overview (v2.1), 2009.
Available from http://www.xilinx.com/.
128
[77] Z. A. Ye, A. Moshovos, S. Hauck, and P. Banerjee. CHIMAERA: a high-performance
architecture with a tightly-coupled reconfigurable functional unit. In ISCA ’00: Pro-
ceedings of the 27th annual International Symposium on Computer Architecture,
pages 225–235, New York, NY, USA, 2000. ACM.
129
