Integrated Optimization of Partitioning, Scheduling and Floorplanning
  for Partially Dynamically Reconfigurable Systems by Chen, Song et al.
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL., NO. 1
Integrated Optimization of Partitioning, Scheduling,
and Floorplanning for Partially Dynamically
Reconfigurable Systems
Song Chen, Member, IEEE, Jinglei Huang, Xiaodong Xu, Bo Ding, and Qi Xu
Abstract—Confronted with the challenge of high performance
for applications and the restriction of hardware resources for
field-programmable gate arrays (FPGAs), partial dynamic re-
configuration (PDR) technology is anticipated to accelerate the
reconfiguration process and alleviate the device shortage. In
this paper, we propose an integrated optimization framework
for task partitioning, scheduling and floorplanning on partially
dynamically reconfigurable FPGAs. The partition, schedule, and
floorplan of the tasks are represented by the partitioned sequence
triple P -ST (PS,QS,RS), where (PS,QS) is a hybrid nested
sequence pair (HNSP ) for representing the spatial and temporal
partitions, as well as the floorplan, and RS is the partitioned
dynamic configuration order of the tasks. The floorplanning and
scheduling of task modules can be computed from the partitioned
sequence triple P -ST in O(n2) time. To integrate the exploration
of the scheduling and floorplanning design space, we use a simu-
lated annealing-based search engine and elaborate a perturbation
method, where a randomly chosen task module is removed from
the partition sequence triple and then re-inserted into a proper
position selected from all the O(n3) possible combinations of
partition, schedule and floorplan. We also prove a sufficient and
necessary condition for the feasibility of the partitioning of tasks
and scheduling of task configurations, and derive conditions for
the feasibility of the insertion points in a P-ST. The experimental
results demonstrate the efficiency and effectiveness of the proposed
framework.
Index Terms—Partitioning, scheduling, floorplanning, partially
dynamically reconfigurable, FPGAs, partitioned sequence triple
I. INTRODUCTION
In recent decades, reconfigurable hardware, and field pro-
grammable gate arrays (FPGAs) in particular, have received
much attention because of their ability to be reconfigured to
any custom desired computing architecture rapidly [1]. We
can construct an entire hardware system on an FPGA chip or
include an FPGA on a system-on-chip to provide hardware
programmability. Traditionally, FPGAs are exploited using
compile-time (static) reconfiguration, and the configuration
remains the same throughout the running time of an application.
This work was partially supported by the National Natural Science Founda-
tion of China (NSFC) under grant No.61874102, No.61732020 and 61674133.
The authors would like to thank the Information Science Laboratory Center of
USTC for the hardware & software services.
S. Chen, X. Xu, and B. Ding are with the Department of Electronic Science
and Technology, University of Science and Technology of China, China (email:
songch@ustc.edu.cn;xxd0210@mail.ustc.edu.cn;dingbo@mail.ustc.edu.cn).
J. Huang is with State Key Laboratory of Air Traffic Management System
and Technology, China (email:huangjl@mail.ustc.edu.cn).
Q. Xu is with the School of Electronic Science and Applied Physics, Hefei
University of Technology, China (email: xuqi@hfut.edu.cn).
To change the configuration, we have to stop the computation,
reconfigure the chip by means of power-on resetting, and
then start a new application. With the evolution of FPGA
technology, dynamic reconfiguration (DR) has been developed,
which provides more flexibility to reconfigure the FPGA by
changing its predetermined functions at run-time. Through DR,
one large application can be partitioned into smaller tasks; then,
the tasks can be sequentially configured at run-time. In this
process, the entire chip must be reconfigured for each task; thus,
significant reconfiguration overhead is incurred for loading the
configuration each time [2].
To reduce the reconfiguration overhead and improve per-
formance, several techniques are employed in modern FPGA
architectures, such as partially dynamic reconfiguration (PDR),
module reuse, and configuration prefetching, where PDR is
a technique that reconfigures part of the FPGA at run-time
while retaining normal operation of the remaining areas of
the FPGA [3]. By applying the PDR technique, different tasks
can be executed and configured in parallel, and a portion of
the configuration latency can be hidden by careful scheduling
of the configurations and executions of tasks. Hereafter, the
FPGA, with the characteristic of PDR, is regarded as a partially
dynamically reconfigurable FPGA (PDR-FPGA).
To implement a large application composed of task modules
on a PDR-FPGA, we must consider two problems: when the
task modules should be configured and executed and where
the task modules should be placed. The former is a scheduling
problem, and the latter is a floorplanning problem. Unfortu-
nately, both of them are NP-hard [4] [5]. In addition, to enable
PDR, the reconfigurable resources on the FPGA are partitioned
into several reconfigurable regions, which will be dynamically
reconfigured to realize different tasks over time. Therefore, the
number of partitioned reconfigurable regions and their sizes
should be considered in this process.
A. Related Work
Many studies have focused on partitioning, scheduling, and
floorplanning for PDR. R. Cordone et al. [6] proposed an
integer linear programming (ILP) based method and a heuristic
method for partitioning and scheduling task graphs on PDR-
FPGAs, where configuration prefetching and module reuse
are considered to minimize the reconfiguration overhead. A.
Purgato et al. [7] proposed a fast task scheduling heuristic
to schedule the tasks in either the hardware or the software
ar
X
iv
:1
80
3.
03
74
8v
2 
 [c
s.A
R]
  2
6 D
ec
 20
18
with minimization of the overall execution time on partially
reconfigurable systems. However, the proposed method only
focuses on generating reconfigurable regions to satisfy the
resource requirements, which will easily cause the final result
to fail to produce a valid floorplan. Y. Jiang et al. [8] proposed
a network flow-based multi-way task partitioning algorithm
to minimize the total communication costs across temporal
partitions. However, in this work, the partitioning is simplified
without considering the partial reconfiguration, and it is difficult
to effectively estimate the communication costs without the
floorplan information. All the aforementioned works mainly
focus on partitioning/scheduling of the tasks without consid-
eration of the floorplan, which will often cause the schedule to
fail to be floorplanned effectively, as they do not consider the
resource constraints on the FPGA chips.
E. A. Deiana et al. [9] proposed a mixed-integer linear pro-
gramming (MILP) based scheduler for mapping and scheduling
applications on partially reconfigurable FPGAs, and if the
schedule cannot be successfully floorplanned, the scheduler is
re-executed until a feasible floorplan is identified. However,
the time-consuming MILP based method is impractical for
large applications. In addition, scheduling and floorplanning
are solved separately, which can cause large communication
costs in the spatial domain. M. Vasilko [10] proposed a
temporal floorplanning method for solving the scheduling and
floorplanning of dynamically reconfigurable systems. P. Yuh et
al. [11] [12] modeled the tasks as three-dimensional (3D) boxes
and proposed simulated annealing-based 3D floorplanners to
solve the floorplanning and scheduling problems of the tasks.
However, the task modules are assumed to be reconfigured at
any time and in any region, which may not match practical
reconfigurable architectures. For example, in the Virtex 7 series
FPGA chips from Xilinx [13], the reconfiguration partitions
(dynamically reconfigurable regions) cannot be overlapped.
Given scheduled task graphs, many works have focused on the
floorplanning of partially reconfigurable designs [14]–[19].
The design of reconfigurable systems with PDR generally
involves partitioning, scheduling, and floorplanning of the tasks,
which are interdependent considering communication costs and
system performance. Therefore, these three problems have to be
solved in an integrated optimization framework to effectively
explore the design space. However, the aforementioned works
either solve the three problems sequentially, where, at most, a
simple iterative refinement between scheduling and floorplan-
ning is included, or solve only two of the three problems in an
integrated framework.
B. Contributions
In this paper, we propose an integrated optimization frame-
work for task partitioning, scheduling, and floorplanning on
partially dynamically reconfigurable FPGAs. This paper ex-
pands our previous work [20]. Numerous theoretical analyses
are provided for the feasibility of the P -ST s (defined below).
The main contributions of this paper are outlined as follows.
1). The term P -ST (PS,QS,RS) is proposed to represent
the partitions, schedule, and floorplan of n task modules, where
PS, QS, and RS are the sequences of n task modules. (PS,
QS) is regarded as a hybrid nested sequence pair (HNSP )
representing the floorplan with spatial and temporal partition,
and RS is the partitioned dynamic configuration order of the
tasks. The floorplan can be computed from the HNSP in
O(nloglogn) time, and the schedule of tasks can be computed
in O(n2) time by solving a single-source longest-path problem
on a reconfiguration constraint graph (RCG), which is con-
structed based on P -ST and the task precedence graph.
2). We elaborate a perturbation method to integrate the
exploration of the schedule and floorplan design space into
simulated annealing-based searching. In the perturbation, a
randomly chosen task module is removed from a P -ST and is
then re-inserted into the partitioned sequence triple at a proper
position selected from all O(n3) possible insertion points,
which are efficiently evaluated in O(n4) time based on an
insertion point enumeration procedure.
3). We prove a sufficient and necessary condition for the
feasibility of the partitioning of tasks and scheduling of task
configurations, which is not included in [20], and derive con-
ditions for the feasibility of the insertion points in a P -ST .
The experimental results demonstrate the efficiency and
effectiveness of the proposed optimization framework.
The remainder of the paper is organized as follows. Section II
describes the target hardware architecture and the problem
definition. Section III discusses the representation of a sequence
triple. Section IV shows the optimization framework to explore
the design space of partitioning, scheduling and floorplanning
of task modules. Experimental results and conclusions are
shown and discussed in Section V and Section VI, respectively.
II. PROBLEM DESCRIPTION
A. Dynamically Reconfigurable Architecture
The dynamically reconfigurable system typically includes a
host processor, an FPGA chip, an external memory, and the
communication infrastructure among them. The host processor
and communication infrastructure could be on-chip or off-chip.
Pre-synthesized task modules are stored in off-chip external
memory in the form of bitstreams. According to the scheduled
sequence and floorplanned locations, the host processor deploys
task modules on the FPGAs.
Modern FPGAs have evolved into complex heterogeneous
and hierarchical devices. However, the basic logic cell still
comprises configurable logic blocks (CLBs) [21]. In the target
architecture, the CLB is the smallest reconfigurable element.
Configuration bitstreams are transferred into FPGAs using one
configuration port, which is an external Joint Test Action Group
protocol or an internal configuration access port (ICAP).
On the other hand, PDR is subject to the technology limita-
tion, which is that the configuration process of a task module
must not disrupt the execution of other task modules [14].
Thus, generally, dynamically reconfigurable regions (DRRs),
where the task modules are dynamically reconfigured in a
manner similar to that of a context (time layer) switching
mode, are used for implementing partial reconfiguration. On an
TABLE I: Some frequently used notations.
mi 1 ≤ i ≤ n and mi is a task module.
wi, hi number of CLB rows and CLB columns required by mi.
ci, ti configuration span (time) and execution span (time) of mi.
bci, bti start configuration time /start execution time of mi.
drri 1 ≤ i ≤ N and drri is a dynamically reconfigurable
region.
drr(mi) the DRR where mi is located.
tlij the j-th time layer in drri.
tl(mi) the time layer where mi is located.
ctlij
configuration span of time layer tlij .
bctlij
start configuration time of time layer tlij .
CO[tlij ] configuration order of time layer tl
i
j .
lt sij start of lifetime of a time layer tl
i
j .
lt eij end of lifetime of a time layer tl
i
j .
LT [tlij ] =(lt s
i
j , lt e
i
j ), lifetime of a time layer tl
i
j .
FPGA chip, we can have multiple DRRs and one DRR can be
dynamically reconfigured while the others continue to execute.
A DRR is a rectangular region on FPGAs because irregular-
shaped reconfiguration regions (such as T or L shapes) can
introduce routing restriction issues [13]. A task can be imple-
mented as a rectangular hardware module on the FPGA. The
module area represents the occupied CLBs (the number of rows
and columns on the FPGA).
B. Problem Definition
The design is composed of pre-synthesized tasks whose
resource usage and internal routing are predetermined. Let
M = {mi|1 ≤ i ≤ n} be a set of n tasks. A task mi, has
a physical attribute vector, (wi, hi, ci, ti). The meanings are
shown in Table I. ci is proportional to the area and is estimated
by cclb×wi×hi, where cclb is the configuration time of a single
CLB.
The data dependencies among these tasks are given as a task
dependence graph, TG = (VTG, ETG), where VTG = M and
ETG = {(mi,mj)|1 ≤ i, j ≤ n, i 6= j, and mi must end before
mj starts}. TG′ = (VTG, ETG′) denotes the transitive closure
of TG.
The partitioning, scheduling, and floorplanning of PDR are
formulated as follows:
In the spatial domain, the n tasks are partitioned into DRRs.
Let N be the number of DRRs.
Definition 1. The DRRs are denoted as DRR = {drri|1 ≤ i ≤
N, drri ⊆ M,
⋃N
i=1 drri = M}, where ∀i 6= j, drri ∩ drrj =
∅. If mk ∈ drri, we denote the DRR of mk as drr(mk) = drri.
In the temporal domain, the n tasks are partitioned into
different time layers to reuse the resources of DRRs. A time
layer is configured as a whole. Thus, in the same DRR, a time
layer can only be configured after the completion of all the
tasks in the previous time layer. Let li be the number of time
layers in drri.
Definition 2. The time layers are denoted as TL = {tlij |∀1
≤ i ≤ N, 1 ≤ j ≤ li, tlij ⊆ drri,
⋃li
j=1 tl
i
j = drri}, where
∀j1 6= j2, tlij1 ∩ tlij2 = ∅. If mk ∈ tlij , we denote the time
layer of mk as tl(mk) and the total number of time layers as
|TL| =∑Ni=1 li.
For convenience, we define CO[tlij ] to be the configuration
order of time layer tlij , 1 ≤ CO[tlij ] ≤ |TL| and stipulate that
CO[tlij ] < CO[tl
i
j+1], 1 ≤ j < li.
The configuration span (time) of the time layers in a DRR
is proportional to the area of the DRR, we use cdrri or ctlij
(= cdrri ) to denote the configuration span of a time layer tl
i
j .
To reduce the time complexity in the proposed integrated opti-
mization framework, ctlij is also under-estimated by summing
the configuration time of task modules:
ctlij =
∑
mp∈tlij
cp (1)
For the scheduling, we consider the following constraints:
(1) The precedence constraints between tasks cannot be
violated, that is, ∀(mi,mj) ∈ ETG, bti + ti ≤ btj .
(2) A task must be configured before execution, that is, ∀1 ≤
i ≤ n, bctl(mi) + ctl(mi) ≤ bti.
(3) Considering the technical limitation of only one config-
uration port, the configuration span of time layers must
be non-overlapped.
(4) In the same DRR, a time layer can only be configured
after the execution of all the tasks in the previous time
layer because they share the same hardware resources.
The constraints for the floorplanning process are as follows.
(5) Each DRR occupies a rectangular region, and all the
rectangular regions of the DRRs should be placed without
overlapping each other and should be within the FPGA
chip area, which is defined by the chip width and chip
height (fixed-outline constraint).
(6) The task modules in the same time layer must be non-
overlapped and placed within their corresponding DRR.
Under the above constraints, we solve the partitioning prob-
lem to determine DRR and TL, the scheduling problem to
determine the start configuration time and start execution time
of the tasks (time layers), and the flooprlanning problem to
determine the floorplan of DRRs and the floorplan of tasks
inside the DRRs.
We define schedule length to be the time from the beginning
of the configuration process to the end of the executions of all
tasks. The objective is to find a reasonable floorplan of tasks on
a PRD-FPGA while minimizing the schedule length of designs
as well as the communication costs among tasks.
III. PARTITIONED SEQUENCE TRIPLE
A. Representation
In this paper, a partitioned sequence triple (P -ST ) is pro-
posed to represent the partitioning, scheduling, and floorplan-
ning of tasks for partially dynamically reconfigurable designs.
Definition 3. The partitioned sequence triple P -ST is a 3-tuple
of task sequences, (PS,QS,RS), where (PS,QS) forms a
hybrid nested sequence pair (HNSP ) to represent the spatial
partition (DRR), the temporal partition (time layer) and the
floorplan of the task modules, and RS defines the configuration
order of the time layers.
In a P -ST , task partitioning is constrained as follows:
1) The task modules in the same time layer will consecu-
tively appear in PS, QS, and RS.
2) The task modules in the same DRR will consecutively
appear in both PS and QS.
The structure of P -ST is illustrated as follows:
(〈. . .
drri︷ ︸︸ ︷
[. . . (. . .mp . . .mq . . .)
i
j1︸ ︷︷ ︸
tlij1
. . . (. . .mr . . .mt . . .)
i
j2︸ ︷︷ ︸
tlij2
. . .]i . . .〉,
〈. . .
drri︷ ︸︸ ︷
[. . . (. . .mq . . .mp . . .)
i
j1︸ ︷︷ ︸
tlij1
. . . (. . .mr . . .mt . . .)
i
j2︸ ︷︷ ︸
tlij2
. . .]i . . .〉,
〈. . .
drri︷ ︸︸ ︷
(. . .mp . . .mq . . .)
i
j1︸ ︷︷ ︸
tlij1
. . .
drri︷ ︸︸ ︷
(. . .mt . . .mr . . .)
i
j2︸ ︷︷ ︸
tlij2
, . . .〉).
In a P -ST , (·)ij denotes the sequence of tasks in the time
layer tlij and [·]i is the sequence of tasks in the DRR drri.
An HNSP (PS,QS) imposes the position relationship be-
tween each pair of task modules as follows.
Definition 4. if tl(mi) = tl(mj) or drr(mi) 6= drr(mj), then
(〈. . .mi . . .mj . . .〉, 〈. . .mi . . .mj . . .〉) → mi is left to mj;
(〈. . .mj . . .mi . . .〉, 〈. . .mi . . .mj . . .〉) → mi is below mj .
Notice that the relationship between the task modules from
different time layers in the same DRR is not defined, as there
are no non-overlapping constraints involved. Without loss of
generality, we require the task modules in the same time layer
to occur consecutively in PS and QS for clarity in representing
the partitions of time layers and the floorplan of time layers.
The configuration order of a time layer can be represented
by a configuration sequence RS, which is defined as follows:
Definition 5. Given an RS sequence, (〈. . .mi. . .mj . . .〉), the
configuration constraints are defined as follows:
1) if tl(mi)= tl(mj), then mi and mj are configured simulta-
neously, along with the corresponding time layer;
2) if tl(mi) 6= tl(mj), then tl(mi) is configured before tl(mj),
and the configuration order relationship is CO[tl(mi)] <
CO[tl(mj)].
In the RS, the ordering of task modules within a time layer
makes no sense because the time layer is configured as a whole.
For example, a task graph TG with ten task modules is
shown in Fig. 1 and a P -ST in this example is given as follows:
(〈[(1 2)21 (9 10)22]2 [(8 7)11]1 [(6)41 (4)42]4 [(3)31 (5)32]3〉,
〈[(8 7)11]1 [(2 1)21 (9 10)22]2 [(3)31 (5)32]3 [(6)41 (4)42]4〉,
〈(1 2)21 (3)31 (5)32 (6)41 (4)42 (7 8)11 (9 10)22〉).
(2)
For simplifying the notations, we use i to represent the task
module mi in the examples of P -ST . From the partitioned
m
3
m
1
m
5 m7
m
2
m
4
m
6
m
8 m
9
m
10
Fig. 1: Task dependence graph TG with ten task modules
sequence triple P -ST , we can obtain the corresponding con-
figuration order and floorplan on the FPGA as shown in Fig. 2.
2
1ݐ݈ଵଶ 9 10ݐ݈ଶଶ78 ݐ݈ଵଵ3ݐ݈ଵଷ 5ݐ݈ଶଷ 4ݐ݈ଶସ6ݐ݈ଵସͳ ʹ ͵ Ͷ ͷ ͸ ͹ ܥܱ ݐ ௝݈௜
(a) Configuration order of time layers
2
1
3 5
6 4
78 78
5
9 10
dݎݎͳሺݐ݈ଵଵሻ
ݎݎʹሺݐ݈ଵଶǡ ݐ݈ଶଶሻ dݎݎͶሺݐ݈ଵସǡ ݐ݈ଶସሻݎݎ͵ሺݐ݈ଵଷǡ ݐ݈ଶଷሻ
1 3
4 5 6 7
2
2
1
2
1
2
1
2
1
2
1
5 5 5
4 4
(b) Floorplan of DRRs and the tasks in the time layers
Fig. 2: From a P -ST to the configuration order and the
floorplan of DRRs and task modules in time layers.
According to Definition 5 and the given configuration se-
quence RS〈(1 2)21 (3)31 (5)32 (6)41 (4)42 (7 8)11 (9 10)22〉, Fig. 2a
shows the configuration order of the time layers. First, the time
layer tl21 from drr2 is configured and, second, the time layer
tl31 from drr3 can be configured during the executions of m1
and m2. The computation of the beginning configuration times
of time layers will be discussed in Section III-C.
Considering the relationship between each pair of task mod-
ules defined in Definition 4, Fig. 2b shows the corresponding
floorplan of task modules, where m2 is below m1 because they
are in the same time layer (tl(m1) = tl(m2) = tl21), and m7 is
below m2 because they are in different DRRs (drr(m7) = drr1
and drr(m2) = drr2).
Knowing the dimensions of task modules, we can compute
the floorplan from the HNSP in O(nloglog(n)) time by
solving the longest weighted common subsequence of PS and
QS [22] hierarchically. We can compute the floorplan of task
modules within every DRR to obtain the occupied resource
arrays of DRRs, and then compute the floorplan of DRRs to
determine the total resource usage by regarding each DRR as
a whole. The computation of the schedule will be discussed in
the following subsections.
B. Feasibility of Partition and Configuration Order
Owing to the dependencies between tasks, not all the parti-
tioned sequence triples P -ST are feasible. In this subsection,
we prove a sufficient and necessary condition for the feasibility
of partitions and configuration order of task modules.
1) Lifetime of Time Layers: The task modules in a time layer
can be executed only after the configuration of the time layer
and will be destroyed while configuring the next time layer (if
more time layers exist) in the same DRR. Consequently, we
have the following definition.
Definition 6. Given the spatial partition DRR, the temporal
partition TL, and the configuration order of the time layers, we
define the lifetime of a time layer tlij , LT [tl
i
j ] = (lt s
i
j , lt e
i
j),
as follows. ∀tlij ∈ drri,
lt sij = CO[tl
i
j ], and lt e
i
j =
{
CO[tlij+1], 1 ≤ j < li;
∞, j = li.
(3)
Note that the lifetime of a time layer is also the lifetime of
the task modules in the time layer.
To discuss the feasibility of a configuration order, we define
the dependencies between time layers based on the dependency
graph of tasks given DRR and TL. A dependence graph
LTG(VLTG, ELTG) is constructed as follows.
VLTG = TL; ELTG = {(tli1j1 , tli2j2)| If there exist mk1 and
mk2 respectively from tl
i1
j1
and tli2j2 such that (mk1 ,mk2) ∈
E
′
TG}. Note that E
′
TG is the edge set of the transitive closure
of TG.
2) Dependencies Between Time Layers: Given a configura-
tion order, the dependencies between time layers fall into two
groups: forward dependencies and backward dependencies.
Definition 7. A dependence (tli1j1 , tl
i2
j2
) ∈ ELTG is forward if
CO[tli1j1 ] < CO[tl
i2
j2
], which indicates that the output of a task
module in a time layer, tli1j1 , is the input to a task module from
a future time layer, tli2j2 .
Forward dependencies are always feasible because even if
the lifetime of a time layer ends, the computed data can be
stored and used in the future.
Definition 8. A dependence (tli1j1 , tl
i2
j2
) ∈ ELTG is backward if
CO[tli1j1 ] > CO[tl
i2
j2
], which indicates that the output of a task
module in a time layer tli1j1 is the input to a task module from
an earlier configured time layer, tli2j2 .
However, backward dependencies are infeasible if there is no
overlapping between the lifetimes of the dependent time layers,
tli1j1 and tl
i2
j2
, that is, lt ei2j2 < lt s
i1
j1
. In this situation, tli2j2 is
destroyed (replaced by a new time layer) before the time layer
tli1j1 is configured, so the input to a task module is generated
after the task module has been destroyed.
Fig. 3 shows examples of lifetimes of time layers and the
dependencies between time layers. The spatial partition DRR
and the temporal partition TL of the tasks are shown in Fig.
2, and the dependencies between tasks are shown in Fig.
1. The configuration order of the time layers is as follows:
RS〈(1 2)21 (6)41 (3)31 (4)42 (5)32 (7 8)11 (9 10)22〉 (also shown as
the x-axis in Fig.3).
The time layers tl41 and tl
3
2 have backward dependence
because m6 needs the data from m5, as shown in Fig. 1, and
their lifetimes LT (tl41) = (2, 4) and LT (tl
3
2) = (5,∞) are
non-overlapped. That is, m6 in tl41 is destroyed ( tl
4
2 in the
same DRR drr4 has occupied the hardware resource) before
the execution of m5 in tl32. Consequently, m6 will never receive
the data from m5, so the configuration order of task modules
shown in Fig. 3 is infeasible.
ܥܱ ݐ ௝݈௜ ՜ݐ݈ଵଶ ݉ଵǡ ݉ଶͳ ݐ݈ଵସ ݉଺ʹ ݐ݈ଵଷ ݉ଷ͵ ݐ݈ଶସ ݉ସͶ ݐ݈ଶଷ ݉ହͷ ݐ݈ଵଵ ݉଻ǡ ଼݉͸ ݐ݈ଶଶ ݉ଽǡ ݉ଵ଴͹
ܮܶ ݐ݈ଵଶ ൌ ሺͳǡ͹ሻܮܶ ݐ݈ଵସ ൌ ሺʹǡͶሻܮܶ ݐ݈ଵଷ ൌ ሺ͵ǡͷሻܮܶ ݐ݈ଶସ ൌ ሺͶǡλሻܮܶ ݐ݈ଶଷ ൌ ሺͷǡλሻܮܶ ݐ݈ଵଵ ൌ ሺ͸ǡλሻܮܶ ݐ݈ଶଶ ൌ ሺ͹ǡλሻ
illegal backward dependence
feasible backward dependency 
λ
forward dependency
Fig. 3: Lifetimes of time layers and an infeasible backward
dependence.
3) Condition of Feasibility: We thus argue that the given
spatial partition, temporal partition, and configuration order is
feasible if a schedule of executions and configurations of task
modules can be computed without consideration of resource
constraints. We have the following theorem:
Theorem 1. The given spatial partition, temporal partition,
and configuration order is feasible if and only if there are
no backward dependencies between time layers that have no
lifetime overlap.
Proof. Given a partition, a configuration order, and the
task dependency graph, we can construct a reconfiguration
constraint graph (RCG) for scheduling the configurations of
the time layers and the executions of the task modules, i.e., the
computation of bti, bci, and bctl(mi) defined in Section II-B.
RCG(V,E) is constructed by adding to the graph (TG =
(V,E)) the vertex set VLTG and three edge sets representing
the scheduling constraints. VRCG = VTG∪VLTG, where VLTG
represents time layers and is defined in Section III-B. ERCG =
ETG ∪ Ecr ∪ Ece ∪ Eec and Ecr, Ece, and Eec are defined as
follows.
1) The set of edges represents the configuration order. Ecr =
{(tli1j1 , tli2j2)|CO[tli1j1 ] < CO[tli2j2 ]}.
2) The set of edges indicates that a task mk must be exe-
cuted only after the configuration of the time layer where
mk is located. Ece = {(tlij ,mk)|tlij ∈ VLTG,mk ∈ VTG
and tl(mk) = tlij}.
3) The set of edges indicates that, in a DRR, a time layer
must be configured after the execution of all the tasks
in the previous time layer because they share the same
hardware resources. Eec = {(mk, tlij)|tlij ∈ VLTG, mk ∈
VTG, and tl(mk) is the time layer before tlij in drri }.
A schedule can be computed only if the RCG is acyclic
because a cycle produces a conflict in the constraints.
IF. Here we show that if there are backward dependencies
between the time layers that have no lifetime overlap, there
will be a cycle in the RCG and hence the given partition and
configuration order is infeasible.
A pair of time layers, tli1j1 and tl
i2
j2
with CO[tli1j1 ] > CO[tl
i2
j2
],
have a backward dependence if there are two task modules,
mk1 and mk2 , respectively from tl
i1
j1
and tli2j2 and there is a
direct or indirect data dependence between them ((mk1 ,mk2) ∈
ETG′ ). Fig.4a shows an illustration of this, where a dashed
arrow represents an edge or a path and solid arrows represent
edges. While there is no overlap between the lifetime of tli1j1
and tli2j2 , the hardware resources occupied by the time layer
tli2j2 must be reconfigured to be the next time layer in the same
DRR, tli2j2+1 before the configuration of tl
i1
j1
, and there must be
an edge from mk2 to tl
i2
j2+1
(shown in a bold dashed arrow in
Fig.4a) because a time layer can only be configured after the
execution of all tasks in the earlier time layers in the same DRR.
Accordingly, a cycle is formed, which indicates the conflict of
constraints.
݉݇ଵ
ܸܮܶܩ
ܸܶܩ
ݐ ௝݈మ௜మ ݐ ௝݈భ௜భݐ ௝݈మାଵ௜మ
݉݇ଶ
(a) Cycle formed by a backward
dependence between time layers.
݉݇ଵ
ܸܮܶܩ
ܸܶܩ
ݐ ௝݈భ௜భ ݐ ௝݈మ௜మ
݉݇ଶ
ݐ ௝݈భିଵ௜భ
(b) The cycle causes an illegal
backward dependence.
Fig. 4: A cycle in RCG and the backward dependences.
ONLY IF. Here, we show that if the given partition and
configuration order is infeasible, there must be backward depen-
dencies between the time layers that have no lifetime overlap.
If the given partition and configuration order is infeasible,
there must be a cycle in RCG. Notice that VRCG = VTG ∪
VLTG. The subgraph induced by VLTG, which includes the
edge set Ecr representing the configuration order of the time
layers, is acyclic. The subgraph induced by VTG (exactly TG),
which includes the edge set ETG representing the dependences
between tasks, is also acyclic. Moreover, all the edges in Ece
are from VLTG to VTG, which represents that a task must be
configured before it is executed, and all the edges in Eec are
from VTG to VLTG, which represents that a time layer can only
be configured after the execution of all the tasks in the earlier
time layers. Consequently, the cycle must include four parts: 1)
a path (one or more edges) from Ecr, 2) a path (one or more
edges) from ETG, 3) an edge from Ece, and 4) an edge from
Eec.
Without loss of generality, we assume that the cycle includes
a path from tli1j1 to tl
i2
j2
and a path from mk1 to mk2 , respec-
tively, constructed by the edges from Ecr and ETG. The cycle
must also include two edges: (mk2 , tl
i1
j1
) and (tli2j2 ,mk1). Fig.4b
shows an illustration of this. According to the definition of Ece,
mk1 is in tl
i2
j2
because we have the edge (tli2j2 ,mk1). On the
other hand, the edge (mk2 , tl
i1
j1
) indicates that tli1j1 is configured
after mk2 is executed, which means that mk2 must be located
in the previous time layer of tli1j1 in drri1 , tl
i1
j1−1. We can see
that tli1j1−1 and tl
i2
j2
have a backward data dependence and their
lifetimes are non-overlapping, as the region occupied by tli1j1−1
has been reconfigured to be tli1j1 before tl
i2
j2
is configured.
Proof END.
Note that if RS represents a topological ordering of TG,
the partition and configuration order will always be feasible
because there are no backward dependencies involved.
Corollary 1. Given a partition, a configuration order, and
the task dependency graph, the RCG is acyclic if there is
always lifetime overlap between time layers that have backward
dependencies.
Fig. 5a shows the RCG of the feasible P -ST in Formula
(2), where RS = 〈(1 2)21 (3)31 (5)32 (6)41 (4)42 (7 8)11 (9 10)22〉.
If RS is changed to the configuration order in Fig. 3,
〈(1 2)21 (6)41 (3)31 (4)42 (5)32 (7 8)11 (9 10)22〉, the corresponding
RCG is shown in Fig. 5b, where a cycle tl42 → tl32 → m5 →
m6 → tl42 is formed and no feasible schedule can be found.
݉͵݉ͳ ݉ͷ ݉͹݉ʹ
݉Ͷ
݉͸ ݉ͺ ݉ͻ݉ͳͲ
ݐ݈ଵଶሺ݉ଵǡ ݉ଶሻ ݐ݈ଵଷሺ݉ଷሻ ݐ݈ଶଷሺ݉ହሻ ݐ݈ଵସሺ݉଺ሻ ݐ݈ଶସሺ݉ସሻ ݐ݈ଵଵሺ݉଻ǡ ଼݉ሻ ݐ݈ଶଶሺ݉ଽǡ ݉ଵ଴ሻܥܱ ݐ݈ଵଶ ൌ ͳ
ܧܿݎ ܧܿ݁ ܧ݁ܿܧܶܩ ݐ ௝݈௜ሺ݉௤ ǡ ǥ ǡ݉௣ሻ݉௞ܸܶܩ ௅்ܸீ
ܥܱ ݐ݈ଵଷ ൌ ʹ ܥܱ ݐ݈ଵଵ ൌ ͸ܥܱ ݐ݈ଶଷ ൌ ͵ ܥܱ ݐ݈ଵସ ൌ Ͷ ܥܱ ݐ݈ଶସ ൌ ͷ ܥܱ ݐ݈ଶଶ ൌ ͹
(a) An example of RCG under feasible partition and configuration order.
m
3
m
1
m
5
m
7
m
2
m
4
m
6
m
8
m
9
m
10
ݐ݈ଵଶሺ݉ଵǡ ݉ଶሻ ݐ݈ଵଷሺ݉ଷሻ ݐ݈ଶଷሺ݉ହሻݐ݈ଵସሺ݉଺ሻ ݐ݈ଶସሺ݉ସሻ ݐ݈ଵଵሺ݉଻ǡ ଼݉ሻ ݐ݈ଶଶሺ݉ଽǡ ݉ଵ଴ሻܥܱ ݐ݈ଵଶ ൌ ͳ ܥܱ ݐ݈ଵଷ ൌ3 ܥܱ ݐ݈ଶଷ ൌ ͷܥܱ ݐ݈ଵସ ൌ2 ܥܱ ݐ݈ଶସ ൌ4 ܥܱ ݐ݈ଵଵ ൌ ͸ ܥܱ ݐ݈ଶଶ ൌ ͹
(b) An example of RCG with a cycle (No feasible schedule).
Fig. 5: RCG Examples
C. Computation of the Schedule
The schedule can be computed by finding the longest paths
on the RCG with edges weighted as follows.
∀tlij ∈ VLTG, wt(tlij) = ctlij ,∀mi ∈ VTG, wt(mi) = ti. (4)
Let s be the vertex corresponding to the time layer that is
configured first (having zero in-degree in RCG), and lp(vi)
denote the vertex-weighted longest-path from s to a vertex vi.
The schedule (bctlij and btk) can be determined by computing
lp(tlij) and lp(mk), respectively.
The schedule length T of the partially dynamically recon-
figurable system is the maximum of the paths, and can be
calculated as follows: T = max
1≤k≤n
(lp(mk) + tk).
Given a feasible P -ST , we can construct RCG and compute
the schedule in O(n2) time if the RCG is acyclic, where n is
the number of task modules.
IV. OPTIMIZATION FRAMEWORK
Definition 9. An insertion point in the partitioned se-
quence triple P -ST (PS,QS,RS) is defined as a four-tuple,
(p, q, r, tlij), where p, q, and r are the positions immediately
after the p-th task module in PS, the q-th task module in QS,
and the r-th task module in RS, respectively, and tlij is the
j-th time layer in drri. p = 0 (or q = 0 or r = 0) indicates
the position before the first task module of the sequence.
In Section IV-B2, we will discuss the feasibility and types
of insertion points in detail.
A. Overall Design Flow
In this work, we modify the perturbation method, Insertion-
after-Remove (IAR) in [23], to explore the design space of
the schedule and floorplan in a simulated annealing-based
search. With the IAR operation, we can perturb the partitioning,
scheduling, and floorplanning of task modules simultaneously.
The detailed steps are as follows:
a. Select and remove a task module mk randomly and
then compute the floorplan and schedule of task modules
without the removed task module mk;
b. Select a fixed number of feasible candidate insertion
points, SCIP = {(p, q, r, tlij)}, for mk by rough eval-
uations of all the feasible insertion points;
c. Choose the best insertion point from SCIP for the
removed task module mk by accurate evaluations.
In step b, the feasible insertion points are evaluated by the
linear combination of resource costs, schedule length, and com-
munication costs. In this step, the resource costs are calculated
accurately. To reduce the time complexity, the communication
cost CC is calculated roughly without updating the floorplan
and schedule of task modules, and the schedule length is
roughly evaluated by under-estimating the configuration spans
of time layers using Formula 1. In step c, all the insertion
points in SCIP will be evaluated accurately based on the entire
floorplan considering the communication costs, and the best one
will be chosen as the candidate insertion point. The feasibility
of insertion points will be discussed in Subsection IV-B.
In the experiments, we set the size of |SCIP | at 15. The
objective function Cost is defined as the linear combination of
the area cost (AC), which depends on the dimensions of all
occupied resources (Col×Row), the schedule length (T ), and
the communication costs (CC):
Cost = α×AC + β × T + γ × CC. (5)
α, β, and γ are balance factors for making a trade-off
between the schedule length and the communication costs. T is
calculated by under-estimating the configuration spans of time
layers using Formula (1). The evaluation of T , CC, and AC
will be discussed in Subsection IV-C.
B. Feasible insertion points in P -ST
Generally, given a P -ST of n − 1 task modules, there are
a total of O(n3) insertion points for inserting a task module.
However, when considering Theorem 1 and the definition of
P -ST , some insertion points are infeasible. Here we discuss
the feasibility of insertion points in P -ST s.
1) Lifetime overlap constraint: First, inserting mk could
introduces new backward dependencies between time layers.
To ensure the lifetime overlap between the backward-dependent
time layers, we have the following corollary from Theorem 1.
Corollary 2. The lifetime of a time layer, tlij , where mk is
inserted, must satisfy the following condition.lt s
i
j ≤ mintli1j1 :∃mk1∈tli1j1∧(mk,mk1 )∈ETG′{lt e
i1
j1
};
lt eij ≥ maxtli1j1 :∃mk1∈tli1j1∧(mk1 ,mk)∈ETG′{lt s
i1
j1
}. (6)
Second, the lifetime of a time layer is changed when a new
time layer is inserted into an existing DRR. To ensure the
lifetime overlap between the time layers that have backward
dependences, we have the following corollary from Theorem
1.
Corollary 3. Given a partition and configuration order, the
lifetime of a time layer tlij must satisfy the following minimum
lifetime constraint, denoted as LTmin[tlij ] = (lt ls
i
j , lt ee
i
j).
lt lsij = lt s
i
j .
lt eeij =

maxtli′
j′ :(tl
i′
j′ ,tl
i
j)∈ELTG∧lt si
′
j′>lt s
i
j
{lt si′j′}+ 1,
if there is a backward dependence
between tlij and some future time layer.
lt sij + 1, otherwise.
(7)
The minimum lifetime constraints ensure a lifetime overlap
between any two time layers that have a backward dependence.
This constraint cannot be violated after mk is inserted back into
the P -ST . In Section IV-B3, an example is provided.
2) Feasibility of insertion points: Let ps[i], qs[i], and rs[i],
1 ≤ i < n represent the i-th task in PS, QS, and RS,
respectively, with mk removed. For each possible insertion
point (p, q, r, tlij), there exist three possible types of optional
partitions to re-insert mk depending on the time layer tlij .
Type-1: Create a new time layer in a new DRR, tlnewnew , for
mk. In this case, (p, q, r) must be located within the boundary
of task sequences corresponding to different DRRs in P -ST ,
i.e.,
drr(ps[p]) 6=drr(ps[p+ 1]), drr(qs[q]) 6=drr(qs[q + 1]),
and tl(rs[r]) 6= tl(rs[r + 1]).
(8)
Without loss of generality, we assume that ps[0] = ps[n] =
−1, and that drr(−1) and tl(−1) correspond to a virtual DRR
and to virtual time layers, respectively. qs[0], qs[n], rs[0], and
rs[n] are dealt with similarly.
This type of insertion point will not change the lifetimes of
any other time layers according to Definition 3. Consequently, if
the constraint (6) in Corollary 2 is satisfied, then (p, q, r, tlnewnew)
is feasible. Note that the new generated time layer tlnewnew is
configured between tl(rs[r]) and tl(rs[r + 1]).
Type-2: Create a new time layer, tlinew, in an existing DRR,
drri, for mk. In this case, the insertion point (p, q, r, tlinew)
must be located within the boundary of task sequences corre-
sponding to different time layers, i.e., there is a combination
(p′, q′, r′) ∈ {p, p + 1} × {q, q + 1} × {r, r + 1} such that
drr(ps[p′])=drr(qs[q′])=drr(rs[r′]) 6=drr(−1), and
tl(ps[p]) 6= tl(ps[p+ 1]), tl(qs[q]) 6= tl(qs[q + 1]),
and tl(rs[r]) 6= tl(rs[r + 1]).
This type of insertion point will change the lifetime of
the time layer that is immediately before tlinew in drri. An
insertion point (p, q, r, tlinew) is feasible if the constraints in
both Corollary 2 and Corollary 3 are satisfied. Note that the
new generated time layer tlinew is configured between tl(rs[r])
and tl(rs[r + 1]).
Type-3: Insert mk into an existing time layer, tlij . In this
case, an insertion point (p, q, r, tlij) must satisfy the condition
that there is a combination (p′, q′, r′) ∈ {p, p + 1} × {q, q +
1} × {r, r+ 1}, such that tl(ps[p′]) = tl(qs[q′]) = tl(rs[r′]) 6=
tl(−1).
This type of insertion point will not change the lifetime of
any other time layers. (p, q, r, tlij) is feasible if the constraint
(6) in Corollary 2 is satisfied.
3) An example: Given the task dependencies shown in Fig.1,
we have the following P -ST with m8 removed:
(〈[(1 2)21 (9 10)22]2 [(7)11]1 [(6)41 (4)42]4 [(3)31 (5)32]3〉,
〈[(7)11]1 [(2 1)21 (9 10)22]2 [(3)31 (5)32]3 [(6)41 (4)42]4〉,
〈(1 2)21 (3)31 (5)32 (6)41 (4)42 (7)11 (9 10)22〉).
(9)
Fig. 6 (black edges) shows the lifetime of time layers with
the task module m8 removed.
According to Corollary 2, for the lifetime, (lt sij , lt e
i
j), of
a time layer tlij , where the removed task module m8 will be
inserted, we have the following basic constraints.
(i) Because (m8, m4) and (m8, m10) are in ETG′ , and m4
is in tl42 and m10 is in tl
2
2, then lt s
i
j has to satisfy:
lt sij ≤ min{lt e42, lt e22}.
As shown in Fig. 6, LT [tl42] = (5,∞) and LT [tl22] =
(7,∞), thus, lt sij ≤ min{∞, ∞} = ∞, which means
that there is no constraint on the beginning of the lifetime.
(ii) Because (m5, m8) and (m7, m8) are in ETG′ , and m5
is in tl32 and m7 is in tl
1
1, then lt e
i
j has to satisfy:
lt eij ≥ max{lt s32, lt s11}.
As shown in Fig. 6, LT [tl32] = (3,∞) and LT [tl11] =
(6,∞), thus, lt eij ≥ max{3, 6} = 6.
Consequently, the task module m8 must be inserted into a
time layer whose lifetime ends after 6.
Here, we assume that m8 is inserted back into an insertion
point, (6th, 8th, 5th, tlij), of the P -ST (PS,QS,RS) shown in
Formula 9. For the insertion point (6th, 8th, 5th, tlij), there are
several optional partitions for m8 considering different tlij .
For a Type-1 partition because the task modules ps[6] = m6
and ps[7] = m4 belong to the same DRR drr4, i.e., drr[m6] =
drr[m4], we cannot create a new DRR for m8.
For a Type-2 partition, a new time layer tl4new is created for
m8 in the existing DRR drr4 as follows.
(〈[(1 2)21 (9 10)22]2 [(7)11]1 [(6)41 (8)4new (4)42]4 [(3)31 (5)32]3〉,
〈[(7)11]1 [(2 1)21 (9 10)22]2 [(3)31 (5)32]3 [(6)41 (8)4new (4)42]4〉,
〈(1 2)21 (3)31 (5)32 (6)41 (8)4new (4)42 (7)11 (9 10)22〉).
Without loss of generality, we set the configuration order
CO[tl4new] of time layer tl
4
new at 4.5 in this situation, and the
lifetime of tl41 will be changed to (4, 4.5). Fig. 6 (black edges
and red edges) shows the lifetime of the time layers. As the
lifetime of the inserted time layer tl4new is LT [tl
4
new] = (4.5, 5),
where the end of the lifetime is lt e4new = 5 < 6, the insertion
point (6th, 8th, 5th, tl4new) is infeasible.
For a Type-3 partition, the removed task module m8 is
inserted into the existing time layer tl41 or tl
4
2.
For tlij = tl
4
1, the P -ST is as follows.
(〈[(1 2)21 (9 10)22]2 [(7)11]1 [(6 8)41 (4)42]4 [(3)31 (5)32]3〉,
〈[(7)11]1 [(2 1)21 (9 10)22]2 [(3)31 (5)32]3 [(6 8)41 (4)42]4〉,
〈(1 2)21 (3)31 (5)32 (6 8)41 (4)42 (7)11 (9 10)22〉).
In this situation, however, as shown in Fig. 6, the lifetime
of the inserted time layer tl41 is LT [tl
4
1] = (4, 5), where the
end of the lifetime is lt e41 = 5 < 6, thus, the insertion point
(6th, 8th, 5th, tl41) is infeasible.
For tlij = tl
4
2, the P -ST is as follows.
(〈[(1 2)21 (9 10)22]2 [(7)11]1 [(6)41 (8 4)42]4 [(3)31 (5)32]3〉,
〈[(7)11]1 [(2 1)21 (9 10)22]2 [(3)31 (5)32]3 [(6)41 (8 4)42]4〉,
〈(1 2)21 (3)31 (5)32 (6)41 (8 4)42 (7)11 (9 10)22〉).
In this situation, as shown in Fig. 6, the lifetime of the
inserted time layer tl42 is LT [tl
4
2] = (5,∞), where the end
of the lifetime is lt e42 = ∞ ≥ 6, thus, the insertion point
(6th, 8th, 5th, tl42) is feasible.
ܥܱ ݐ ௝݈௜ ՜ݐ݈ଵଶ ݉ଵǡ ݉ଶͳ ݐ݈ଵସ ݉଺Ͷݐ݈ଵଷ ݉ଷʹ ݐ݈ଶସ ݉ସͷݐ݈ଶଷ ݉ହ͵ ݐ݈ଵଵ ݉଻͸ ݐ݈ଶଶ ݉ଽǡ ݉ଵ଴͹
ܮܶ ݐ݈ଵଶ ൌ ሺͳǡ͹ሻ
ܮܶ ݐ݈ଵସ ൌ ሺͶǡͷሻൌ ሺͶǡ ͶǤͷሻ
ܮܶ ݐ݈ଵଷ ൌ ሺʹǡ͵ሻ
ܮܶ ݐ݈ଶସ ൌ ͷǡλ
ܮܶ ݐ݈ଶଷ ൌ ሺ͵ǡλሻ
ܮܶ ݐ݈ଵଵ ൌ ͸ǡλܮܶ ݐ݈ଶଶ ൌ ͹ǡλ
λ
ݐ݈୬ୣ୵ସ ଼଼݉݉
infeasible backward dependency
଼݉
feasible backward 
dependency
ܮܶ ݐ݈௡௘௪ସ =(4.5, 5) infeasible backward dependency
ሺͶǤͷሻ
Fig. 6: Example of feasible positions for inserting m8.
4) Discussion of the Reachability of the Solution Space :
Theorem 2. Any two feasible solutions of n task modules,
represented by P -ST s, are reachable to each other through
at most 2n feasible solutions generated by iteratively removing
and re-inserting a task module.
Proof. As discussed in Section III-B3, if the RS is a
topological order of the task dependence graph, the P -
ST will be feasible. From any feasible solution P ′-ST ′:
(PS′, QS′, RS′), we are able to reach another feasible solution
P -ST : (PS,QS,RS), where RS is a topological order, by
iteratively removing and re-inserting some task module. If we
select the task modules for removing and inserting back in
the order of RS′, no backward dependencies between time
layers will be introduced because all the time-layer dependen-
cies introduced by the moved task module are forward ones
(per Definition 7). Consequently, all the intermediate solutions
generated from P -ST , assumed to be P -ST1, P -ST2, · · · , P -
STk (k ≤ n), will always be feasible according to Theorem
1, and from one solution P -ST ′ (PS′, QS′, RS′) we can
reach any other solution P -ST (PS, QS, RS) where RS is a
topological order of TG in at most k (k ≤ n) steps: P -ST ′,
P -ST1, P -ST2 · · · P -STk = P -ST .
On the other hand, we can reach a generic feasible solution
P -ST ′′ (PS′′, QS′′, RS′′) with no constraints on RS′′ from
any solution P -ST (PS, QS, RS) where RS is a topological
order through at most n feasible solutions, which can be
obtained by inverting the sequence of removing and re-inserting
operations from P -ST to P -ST ′′.
Therefore, starting from one generic solution P -ST ′ we can
reach another generic solution P -S′′ (through a solution P -ST
in which RS is a topological order of TG) in at most 2n steps.
Proof END.
C. Evaluation of Insertion Points
1) Computation of T : To reduce the complexity, we use the
area sum of the task modules to underestimate the configuration
span of the time layers instead of accurately computing the
configuration span of the DRRs. In this subsection, we discuss a
method (used in step b of the IAR perturbation shown in Section
IV) to evaluate, in amortized constant time, the schedule length
while inserting a task module into an insertion point.
After a task module mk is removed from the partitioned
sequence triple P -ST , RCG is updated by removing some
edges related to time layers according to the following two
situations:
(i) If mk is the only task module in tl(mk), we remove
the vertex tl(mk) along with its incoming and outgoing
edges and the edges between mk and any vertices that
represent time layers.
(ii) If mk is not the only task module in time layer tl(mk),
the weight of vertex tl(mk) will be subtracted by the
configuration time span of task mk (ck) and the edges
between mk and any vertices that represent time layers
are removed.
Let RCG0 be the updated reconfiguration constraint graph.
To simplify the description, we add to RCG0 a source vertex
vs with outgoing edges to all the task modules that have zero
in-degree and a sink vertex vt with incoming edges from all the
task modules that have zero out-degree. Both vs and vt have
zero weight. Let rRCG0 be the graph obtained by reversing
all the edges of RCG0.
We pre-compute the longest paths from vs to each vertex
vi ∈ RCG0, denoted as lp0(vi), and the longest paths from vt
to each vertex, lpr0(vi), based on rRCG
0 in O(n2) time using
the longest-path algorithm on directed acyclic graphs.
To evaluate the schedule length T for inserting mk into a
feasible insertion point (p, q, r, tlij) in P -ST , a new recon-
figuration constraint graph RCGnew is generated. Let T0 be
the longest path from vs to vt in RCG0. T can be roughly
and incrementally evaluated from the longest paths in RCG0
and rRCG0 by considering only the paths passing through the
vertex tlij or mk because all the changed edges are related to
either tlij or mk.
T = max(T0, lpnew(tl
i
j) + lp
r
new(tl
i
j) + ctlij ,
lpnew(mk) + lp
r
new(mk) + tk),
(10)
where lpnew(tlij), lp
r
new(tl
i
j), lpnew(mk), and lp
r
new(mk) are
incrementally computed based on lp0(vi) and lpr0(vi), for the
three types of partitions discussed in Section IV-B.
Type-1: Both a new DRR drri and a new time layer tlij are
created for mk, and the RCGnew can be constructed by adding
three edges (red dotted lines) in RCG0, as shown in Fig. 7a.
Type-2: A new time layer tlij is created in an existing DRR
drri for mk, and the RCGnew can be constructed by adding
some edges (red dotted lines) in RCG0, where there are three
situations respectively shown in Fig. 7b, Fig. 7c and Fig. 7d.
In the situation shown in Fig. 7d, there are at least three time
layers tlij−1, tl
i
j , and tl
i
j+1 in DRR drri.
Type-3: mk is inserted into an existing time layer tlij in the
DRR drri. There are two situations for updating RCGnew,
respectively shown in Fig. 7e and Fig. 7f.
In the RCGnew, the computations of lpnew(tlij) and
lpnew(mk) are summarized as follows.
lpnew(tl
i
j) =

lp0(tl
i1
j1
) + c
tl
i1
j1
, for (a), (b), (e), and (f);
max{lp0(tli1j1) + ctli1j1 , maxmx∈tlij−1
{lp0(mx) + tx}},
for (c) and (d).
(11)
lpnew(mk) =

max{lp0(mk), lpnew(tlij) + ck},
for (a), (b), (c), and (d);
max{lp0(mk), lpnew(tlij) + (ctlij + ck)},
for (e) and (f).
(12)
In the second part of Formula (11), the computation can be
performed in amortized constant time because the total number
of mx is at most n− 1.
The longest paths lprnew(tl
i
j) and lp
r
new(mk) in the reverse
graph of RCGnew, rRCGnew, can be calculated in constant
time as follows.
lprnew(mk) =

lpr0(mk), for (a), (c), and (e);
max{lpr0(mk), lpr0(tlij+1) + ctlij+1},
for (b), (d), and (f).
(13)
݉
ܸܮܶܩ
ܸܶܩ
ݐ݈ሺ݉௞ሻݐ݈ሺݎݏሾݎሿሻ ݐ݈ሺݎݏሾݎ ൅ ͳሿሻ
ݐ ௝݈మ௜మݐ ௝݈భ௜భ
ݐ ௝݈௜
݉௞
(a) Type-1: tlij is a new time layer in a
new DRR.
݉
k
ܸܮܶܩ
ܸܶܩ
ݐ݈ሺ݉௞ሻݐ݈ሺݎݏሾݎሿሻ ݐ݈ሺݎݏሾݎ ൅ ͳሿሻ
ݐ ௝݈భ௜భ ݐ ௝݈మ௜మ
ݐ ௝݈௜
݉௞
ݐ ௝݈ାଵ௜
(b) Type-2: tlij is the first time layer in drri
݉݇
ܸܮܶܩ
ܸܶܩ
ݐ݈ሺ݉௞ሻݐ݈ሺݎݏሾݎሿሻ ݐ݈ሺݎݏሾݎ ൅ ͳሿሻ
ݐ ௝݈భ௜భ ݐ ௝݈మ௜మ
ݐ ௝݈௜
݉௞݉௫ א ௝ିଵ௜
(c) Type-2: tlij is the last time layer in
drri
݉
k
ܸܮܶܩ
ܸܶܩ
ݐ݈ሺ݉௞ሻݐ݈ሺݎݏሾݎሿሻ ݐ݈ሺݎݏሾݎ ൅ ͳሿሻ
ݐ ௝݈భ௜భ ݐ ௝݈మ௜మ
ݐ ௝݈௜
݉௞݉௫ א ௝ିଵ௜
ݐ ௝݈ାଵ௜
(d) Type-2: tlij is the middle time layer in drri
݉݇
ݐ݈ ݉݇ ൌ ݐ݈ ݎݏ ݎݐ݈ሺݎݏሾݎሿሻܸܮܶܩ
ܸܶܩ
ݐ ௝݈௜
݉௞
ݐ ௝݈భ௜భ ݐ ௝݈మ௜మ
(e) Type-3: tlij is the last time layer in
drri
ݐ ௝݈ାଵ௜
݉݇
ݐ݈ ݉݇ ൌ ݐ݈ ݎݏ ݎݐ݈ሺݎݏሾݎሿሻܸܮܶܩ
ܸܶܩ
ݐ ௝݈௜
݉௞
ݐ ௝݈భ௜భ ݐ ௝݈మ௜మ
(f) Type-3: tlij is not the last time layer in
drri
Fig. 7: RCGnew for three types of partitions considering insertion points.
lprnew(tl
i
j) = max{lpr0(tli2j2) + ctli2j2 , lp
r
new(mk) + tk}. (14)
Consequently, the evaluation of T in Formula (10) for all
the possible O(n3) insertion points can be completed in O(n3)
time.
2) Computation of Communication Costs CC: For an edge
(mi,mj) ∈ ETG, the communication cost between mi and mj ,
CC(mi,mj), can be evaluated as follows:
CCmi,mj =
wi,j · {αd · (|xi − xj |+ |yi − yj |) + βt · (btj − eti)}
(15)
where wi,j is the communication requirement between mi and
mj , and (xi, yi) and (xj , yj) are respectively the coordinates of
mi and mj on the FPGA chip. eti (= bti + ti) is the ending
execution time of mi.
If the two task modules span multiple time layers, we project
the two task modules onto one time layer and calculate the
Manhattan distance. In the experiments, the parameters αd and
βt are set based on the temporal and spatial dimensions: 1) If
mi and mj are partitioned into the same time layer, we set αd
and βt to 1 and 0, respectively; 2) If mi and mj are partitioned
into different time layers in a DRR, then αd and βt will be set
to 1 and 1.5, respectively; 3) If mi and mj are partitioned into
different DRRs, αd and βt will be set to 3 and 1.5, respectively.
The communication cost CCmk can be calculated as fol-
lows.
CCmk =
∑
(mi,mk)∈ETG
CC(mi,mk) +
∑
(mk,mi)∈ETG
CC(mk,mi).
(16)
Thus, the total communication costs CC can be evaluated as
follows: CC =
∑
mk∈VTG CCmk .
3) Computation of Area Costs: Let Row0 and Col0, re-
spectively, be the number of rows and the number of columns
of the CLB array available in the FPGA chip. The area cost
AC is calculated using a method similar to that in [23]:
AC =ERow + ECol · λ+ C1 ·max(ERow, ECol · λ), (17)
where ECol = max(Col−Col0, 0) and ERow = max(Row−
Row0, 0) are respectively the excessive columns and rows
required by the current solution, λ = Row0/Col0, and C1 is a
user-defined constant.
The Col and Row are evaluated in constant time by a
method similar to that in [23] and [24]. Considering a feasible
insertion point (p, q, r, tlij), we can first calculate the row
and column within the DRR using a method similar to [24]
in amortized constant time because the time layers in the
DRR are represented by a multi-layer sequence pair. After the
dimensions of the DRR are obtained, we can use a similar
method to that in [23] to calculate the total rows and total
columns required by inserting the removed task module into
the insertion point (p, q, r, tlij) in amortized constant time.
D. Complexity Analysis
In this subsection, we analyze the complexity of an IAR
perturbation on a P -ST . For an insertion point (p, q, r, tlij),
if conditions for both type-1 and type-3 in Section IV-B2
are satisfied, there will be at most five possible time layer
candidates: one type-1 insertion point, where tlij is a new
created time layer in a newly-created DRR; two type-2 insertion
points, where tlij is a newly-created time layer in an existing
DRR; two type-3 insertion points, where tlij is an existing time
layer. Consequently, there are O(n3) possible insertion points
for a removed task module.
The feasibility of an insertion point can be judged in amor-
tized constant time, as both the minimum lifetime in Corollary
3 and the lifetime constraint in Corollary 2 can be computed
previously in O(n2) time.
For each insertion point, the cost function shown in Equation
5 can be calculated in O(n), where the area cost (AC) and the
schedule length T can be evaluated in amortized constant time,
and the complexity of computing communication costs is O(n).
Consequently, the evaluation of O(n3) insertion points can be
performed in O(n4), where n is the number of task modules.
V. EXPERIMENTS AND RESULTS
A. Experimental Setup
The proposed method has been implemented in C-language
on a Linux 64-bit workstation (Intel 2.0 GHz, 62 GB RAM).
The input consists of a set of tasks with dependencies given
in a task graph (TG), the resource requirements, configura-
tion time cti, and execution time eti of each task module
mi. The benchmarks are constructed by combining the task
graphs generated by Task Graphs For Free (TGFF) [25] and
the standard floorplanning benchmark, GSRC suites [26]. The
dimensions (width and height) of task modules are from GSRC
benchmarks and the width and the height of a task module
respectively define the number of CLB columns and the number
of CLB rows on an FPGA chip. The task dependencies are
generated by TGFF. The execution times of task modules and
the communication requirements between task modules are
randomly generated. Note that we are considering only the
allocation of CLB resources. We also generate two benchmarks
from a popular convolutional neural network model, AlexNet
[27]. One is AN Part1, which includes two convolutional layers
and one pooling layer of the model. The other is AN Part2,
which includes three convolutional layers and two pooling
layers of the model. The task module in convolutional layers
performs a convolution operation over a feature map with a
specified convolutional kernel and the task module in pooling
layers performs a pooling operation over all the output feature
maps of a convolutional layer. The task modules are stipulated
to consume the least hardware resources, and the execution
times are estimated based on a frequency of 200 MHz.
Table II lists the benchmark parameters used in our experi-
ments. Each randomly generated benchmark has three different
implementations, which have the same number of task modules
but different task dependencies. #V and #E are the number of
vertexes and edges in task graph TG, respectively. The columns
VWR and EWR are the range of random values for execution
time of tasks and communications between tasks, respectively.
The column CPT shows the longest paths of the task graphs
and the vertexes are weighted by the execution times.
We take one of the widely used Xilinx Virtex 7 series FPGA
chips, XC7VX485T, as the target chip. There are about 37,950
CLBs and the ratio of rows to columns is 3:1. Therefore,
the CLB array in XC7VX485T has approximately 350 rows
and 117 columns. Configuring all resources of XC7VX485T
requires 50.7 ms through the interface ICAP with the maximum
bandwidth of 3.2 Gb/s [13]. Thus, we consider the time
overhead of reconfiguring one CLB to be 0.0013 ms, and the
configuration time span of a task module is proportional to the
module area because the configuration time is proportional to
the synthesized bitstream of a design.
TABLE II: Benchmark Information
Bench. imp. TG#V #E VWR (ms) EWR CPT (ms)
t10
1
10
10 (40,55) (20,30) 259.2
2 12 (40,55) (20,30) 359.1
3 8 (40,55) (20,30) 151.1
t30
1
30
51 (30,350) (60,610) 1450.8
2 72 (40,60) (20,30) 786.3
3 71 (40,60) (20,30) 727.6
t50
1
50
33 (40,60) (20,30) 244.0
2 51 (20,180) (50,350) 464.9
3 78 (40,60) (20,30) 595.9
t100
1
100
110 (20,180) (50,350) 703.1
2 134 (20,180) (50,350) 1265.6
3 147 (20,180) (50,350) 580.5
t200
1
200
312 (10,390) (30,770) 3556.2
2 327 (40,60) (20,30) 436.4
3 403 (10,390) (30,770) 1482
t300
1
300
416 (20,180) (50,350) 943.4
2 443 (20,180) (50,350) 1051.5
3 735 (20,180) (50,350) 2814.7
AN Part1 353 352 (5.6, 51) (5.4, 11.8) 89.6
AN Part2 738 992 (3.5, 51) (2.7, 11.8) 104.4
In the experiments, the sum of the three coefficients in
Formula (5) is set to one and the cost values are normalized. To
avoid violating the resource constraints in the final solutions,
the coefficient of area cost, α, is the dominant factor and is
set to around 0.8 to ensure almost 100% success rate. The
coefficients of schedule length and communication costs, β and
γ, are respectively set to 0.15 and 0.05. We can make a trade-
off between the schedule length and the communication cost by
changing β and γ because the area cost computed by Formula
(17) will be zero if the resource constraint is satisfied. The
initial solution is generated randomly and each task module
is partitioned into an individual time layer in a common DRR
and the configuration order is a topological ordering of the task
graph. The starting temperature, the ending temperature, and
the cooling ratio of the simulated annealing are respectively
set to 2000, 0.01 and 0.98. The iteration number in each
temperature is set to 50 for the benchmarks with less than 50
tasks. For other benchmarks, the iteration number is increased
slightly along with the increasing number of task modules.
B. Results and Analysis
Table III shows the experimental results. The proposed inte-
grated optimization framework is called Int PSF. We execute
the proposed method 10 times independently for each bench-
mark, and list the average results. The columns in Table III
are organized as follows: T is the schedule length of each
design, which corresponds to the longest paths in the RCGs.
RunT is the run-time of the optimization framework. #succ
is the success rate of floorplanning. N is the number of DRRs.
CC indicates the communication costs calculated based on the
temporal and spatial dimensions.
As a baseline situation, we solve the simplified scheduling
problem, where the hardware resources are considered unlim-
ited and every task module occupies an individual DRR, using
an integer linear programming (ILP) formulation similar to that
in [9]. The obtained T indicates the schedule length in the
case when the configuration times are maximally hidden. In
TABLE III: Results of the proposed algorithm
Benchmark imp ILP TP PSF Int PSF
T (ms) RunT(s) #succ T (ms) N CC RunT(s) #succ T (ms) N CC RunT(s)
t50
1 299.45* 3600 40% 462.7 3.8 325436 21.1 100% 456.9 3.3 216047 27.5
2 468.82 12.2 20% 688.2 4.1 3982657 23.5 100% 705.7 3.9 3644317 29.1
3 604.90 2.1 80% 616.5 4.4 610452 25.1 100% 618.9 3.8 589759 29.8
t100
1 706.22 23.5 60% 712.3 9.7 6428845 108.6 100% 715.22 9.3 6443798 102.2
2 1266.29 27.6 100% 1266.3 9.6 7665841 135.3 100% 1266.29 9.4 7629823 113.4
3 581.33 64.9 80% 611.1 9.6 13724146 112.4 100% 603.85 9.3 13664719 106.5
t200
1 3557.37 16.8 100% 3557.4 15.8 53624898 528.5 100% 3557.37 16.9 54086203 615.9
2 436.92 386.4 100% 436.92 20.5 3043017 510.2 100% 436.92 21.6 3091106 532.5
3 1483.24 772.42 70% 1501.1 21.9 87560078 532.1 100% 1489.68 21.2 87362210 568.4
t300
1 1120.05* 3600 40% 1110.3 17.4 37462179 1450.8 100% 1108.97 17.3 36686354 1633.9
2 1379.49* 3600 30% 1135.6 16.9 41404278 1346.9 100% 1126.10 16.9 41972811 1590.3
3 2815.2 2610.7 100% 2815.2 19.0 80345041 1390.7 100% 2815.21 17.8 79287974 1692.8
AN Part1 455.87* 3600 100% 542.1 15.9 1230864 1426.9 100% 539.3 21.5 1199978 1113.1
AN Part2 NF 3600 100% 1214.25 24.0 2980086 6693.2 100% 1023.3 26.5 2686430 6642.1
Cmp. - - 72.8% 1 1 1 - 100% -1.2% 1.03% -0.53% -
the experiment, Gurobi [28] is used as the ILP solver. The
column ILP shows the results, where the ′∗′ means the solu-
tions are incumbent within one hour. The results demonstrate
that Int PSF can effectively hide the reconfiguration time
overhead in the dependency dominated task graphs (the designs
having long CPT) under the resource constraints.
To explore the effectiveness of the proposed integrated
optimization framework, we perform a two-phases approach
(TP PSF ) for partitioning, scheduling, and floorplanning of
task modules. In the first phase, we evaluate the hardware
resources for a time layer using the sum of the task module
area instead of calculating a floorplan of the task modules.
Consequently, the order of task modules within a time layer
(in PS and QS) makes no sense, but the partitioning of task
modules and scheduling of the time layers are solved in an
integrated optimization framework. In the second phase, a
simulated annealing-based search is used for placing the task
modules and the DRRs. The initial floorplan of DRRs and
the initial floorplan of time layers are generated randomly. In
the simulated annealing, the IAR perturbation is adopted for a
DRR (as a whole) or a task module, and the task modules
are removed and inserted only within a time layer to keep
the partitioning unchanged. As shown in Table III, Int PSF
achieves a success rate of 100% whereas the two-phase method
TP PSF achieves only a success rate of 72.8% in the case
when the schedule length and CC are almost the same.
In benchmark t50-2, the T obtained by the proposed method
is obviously higher than the baseline situation because t50-
2 has high parallelism, whereas FPGA hardware resources
constrain the parallel executions of the tasks. Table IV shows
the detailed experimental results on the relationships between
FPGA resources and performances for applications with differ-
ent degrees of parallelism.
According to our experimental results, the communication
costs CC can be reduced by 33.18% on average by considering
the communication costs in the optimization framework. The
detailed data are listed in the supplementary material (Table I).
To evaluate the impacts of FPGA resources on the schedule
length of designs, we perform the experiments for all test
benchmarks under different FPGA resource constraints, which
are set as 3/4x, 1.0x, 3/2x, and 2.0x of the targeted FPGA
architecture (37,950 CLBs). We execute Int PSF 10 times
independently for each benchmark, and show the average re-
sults in Table IV. The column Resource represents the amount
of resources for the target FPGA architecture. As shown in
Table IV, for all the benchmark circuits, with increasing FPGA
resources, the trends of schedule length T and communication
costs CC decrease to be gentle. On the one hand, DRRs can
be executed in parallel and configured independently, thus the
configuration latency can be effectively hidden in the executions
of tasks. With increased FPGA resources, the number of DRRs
N is increasing overall, which will maximize the parallel
execution of tasks and increase the possibility of hiding the
configuration of tasks. On the other hand, the schedule length
should be greater than that in the baseline situation shown in
Table II. For the benchmark t200-1, which involves a long CPT
and fewer data dependencies, the schedule length T remains the
same with increasing FPGA resources and is close to the length
of the corresponding CPT because the configuration latency
is effectively hidden. Furthermore, Int PSF achieves 100%
success rate for the different FPGA resource constraints, which
demonstrates the effectiveness of the method.
As discussed in Section IV-C, to reduce the complexity,
we use the area sum of the task modules to underestimate
the configuration span of the time layers instead of accurately
computing the configuration span of the DRRs, which should be
computed using the DRR area. According to our experimental
results, for the same solution, if the DRR area is used instead
of summing the module area, the schedule length is increased
by 5%. In the optimization framework, if we use the DRR area
to accurately estimate the configuration span of time layers, the
obtained schedule length is increased only by negligible 1%.
The possible reason is that the configurations of the time layers
are well hidden in the execution of task modules. A detailed
analysis is included in the supplementary material (Figure 1).
C. Vertically Aligning DRRs to Reconfigurable Frames
As was demonstrated in [13], when applying the Reset After
Reconfiguration methodology, a DRR must vertically align to
reconfigurable frames (aligning vertically to clock regions) for
TABLE IV: Impacts of FPGA resources on Int PSF
t100 t200 t300
imp. Resource #succ N T (ms) CC N T (ms) CC N T (ms) CC
1
3/4 100% 5.6 887.45 7708821 12.5 3557.37 54554324 11.6 1344.19 41151018
1.0 100% 9.3 715.22 6443798 16.9 3557.37 54086203 17.3 1108.97 36686354
3/2 100% 14.1 706.22 5277481 18.9 3557.37 53948392 31.9 944.88 32868833
2.0 100% 14.7 706.22 5063156 17.4 3557.37 54165018 46.8 944.41 33023986
2
3/4 100% 5.4 1270.60 7705509 13.2 478.45 3312604 11.6 1294.18 46898340
1.0 100% 9.4 1266.29 6443798 21.6 436.92 3091106 16.9 1126.10 41972811
3/2 100% 9.5 1266.29 5277481 27.2 436.92 2749060 31.6 1052.78 38627940
2.0 100% 9.9 1266.29 5063156 27.9 436.92 2747273 37 1052.78 38020370
3
3/4 100% 5.3 804.19 15790269 13.1 1725.29 97307300 12.3 2841.89 81422950
1.0 100% 9.3 603.85 13664719 21.2 1489.68 87362210 17.8 2815.21 79287974
3/2 100% 15.3 581.33 11737755 30.4 1483.24 83528144 21.8 2815.21 78553805
2.0 100% 17.9 581.33 10975717 31.8 1483.24 83050648 22.5 2815.21 78461001
7-series and Zynq-7000 AP SoC devices. When the height of
the DRRs is vertically aligned to the reconfigurable frames,
the height of the reconfiguration frame (50 rows of CLBs)
is adopted as the measurement unit of DRR height. Table V
shows the experimental results of with/without consideration
of the aligning constraint. The results of the both cases are
comparable. An example result of t50-1 is shown in the
supplementary material (Figure 2).
TABLE V: Aligning DRRs to Reconfigurable Frames
Bench. imp Int PSF with aligning Int PSF#succ T (ms) N CC #succ T (ms) N CC
t10
1 100% 273.6 3.0 65125 100% 269.8 3.9 54078
2 100% 370.8 2.5 82022 100% 369.6 2.5 75027
3 100% 232.5 3.0 38566 100% 201.6 3.7 57750
t30
1 100% 1470.9 2.7 7516360 100% 1459.8 3.0 6806495
2 100% 799.8 2.8 582164 100% 793.1 3.0 599203
3 100% 738.7 3.0 602132 100% 734.2 3.0 615542
t50
1 100% 473.6 4.0 257776 100% 456.9 3.3 216047
2 100% 691.2 4.6 3826090 100% 705.7 3.9 3644317
3 90% 611.9 4.7 625038 100% 618.9 3.8 589759
- 98.9% 629.2 3.36 1510585 100% 623.3 3.34 1406468
D. Comparison with Previous Work
R. Cordone et al. [6] proposed a partitioning method to
extract cores (isomorphic and non-overlapping subgraphs) from
the task graphs for module reuse and an integer linear pro-
gramming (ILP) based method and a heuristic method for
scheduling task graphs on partially dynamically reconfigurable
FPGAs. The core extraction method provides preprocessing of
the task graphs and can be combined with other scheduling
methods to consider module reuse. However, it is difficult to
extend the scheduling method for processing DRR partitions,
while the task partitioning algorithm in Y. Jiang et al. [8]
can be used only within a DRR. E. A. Deiana et al. [9]
proposed a mixed-integer linear programming (MILP) based
scheduler for mapping and scheduling applications on partially
reconfigurable FPGAs with consideration of DRRs, where only
one task module is involved in each time layer, followed by
floorplanning the DRRs. Consequently, in this study, we make
a comparison with the methodology in [9]. We adapt the ILP
method in [9] to the problem in this work by skipping the
module reuse, and in the proposed optimization framework,
we add a constraint (one task constraint) so that each time
layer includes only one task module. As for the floorplanning
of DRRs, there are no detailed descriptions in [9], so we use the
method in [23], which performs very well in the fixed-outline-
constrained floorplanning for FPGAs and spends only several
seconds for the floorplanning of 100 task modules.
Table VI shows the experimental results. The results are
the average of 10 independent runs. In the ILP+Flooprlan
method, we solve the ILP model once and run the floorplanning
algorithm 10 times. Because the ILP based method is time-
consuming, we use the small test cases, t10, t30, and t50 (the
largest test cases in [9] includes 50 tasks), for the comparison.
′NF ′ represents that the ILP solver fails to find any feasible
solution in a reasonable time. The results show that the pro-
posed optimization framework achieves much higher success
rates with comparable schedule lengths.
TABLE VI: Comparison between the ILP+Floorplanning [9]
and the proposed Int PSF
Bench. imp ILP+Floorplan [9] Int PSF (one task constraint)#succ T (ms) N RunT(s) #succ T (ms) N RunT(s)
t10
1 100% 269.8 4.0 0.64 100% 269.8 4.0 1.56
2 90% 369.7 4.0 0.54 100% 369.7 3.0 1.64
3 100% 176.2 4.0 2.24 100% 202.6 4.0 1.47
t30
1 50% 1459.8 6.0 7.1 100% 1460.1 4.6 6.82
2 80% 792.8 6.0 7.9 100% 794.4 5.1 6.55
3 30% 734.2 5.0 17.8 100% 739.1 4.7 6.47
t50
1 NF NF NF >10000 100% 487.6 6.7 16.08
2 NF NF NF >10000 100% 733.3 6.6 17.17
3 30% 605 8.0 1085 100% 623.4 6.8 17.61
- 68.6% 630 - - 100% 637(+1.1%) - -
VI. CONCLUSIONS
In this paper, we proposed an integrated optimization frame-
work for partitioning, scheduling, and floorplanning partially
dynamically reconfigurable FPGAs, where the partitioned se-
quence triple P -ST (PS,QS,RS) was proposed to represent
the partitions, schedule, and floorplan of the task modules,
and a sufficient and necessary condition is given for the
feasibility of P -ST considering the scheduling problem. An
elaborated method was proposed to generate new solutions by
simultaneously perturbing the partition, schedule, and floorplan.
Based on the proposed optimization framework, we integrated
the exploration of spatial and temporal design space to search
the optimal solutions of partitioning, scheduling, and floor-
planning. Experimental results demonstrated the effectiveness
of the proposed framework. In future work, we will further
consider the reuse of task modules, variable dimensions for
task modules, and integration of the allocation of RAM and
DSP resources.
REFERENCES
[1] R. Tessier, K. Pocek, and A. DeHon, “Reconfigurable computing archi-
tectures,” Proceedings of the IEEE, vol. 103, no. 3, pp. 332–354, 2015.
[2] S. Hauck and A. DeHon, Reconfigurable computing: the theory and
practice of FPGA-based computation. Morgan Kaufmann, 2010, vol. 1.
[3] D. Koch, Partial Reconfiguration on FPGAs: Architectures, Tools and
Applications. Springer Science & Business Media, 2012, vol. 153.
[4] H. Murata, K. Fujiyoshi, S. Nakatake, and Y. Kajitani, “Rectangle-
packing-based module placement,” in IEEE/ACM International Confer-
ence on Computer-Aided Design (ICCAD), 1995, pp. 472–479.
[5] R. G. Michael and S. J. David, “Computers and intractability: a guide
to the theory of np-completeness,” W. H. Free. Co., San Fr, pp. 90–91,
1979.
[6] R. Cordone, F. Redaelli, M. A. Redaelli, M. D. Santambrogio, and
D. Sciuto, “Partitioning and scheduling of task graphs on partially dy-
namically reconfigurable fpgas,” IEEE Trans. on computer-aided design
of integrated circuits and systems, vol. 28, no. 5, pp. 662–675, 2009.
[7] A. Purgato, D. Tantillo, M. Rabozzi, D. Sciuto, and M. D. Santambrogio,
“Resource-efficient scheduling for partially-reconfigurable fpga-based
systems,” in IEEE International Parallel and Distributed Processing
Symposium Workshops, 2016, pp. 189–197.
[8] Y.-C. Jiang and J.-F. Wang, “Temporal partitioning data flow graphs for
dynamically reconfigurable computing,” IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, vol. 15, no. 12, pp. 1351–1361, 2007.
[9] E. A. Deiana, M. Rabozzi, R. Cattaneo, and M. D. Santambrogio, “A mul-
tiobjective reconfiguration-aware scheduler for fpga-based heterogeneous
architectures,” in IEEE International Conference on ReConFigurable
Computing and FPGAs (ReConFig), 2015, pp. 1–6.
[10] M. Vasilko, “Dynasty: A temporal floorplanning based cad framework for
dynamically reconfigurable logic systems,” in International Workshop on
Field Programmable Logic and Applications, 1999, pp. 124–133.
[11] P.-H. Yuh, C.-L. Yang, and Y.-W. Chang, “Temporal floorplanning us-
ing the t-tree formulation,” in IEEE/ACM International conference on
Computer-aided design (ICCAD), 2004, pp. 300–305.
[12] ——, “Temporal floorplanning using the three-dimensional transitive
closure subgraph,” ACM Transactions on Design Automation of Electronic
Systems (TODAES), vol. 12, no. 4, p. 37, 2007.
[13] Xilinx, “Vivado design suite user guide: Partial reconfiguration,”
[Online]. https://www.xilinx.com/support/documentation/sw manuals/
xilinx2017 4/, 2017.
[14] A. Montone, M. D. Santambrogio, D. Sciuto, and S. O. Memik, “Place-
ment and floorplanning in dynamically reconfigurable fpgas,” ACM Trans.
on Reconfigurable Technology and Systems, vol. 3, no. 4, p. 24, 2010.
[15] L. Singhal and E. Bozorgzadeh, “Multi-layer floorplanning on a sequence
of reconfigurable designs,” in IEEE International Conference on Field
Programmable Logic and Applications, 2006, pp. 1–8.
[16] P. Banerjee, M. Sangtani, and S. Sur-Kolay, “Floorplanning for partially
reconfigurable fpgas,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol. 30, no. 1, pp. 8–17, 2011.
[17] N. Liu, S. Chen, and T. Yoshimura, “Resource-aware multi-layer floor-
planning for partially reconfigurable fpgas,” IEICE transactions on elec-
tronics, vol. 96, no. 4, pp. 501–510, 2013.
[18] M. Rabozzi, J. Lillis, and M. D. Santambrogio, “Floorplanning for
partially-reconfigurable fpga systems via mixed-integer linear program-
ming,” in IEEE Annual International Symposium on Field-Programmable
Custom Computing Machines (FCCM), 2014, pp. 186–193.
[19] M. Rabozzi, A. Miele, and M. D. Santambrogio, “Floorplanning for
partially-reconfigurable fpgas via feasible placements detection,” in IEEE
23rd Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM), 2015, pp. 252–255.
[20] X. Xu, Q. Xu, J. Huang, and S. Chen, “An integrated optimization
framework for partitioning, scheduling and floorplanning on partially
dynamically reconfigurable fpgas,” in ACM Great Lakes Symposium on
VLSI, 2017, pp. 403–406.
[21] D. Chen, J. Cong, and P. Pan, “Fpga design automation: A survey,”
Foundations and Trends in Electronic Design Automation, vol. 1, no. 3,
pp. 139–169, 2006.
[22] X. Tang, R. Tian, and D. Wong, “Fast evaluation of sequence pair in
block placement by longest common subsequence computation,” IEEE
Transactions on Computer-Aided Design of Integrated Circuits and
Systems, vol. 20, no. 12, pp. 1406–1413, 2001.
[23] S. Chen and T. Yoshimura, “Fixed-outline floorplanning: Block-position
enumeration and a new method for calculating area costs,” IEEE Trans.
on Computer-Aided Design of Integrated Circuits and Systems, vol. 27,
no. 5, pp. 858–871, 2008.
[24] ——, “Multi-layer floorplanning for stacked ics: Configuration number
and fixed-outline constraints,” Integration-the VLSI journal, vol. 43, no. 4,
pp. 378–388, 2010.
[25] R. P. Dick, D. L. Rhodes, and W. Wolf, “Tgff: task graphs for free,” in
IEEE 6th International workshop on Hardware/software codesign, 1998,
pp. 97–101.
[26] A. B. Kahng and I. L. Markov., “Vlsi cad bookshelf,” [Online]. http:
//vlsicad.eecs.umich.edu/BK.
[27] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
mation processing systems, 2012, pp. 1097–1105.
[28] I. Gurobi Optimization, “Gurobi optimizer reference manual,” 2015.
[Online]. Available: http://www.gurobi.com
Song Chen received his B.S. degree in computer
science from Xi’an Jiaotong University, China in
2000. Subsequently, he obtained a Ph.D. degree in
computer science from Tsinghua University, China
in 2005. He served at the Graduate School of Infor-
mation, Production and Systems, Waseda University,
Japan, as a Research Associate from August 2005 to
March 2009 and as an Assistant Professor from April
2009 to August 2012. He is currently an Associate
Professor in the Department of Electronic Science and
Technology, University of Science and Technology of
China (USTC). His research interests include several aspects of VLSI design
automation, on-chip communication system, and computer-aided design for
emerging technologies. Dr. Chen is a member of IEEE and IEICE.
Jinglei Huang received the B.E. degree in elec-
tronic science and technology from AnHui University,
Hefei, China in 2013, and obtained a Ph.D. degree
in electronic science and technology from University
of Science and Technology of China (USTC) in
2018. He is currently an Engineer in the State Key
Laboratory of Air Traffic Management System and
Technology, the 28th Research Institute of China
Electronics Technology Group Corporation. His re-
search interests include network-on-chip synthesis
and air traffic management.
Xiaodong Xu, Photograph and biography not available at
the time of publication.
Bo Ding, Photograph and biography not available at the time
of publication.
Qi Xu received the B.E. degree in microelectronics
from AnHui University in 2012 and the Ph.D degree
in electronic science and technology from University
of Science and Technology of China in 2018. He
is currently a lecturer in the School of Electronic
Science and Applied Physics, Hefei University of
Technology.
His research interests include physical design au-
tomation and design for reliability for 3-D integrated
circuits.
