Maintaining Virtual Areas on FPGAs using Strip Packing with Delays by Angermeier, Josef et al.
ar
X
iv
:1
00
1.
44
93
v1
  [
cs
.A
R]
  2
5 J
an
 20
10
1
Maintaining Virtual Areas on FPGAs using
Strip Packing with Delays
Josef Angermeier#1, Sa´ndor P. Fekete∗, Tom Kamphans∗2, Nils Schweer∗, Ju¨rgen Teich#
# Department of Computer Science 12, University of Erlangen-Nuremberg
Erlangen, Germany
{angermeier, teich}@cs.fau.de
∗ Department of Computer Science, Braunschweig University of Technology
Braunschweig, Germany
{s.fekete, n.schweer}@tu-bs.de, tom@kamphans.de
Abstract—The computing resources available on dynamically
partially reconfigurable devices increase every year enormously.
In the near future, we expect that many applications run on a
single reconfigurable device. In this paper, we present a concept
for multitasking on dynamically partially reconfigurable systems
called virtual area management. We explain its advantages, show
its challenges, and discuss possible solutions. Furthermore, we
investigate one problem in more detail: Packing modules with
time-varying resource requests. This problem from the reconfig-
urable computing field results in a completely new optimization
problem not tackled before. ILP-based and heuristic approaches
are compared in an experimental study and the drawbacks and
benefits discussed.
I. INTRODUCTION
Reconfigurable devices offer more and more space and
functionality over time and will probably continue to do so
in the future. Yet, huge hardware applications can already
be instantiated on the reconfigurable chips, or even arrays of
processors. Furthermore, nowadays reconfigurable chips are
mostly used to instantiate a single application with multiple
modules, which might not all be necessary at each instant
in time. But as reconfigurable devices evolve further, the
offered resources will be large enough to also run completely
different applications simultaneously. This development also
took place in the software world: In the beginning of the
nineties, most personal computers used operating systems
such as MS-DOS, which allowed just one single software
application to be executed, alone with some drivers. Since
then, the increased computing resources have allowed us to
run multiple applications simultaneously.
Similar to the software world, multitasking will lead re-
configurable devices to higher efficiency, but will also raise
several challenges that must be solved. For example, in order
to provide reliability and security, we must make sure that
no application can violate the execution of another one. In
software this is solved by running each application in its own
virtual address space. Each application knows only the virtual
addresses of its own resources and parts of the operating
system. The virtual addresses are translated at runtime into
1Supported by DFG grant TE 163/14-2, 14-3, project “ReCoNodes”, as
part of the Priority Programme 1148. “Reconfigurable Computing”.
2Supported by DFG grant FE 407/8-2, 8-3, project “ReCoNodes”, as part
of the Priority Programme 1148, “Reconfigurable Computing”.
physical addresses in the memory. Thus, pages of different
applications may lie physically next to each other, but only
the approved application may access them. Furthermore, the
concept of the virtual address space also facilitated writing
software, as each application did not need to bother about the
positions of other applications but can assume to be the only
user of the processor and the memory.
In order to work out concepts for allowing multitasking on
reconfigurable devices, we oriented ourselves on concepts of
the software world such as virtual address space. But, as there
are some major differences between hardware and software,
one cannot just put one concept unchanged from software to
reconfigurable devices. One important difference is that on
reconfigurable devices, different modules may be executed in
parallel while software usually works more sequentially.
The basic idea of our concept which we called ”Virtual
Area Management” is to partition the available FPGA area
into different regions. Each hardware application obtains one
contiguous region, which may grow or shrink depending
on the applications resource needs. Each running hardware
application can reconfigure hardware modules in its region
and free or try to allocate more area. In this process, each
application does not know the physical position of its modules,
but just the relative positions. Thus, it cannot reconfigure any
region that belongs to a different application and is concerned
only with its own resources. Intermodule communication takes
place only in the region of one application. Thus, the modules
that must communicate with each other are automatically
grouped by position to each other.
Communication between the partial modules and with the
input and output periphery is an important point. We assume
that the application modules are provided with some means
to communicate with each other and to the FPGA’s I/O-ports
independently from their position (e.g., as described in [1]).
There are several possible ways to achieve this goal. We solved
the communication problems by using our self-developed
platform called ESM (see [2],[3]). It offers—amongst others—
a so called crossbar device, which dynamically routes the input
and output signals to the position of the corresponding module.
Moreover, the modules can communicate to each other using
the crossbar. Thus, we can assume that the modules can be
placed independent from the positions of the I/O periphery
and other the positions of other modules.
2A. Related Work
1) Reconfigurable Computing: Brebner [4], [5] addressed
the problems involved in presenting to a software-oriented
user a larger virtual hardware resource that is implemented
using smaller physical FPGA hardware. Their approach is
based on using swappable units, and a prototype operating
system is described that demonstrates operational steps. In
contrast to that work, this paper does not address the problem
of overcoming the physical constraints given by a small FPGA,
but tackles the problem, how to run multiple applications on
a single reconfigurable resource.
Bazargan et al. [6] present fast online placement methods
for dynamically reconfigurable systems, as well as offline
3D placement algorithms. Hereby, partial modules are to be
placed completely independent of each other, and the inter-
module communication problem is not addressed. Steiger et
al. [7] and Diessel et al. [8] further improve scheduling
methods for partially reconfigurable systems. Our approach is
based on the differentiation between applications and modules.
Modules belonging to the same application are placed nearby,
such that inter-module communication can be as efficient
as possible. Different scheduling subproblems have been ad-
dressed meanwhile; for example, scheduling with respect to
the reconfiguration overhead [9]. Banerjee et al. [10] take into
account hardware-software partitioning decisions for a fast
execution of an application. But in contrast to our paper, all
these approaches still focus on executing a single application.
Some operating systems for reconfigurable embedded plat-
forms were developed [11], [12], [13]. Such an operating
system provides a minimal programming model and a runtime
system. The runtime system performs online task and resource
management. Scheduling problems are formulated for the
1D and 2D resource models and developed heuristics are
compared to each other. Resources of the operating system
and the user applications are clearly differentiated, but that is
not the case for the resource access of multiple applications
running on the FPGA. Our paper suggests a compromise for
the inter-module communication problem: Modules belong-
ing to the same application should placed nearby to each
other, such that they can exchange data efficiently. Modules
belonging to different applications are not necessarily placed
nearby. Additionally, in contrast to our application model, no
application can shrink and grow during runtime. The focus
in former works was put more on hardware and software
abstraction, here it is on securely running multiple, dynamic
hardware applications.
Many recent works focused on achieving optimal perfor-
mance to put multiple tasks onto one FPGA within this
context. Cordone et al. [14] and Redaelli et al. [15] specified
a new model for partitioning and scheduling on partially dy-
namically reconfigurable hardware. The different applications
are represented by a task graph and the aim is to obtain a total
execution time near optimality by taking reconfiguration and
communication times into account. However, this approach
does not allow dynamic behavior, such that according to the
current state of the resources, the tasks can select on their
own which module implementation to reconfigure next and at
which nearby position to place it.
The approach by Cardoso [16] also considers the topic of
resource virtualization on FPGA devices, achievable due to
dynamic reconfiguration capabilities. Hereby, a new temporal
partitioning algorithm is proposed. The model by Banerjee
et al. [17] furthermore also supports HW/SW partitioning
of the tasks. However, both approaches are also based on
the assumption that the running times of each task can be
estimated roughly, and that the worst-case execution time does
not differ too much from the average case. In contrast to that,
our approach may also be engaged for the online case, where
this must not be the case.
2) Packing: It turns out that placing hardware modules with
growing and shrinking area resources amounts to strip packing.
The classical strip packing problem was first considered by
Baker et al. [18]. In this problem a set of rectangles must be
packed into a strip of semi-infinite height and width 1 such
that the total height of the packing is minimized. They showed
that in the online case the bottom left heuristic does not
guarantee a constant competitive ratio for packing a sequence
of rectangles. For the offline case they proved an upper bound
of 3 for a sequence of rectangles and of 2 on the competitive
ratio for a sequence of squares; both analyses require the
elements to be sorted. Later Kenyon and Re´mila designed a
fully polynomial time approximation scheme for the offline
setting [19]. For the online case Baker and Schwarz [20]
introduced the so-called shelf algorithms with an competitive
ratio that can be made arbitrarily close to 1.7. Csirik and
Woeginger [21] showed a lower bound of 1.69103 for any
shelf algorithm and introduced an algorithm whose asymptotic
worst-case ratio comes arbitrarily close to this value.
In the classical game of Tetris the aim is to find online
placements for a sequence of objects—not all having rectan-
gular shape—such that space is utilized as well as possible.
In this process, no item can ever move upward, no collisions
between objects must occur, an item will come to a stop if
and only if it is supported from below, and each placement
must be placed before the next item arrives. Obviously, there
is a slight difference in the objective function, as Tetris aims
at filling rows. In actual optimization scenarios, this is less
interesting, as it is not critical whether a row is used to
precisely 100%. Even when disregarding the difficulty of ever-
increasing speed, Tetris is notoriously difficult: As shown by
Breukelaar et al. [22], Tetris is PSPACE-hard, even for the
original, limited set of different objects. Azar and Epstein [23]
considered Tetris-like online packing of rectangles into a strip
where each item must be moved on a collision-free path to its
final position which does not have to supported from below.
Just like in Tetris, they considered the situation with or without
rotation of objects. For the case without rotation, they showed
that no constant competitive ratio is possible, unless there
is a fixed-size lower bound of ε on the side length of the
objects, in which case there is an upper bound of O(log 1
ε
).
For the case in which rotation is possible, they showed a 4-
competitive strategy, based on shelf-packing methods, with all
rectangles being rotated to be placed on their narrow sides.
Coffmann, Downey, and Winkler [24] considered probabilistic
aspects of online rectangle packing with Tetris constraint,
3M1 M2 M3
FP
G
A
Crossbar
SRAM SRAM SRAM
BabyBoard
Peripherals
PowerPC
Reconfiguration
Manager
Flash
Ms
SRAM
…
MotherBoard
Fig. 1. Architecture overview our ESM platform (see [2],[3] for more details).
without allowing rotations. If rectangle side lengths are chosen
uniformly at random from the interval [0, 1], they showed that
there is a lower bound of (0.31382733...)n on the expected
height of the strip. Using another kind of level-type strategy,
which arises from the bin-packing–inspired Next Fit Level,
they established an upper bound of (0.36976421...)n on the
expected height. Fekete et al. [25] considered an Tetris-like
online packing with gravity; that is, every item must be
supported from below in its final position. For squares they
gave an algorithm with competitive ratio 2.6154.
Note that none of the previous works allows stretching the
objects in any direction.
II. VIRTUAL AREA MANAGEMENT
A. Main Idea
Our concept is aimed at partially dynamically reconfigurable
architectures, where modules loaded onto the reconfigurable
device may be exchanged at runtime. A typical structure of
such a device is given in Fig. 1. Yet devices are available
which allow the reconfiguration of columns only, while newer
platforms also allow reconfiguration of certain contiguous
cells. The former is called 1D reconfiguration, the latter 2D
reconfiguration. Our concepts are applicable to both architec-
tures. Furthermore, the platform should consist of one control
CPU, which may be placed externally or be included into the
reconfigurable device.
To run different applications on the reconfigurable device,
we propose to partition the available reconfigurable area into
so called virtual areas (VAs). For performance reasons, the
different VAs should consist of contiguous reconfigurable area
units. As the required amount of resources of an application
changes over time, the size of the virtual area may change
dynamically over the running time of its application. The
mapping of virtual area to physical reconfigurable area is done
by the control processor. Each hardware application being
executed on the reconfigurable device has its own software
control thread running on this CPU, see Fig. 2. These threads
request the initially required area and transmit changes in the
requirements. An operating system service on the software
side takes the requests and is in charge of the management of
Control−Processor
VA1 VA1 VA2 VA2VA1
VSlot1 VSlot2 VSlot3 VSlot1 VSlot2
CP CP CP CP CP
Reconfigurable Area
VA1−Scheduler VA2−Scheduler
Res(A)=Val1 Load B
in VSlot1
Load A in VSlot1
Res(Y)=Val2
Load D in VSlot2
Load E in VSlot3
Load X in VSlot2
Load Y in VSlot2
Load A in VSlot1
Reconfigurable Device
Fig. 2. Two different hardware applications running on a 1D partially dy-
namically reconfigurable device: The first three slots form the first virtual area
(VA1), the remaining the second virtual area (VA2). Modules in one VA can
communicate with each other, e.g. by using their communication point (CP)
belonging to reconfigurable bus. Neither the first nor the second application
knows the exact position of its VA on the reconfigurable device; only the size
of the available area and the relative positions of the reconfigurable modules
loaded in its virtual area are known.
the virtual areas. This secure operating system unit maps the
virtual area units to the corresponding physical dynamically
reconfigurable area units. The virtual area management unit
can be compared to the memory management unit (MMU) in
the software world: both handle the translation of virtual to
physical memory positions. Furthermore, the corresponding
application software threads do not know the actual physi-
cal positions of the reconfigured modules, only the relative
positions of each reconfigurable module to each other. Each
application simply requests to load a specified module to a
virtual position in the assigned virtual area. See the example
in Fig. 2: The second application with its virtual area VA2
issues a request to load a specified bit-file called “X” to
the virtual address (here called: VSlot) ‘2’. It does not know
that this virtual address corresponds physically to the last
reconfigurable unit on the reconfigurable device. It knows only
that the module loaded there is on the right side of a module
loaded to the virtual address (VSlot) ‘1’. As each application
is allowed to specify only a virtual address corresponding to
its virtual area in which to place the module, it cannot place
a module into an area belonging to a different application.
The concept is transparent to the applied intermodule com-
munication in each virtual area: Each application can choose
its preferred communication method (e.g., bus system, neigh-
bor to neighbor communication over bus macros). When only
contiguous virtual areas are used, the communicating mod-
ules are grouped together automatically and communication
overhead is minimized. Furthermore, communication between
different applications or heterogeneities of the reconfigurable
4area is supported easily by extending the operating system.
Yet there are many other approaches to schedule tasks of
different applications, but in contrast to them our approach
of virtual areas combines a localization strategy, putting tasks
which belong to one application together, and highly dynamic
applications, which can individually take different module
selection and placement decision based on the current context.
B. Advantages
The idea of virtual area management offers the following
advantages when running multiple applications on a single
dynamically partially reconfigurable device.
• Resource accounting: The concept can be easily used to
prevent applications from using too many resources at
runtime. One might want to restrict the resources granted
to an application for each execution or just in a certain
context in order. The goal is, for example, to reduce
the power consumption or to accelerate more important
applications in a system where certain applications have
lower priorities than others. The concept of virtual area
management provides a limitation mechanism by granting
just a certain amount of, for example, reconfigurable area
to an application.
• Resource protection: Each application controls only the
reconfiguration within its virtual area, as it can only
specify those virtual positions that are valid in its own
VA. The virtual area management unit checks for each
request if the virtual address is valid and translates it
into a physical position on the reconfigurable device. This
way, no application can load any module to the area of an-
other application and the applications are protected. Thus,
an error in one application cannot harm the execution
of all other applications and lead to a complete system
breakdown.
• Support for differing scheduling and placement proce-
dures: Different applications require different scheduling
and placement methods to increase performance. Instead
of designing complex scheduling algorithms that try to
meet all the requirements (e.g., periodic or aperiodic,
with or without deadlines) by introducing various priority
classes, each application gets its own virtual area and
specifies the best scheduling strategy. This may include
a simpler and faster implementation of new hardware
applications within an existing system.
• Area position transparency and programming dynamic
applications: The absolute positions’ independence of the
executed modules, in the following called area position
transparency, offers a new programming model. It allows
to write dynamic applications, which subject to the cur-
rent resource context decide on their own how to proceed.
Depending on the assigned area, an application can either
use an implementation that offers more performance but
needs more area, or another one that takes longer but uses
less area. Another new possibility is that the application
can decide how to increase and shrink depending on
the current area usage context. An application can ask
the virtual area management unit, if it can grow to the
left or right, or at the bottom, and, depending where
some unused area is available, it can decide to put a
partial module there and instantiate some appropriate
communication module to transfer data there and back.
Thus, at each run, an application can operate differently in
its amount of resource usage. Before, trade-off decisions
where also possible for a single application, but the new
idea is to let each application decide in its own control
program its next steps depending on the behavior of the
other applications.
C. Challenges
First, there are some technical problems to be solved. For
area position transparency, the placement of a module should
not be limited to one single position. Furthermore, there might
be heterogeneities on the reconfigurable device which require
different implementation for different positions. A possible
solution to this problem is to generate implementations for
the module for each possible position where the corresponding
virtual area can be placed. The corresponding module imple-
mentation bit-files can be compressed to save some space.
Another option is to apply a single generated module bit-file
and relocate this file; that is, adapt it to the corresponding
position. We solved this technical issue in the following way:
Our experimental board is equipped with a reconfiguration
manager device that manipulates the corresponding bit-files
before the reconfiguration of the corresponding device.
A further technical challenge lies in the communication
of the partial modules to the external periphery (e.g., video,
audio) over the input/output pins at the border of the FPGA.
Our experimental board has a crossbar device that routes I/O
signals dynamically from the periphery to the current position
of the partial module.
A larger challenge is to fulfill the changing resource re-
quirements of the different applications. Every application
has a different resource usage profile which depends on the
inputs specified. An application called with a larger problem
instance to solve needs also more resources. Additionally,
the resource usage profile also depends on the execution
context. The application may behave differently depending
on how many area resources are available at each position.
Furthermore, the resource requirement depending on the inputs
of an application may or may not be known in advance (or at
least can be estimated, e.g., in a numerical application based
on the required precision of the solutions). The former is called
the offline case, the latter the online case.
In the online case, a deadlock is possible when two applica-
tions currently running on the reconfigurable device can only
proceed in their task when an resource increase is granted,
see Fig. 3. The blocks represent the occupied area of one
application at different points in time. The leftmost application
needs two more area units, as does the application second
from the left. Furthermore, no application can give up its
accumulated resources, as saving and restoring states is widely
considered too expensive on FPGAs. Such a deadlock must be
prevented. A commonly used solution is to allow hardware
task preemption: If not enough resources are available to
5Fig. 3. Both the first and the second application (from the left), request
more area resources in order to proceed. A deadlock occurs, because no
more resources are available to meet any request.
meet increased resource demands, the state of one application
is stored in external memory. At a later point in time, the
application is loaded again on the reconfigurable device and
the state be restored again to continue its processing. Another
approach is not to wait until a deadlock scenario actually
happens, but to check beforehand that it cannot get this far:
Assuming that the maximal size of a request is known, an area
shared between two applications is granted exclusively to just
one application, but not one part of it to the one application,
and another part to the other application.
In the following, we consider the offline problem: The
resource usage profiles for specific inputs of a series of
applications is known a priori, or can be estimated roughly.
An example of application resource profiles is given in Fig. 4.
Note that in the offline case deadlocks cannot occur: We
search for feasible solutions (i.e., schedules where no resource
requests overlap) only.
Hardware task preemption can be used to increase the
area usage. However, saving and restoring the states of the
hardware applications can be very costly in time and memory.
An approach that balances reconfiguration costs and efficient
resource usage is to allow that requests may be delayed by the
scheduler. The application keeps its currently occupied area,
but remains idle until the request is fulfilled. Compare the
two schedules for our example shown in Fig. 5: The schedule
shown on the left hand side is a solution for scheduling the
application modules without delaying requests. On the right
hand side, we allow that requests are postponed. Using this
Fig. 4. Five applications modules, each with different resource demands over
time for their specific inputs.
Fig. 5. Left: Schedule of the resource demands on the reconfigurable device
under the assumption that running applications cannot wait for a resource
grant. Right: Schedule for the same demands, but with the option to delay
resource grants.
option for the forth request of the forth module, we achieve a
better makespan.
III. PACKING APPLICATION MODULES
We consider the problem FPGATris: Scheduling modules
whose resource requests (i.e., space on an FPGA) varies over
time. This may be, for example, a router module that needs
more resources if the traffic increases. We assume that a
module occupies a certain number of slots on the FPGA, but
requests only complete slots. Thus, we model the FPGA as
a one-dimensional array. Furthermore, we assume that time is
discrete; that is, requests are multiples of a fixed-size time slot.
Now, scheduling a sequence of modules with time-varying
resources corresponds to strip packing: The width of the strip
is the number of slots on the FPGA; the height corresponds
to the time axis. Thus, we use height and time synonymously.
Moreover, we assume that every module occupies a base slot
and extends to the left or to the right of the base slot.1.
We are allowed to delay a request; that is, we may stretch
the modules along the time axis; see Fig. 6. Our goal is to
minimize the makespan (i.e., the time needed to fulfill all
requests). For the strip-packing problem, this goal corresponds
to minimizing the height of the occupied part of the strip.
1For convenience, we consider only the case of growing either to the left or
to the right of the base slot. The generalization to both sides is straightforward.
1
1 N
time
mi
slot
delayed
Fig. 6. We can place the module mi at the position marked by × (the base
slot of mi), if we delay the third request. That is, the second request stays
on the FPGA until the third request can be fulfilled.
6A. An ILP
We are given a strip of width N (e.g., an FPGA with N
slots and a time axis) and want to place M modules. Each
module, mi, is given as a sequence of requests. Let ℓi denote
the length of the request sequence for module mi and (i, j)
the jth request of mi. The size of (i, j) is given by rij ∈
ZZ\{0}, 1 ≤ j ≤ ℓi. For rij > 0 the module expands to the
right of the base slot, for rij < 0 to the left.
For the ILP, we introduce four kinds of variables:
• the slot assignment variables, xsi
• the time assignment variables, ytij
• the occupancy variables, zstij
• the usage variables, ut
The first two types specify when and where a request
is scheduled. More precisely, setting xsi to 1 indicates that
module mi is scheduled in slot s. Setting ytij to 1 indicates
that request (i, j) of module mi is scheduled at time t. That
is, module mi occupies the following slots at time t:
s, . . . , s+ rij − 1 for rij > 0,
s+ rij + 1, . . . , s for rij < 0,
where s is the base slot of module mi. For every i there is
exactly one xsi with xsi = 1 and for every (i, j) there is
exactly one ytij with ytij = 1.
Usually (i.e., if |rij | > 1) a request occupies more than
one slot when executed. Moreover, if the request (i, j + 1) is
delayed, (i, j) remains on the FPGA for more than one time
unit (i.e., it occupies more the one time row). To keep track
of the occupied slots, we set zstij to 1, if slot s is occupied
by request (i, j) at time t. The usage variables simply specify
which time steps are used.
Clearly, the FPGA’s size, N , and the number of modules,
M , strongly determine the size of an ILP and, in turn, the
time needed to solve it. In addition, we assume that an upper
bound, T , on the number of time steps is given. The closer
this bound is to the optimum, the smaller the resulting ILP.
This upper bound can be obtained, for example, using the
tabu search in Sect. III-B4.
1) Constraints:
a) Assignment Constraints: Each request must be sched-
uled exactly once. That is, for every (i, j) we have to set
exactly one xsi to 1 (to assign a slot for (i, j)) and exactly one
ytij (to assign a start time for (i, j)). The following constraints
express these conditions:
N∑
s=1
xsi = 1 ∀i = 1, . . . ,M , (1)
T∑
t=1
ytij = 1 ∀i = 1, . . . ,M, j = 1, . . . , ℓi . (2)
b) Boundary Constraints: Next, we ensure that a request
does not exceed the FPGA’s boundary by forcing all slot
assignment variables that would cause an infeasible placement
to be zero.
∀i, j, s = slow, . . . , sup : xsi = 0 (3)
where
slow :=
{
N − rij + 2, rij > 0
1, rij < 0
and
sup :=
{
N, rij > 0
−rij − 1, rij < 0
.
c) Order Constraints: Now, we ensure that the process-
ing order is maintained; that is, request (i, j) of mi is not
scheduled before request (i, j− 1) is finished. For every (i, j)
there is exactly one t such that ytij = 1. Thus, summing up
t · ytij over t for fixed i and j yields the time step where
request (i, j) is scheduled. This yields:
T∑
t=1
t ytij −
T∑
t=1
t ytij−1 > 0 ∀i, j > 0 . (4)
d) Occupancy Constraints: If xsi = 1 and ytij = 1,
the request (i, j) occupies rij slots adjacent to s at time
t; see Fig. 7. The first step to prevent other modules from
overlapping with mi is to set the appropriate occupancy
variables as follows.
∀i = 1, . . . ,M, j = 1, . . . , ℓi, s = 1, . . . , N, t = 1, . . . , T,
s′ = slow, . . . , sup :
xsi + ytij − zs′tij ≤ 1 , (5)
with
slow :=
{
s, rij > 0
max{1, s+ rij + 1}, rij < 0
and
sup :=
{
min{N, s+ rij − 1}, rij > 0
s, rij < 0
.
slow
supslow = s
s = sup
rij > 0
rij < 0rij
rij
Fig. 7. Occupancy Constraints: If xsi = 1 and ytij = 1, the request (i, j)
occupies rij slots left or right to s—depending on sgn(rij).
7e) Exclusive Constraints: By setting the appropriate oc-
cupancy variables with the occupancy constraints, we can
ensure that requests do not overlap. We allow at most one
occupancy variable for a fixed slot and a fixed time to be 1.
∀t = 1, . . . , T, s = 1, . . . , N :
M∑
i=1
ℓi∑
j=1
zstij ≤ 1 (6)
f) Delay Constraints: If a request (i, j+1) is delayed, the
preceding request (i, j) remains on the FPGA until (i, j + 1)
is scheduled. Thus, if zstij = 1 either zs(t+1)ij must be
set to 1—because the module still occupies space on the
FPGA—or (i, j + 1) is scheduled at time t + 1; that is,
y(t+1)i(j+1) = 1 holds. The following constraints keep track
of delayed requests.
∀i = 1, . . .M, j = 1, . . . , ℓi − 1,
s = 1, . . . , N, t = 1, . . . , T − 1 :
zstij − zs(t+1)ij − y(t+1)i(j+1) ≤ 0 (7)
g) Usage Constraints: Finally, we introduce some con-
straints that define our usage variables. Let ut be 1, if at least
one ytij is 1 or if ut+1 is 1.
∀t = 1, . . . T, i = 1, . . . ,M, j = 1, . . . , ℓi :
ut − ytij ≥ 0 (8)
and ∀t = 2, . . . , T :
ut−1 − ut ≥ 0 . (9)
2) Objective Function: To minimize the makespan, we use
the following ILP:
min
T∑
i=1
i ui subject to Eq. 1–9 (10)
xsi ∈ {0, 1}
ytij ∈ {0, 1}
zstij ∈ {0, 1}
ut ∈ {0, 1} .
B. Heuristic Methods
We implemented several heuristics for our problem: A sim-
ple FirstFit with and without delaying requests, and two more
elaborated heuristics, BestFit and TabuSearch. Our methods
pack the given modules in a semi-infinite strip. The width of
the strip is given by the number of slots on the FPGA; the
height of the strip corresponds to the time axis.
1) FirstFit: Probably the simplest heuristic is to place the
modules, one by one, in a first-fit way into the strip: beginning
with s = 1 and t = 1, we test for every position if the module
that must be placed overlaps with already-placed modules. We
choose the first position where no overlap occurs. Note that
we disregard the possibility of delaying requests.
S = (1, . . . ,M)
for i = 0 to M/2
found = 0
for j = 1 to M
if (j, ((j + i) mod M) + 1) are not in the tabu list
Swap items at pos. j and ((j + i) mod M) + 1 in S
Calculate makespan of BestFit
if makespan is the best so far then
found = j
Undo swapping
if found > 0 then
Swap pos. found and ((found + i) mod M) + 1 in S
Store (found , ((found + i) mod M) + 1) in the tabu list
Fig. 8. Tabu Search
2) FirstFit with delays: This method works the same as the
method above, but allows the delaying of requests. That is, if
for a certain start position the requests 0, . . . , j− 1 fit into the
strip without overlap, but request j does not fit in time step t′,
we search for the largest j′ < j such that request (i, j′) fits
in t′ and delay every request j′′ = j′ + 1, . . . (i.e., we move
them upwards in the strip); see Fig. 6.
3) BestFit: Similar to FirstFit with delays, we try to find
a nonoverlapping position by testing every possible position.
But now, we do not choose the first feasible position, but we
evaluate every position as follows: We separately count the
unoccupied cells left and right to the placed module and take
the minimum of the these two values as a score for the given
position. For example, for the placement of mi in Fig. 6 there
are 4 unoccupied cells left to mi and 14 unoccupied cells right
to mi, yielding a value of 4 for his placement. We choose
the position that yields the minimal score and break ties by
preferring the (first) position with least number of delays.
To avoid that every module is placed on the left or right
side (yielding a score of 0), we maintain an upper limit, tmax,
for the time. Before we place a new module, mi, we increase
tmax by ℓi/2 and try to place mi within the given time bound.
If this is not possible, we increase tmax by ℓi and try again.
4) TabuSearch: BestFit inserts the modules in the given
order (i.e., m1,m2, . . . ,mM ). Obviously, the result of BestFit
highly depends on the insertion order, so we may get a better
result if we permute the insertion order of the modules. Thus,
we use a tabu search to try several BestFit runs, each one using
a different order for the insertion of modules. Starting with the
sequence S = (1, . . . ,M), we swap two items of the sequence
and compute the makespan that is achieved by BestFit. More
precisely, we maintain a swapping distance, i, ranging from
i = 0 to M/2. For a fixed i, we swap the items at positions j
and ((j + i) mod M) + 1 for j = 1, . . .M , keeping track of
the best makespan achieved so far. We accept the swap that
achieves the best makespan known so far. A tabu list ensures
that we do not swap an already accepted pair again; see Fig. 8.
C. Experimental Results
An example instance and solutions are shown in Fig. 9. The
corresponding ILP was solved in approximately 6 hours on
8m2
m1
m3 m4
m3
m4
m1 m2
m3
m4
m1
m2 m3
m4
m1 m2
m3
m1 m2
m4
Fig. 9. From left to right: Example input and packings generated by (from left to right) FirstFit without delays, FirstFit with delays, BestFit, and TabuSearch/ILP.
Delayed modules are shown hatched. Note that m3 is packed by BestFit on top of m1, because this position has value 0 and fits into the strip of height
tmax + ℓ3/2. TabuSearch swaps m3 and m4 in the insertion order.
an Intel(R) Xeon(TM) 3.20 GHz CPU running ILOG CPLEX
10.00 under Linux. Note that TabuSearch yields the same
result in less than one second.
To test our heuristics, we conducted a set of experiments.
For upper limits on the size of a request, rmax, ranging from
10% to 90% in steps of 10, we randomly generated sequences,
each of 20 modules. For each value of rmax, we shuffled 20
sequences as follows: For every module, we choose its length,
ℓi, randomly, by normal distribution with expected value µ =
10 and variance σ2 = 5. The size for every request, rij , was
chosen by normal distribution, too, with an expected value of
µ = rmax/2 and variance σ2 = rmax/4. We present the results
for N = 50, other FPGA widths showed similar results.
Heuristic Average running time
FirstFit 0.25 s
FF with delays 0.33 s
BestFit 2.09 s
Tabu Search 1125.06 s
TABLE I
AVERAGE RUNNING TIME FOR OUR HEURISTICS.
Table I shows the average running time over all experiments,
Fig. 10 shows the mean value over 20 runs for every heuristic
and value of rmax. Fig. 11 shows mean-, maximal-, and
minimal values for BestFit and TabuSearch. For comparison,
Fig. 10 also shows an average lower bound computed as the
smallest area needed to pack all requests; that is,
LB =
1
N
M∑
i=1
ℓi∑
j=1
rij .
Choosing the best suited strategy depends on the scenario. For
systems that run the same request sequence on and on and that
are produced in a large number, it may even pay off to use
the ILP. Clearly, balancing computation time and quality, the
tabu search is a better choice. Nevertheless, it requires that the
requests are known beforehand. If this is not given, BestFit can
be used, because it works in an offline scenario as well as in
an online setting.
IV. CONCLUSION
Reconfigurable devices increasingly offer enough comput-
ing resources, so that in the near future, multiple applications
may be executed on them concurrently, instead of just a single
application. However, until now there is a general lack of
research on how to successfully achieve secure and flexible
execution of multiple applications on dynamically partially
reconfigurable devices. We present an approach called virtual
area management, which is heavily influenced by multitasking
and operating system concepts of the software. Advantages
of our concept (e.g., support for accounting of resources,
resource protection, multiple scheduling and placement strate-
gies, and a new programming model) are explained. Further-
more, challenges posed by this concept and possible solutions
are discussed. Afterwards, a specific approach to virtual area
management in the offline case is presented in more detail. It
is based on the assumption that most applications can handle
also a delayed resource grant. The corresponding optimization
problem to minimize the total makespan is solved with an
ILP and heuristics. Both approaches are compared in an
experimental study.
The presented concept is a practicable solution to the
considered important problem; the proposed methods can be
applied to nowadays available reconfigurable devices. We do
not rely on the assumption that saving and storing hardware
task states will no longer be considered as too expensive in the
future. Furthermore, programming models for reconfigurable
architectures, resource accounting and protection, bitstream re-
location and position independence, are all formidable research
problems on their own, however the concept is compatible and
extendable to different solution approaches to these problems.
REFERENCES
[1] D. Koch, C. Haubelt, and J. Teich, “Efficient reconfigurable on-chip
buses for FPGAs,” in Proc. 16th Annu. IEEE Sympos. Field-Programm.
Custom Comput. Mach., 2008.
[2] J. Angermeier, D. Go¨hringer, M. Majer, J. Teich, S. P. Fekete, and J. V.
der Veen, “The Erlangen Slot Machine — A platform for interdisci-
plinary research in dynamically reconfigurable computing,” Information
Technology, vol. 49, pp. 143–148, 2007.
9 0
 50
 100
 150
 200
 250
 10  20  30  40  50  60  70  80  90
Max. request size (%)
H
ei
gh
t o
f p
ac
ki
ng
Lower bound
FirstFit
FF with delays
BestFit
Tabu Search
Fig. 10. A comparison of our heuristics—FirstFit (without delays), FirstFit
(with delays), BestFit, and TabuSearch—in settings with different densities
(i.e., maximal value for a request) averaged over 100 runs per densities, For
comparison, a lower bound (ratio of total area by number of slots) is shown.
[3] C. Bobda, M. Majer, A. Ahmadinia, T. Haller, A. Linarth, J. Teich,
S. Fekete, and J. van der Veen, “The Erlangen Slot Machine: A Highly
Flexible FPGA-Based Reconfigurable Platform,” IEEE Symp. on FPGAs
and Custom Computing Machines (FCCM), pp. 319–320, 2005.
[4] G. Brebner, “A virtual hardware operating system for the xilinx
XC6200,” in Field-Programmable Logic Smart Applications, New
Paradigms and Compilers, 1996, pp. 327–336.
[5] ——, “The swappable logic unit: a paradigm for virtual hardware,” in
FPGAs for Custom Computing Machines, 1997. Proceedings., The 5th
Annual IEEE Symposium on, 1997, pp. 77–86.
[6] K. Bazargan, R. Kastner, and M. Sarrafzadeh, “Fast template placement
for reconfigurable computing systems,” Design & Test of Computers,
IEEE, vol. 17, no. 1, pp. 68–83, 2000.
[7] C. Steiger, H. Walder, M. Platzner, and L. Thiele, “Online scheduling
and placement of real-time tasks to partially reconfigurable devices,” in
Real-Time Systems Symposium, 2003. RTSS 2003. 24th IEEE, 2003, pp.
224–225.
[8] O. Diessel, H. ElGindy, M. Middendorf, H. Schmeck, and B. Schmidt,
“Dynamic scheduling of tasks on partially reconfigurable FPGAs,”
Computers and Digital Techniques, IEEE Proceedings -, vol. 147, no. 3,
pp. 181–188, 2000.
[9] J. Resano, D. Mozos, and F. Catthoor, “A hybrid prefetch scheduling
heuristic to minimize at run-time the reconfiguration overhead of dynam-
ically reconfigurable hardware [multimedia applications],” in Design,
Automation and Test in Europe, 2005. Proceedings, 2005, pp. 106–111.
[10] S. Banerjee, E. Bozorgzadeh, and N. Dutt, “Physically-aware HW-
SW partitioning for reconfigurable architectures with partial dynamic
reconfiguration,” in Proceedings of the 42nd annual Design Automation
Conference. ACM, 2005, pp. 335–340.
[11] C. Steiger, H. Walder, and M. Platzner, “Operating systems for reconfig-
urable embedded platforms: online scheduling of real-time tasks,” IEEE
Transactions on Computers, vol. 53, no. 11, pp. 1393–1407, 2004.
[12] G. Wigley and D. Kearney, “The development of an operating system for
reconfigurable computing,” in Field-Programmable Custom Computing
Machines, 2001. FCCM ’01. The 9th Annual IEEE Symposium on, 2001,
pp. 249–250.
[13] H. Walder and M. Platzner, “Reconfigurable hardware operating sys-
 0
 50
 100
 150
 200
 250
 300
 0  10  20  30  40  50  60  70  80  90  100
Max. request size (%)
H
ei
gh
t o
f p
ac
ki
ng
Tabu search
Best Fit
Fig. 11. Mean-, maximal-, and minimal value averaged over 100 runs per
densities for BestFit and TabuSearch.
tems: From design concepts to realizations,” In Proceedings Of The 3rd
International Conference On Engineering Of Reconfigurable Systems
And Architectures (ERSA), pp. 284–287, 2003.
[14] R. Cordone, F. Redaelli, M. A. Redaelli, M. D. Santambrogio, and
D. Sciuto, “Partitioning and scheduling of task graphs on partially
dynamically reconfigurable FPGAs,” IEEE Transact. Computer-Aided
Design of Integrated Circuits and Systems, vol. 28, pp. 662–675, 2009.
[15] F. Redaelli, M. D. Santambrogio, and D. Sciuto, “Task scheduling
with configuration prefetching and anti-fragmentation techniques on
dynamically reconfigurable systems,” in Proc. 11th Conf. on Design,
Automation and Test in Europe, 2008, pp. 519–522.
[16] J. M. Cardoso, “On combining temporal partitioning and sharing of
functional units in compilation for reconfigurable architectures,” IEEE
Transact. Comput., vol. 52, pp. 1362–1375, 2003.
[17] S. Banerjee, E. Bozorgzadeh, and N. D. Dutt, “Integrating physical
constraints in HW-SW partitioning for architectures with partial dynamic
reconfiguration,” IEEE Transact. Very Large Scale Integration Systems,
vol. 14, pp. 1189–1202, 2006.
[18] B. S. Baker, E. G. C. Jr., and R. L. Rivest, “Orthogonal packings in two
dimensions,” SIAM J. Comput., vol. 9, pp. 846–855, 1980.
[19] C. Kenyon and E. Re´mila, “Approximate strip packing,” in Proc. 37th
Annu. IEEE Sympos. Found. Comput. Sci., 1996, pp. 31–36.
[20] B. S. Baker and J. S. Schwarz, “Shelf algorithms for two-dimensional
packing problems,” SIAM J. Comput., vol. 12, pp. 508–525, 1983.
[21] J. Csirik and G. J. Woeginger, “Shelf algorithms for on-line strip
packing,” Inform. Process. Lett., vol. 63, pp. 171–175, 1997.
[22] R. Breukelaar, E. D. Demaine, S. Hohenberger, H. J. Hoogeboom, W. A.
Kosters, and D. Liben-Nowell, “Tetris is hard, even to approximate,”
Internat. J. Comput. Geom. Appl., vol. 14, pp. 41–68, 2004.
[23] Y. Azar and L. Epstein, “On two dimensional packing,” J. Algorithms,
vol. 25, pp. 290–310, 1997.
[24] E. Coffman, Jr., P. J. Downey, and P. Winkler, “Packing rectangles in a
strip,” Acta Inform., vol. 38, pp. 673–693, 2002.
[25] S. P. Fekete, T. Kamphans, and N. Schweer, “Online square packing,”
in Proc. 20th Algorithms and Data Structure Symposium (WADS), 2009,
pp. 302–314.
