Reducing the dynamic FPGA reconfiguration overhead by Degryse, Tom et al.
Reducing the
dynamic FPGA
reconfiguration
overhead
Tom Degryse∗,1, Karel Bruneel∗,1,
Harald Devos∗,1, Dirk Stroobandt∗,1
∗ ELIS, Ghent University, Sint-Pietersnieuwstraat 41, 9000 Gent, Belgium
ABSTRACT
Dynamic hardware generation reduces the number of FPGA resources needed and speeds up the
application by optimizing the configuration for the exact problem at hand at run-time. If the prob-
lem changes, the system needs to be reconfigured. When this occurs too often, the total recon-
figuration overhead is too high and the benefit of using dynamic hardware generation vanishes.
Hence, it is important to minimize the number of reconfigurations.
We propose a novell technique to reduce the number of reconfigurations by using loop trans-
formations. Our approach is similar to the temporal data locality optimizations. By applying our
technique, we can drastically reduce the number of reconfigurations, as indicated by the matrix
multiplication example. After applying the loop transformations, the number of reconfigurations
decreases by an order of magnitude. Combined with a dynamic hardware generation technique
with a very low overhead, our technique obtains a significant speedup over generic circuits.
KEYWORDS: FPGA; dynamic hardware generation; loop transformations
1 Introduction
One of the major advantages of FPGAs (Field Programmable Gate Arrays) is their reconfig-
urability. However, the majority of todays FPGA-based systems do not exploit this benefit
or only exploit it on a very large time scale. In a traditional system, the FPGA loads its con-
figuration bits from a local memory when the system starts up. Once the system has booted,
the functionality of the FPGA remains the same during the entire run time of the system.
The FPGA is only reconfigured after a firmware upgrade, but this occurs very infrequently.
Dynamic hardware generation uses this reconfigurability on a much shorter time scale
by exploiting run-time knowledge of the exact problem at hand. At run-time, a specialized
hardware circuit is generated, which is substantially smaller and faster than the generic
counterpart. If the problem at hand changes, the dynamic hardware generation tool creates a
1E-mail: {Tom.Degryse, Karel.Bruneel, Harald.Devos, Dirk.Stroobandt}@elis.UGent.be
new specialized FPGA configuration and reconfigures the FPGA. When the problem at hand
does not change too often, the total execution time, including the reconfiguration overhead,
is less than the execution time of the generic circuit.
In this work, we make use of so-called parameterizable configurations [BS08]. These are
FPGA configurations, generated off-line, in which some of the configuration bits are ex-
pressed as closed form Boolean functions, the tuning functions, of some of the inputs, called
the parameters. On-line specialization of a parameterizable configuration means evaluating
these tuning functions. This technique has a substantially lower on-line hardware genera-
tion overhead than conventional hardware generation methods, which makes it much more
useful in practice.
Because of the overhead of the hardware generation and reconfiguration, it is impor-
tant to minimize the number of reconfiguration steps during program execution. This can
be done by maximizing the reuse of the input parameter values, which is very similar to
optimization of the temporal data locality in a system with a memory hierarchy. The appli-
cation of loop transformations is one well known technique to optimize the data locality,
both in hardware and software. In the next section, we will investigate how the temporal
data locality optimizations can be used to minimize the number of reconfigurations.
Section 3 continues with the matrix multiplication example we used to evaluate our tech-
nique. This example shows that our technique drastically reduces the number of reconfig-
urations and hence reduces the total execution time significantly. By combining our tech-
nique with parameterizable configurations, we achieve a theoretical speed up of more than
37% compared over generic counterparts for large matrices. Our work improves the useful-
ness of dynamic hardware generation techniques. We end this paper with some concluding
remarks in section 4.
2 Minimizing the number of reconfigurations
Our target, minimizing the number of reconfigurations or equivalantly, maximizing the
hardware reuse, is very similar to maximizing the reuse of data elements in a local mem-
ory. Hence, the same techniques that are used to optimize the temporal data locality, can
be used to reduce the number of reconfigurations. Loop transformations are a well known
set of transformations to optimize both spatial and temporal data locality and introduce
parallelism.
However, there are some differences between the temporal data locality optimization
and the minimization of the number of reconfigurations. First of all, there is no configura-
tion cache in an FPGA. So, if we want to reuse a certain parameter value, the different usages
of this value should be placed right after eachother in time, or elsewise stated, the reuse dis-
tance must be equal to one. This situation could be improved by using an FPGA architecture
with multiple configuration memories.
Next to this, there is also a different cost function associated with a miss. First of all, after
a parameter has changed, a new configuration must be generated. The cost of this operation
depends on the number of boolean functions that must be evaluated and on the complexity
of the boolean functions. Both aspects can be different for every parametric input, but are
known after the parameterizable circuit has been generated off-line.
After the new configuration has been generated, the FPGA must be reconfigured. There
is a fixed cost that comes with every reconfiguration, due to for instance the overhead to stop
for i′ = 1 . . . N by T
for j = 1 . . . N by
B
8
for k = 1 . . . N by 8
for i = i′ . . .min(i′ + T − 1, N)
tmp1,1 = Ai,k ×Xk,j
. . .
tmp1,8 = Ai,k+7 ×Xk+7,j
Yi,j = Yi,j + tmp1,1 + . . .+ tmp1,8
. . .
Figure 1: Matrix multiplication after loop transformations. There are
N
T
× N
B
× N reconfig-
urations and
B
8
× T intermediate values must be stored on chip.
and restart the calculation. The cost to transfer the configuration bits to the FPGA depends
heavily on the FPGA architecture and on the available bandwidth to the configuration mem-
ory. In an FPGA with a RAM-like configuration memory, where every configuration bit can
be addressed individually, the configuration cost is equal to the total number of bits divided
by the configuration memory bandwidth.
Current Xilinx FPGAs only support partial reconfiguration, where the finest level of
granularity is the frame. To calculate the reconfiguration overhead, we must count the num-
ber of frames that change. Altera does not support partial reconfiguration at the moment, so
every time the FPGA needs to be reconfigured, the whole bitfile must be transfered to the
FPGA. In this case, the reconfiguration overhead is fixed.
3 Example
To illustrate our proposed approach, we use the matrix multiplication as an example. Matrix
operations serve as basic building blocks for many numerical linear algebra applications that
are heavily used in scientific computing, including the solution of linear systems of equa-
tions, linear least square problems and eigenvalue problems. Speeding up these operations
thus accelerates a large class of algorithms found in scientific computing.
Fig. 1 depicts the matrix multiplication after loop transformations. We first partially un-
rolled the k-loop with a factor 8, which is determined by the bandwidth to the external
memory. This transformation introduces a data dependency between the multiplications
and the additions. To overcome this, we used software pipelining on this loop (not shown in
Fig. 1). To use allB processing units, we partially unrolled the j-loop by a factor
B
8
, so that
B
8
columns are calculated in parallel. On our target FPGA, the Xilinx Virtex 5 XC5VFX30T, B is
equal to 32 and 64 for the generic and the parameterizable circuit respectively. To optimally
reuse the configurations, we interchanged the loops, so the i-loop becomes the innermost
loop. This maximizes the configuration reuse. As a result, there are
N
B
×N reconfigurations
and
B
8
×N intermediate values must be stored.
Table 1: Theoretical speed up of the parameterizable circuit over the generic counterpart.
N Generic circuit Parameterizable circuit Speed upReconfig. Total
512 0.03s 0.03s 0.04s −52.10%
1024 0.22s 0.11s 0.23s −1.05%
2048 1.79s 0.46s 1.35s 24.48%
4096 14.32s 1.83s 8.99s 37.24%
8192 114.53s 14.62s 71.88s 37.24%
16384 916.26s 116.94s 575.07s 37.24%
If N is too large, there is not enough on-chip memory to store all columns. As a solution
to this, we tiled the i-loop by a factor T , so that the memory usages goes down to
B
8
× T .
The value of T is determined by the size of the FPGA. In our case, T = 8192 and T = 4096
for the generic and the parameterizable circuit respectively. As a drawback, the number of
reconfiguration increases by a factor
N
T
. Our design optimally makes uses of the available
FPGA area, I/O bandwidth and on-chip memory and can easily be scaled to larger devices.
In our experiments we used different matrix sizes N , ranging form 512 to 16384. Both the
generic and the parameterizable circuits almost have the same maximum clock frequency of
about 150MHz. For each N , we calculated the execution times for both the generic and the
parameterizable circuit and the reconfiguration overhead (Table 1). The speed up achieved
by using parameterizable configuration depends heavily on the size of the matrices. If N is
too small, there is not enough hardware reuse to compensate for the reconfiguration over-
head. If N ≥ 4096 the speed up is constant (37.24%), because of the increased number of
reconfigurations due to the loop tiling. In the unoptimized version of the algorithm, there
are
N3
B
reconfigurations. As a result, the total reconfiguration time is higher than the execu-
tion time of the generic circuit, so this solution is infeasible.
4 Conclusions
We have presented a technique to reduce the number of hardware generation and FPGA
reconfiguration steps in a dynamic hardware generation environment by means of loop
transformations. Our technique is very similar to temporal data locality optimization. The
matrix multiplication example shows that the number of reconfigurations can drastically be
reduced by transforming the loop nests in a program. This speeds up the application and
thus further improves the usefulness of dynamic hardware generation techniques. When
we combine our approach with a dynamic hardware generation technique with a very low
overhead, we can obtain a speedup of more than 37% over generic circuits for large matrices.
References
[BS08] Karel Bruneel and Dirk Stroobandt. Automatic generation of run-time parameteriz-
able configurations. FPL, accepted for publication, 2008.
