A Method for Fast Hardware Specialization at Run-time by Bruneel, Karel et al.
A METHOD FOR FAST HARDWARE SPECIALIZATION AT RUN-TIME
Karel Bruneel∗, Peter Bertels†, and Dirk Stroobandt
Department of Electronics and Information Systems
Ghent University
Sint-Pietersnieuwstraat 41, B-9000 Ghent, Belgium
{kbruneel, pbertels, dstrooba}@elis.ugent.be
ABSTRACT
Dynamic hardware generation is a powerful technique that
can substantially reduce both the required hardware resour-
ces and the time needed to perform a calculation, reﬂected
in an improved functional density. This performance im-
provement is a result of additional run-time optimizations
enabled by the knowledge of values at certain inputs at run-
time. However, due to the large overhead conventional hard-
ware generation tools incur, the usability of dynamic hard-
ware generation is limited. We present a dual approach that
combines compile-time generation of generic hardware and
run-time specialization. This drastically decreases the dy-
namic generation overhead. Our approach is used for dy-
namic generation of FIR ﬁlters and compared to a static and
a conventional dynamic implementation. The experiments
clearly show that the dual approach improves the usability
of dynamic hardware generation.
1. INTRODUCTION
During the design process of a hardware component design-
ers aim at an optimal implementation, taking into account
the speciﬁc conditions of the application. While a broad
range of optimizations is possible at design time many more
emerge only at run-time when even more information about
the speciﬁc application becomes available. One of these op-
timizations is constant propagation of speciﬁc parameters
and/or inputs which leads to a more efﬁcient use of the Field
Programmable Gate Array (FPGA) fabric.
In this paper we use the design of the FIR ﬁlter shown
in ﬁgure 1 as a running example. At design time the number
of ﬁlter taps is known as well as the word length of its coef-
ﬁcients. The word length of the input data is also a design
constraint. This speciﬁcation leads to an implementation us-
ing generic multipliers. If the designer would also know the
exact values of the coefﬁcients, these values could be prop-
agated into the design leading to a speciﬁc implementation
∗Karel Bruneel is supported by a BOF grant from Ghent University.




c1 c2 c14 c15
Fig. 1. Finite Impulse Response Filter
which is more compact and possibly faster. In applications
where the ﬁlter coefﬁcients are not ﬁxed, constant propaga-
tion is only possible at run-time.
Run-time switching between speciﬁc implementations is
possible due to the reconﬁgurability of FPGAs. For applica-
tions where all possible conﬁgurations can be generated at
design time, this run-time switching is a sufﬁcient solution.
In this paper we focus on applications where more ﬂexibil-
ity is needed. The number of speciﬁc implementations be-
comes unmanageable for these applications. Hence it might
be more efﬁcient to generate the speciﬁc implementations at
run-time.
Run-time optimization improves the quality of the gen-
erated hardware, but also incurs an overhead. Functional
density, introduced in section 2, is a measure for how ef-
ﬁcient the FPGA fabric is used. Functional density takes
both intrinsic hardware properties and run-time optimiza-
tion overhead into account. Using functional density we
can show under which conditions the optimization gain out-
weighs the hardware generation overhead.
Standard hardware synthesis tools are very slow, which
makes these conditions too stringent for many practical sit-
uations. To alleviate these conditions we propose a faster
method for hardware generation at run-time in section 3. We
achieve this by splitting up the hardware generation in an
ofﬂine initial compilation and a fast run-time reﬁnement of
this compilation. A profound description of two techniques
we use for this reﬁnement is given in section 4.
Experimental results for the optimization of FIR ﬁlters
in section 5 show that our method can effectively increase
the functional density.
1-4244-1060-6/07/$25.00 ©2007 IEEE. 35
We situate our work in the broad ﬁeld of run-time hard-
ware generation in section 6 and formulate some conclu-
sions in section 7.
2. FUNCTIONAL DENSITY
We claim run-time generated circuits can use the FPGA fab-
ric more efﬁciently than generic, ofﬂine generated circuits.
A metric to measure this efﬁciency is the functional den-
sity D. This metric combines two important properties of a
calculation in hardware: the number of hardware resources,
reﬂected in the area A, and the time T needed to perform





For the static case where generic hardware is generated
at compile-time, As is the total hardware cost and ts,exec is








For dynamically generated hardware as we propose in
this paper, T consists of the execution time of the hard-
ware (td,exec) and the time needed to generate the hardware
(tgenerate) and to conﬁgure the FPGA (tconf ). The area is
denoted Ad. This leads to the dynamic functional density






Ad(td,exec + tgenerate + tconf )
(3)
When the generated hardware component is reused sev-
eral times (n), the overhead of generating this component is














At run-time more optimization possibilities emerge re-
sulting in potentially smaller and faster hardware:
Adtd,exec < Asts,exec. (6)
Under this condition, we can see that Ddynamic over-














































Fig. 2. Method for run time hardware generation
As can be seen in section 5, the hardware generation
time tgenerate is very large for a conventional hardware syn-
thesis tool chain. The break even point N can be reduced
by speeding up run-time synthesis. This enables the use of
run-time hardware generation in a new range of applications
where n is relatively small.
3. DUAL APPROACH
Dynamic hardware generation enables exploitation of opti-
mizations that only emerge at run-time. For some applica-
tions, optimizing for speciﬁc parameters, requires a com-
pletely different design. We aim for another class of ap-
plications where the optimizations can be seen as a trans-
formation of an existing design. In this paper we focus on
constant propagation. Example applications are: adaptive
ﬁltering, key-speciﬁc encryption and many others.
These applications enable our dual approach where we
combine compile-time generation of generic hardware with
run-time specialization as shown in ﬁgure 2. As the optimiz-
ing transformations are far less complicated than a complete
redesign, large speed-ups can be achieved over a conven-
tional hardware synthesis approach. By shifting hardware
generation time from run-time to compile-time, we improve
the functional density where tgenerate was the obstructing
factor as can be seen from equation 4.
For the ofﬂine generation of the generic design, we start
with a high level description of the generic problem. This
description is synthesized with a conventional tool chain
consisting of three consecutive steps: Synthesis, Placement
and Routing. This results in a generic netlist, and generic
placement and routing information used as a headstart for
the online specialization steps.
At run-time we perform the following steps: constant
propagation, compaction and incremental routing. Each of
these online steps corresponds to an ofﬂine preparation step
36
and builds on its results.
The constant propagation step uses the run-time infor-
mation about constant input values to transform the generic
netlist into a specialized netlist. This is done by propagating
the constants into the circuit and simultaneously simplify-
ing the circuit. Some logic blocks are removed due to this
simpliﬁcation.
The compaction step combines the generic placement
and the specialized netlist and produces a compact special-
ized placement. This is done in two steps. First the generic
placement is pruned only retaining the logic blocks present
in the specialized netlist. The emerging free space is frag-
mented and therefore cannot be used efﬁciently. Secondly
this sparse placement is compacted to reduce the fragmen-
tation, while trying to preserve the placement quality. Be-
cause compaction decreases the bounding box area, func-
tional density is improved as can be deduced from equa-
tion 4.
Optimally compacting a sparse placement is as hard as
ﬁnding the optimal placement itself. There is no strict corre-
lation between the place of a logic block in the sparse place-
ment and its place in the optimal placement. Nevertheless
we can use the sparse placement as a headstart for a new
suboptimal placement considering the generic placement as
a condensed representation of the connectivity information
of logic blocks. Based on this observation we designed a
fast and efﬁcient online heuristic for compaction, presented
in section 4.2.
The incremental routing step concludes the online hard-
ware generation. Only those interconnections that were bro-
ken during constant propagation and compaction are rerout-
ed.
The bitstream resulting from our online hardware spe-
cialization can be used to conﬁgure the FPGA after which
the hardware component can start executing. During execu-
tion the dynamic input is processed by the hardware compo-
nent generating the desired output.
4. ALGORITHMS
4.1. Constant propagation
The constant propagator takes a netlist and a list of the con-
stant inputs and their constant values. In the netlist logic
blocks are annotated with the truth tables of their Look Up
Tables (LUTs). The output is a specialized netlist.
First we read the generic netlist and build an internal
data structure containing information about logic blocks –
consisting of LUTs and Flip Flops (FF) – inputs, outputs
and nets. This representation facilitates complex transfor-
mations on the circuit.
Secondly we propagate constant inputs one by one in
the circuit. The truth tables are simpliﬁed simultaneously.
In some cases this leads to opportunities to prune the circuit.
We use eight simple pruning rules:
1. If a LUT becomes independent of one of its inputs,
this input is removed from the sink list of the driving
net.
2. If a net has no sinks, its driving LUT or FF is removed.
3. If a LUT is removed, its inputs are removed from the
sink list of its respective input nets.
4. If a FF is removed, its input is removed from the sink
list of its input net.
5. If a LUT becomes a constant generator, it is removed
except for when it drives an output. The constant out-
put of the LUT is propagated.
6. If a FF has a constant input it is removed and the con-
stant input is propagated.
7. If a LUT has only one input and it simply transfers the
value to the output, the input net and the output net are
merged and the LUT is removed.
8. If all LUTs and FFs of a logic block are removed the
logic block itself is removed.
The eighth rule removes logic blocks from the circuit.
These blocks are pruned from the netlist but are still present
in the generic placement ﬁle. The compaction step will solve
this problem by removing them from the placement.
Another important property to note is that none of the
rules create extra logic blocks. This proves that the set of
logic blocks in the specialized circuit after constant prop-
agation will be a subset of the logic blocks in the original
design. As a consequence, each logic block in the special-
ized netlist is still linked to a place in the generic placement.
This results in a valid initial placement for all logic blocks,
which will be optimized by our compaction algorithm.
4.2. Replacement via Compaction
Some logic blocks have become redundant after constant
propagation. Removing these blocks from the generic place-
ment, results in a sparse specialized placement that uses less
hardware resources than the original placement. Because of
fragmentation the vacant logic blocks do not contribute to
an improved functional density.
This sparse specialized placement is transformed into
a compacted version by our compaction algorithm. The
resulting placement can effectively improve the functional
density if it has a reduced bounding box and still a good
placement quality. Important factors are routability and the








Fig. 3. Principles of Compaction
As stated in section 3 we designed a fast online heuris-
tic for compacting while trying to preserve the placement
quality. The generic placement is used as a condensed rep-
resentation of the connectivity information of logic blocks.
This placement is produced by a placement algorithm that
obtains good quality by placing strongly connected blocks
closely together. We assume that this property still holds for
the sparse placement. Hence good placement quality can be
achieved by keeping blocks that are closely together in the
original placement closely together during compaction.
Our compaction algorithm is a fast heuristic for achiev-
ing this goal. The pseudo code is shown in ﬁgure 4. We ﬁrst
calculate the bounding box of the sparse placement. Then
we iterate over all logic blocks on the border of this bound-
ing box and shift them into the bounding box. This is done
by ﬁnding the closest free space in the bounding box and
then shifting the logic blocks that are in between at a right
angle (ﬁgure 3). By shifting a logic block over not more than
one position every iteration, logic blocks that were closely
together stay closely together during an iteration.
BoundingBox b = findBoundingBox();
LogicBlock l = b.firstLogicBlock();
while (b.freeSpace() != 0) {
Place f = b.findClosestFreeSpace(l);
shift(l, f);
l = b.nextLogicBlock();





Fig. 4. Pseudo code of the compaction algorithm
As can be intuitively seen this algorithm is very fast
compared to more conventional placement algorithms based
on simulated annealing. In section 5 we also show the place-
ment quality can be preserved.
5. RESULTS
Our implementation is based on VPR (Versatile Place and
Route) [2]. After synthesis we use Vplace for placing the
generic circuit in the ofﬂine phase. For the online specializa-
tion we use our own constant propagation and compaction
algorithms, described in sections 4.1 and 4.2 respectively.
Currently we have not yet implemented an incremental rout-
ing algorithm. Therefore the specialized circuit is fully rout-
ed at run-time with Vroute, the VPR router.
Although the approach proposed in this paper is archi-
tecture independent, we focus on an implementation for a
speciﬁc FPGA in this section. We have chosen a simple ar-
chitecture1 with logic blocks containing one 4-LUT and a
FF. In the interconnection network the wire segments only
span one logic block and the channel width is ﬁxed to 20.
FIR ﬁltering is used for validation of our dual approach.
In our experiments we use a generic 16 tap FIR ﬁlter with
8 bit coefﬁcients and an 8 bit input, shown in ﬁgure 1. The
generic multipliers are ripple carry array multipliers.
A generic netlist for this FIR ﬁlter was manually con-
structed. It contains 2704 logic blocks, 137 inputs (one for
the clock, 8 for the input, and 16 times 8 for the coefﬁcients)
and 31 outputs. In the netlist every LUT was annotated with
its truth table. Vplace was used with default settings for the
ofﬂine placement of the generic design.
In order to test the online specialization and constant
propagation, we randomly generated 100 different sets of
16 coefﬁcients. The next subsections present results of ex-
periments with these 100 FIR ﬁlters.
All experiments were executed on an Intel Core 2 pro-
cessor running at 2.13 GHz with 2 GiB of memory.
5.1. Experiment 1: Compaction
Our compaction algorithm can produce a placement of equal
quality compared to Vplace in a much shorter time. We use
the critical path length after routing as the quality measure.
Vplace can be tuned to trade quality for execution time
with the inner num parameter. We ran Vplace for this pa-
rameter ranging from 0.2 to 10 (default Vplace setting) for
all coefﬁcient sets. The critical path length and the place-
ment time were averaged over these FIR ﬁlters. The result-
ing data points are plotted in ﬁgure 5.
The compaction algorithm was also run for all 100 co-
efﬁcient sets. The results were averaged over all FIR ﬁl-
ters. The resulting data point is shown in ﬁgure 5. Our com-
paction algorithm on average takes 43 ms and produces FIR
ﬁlters with an average critical path length of 119.3 ns.
The ﬁgure clearly shows that our compaction algorithm
can produce a placement with equal quality to VPR in a
shorter time. We measure an average speedup of 45.8 for
an average critical path of 119.3 ns.
We can also see that our compaction algorithm produces
FIR ﬁlters with a critical path length that is 36.1% longer on
average compared to Vplace.
1A description of this architecture is provided with the VPR tool suite
in 4lut sanitized.arch.
38
Note that Vplace cannot produce routable placements






0.01 0.1 1 10 100

















Fig. 5. Placement quality versus generation time
5.2. Experiment 2: Intrinsic hardware properties
In our second experiment we compare the intrinsic hardware
properties of the hardware produced by three different de-
sign methods: our dual approach, a full online FPGA tool
chain and a static implementation.
In the dual approach the hardware is generated online in
three steps: constant propagation, compaction, and Vroute.
The full online FPGA tool chain also produces the special-
ized hardware in three steps: Synthesis, Vplace and Vroute.
Because we did not have a synthesis tool available that ﬁts in
the VPR tool chain we replaced it with the constant propaga-
tor. No online hardware generation is done by the static im-
plementation. To achieve the same functionality, the static
hardware implements a generic FIR ﬁlter.
Table 1 shows the average area and critical path length
for these three design methods. We see that on average the
static implementation uses 81.4% more logic blocks than
the two dynamic implementations. On average the hardware
produced by the dual approach is 5.3% faster than the static
implementation and 36.1% slower than the hardware pro-
duced by the online VPR tool chain.
Table 1. Properties of the generated hardware
Dual Approach VPR Static
Area (logic blocks) 1491 1491 2704
Critical path (ns) 119.3 87.6 126.0
5.3. Experiment 3: Hardware generation time
In the third experiment we compare the generation times of
the dynamic design methods described in section 5.2
Table 2 gives the average total hardware generation time
and its decomposition. We see that the dual approach is on
average 12 times faster than the full VPR tool chain and that
the bulk of the generation time of this approach is routing
time. Incremental routing could reduce this routing time and
make our method even faster.
Table 2. Duration of the hardware generation process
Dual Approach VPR
Constant propagation (ms) 128 —
Compact / Place (ms) 43 47557
Route (ms) 3889 2275
Total generation time (ms) 3932 49832
Note that no measurements are ﬁlled out for the synthe-
sis time because we did not have a synthesis tool available
that ﬁts in the VPR tool chain. Adding the synthesis time
will increase the total time needed for the online VPR ap-
proach, thus enhancing the performance advantage of our
dual approach.
5.4. Experiment 4: Functional Density
Finally we compare the functional density of the three de-
sign methods described in section 5.2. We calculated the
functional density as shown in section 2 for several values
of n and averaged the resulting functional densities over all
100 FIR ﬁlters. The result is shown in ﬁgure 6. In our FIR
ﬁlter example n denotes the number of input samples that









0 200 400 600 800 1000 1200 1400 1600 1800 2000
























Fig. 6. Functional density versus hardware reuse
For (n > 40 million), our dual approach outperforms
the static approach because of the improved area and timing
with a relatively low overhead. The VPR ﬂow has an even
39
better performance— in area and timing— but with a much
larger overhead. Therefore only for very large reuse values
(n > 1.5 billion) its functional density is higher than the one
for our approach.
The range where our dual approach is beneﬁcial can be
split up into two subranges. In the range (40 million <
n < 360 million) dynamic hardware generation is not ben-
eﬁcial without our new approach. In this range our special-
ization technique enables dynamic hardware generation for
a new range of applications. In the second range (360 mil-
lion < n < 1.5 billion) the functional density for online
hardware generation is improved by our method.
6. RELATED WORK
The exploitation of the reconﬁgurability of FPGAs to im-
prove performance has been studied intensively in the past
decade.
Run-time reconﬁguration of the FPGA by switching be-
tween several statically generated bitstreams has lead to in-
teresting results in the ﬁeld of neural networks, template
matching, DNA sequencing and many others [1, 3]. Be-
cause the hardware is generated ofﬂine, in contrast to our
approach, only the reconﬁguration incurs an overhead. Of
course, ofﬂine generation provides less optimization oppor-
tunities.
Other related work was done in the ﬁeld of hardware
generation at run-time. For applications with a quasi-static
behavior a full hardware generation with a conventional tool
ﬂow was successfully used to improve overall performance.
Examples are key-speciﬁc DES [4], the subgraph isomor-
phism problem [5], boolean satisﬁability [6] and many oth-
ers. This approach cannot be extended for applications with
a more dynamic behavior — the applications focused on
in our work — because the conventional hardware gener-
ation process is far too slow. In the WARP processor [7]
frequently used code is dynamically moved from processor
to FPGA. The hardware generation is done by lean versions
of the conventional FPGA tool chain.
Several authors proposed run-time hardware specializa-
tion on the netlist level based on partial evaluation [4, 8].
McKay et al [9] also reuse placement and routing informa-
tion but do not compact the sparse placement obtained after
constant propagation.
7. CONCLUSIONS AND FUTURE WORK
In this paper we proposed a dual approach for run-time hard-
ware generation. Our approach combines the generation of
generic hardware at compile-time with run-time specializa-
tion. This allows us to use fast tools at run-time while main-
taining the hardware quality. Hence the hardware generation
overhead can be reduced extensively. We described two of
those tools: the constant propagator and the compactor. The
incremental router in our approach is not yet implemented.
We validated our approach on FIR ﬁlters with changing
coefﬁcients. A static generic implementation is compared
to run-time hardware generation, both with a full tool chain
at run-time (VPR) and our faster dual approach. We have
shown that run-time hardware generation can effectively im-
prove functional density if the generated hardware is sufﬁ-
ciently reused. By reducing the hardware generation over-
head our dual approach makes run-time hardware generation
proﬁtable for a new range of applications.
Although the tools we have implemented are sufﬁcient
for proving the concept of our dual approach many improve-
ments can be made. The next step in our research is there-
fore the implementation of an incremental router. This could
substantially improve functional density because the bulk of
the run-time generation time is due to routing. We also plan
an in depth study of constant propagators and compaction
algorithms.
8. REFERENCES
[1] M. J. Wirthlin and B. L. Hutchings, “Improving functional
density using run-time circuit reconﬁguration,” IEEE Trans.
VLSI Syst., vol. 6, no. 2, pp. 247–256, 1998.
[2] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD
for Deep-Submicron FPGAs. Kluwer Academic Publishers,
1999.
[3] J. Villasenor, B. Schoner, K.-N. Chia, C. Zapata, H. J. Kim,
C. Jones, S. Lansing, and B. Mangione-Smith, “Conﬁgurable
computing solutions for automatic target recognition,” in Proc.
IEEE Symposium on FPGAs for Custom Computing Machines
(FCCM), 1996.
[4] J. Leonard and W. H. Mangione-Smith, “A case study of par-
tially evaluated hardware circuits: Key-speciﬁc DES,” in Proc.
International Workshop on Field-Programmable Logic and
Applications (FPL), 1997, pp. 151–160.
[5] S. Ichikawa and S. Yamamoto, “Data dependent circuit
for subgraph isomorphism problem,” in Proc. International
Conference on Field-Programmable Logic and Applications
(FPL), 2002, pp. 1068–1071.
[6] P. Zhong, M. Martonosi, P. Ashar, and S. Malik, “Accelerat-
ing boolean satisﬁability with conﬁgurable hardware,” in Proc.
IEEE Symposium on FPGAs for Custom Computing Machines
(FCCM), 1998, p. 186.
[7] R. Lysecky, G. Stitt, and F. Vahid, “Warp processors,” Trans-
actions on Design Automation of Electronic Systems, vol. 11,
no. 3, pp. 659–681, July 2006.
[8] S. Singh, J. Hogg, and D. McAuley, “Expressing dynamic re-
conﬁguration by partial evaluation,” in Proc. IEEE Symposium
on FPGAs for Custom Computing Machines (FCCM), 1996.
[9] N. McKay and S. Singh, “Dynamic specialisation of XC6200
FPGAs by partial evaluation,” Lecture Notes in Computer Sci-
ence, vol. 1482, p. 298, 1998.
40
