Exploiting dynamic reconfiguration of platform FPGAs: implementation issues by Miguel L. Silva & João Canas Ferreira
Exploiting dynamic reconfiguration of platform FPGAs:
Implementation issues
Miguel L. Silva1 and Joa˜o Canas Ferreira1,2
1FEUP/DEEC 2INESC Porto
Rua Dr. Roberto Frias, s/n Rua Dr. Roberto Frias, s/n
4200-465 PORTO, Portugal 4200-465 PORTO, Portugal
mlms@fe.up.pt jcf@fe.up.pt
Abstract
The effective use of dynamic reconfiguration re-
quires the designer to address many implementation
issues. The market introduction of feature-full plat-
form FPGAs equipped with embedded CPU blocks ex-
pands the number of situations where dynamic reconfig-
uration may be applied to improve overall performance
and logic utilization. The paper compares the design
of two similar systems supporting dynamic reconfigu-
ration and the issues that were addressed in their im-
plementation. The first system supports 32-bit data
transfers between CPU and the dynamically reconfig-
urable circuits. The other implementation supports 64-
bit transfers, but its effective use is more complicated
and several restrictions must be taken into account.
The work includes a performance comparison of the
two designs on several simple tasks, including pattern
matching, image processing and hashing.
1 Introduction
The present work is concerned with the implemen-
tation issues that arise for designs that try to ex-
ploit some platform FPGA’s capability for configura-
tion changes at run-time. The intent is typically to
time-share the available hardware to support multiple
(and mutually exclusive) tasks; alternatively, the de-
signer may be seeking better performance by adapting
the hardware implementation to the actual data being
processed at a given instant.
Work partially funded by the Department of Electrical and Com-
puter Engineering of the Faculty of Engineering of the University
of Porto, under contract DEEC-ID/05/2003, and by FCT schol-
arship SFRH/BD/17029/2004.
In the context of the present work, we are interested
in designs that contain a closely-coupled CPU, typi-
cally as a dedicated block or as a core implemented
on part of the reconfigurable fabric. There are many
hardware and software issues that must be considered
in this context. The present work deals mainly with
the need to establish an appropriate hardware environ-
ment in order to be able to carry out dynamic reconfig-
urations in an orderly fashion, and how the associated
design choices affect the global performance. Since the
actual details of the underlying reconfigurable platform
may be important for the concrete analysis of the is-
sues, the paper presents its discussion in the context of
systems based on Virtex-II Pro devices, an advanced
FPGA family from Xilinx.
The paper is organized around two actual system
designs with similar overall organization. The first one
is implemented on a XC2VP7-FG456-6 device and fea-
tures a 32-bit data bus; the second one is implemented
on a XC2VP30-FF896-7 and uses a 64-bit bus. The
second system design was intended to implement the
same basic approach as the first one, but to achieve
better performance for tasks that depend on run-time
reconfiguration.
Other systems based on dynamically reconfigurable
platform FPGAs have been described in the literature
(see, for instance, [13, 2, 1, 11]). Aspects of the 32-bit
design used for this work have been presented in [5].
The 64-bit system is described here for the first time.
The rest of the paper is organized as follows. Sec-
tion 2 describes the overall context of the work and
presents the global design choices. The 32-bit system
is summarized in section 3, which also characterizes
the system’s performance and presents experimental
results for some simple application fragments. Section
4 then describes the 64-bit system, with emphasis on
1-4244-0054-6/06/$20.00  ©2006 IEEE
Figure 1. General system architecture
the different design choices and how they impacted the
system’s performance. The section also presents exper-
imental results, including results for the same applica-
tion fragments used with the 32-bit version. Finally,
section 5 presents some final remarks and concludes
the paper.
2 Dynamic reconfiguration of platform
FPGAs
2.1 Generic system organization
A generic overview of a system organization used for
this work is shown in figure 1. In addition to the CPU
and the area for run-time reconfiguration (the dynamic
area), the following modules are included:
• Memory interface unit. Depending on the
needs, it may interface to internal and/or exter-
nal memory. Both types of memory may store re-
configuration data (for the dynamic modules) and
application-specific data.
• Configuration control unit. This module per-
forms the actual reconfiguration of the dynamic
area. It can be seen as an interface unit to the
FPGA’s configuration memory.
• External communication unit. This module,
if available, is responsible for communications with
an external system (e.g., a standalone computer)
for data transfer, system control and debugging
operations.
• Dynamic area communication unit. This unit
is responsible for communications with the mod-
ules in the dynamic area. It includes the circuitry
to use one of the data busses and, possibly, a DMA
controller.
2.2 Partial configurations
The process of reconfiguring the dynamic area must
handle the constraints imposed by the FPGA’s archi-
tecture. Virtex-II Pro devices (like other Xilinx de-
vices) are reconfigured by frames. A frame is a set of
configuration bits that control a column of configurable
resources. Each such column covers the entire height
of the device. However, in practice, it is difficult to
have a dynamic area that covers the entire height of
the device, because that would isolate one side of the
device from the other, i.e., circuits on the left of the
dynamic area would not be able to connect to circuits
on the left, and vice-versa. This may not be acceptable
in practice. For instance, the layout of the board may
constrain the layout of the resources inside the FPGA
in such a way as to make a full-height dynamic region
unavailable (typically because some external compo-
nents are connected to pins on the upper or lower sides
of the reconfigurable fabric).
Due to the need to take such layout constraints in
consideration, a dynamic area will typically not occupy
the full height of the device. Therefore, the partial con-
figurations used to reconfigure the dynamic area must
be produced in such a way as to not disturb the circuits
below or above.
Another issue with the reconfiguration process arises
because partial configurations are “differential” config-
urations, that is, they assume an initial state of the
configuration resources, and only specify reconfigura-
tion data for those resources whose state is to be dif-
ferent from the initial one. Since the dynamic area
is used for multiple configurations in an order that is
unknown at the time the partial configurations are pro-
duced, the problem of ensuring the correct state prior
to reconfiguration arises.
Several ways to address the problem have been re-
ported [11, 6, 10, 12]. One way is to have a tool like
BitLinker [12], that is capable of ensuring that the
configuration bitstream is complete (i.e., not “differ-
ential”). This has the side effect of increasing the con-
figuration time.
The use of a configuration assembly tool may help
solve other problems. For instance, BitLinker ensures
that partial configurations do not disturb the circuits
residing below or above the dynamic area.
It is possible to go further and provide means for as-
sembling the correct partial configuration from the con-
figurations of individual components [12, 6, 10]. In this
way, components can be reused without going through
the complete high-level design flow. This is particu-
larly helpful when multiple similar configurations must
be produced. All the configurations used in the experi-
Figure 2. LUT-based bus macros
ments reported in the next sections have been produced
with the help of BitLinker.
The assembly of component configurations rises the
problem of establishing communications between them.
This requires a means of ensuring that the compo-
nent’s input and output ports are at fixed locations,
so that the assembled configuration can be produced
by appropriate concatenation of the individual com-
ponents. To ensure that a component has input and
output ports compatible with the assembly procedure,
a so-called “bus macro” is used in the component’s de-
sign. Figure 2 illustrates the situation. In this particu-
lar case, the communication is guaranteed by ensuring
that the output signals of component A (called In(0)
and In(1)) flow out of the component through spe-
cific LUTs and that the input signals of component B
(called Out(0) and Out(1)) flow into the component
from the corresponding LUTs.
Note that the components are designed separately
(using the regular design flow): figure 2 shows the cir-
cuit that corresponds to the assembly of the configu-
rations of the individual components, not the situation
at design time. During the design process for A, no
information about component B is used, except for the
fact that the relative positions of the I/O connections
are fixed by the ”bus macro”. The same applies to the
design of component B and, indeed, to the design of
any component that uses the same “bus macro”.
“Bus macros” based on tristate connections have
also been proposed, see [14]. The circuits mentioned
in the next sections use LUT-based bus macros when
necessary, since they consume less area.
3 The 32-bit system design
This section describes an implementation of the
generic organization from section 2 around a 32-bit bus
connection between CPU and dynamic area. The im-
plementation described in section 3.1 corresponds to
an updated implementation of the hardware system de-
scribed in [5]. A performance assessment of the system
Figure 3. The 32-bit system architecture
is presented in section 3.2.
3.1 Implementation overview
For this system implementation we used a board
with a Xilinx XC2VP7 (speed grade -6) FPGA and
32 MB of external static memory. This FPGA has
4928 slices and 44 RAM blocks (with 18 kb each).
An overview of the system setup is shown in fig-
ure 3. It corresponds roughly to the actual floorplan of
the system. All modules (except the CPU) are imple-
mented on the reconfigurable fabric. The on-chip inter-
module bus system is an implementation of the Core-
Connect Bus Architecture [7] as provided by the Xilinx
Embedded Development Kit (EDK). The two busses
used are the 64-bit Processor Local Bus (PLB) and the
lower-performance, but less resource-consuming, 32-bit
On-Chip Peripheral Bus (OPB).
The PLB connects to a memory controller for on-
chip memory and to the PLB-OPB bridge. The OPB
connects to the external memory controller and to the
serial port. Using the OPB instead of the PLB to access
external memory requires a much smaller controller.
Although not shown on the figure, the OPB also con-
nects to a General-Purpose I/O (GPIO) controller (for
LEDs and push buttons). A Reset Block is also in-
cluded; it can be used to externally reset the CPU and
peripherals without affecting the fabric configuration.
A dedicated block called JTAGPPC is also included;
this special block connects the FPGA’s JTAG port to
the PowerPC core, and is used for data transfers and
debugging.
The configuration memory controller (OPB HW-
ICAP) is also connected to the OPB. Its purpose is to
allow configurations to be changed internally through
the Internal Configuration Access Port (ICAP), a ded-
icated block available in several Xilinx device families.
The Xilinx EDK was used to develop the system, so
many on the necessary modules were already available.
The one remaining module that is directly relevant
Table 1. Resource usage (32-bit system)
for this work is the OPB Dock. The OPB Dock is a
a wrapper module, that connects the dynamic region
to the rest of the system. It connects to the OPB bus
in order to provide a 32-bit data channel to the dy-
namic region. The wrapper is assigned a fixed range
of the OPB address space, and acts like an OPB slave
peripheral, performing address decoding and I/O op-
erations. The wrapper stores incoming data, so that it
is kept available for processing by the components in
the dynamic region between write operations.
The data communications between the wrapper and
the dynamic region are made through a connection in-
terface with two unidirectional channels, one for write
and the other for read operations. Since the OPB is a
32-bit bus, each channel is 32 bits wide. The connec-
tion interface generates an additional signal, that indi-
cates the occurrence of a write operation on the OPB.
This signal can be used as a clock enable signal for
any flip-flop in the dynamic region. The connection in-
terface is implemented using the previously mentioned
LUT-based bus macros.
The resource usage of the system implementation is
shown in table 1. The CPU clock frequency is 200 MHz.
Both the PLB and the OPB operate at 50 MHz. We
were not able to obtain better operating frequencies
while still satisfying the layout constraints required to
obtain a dynamic area of useful size.
The dynamic region available in this implementation
contains 6 RAM blocks and 28×11 = 308 Configurable
Logic Blocks (CLBs). A Virtex-II Pro CLB includes
4 slices, each with two 4-input lookup tables and two
flip-flops, so the dynamic area contains 25% of the total
number of slices (and flip-flops).
3.2 Performance characterization
To assess data transfer performance, we measured
the time necessary to transfer sequences of 32-bit val-
ues to/from external memory. Table 2 shows the av-
erage time per transfer for three situations: sequences
Table 2. Measured times for data transfers be-
tween dynamic region and external memory
(32 bit)
Table 3. Results for pattern matching in bi-
nary images (32 bit)
of write operations, sequences of read operations and
sequences of interleaved write/read operations. The re-
sults include the overhead of the controlling software.
Note that transfers between external memory and dy-
namic area use the data bus twice, since data is fetched
from the origin to the CPU and then from the CPU to
the destination.
The times reported in table 2 allow the developer
to determine a lower bound for the time required to
use the dynamic area. This lower bound can be used
to make a first assessment of the improvements that
can be obtained by moving a function from software to
hardware.
Our first application example concerns a simple pat-
tern matching task for bilevel images, where it is nec-
essary to determine how many pixels of an 8× 8 image
pattern are equal to the corresponding pixels of a win-
dow that slides over a larger image. The hardware im-
plementation is based around a pipeline of eight stages,
each one calculating the number of matching pixels in
a row of the pattern. The results of the eight stages
are summed, producing the number of matching pixels
for one position of the sliding window.
Table 3 shows the results obtained for a software-
only implementation running on the embedded CPU
versus the hardware/software version, where the dedi-
cated matching pipeline is implemented in the dynamic
area. As can be seen from the table, speedup factors
of more than 26 were obtained. These results can be
Table 4. Results for hash function (32 bit)
Table 5. Speedups for simple image process-
ing tasks (32 bit)
explained by noting that: i) the task consists of many
simple independent steps that can be executed in paral-
lel; ii) a pipelined hardware implementation was used;
iii) the task involves bit manipulations that are cum-
bersome to express in the C programming language,
but simple to implement in hardware.
As another example consider the task of accelerat-
ing a public domain implementation of a hashing func-
tion that returns a 32-bit value for a variable-length
key [8]. In this case, the whole hashing function was
implemented in hardware. As the results of table 4
show, the speedup in this case is much more modest,
since the original code had been optimized for 32-bit
CPUs (like the PowerPC used in this case) and the
data transfer times are significant when compared to
the original software processing times.
Image processing tasks often involve the concurrent
processing of small data items. The instruction set ar-
chitectures of most desktop CPUs have been extended
to include special instructions to handle packed sets of
such data items [3, 9]. Since the PowerPC 405 core
does not support such an extension, it makes sense to
use the dynamic region to accelerate image processing
tasks, that would otherwise be tackled by the CPU
alone.
Table 5 presents the results obtained for some simple
grayscale image processing applications (8-bit pixels):
• Brightness adjustment: The hardware adds
an 8-bit unsigned pixel value to a signed constant
value (saturating add). Four pixels are processed
per data transfer.
• Additive blending: This task consists of
adding (with saturation) the pixel values from two
images to produce a third. The hardware receives
four pixel values per transfer (two from each im-
Figure 4. The 64-bit system architecture
age) and produces two output pixels. In order to
save on read operations, the resulting pixels are
packed in groups of four, before being read back
by the CPU.
• Fade effect: This task consists of combining the
pixels of two images according to (A−B)×f +B,
where A is a pixel value from the first image, B is
a pixel value from the second, and f is a constant
that specifies the relative contribution of the first
image to the result [9]. The fade-in-fade-out effect
is obtained by processing the source images succes-
sively for different values of f . The data transfer
pattern is identical to the one used in the additive
blending task.
Note that the two last tasks require that data from
two sources be combined by the CPU, before being
sent to dynamic area. This overhead is included in the
measured times for the hardware implementation, and
helps to explain the smaller speedups obtained in these
two cases. The additive blending operation is simpler
than the fade effect operation, and hence benefits less
from being implemented in hardware.
4 The 64-bit system design
This section describes an implementation of the
generic design from section 2, where the dynamic area
is connected to the 64-bit processor local bus. The
main differences in relation to the 32-bit designed are
summarized in section 4.1 and the performance mea-
surements are described in section 4.2.
4.1 Implementation aspects
The design from section 3 had a 32-bit bus, in or-
der to use less resources and have a dynamic area with
a usable size. For the alternative system implemen-
tation presented here, we used a board with a Xilinx
XC2VP30 device and an external DDRAM memory of
512 MB. This device includes two CPU cores, but only
one is used. The FPGA has 13696 slices (about 2.7
times more slices than the previously used device) and
136 internal RAM blocks. The speed grade is also bet-
ter (-7).
An overview of the system setup is shown in fig-
ure 4. As for the previous system, this figure corre-
sponds roughly to the actual floorplan of the system.
Again, all modules except the CPU are implemented
on the reconfigurable fabric.
When compared to the previous system, the present
implementation has two main differences: i) the exter-
nal memory controller is located on the 64-bit PLB; ii)
the dynamic region wrapper is also connected to the
processor local bus and has some added functionality
(it is now called the PLB Dock). Minor differences in-
clude the addition of an interrupt controller attached
to the OPB and the absence of the GPIO controller.
The PLB Dock now provides a 64-bit data channel
to the dynamic region. The wrapper is assigned a fixed
range of the PLB address space, and acts like an PLB
master/slave peripheral. Besides performing address
decoding for I/O operations and storing incoming data
(like the OPB Dock implementation), this wrapper has
three additional capabilities:
1. DMA controller: Direct transfers between
memory and PLB dock are now possible without
CPU intervention.
2. Output FIFO: The results produced by the dy-
namic area can be stored in a FIFO for subsequent
DMA transfer to memory.
3. Interrupt generator: The PLB dock can send
interrupts to the CPU.
The data communication between wrapper and dy-
namic region is unchanged, with the obvious difference
that the channels are now 64-bit wide.
The main reason for moving the dynamic region
from the OPB to the PLB was to obtain a better data
transfer performance. Since the CPU does not sup-
port 64-bit wide data transfers at the instruction level
(load and store instructions handle items of size up
to 32 bits), program transfers to/from the dynamic
area cannot directly benefit from the increased data
width. Only transfers that go through the caches use
64-bit transfers. Therefore, communication between
CPU and dynamic region through load and store in-
structions is still made by 32-bit-wide transfers. With-
out further measures the change would only benefit
software implementations.
In order to use the full bus width, the PLB dock in-
cludes a scatter-gather DMA controller that supports
64-bit transfers. The controller is automatically gener-
ated by the Xilinx development tools. To profit from
Table 6. Resource usage (64-bit system)
the DMA, data transfers to the dynamic area have to
be done as a block; therefore, a memory buffer, the
output FIFO, must be provided to store the results
produced by the dynamic area, before they are sent to
main memory.
Since the CPU is free during DMA transfers, it
can be used for other purposes. To avoid the need
for polling the PLB dock to determine the status of
the transfers, an interrupt generator was added to the
dock, thus requiring the inclusion of an interrupt con-
troller in the design.
As the previous discussion made clear, the perma-
nent circuits implemented on the reconfigurable fab-
ric are larger and more complex for the second design.
Resource usage of the system setup is summarized in
table 6. Since the FPGA device used is faster and
the layout constraints are less severe, the CPU clock
frequency in this case is 300 MHz (vs. 200 MHz in
the previous design) and both the PLB and the OPB
operate at 100 MHz (vs. 50 MHz previously). It is
clear that, without the introduction of DMA transfers,
the design modifications would be more favorable for
software-only implementations.
The dynamic region available in the new version con-
tains 22 BRAMs and 32 × 24 = 768 CLBs, i.e., 3072
slices (22.4% of the total). The use of the remaining
free slices is made more difficult by the presence of the
second CPU core and alternative approaches (like hav-
ing two separate dynamic areas) may be necessary to
put them to use.
4.2 Performance assessment
To assess data transfer performance, we again mea-
sured the time necessary to transfer sequences of data
to/from external memory. In this case, two situations
must be considered, according to whether the data
transfers are controlled by the CPU (as is the case with
the 32-bit system) or by the DMA controller.
Table 7 shows the average time taken by program-
Table 7. Measured times for 32-bit data trans-
fers between dynamic region and external
memory (CPU controlled)
Table 8. Measured times for 64-bit data trans-
fers between dynamic region and external
memory (DMA-controlled)
controlled transfers: sequences of write operations, se-
quences of read operations and sequences of interleaved
write/read operations. In this case, each transfer in-
volves a 32-bit value (as discussed previously). This
operation is the same as the one performed in the 32-
bit system and direct comparison of the values is le-
gitimate. A decrease in transfer time between 4 and
6 times, depending on the transfer type, can be ob-
served. Part of this decrease is due to improved bus
speed (a factor of 2) and CPU frequency (a factor of
1.5). The additional improvement presumably comes
from the fact that no PLB-to-OPB bridge is used.
Table 8 shows the average time per transfer when
using DMA. In this method, each transfer involves a
64-bit value, using the data path to the fullest. The in-
terleaved write/read operations are block-interleaved:
the output data is stored in the output FIFO, while the
write operation is being performed; when the FIFO be-
comes full, the write operation stops and the data con-
tained in the FIFO is transferred the external memory
by a DMA operation. These operations are repeated,
until all the data is transferred. The current output
FIFO stores up to 2047 64-bit values.
The times reported in tables 7 and 8 allow the devel-
oper to determine a lower bound for the time required
to use the dynamic area with different transfer meth-
Table 9. Results for pattern matching in bi-
nary images (64 bit).
Table 10. Results for a hash function imple-
mentation (64 bit)
ods and different data widths. This lower bound can be
used to make a first assessment of the improvements,
that can be obtained by moving a function from soft-
ware to hardware, and to evaluate the gains from using
each of the two data transfers methods.
We took our first two implementation examples from
the 32-bit system, and transferred them on the new
system without any modifications: the data transfers
don’t take advantage of the 64-bit bus width and are
controlled by the CPU. Tables 9 and 10 present the
results. Both tasks benefit greatly from the new sys-
tem and both software and hardware implementations
perform considerably better.
In general, the results follow the trends observed
for the transfer times, as expected. In the pattern
matching task, a decrease in the hardware vs. software
speedup is obtained, because the software implementa-
tion benefited more from the quicker access to memory.
The hardware implementations still maintain a consid-
erable performance advantage. The hash value calcu-
lation task, on the other hand, shows only a slightly
better speedup for the hardware implementation.
We also tested the system with the more demand-
ing hash function SHA1 [4]. This hashing algorithm
is geared towards 32-bit implementations. Our imple-
mentation does not fit into the dynamic area of the
32-bit system, so no comparison can be done. The re-
sults of table 11 show a considerable performance gain
for the hardware implementation (using 32-bit CPU-
controlled data transfers). The software implementa-
tion (taken from the RFC document) has a large over-
Table 11. Results for SHA-1 implementation
Table 12. Results for simple image process-
ing tasks (64 bit)
head for smaller data sets. The overhead’s relative im-
portance decreases for larger data sets.
A 64-bit DMA-controlled implementation of the im-
age processing tasks presented in section 3.2 was also
made. The results are shown in table 12. For the first
task, there is a clear increase of the speedup obtained
by the hardware (on top of the increased performance
of the software version). The reason for this is that the
64-bit data transfers could be employed without addi-
tional work, since only one image is involved. The other
tasks show a significantly smaller speedup increase, be-
cause the data of the two source images had to be com-
bined by the CPU, before being sent to the dynamic
area. This time overhead appears in the table under the
heading of “data preparation”, and is directly attribut-
able to the constraints of the DMA transfer mode.
5 Conclusion
The paper discusses several issues, that arise when
trying to exploit effectively the run-time reconfigura-
tion capabilities of platform FPGAs. Two implemen-
tations of the same general approach are presented, and
compared with the help of several small studies. The is-
sues associated with data transfers between embedded
CPU and dynamically reconfigured circuits are shown
to contribute significantly to overall performance.
For the systems considered in this work, the use of
64-bit data transfers is hampered by the fact that the
CPU does not support programmatic 64-bit data trans-
fers; only transfers that go through the caches profit
from the higher bus width. In order to use the 64-bit
bus width effectively to communicate with the dynamic
area, it is necessary to employ DMA transfers. These,
however, pose significant restrictions on data organiza-
tion and access patterns, making the adaptation of the
software-based algorithms more difficult and time con-
suming. When the difficulties can be overcome, signif-
icantly better performance can be achieved. It is also
important to note that the changes from the first to
the second system affect hardware and software in a
different manner, so the relative merits of using a pure
software approach vs. a combined hardware/software
solution also change.
References
[1] B. Blodget, C. Bobda, M. Hu¨bner, and A. Niy-
onkuru. Partial and dynamically reconfiguration of
Xilinx Virtex-II FPGAs. In Proceedings FPL’04, pages
801–810, 2004.
[2] E. Carvalho, N. Calazans, E. Bria˜o, and F. Moraes.
PaDReH: a framework for the design and implemen-
tation of dynamically and partially reconfigurable sys-
tems. In Proceedings SBCCI’04, pages 10–15, 2004.
[3] K. Diefendorff, P. Dubey, R. Hochsprung, and
H. Scale. AltiVec extension to PowerPC accelerates
media processing. IEEE Micro, 20(2):85–95, 2000.
[4] D. Eastlake and P. Jones. RFC 3174 — US Secure
Hash Algorithm 1 (SHA1). RFC Editor, Sept. 2001.
[5] J. C. Ferreira and M. M. Silva. Run-time reconfigura-
tion support for FPGAs with embedded CPUs: The
hardware layer. In Proceedings RAW’05, Denver, Col-
orado, Apr. 2005.
[6] E. L. Horta, J. W. Lockwood, and S. T. Kofuji. Using
PARBIT to implement partial run-time reconfigurable
systems. In Proceedings FPL’02, pages 182–191, Lon-
don, UK, 2002. Springer-Verlag.
[7] IBM. The CoreConnect bus architecture, Sept. 1999.
[8] B. Jenkins. Hash functions. Dr. Dobb’s Journal,
22(9):107–109, Sept. 1997.
[9] A. Peleg, S. Wilkie, and U. Weiser. Intel MMX for
multimedia PCs. Commun. ACM, 40(1):24–38, 1997.
[10] A. K. Raghavan and P. Sutton. JPG–a partial bit-
stream generation tool to support partial reconfigu-
ration in Virtex FPGAs. In Proceedings IPDPS’02,
page 192, Washington, DC, USA, 2002. IEEE Com-
puter Society.
[11] P. Sedcole, B. Blodget, J. Anderson, P. Lysaght, and
T. Becker. Modular partial reconfiguration in Virtex
FPGAS. In Proceedings FPL’05, pages 211–216, 2005.
[12] M. L. Silva and J. C. Ferreira. Generation of
hardware modules for run-time reconfigurable hybrid
CPU/FPGA systems. In Proceedings DCIS’05, Lis-
boa, Portugal, Nov. 2005.
[13] M. Ullmann, M. Hu¨bner, B. Grimm, and J. Becker.
An FPGA run-time system for dynamical on-demand
reconfiguration. In IPDPS’04, page 135a. IEEE Com-
puter Society, 2004.
[14] Xilinx. Two flows for partial reconfiguration: Module
base or small bit manipulations. Application note 290,
Sept. 2004.
