A real-time capable dynamic partial reconfiguration system for an applicationspecific soft-core processor by Kirchhoff, Michael et al.
  
 
TU Ilmenau | Universitätsbibliothek | ilmedia, 2019 
http://www.tu-ilmenau.de/ilmedia 
Kirchhoff, Michael; Kerling, Philipp; Streitferdt, Detlef; Fengler, Wolfgang: 
A real-time capable dynamic partial reconfiguration system for an application-
specific soft-core processor 
 
Original published in: International journal of reconfigurable computing. - New York, NY [u.a.] : 
Hindawi Publ. Corp.. - (2019), art. 4723838, 14 pp. 
Original published: 2019-08-31 
ISSN: 1687-7209 
DOI: 10.1155/2019/4723838 
[Visited: 2019-11-13] 
 
   
This work is licensed under a Creative Commons Attribution 4.0 
International license. To view a copy of this license, visit  
http://creativecommons.org/licenses/BY/4.0/ 
 
Research Article
A Real-Time Capable Dynamic Partial Reconfiguration
System for an Application-Specific Soft-Core Processor
Michael Kirchhoﬀ ,1 Philipp Kerling ,1 Detlef Streitferdt ,2 and Wolfgang Fengler1
1Technische Universita¨t Ilmenau, Group for Computer Architecture and Embedded Systems, Ilmenau, Germany
2Technische Universita¨t Ilmenau, Group for Software Systems and Process Informatics, Ilmenau, Germany
Correspondence should be addressed to Michael Kirchhoﬀ; michael.kirchhoﬀ@tu-ilmenau.de
Received 15 April 2019; Revised 30 July 2019; Accepted 31 August 2019; Published 22 September 2019
Academic Editor: John Kalomiros
Copyright © 2019 Michael Kirchhoﬀ et al. )is is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
Modern FPGAs (Field Programmable Gate Arrays) are becoming increasingly important when it comes to embedded system
development.Within these FPGAs, soft-core processors are often used to solve a wide range of diﬀerent tasks. Soft-core processors
are a cost-eﬀective and time-eﬃcient way to realize embedded systems. When using the full potential of FPGAs, it is possible to
dynamically reconﬁgure parts of them during run time without the need to stop the device. )is feature is called dynamic partial
reconﬁguration (DPR). If the DPR approach is to be applied in a real-time application-speciﬁc soft-core processor, an architecture
must be created that ensures strict compliance with the real-time constraint at all times. In this paper, a novel method that
addresses this problem is introduced, and its realization is described. In the ﬁrst step, an application-specializable soft-core
processor is presented that is capable of solving problems while adhering to hard real-time deadlines. )is is achieved by the full
design time analyzability of the soft-core processor. Its special architecture and other necessary features are discussed. Fur-
thermore, a method for the optimized generation of partial bitstreams for the DPR as well as its practical implementation in a tool
is presented. )is tool is able to minimize given bitstreams with the help of a diﬀerential frame bitmap. Experiments that realize
the DPR within the soft-core framework are presented, with respect to the need for hard real-time capability. )ose experiments
show a signiﬁcant resource reduction of about 40% compared to a functionally equivalent non-DPR design.
1. Introduction
Increasing performance with simultaneous miniaturization
and the corresponding increase in the integration density
of microelectronic circuits allow the realization of more
complex and eﬃcient embedded systems. )ese can solve
problems that previously could only be solved even with
large computing systems. At the same time, it is also
necessary to develop new methods and approaches that can
adapt to the more complex development and scale with
increasing complexity. A major problem that occurs in this
context is the negative relations of system properties. If, for
example, the computing power has to be increased, the
form factor usually also increases. For this reason and to
achieve the best possible adaptation to the problem, most
embedded systems are developed from scratch, which is
very time and cost intensive. As an alternative approach,
soft-core processors provide an excellent compromise
between task-speciﬁc problem adaptation and reusability
and thus cost reduction. In this paper, we present an ap-
plication-speciﬁc soft-core architecture addressing this
approach in Section 5. In order to maximize the reusability
and therefore minimize the costs for each new project, a
compatible tool-chain is presented in Section 5.2, which is
designed to support the hard real-time compliance of the
soft-core processor.
)ere are special requirements for the real-time appli-
cation ﬁeld addressed in this article that prevent the use of
general soft-core processors. For the calculation of real-time
critical tasks such as control algorithms or real-time critical
image processing, it is necessary to guarantee full-time
analysis capability at design time. Section 5.1 is dedicated to
Hindawi
International Journal of Reconﬁgurable Computing
Volume 2019, Article ID 4723838, 14 pages
https://doi.org/10.1155/2019/4723838
the special adjustments needed in order to achieve full
design-time analyzability.
A prime example for this ﬁeld of application is the image
processing as part of the driver-assistance systems like
AutoVision [1]. In this application, soft-core processors are
used as coprocessors for real-time image analysis of, e.g.,
road tracks. Since the requirements for algorithms and
processing units in driver-assistance systems can change
signiﬁcantly over time, both processing hardware and
software must be adapted. An example is a transition from a
sunny to a dark road when the car enters a tunnel. )is task
can be solved with conventional embedded systems only
with diﬃculty and with a very complex hardware.
)is is where a big advantage of FPGAs emerges: )eir
dynamic partial reconﬁguration (DPR) capability enables
the exchange of parts of the logic at run time and thus allows
a problem-speciﬁc dynamic adaptation to changing condi-
tions. )is allows to fully utilize the potential of FPGAs
compared to similar embedded solutions without DPR. )is
article combines the advantages of soft-core processors with
the possibilities of DPR and presents a novel method for
eﬃcient implementation and problem-based realization.
After an overview of related work in the respective areas has
been given in Section 3, a method for creating partial bit-
streams is presented in Section 4. A real-time capable soft-
core processor is introduced in Section 5, and the necessary
architectural decisions to ensure compliance with hard real-
time limits will be discussed. Subsequently, the practical
implementation of the adaptation of the processor to the
DPR feature is presented in Section 5.3 in conjunction with a
complete DPR system design (Section 6) that delivers partial
bitstreams via a custom partial reconﬁguration controller
(PRC). )e DPR system is presented in Section 7, and the
PRC is explained in Section 7.2. Selected parts of the ex-
periments are carried out and their results are analyzed in
Section 8 before the article concludes with a summary and an
outlook on future work.
2. Background
)e research presented in this paper is performed on a Xilinx
device of the Zynq-7000 family [2]. It is a System-on-a-Chip
(SoC) that consists of an ARM Cortex-A9 based hard-core
processor and a Kintex-7 series FPGA. )e processor (also
referred to as Application Processing Unit (APU)), pe-
ripherals, and bus interconnect including memory con-
trollers form the Processing System (PS). )e FPGA fabric
and all supporting circuitry are in entirety called Pro-
grammable Logic (PL). PS and PL can communicate using
several AXI buses that also allow direct access to the PS
memory controller.
DPR requires partial bitstreams to be sent to the FPGA
that only reconﬁgure a part of the device (the dynamic logic)
while another part (the static logic) continues to operate.)e
ﬂow of the Xilinx Vivado development environment used
here requires to deﬁne at least one reconﬁgurable partition
(RP) to host dynamic logic in the form of exchangeable
reconﬁgurable modules (RMs). Furthermore, each RP re-
quires the assignment of a resource Pblock, which is a
deﬁned area of logic resources such as logic slices and DSPs
in the FPGA fabric. DPR functionality from within the
device (self-reconﬁguration) is available in the Zynq via
dedicated conﬁguration interfaces in PS and PL. In this
paper, the Internal Conﬁguration Access Port (ICAP) in the
PL is used. )e ICAP can be instantiated as hardwired
primitive in hardware designs and receives partial bitstreams
in 4-byte units at amaximum frequency of 100MHz, thereby
achieving a maximum throughput of 400MB/s. )is device-
speciﬁc parameter is the main factor limiting the speed of
DPR since it is a result of the capabilities of the hardware and
cannot easily be inﬂuenced.
Sending bitstreams (raw streams of commands and data
for the FPGA conﬁguration controller) to the ICAP puts
conﬁguration data into 1-bit SRAM cells that deﬁne the
behavior of the FPGA. )ese cells can only be addressed in
conﬁguration frames with the size of 3,232 bits.
3. Related Work
In this section, we give an overview of other relevant at-
tempts at both using DPR capabilities in soft-core processors
and implementing partial reconﬁguration controller.
Most similar to the research at hand, exchanging integer
execution units in the open-source LEON3 processor is
discussed in [3]. )e LEON3 is a customizable 32-bit RISC
microprocessor suitable for general-purpose computing
applications. It is not designed for hard real-time tasks. )is
speciﬁc paper mainly presents an area, energy, and power
analysis showing that savings in these aspects can be
achieved by DPR in practice. It does not discuss perfor-
mance aspects and the design of the DPR system and PRC in
detail.
Another good example of an application-speciﬁc in-
struction set processor (ASIP) is the Invasive Core (i-Core)
[4]. )is run-time reconﬁgurable processor works with two
separate instruction sets (IS): a static (permanently available)
IS and a task-speciﬁc (dynamically reconﬁgurable) IS. )e
i-core is embedded into a heterogeneous loosely-coupled
multicore system which is partitioned into tiles that consist
of processor cores and local shared memory.
Moving on to PRCs, those utilizing the ICAP are often
designed speciﬁcally for usage in combination with a gen-
eral-purpose CPU that provides commands and/or bit-
stream data. For this purpose, soft-cores such as the Xilinx
MicroBlaze [5] included in the Vivado design suite can be
used in combination with a peripheral on an AXI bus that
enables access to the ICAP. )e most basic design will use
the AXI HWICAP IP core by Xilinx [6] (or similar) that is
essentially a low-level bridge between the AXI bus and the
ICAP. All control logic is implemented in software (cf.
Figure 1). )e MicroBlaze processor is in complete control
of the reconﬁguration process and responsible for fetching
bitstream data from the memory and forwarding it to the
ICAP.
More advanced approaches like the RT-ICAP [7] move
some aspects of the process out of the CPU and provide
additional features such as bitstream decompression but still
oﬀer an interface that is best suited for interoperating with a
2 International Journal of Reconﬁgurable Computing
processor. )ere are several similar designs along these lines
[8–11]. A variant of this approach is to provide integrated
extensibility of a general-purpose processor via DPR such as
in [12]. Speciﬁcally for the Zynq SoC, it is possible to use the
APU to control the reconﬁguration process, which is done,
e.g., by the ZyCAP [13].
Although the Zynq includes the Processor Conﬁguration
Access Port (PCAP) to stream conﬁguration data directly
from the APU, its performance is inferior to the ICAP.
Xilinx speciﬁes a typical throughput of the PCAP of around
145MB/s on the Zynq-7000 [14]. Experiments suggest that
the realistically achievable PCAP data rates are a bit lower in
practice at around 128MB/s [13, 15]. Since the reconﬁgu-
ration time is directly related the throughput of the access
port and the size of the bitstream, it is necessary to choose
the fastest possible access port.)is way, the architecture will
meet stricter timing constraints which are a consequence of
the real-time requirements. )e ZyCAP, e.g., for this reason
does not make use of the PCAP, but its general approach still
has the disadvantage of restricting the design to the Zynq
SoC family as pure FPGAs do not contain a comparable
hard-core CPU.
)ere are other existing PRCs proposed in the literature
that are designed for standalone usage in hardware designs
without any supporting processor [16–19]. Unfortunately
their design ﬁles are not publicly available, with the ex-
ception of [20]. Additionally, they do not take real-time
constraints into account and often do not come close to the
device limits in terms of conﬁguration speed (or require
bitstreams to ﬁt in FPGA-internal RAM). All of these PRCs
furthermore use custom or older protocols for accessing
bitstreams and do not support the standardized AXI bus.
Xilinx itself also oﬀers a closed-source partial reconﬁ-
guration controller [21] as part of its IP core library. )e
timing and speed characteristics are not speciﬁed at all. )is
turns the controller essentially into a black-box, not suitable
for hard real-time systems.
Previous research has leveraged the idea of streaming
bitstream data from external RAM [13, 20, 22, 23]. )is
allows to reach very high reconﬁguration speeds without
having to store the complete bitstream in BRAM, which is
usually not desirable in terms of FPGA resource usage and
constraints.
)e concept of DPR (in combination with PRC) is used
in the AutoVision architecture [1] to dynamically adapt the
integrated soft-core coprocessors to changing conditions. A
driver-assistance system is implemented which allows the
scenery to be analyzed in real time with the help of co-
processors enabling a hardware acceleration for the image
processing algorithms. )e requirements of the algorithms
can change signiﬁcantly over time (e.g., when entering a
tunnel on a sunny day), which makes an adjustment nec-
essary. )is, in turn, requires adjustment of the hardware
acceleration which is carried out by using DPR. AutoVision
guarantees compliance with the real-time deadlines for
image analysis.
4. The torCombitgen Tool
Partial bitstreams required for DPR can be generated in
several ways despite the result having the same function and
eﬀect. )e most important ones are brieﬂy presented in the
course of Section 4.1, followed by the torCombitgen ap-
proach combining the advantages of these methods in
Section 4.2.
4.1. Reconﬁguration Flows. )e typical generation that
Vivado will perform when using the partial reconﬁguration
ﬂow uses a module-based approach that writes one bitstream
per reconﬁgurable module containing the data for all
conﬁguration frames associated with the RP and ignores
everything else in the design [24]. All frames inside those
partitions are included unconditionally.
A completely diﬀerent approach oblivious of individual
modules is diﬀerence-based generation [25, 26]. It takes
exactly two full top-level bitstreams including the static
design and one or more RPs as input and compares all
conﬁguration frames to ﬁnd which ones diﬀer from each
other. Only those are written to a partial bitstream ﬁle.
Compared to the module-based method, this has the ad-
vantage that potentially fewer frames have to be written in
case there are frames that are identical between the two
analyzed designs. All frames containing only static logic will
therefore be excluded, so the bitstream is never larger than
an equivalent one generated by the module-based approach.
Identical frames in dynamic logic are most likely to be
found when logic is unused since dormant resources always
have the same representation in conﬁguration data. Addi-
tionally, smaller parts of logic that are functionally the same
in multiple RMs might be removed, but it is still an open
research question if this has a measurable impact in practice:
Minute changes in any part of an RMmight cause the placer
to position all logic at an entirely diﬀerent location within
the Pblock. As a consequence, conﬁguration data would be
vastly diﬀerent also in other unchanged parts, and the
chance of encountering identical frames would be severely
diminished.
When three or more RMs have to be considered, a
bitstream must be generated for every pair of top-level
bitstreams so that it is possible to switch from any RM to
every other one. )e number of partial bitstreams necessary
for n conﬁgurations is nn   1  n2   n; i.e., it is quadratic
in n. )e amount of bitstreams quickly becomes un-
manageable with higher n.
MicroBlaze processor
Memory controller
Memory
AXI_HWICAP
ICAP
AXI bus
Figure 1: Exemplary structure of a microprocessor-based DPR
system using the HWICAP solution by Xilinx.
International Journal of Reconﬁgurable Computing 3
4.2. Optimized Diﬀerence-Based Bitstream Generation.
Claus et al. have suggested another approach in [26] that
eﬀectively combines the advantages of the module-based
and the diﬀerence-based generation. )eir tool Combitgen
represents an approach in between diﬀerence-based gen-
eration that requires many bitstreams and module-based
generation that does not consider potential size savings by
identical frames. Since it was developed only for the now
obsolete Xilinx Virtex-II family, it cannot be used ﬂexibly
and in modern devices. torCombitgen, developed by the
authors, is the natural successor to Combitgen. It allows to
use the advantages of Combitgen on contemporary Xilinx
FPGAs. Since torCombitgen uses the open-source frame-
work Tools for Open Reconﬁgurable Computing (Torc) [27]
for reading, manipulating, and writing Xilinx bitstreams, all
FPGA families implemented in Torc are supported. It is
adaptable by the community also to future FPGAs.
As can be seen in Figure 2, torCombitgen takes multiple
full bitstreams as input, reads the contained data into a
memory array representing the state of the FPGA conﬁg-
uration cells, and marks frames that are identical in each and
every one of them in a diﬀerential frame bitmap. Identical
frames are removed from each individual bitstream to
produce one partial bitstream per top-level one. Similarly to
the diﬀerence-based approach, the static logic is guaranteed
to disappear as it must be the same in every input.
On the one hand, if no identical frames can be identiﬁed,
the generated bitstreams will be exactly equal to the output
that the module-based approach would have produced. On
the other hand, if there are two input bitstreams, this method
is exactly equal to the diﬀerence-based approach. As such,
the size of the bitstreams should lie between these two
extremes.
Claus et al. report an experimentally identiﬁed time
saving of 3 to 4% in an exemplary design that consisted of a
clock divider with three diﬀerent division factor options.
)is seems to be due to an additional optimization that
Combitgen performs on the bitstream by converting regular
frame write commands into multiple frame writes, allowing
to remove one frame of padding per write on the target
platform.
5. The ViSARD Soft-Core Processor
)e ViSARD (VHDL Integrated Soft-Core Architecture for
Reconﬁgurable Devices) is a hard real-time application-
speciﬁc soft-core processor, which can be conﬁgured from a
generic customizable soft-core library. An early state of the
ViSARD has already been published in [28]. Data can be
entered or returned in any numerical format; internal cal-
culations are performed in IEEE ﬂoating-point format [29]
to ensure maximum accuracy. )e processor can currently
operate in single-precision (32 bit) or double-precision
(64 bit) mode. )e range of possible operations includes
more than 30 instructions. )e soft-core was designed in a
way that the arithmetic logic unit (ALU) is fully modular and
can be adapted at any time to an extension or replacement of
the hardware execution units (EUs).)at means the scope of
functions for a speciﬁc problem can be carried out. )e
ViSARD uses a 5-stage pipeline, and every EU is internally
pipelined as well.
Figure 3 shows the architecture of the ViSARD soft-core
processor in the multicore conﬁguration. )is processor
consists of a data path (blue) and a control path (orange),
which are described in detail in this section. It has to be
mentioned that, for reasons of clarity, in Figure 3, only one
core is illustrated in detail. Each core communicates ex-
clusively via the shared memory with other cores, and the
architecture of every core is identical (except for the ALU).
)e control path checks the current state of the execution
of the given program code in each clock cycle via the
program counter in combination with a RAM module
(DPRAM (Program)) that stores the actual instructions. )is
program is set during design time for each core by the
assembler (explained in detail in Section 5.1). )e program
counter passes the current program address to the program
memory in each clock cycle. )ere, the next machine in-
struction is read and passed on to the decoder. )is rep-
resents the instruction fetch. )e next step, the instruction
decode, is executed subsequently. All necessary addresses
and control ﬂow signals are passed on to the respective
modules in this step. )e loading of the necessary data, i.e.,
the content of the memory cells, for the operations and the
transfer of the data to the ALU is done in the operand fetch
step. Finally, the calculated result is stored in the local and/or
shared memory. A special feature in this processing chain is
the set/reset module. )e program counter can be manip-
ulated with this module and allows to realize a hardware
loop. Since this is a crucial part to ensure the hard real-time
ability of the soft-core processor, it is explained in detail in
Section 5.1.
)ere are several optimization modules realized in the
soft-core. )e memory activation module (Data RAM en-
able) is one of those optimization modules and part of the
control path.)e task of this module is to switch oﬀ the clock
of the respective data memory with the help of the in-
formation from the decoder, as long as neither a reading nor
a writing operation is executed on it. )is mechanism aims
to reduce the power usage of the architecture. Another
module that aims to minimize the energy consumption is the
ALU OP enable module. It stores information about which
values are currently being calculated in each EU pipeline of
each execution unit in the ALU. If the pipeline of an EU is
empty, the module will turn oﬀ that EU, further reducing
power consumption.
)e main part of the data path is the ALU (fpALU). It is
designed in a way that all EUs are connected to the data
links. A multiplexer within the ALU decides at which clock
cycle the results of which execution units are stored in
memory. Two local memories (Data A and Data B) and an
optional shared memory are used. )e memory is ﬂattened,
whichmeans that there is only onememory level used. Other
memory levels such as caches, working storage or hard disks
are deliberately omitted. Memory and register are therefore
identical, meaning that eachmemory cell corresponds to one
register. Dual-port RAM (DPRAM) is used in each memory
module, resulting in a total of two identical DPRAM
modules (Data A andData B) and an optional third memory
4 International Journal of Reconﬁgurable Computing
module. With this architecture, it is possible to read and
write each memory module in every clock cycle. Because the
two memory modules are identical at any time, the soft-core
is able to read up to two arguments for an EU and therefore
is able to start a new operation in each clock cycle. Addi-
tionally, a result can be stored in each clock cycle. External
inputs can be written to the memory via the data path using
corresponding data input ports. Since these must be
addressed by an assembly instruction, the data path goes
through the ALU. )e shared memory is only used in the
case of a multi-soft-core conﬁguration of the ViSARD and is
not identical to the processor-internal memory. Due to the
architecture, a maximum of one argument per clock cycle
can be replaced. If none of the two required arguments are
available in the processor’s internal memory, the content of
one of the needed memory cells is copied from the shared
memory into the processor’s internal memory with the help
of a preceding instruction (inserted at design time), and the
actual operation is started afterwards. )e last module of the
data path is the output module. It contains a (at design time)
deﬁned number of output registers and is used to pass the
results of the soft-core to the surrounding logic. )e output
Data path
Control path
Prg addr
DPRAM
(program) Decoder
DPRAM
data A( )
DPRAM
data B( )
Set/reset
Signal
padding
Output
register 0
Output
register n
fpALU
mdata
statusbusy
Data input n
Data input 0
Prg we
Prg data
mcmd
Guard
WB commAdr1
Start
Wb addr
DPRAM
shared( )
B
A
Result
Ex comm
EX
WB
WB
ALU OP
enable
ADD
MUL
DIV
...
Data RAM 
enable
Adr2Shared
adr
WE
Input
Program 
counter
Control 
signal
Data
Address
DPR options
Figure 3: Schematic of one core of the ViSARD multi-soft-core processor (based on [30]).
Toplevel 
bitstreams Pseudo-FPGA-memory Partial bitstreams
Diﬀerential 
frame-
bitmap
Figure 2: Functionality of the torCombitgen tool.
International Journal of Reconﬁgurable Computing 5
registers can be set independently using an assembly
instruction.
As depicted in Figure 3, the architecture of the soft-core
processor is designed for modular application-speciﬁc ad-
aptation. )e functionality of the ALU can be specially
adapted according to the task, and multicore architectures
can be realized with the shared memory. )e fully modular
design alsomakes it possible to exchange individual modules
over time using the DPR approach.)e exchange of modules
using DPR is called time multiplexing, and necessary ad-
aptations are described in Section 5.3.
5.1. Hardware Loop and Jump Instructions. One of the main
diﬀerences between the ViSARD and general soft-core
processors is the hard real-time capability. )is character-
istic is derived from the ﬁeld of application in which the
processor has to operate.)e processor is used in domains in
which real-time constraints must be met, such as control
loops or real-time image processing. As previously men-
tioned in the introduction, a very good real-world example
to clarify the hard real-time demands is driver assistance
systems like AutoVision [1]. In this system, a camera records
a video with 31 frames per second. )is means that the
hardware has to process 31 pictures in a second, resulting in
a total time frame per picture of about 32.26ms. If such a
picture is comprised of, for example, 10241024 pixels, the
soft-core has a deadline of 30.76 ns for each pixel.
In order to ensure this real-time characteristic, some
adjustments were needed both in the assembly code and in
the architecture:
)e use of conditional jump instructions in the assembly
code is prohibited and thus not supported by the processor.
However, this limitation is an insigniﬁcant disadvantage in
the addressed domain, since the tasks realized in this domain
are ﬁxed at design time, and mechanisms can be imple-
mented to replace the corresponding jumps without
showing a negative eﬀect.
When programming at higher abstraction levels, for
example, at model level, only loops with a ﬁxed number of
iterations may be used. )ese can then be unrolled by a
compiler or assembler at design time. Since all information
about these loops (e.g., how many iterations it has to per-
form) is available during design time, every loop can be
unrolled, and thus no conditional jumps are needed. )is
information is available during design time because the
algorithm to be realized does not change over time (e.g.,
control loop tasks) and consequently has a constant number
of iterations for each loop. To avoid overly long assembly
code and to avoid unrolling every loop in the assembly code,
a hardware loop was implemented in the ViSARD.)is loop
is controlled by the set/reset module.)e program counter is
set to a certain value at ﬁxed clock cycles and thus jumps to
the corresponding position in the code, realizing hardware
loops. Furthermore, the scheduling is performed at design
time by an optimizing assembler [30, 31] and not at run time
by the soft-core itself. Every time a loop is not unrolled and a
hardware loop is realized, the information of where the loop
starts (program counter), where it ends, and how often it
should be run are provided by the optimizing assembler to
the set/reset module during design time. )e module is
therefore able to realize these jumps during run time.
With these novel mechanisms, it is possible to perform a
cycle-accurate timing analysis of the processor at design time
and to determine the exact computing time, thereby en-
suring compliance with the hard real-time barrier.
5.2.5e ViSARD Tool-Chain. To be able to use the soft-core
eﬃciently within the scope of a new embedded system, a
processing chain, from now on referred to as the tool-chain,
is necessary. )is tool-chain will be brieﬂy introduced in the
following section. In every new project, requirements and
constraints have to be deﬁned, see Figure 4. With this input,
it is possible to describe the target application that needs to
execute an embedded algorithmwith regard to the hard real-
time characteristics and massive parallelism requirements.
)e application sets a scenario, and the user can model the
problem with the ﬁrst part of the tool-chain, themodel-based
assembly code generator, as can be seen in Figure 4. )is
MATLAB/Simulink-based tool, which was already pub-
lished in [32], gives the user the ability to realize any given
algorithm without special knowledge of any programming
language.)e user simply drags and drops blocks that realize
the needed functionality and connects them as desired. With
the help of this tool, it is possible to realize any algorithm as a
data-ﬂow model. As soon as the user-deﬁned model-based
algorithm is ﬁnished, it is possible to automatically generate
the special assembly code needed to run the ViSARD soft-
core. In addition, diﬀerent optimizations can be used to
optimize the graph generated by the model (or even the
model itself ) by replacing blocks with faster equivalent logic
or removing dead parts in the model, resulting in a shorter
assembly code. It is also possible to minimize the usage of
variables of the generated assembly code. )is minimization
is necessary in order to create a suitable assembly code for
further processing. For detailed information on all opti-
mizations that are performed by the model-based assembly
code generator, refer to [32].
After the assembly code is generated, it is then compiled
using an optimizing assembler. )is assembler translates any
given assembly code tomachine code for the soft-core. In the
course of this translation process, an optimized scheduling is
processed that makes optimal use of the ViSARD-internal
pipelines and thus minimizes the computing time. )e as-
sembler also determines the time points required for setting
up the hardware loop. In case of a multicore setup such as a
dual core, the assembler must be provided all information on
which EUs are available on which cores. Additionally, the
assembler must know when any core may be reconﬁgured
and what EUs are available before or after the reconﬁgu-
ration. With this information, it can determine an optimized
scheduling for each core, maximizing the pipeline usage and
thus minimizing the execution time, while maintaining full
design time analyzability.
Based on the requirements of the application and the
operations used in the assembly code, the ViSARD EU li-
brary is adapted to the given task. )e resulting ViSARD
6 International Journal of Reconﬁgurable Computing
incorporates only the required EU modules, resulting in a
minimized application-speciﬁc soft-core.
5.3. Adaptation of the ViSARD to DPR. To use the full po-
tential of FPGAs, a method for the dynamic partial
reconﬁguration (DPR) of the ViSARD soft-core has to be
developed. For this purpose, adjustments had to be made in
the architecture of the processor, which will be discussed
brieﬂy in this section. )e reason why DPR as an optimi-
zation possibility is very well suited to the addressed domain
is the hard real-time deadline itself. A massive time shortfall
of this deadline does not bring any advantages, because real-
time systems generally have to be only as fast as absolutely
necessary and not as fast as possible. By time-division
multiplexing EUs or even entire soft-core processors in the
time domain, the computing time is potentially increased,
but a large amount of resources can be saved on the FPGA. A
reduction of the FPGA resources required by the processor
always results a proﬁt in contrast to the faster processing of
the algorithm. In some cases, even a smaller and therefore
cheaper FPGA can be used with this method.
As the ViSARD was not initially designed with DPR in
mind, system and component reset and enable signals had to
be included to deal with startup in an undeﬁned state after a
reconﬁguration. )is also ensured that no spurious writes to
the memory or state changes occur while the core or EUs are
being exchanged. Furthermore, the ability to accommodate
multiple diﬀerent programs in the internal memories
(program and data) had to be added. Finally, as all EUs that
are to be exchanged have to be contained in one recon-
ﬁgurable partition, the corresponding calculation result
inputs of the ALU result multiplexer were connected to this
RP instead of to the default static versions of the EUs.
Compared to solutions like the i-Core [4], which has a
static instruction set (IS) as well as a dynamically recon-
ﬁgurable IS, the ViSARD is not limited to those reconﬁ-
guration choices. )is is because it is possible to change all
EUs or even more (e.g., one complete core including all EUs
and memory). )erefore, every EU and the respective as-
sembly instructions can be seen as dynamic. Another prime
diﬀerence is the fact that the ViSARD has to fulﬁll hard real-
time constraints, and in order to do so, the processor be-
havior as well as the assembly code must be fully analyzable
at design time. Because of this fact, it is necessary to deﬁne all
reconﬁguration points during design time, and it is not
mandatory to predeﬁne static and dynamic IS parts like the
i-core does. It is also possible to even reconﬁgure complete
cores in a multicore setup, resulting in a complete change of
IS in that speciﬁc core. Since the task domain the soft-core is
designed for does not have requirements that change over
time, it is possible to make all the mentioned decisions at
design time. It has to be mentioned that it is not possible to
realize diﬀerent reconﬁguration granularities at the same
time (and therefore realize DPR regions inside coarser DPR
regions). During design time and with the help of the es-
timations of the 1, which is presented in the following
section in detail, the user has to decide which reconﬁgu-
ration granularity is the best ﬁt for the concrete problem.
)is reconﬁguration granularity determines the DPR re-
gions that will be realized inside the soft-core architecture.
6. Reconfiguration Method
During the development of an embedded system, it must be
assessed whether the use of DPR will be appropriate in a
speciﬁc project. In a further step, it has to be decided which
parts could be reconﬁgured. With the help of the method
presented below, the timing-related costs T (in μs) of partial
reconﬁguration can be estimated.
T 
H  n  W  m  F  E
S
; 1
whereH is the size of the header, n  W is the number of write
commands times the size of one command, m  F is the
number of conﬁguration frames times the size of one
command, E is the size of the end sequence, and S is the
actual speed of the reconﬁguration port in bit μs  1. It should
be noted that with every new clock region, another write
command is needed. As an example, the necessary values
from equation (1) for the Zynq-7000 FPGA family are given
in Table 1. )ey were determined by inspecting partial
bitstreams generated by the Vivado design suite.
T is the minimum time required for the planned
reconﬁguration. With this lower limit and the estimated
computation time of the given algorithm, it is possible to
assess whether the planned partial reconﬁguration would
violate the limits of the maximum computing and recon-
ﬁguration times of the hard real-time deadline. If the
Model-based assembly code 
generator
Assembly code
Application
Requirem
ents
M
achine code
Optimizing assembler
Application-speciﬁc so-core 
processor
Adjustment
So-core processor library
EU library
Figure 4: ViSARD tool-chain (simpliﬁed, based on [31]).
International Journal of Reconﬁgurable Computing 7
evaluation is positive that a partial reconﬁguration can be
used and is beneﬁcial, there are diﬀerent reconﬁguration
granularities, i.e., diﬀerent levels on which reconﬁgurations
can be performed. )is estimation has to be done manually
at this time, but with this formula it should be possible to
implement an automatized reconﬁguration checker.
In the literature, there is only a very general distinction
between ﬁne-grained and coarse-grained reconﬁguration.
Fine-grained means a reconﬁguration of processing ele-
ments working at bit level, and coarse-grained describes the
replacement of complex functional blocks such as an ALU
[33].
In the context of the reconﬁguration of soft-core pro-
cessors, a more reﬁned classiﬁcation is necessary. )erefore,
a distinction should be made between the diﬀerent modules
and module levels as presented in Figure 5.
A distinction is made based on the granularity of the
modules. Individual EU modules within the ALU can be
reconﬁgured, which corresponds to the ﬁnest granularity.
Swapping an EU on the hardware level is equivalent to
changing the availability of assembly instructions on the
software level. Because of the time a reconﬁguration of
diﬀerent EUs would take compared to the time a typical
assembly code would need to be computed, it is not useful to
realize this reconﬁguration granularity during an active
computation period of the soft-core processor. Rather, it is
meant to reconﬁgure the soft-core between two computation
runs to adapt the processor to diﬀerent algorithms that
require diﬀerent EU allocations.
)e coarser steps would replace the entire ALU as a
module or even entire soft-core processors. On the next
reconﬁguration granularity, single cores of a multicore ar-
chitecture can be reconﬁgured as a complete module. )e
reconﬁguration of one of the cores of a multicore archi-
tecture or even the complete soft-core processor can be
associated with a function and program swap. So, with the
use of this granularity, the DPR region would implement a
separate program to be executed. )e coarsest level of
granularity is the reconﬁguration of entire multicore soft-
core processors in a MIMD processor.
Coarser granularities will naturally lead to a higher
amount of identical logic in the dynamic region of the design
as for example the control units of the processors are the
same or very similar, depending on the conﬁguration of the
soft-core. However, torCombitgen can reduce the inﬂuence
of this duplication, and the bigger reconﬁgurable regions can
reduce the resource waste incurred as a result of DPR. In
Section 8, we present experimental results for the ﬁnest
granularity (exchanging EUs) and a medium granularity of
one soft-core processor. Additional research is needed to
evaluate which granularity level delivers the best compro-
mise between logic utilization and reconﬁguration time.
7. DPR System
Partial reconﬁguration without an external controller re-
quires additional logic in the form of a partial reconﬁgu-
ration controller (PRC), and control logic is required on the
FPGA. Furthermore, a DPR system needs memory for
storing partial bitstreams. )is section will explain the
system architecture and components.
)e resource requirements of all additional components
required for DPR must be kept to a minimum in order to
ensure the advantage of DPR ofminimized space requirement.
As the analysis of the available PRCs from the literature in
Section 3 has shown, there is no PRC that satisﬁes the re-
quirements of the architecture (low overhead, predictable
timing satisfying hard real-time constraints, throughput close
to device limits, designed for standalone operation without
general-purpose CPU). For this reason, an optimized PRCwas
realized and is presented in detail hereafter.
7.1.Memory Solution. Choosing a bitstream storage location
that is suﬃciently large and fast is key to achieving high
reconﬁguration throughput.
Block RAM (BRAM) is internal memory distributed in
the fabric of Xilinx FPGAs that can be initialized with data
speciﬁed at design time and is guaranteed to have no access
latency. However, BRAM is a comparatively scarce resource
in most FPGAs. A full bitstream for the device used in our
projects requires about 4.05MB of storage in contrast to the
available 0.6125MB of BRAM on the FPGA [2], so a partial
bitstream for 10% of the resources already exceeds the
available space. In any case, bitstream storage in BRAM incurs
considerable overhead, occupying valuable resources on the
FPGA that might be required for the actual application.
)e only alternative is to use external memory. Out of
the storage chips available on the Trenz Electronic GigaZee
platform used in this research (QSPI ﬂash, eMMC, and
ALU
Coarse
grained
Fine
grained
Soft-core 
processor
Multi-soft-core 
processor
MIMD soft-core 
processors
Execution unit (s)
Figure 5: Soft-core-based reconﬁguration granularities.
Table 1: Values of the components.
Symbol Meaning Value
H Header 960 bit
W Write command 256 bit
E End sequence 736 bit
F Conﬁguration frame 3,232 bit
S Speed 400 bitµs  1
8 International Journal of Reconﬁgurable Computing
DDR3 SDRAM), only the DDR RAM oﬀers suﬃcient
bandwidth exceeding the 400MB/s required for sustained
partial reconﬁguration. It is connected to the processing
system (PS) memory controller, so in order to access it from
the programmable logic (PL), one of the PS AXI buses has to
be used. When choosing the high-performance AXI_HP bus
as done here, the maximum read bandwidth is 1,200MB/s at
a bus clock of 150MHz [14]. Timing-wise, a maximum
latency for memory accesses can be guaranteed by making
sure that no other components in the SoC (such as the APU
or the DMA controller) use the DDR3 SDRAM or the
AXI_HP access is given highest priority [14].
Due to the volatile nature of the external RAM, the
bitstreams cannot be saved permanently. Instead, they have to
be copied from some other persistent storage device on
system startup.)e eMMC is consideredmost appropriate for
this purpose since it does not contain any other data required
for the system to function as opposed to the QSPI ﬂash which
contains the Zynq boot loader. Still, the on-chip SD/MMC
controller is not directly accessible to the logic on the FPGA.
In order to copy the bitstream data to the DDR3 SDRAM, a
software routine was implemented that runs once on the APU
when the system is initialized and after that turns inert. )is
component is referred to as bin provider. )e complete
structure of the developed system is illustrated in Figure 6. In
addition to the already introduced components, the appli-
cation-speciﬁc execution control unit is responsible for ini-
tiating and monitoring both DPR and the execution of
programs on the ViSARD. Typically, the execution control
will observe activity of the ViSARD, and once a given pro-
gram has ﬁnished execution (this is indicated by the
ViSARD), stop the processor and request reconﬁguration
from the PRC. Once the PRC conﬁrms that the bitstream has
been loaded, the execution control may restart the processor
with a diﬀerent programmatching the new set of EUs. Which
programs require which EUs (and therefore partial bit-
streams) is speciﬁc to the application being implemented.
More complicated setups that, e.g., evaluate external input to
decide which program to run, are also possible.
To be adaptable to a wide range of DPR-enabled FPGAs,
the design of this system is conceived to be as independent of
the individual hardware components as possible under the
constraint that it can function on the target hardware of this
research. Instead of the eMMC, any other memory that is
reachable to the APU can be easily substituted in bin pro-
vider. Even if the target device is not an SoC, the PRC does
not have to be modiﬁed due to its capability of loading
bitstreams from any source via the AXI bus. It can be
connected to a memory controller instantiated in the FPGA
design instead of using the one in the PS.
7.2. Partial Reconﬁguration Controller. )e second key
component that determines the characteristics of the DPR
system such as the conﬁguration data throughput is the PRC
itself. It has to be designed so that it can keep up with the
data ﬂow and stream it to the ICAP at maximum speed.
)e ICAP takes 32 bits of data in each clock cycle at a
maximum frequency of 100MHz. For simplicity, the AXI
bus uses the same 100MHz clock. )e AXI_HP bus of the
Zynq-7000 uses a data width of 64 bits, so for each cycle
twice the amount of data needed for the reconﬁguration is
read. )is allows for a considerable amount of leeway for
delivering the bitstream to the ICAP. Nevertheless, some
sort of width adaption/buﬀering between AXI and ICAP is
necessary. Depending on the concrete bitstream provider on
the AXI bus, there might be a few clock cycles of inactivity
between successive read transactions (this is the case for the
DDR memory controller). A ﬁrst-in-ﬁrst-out (FIFO) buﬀer
with a write port width of 64 bits and a read port width of
32 bits allows using this period of inactivity for bitstream
delivery. For low resource overhead, the minimum possible
buﬀer depth of 16 64-bit units was chosen.
)e basic structure of the custom PRC implemented in
VHDL is shown in Figure 7, which also presents its in-
terfaces (partial reconﬁguration control to the right of the
PRC, bitstream storage above the PRC, and ICAP below the
PRC). Additionally, there are clock and reset inputs (not
depicted). )e individual components of the PRC are
explained further below.
One of the advantages of the custom PRC over similar
approaches presented in Section 3 is the usage of the modern
and standard AXI bus to achieve separation of bitstream
access and delivery. It enables adaption to diﬀerent system
requirements with less eﬀort by allowing to replace the
storage module with another ready-made one. VHDL ge-
nerics allow the speciﬁcation of parameters of the AXI bus
(address width, database address) as well as the number of
diﬀerent bitstreams to be loaded.
Partial bitstreams can diﬀer in size depending on the
options used to create them and the resources the logic oc-
cupies. One of the novel contributions of this paper is a
method and tool for delivering partial bitstreams minimized
in size and number with the torCombitgen tool as decribed in
Section 4. For the PRC to be able to access and load bitstreams
quickly, all of themmust be available in a contiguous region of
memory and additionally have to be indexed to allow locating
the start address of an arbitrary bitstream without performing
multiple look-ups or skipping through other bitstreams in
order to ﬁnd the correct one. )e index format used is a plain
table of all available bitstreams that starts at oﬀset 0 to the AXI
database address. Each bitstream gets one 8-byte entry
consisting of two 4-byte unsigned integer values: the oﬀset of
the ﬁrst byte of the bitstream and its size in bytes. All entries
are stored successively, enabling instant lookup of the values
of one bitstream by loading 8 bytes of data from the base
address plus eight times the index of the desired bitstream.
)e actual bitstreams are stored without any postprocessing
and likewise in succession. Compared to the Xilinx PRC, this
technique enables testing bitstreams of diﬀerent sizes (e.g.,
generated with diﬀerent methods and/or settings) without
having to perform any modiﬁcations to the PRC conﬁgu-
ration or the implemented netlist.
)e PRC contains three state machines, each responsible
for one channel of information: the AXI address channel, the
AXI data channel, and the ICAP channel (see Figure 7). By
treating the two AXI channels of information separately, it is
possible to make the best use of the bandwidth and burst read
capability of the DDR memory controller. Although the
International Journal of Reconﬁgurable Computing 9
memory controller requires a number of cycles of latency to
answer read requests, it continuously streams out data once it
begins to send the response. In particular, further reads can be
issued while the response to a previous request is still pending.
)e PRC makes sure to always keep as many read requests as
possible in the queue of pending AXI transactions. )e la-
tency of memory read requests does not impact the bitstream
throughput beyond the start of the partial reconﬁguration.
When a reconﬁguration with a particular partial bitstream is
requested, ﬁrst the entry in the bitstream table as described
above needs to be read. Only when the table entry information
is available, it is possible to begin issuing AXI read commands
for the actual bitstream data, as before that the address to read
the bitstream from is unknown.
)e user-facing control interface is very simple to use:
When the PRC is ready to accept a request for reconﬁgu-
ration, it indicates this via the “ready” output port. )e
application may start such a request using the “request”
input and communicate which partial bitstream to load
using the corresponding bitstream table index. When
reconﬁguration has been completed, the “done” output is
asserted for the duration of one clock cycle. )is is guar-
anteed to happen only after all data was written to the ICAP,
which in turn guarantees that the reconﬁgured region is
ready to be used when the application receives this in-
formation. )e “ready” output is asserted again whenever
the PRC is able to begin processing another request. Spe-
ciﬁcally, this might be the case when a reconﬁguration that is
currently underway is nearing its end, so that another partial
bitstream can already begin preloading for reduced latency
between successive reconﬁguration cycles.
)e implementation of the PRC has predictable timing
characteristics and satisﬁes the hard real-time requirement.
Given the clock frequency f (typically 100MHz), a partial
bitstream of n bytes, and an AXI slave that responds to
requests in at most d clock cycles and can sustain the re-
quired conﬁguration throughput (i.e., answers with new data
before the FIFO runs empty), the upper bound of the time
required for a reconﬁguration cycle is
Tr 
3  2  d n/4
f
: 2
)is equation is derived from a small number of cycles of
initial latency needed to communicate between the state
machines and issue AXI requests, the time these requests
take to be answered, and ﬁnally the duration of the transfer
of the bitstream data to the ICAP. As described above, two
unpipelined AXI requests need to be answered before data is
streamed continuously (which is why d is multiplied with 2):
Firstly, the read of the bitstream table entry, and secondly,
the read of the ﬁrst bitstream data block.
8. Experimental Results
A number of experiments were carried out on a Xilinx
Z-7020 SoC to evaluate the beneﬁts of DPR for the ViSARD
FIFO
ICAP FSM
AXI read
address FSM
AXI read
FSM
ICAP
Bitstream storage
PRC
Ready
Request
Done
AXI bus
Table entry
Figure 7: Block diagram of the custom PRC and its interfaces
(excluding clock/reset).
PL toplevel
PS
Execution control
ViSARD
Execution
unit (s)
PRC
ICAP
eMMC
bin provider
DDR3 SDRAM
StartStatusStartStatus
AXI_HP
Board-level component
VHDL entity
Soware running in APU
Figure 6: Structure of the custom DPR system.
10 International Journal of Reconﬁgurable Computing
and verify the feasibility of the DPR system presented in
Section 7. )is section explains two of them in detail and
presents what results were achieved in terms of resource
savings, reconﬁguration speed, and bitstream sizes. Other
experiments are given as summary.
As ﬁrst step, the operation of the PRC and its interaction
with the AXI bus were veriﬁed in behavioral simulation. )e
whole DPR system (including the PRC, software on the
APU, and all utilities) was tested by continuously ex-
changing one execution unit in the ViSARD.
)e following experiment consisted of one ViSARD core
embedded in the DPR system (Figure 6) using a basic ex-
ecution control module that continually cycles through three
diﬀerent EU conﬁgurations and ViSARD test programs in
the following order (wrapping back to the ﬁrst conﬁguration
after the last one):
(1) Divide
(2) Square root (Sqrt) and sine/cosine (SinCos)
(3) Sqrt, natural exponential function (NatExp) and
multiply
)is experiment represents the ﬁnest-grained reconﬁ-
guration granularity, as execution units are being recon-
ﬁgured (Figure 5). It is adopted from the results presented in
[34]. )e conﬁgurations were chosen such that they diﬀer in
the number of EUs (to verify that changing not only the EUs
itself but also the amount of EUs in an RP is possible) and
have a compatible resource footprint. Even so, the NatExp
EU is the only EU that uses BRAM resources, which means
that the resource Pblock must include BRAM which is
unused in conﬁgurations 1 and 2.
)e soft-core processor is stopped during the reconﬁ-
guration. It is possible that all parts of the processor that are
not reconﬁgured remain operational during the reconﬁgu-
ration, but for the sake of simplicity of the experiments this
was not applied. If those parts stay operational, the as-
sembler will ensure that the processor will not access those
parts during the reconﬁguration. Due to the time that the
reconﬁguration occupies, this scenario makes sense only
with higher reconﬁguration granularities. An example for
this would be a multi-soft-core scenario, where one core is
reconﬁgured while all other cores stay operational.
Running the experiment on hardware revealed that the
ViSARD produced correct results in all calculations, thereby
verifying that partial bitstreams are forwarded to the FPGA
and the DPR system works.
)e speed of reconﬁguration was measured by capturing a
histogram of the number of clock cycles it took theDPR system
to fully deliver one partial bitstream, i.e., switch from any EU
conﬁguration to the next. As shown in Table 2, the average
cycle count in 65,536 continuous measurements was 61,833.7
with a standard deviation (SD) of 13.7 cycles, allowing
for 618:337 µs  1  1617:24 reconﬁgurations per second.
)e average bitstream throughput was 247; 116 bytes/
618:337 Aˆµs  399:646MB/s or 99.91% of the 400MB/s
maximum.
“Transferred to the real-world example AutoVision [1],
reconﬁguration would be triggered, e.g., when the car drives
into a tunnel. A picture has to be computed in a total time
frame per picture of 32.25ms. As one reconﬁguration run
would take less than 0.62ms, the soft-core processor would
be left with 31.63ms for the actual image processing. )is
reconﬁguration period would take less than 2% of the
available time frame. )e time frame the soft-core processor
has for each pixel is slightly reduced from 30.75 ns to
30.16 ns after the DPR is completed, but it has the advantage
of reduced resource usage, thanks to DPR.”
As the latency of the PRC is completely deterministic
and static (see Section 7.2), the variance is the result of the
Zynq DDR memory controller having diﬀerent access la-
tencies depending on when the request is made. )is is a
known property of DDR memory accesses caused mainly by
refresh cycles that must be interleaved with data transfer. As
can be seen from both past research projects [35] and the
results here (30 cycles of variance between fastest and
slowest execution of an operation that takes 61,834 cycles on
average), its eﬀect is minimal. It is therefore not considered
further. In summary, these results show that the PRC is able
to deliver hard real-time reconﬁguration in practice with the
limitation of assuming a bitstream provider that can satisfy
this constraint as well.
)e resources used by the design and its components are
listed in Table 3.
For comparison purposes, an equivalent non-DPR de-
sign was included that implements the same functionality
but does not have a reconﬁgurable partition. It uses identical
IP cores and settings, ViSARD test program, and execution
control unit. When correlating the DPR and the non-DPR
design, it is not immediately obvious what numbers should
be examined. Manufacturer tools provide usage broken
down into components, but additional considerations have
to be taken into account. )e DPR design has resources in
the Pblock(s) designated for reconﬁguration that are blocked
and not usable for implementing additional logic but do not
appear in those numbers.)erefore, we take the static part of
the DPR design and add the resources allocated for
reconﬁguration to it.
Applying this methodology, DPR allowed to save around
3,500 LUTs and 7,000 FFs. )is is a very good result that
amounts to a relative saving of 40.7% when directly com-
paring full DPR and non-DPR designs. Both BRAM and
DSP resources showed an increase (8.5 and 5 additional tiles,
respectively) mainly due to BRAM/DSP functional units that
had to be allocated to the DPR Pblock but could not be used
by the dynamic logic. Realistically, these dormant resources
cannot be completely avoided when dealing with diverse
execution units and are considered to be the cost of partial
reconﬁguration. However, since these resources are not
needed much within processors in the application domain
and the LUTs are the critical resource, this small increase is
negligible.
Of the PRCs presented for comparison in Section 3,
usage ﬁgures on Kintex-7-based FPGAs are available for the
RT-ICAP, the Xilinx PRC, and the ZyCAP. )ey are
compared in Table 4. Our PRC requires about 270 LUTs plus
one BRAM tile for the FIFO, which is less than 1% of the
resources available on the device in all categories. It is small
International Journal of Reconﬁgurable Computing 11
and does not constitute considerable overhead. )e sustained
throughput of 399.981MB/s or 99.995% of the device maxi-
mum was measured in another experiment with two ViSARD
cores that allowed for overlapping bitstream delivery to the
ICAP with buﬀering of the next bitstream to load. Data for the
Xilinx PRC was determined in a test design (settings were
chosen to be most similar to the custom PRC: no optional
features (including AXI control interface), minimum FIFO
depth of 32 implemented as BRAM, 4 clock domain crossing
stages, and 1 virtual socket with 2 modules with 1 hardware
trigger each).)eRT-ICAP is competitive in terms of resource
usage, but it was conceived for a diﬀerent purpose, requires a
supporting general-purpose CPU, and does not have an AXI
memory interface. It additionally requires memory for
communication with the CPU that is not represented in the
reported BRAM usage. Furthermore, its throughput is around
4% below the result of this research. Vastly more logic re-
sources are needed by the ZyCAP (more than twice) and the
Xilinx PRC (more than three times). Conceptually, the pro-
posed PRC is quite similar to a design proposed by Vipin [20]
that was implemented on prior-generation FPGA hardware.
However, it does not consider hard real-time requirements
and uses a directly integrated DDR memory interface as
opposed to the standard AXI bus, which means that the
bitstream storage cannot easily be replaced.
In the following experiment that applies the same idea
at the coarser granularity level of a whole soft-core pro-
cessor, the principle was kept as above and the reconﬁg-
urable partition extended to cover the whole ViSARD core.
To keep it simple, only one execution unit was inserted per
core (NatExp, Sqrt, and SinCos). Table 5 lists the sizes of
bitstreams generated by diﬀerent methods. A module-
based generation always produces the largest bitstream, so
it is taken as reference. Diﬀerence bitstreams were gen-
erated only for switching from one conﬁguration to the
next as required by this experiment, reducing the size by
about 24% on average. All or dynamic change patterns
would require n2   n  32   3  6 partial bitstreams as
previously explained in Section 4.1, but their sizes will not
diﬀer signiﬁcantly. torCombitgen similarly allowed for a
25% reduction in size, but the three produced bitstreams
are already suﬃcient for switching from any conﬁguration
to any other one. )e results conﬁrm the validity of this
approach and that the sizes of bitstreams generated by this
tool lie between the module-based and the diﬀerence-based
method.
Table 3: Resource utilization in the EU-swapping experiment.
Component LUT FF BRAM DSP
(a) Divide RM 3,273 6,546 0.0 0
(b) Sqrt-Sin Cos RM 3,063 6,126 0.0 0
(c) Sqrt-NatExp-Multiply RM 3,055 6,110 2.5 25
(d) DPR design (complete with Divide RM) 4,575 9,150 4.0 3
(e) DPR design (static part): (d)   (a) 1,302 2,604 4.0 3
(f ) Pblock for DPR 3,800 7,600 10.0 30
(g) DPR design (eﬀective): (e) + (f) 5,102 10,204 14.0 33
(h) Equivalent non-DPR design 8,607 17,214 5.5 28
(i) DPR vs. non-DPR: (g)   (h)   3,505   7,010 +8.5 +5
Table 4: Resource utilization and conﬁguration throughput of the custom and other PRCs.
Controller LUT FF BRAM )roughput (MB/s)
Custom PRC 273 292 1.0 399.981
RT-ICAP [7] 245 101 0.0 382.2
Xilinx PRC [21] 897 954 0.5 n/a
ZyCAP [13] 620 806 0.0 382
Table 2: Statistics of clock cycle count required for DPR in the EU-swapping experiment.
Minimum Mode Average (SD) Maximum
61,824 61,824 61,833.7 (13.7) 61,854
Table 5: Bitstream sizes in the ViSARD-swapping experiment.
Method
Bitstream size (bytes)
NatExp Sqrt SinCos
Module-based 191,784 (100.00%) 191,784 (100.00%) 191,784 (100.00%)
Diﬀerence-based 144,076 (75.12%) 140,440 (73.23%) 144,076 (75.12%)
torCombitgen 144,060 (75.12%) 144,060 (75.12%) 144,060 (75.12%)
12 International Journal of Reconﬁgurable Computing
9. Conclusion and Future Work
)e ViSARD is a hard real-time capable soft-core processor
that can be adapted for any application in the addressed
domain. With the help of the accompanying tool-chain, it is
possible to adapt the processor to the respective project in a
model-based fashion. )e novel mechanisms derived from
the method and realized in the soft-core enable a cycle-
accurate timing analysis of the processor at design time and
determine the exact computing time, thereby ensuring
compliance with the hard real-time barrier. )ose mecha-
nisms also allow the application-speciﬁc specialization and
eliminate any disadvantage resulting from the hard real-time
constraints. )e soft-core was extended to DPR with respect
to the hard real-time capability. )e torCombitgen tool was
developed based on the Combitgen tool, as part of the ex-
tension of the soft-core to DPR. )is tool can be used to
create minimized partial bitstreams for any DPR scenario
and combines the advantages of the module-based and
diﬀerence-based bitstream generation approaches.
)e combination of torCombitgen and ViSARD enabled
the implementation of a soft-core processor that adapts to a
changing environment over time. )is maximizes the
suitability of the processor for a wide range of tasks and at
the same time minimizes the necessary FPGA resources. )e
conducted experiments have shown that even with a
reconﬁguration on the ﬁnest granularity level, more than
40% of the critical resources can be saved while preserving
identical resulting functionality. Considering that the size of
the Pblock can be adapted more precisely to the respective
problem with a larger area to be reconﬁgured, i.e., the
overhead is reduced, even better results can be achieved with
coarser granularities.
Within the scope of the performed experiments, the hard
real-time capability of the used DPR approach (torCombit-
gen-generated partial bitstream and custom PRC) could also
be demonstrated.)e implemented PRC uses minimal FPGA
resources and achieves the best possible reconﬁguration speed
compared to other PRCs in the literature (see Table 4).
In the future, the approach presented and all developed
tools will be applied in real-world scenarios and compared to
existing non-DPR solutions. Additional research is needed
to evaluate which granularity level delivers the best com-
promise between logic utilization and reconﬁguration time.
With these results, it would be conceivable to integrate DPR
into the ViSARD tool-chain as an automated step during the
model-based assembly code generation. An essential part of
this integration will be Equation (1).With this estimation, an
automatized algorithm can be implemented that ﬁnds the
possible reconﬁguration options. )is reconﬁguration
checker will also handle timing considerations arising from
reconﬁguring active soft-core processors and provide this
information to the assembler. Currently, this has to be done
manually by the user during design time.
Data Availability
)e data used to support the ﬁndings of this study are
available from the corresponding author upon request.
Conflicts of Interest
)e authors declare that there are no conﬂicts of interest
regarding the publication of this paper.
Acknowledgments
We acknowledge support for the article processing charge by
the German Research Foundation (DFG) and the Open
Access Publication Fund of the Technische Universita¨t
Ilmenau.
References
[1] C. Claus, S. Walter, and A. Herkersdorf, “Autovision—a run-
time reconﬁgurable MPSoC architecture for future driver
assistance systems (Autovision—eine zur Laufzeit rekonﬁ-
gurierbare MPSoC-Architektur fu¨r zuku¨nftige Fahrer-
assistenzsysteme),” IT-Information Technology, vol. 49, no. 3,
pp. 181–187, 2007.
[2] Xilinx, Inc., Zynq-7000 SoC Data Sheet: Overview. DS190
v1.11.1, Xilinx, Inc., San Jose, CA, USA, 2018, https://www.
xilinx.com/support/documentation/data_sheets/ds190-Zynq-
7000-Overview.pdf.
[3] I. Zaidi, A. Nabina, C. N. Canagarajah, and J. Nunez-Yanez,
“Evaluating dynamic partial reconﬁguration in the integer
pipeline of a FPGA-based opensource processor,” in Pro-
ceedings of the 2008 International Conference on Field Pro-
grammable Logic and Applications, pp. 547–550, IEEE,
Heidelberg, Germany, September 2008.
[4] J. Henkel, L. Bauer, M. Hu¨bner, and A. Grudnitsky, “i-Core: a
run-time adaptive processor for embedded multi-core sys-
tems,” in Proceedings of the 2011 International Conference on
Engineering of Reconﬁgurable Systems and Algorithms
(ERSA’11), pp. 163–170, Las Vegas, NV, USA, July 2011.
[5] Xilinx, Inc., MicroBlaze Micro Controller System v3.0 Logi-
CORE IP Product Guide. PG116, Xilinx, Inc., San Jose, CA,
USA, 2017, https://www.xilinx.com/support/documentation/
ip_documentation/microblaze_mcs/v3_0/pg116-microblaze-
mcs.pdf.
[6] Xilinx, Inc., AXI HWICAP v3.0 LogiCORE IP Product Guide.
PG134, Xilinx, Inc., San Jose, CA, USA, 2016, https://www.
xilinx.com/support/documentation/ip_documentation/axi_
hwicap/v3_0/pg134-xihwicap.pdf.
[7] L. Pezzarossa, M. Schoeberl, and J. Sparso, “A controller for
dynamic partial reconﬁguration in FPGA-based real-time
systems,” in Proceedings of the 2017 IEEE 20th International
Symposium on Real-Time Distributed Computing (ISORC),
pp. 92–100, IEEE, Toronto, Canada, May 2017.
[8] M. Hu¨bner, D. Gohringer, J. Noguera, and J. Becker, “Fast
dynamic and partial reconﬁguration data path with low
hardware overhead on Xilinx FPGAs,” in Proceedings of the
2010 IEEE International Symposium on Parallel & Distributed
Processing, Workshops and Phd Forum (IPDPSW), pp. 1–8,
IEEE, Atlanta, GA, USA, April 2010.
[9] A. Nabina and J. L. Nunez-Yanez, “Dynamic reconﬁguration
optimisation with streaming data decompression,” in Pro-
ceedings of the 2010 International Conference on Field Pro-
grammable Logic and Applications, pp. 602–607, Milano, Italy,
August 2010.
[10] G. Wang, L. Dongming, W. Fengzhou, A. Adetomi, and
T. Arslan, “A tiny and multifunctional ICAP controller for
dynamic partial reconﬁguration system,” in Proceedings of the
2017 NASA/ESA Conference on Adaptive Hardware and
International Journal of Reconﬁgurable Computing 13
Systems (AHS), pp. 71–76, IEEE, Pasadena, CA, USA, July
2017.
[11] L. A. Cardona and C. Ferrer, “AC_ICAP: a ﬂexible high speed
ICAP controller,” International Journal of Reconﬁgurable
Computing, vol. 2015, Article ID 314358, 15 pages, 2015.
[12] S. Liu, R. N. Pittman, A. Forin, and J.-L. Gaudiot, “Minimizing
the runtime partial reconﬁguration overheads in reconﬁg-
urable systems,”5e Journal of Supercomputing, vol. 61, no. 3,
pp. 894–911, 2012.
[13] K. Vipin and S. A. Fahmy, “ZyCAP: eﬃcient partial recon-
ﬁguration management on the Xilinx Zynq,” IEEE Embedded
Systems Letters, vol. 6, no. 3, pp. 41–44, 2014.
[14] Xilinx, Inc., Zynq-7000 SoC Technical Reference Manual.
UG585 v1.12.2, Xilinx, Inc., San Jose, CA, USA, 2018, https://
www.xilinx.com/support/documentation/user_guides/ug585-
Zynq-7000-TRM.pdf.
[15] J. C. Rodriguez and K. F. Ackermann, “Leveraging partial
dynamic reconﬁguration on Zynq SoC FPGAs,” in Pro-
ceedings of the 2014 9th International Symposium on Recon-
ﬁgurable and Communication-Centric Systems-On-Chip
(ReCoSoC), pp. 1–6, IEEE, Montpellier, France, May 2014.
[16] V. Lai and O. Diessel, “ICAP-I: a reusable interface for the
internal reconﬁguration of Xilinx FPGAs,” in Proceedings of
the 2009 International Conference on Field-Programmable
Technology, pp. 357–360, IEEE, Sydney, Australia, December
2009.
[17] S. Bhandari, S. Subbaraman, S. Pujari et al., “High speed
dynamic partial reconﬁguration for real time multimedia
signal processing,” in Proceedings of the 2012 15th Euromicro
Conference on Digital System Design, pp. 319–326, IEEE,
Izmir, Turkey, September 2012.
[18] J. Tarrillo, F. A. Escobar, F. L. Kastensmidt, and
C. Valderrama, “Dynamic partial reconﬁguration manager,”
in Proceedings of the 2014 IEEE 5th Latin American Sympo-
sium on Circuits and Systems, pp. 1–4, IEEE, Santiago, Chile,
February 2014.
[19] C. Beckhoﬀ, D. Koch, and J. Torresen, “Portable module
relocation and bitstream compression for Xilinx FPGAs,” in
Proceedings of the 2014 24th International Conference on Field
Programmable Logic and Applications (FPL), pp. 1–8, IEEE,
Munich, Germany, September 2014.
[20] K. Vipin and S. A. Fahmy, “A high speed open source con-
troller for FPGA partial reconﬁguration,” in Proceedings of the
2012 International Conference on Field-Programmable Tech-
nology, pp. 61–66, Seoul, South Korea, December 2012.
[21] Xilinx, Inc., Partial Reconﬁguration Controller v1.3. PG193,
Xilinx, Inc., San Jose, CA, USA, 2018, https://www.xilinx.
com/support/documentation/ip_documentation/prc/v1_3/
pg193-partial-reconﬁguration-controller.pdf.
[22] M. Liu, W. Kuehn, Z. Lu, and A. Jantsch, “Run-time partial
reconﬁguration speed investigation and architectural design
space exploration,” in Proceedings of the 2009 International
Conference on Field Programmable Logic and Applications,
pp. 498–502, IEEE, Prague, Czech Republic, August 2009.
[23] C. Claus, R. Ahmed, F. Altenried, and W. Stechele, “Towards
rapid dynamic partial reconﬁguration in video-based driver
assistance systems,” in Reconﬁgurable Computing: Architec-
tures, Tools and Applications, vol. 5992, pp. 55–67, Springer
Berlin Heidelberg, Berlin, Heidelberg, 2010.
[24] Xilinx, Inc., Vivado Design Suite User Guide: Partial Recon-
ﬁguration. UG909 v2018.2, Xilinx, Inc., San Jose, CA, USA,
2018, https://www.xilinx.com/support/documentation/sw_
manuals/xilinx2018_2/ug909-vivado-partial-reconﬁguration.
pdf.
[25] E. Eto, Diﬀerence-Based Partial Reconﬁguration. XAPP290
v2.0, Xilinx, Inc., San Jose, CA, USA, 2007, https://www.xilinx.
com/support/documentation/application_notes/xapp290.pdf.
[26] C. Claus, F. H. Mu¨ller, and W. Stechele, “Combitgen: a new
approach for creating partial bitstreams in virtex-II pro de-
vices,” in Proceedings of the Workshop on Reconﬁgurable
Computing Proceedings (ARCS 06), pp. 122–131, Delft,
Netherlands, March 2006.
[27] N. Steiner, A. Wood, H. Shojaei, J. Couch, P. Athanas, and
M. French, “Torc: towards an open-source tool ﬂow,” in
Proceedings of the 19th ACM/SIGDA International Sympo-
sium on Field Programmable Gate Arrays—FPGA’11,
pp. 41–44, ACM, Monterey, CA, USA, 2011.
[28] M. Kirchhoﬀ and W. Fengler, “Realization of an embedded
hard realtime softcore processor,” in Proceedings of the 7th GI
Workshop—Autonomous Systems 2014, pp. 33–42, Cala
Millor, Spain, 2014.
[29] IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-
2008, IEEE, Piscataway, NJ, USA, 2008.
[30] M. Kirchhoﬀ, L. Wagnler, and W. Fengler, “Compiler opti-
mization on instruction scheduling for a specialized real-time
ﬂoating point soft-core processor,” Advances in Electrical and
Computer Engineering, vol. 19, no. 3, pp. 57–68, 2019.
[31] M. Kirchhoﬀ, N. Kaptsova, D. Streitpferdt, and W. Fengler,
“Optimizing compiler for a specialized real-time ﬂoating
point softcore processor,” in Proceedings of the 2017 8th
Annual Industrial Automation and Electromechanical Engi-
neering Conference (IEMECON), pp. 181–188, Bangkok,
)ailand, August 2017.
[32] M. Kirchhoﬀ, J. Weisensee, D. Streitferdt, W. Fengler, and
E. Rozova, “Increasing eﬃciency in data ﬂow oriented model
driven software development for softcore processors,” in
Proceedings of the 2018 IEEE 42nd Annual Computer Software
and Applications Conference (COMPSAC), pp. 806–811, IEEE,
Tokyo, Japan, July 2018.
[33] D. Koch, Partial Reconﬁguration on FPGAs: Architectures,
Tools and Applications, Springer, Berlin, Germany, 2014.
[34] P. Kerling, “Conceptual design and realization of a dynamic
partial reconﬁguration extension of an existing soft-core
processor,” Master’s thesis, Technische Universita¨t Ilmenau,
Ilmenau, Germany, 2019.
[35] P. Atanassov and P. Puschner, “Impact of DRAM refresh on
the execution time of real-time tasks,” in Proceedings of the
International Workshop on Application of Reliable Computing
and Communication (In Conjunction with PRDC 2001),
pp. 29–34, Seoul, Korea, December 2001.
14 International Journal of Reconﬁgurable Computing
International Journal of
Aerospace
Engineering
Hindawi
www.hindawi.com Volume 2018
Robotics
Journal of
Hindawi
www.hindawi.com Volume 2018
Hindawi
www.hindawi.com Volume 2018
 Active and Passive  
Electronic Components
VLSI Design
Hindawi
www.hindawi.com Volume 2018
Hindawi
www.hindawi.com Volume 2018
Shock and Vibration
Hindawi
www.hindawi.com Volume 2018
Civil Engineering
Advances in
Acoustics and Vibration
Advances in
Hindawi
www.hindawi.com Volume 2018
Hindawi
www.hindawi.com Volume 2018
Electrical and Computer 
Engineering
Journal of
Advances in
OptoElectronics
Hindawi
www.hindawi.com
Volume 2018
Hindawi Publishing Corporation 
http://www.hindawi.com Volume 2013www.hindawi.com
The Scientific 
World Journal
8
Control Science
and Engineering
Journal of
Hindawi
www.hindawi.com Volume 2018
Hindawi
www.hindawi.com
 Journal ofEngineering
Volume 2018
Sensors
Journal of
Hindawi
www.hindawi.com Volume 2018
International Journal of
Rotating
Machinery
Hindawi
www.hindawi.com Volume 2018
Modelling &
Simulation
in Engineering
Hindawi
www.hindawi.com Volume 2018
Hindawi
www.hindawi.com Volume 2018
Chemical Engineering
International Journal of  Antennas and
Propagation
International Journal of
Hindawi
www.hindawi.com Volume 2018
Hindawi
www.hindawi.com Volume 2018
Navigation and 
 Observation
International Journal of
Hindawi
www.hindawi.com Volume 2018
 Advances in 
Multimedia
Submit your manuscripts at
www.hindawi.com
