FASTER: Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration by Pnevmatikatos, D et al.
Microprocessors and Microsystems 39 (2015) 321–338Contents lists available at ScienceDirect
Microprocessors and Microsystems
journal homepage: www.elsevier .com/locate /micproFASTER: Facilitating Analysis and Synthesis Technologies for Effective
Reconﬁgurationhttp://dx.doi.org/10.1016/j.micpro.2014.09.006
0141-9331/ 2014 Elsevier B.V. All rights reserved.
⇑ Corresponding author.D. Pnevmatikatos a,⇑, K. Papadimitriou a, T. Becker b, P. Böhmb, A. Brokalakis h, K. Bruneel c, C. Ciobanu d,
T. Davidson c, G. Gaydadjiev d, K. Heyse c, W. Luk b, X. Niu b, I. Papaefstathiou h, D. Pau g, O. Pell f,
C. Pilato e, M.D. Santambrogio e, D. Sciuto e, D. Stroobandt c, T. Todman b, E. Vansteenkiste c
a Foundation for Research and Technology-Hellas, Heraklion, Greece
b Imperial College London, London, UK
cGhent University, Ghent, Belgium
dChalmers University of Technology, Göteborg, Sweden
e Politecnico di Milano, Milan, Italy
fMaxeler Technologies, London, UK
g STMicroelectronics, Agrate, Italy
h Synelixis, Chalkida, Greecea r t i c l e i n f o
Article history:







Runtime systema b s t r a c t
The FASTER (Facilitating Analysis and Synthesis Technologies for Effective Reconﬁguration) EU FP7 pro-
ject, aims to ease the design and implementation of dynamically changing hardware systems. Our moti-
vation stems from the promise reconﬁgurable systems hold for achieving high performance and
extending product functionality and lifetime via the addition of new features that operate at hardware
speed. However, designing a changing hardware system is both challenging and time-consuming.
FASTER facilitates the use of reconﬁgurable technology by providing a complete methodology enabling
designers to easily specify, analyze, implement and verify applications on platforms with general-pur-
pose processors and acceleration modules implemented in the latest reconﬁgurable technology. Our
tool-chain supports both coarse- and ﬁne-grain FPGA reconﬁguration, while during execution a ﬂexible
run-time system manages the reconﬁgurable resources. We target three applications from different
domains. We explore the way each application beneﬁts from reconﬁguration, and then we asses them
and the FASTER tools, in terms of performance, area consumption and accuracy of analysis.
 2014 Elsevier B.V. All rights reserved.1. Introduction
Extending product functionality and lifetime requires constant
addition of new features to satisfy the growing customer needs
and the evolving market and technology trends. Software compo-
nent adaptivity is straightforward, but in many cases it is not
enough. Recent products incorporate hardware accelerators to sat-
isfy performance and energy requirements. These accelerators also
need to adapt to the new requirements. Reconﬁgurable logic allows
the deﬁnition of new functions to be implemented in dynamically
instantiated hardware units, combining adaptivity with hardware
speed and efﬁciency. However, designing a hardware system that
changes over time is a challenging and time-consuming task.
We propose a methodology enabling designers to easily imple-
ment applications on platforms with one or more general-purposeprocessors and multiple acceleration modules implemented in
reconﬁgurable hardware. Our main contribution is that we intro-
duce partial reconﬁguration from the initial design stage all the
way down to the runtime use of the system. Fig. 1 depicts the
tool-chain of FASTER project [1]. Its input is the description of
the application in a high-level programming language; the initial
decomposition into tasks is described in OpenMP. The correspond-
ing task graph is then partitioned in space and time to identify can-
didates for reconﬁguration. FASTER supports coarse-grain
reconﬁguration and ﬁne-grain reconﬁguration. The former allows
for swapping the hardware modules identiﬁed at design time as
reconﬁgurable ones in/out of FPGA regions; this is called region-
based reconﬁguration. The latter allows for reconﬁguring small
parts of the FPGA with circuits synthesized at run-time; this tech-
nique is called micro-reconﬁguration and enables the creation of
specialized circuits containing infrequently changing parameters.
We also address the veriﬁcation of static and dynamic aspects of
Fig. 1. Abstract view of FASTER tool-chain.
322 D. Pnevmatikatos et al. /Microprocessors and Microsystems 39 (2015) 321–338a reconﬁgurable design while minimizing run-time overheads on
speed, area and power consumption. Finally, we developed a
run-time system for managing the various aspects of parallelism
and adaptivity of an application by taking into account the run-
time availability of reconﬁgurable resources. To test our methodol-
ogy we employ applications from the embedded, desktop and
high-performance computing domains, using as metrics perfor-
mance, area consumption and accuracy of analysis at an early stage
of the development cycle.
This paper extends our previous work in [2], providing more
details on all the parts of FASTER project. It is structured as follows:
Section 2 overviews previousworks onmethods and tools for recon-
ﬁgurable system design, and exposes the relevance and novelty of
the FASTER project. In Section 3 we delve into the details of the
front-end tool-chain by presenting the connections between the dif-
ferent stages, and discussing the XML exchange format. Section 4
describes brieﬂy region-based reconﬁguration supported by FPGA
vendors, and extends our previous work on micro-reconﬁguration
with reconﬁguration of routing and with a proﬁler to assist the
designer prior to taking decisions. Section 5 discusses our veriﬁca-
tion approach, and Section 6 presents the runtime system operation
and its input requirements at design- and run-time. Section 7 dis-
cusses the target applications coming from the industrial side, the
way we design them to explore reconﬁguration capabilities, their
performance evaluation, and evaluates some of our tools. Section 8
summarizes our contributions, and Section 9 concludes the paper.2. Related work and motivation
Reconﬁgurable computing has been extensively studied in the
academic literature. The authors in [3] presented a survey covering
reconﬁgurable architectures and design methods. The FASTER pro-
ject targets two system-level architectures; the stand-alone recon-
ﬁgurable logic, and the organization that embeds the processor in
the reconﬁgurable fabric. Regarding design methods, we focus on
run-time customization using partial reconﬁguration. We
developed a run-time system to hide low-level system details
from the designer, which handles scheduling, placement, andcommunication with the reconﬁguration port. In addition, the
work in [3] identiﬁed the compilation tool-chain as an important
challenge in reconﬁgurable systems. Towards this direction, we
propose tools that assist the designer in deciding which part of
the application should be mapped to the reconﬁgurable hardware
and when reconﬁguration should occur. In speciﬁc, we offer a
semi-automatic approach to leverage designer’s familiarity with
the application to ensure high quality of results.
A detailed survey in [4] summarizes research on compilation
techniques for reconﬁgurable architectures and categorizes them
based on their features and target platforms. The authors focus
mainly on High Level Synthesis (HLS) tools both for ﬁne- and
coarse-grained reconﬁgurable hardware. Furthermore, they pres-
ent a generic compilation ﬂow, highlighting the interaction
between the various transformations and optimization stages,
and discussing hardware-software partitioning. That work catego-
rizes relevant research according to the supported programming
languages and the intermediate representations. Also, it provides
a classiﬁcation of the most important code transformations at dif-
ferent levels (bit, instruction, loops, etc.), as well as insights in both
temporal and spatial partitioning techniques. Temporal partition-
ing is relevant in the context of dynamic partial reconﬁguration,
using time-multiplexing when hardware resources are insufﬁcient
while minimizing the reconﬁguration overhead. In addition, the
authors state that the adoption of reconﬁgurable computing is lim-
ited by the cumbersome process of programming these platforms.
Similarly, the work in [5] points out the lack of adequate develop-
ment methodologies and EDA tools for reconﬁgurable systems. Our
work address this issue by providing easy-to-use and ﬂexible tools.
Further information related to reconﬁgurable architectures and
devices, application development and tools is discussed in [6],
while [7] studies the above aspects concentrating on dynamically
reconﬁgurable systems only.
Different frameworks have been proposed to address the con-
current development of architecture and application for heteroge-
neous systems. For example, Ptolemy [8] is an environment for
simulating and prototyping heterogeneous systems with mecha-
nisms for modeling multiple abstraction levels and heterogeneous
mixtures of models of computation. Daedalus [9] is a system-level
D. Pnevmatikatos et al. /Microprocessors and Microsystems 39 (2015) 321–338 323design framework, composed of several tools that range from auto-
matic parallelization for Kahn Process Networks to design space
exploration of both the architectural and platform levels, and to
the synthesis of the candidate platform architecture. The different
components are interfaced through XML ﬁles. Within hArtes
project [10] an integrated methodology from the automatic
parallelization to the generation of heterogeneous systems was
developed, but without considering the reconﬁguration aspects.
FASTER extends this approach by adopting different estimation
tools and partitioning algorithms (interfaced with XML ﬁles), while
the task partitioning of the input application is speciﬁed bymeans of
OpenMP pragmas proposed by [11] as produced by the partitioning
methods in hArtes. Another framework for programming heteroge-
neous platforms is OpenCL [12], an open royalty-free standard for
cross-platform, parallel programming of modern processors found
in personal computers, servers and handheld/embedded devices.
EU-funded projects such as hArtes [10], REFLECT [13], ACOTES
[14], ANDRES [15], and Morpheus, conducted research on the
necessary stages of a tool-chain and addressed similar issues with
FASTER, but they focused more on system-level or architectural
aspects of reconﬁguration. Moreover, they do not explicitly
emphasize on the design and runtime aspects of partial and
dynamic reconﬁguration, or, on choosing the best reconﬁguration
grain-size. On the contrary, we introduce partial and dynamic
reconﬁguration from the initial design of the system all the way
down to its runtime use.
To the best of our knowledge, the existing approaches do not
abstract from the designer complex manipulations needed to
control effectively hardware accelerators, in particular when these
are designed as dynamically reconﬁgurable modules. Towards this
direction, we aim at providing a general formulation capable to
deal with different multiprocessor systems and different hardware
implementations for the tasks (also by exploiting micro-architec-
tural optimizations), and proposing a tool-chain that efﬁciently
supports partitioning of the application, while performing explora-
tion on the possible solutions for the problem. In addition, we con-
sider reconﬁguration from the early stages of the design process,
hiding most of the implementation details from the user.3. The FASTER front-end
The present Section discusses the discrete stages and the way
we interconnected them to form the FASTER tool-chain. Then it
explains the structure of the XML exchange format we use to pass
information amongst the stages.3.1. The front-end tool-chain
The input of the FASTER front-end is an application in C – gcc
C11 – whose initial decomposition is described with OpenMP prag-
mas, and an XML ﬁle containing information about the target archi-
tecture, such as the number of HW/SW processing elements,
characteristics of reconﬁgurable regions, and the different imple-
mentations of hardware accelerators. The corresponding task graph
is partitioned to determine which processing element will execute
each application task. Every hardware task is treated as a static IP
core, a region-based reconﬁgurable module, or a micro-reconﬁgu-
rable module. We do not focus on the automatic generation of the
HDL implementations for the hardware cores; these are provided
by the user either using traditional HDL design or high-level syn-
thesis tools. We target systemswith partially reconﬁgurable FPGAs,
either in a single-FPGA or a multi-FPGA environment. In order to
support the analysis for region-based and micro-reconﬁgurable
actions we include additional steps. Note that the solution is
represented by the mapping of each task not only to a processingelement, but also to one of its available implementations; this
allows for exploring alternative trade-offs between performance
and usage of resources. The methodology is outlined in Fig. 2, and
is organized in four phases: Application proﬁling and identiﬁcation
of reconﬁgurable cores, High-level analysis, Optimizations for region-
and micro-reconﬁguration, and Compile-time scheduling and mapping
onto reconﬁgurable regions.
The Application proﬁling and identiﬁcation of reconﬁgurable cores
based on the initial source code of the application and the descrip-
tionof the target architecture, decomposes the application into tasks
and assigns them to the different components of the architecture. It
can also use information about the performance of the current tasks,
and feedback after the execution of the schedule, e.g. how the
partitioning affects the computed schedule, in order to iterate and
improve gradually the solution. In addition, it determines (i) the
proper level of reconﬁguration, i.e. none, region-based, or micro-
reconﬁguration, for each of the hardware cores by including differ-
ent analyses (either static or dynamic), and (ii) the properties of
the identiﬁed tasks, such as the frequency of call functions, the fre-
quency of micro-reconﬁguration parameter change, the resources
required for each implementation, and the execution performance.
The scope of the High-level analysis phase is to explore various
implementation options for applications (or parts of applications)
that target reconﬁgurable hardware and to automatically identify
opportunities for run-time reconﬁguration. The analysis is based
on an application description in the form of a hierarchical Data
Flow Graph (DFG), application parameters such as input data size,
and physical design constraints such as available area and memory
bandwidth. The high-level analysis relies on DFGs for functions to
estimate implementation attributes such as area, computation
time and reconﬁguration time, in order to avoid time-consuming
iterations in the design implementation process. The hierarchical
DFG contains function DFGs to represent algorithm details in appli-
cation functions. For a function DFG, arithmetic nodes are mapped
as data-paths, and data access nodes are mapped as memory archi-
tectures. The area and the bandwidth attributes are estimated
based on the mapped nodes. The computation time is calculated
by relying on data-path performance and data size, and the recon-
ﬁguration time is calculated using area consumption and reconﬁg-
uration throughput. These estimations are used as input to the
previous processing step, i.e. Application proﬁling and identiﬁcation
of reconﬁgurable cores, to perform design optimizations including
arithmetic operations presentation, computational precision, and
parallelism in the implementation. Compared with Design Space
Exploration (DSE) process, our high-level analysis relies on hard-
ware design models to actively estimate design properties. Once
function properties are estimated, the high-level analysis examines
the interaction between application functions to suggest opportu-
nities for reconﬁguration. Application functions are partitioned
into several reconﬁgurable components, to separate functions that
are active at different time. Inside a reconﬁgurable component,
previously idle functions are removed. This can increase the
throughput while using the same area, or, reduce the area while
providing the same throughput. As depicted in Fig. 2,, high-level
analysis interacts with Application proﬁling and identiﬁcation of
reconﬁgurable cores, as it provides key information for the overall
design partitioning and hardware/software co-design process.
Several iterations of these two processing steps might be needed.
Information between them is exchanged through an XML ﬁle.
The third phase, the Optimizations for region- and micro-
reconﬁguration, receives the descriptions of the tasks, i.e. source
code, that could beneﬁt from the reconﬁguration and produces
new and optimized implementations for them to be considered
during task mapping. This analysis also proﬁles the application
tasks to determine the slow-changing parameters for the
micro-reconﬁguration.
Fig. 2. Front-end of tool-chain.
324 D. Pnevmatikatos et al. /Microprocessors and Microsystems 39 (2015) 321–338Finally, the Compile-time scheduling and mapping onto reconﬁgu-
rable regions phase receives information about the application and
the architecture from the two previous processing steps, focusing
on the tasks assigned to the reconﬁgurable hardware and it deter-
mines the task schedule, along with the mapping of the cores onto
the reconﬁgurable regions. It also determines the number and
characteristics of these regions, e.g. size, the number and size of
each input/output point, and also takes into account the intercon-
nection infrastructure of the system, e.g. bus size. Finally, it anno-
tates the tasks with information about the feasibility of the
implementation where the solution is speciﬁed (i.e. if the reconﬁg-
urable region can satisfy the resource requirements), and it pro-












Fig. 3. Interaction between the initial steps of the FASTER tool-chain through the
XML ﬁle structure.3.2. The XML exchange format
We adopt an exchange format based on the Extensible Markup
Language (currently XML v2.0 is supported), which allows to easily
integrate the different methodologies developed in parallel, as well
as the manual decisions performed by the designer. Below we
describe the interfaces between the different activities. The FASTER
XML has a modular format and contains four independent but
related sections that can be processed by different modules:
 tag <architecture>: the architecture is deﬁned here in
advance, at least in terms of the number of processing elements
and area dedicated to hardware cores, either reconﬁgurable or
not. Communication architecture and memory hierarchy are
also provided here;
 tag <application>: this part describes high-level information
about the application, e.g. source code ﬁles, proﬁling informa-
tion, workload or data input characterization, without any con-
nection with the architecture or its implementation;
 tag <library>: it contains the available implementations for
the application tasks, along with the performance and corre-
sponding resource requirements. It takes into account the
implementations derived from the application, e.g. produced
by the partitioning methodology along with the high-level anal-
ysis methods, and the ones from external sources, e.g. obtained
through external tools and provided by the designer to the
methodology;
 tag <partitions>: the structure of the partitioned application
in terms of tasks and data transfers, along with the correspond-
ing static schedule and mapping solutions, both for hardware
and software tasks.The architecture and application parts are independent of each
other, while the library brings together information from the par-
titions (this tag contains info on the tasks) and the architecture
(this tag contains info on the processing elements) by means of
the description of the available implementations. Fig. 3 highlights
how the different parts of the FASTER tool-chain interact through
the XML ﬁle format; it reﬂects to the ﬁrst two processes of Fig. 2
and illustrates how different parts of the XML ﬁle structure are
analyzed, generated or updated:
 Application analysis and proﬁling: it corresponds to the analysis
performed on the initial application code. It includes the proﬁl-
ing of the call graph and the function call parameters to
improve the HW/SW partitioning and the identiﬁcation of cores
that can beneﬁt from micro-reconﬁguration.
 Partitioning and optimization: it includes the HW/SW partition-
ing stage, along with the optimization of the task implementa-
tions, especially for exploiting micro-reconﬁguration.
 High-level analysis: it produces estimates for hardware imple-
mentations of tasks such as area and computation time. This
is based on analyzing the application, its input data and design
constraints. The estimates are used for partitioning and
optimization.
D. Pnevmatikatos et al. /Microprocessors and Microsystems 39 (2015) 321–338 3254. Region-based and micro-reconﬁguration support
FASTER supports both region-based and micro-reconﬁguration.
Each of these two options offers advantages in different conditions.
4.1. Region-based reconﬁguration
Region-based reconﬁguration describes the concept of instanti-
ating a new function in a particular region of the FPGA. The gener-
ation of conﬁguration bitstreams takes place at design time. The
designer marks a certain functionality as reconﬁgurable and con-
ﬁnes its logic to a dedicated region on the FPGA by means of ﬂoor-
planning. This is shown in Fig. 4a, while Fig. 4b illustrates that a
number of different reconﬁgurable functions can be implemented
targeting the same region. This region can be reconﬁgured at
run-time with the desired functionality while the rest of the chip
remains operational. An FPGA design can contain multiple recon-
ﬁgurable regions, and in general, reconﬁgurable functions are
loaded only into the region they were originally implemented for.
The challenge when designing such systems is identifying func-
tionality that can be reconﬁgured, effectively allocating them to
dedicated regions and ﬂoorplanning the entire design. The problem
of ﬂoorplanning in the domain of partially reconﬁgurable FPGAs
steadily attracts the interest of researchers [16,17]. In FASTER pro-
ject we use the ﬂoorplanning published in [18], while for support-
ing region-based reconﬁguration we rely mainly on vendor’s tools.
4.2. Micro-reconﬁguration
One or more of the aforementioned regions can also be recon-
ﬁgured in a ﬁner granularity to implement Runtime Circuit Special-
ization (RCS) [19,20]. RCS is a technique for specializing an FPGA
conﬁguration at runtime according to the values of a set of param-
eters. The main idea is that before a task is deployed on the FPGA, a
conﬁguration that is specialized for the new parameter values is
generated. Specialized conﬁgurations are smaller and faster than
their generic counterpart, hence RCS can potentially result in a
more efﬁcient implementation. Currently, the design tools of FPGA
manufacturers support region-based reconﬁguration only, where a
limited number of functionalities are time-shared on the same
piece of FPGA region.
The problem of mapping a hardware speciﬁcation to FPGA
resources is NP-complete [21], and a specialization process could
generate sub-optimal solutions. There is a trade-off between the
resources used for the specialization process and the quality of
the resulted specialized FPGA conﬁguration; the more resources
spent on generating the specialized functionality, the fewer
resources needed to implement the specialized functionality. The
extent to which an optimal implementation can be achieved






Fig. 4. (a) FPGA fabric with a pre-deﬁned Reconﬁgurable Region (RR) and (b) A, B, C
functions are generated off-line and each one can be loaded during run-time into
the RR.the vendor region-based tool-chain, but this is adequate only if
the number of reconﬁgured circuits is limited. In RCS the number
of parameter values grows exponentially with the number of bits
needed to represent the parameter data; generating all conﬁgura-
tions off-line and storing them in a repository becomes infeasible
for real-life applications. Instead, in FASTER project we use special-
ization at run-time, i.e. on-line generation of partial conﬁgurations.
We do this using the method of parameterized conﬁgurations
described in [22] that relies on a simpliﬁed low-overhead run-time
tool-chain. Using parameterized conﬁgurations, RCS implementa-
tions having similar properties to handcrafted applications are
built automatically. We build on the observation that specializa-
tion process actually implements a multivalued Boolean function,
which is called Parameterized Conﬁguration (PC). In fact, both
the input, i.e. a parameter value, and the output, i.e. a specialized
conﬁguration, of the specialization process are bit vectors.
We use the staged compilation illustrated in Fig. 5. First, a
parameterized conﬁguration is constructed and represented in a
closed form starting from a parameterized HDL description, shown
in Fig. 5(a). This is executed at compile time, when the parameter
values are unknown. Then, a specialized conﬁguration is produced
by evaluating the parameterized conﬁguration given a parameter
value, which is shown in Fig. 5(b). This is executed at run-time
after the parameter values have become available. The specialized
conﬁguration is then used to reconﬁgure the FPGA.
Fig. 6 represents the results of each stage of micro-reconﬁgura-
tion. The parameterized conﬁguration generated off-line, is ﬁrst
loaded to the RR shown in Fig. 4(a). It includes a number of static
bits, i.e. the rectangle with the missing circles, and a number of
parameter dependent bits, i.e. the circles; this is illustrated on
the left side of Fig. 6. The PC evaluation step of the on-line stage
determines the values of the parameter dependent bits. Then, the
RR is reconﬁgured at run-time with these values, by overwriting
the parameter dependent bits without disrupting the static bits
of RR. Fig. 6 shows that different parameter dependent bits are
generated, as a result of evaluating on-line the parameter values.Fig. 5. Staged compilation in RCS techniques using parameterized conﬁgurations:
(a) off-line stage of the tool-chain and (b) on-line stage.
Off-line On-line
Parameter-values:
Fig. 6. Off-line and on-line results of micro-reconﬁguration.
326 D. Pnevmatikatos et al. /Microprocessors and Microsystems 39 (2015) 321–338It is important to notice that the parameterized HDL description
of off-line stage is an HDL description that distinguishes regular
input ports from parameter input ports. The parameter inputs
are not inputs of the ﬁnal specialized conﬁgurations. Instead, they
will be bound to a constant value during the specialization stage.
The early version of micro-reconﬁguration ﬂow in [22] was able
to support reconﬁguration of LUTs only. Within the context of
FASTER project we extended it so as to support reconﬁguration
of the routing architecture as well. In particular, in [23] we
extended a technology mapping algorithm so as to use the run-
time reconﬁgurability of the routing infrastructure, which led to
larger area gains as compared to the initial version [22]. Then in
[24], we proposed an efﬁcient router for handling connections that
are reconﬁgured at run-time.
Another limitation of the initial micro-reconﬁguration ﬂow was
the difﬁculty in determining at an early design stage whether an
application will beneﬁt from a micro-reconﬁgurable implementa-
tion. To this direction we developed an RTL proﬁler that analyzes
the RTL description of the application and uses a metric called
functional density in order to compare different micro-reconﬁgu-
rable implementations. Functional density was introduced in [20]
for measuring the amount of computation that can be placed
within a certain logic area. Our RTL proﬁler aims at ﬁnding the
most interesting parameters and estimate the functional density
of the corresponding implementations. The designer then evalu-
ates the most interesting parameter choice, and the extent to
which the design will beneﬁt from micro-reconﬁguration. We also
studied the feasibility of a high-level proﬁler able to explore the
potential gains from micro-reconﬁguration earlier in the design
cycle. This proﬁler focuses on the data path of the high-level
descriptions. Our case study comes from the domain of multi-
mode applications, and in a recent work we discuss the best
parameter candidates for micro-reconﬁguration in this speciﬁc
domain [25]. In our tool-chain, after the high-level proﬁling takes
place, the part of the design that get gains from micro-reconﬁgura-
tion is annotated in the XML ﬁle.5. Veriﬁcation of changing systems
Veriﬁcation ensures that an optimized, reconﬁguring design
preserves the original behavior. In the FASTER workﬂow, there
are two complementary aspects to validate and verify reconﬁgur-
ing designs: ﬁrst, given a set of conﬁgurations, ensuring the correct
one is loaded, verifying the correctness of reconﬁguration at run-
time; second, verifying the correctness of a reconﬁgurable design
compared to a traditional design. The novelty of our approach lies
in (i) verifying streaming designs including metaprogramming; (ii)
verifying designs using run-time reconﬁguration; and (iii) verifying
co-design of systems containing hardware and software.
Section 5.1 outlines our approach to the ﬁrst challenge using
traditional approaches such as checksums; the rest of the Sec-
tion deals with the remaining challenges by combining symbolic
simulation and equivalence checking.Fig. 7. Veriﬁcation design ﬂow.5.1. Micro-architectural support for run-time signature validation
At run-time, FASTER-based systems may change due to region-
based reconﬁguration, micro-reconﬁguration, or Custom Comput-
ing Unit (CCU) relocation. To this end, the FASTER system matches
each CCU with its corresponding signature to check integrity and
validity. The signature type is chosen to suit available resources
and validation requirements: for simple signatures (checksums or
cryptographic hashes), validation checks that the signaturematches
the CCU. For complex signatures (proof traces or complete symbolic
proofs), signature validation also veriﬁes functional correctness.On loading a new partial reconﬁguration, the system ﬁrst vali-
dates its bitstream using the signature. For complex signatures,
the system also veriﬁes functional correctness of the CCU using
the signature. The bitstream will be loaded into the target device
only if this process succeeds.
The FASTER tool-chain provides hardware support to the run-
time system for signature validation and veriﬁcation. Basic support
includes, but is not limited to:
 dedicated storage space for previous veriﬁcation points;
 a signature checker to verify that CCUs and signatures match;
 counters to track the number of veriﬁcations and veriﬁcation
results statistics.
5.2. Equivalence checking of reconﬁgurable streaming designs
Our work concerns the correctness of reconﬁgurable designs
rather than the correctness of the reconﬁguration process. Tradi-
tional approaches to design validation simulate reference and opti-
mized designs with test inputs, comparing the outputs. Such
approaches, e.g. Universal Veriﬁcation Methodology [26], use ver-
iﬁcation goals and automation to improve coverage; however,
there is always a danger that the test inputs do not cover all cases,
or that outputs are only coincidentally correct.
Instead of numerical or logical simulation, our approach com-
bines symbolic simulation with equivalence checking. Symbolic sim-
ulation uses symbolic rather than numeric or logical inputs; the
outputs are functions of these symbolic inputs. For example, sym-
bolically simulating an adder with inputs a and b could result in
aþ b. For larger designs, it is hard to distinguish different but
equivalent outputs (bþ a instead of aþ b) from incorrect ones.
The equivalence checker tests if the outputs of transformed designs
are equivalent to those of the reference design.
Previous work: Industrial tools for formal veriﬁcation include
Formality [27], working with existing hardware ﬂows to ensure
the equivalence of RTL designswith optimized and synthesized net-
lists. An academic approach published in [28], veriﬁes equivalence
of FPGA cores using a model checker, and proposes run-time veriﬁ-
cation by model checking at run-time, which is necessarily
restricted to small designs such as adders. Another approach in
[29] veriﬁes run-time reconﬁgurable optimizations using a theo-
rem prover. Other researchers have considered veriﬁcation of prop-
erties of discrete event systems (such as deadlock freedom) by
model checking [30], verifying programs running on FPGA-based
soft processors [31], verifying declarative parameterized hardware
designs with placement information using higher-order logic [32],
and verifying that hardware requested at run time implements a
particular function using proof-carrying code [33,34].
Our approach relates to work on design validation of imaging
operations using symbolic simulation and equivalence checking
[35]. Thiswork embeds a subset of a C-like language for FPGAdesign
Fig. 8. System model showing the components of the run-time system.
D. Pnevmatikatos et al. /Microprocessors and Microsystems 39 (2015) 321–338 327into a theorem prover, using symbolic simulation and an equiva-
lence checker to verify the correctness of transformed designs.
Unlike that work, we verify optimizations of streaming designs,
with our implementation using Maxeler MaxCompiler. This means
that we must (i) preserve the order of inputs to and outputs from
the design and (ii) allow for user metaprogramming, since Maxeler
designs are Java programs. Furthermore, we extend our approach to
hardware-software co-design and run-time reconﬁguration.
Fig. 7 shows our approach, comparing a reference design (the
source) with a transformed design (the target). For the FASTER pro-
ject, we compare designs implemented as Maxeler MaxJ kernels;
our approach could apply to other hardware design descriptions,
or to software. The veriﬁcation happens in four phases:
 Design optimization: the rest of the FASTER design ﬂow trans-
forms a source design to a target;
 Compilation for simulation: compile the MaxJ kernel for the sym-
bolic simulator in two steps: (i) interpret the program to unroll
any compile-time loops in the MaxJ design, and (ii) compile the
design to a symbolic simulation input using a syntax-directed
compile scheme;
 Symbolic simulation: a symbolic simulator applies symbolic
inputs to source and target designs;
 Validation: the Yices equivalence checker [36] compares the
outputs of source and target, resulting in either success (source
and target designs match), or failure, with a counter example
showing why the designs are not equivalent.
5.3. Verifying dynamic aspects of the design
The FASTER tool-chain generates run-time reconﬁgurable
designs that are not supported by symbolic simulators or equiva-
lence checkers. Rather than modifying these tools, we adapt an
approach modeling run-time reconﬁguration using virtual multi-
plexers [37], enclosing mutually-exclusive conﬁgurations within
virtual multiplexer-demultiplexer pairs. We compile the run-time
reconﬁgurable parts of designs to be enclosed by such pairs. We
modify the conﬁguration controller to generate the control inputs
to the multiplexers to choose the appropriate conﬁguration. Our
approach applies equally to static, region-based reconﬁguration,
or micro-reconﬁguration.
5.4. Hardware-software co-design
Hardware designs are rarely developed in isolation; often, soft-
ware is a part of an overall design. Furthermore, designers often
start with a software reference design, (e.g. a textbook algorithm
implementation), which they accelerate with reconﬁgurable hard-
ware. Hence, we extend our approach to verify hardware-software
co-designs.
We model hardware-software codesign by compiling from soft-
ware to the symbolic simulator. We adapt a syntax-directed hard-
ware compilation scheme, which has the advantage that the
number of simulation cycles is statically determinate, making it
easier to compare software and hardware simulation outputs. To
interface hardware and software, we use a synchronous API (appli-
cation programming interface); this limits parallelism but simpli-
ﬁes software design. The API contains three calls:
 load: loads a streaming hardware design compiled with our
hardware compiler,
 run: runs a previously-loaded hardware design for a given cycle
count, with one or more input or output arrays, which must
match stream inputs and outputs on the hardware design,
 set_scalar: sets a scalar hardware input value, which will apply
to the hardware design on the next call to run.To model runtime reconﬁguration, we add an API call to load
multiple streaming hardware designs and switch between them
by writing to a particular scalar input, whose value controls the
virtual multiplexers selecting which design is conﬁgured into the
reconﬁgurable region.
6. Run-time system support
The Run-Time System Manager (RTSM) is a software compo-
nent controlling the execution of application workloads. It under-
takes low-level operations so as to ofﬂoad the programmer from
manually handling ﬁne grain operations such as scheduling,
resource management, memory savings and power consumption.
In a partially reconﬁgurable FPGA-based system, in order to man-
age dynamically the HW tasks, the RTSM needs to support speciﬁc
operations [38]. Fig. 8 illustrates our target system model along
with the components participating in run-time system operation
[39]. The FPGA is managed as a 2D area with regard to the HW task
placement (a HW task corresponds to a HW module). Loading of
tasks is controlled by a General Purpose Processor (GPP), while
programming of FPGA conﬁguration memory is done through the
ICAP conﬁguration port. All tasks can have both SW and HW ver-
sions available. HW tasks are synthesized at compile time, and
stored as partial bitstreams in a repository (omitted from Fig. 8
for clarity), which accords with the restrictions of Xilinx FPGA
technology. Each task is characterized by three parameters: task
area (width and height), reconﬁguration time, and execution time.
In Fig. 8, four distinct components implemented outside the recon-
ﬁgurable area participate in the control of tasks:
 Placer (P): responsible for ﬁnding the best location for the task
in the FPGA.
 Scheduler (S): ﬁnds the time slot in which a task is loaded/starts
execution.
 Translator (T): resolves the task coordinates by transforming a
technology independent representation of the available area
into the low-level commands for the speciﬁc FPGA.
 Loader (L): communicates directly with the conﬁguration port
for FPGA programming.
The system of Fig. 8 is general enough to describe similar sys-
tems. Hence, instead of ICAP, external conﬁguration ports can be
employed such as the SelectMAP or JTAG. The GPP can be a power-
ful host processor (implementing Placer, Scheduler and Translator)
communicating with the FPGA device through a PCI bus (e.g. desk-
top computing with OS), or, it can be an embedded processor (with/
Fig. 9. Technology independent bitstream format.
328 D. Pnevmatikatos et al. /Microprocessors and Microsystems 39 (2015) 321–338without OS) connected to the FPGA through a dedicated point-to-
point link. In any case, communication latency and bandwidth
should be evaluated and used effectively for optimal results.
Our RTSM supports the following operations. Scheduling handles
tasks by placing them in proper time slots; tasks can be immediately
served, or can be reserved for later reconﬁguration or/and execu-
tion. Placement should be efﬁciently supported so as to ﬁnd themost
suitable reconﬁgurable region. Conﬁguration caching, which con-
cerns placing the conﬁguration data that will be required in the
future close to the conﬁguration memory. Different conﬁguration
times have been observed depending on the type of memory used
for caching and the conﬁguration controller [40]. The RTSM also
supports conﬁguration prefetching, which alleviates the system from
the reconﬁguration overhead by conﬁguring a task ahead of time.
6.1. The FASTER architectural interface
The architectural interface consists of the conﬁguration agnostic
ISA extensions and the technology independent bitstream format.
6.1.1. Conﬁguration content agnostic ISA interface
Our ISA interface resembles the Molen programming paradigm
introduced in [41]. It represents a sequential consistency paradigm
for programming Custom Computing Machines (CCUs), consisting
of a General Purpose Processor (GPP) and a number of Reconﬁgu-
rable Units (RUs), implemented using FPGAs. The FPGA is viewed
as a co-processor that extends the GPP architecture. An arbiter
module is introduced between the main memory and the GPP that
partially decodes the instructions and forwards them either to the
GPP or to the reconﬁgurable units. The software instructions are
executed by the GPP, and the hardware operations by the Reconﬁg-
urable Units. In order to pass parameters between the GPP and the
RUs, an additional Register File is used the XREGs. A multiplexor
facilitates the sharing of the main memory between the GPP and
the Reconﬁgurable Units. There are dedicated instructions for pass-
ing values between the GPP and the XREGs. In the case of micro-
reconﬁguration, conﬁguration memory is expressed as a function
of a set of parameters. This function takes the parameter values
as input and outputs an FPGA conﬁguration that is specialized
for these parameter values; the function is called a parameterized
conﬁguration. The corresponding parameterized conﬁguration is
evaluated after the bitstream is loaded from the memory, and a
specialized component generates the ﬁnal reconﬁguration data
before sending them to the conﬁguration port.
6.1.2. FASTER technology independent bitstream format
Fig. 9 illustrates the operation of task creation at design time.
The conﬁguration data and task speciﬁc information are merged
together in a so-called Task Conﬁguration Microcode (TCM) block
introduced in [39]. TCM is pre-stored in the memory at the Bit-
stream (BS) Address. It is assumed that each task requires reconﬁg-
urable area with rectangular shape. The conﬁguration data is
obtained from the vendor speciﬁc synthesis tools. After this, we
can create the technology independent bitstream format, shown
in Fig. 9. The task speciﬁc information includes the length of the
conﬁguration data, the Task Parameter Address (TPA), the size of
the task, the Execution Time per Unit of Data (ETPUD) and a ﬂag
that speciﬁes the reconﬁguration type (RT) region-based or
micro-reconﬁguration.
The length of the conﬁguration data, the throughput of the
reconﬁguration port and the type of reconﬁguration controller
are used to estimate the reconﬁguration time, in terms of clock
cycles. The size of the task is expressed by the width and height
of the task, expressed in terms of atomic reconﬁgurable units,
e.g. CLBs. The TPA contains pointers to the locations of the input
and output parameters of the task. Using the ETPUD in conjunctionwith the size of the input and output parameters, the execution
time (in clock cycles) can be estimated. In the case of micro-recon-
ﬁguration, additional parameters are included. The ﬁrst is the
number of parameters of the parameterized conﬁguration (N), fol-
lowed by N pairs of parameter width/index of the XREG containing
the parameter value. Finally, a binary representation of the param-
eterized conﬁguration data is included.
By utilizing a generic model for the size of the tasks, the low
level details of the FPGA are abstracted, avoiding the need to
expose proprietary details of the bitstreams, which may differ
amongst vendors.
6.2. Input at compile-time and run-time
In the FASTER project we target platforms combining software
and reconﬁgurable elements. The RTSM determines which task of
an application is going to run on which processing element (PE)
of the platform; a PE can be a discrete component such as a CPU,
a static hardware part, or, a partially reconﬁgurable part of an
FPGA. The RTSM is fed with information by the baseline scheduler
created at compile time. Such information is the task graph for rep-
resenting the task dependencies, and the initial mapping of the
tasks in the PEs. During execution, the RTSM ﬁnds the time slot
in which a task should be loaded (for a partially reconﬁgurable
region this means that a reconﬁguration should occur) and the task
execution starting time. Also, in case a request for loading a HW
task to the reconﬁgurable logic is made but there is no space avail-
able, the RTSM either activates a SW version of the task (if it is
available), or reserves the task for future execution based on pre-
dicted free space and speciﬁc starting times.
Below we identify the parameters determined at compile time
that do not change, as well as the ones that do change during run-
time. The former ones called static parameters should be saved in
storage means that will be accessed during execution only for
reading, e.g. initially stored in a ﬁle system and then loaded to
the data segment of the CPU memory, while the latter ones called
dynamic parameters are saved in dynamic allocated storage
updated at runtime. Below we list the set of these parameters
mainly focused to the information needed to manage the reconﬁg-
urable elements of the platform.
Static parameters
 Reconﬁgurable regions: designated at compile time using the
region-based method.
 Reconﬁguration time: this attribute is related directly with the
size of reconﬁgurable region. Other contributing factors are the
D. Pnevmatikatos et al. /Microprocessors and Microsystems 39 (2015) 321–338 329throughput of reconﬁguration port, the memory storing the bit-
stream, and the reconﬁguration controller.
 ETPUD (Execution Time Per Unit of Data): it is ﬁxed, as it con-
cerns the time elapsed to process a speciﬁc amount of data,
and affects the task overall execution time. However, the task
overall execution time can be also inﬂuenced by the overall size
of the data to be processed, which might not be ﬁxed, and by
the nature of the data to be processed.
 Tasks that at compile time are assigned to be executed in ﬁxed
PEs, i.e. CPU or static HW taks and tasks assigned to certain
reconﬁgurable areas.
Dynamic parameters
 The current status of each reconﬁgurable region. Possible condi-
tions are: empty, (re)conﬁguring, busy executing, not empty but
idle, the ID of the task placed on the reconﬁgurable region.
 The current status of each task. Possible conditions are: already
placed in a reconﬁgurable region and ready for execution,
(re)conﬁguring, running, to be executed in a partially reconﬁg-
urable region but not yet conﬁgured, the reconﬁgurable region
in which the task is loaded.
 Task execution time. It can depend on: (i) the amount of data to
be processed, such as real-time data entering the system
through a network link, or, (ii) the nature of data to be pro-
cessed, or, (iii) the amount of times a loop iterates before com-
pleting data processing.
7. Evaluating FASTER tools on industrial applications
To asses FASTER applicability we used our tools and explored
reconﬁguration capabilities on three industrial applications.
Beyond describing the applications, the contributions of present
Section are:
 identiﬁcation of the application functions to be accelerated in
hardware;
 the way each application beneﬁts from reconﬁguration and
identiﬁcation of application parts that worth to be reconﬁgured;
 potential parallelism of applications and proﬁling;
 use of FASTER tools for analyzing and implementing the
applications;
 qualitative and quantitative analysis of performance and area
consumption results and their trade-offs;
 evaluation of the high-level analysis tool of Section 3; and
 overhead of the run-time system of Section 6.
The FASTER project aims at serving different application
domains, i.e. high-performance computing, desktop and low-cost
embedded, thus our tools target different platforms.
7.1. Reverse time migration
We employed FASTER tools to implement Reverse Time Migra-
tion (RTM), a seismic imaging technique used in oil and gas indus-
try to detect terrain images of geological structures based on
Earth’s response to injected acoustic waves [42].
7.1.1. Application description
The objective of RTM is to create images of the subsurface of
Earth from acoustic measurements performed at the surface. This
is done by activating a low frequency acoustic source on the sur-
face, and recording the reﬂected sound waves with tens of thou-
sands of receivers for several seconds (typically 8–16 s). Thisprocess is called a ‘‘shot’’ and is repeated many thousands of times
while the source and/or receivers are moved to illuminate different
areas of the subsurface. The resulting dataset is dozens or hun-
dreds of terabytes in size, and the problem of transforming it into
an image is computationally intensive.
The concept behind RTM operation is simple. It starts with a
known ‘‘earth model’’, which is the best known approximation to
the subsurface geology, indicatively represented with acoustic
velocity data. Scientists conduct simultaneously two computational
modeling experiments through the earthmodel, both attempting to
simulate the seismic experiment conducted in the ﬁeld – one from
sources perspective and one from receivers perspective. The source
experiment involves injecting our estimated source wavelet into
the earth and propagating it from t0 to our maximum recording
time tmax, creating a 4D source ﬁeld sðx; y; z; tÞ; typical values for
x; y; z; t are 1000–10000. At the same time, we conduct the receiver
experiment; we inject and propagate the recorded data starting
from tmax to t0, creating a similar 4D volume rðx; y; z; tÞ. We have a
reﬂection where the energy propagated from the source and recei-
ver is located at the same position at the same time, thus an image
can be obtained by summing the correlation of the source and recei-
ver waveﬁeld at every time point and every ‘‘shot’’.
7.1.2. Application analysis and parallelism
A 3D subsurface image is generated by simultaneously propa-
gating two waves through a model of the earth, and correlating
the results of the simulations. These operations are carried out
by the propagate and image kernel respectively. The propagate
kernel computes the waveﬁeld state at the next timestep based
on the current and previous timesteps. The image kernel performs
the cross-correlation of the source and receiver waveﬁelds. These
form the main computational kernels of RTM.
Propagating source and receiver in opposite directions in time
leads to high memory requirements as two state ﬁelds at different
points in time must be maintained. In our implementation, to
avoid storing full 4D data volumes (can be many terabytes in size),
we compute the source waveﬁeld fully forward in time and then
back in time in parallel to the receiver ﬁeld. This approach propa-
gates the source twice, and thus requires 50% more computation
than the naive approach, but avoids the data management problem
and reduces the memory footprint. Algorithm 1 shows the pseudo-
code for running RTM algorithm on a single ‘‘shot’’.
Algorithm 1. RTM pseudo-code for a single ‘‘shot’’.
migrate_shot(shot_id) {
src_curr = zeros(nx,ny,nz); src_prev = zeros(nx,ny,nz);
rcv_curr = zeros(nx,ny,nz); rcv_prev = zeros(nx,ny,nz);
image = zeros(nx,ny,nz,nh);
model = load_earthmodel(shot_id);




swap(curr_src, prev_src); // reverse time direction




if (i % image_step == 0); // typically every 5–10
steps
image (src_curr, rcv_curr, image);
}
}
330 D. Pnevmatikatos et al. /Microprocessors and Microsystems 39 (2015) 321–338A typical RTM has tens of thousands of ‘‘shots’’, with each ‘‘shot’’
taking minutes, hours or days to compute, and this axis is almost
embarrassingly parallel. Each ‘‘shot’’ can be computed indepen-
dently from any other ‘‘shot’’, and eventually all ‘‘shots’’ are com-
bined with a simple addition to form the ﬁnal image result.
Going into further details of the application is out of the scope of
present work; more information is available in [43]. Ideally we
opt to parallelize ‘‘shots’’ over multiple nodes in a cluster. For per-
formance purposes (and when considering the potential impact of
reconﬁguration) it makes sense to consider only the ‘‘shot’’ compu-
tation since this dominates the runtime, and examine a single
‘‘shot’’ as a test case. Thus, in the context of the FASTER project
we restrict the RTM test case to the single shot/single node case.7.1.3. Reconﬁguration opportunities
We focused on implementing the propagate and image kernels
as distinct partially reconﬁgurable modules sharing the same
reconﬁgurable region within an FPGA. There is no feedback of val-
ues within a single timestep, thus we implemented both kernels as
a streaming datapath with a feed-forward pipeline. The two ker-
nels are amenable to time-multiplexing using partial reconﬁgura-
tion because they run sequentially and perform fundamentally
different computations; stating otherwise the two kernels are
mutually exclusive, i.e. when propagate kernel executes, the
imaging kernel is idle and vice versa. Imaging typically runs less
frequently than propagation. This is observed in the above
pseudo-code, which shows that the propagate calculation runs
for all timesteps and the imaging runs only every N timesteps.
Time-multiplexing allows for saving FPGA resources, which can
instead be used to increase the parallelism of the individual kernels
and potentially improve the runtime performance.
7.1.4. Implementation
We implemented the RTM on a platform from Maxeler
Technologies targeting HPC applications. It provides fully inte-
grated computer systems containing FPGA-based dataﬂow engines
(DFEs), conventional CPUs and large storage means. We target one
of the standard compute nodes within such platforms, the MPC-C
series MaxNode containing 12 Intel Xeon CPUs, 144 GB of main
memory, and 4 DFEs. Each DFE utilizes a large Xilinx Virtex-6 FPGA
attached to 48 GB of DDR3 DRAM. DFEs are connected to the CPU
via PCI Express, and in addition have a high-bandwidth low-
latency direct interconnect called MaxRing for communicating
between neighboring engines. The system architecture is shown
in Fig. 10. Maxeler nodes run standard Red Hat/CentOS operating
systems and provide management tools for debugging and event
logging of both the CPU and DFE subsystems. The MaxelerOS run-
time software includes a kernel device driver, status monitoring
daemon and runtime libraries for use by individual applications.Fig. 10. Maxeler system architecture.The MaxelerOS management daemon coordinates resource use,
scheduling and data movement including automatic conﬁguration
of the FPGAs and allocation to different CPU processes.
We broke the RTM application code down into a DFE and a CPU
part. The DFE part executes the RTM kernels. The CPU part is
responsible to load the conﬁguration onto a DFE, transfer data to
and from the DFE, and run the kernels. The CPU program runs on
the main processor calling simple functions to execute operations
on the DFE. Initially, we implemented a static design as baseline for
comparing it with the partially reconﬁgurable design; in this
design the propagate and image kernels co-exist and the proper
kernel is enabled as needed. On the other hand, in the reconﬁgura-
ble version the CPU controls swapping in and out the two kernels
into the same region according to the phase of execution. Reconﬁg-
uration occurs relatively rarely and it is triggered from the CPU,
which streams the corresponding partial bitstream from CPU main
memory to the DFE as needed.
Below we discuss the important ﬁndings from implementing
and executing the partial reconﬁguration version of RTM:
 Due to the nature of application, during reconﬁguration the rest
of the FPGA does not function. The host is also idle, waiting for
imaging data, making it impossible to hide the reconﬁguration
time under any useful operations. Our ﬁndings show that for
small problems this has a signiﬁcant impact on performance,
while for larger problem sizes it becomes negligible. Indica-
tively, for a small problem in which 100 100 100 adjacent
spatial points are computed, 71% of the total end-to-end run-
time was spent in reconﬁguration. This was drastically reduced
down to 0.75% for computing a problem size of 800 800 800
adjacent spatial points. Commercially interesting problem sizes
scale up from the tested sizes up to several thousand cubed. Lar-
ger problem sizes increase the compute time, rendering the
reconﬁguration time a less signiﬁcant portion of the overall wall
clock time.
 Compared to the baseline non-reconﬁgurable implementation,
performance in the partially reconﬁgurable design was reduced
due to that designing with PR affected the clock. In addition,
although in PR design the two kernels share the same region,
while in the non-reconﬁgurable version they co-exist in the
chip, we found that the overhead from the resources added
from the PR ﬂow is considerable, especially for relatively small
problem sizes. Thus, we obtained a modest reduction of the
resources required for the propagate and image reconﬁgura-
tions. This is mainly due to the increased static resource usage
from the extra logic being introduced to perform partial recon-
ﬁguration, and to the additional dummy inputs and outputs
needed for both kernels to maintain a stable IO interface.
 Huge amount of data must be preserved during reconﬁguration.
In the ﬁrst implementation we used the FPGA’s attached DRAM
memories to store the seismic waveﬁelds and earth model vol-
umes during the computation of a ‘‘shot’’. If the FPGA is partially
reconﬁgured during the execution of a ‘‘shot’’ computation, it is
important to preserve the DRAM contents in order to enable the
computation to proceed with the correct data.
 A major trade-off between performance and area comes from
the instantiation of a DRAM controller in the same FPGA that
hosts the RTM kernels. In speciﬁc, in the ﬁrst version we imple-
mented a DRAM memory controller using the FPGA resources.
Holding data, i.e. earth model, current states of the waveﬁeld
propagations and accumulated image, in the on-card memory
connected directly to the FPGA on the DFE card is more efﬁcient
since the DFE memory provides much greater bandwidth than
the PCI Express link. However, we discovered that the DRAM
memory controller consumes a large amount of chip area,
which restricts the area that can be used for implementing
D. Pnevmatikatos et al. /Microprocessors and Microsystems 39 (2015) 321–338 331the application kernels and reduces the possible gains from uti-
lizing reconﬁguration. In the second implementation we moved
the storage from the on-card memory to the CPU, and conse-
quently omitted the memory controller. Now, we store the data
in the host CPU memory and thus all transactions between CPU
and FPGA are performed through the PCIe. This reduces the
achievable throughput, but it also relieves the FPGA resources,
thus allowing for more space in designing either more modules,
or more parallelism within the module, or more pipelining or
more parallelized modules running.
7.1.5. Using FASTER high-level analysis tool on RTM
RTSM is based on Earth’s response to injected acoustic waves.
The wave propagation within the tested media is simulated for-
ward, and calculated backward, forming a closed loop to correct
the velocity model, i.e. the terrain image. The propagation of




þ dvvðrÞ252pðr; tÞ ¼ f ðr; tÞ ð1Þ
The propagation involves stencil computation, as the partial differ-
ential equation is approximated with the Taylor expansion. In our
implementation, the propagation is approximated with a ﬁfth-order
Taylor expansion in space, and ﬁrst-order Taylor expansion in time.
The constant coefﬁcients are calculated using ﬁnite difference
methods. In this part, we focus on the forward propagation function
of an RTM application.
The high-level analysis results and the measured results for
RTM are compared in Table 1. For the target RTM application,
design properties are estimated with high-level analyses, and cus-
tomized designs are developed with MaxCompiler version 2012.1
to a Xilinx Virtex-6 SX475T FPGA hosted by a MAX3424A card from
Maxeler Technologies. As shown in Table 1, the design parallelism
(i.e. the number of duplicated data-paths) is limited by LUT and
BRAM resource usage. In current experiments, we set the available
resources to be 90% of the available FPGA resources, to reduce the
design routing complexity. In terms of estimation accuracy, the
resource usage and execution time are more than 90% accurate.
This indicates that the high-level analysis captures the design
properties without going through the time-consuming synthesis
tool chain. Moreover, this implies the efﬁciency of the RTM design.
The listed execution time consists of the execution and reconﬁgu-
ration time of RTM design. The high-level analysis estimates the
design execution time based on the theoretical peak performance:
all implemented data-paths are assumed to be running full speed
in parallel. In other words, with the analytical model taking care
of application and circuit details, the applications are running with
almost the theoretical performance.
7.2. Ray tracing
In modern graphic applications it is important to achieve photo-
realistic rendering in a coherent manner in order to improve pic-
ture quality with increased scene complexity, and visualize
accurately characteristics such as real reﬂection, soft shadows, area
light source, and indirect illumination. Rendering in 3D graphicTable 1
Estimated and analyzed design properties for RTM. The usage of a resource type is
divided by the amount of available FPGA resources of the speciﬁc type.
RTM Frequency (GHz) LUT FF DSP BRAM Time (s)
Estimated 0.1 0.86 0.53 0.4 0.89 33.02
Measured 0.1 0.91 0.54 0.41 0.887 36.05design is the process of generating an image from amodel (or mod-
els in what collectively can be called a scene ﬁle), by means of com-
puter programs; this adds shading, color and lamination to a 2D or
3D wireframe to create life-like images of a screen. A scene ﬁle
contains objects in a strictly deﬁned language or data structure;
it can include geometry, viewpoint, texture, lighting, and shading
information as a description of the virtual scene. The problem of
performing 3D rendering effectively is important due to the contin-
uous strive for realistic images in several domains, as for instance
in movies and video games. In this context we study the ray tracing
scheme, which belongs to the global illumination group of algo-
rithms [44] that aim to add more realistic lighting in 3D scenes.
7.2.1. Application description
Ray tracing simulates the physics of a light ray to produce real-
istic results. Its classical implementation consists in deﬁning a ren-
dering point in a 3D scene and shooting light rays from that point,
simulating their reﬂections and refractions when these rays inter-
sect objects in the scene. The objects are described as a composi-
tion of geometric primitives (2D or 3D) such as triangles,
polygons, spheres, cones and other shapes. The computational
complexity of a rendering scene is proportional to the number
and the nature of these primitives, along with their positions in
the scene itself.
The ray tracing algorithmwe use as benchmark in FASTER starts
from a description of the scene as a composition of certain geomet-
ric 2D/3D primitives: triangles, spheres, cylinders, cones, toruses,
and polygons. Each primitive is described by a set of geometric
properties such as position in the scene, orientation, scale, height
of the primitive or rays of the circles composing the primitive.
The algorithm then performs the following steps:
1. the scene is divided in blocks, called voxels, and the number of
these voxels is one of the contributors to determine the com-
plexity of the algorithm; the more the voxels are, the more
intersections between rays and primitives have to be
computed;
2. the algorithm generates a certain amount of rays from the cur-
rent rendering point of the image, and it computes the set of
voxels traversed for each of these rays;
3. it then iterates all over these voxels and computes the intersec-
tion between the primitives in the voxel and the current ray;
4. the nearest intersection, if any, is considered and the algorithm
computes the reﬂection and refraction of the light ray on the
surface of the object;
5. the rays generated by this physic simulation continue to be
propagated into the image until a maximum number of inter-
section (an input parameter of the application) is reached or
no intersection is found at all.
Ray tracing operation can be thought of as by having a virtual
camera placed into the scene, an image is rendered by casting rays
simulating the reverse path of ray of lights, from the origin of the
camera through every pixel of its virtual focal plane. The color of
a pixel is determined by the potential intersections of the primary
ray cast through it, with the 3D scene. Photorealistic results can be
achieved when sufﬁcient rays are cast, simulating with ﬁdelity the
behavior of light. Realistic description of the objects of the scene
and proper simulation of the materials’ behavior is important for
correct results.
During rendering, the camera creates a single primary ray orig-
inating at the camera location, and pointing in the direction of a
single sample inside a given pixel of a virtual focal plane. The 3D
scene is interrogated for potential intersection with this primary
ray, using a structure that contains the geometric primitives. The



























Ideal Speedup Real Speedup Log. (Real Speedup)
Fig. 11. Ideal and measured speedup from the execution of ray tracing in a multi-
threaded environment in a 4-core Intel CPU. Graph includes a rough estimation of
the expected (or logical) speedup for up to 8 threads.
332 D. Pnevmatikatos et al. /Microprocessors and Microsystems 39 (2015) 321–338is determined by computing the intersection with all the primi-
tives of the scene. If an intersection with a primitive is found, its
shading is computed using the material of the corresponding prim-
itive. Shading computation may generate additional rays, cast in
turn into the scene following a similar process. The result of
shading stage is a color associated with the intersection point.
For a primary ray, this color is used to compute the color of the ori-
ginal pixel by the camera. For a secondary ray (or shadow ray), this
color is used to compute the color of the surface hit by the primary
ray. More speciﬁcally, several rays (shadow rays) are shot from the
intersection point in the direction of the light sources. For area
light sources, a random sample is computed over the surface and
chosen as direction of the corresponding shadow ray. If a shadow
ray reaches its target without occlusion, then the considered point
is receiving direct light from that source; otherwise, the point is in
shadow. The accumulation of contributions of shadow rays permits
the rendering of soft shadows and penumbras. Reﬂections are
taken into account by casting a new ray (secondary ray or reﬂected
ray) in the reﬂection direction, and the process starts again for that
ray. Its contribution is accumulated to its corresponding primary
ray for the ﬁnal result.
7.2.2. Parallelism and proﬁling
To study the potential for parallelism we optimized the soft-
ware code by exploiting multithreaded rendering on a symmetric
multiprocessor architecture. Ray tracing is embarrassingly parallel,
but the level of parallelism that can be exploited depends on plat-
form capabilities. On a multi-processor/multi-threaded platform
the image can be decomposed in independent sub-areas, and each
sub-area can be rendered independently by a single-processor/
thread. Indicatively, we divided the screen in tiles, e.g. 8  8, and
used one thread per tile for accessing shared scene data. Table 2
has the execution time for processing a 800 800 image size in a
4-core Intel CPU using different number of threads, and Fig. 11
shows the achieved speedup in each case. The results indicate that
ray tracing application beneﬁts considerably from deployment
onto parallel processing elements.
Due to the complexity of application, before proceeding with
the hardware implementation we proﬁled it to identify the
most-time consuming functions that worth to be accelerated. We
analyzed it using a proﬁle data visualization tool [45]. Tests were
carried out on a ﬁxed scene by varying the number of lights (one
or three) and shadow sampling and reﬂection depth values (from
two to ten), obtaining 26 proﬁles. We then compared the 26 pro-
ﬁles to identify the 10 most time-consuming functions and the
fraction of execution time spent in each of them. Fig. 12 shows
the proﬁling results.
We examine the self cost of each function without taking into
account their callers. Results can be better explained referring to
the mean value and to the variance of the fraction of execution
time spent by each function. The chart in Fig. 12 depicts the frac-
tion of time per function over the total execution time. We
obtained that the most demanding function is Intersect with an
average value of 21;49%. Variance has a negligible value for all
selected functions except for the three (3) most expensive ones.
This behavior is due to the algorithm operation; Intersect,Table 2
Performance of ray tracing executed in different number
of threads.




4 3195.60Ispointinsidepolygon and Intersectcone show a dependence on the
total amount of generated rays, while the remaining seven (7)
functions keep closer to the mean value. In general, variance
becomes negligible as the number of rays increases; this keeps
constant for different scenes and different settings, and no big vari-
ations in the values can be seen.
A basic ray tracing application must compute the intersections
between a ray and the complete synthetic scene in order to evalu-
ate visibility. It should be noted here that for primary and second-
ary rays, only the intersection closest to the source of the ray must
be found, while for shadow rays, it is sufﬁcient to determine if
there is any intersection between the source and the destination
of the ray. However, in order to ﬁnd the proper intersection a con-
siderable amount of such computations might be performed. In
fact, an important ﬁnding of the proﬁling was that the most com-
putational intensive part is related to the computation of ray inter-
sections within the voxels, i.e. the 3D partition grid we adopted to
reduce the computational cost. The intersection functions are
called potentially hundreds of time per ray, for hundreds of thou-
sands or millions of rays. Ray tracing spends a lot of time in com-
puting intersections between a ray and a geometric primitive of
the scene, as the rendering of an image requires casting rays
through each pixel. Hence, we decided to accelerate the intersec-
tion functions by implementing them in hardware. On the other
hand the complexity of intersection computation depends itself
on the nature of the primitive, e.g. intersection ray-sphere is
cheaper to compute than ray-polygon.
7.2.3. Reconﬁguration opportunities
The ten functions in Fig. 12 are the best candidates to be imple-
mented as hardware IPs. Opportunities for reconﬁguration arise if
we consider that most of them cannot run in parallel with others,
i.e. they are mutually exclusive. For this reason, it is possible to
assign all hardware implementations of these functions to a single
reconﬁgurable region (or a few of them) and reconﬁgure the func-
tions so as to keep the execution time in line with that of a design
in which all functions co-exist implemented as static cores, but
with far fewer resources. This is beneﬁcial in two scenarios:
 The ﬁrst one is when the target device does not ﬁt the whole
architecture. This is useful when ray tracing is ported to small
embedded devices for wearable solutions running for example
augmented reality tasks. The device is reconﬁgured at runtime
on the basis of type of intersections needed at a certain point
in the computation of ﬁnal image. This solution has been eval-
uated with our current implementation.
Fig. 12. Proﬁling of the 10 most CPU-time intensive functions of ray tracing. They consume most of the total execution time.
D. Pnevmatikatos et al. /Microprocessors and Microsystems 39 (2015) 321–338 333 The second scenario comes from altering the mix of HW cores
used to speed up the computation at runtime, in order to adapt
their type and number to the current processing needs. In this
case, with a change to the SW of our current implementation,
it is possible to detect at runtime if it is beneﬁcial to parallelize
the computation of a certain number of intersections on multi-
ple HW cores.
The beneﬁt from this type of reconﬁguration is similar to that in
Reverse Time Migration application described above, i.e. time-mul-
tiplexing kernels for saving resources. Different levels of reconﬁg-
uration can be considered to address adaptable aspects of ray
tracing in terms of performance or ﬂexibility, such as reconﬁguring
the type of supported geometric primitives, or, the complexity of
primitives for trading-off intersection computation-cost, or, the
shading complexity with the computation of color at intersection
point for trading-off the level of realism with computation-cost.
7.2.4. Implementation
We created a reconﬁgurable design based on the ﬁrst scenario
described previously targeting the low-cost Zedboard, which is
based on the Xilinx Zynq platform. Zynq is an ARM-based SoC with
limited reconﬁgurable logic. We used our tools for analyzing and
implementing the application in HW, in particular the mapper,
scheduler and ﬂoorplanning. Initially, we restructured the original
code so as to become more parallelizable; this way multiple inter-
sections can be computed at the same time. We then encoded the
task graph along with the related proﬁling information in the FAS-
TER XML format. The mapper and scheduler tool determined that
the application can beneﬁt in terms of performance by implement-
ing the intersections in HW instead of SW. Then we used the ﬂoor-
planning tool to build the partially reconﬁgurable system. We
should note here that we put considerable effort to better under-
stand how to efﬁciently manipulate the ray tracing data structures
for hardware implementation.
The partially reconﬁgurable system alters at run-time the mix
of cores available to compute intersections. At design time we
explored the solution space by varying the number of reconﬁgura-
ble regions. From a preliminary study this seems to vary from 1 up
to 4, but these boundaries depend also on the underlying physical
architecture, mainly on the amount of reconﬁgurable resources.
The interface to the reconﬁgurable regions is ﬁxed and their
resource occupation must be enough so as to serve every core
the designer plans to load into the reconﬁgurable region. In our
case, for sake of ﬂexibility we mapped all intersection cores in allreconﬁgurable regions, i.e. for every intersection we created one
bitstream for every reconﬁgurable region.
7.3. Network intrusion detection
Network Intrusion Detection Systems (NIDS) are widely
adopted, as high-speed and always-on network access demand
more sophisticated packet processing and increased network secu-
rity [46].
7.3.1. System description
Instead of checking only the header of incoming packets (as for
example ﬁrewalls typically do), NIDS also scan the payload to
detect suspicious contents. The latter are described as signatures
or patterns and intrusion signature databases are made available
that include known attacks. The databases are regularly updated
and an NIDS should be able to provide a certain degree of ﬂexibility
to incorporate the updated security information. In the past, NIDS
used mostly static patterns to scan packet payload. Recently, regu-
lar expressions have also been adopted as a more efﬁcient way to
describe hazardous contents. As such, modern rulesets consist of
both static patterns and regular expressions.
The general requirements of NIDS involve high processing
throughput, ﬂexibility in modifying and updating the content
descriptions and scalability as the number of the content descrip-
tions increases. The performance requirements are fundamental to
the correct functioning of an NIDS system, as if it cannot meet
them, the system itself is susceptible to speciﬁc types of attacks,
i.e. overload and algorithmic attacks. However, equally crucial is
the ability of the NIDS to adapt to updated rules and content
descriptions. This is why software NIDS systems have been widely
used, however they require substantial hardware resources (in
terms of general-purpose CPUs) to achieve link-speed
performance.
NIDS implemented in reconﬁgurable hardware have the poten-
tial to combine the high performance of hardware-based systems
with the ﬂexibility of software solutions. Speciﬁc rules can be
mapped to custom logic for maximum performance and rule
changes/updates can be reconﬁgured into the device. Typically, a
NIDS has to deal with: (i) small incremental updates may be
required to add, change or expand certain IP addresses or address
ranges that appear in detection rules, (ii) new static pattern rules
may have to be added to the static rule set or changes to the cur-
rent patterns included in the rule set may have to be applied, (iii)








334 D. Pnevmatikatos et al. /Microprocessors and Microsystems 39 (2015) 321–338mishandling of certain detection rules, (iv) new regular expres-
sions may have to be added, and (v) overall updates to the system
might needed in case of new policies or large scale update to the
operation of the NIDS system.Physical 




Fig. 14. Micro-reconﬁgurable NIDS system.7.3.2. Reconﬁguration opportunities
Depending on the nature of the update (incremental versus
more extensive ones), different reconﬁguration approaches can
be used. If a new rule is added in the system (the usual case),
micro-reconﬁguration can dedicate free resources to this new rule.
Furthermore, micro-reconﬁguration could be used in the cases
where small changes to certain patterns or rules are required,
e.g. changes in an IP address or address range or maybe updates
to certain rules deﬁned by a speciﬁc regular expression. Since these
changes/updates are expected to be frequent, micro-reconﬁgura-
tion is a very promising approach as it can be applied quickly
and minimize system down-time. On the other hand, for major
restructuring of the rules used in the NIDS (either as initial setup,
a major upgrade, or to respond to new requirements of the organi-
zation), partial or even full reconﬁguration is needed. The ability to
support large-scale changes to the operation of the NIDS provides
high value to the system, prolongs system life and protects the
investment of the customer.7.3.3. Implementation
The NIDS system has been tuned to support micro-reconﬁgura-
tion for updating static rules related to IP checking. The original
NIDS was a static hardwired implementation of a subset of the
Snort rules, without any software programmable units. Fig. 13
overviews the main modules of the system. The use of micro-
reconﬁguration mandates the use of a processor for certain tasks
related to reconﬁguration and calculation of parameters, therefore
a number of changes were made to the system. Fig. 14 has the
micro-reconﬁgurable version of NIDS. The main NIDS System unit
is practically the original implementation of the hardwired NIDS
system; however it comes with several changes that include the
interface to the PLB bus, and the use of micro-reconﬁgurable
parameters for the IP checking rules.
The performance of the original non-reconﬁgurable system and
the micro-reconﬁgurable system is identical and set at wire-speed,
thus both systems produce results at the same rate as they receive
packets from the gigabit ethernet interface. Also, the micro-recon-
ﬁgurable system operates under the same timing constraints as the
original. It should be noted that the main NIDS of Fig. 14 operates
on a separate clock domain than the rest of the system (PLB buses,
MicroBlaze processor and hardware ICAP).Fig. 13. Overview of the hAs expected, there is an impact on the physical resources that
the micro-reconﬁgurable system needs compared to the original
implementation. For a Xilinx Virtex-5 XC5VLX110T device, the
overhead is almost 6% for Slice LUTs and 11% for BlockRAMs.
Although these are not negligible values, they can be considered
small.
7.4. Overhead of the run-time system manager
We used the RTSM described in Section 6 to control a partially
reconﬁgurable design. In speciﬁc, we used it to execute an edge
detection application implemented on an FPGAplatformwithXilinx
Virtex-5. Initially, we evaluated it on a x86 desktop system con-
nectedwith a serial interface to the FPGA platform. The desktop sys-
tem was a linux PC with an i3 CPU@3.1 GHz, 3 MB cache, and 2 GB
DDR3. The time we measured per call of RTSMs main core was
1.3us on average, while the best and worst times were 0.9us and
2us respectively. These values correspond to the overhead of the
RTSM itself, i.e. the time it takes to perform one scheduling decision
and update the internal structures representing the status of the
tasks and the FPGA. This measurement excludes the times for task
reconﬁguration and execution, and communication between the
CPU and the FPGA. We then transited to a completely embedded
version implemented on the same FPGA platform by altering the
RTSM code so as to port it on a Microblaze, running at 100 MHz,
32 KB cache. In that case the RTSM overhead is 520us, thus 2 orders
ofmagnitude larger as compared to the x86 implementation. In both
cases the RTSM itself is very low overhead; it is the reconﬁguration
and execution time of the tasks that dominate the total time for
carrying out the application execution.ardware NIDS system.
D. Pnevmatikatos et al. /Microprocessors and Microsystems 39 (2015) 321–338 3358. Summary of contributions
This section consolidates the project novelties and exposes its
differences over previous related efforts discussed in Section 2.
Our contributions are:
 The interaction amongst the front-end stages of the tool-chain,
which is performed through the XML ﬁle. During this operation
different segments of the XML structure are automatically ana-
lyzed and updated. Several solutions are explored requiring lim-
ited-to-none intervention from the user, while details and
implications related with dynamic reconﬁguration remain hid-
den. XML format offers a convenient representation and has
been used by similar projects in the past [10]. However, the type
of information contained into FASTER XML that is then extracted
for feeding the different stages of the tool-chain is new, e.g. base-
line scheduler provided in XML format that is then used as input
to the run-time system manager. A new feature of the proposed
XML is that it contains attributes characterizing the HW and SW
tasks, which are used as input to the other FASTER tools for mak-
ing decisions. Such attributes are execution time, reconﬁgura-
tion time, and power consumption per task; the way these are
balanced at compile time affects the system implemented at
the ﬁrst place, but also affects the run-time decisions. This fea-
ture is unique as compared to previous research projects, not
only due to the information contained in XML, but also due to
that it affects both compile- and run-time decisions.
 Micro-reconﬁguration is a technique released some time ago,
but we study for the ﬁrst time its integration into a tool-chain.
We are also extending it so as to support routing resources; pre-
vious version of micro-reconﬁguration was targeting Look-Up-
Tables only. Moreover, we developed a proﬁler to assist the user
in determining whether an application will beneﬁt from a
micro-reconﬁgurable implementation. Finally, we evaluated
micro-reconﬁguration by employing it for the ﬁrst time in one
of the target applications, i.e. Network Intrusion Detection.
 The proposed tool-chain supports different implementation
options, static or reconﬁgurable; for the latter case two options
are available, either region-based reconﬁguration or micro-
reconﬁguration. Putting these sub-ﬂows into work under a uni-
ﬁed environment targeting heterogeneous reconﬁgurable plat-
forms constitutes itself a challenging problem in terms of
complexity. To the best of our knowledge existing integrated
solutions do not support such feature, i.e. analyzing and identi-
fying within the tool-chain which option serves better the given
application/platform, and enabling accordingly the proper sub-
ﬂow.
 A veriﬁcation approach that applies equally to static,
region-based, or micro-reconﬁguration without modiﬁcation.
The novelty of this approach lies into three aspects: (i) verifying
streaming designs using meta-programming; (ii) verifying run-
time reconﬁgurable designs; and (iii) verifying co-design of
HW/SW systems. Sections 5 discusses these aspects in detail,
including an extensive comparison over the dominant solutions.
 We implemented a Run-Time System Manager (RTSM) and
evaluated it on an embedded and a desktop platform. We stud-
ied the extent to which it is feasible to develop a generic library
that can support different platforms. For the transition from the
desktop implementation to the embedded one only a few
changes were needed in the RTSM code. Our code has been
structured in a way that can be extended to support different
scheduling policies. The RTSM basic operation is driven by the
baseline scheduler represented in XML.
 We explored the reconﬁguration aspects of three applications,
which are of great interest to the industrial partners. Wegathered signiﬁcant insights by progressively re-designing and
implementing the partially reconﬁgurable versions on certain
platforms, and assessed two of them in terms of performance
and resource savings. Our experience showed that this consti-
tutes a time-consuming task; important role played the plat-
forms that our tool-chain targets, and the initial available
implementations of the target applications (static HW or SW
only). In addition, we employed our tools for analyzing and
implementing the applications. In particular, we used the
High-level analysis tool for analyzing the Reverse Time Migra-
tion and we found that the user can rely on its output in order
to estimate the expected performance and resource consump-
tion, without going through the time-consuming synthesis tools.
We used our mapper, scheduler and ﬂoorplanning tools to ana-
lyze and implement the ray tracing application. Prior to this we
proﬁled the application, and encoded its task graph along with
the related proﬁling information in the FASTER XML format.
Finally, we designed and evaluated the micro-reconﬁgurable
version of the Network Intrusion Detection System.
Recapitulating, the FASTER project is novel both from research
and practice points of view. The basic functionality for some of
the described tools was provided by the partners already, but we
are continuously modifying them so as to integrate all under a uni-
ﬁed tool-chain. Furthermore, FASTER tools support the analysis
and design of applications from different domains.9. Conclusion
Creating a changing hardware system is a challenging process
requiring considerable effort in speciﬁcation, design, implementa-
tion, veriﬁcation, as well as support from a run-time system. We
attempt to alleviate this effort and streamline the design and imple-
mentation process by providing a new design environment friendly
to reconﬁguration. Our contributions span the analysis phase and
the reconﬁgurable system deﬁnition, the support for multi-grain
reconﬁguration, the veriﬁcation for the changing system, and the
run-time system to handle the reconﬁguration requirements.
Acknowledgement
This work was supported by the European Commission - Bel-
gium in the context of FP7 FASTER project (#287804).
References
[1] FASTER, <http://www.fp7-faster.eu/> (accessed 2014).
[2] D. Pnevmatikatos, T. Becker, A. Brokalakis, K. Bruneel, G. Gaydadjiev, W. Luk, K.
Papadimitriou, I. Papaefstathiou, O. Pell, C. Pilato, M. Robart, M.D.
Santambrogio, D. Sciuto, D. Stroobandt, T. Todman, FASTER: Facilitating
Analysis and Synthesis Technologies for Effective Reconﬁguration, in:
Euromicro Conference on Digital System Design (DSD), 2012.
[3] T. Todman, G. Constantinides, S.J.E. Wilton, O. Mencer, W. Luk, P.Y.K. Cheung,
Reconﬁgurable computing: architectures and design methods, Comput. Digital
Tech., IEE Proc. 152 (2) (2005) 193–207.
[4] J. a. M.P. Cardoso, P.C. Diniz, M. Weinhardt, Compiling for reconﬁgurable
computing: a survey, ACM Comput. Surv. 42 (4) (2010) 13:1–13:65.
[5] L. Jówiak, N. Nedjah, M. Figueroa, Modern development methods and tools for
embedded reconﬁgurable systems: a survey, Integr. VLSI J. 43 (1) (2010) 1–33.
[6] S. Hauck, A. DeHon, Reconﬁgurable Computing: The Theory and Practice of
FPGA-Based Computation, Morgan Kaufmann Publishers Inc., San Francisco,
CA, USA, 2007.
[7] M. Platzner, J. Teich, N. Wehn (Eds.), Dynamically Reconﬁgurable Systems –
Architectures, Design Methods and Applications, Springer, 2010.
[8] PTOLEMY, <http://ptolemy.eecs.berkeley.edu/> (accessed 2012).
[9] DAEDALUS, <http://daedalus.liacs.nl/> (accessed 2012).
[10] hArtes, <http://hartes.org/hArtes/> (accessed 2012).
[11] M. Sato, OpenMP: parallel programming API for shared memory
multiprocessors and on-chip multiprocessors, in: Proceedings of the 15th
International Symposium on System Synthesis, 2002, pp. 109–111.
336 D. Pnevmatikatos et al. /Microprocessors and Microsystems 39 (2015) 321–338[12] OpenCL, <http://www.khronos.org/opencl/> (accessed 2012).
[13] REFLECT, <http://www.reﬂect-project.eu/> (accessed 2012).
[14] ACOTES, <http://www.hitech-projects.com/euprojects/ACOTES/> (accessed
2012).
[15] ANDRES, <http://andres.ofﬁs.de/> (accessed 2012).
[16] A. Montone, M.D. Santambrogio, F. Redaelli, D. Sciuto, Floorplacement for
partial reconﬁgurable FPGA-based systems, Int. J. Reconf. Comput. 2011 (2)
(2011) 12.
[17] C. Bolchini, A. Miele, C. Sandionigi, Automated resource-aware ﬂoorplanning of
reconﬁgurable areas in partially-reconﬁgurable FPGA systems, in: Proceedings
of the IEEE International Conference on Field Programmable Logic and
Applications (FPL), 2011, pp. 532–538.
[18] M. Rabozzi, J. Lillis, M.D. Santambrogio, Floorplanning for partially-
reconﬁgurable FPGA systems via mixed-integer linear programming, in:
Proc. of the IEEE International Symposium on Field-Programmable Custom
Computing Machines (FCCM), 2014.
[19] P.W. Foulk, Data-folding in SRAM conﬁgurable FPGAs, in: Proceedings of the
IEEE Workshop on FPGAs for Custom Computing Machines (FCCM), 1993, pp.
163–171.
[20] M.J. Wirthlin, B.L. Hutchings, Improving functional density through run-time
constant propagation, in: Proc. of the ACM/SIGDA International Symposium on
Field Programmable Gate Arrays (FPGA), 1997, pp. 86–92.
[21] K. Shahookar, P. Mazumder, VLSI cell placement techniques, Comput. Surv.
(1991) 143–220.
[22] K. Bruneel, Efﬁcient Circuit Specialization for Dynamic Reconﬁguration of
FPGAs, PhD thesis, Ghent University, 2011.
[23] K. Heyse, K. Bruneel, D. Stroobandt, Mapping logic to reconﬁgurable FPGA
routing, in: Proceedings of the International Conference on Field
Programmable Logic and Applications (FPL), 2012, pp. 315–321.
[24] E. Vansteenkiste, K. Bruneel, D. Stroobandt, Maximizing the reuse of routing
resources in a reconﬁguration-aware connection router, in: Proceedings of the
International Conference on Field Programmable Logic and Applications (FPL),
2012, pp. 322–329.
[25] B.A. Farisi, K. Bruneel, J.M.P. Cardoso,D. Stroobandt,Anautomatic toolﬂowfor the
combined implementation ofmulti-mode circuits, in: Proceedings of the Design,
Automation, and Test in Europe Conference and Exhibition, 2013, pp. 821–826.
[26] Accellera, Universal Veriﬁcation Methodology (UVM) 1.1 user’s guide. <http://
www.accellera.org/downloads/standards/uvm>.
[27] Synopsis, Formality: equivalence checking for DC Ultra, Tech. rep., 2012.
[28] S. Singh, C.J. Lillieroth, Formal veriﬁcation of reconﬁgurable cores, in:
Proceedings of FCCM, 1999, pp. 25–32.
[29] K.W. Susanto, T.F. Melham, Formally analyzed dynamic synthesis of hardware,
J. Supercomput. 19 (1) (2001) 7–22.
[30] F. Madlener, J. Weingart, S.A. Huss, Veriﬁcation of dynamically reconﬁgurable
embedded systems by model transformation rules (2010) 33–40.
[31] K.W. Susanto, W. Luk, Automating formal veriﬁcation of customized soft-
processors, in: FPT, 2011, pp. 1–8.
[32] O. Pell, Veriﬁcation of FPGA layout generators in higher-order logic, J. Autom.
Reason. 37 (1–2) (2006) 117–152.
[33] S. Drzevitzky, U. Kastens, M. Platzner, Proof-carrying hardware: towards runtime
veriﬁcation of reconﬁgurable modules, in: ReConFig. 2009, pp. 189–194.
[34] S. Drzevitzky, Proof-carrying hardware: runtime formal veriﬁcation for secure
dynamic reconﬁguration, in: FPL, 2010, pp. 255–258.
[35] K.W. Susanto, T. Todman, J.G.F. Coutinho, W. Luk, Design validation by
symbolic simulation and equivalence checking: a case study in memory
optimization for image manipulation, in: SOFSEM, 2009, pp. 509–520.
[36] B. Dutertre, L. de Moura, The YICES SMT Solver, Tech. rep., Computer Science
Laboratory, SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025,
USA, 2006.
[37] W. Luk, N. Shirazi, P.Y.K. Cheung, Modelling and optimising run-time
reconﬁgurable systems, in: Proceedings IEEE Symposium on FPGAs for
Custom Computing Machines (FCCM), 1996, pp. 167–176.
[38] C. Steiger, H. Walder, M. Platzner, Operating systems for reconﬁgurable
embedded platforms: online scheduling of real-time tasks, IEEE Trans.
Comput. 53 (11) (2004) 1393–1407.
[39] T. Marconi, Efﬁcient Runtime Management of Reconﬁgurable Hardware
Resources, PhD thesis, TU Delft, 2011.
[40] K. Papadimitriou, A. Dollas, S. Hauck, Performance of partial reconﬁguration in
FPGA systems: a survey and a cost model, ACM Trans. Reconf. Technol.
Syst.(TRETS) 4 (4) (2011) 36:1–36:24.
[41] S. Vassiliadis, S. Wong, G. Gaydadjiev, K. Bertels, G. Kuzmanov, E. Panainte, The
MOLEN polymorphic processor, IEEE Trans. Comput. 53 (11) (2004) 1363–1375.
[42] E. Baysal, D.D. Kosloff, J.W.C. Sherwood, Reverse time migration, SEG-
Geophysics 48 (11) (1983) 1514–1524.
[43] X. Niu, Q. Jin, W. Luk, Q. Liu, O. Pell, Exploiting run-time reconﬁguration in
stencil computation, in: Proc. of the 22nd IEEE International Conference on
Field Programmable Logic and Applications (FPL), 2012.
[44] K. Myszkowski, T. Tawara, H. Akamine, H.-P. Seidel, Perception-guided global
illumination solution for animation rendering, in: Proceedings of the ACM
Conference on Computer Graphics and Interactive Techniques (SIGGRAPH),
2001, pp. 221–330.
[45] KCachegrind, <http://kcachegrind.sourceforge.net/html/Home.html> (accessed
2014).
[46] I. Sourdis, D. Pnevmatikatos, S. Vassiliadis, Scalable multi-gigabit pattern
matching for packet inspection, IEEE Trans. Very Large Scale Integr. (VLSI) Syst.
16 (2) (2008) 156–166.Dionisios Pnevmatikatos is a Professor and former
Chair of the Electronic and Computing Engineering
Department, Technical University of Crete and a
Researcher at the Computer Architecture and VLSI Sys-
tems (CARV) Laboratory of the Institute of Computer
Science, FORTH in Greece. He received his B.Sc. degree
in Computer Science from the Department of Computer
Science, University of Crete in 1989 and M.Sc. and Ph.D.
degrees in Computer Science from the Department of
Computer Science, University of Wisconsin-Madison in
1991 and 1995 respectively. His research interests are
in the broader area of Computer Architecture, where he
investigates the Design and Implementation of High-Performance and Cost-Effec-
tive Systems, Reliable System Design, and Reconﬁgurable Computing. He is cur-
rently the Coordinator of the FASTER project (ICT, STREP) and has participated in
numerous national and European research projects.
Kyprianos Papadimitriou is a Research Associate at the
Computer Architecture and VLSI Systems (CARV) Labo-
ratory, FORTH-ICS, and Scientiﬁc Staff at the School of
Electronic and Computer Enginneering, Technical Uni-
versity of Crete, in Greece. He received his Diploma and
M.Sc. in Electronic and Computer Engineering from
Technical University of Crete, Greece, in 1998 and 2003
respectively. During 1998–1999 he was with the R&D
department of ATMEL working on hardware imple-
mentation of wireless protocols. In 2003 he co-initiated
an effort to establish a spin-off company involved with
motion recognition technology. In 2012 he was granted
with a Ph.D from the Electronic and Computer Engineering Department, Technical
University of Crete, with a special focus on reconﬁgurable computing. He has par-
ticipated in several European and national reserarch projects.Tobias Becker received a Dipl. Ing. degree in Electrical
Engineering from the University of Karlsruhe, Germany
in 2005 and a PhD in computer science from Imperial
College London, UK in 2011. He is currently a research
associate in the Department of Computing at Imperial
College London. His research includes work on recon-
ﬁgurable computing and high-performance computing.
He has extensive experience on FPGA design and he has
worked in collaboration with several industrial partners
including Xilinx and Nokia.Peter Böhm received his Ph.D. from the Department of
Computer Science at Oxford University, UK, and his
M.Sc. (Honour’s) degree from the Department of Com-
puter Science at Saarland University, Germany. Cur-
rently, he is a research associate at Imperial College
London. His academic background is in formal veriﬁca-
tion and speciﬁcation of hardware and communication
architectures. At Imperial College, his research focuses
on the formal veriﬁcation of re-conﬁgurable hardware
designs.Andreas Brokalakis is a Senior Computer Engineer at
Synelixis Solutions Ltd. He graduated from the Com-
puter Engineering and Informatics Department, Uni-
versity of Patras (Greece) in 2004 and received his
master’s degree from the same department in 2007.
Since 2008 he has been working on several European
Research projects through his afﬁliation with the Tele-
communication Systems Institute (Technical University
of Crete) and Synelixis Solutions. His focus is on hard-
ware design for reconﬁgurable systems or ASICs as well
as computer architecture.
D. Pnevmatikatos et al. /Microprocessors and Microsystems 39 (2015) 321–338 337Karel Bruneel (Ph.D. Ghent University, 2011) is a
postdoctoral researcher at Ghent University, afﬁliated
with the Department of Electronics and Information
Systems (ELIS), Computer Systems Lab (CSL), Hardware
and Embedded Systems (HES) group. His research is
situated in the domain of EDA tools for FPGAs with a
strong focus on dynamic reconﬁguration. He is inter-
ested in a large number of topic including synthesis,
technology mapping, placement and routing.Catalin Ciobanu is a researcher in the Computer Science
and Engineering department at Chalmers University of
Technology, Sweden. His research interests include
computer architecture, reconﬁgurable computing and
parallel processing. He has an M.Sc. in Computer Engi-
neering from Delft University of Technology, and a Ph.D.
on customizable vector register ﬁles.Tom Davidson (Master of Applied Science – Electronics,
Ghent University, 2009) is a Ph.D. researcher at Ghent
University, afﬁliated with the Department of Electronics
and Information Systems (ELIS), Computer Systems Lab
(CSL), Hardware and Embedded Systems (HES) group.
He works on run-time reconﬁguration of FPGAs, more
speciﬁcally Dynamic Circuit Specialization. He has pre-
sented and published several papers on this subject. His
current research focus is on identifying opportunities
for DCS from higher abstraction levels, such as System-C
or C.Georgi Gaydadjiev is a professor at the Department of
Computer Science and Engineering at Chalmers Uni-
versity of Technology. His research interests include
computer systems design, advanced computer archi-
tecture and micro architecture, reconﬁgurable comput-
ing, hardware/software co-design and computer
systems testing. He is a Senior IEEE and ACM member.Karel Heyse (Master of Applied Science – Computer
Science, Ghent University, 2011) is a Ph.D. researcher at
Ghent University, afﬁliated with the Department of
Electronics and Information Systems (ELIS), Computer
Systems Lab (CSL), Hardware and Embedded Systems
(HES) group. His research is about run-time reconﬁgu-
ration of FPGAs and in particular the synthesis tools that
make it possible to use run-time reconﬁguration to
increase performance and functional density of digital
circuits. His other research interests are in coarse grain
architectures and regular expression matching on
FPGAs.Wayne Luk is Professor of Computer Engineering at
Imperial College London. He was a visiting professor
with Stanford University, California, and with Queen’s
University Belfast, United Kingdom. His research
includes theory and practice of customizing hardware
and software for speciﬁc application domains, such as
multimedia, ﬁnancial modeling, and medical comput-
ing. His current work involves high-level compilation
techniques and tools for high-performance computers
and embedded systems, particularly those containing
accelerators such as FPGAs and GPUs. He received a
Research Excellence Award from Imperial College, and
12 awards for his publications from various international conferences. He is a fellow
of the IEEE, the BCS, and the Royal Academy of Engineering.Xinyu Niu received his B.Sc. degree in School of Infor-
mation Science and Technology from Fudan University,
and obtained his M.Sc. degree from Department of
Electrical and Electronic Engineering at Imperial College
London. He is currently completing his Ph.D. studies in
the Department of Computing, Imperial College London.
His research interests include run-time reconﬁguration,
heterogeneous computing, high performance comput-
ing and computer-aided design (CAD) tools for hard-
ware design optimization.Ioannis Papaefstathiou is the Manager of the Systems
Research Groups of Synelixis Solutions Ltd and an
Associate Professor at the ECE Department at the
Technical University of Crete. He was granted a Ph.D.
degree in computer science at the University of Cam-
bridge UK, in 2001, an M.Sc. (Ranked 1st) from Harvard
University, USA, in 1996 and a B.Sc. (Ranked 2nd) from
the University of Crete, Greece in 1996. He has pub-
lished more than 70 papers in IEEE-sponsored journals
and conferences. He has been the prime Guest Editor for
an issue of IEEE Micro Magazine and for one in IEEE
Design & Test Magazine.Danilo Pietro Pau received the Electronic Engineering
degree from Politecnico of Milano in 1992. He has, since
1991, joined STMicroelectronics Advanced System
Technology working on digital video processing algo-
rithms and architectures for Set Top Box and Mobile
application processors. He moved to STMicroelectronics
Ltd, Bristol (UK) for a period of 1 year, in 2004, where he
worked mainly on algorithms and architectural studies
in the ﬁeld of 3D graphics for mobile processors com-
pliant with the worldwide standard OpenGL-ES. He then
moved back to Agrate, where he continued to manage
3D graphics developments and started 2D scalable
graphics algorithms and architectural studies for Set Top Box and Thin Clients. He is
currently managing 3 teams based in UK, Italy and China. During his career he has
authored or co-authored many European, U.S. and Japan granted patents, as well as
international publications in conferences or technical journals.
Oliver Pell is Vice President of Engineering at Maxeler
Technologies. His experience ranges from accelerating
Reverse Time Migration and Lattice-Boltzmann simu-
lations to credit derivatives pricing. He is currently
responsible for Maxeler’s technology research and
development as well as the technical architecture and
project management of acceleration efforts for clients
including Tier 1 oil companies and investment banks.
He has an M.Sc. in advanced computing from Imperial
College London. He’s a member of IEEE and the ACM.
338 D. Pnevmatikatos et al. /Microprocessors and Microsystems 39 (2015) 321–338Christian Pilato is a research assistant at Politecnico di
Milano. His research interests include high-level syn-
thesis, evolutionary algorithms for design space explo-
ration and multi-objective optimization, and design of
heterogeneous and multiprocessor embedded systems.
He has an M.Sc. in Computing Systems Engineering from
Politecnico di Milano, received in 2007, and a Ph.D. in
Information Technology received from the same Uni-
versity in 2011. He is a member of the IEEE and a
member of the IEEE Computer Society.Marco D. Santambrogio received his laurea (M.Sc.
equivalent) degree in Computer Engineering from the
Politecnico di Milano in 2004, his second M.Sc. degree in
Computer Science from the University of Illinois at
Chicago (UIC) in 2005 and his Ph.D. degree in Computer
Engineering from the Politecnico di Milano in 2008. He
was at the Computer Science and Artiﬁcial Intelligence
Laboratory (CSAIL) at MIT as postdoc fellow and he is
now assistant professor at Politecnico di Milano,
research afﬁliate at MIT and adjunct professor at UIC. He
has also held visiting positions at the EECS Department
of the Northwestern University (2006 and 2007) and
Heinz Nixdorf Institut (2006). He is a member of the IEEE, the IEEE Computer
Society (CS) and the IEEE Circuits and Systems Society (CAS). He has been with the
Micro Architectures Laboratory at the Politecnico di Milano, where he founded the
Dynamic Reconﬁgurability in Embedded System Design (DRESD) project in 2004
and the CHANGE project (Self-Aware and Adaptive Computing Systems) in 2010. He
conducts research and teaches in the areas of reconﬁgurable computing, computer
architecture, operating system, hardware/software codesign, embedded systems,
and high performance processors and systems.
Donatella Sciuto is a full professor in Computer Science
and Engineering at the Politecnico di Milano, where she
holds also the position of Vice Rector of the University.
She is a Fellow of IEEE and currently President of the
Council of EDA. Her main research interests cover the
methodologies for the design of embedded systems and
multicore systems, from the speciﬁcation level down to
the implementation of both the hardware and software
components, including reconﬁgurable and adaptive
systems. She has published more than 200 papers. She is
or has been member of different program committees of
ACM and IEEE EDA conferences and workshops. She has
served as Associate Editor of the IEEE Transactions on Computers, and serves now as
associate editor to the IEEE Embedded Systems Letters for the design methodologies
topic area and as associate editor for the Journal of Design Automation of Embedded
Systems, Springer. She is in the executive committee of DATE for the past ten yearsand she has been Technical Program Chair in 2006 and General Chair in 2008. She
has been General Co-Chair for 2009 and 2010 of ESWEEK. She is Technical Program
Co-Chair of DAC 2012 and 2013.
Dirk Stroobandt (Ph.D. Ghent University, 1998) is a
professor at Ghent University, afﬁliated with the
Department of Electronics and Information Systems
(ELIS), Computer Systems Lab (CSL). He currently leads
the research group HES (Hardware and Embedded Sys-
tems) of about 10 people with interests in semi-auto-
matic hardware design methodologies and tools, run-
time reconﬁguration, and reconﬁgurable multiprocessor
networks. He is the inaugural winner of the ACM/SIGDA
Outstanding Doctoral Thesis Award in Design Automa-
tion (1999), and he initiated and co-organized the
International Workshop on System-Level Interconnect
Prediction (SLIP) since 1999. He is also lead editor of a special issue of the Inter-
national Journal of Reconﬁgurable Computing and he has been associate editor of
ACM’s TODAES for three years.Tim Todman received the B.Sc. degree from the Uni-
versity of North London, and the M.Sc. and Ph.D. degrees
from Imperial College London, in 1997, 1998 and 2004,
respectively. He is currently a research associate in the
Department of Computing, Imperial College London,
London, UK. His research interests include hardware
compilation, veriﬁcation and implementation of graph-
ics algorithms on reconﬁgurable architectures.Elias Vansteenkiste (Master of Applied Science – Elec-
tronics, Ghent University, 2011) is a Ph.D. researcher at
Ghent University, afﬁliated with the Department of
Electronics and Information Systems (ELIS), Computer
Systems Lab (CSL), Hardware and Embedded Systems
(HES) group. His research is focused on a new tool ﬂow
for the dynamic reconﬁguration that will allow run-
time reconﬁguration of the FPGA’s interconnect net-
work, in order to increase performance and functional
density of digital circuits. He targets the back end of the
tool ﬂow. Placement and routing algorithms and con-
ﬁguration bitstream generation are his main interests.
