Software Implementation vs. Hardware Implementation: The Avionic Test System Case-Study by Afonso, George et al.
Software Implementation vs. Hardware Implementation:
The Avionic Test System Case-Study
George Afonso, Rabie Ben Atitallah, Jean-Luc Dekeyser
To cite this version:
George Afonso, Rabie Ben Atitallah, Jean-Luc Dekeyser. Software Implementation vs. Hard-
ware Implementation: The Avionic Test System Case-Study. ASPLOS, Mar 2012, London,
United Kingdom. <hal-00665162>
HAL Id: hal-00665162
https://hal.inria.fr/hal-00665162
Submitted on 1 Feb 2012
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
Software Implementation vs. Hardware Implementation:
The Avionic Test System Case-Study
George AFONSO
EADS-INRIA Lille Nord
Europe
george.afonso@inria.fr
Rabie Ben Atitallah
LAMIH, University of
Valenciennes
rabie.benatitallah@univ-
valenciennes.fr
Jean-Luc Dekeyser
LIFL, USTL, INRIA Lille-Nord
Europe
jean-luc.dekeyser@lifl.fr
ABSTRACT
This paper presents a development methodology that helps de-
signers to map efficiently applications onto a heterogeneous
CPU/FPGA system. An industrial case study is presented and aims
at meeting performance and real-time requirements with the help
of our architecture capabilities and avionic model parallelization.
Different avionic model implementations will be presented in or-
der to explain how to find the best trade-off between performance
and design-time.
1. INTRODUCTION
In order to meet real-time requirements, power, and flexibility
goals, combination of general CPU and reconfigurable fabrics like
Field-Programmable Gate Aarrays (FPGAs) is a promising solu-
tion leading to heterogeneous computing. In such systems, multi-
core CPU provides high computation rates while the reconfigurable
logic offers high performance per watt and adaptability to the ap-
plication constraints. Due to the high parallelism rate of the ap-
plication, FPGA technology could offer better performances com-
paring to CPUs or GPUs up to 10x [1] at lower frequencies. De-
signers could exploit the existing partitioning of the architecture
which leads to several possible implementations with different per-
formances. The main focus of this paper is the task mapping taking
into account the different constraints of our application
2. DESIGN METHODOLOGY FOR
CPU/FPGA ARCHITECTURE
In order to perform a complete Test and Simulation session, we
need to simulate each embedded part of the helicopter (i.e. auto-
matic pilot system, navigation, etc.), the environmental parameters
(weather conditions, geographical factors, etc.), and the behaviour
of the aircraft. This simulation models will be implemented in a
software and/or hardware fashion in order to satisfy the timing con-
straints. In order to get the best performance, sequential model pro-
filing must be performed in order to exploit the parallelism level in-
herent to our functions. Thus, we will determine the best trade-off
between parallel software, hardware or software/hardware execu-
tion.
.
T1
Communication synthesis
System 
ressources 
Parallelism
extraction
Hardware code 
generation
Hardware 
tasks
Software 
tasks
Hardware/Software mapping
Application 
contraints
FAST LINK
Multi-Core 
CPU FPGA
T1
T2
T3
T4
T5
T6
T7
T8
T9
C, C++, ...
VfEmbedded
Heuristic
ROCCC
Figure 1: Design methodology for efficient test mapping
To overcome this challenge, we present a design methodology
that covers the different development steps from software specifi-
cation to the system implementation as shown in Fig. 1. First, we
are considering a software application presented as a task graph
containing different communicating functions (T0, T1, etc.). All
applications are not adequate to be implemented onto heteroge-
neous CPU/FPGA architectures; a complete analysis of the source
code is needed to verify what implementation could bring better
performances. In order to leverage the parallelism of multi-core
CPU/FPGA architecture, tools such as Vector Fabrics VfEmbed-
ded [2] can find all data dependencies by analysing the C or C++
source code and extract the parallelism intrinsic in the application.
vfEmbedded analyses partitions and maps applications on specific
platforms from single processors to heterogeneous ones. It can also
estimate the performance of the parallelized software before imple-
menting it. Moreover, it can trim any overhead in your hardware
to reduce cost and ensure that all critical behaviours in your pro-
gram are exercised. The mapping step requires a heuristic method
that takes as inputs system resources, application constraints, and
the results of the analysis step. It will find the best configurations
which satisfy timing requirements and resources utilization. After
the mapping step, we need to develop some user hardware appli-
cations from the existing functions. To make this step more ef-
ficient, tools such as Riverside Optimizing Compiler for Config-
urable Computing (ROCCC) [3] can focus on FPGA-based code
acceleration from a subset of the C language. ROCCC does not
focus on the generation of arbitrary hardware circuits. Its objec-
tives are to maximize parallelism within the constraints of the tar-
get device, optimize clock cycle time by efficient pipelining, and
minimize the used area. Finally, a compilation step using GCC and
ISE from Xilinx, can be easily performed in order to map functions
respectively onto the CPU and the FPGA parts.
3. EXPERIMENTAL RESULTS
In this section, we will analyse different avionic models in order
to obtain different possible implementations required to tune the
design according to the needed performance and the real-time con-
straints. These implementations are also useful to switch between
software and hardware configurations or vice-versa.
First, as our objective is to analyse our software models in
order to get better performance in our heterogeneous multi-core
CPU/FPGA architecture, we will profile six different software
models. To do so, we used VfEmbedded [2] tool presented pre-
viously which offers a "parallelize" function analyses all loops and
parallelizes them, correctly handling the dependencies.
Table 1: Avionic models analysis
Results Speed-up Max. of use-
ful threads
Synchronization
overhead
Model A 1.2 2 1%
Model B 0 1 0%
Model C 2.4 3 0%
Model D 3.5 6 39%
Model E 3 4 29%
Model F 486.7 10000 95%
Table I summarizes the experimental results obtained by
analysing software avionic models on VfEmbedded with x86 In-
tel i5 processor execution support. First, we measure the speed-
up obtained after optimization. Secondly, threads must be created
for parallel implementation strategy. This might be implemented
through the use of POSIX calls creating the threads. The maximum
useful number of threads is directly linked to the parallelism degree
of the application. But threads means synchronisation and more
frequent synchronization of small amounts of data means less de-
lay while waiting for data and therefore less latency. Synchroniza-
tion requires overhead and this could require lot of time and make
parallelization step fail. VfEmbedded shows a 1.2 speed-up for the
model A with low synchronization overhead with only two threads.
Model B cannot make profit from a parallelization strategy. These
two first models are more suitable to be implemented on a single
core architecture. Model C, D and E offer higher parallelism degree
with low synchronisation overhead for model C. These models are
suitable for multi-core architecture implementation or hybrid multi-
core CPU/FPGA implementation by splitting the models in differ-
ent functions and implement them in different calculation nodes.
This, is also possible because of the low synchronization overhead.
Finally, Model F shows a high parallelism degree with high syn-
chronisation overhead, this model is very adapted to a hardware
implementation, in the second part of our results we will generate
VHDL code using ROCCC compilation framework.
Secondly, using our previous results, we decide to implement
model F in a hardware fashion using a C to VHDL translator. The
model F main loop fits perfectly to such transformation. With the
help of the ROCCC tool, a compilation step from C to VHDL is
performed. Furthermore, we make profit from the loop unrolling
feature provided by ROCCC in order to obtain varied implementa-
tions for the model F main loop. We synthesised 17 samples with
varying the loop unrolling value. Implementation results are re-
ported in the Fig. 2. We get an execution time varying from 19
to 14 µs respectively with an area utilization varying from 5,800
to 14,300 slices. Indeed, more we parallelize the model F main
loop, the occupied hardware increases and the execution time is
reduced. The software implementation on the host offers an execu-
!"#$$%
!&#$$%
!'#$$%
!(#$$%
!)#$$%
!*#$$%
!+#$$%
,$#$$%
'#*
,%
(#"
$%
(#)
(%
)#,
'%
)#)
!%
*#,
$%
*#(
&%
+#!
'%
+#(
!%
!$
#!$
%
!$
#''
%
!!
#$&
%
!!
#'!
%
!,
#+&
%
!"
#&!
%
!"
#*+
%
!&
#"'
%
-./01234%25/%6789%
-.
/0
12
34
%2
5
/%
67
89
%
:15;/<%3=%8>?0/8%6@9%
Figure 2: Different implementations for model F
tion time equals to 18 µs. The selection of the best implementation
will depend on the global system constraints, the data mapping,
and the available resources. ROCCC shows some limitations on
more complex models, in this case manual coding becomes neces-
sary. Using our previous results, we decide to implement model A
in order to observe the behaviour of such model in a pure VHDL
hardware implementation. The pure software version, we obtained
a 2 µs execution time with a Quad-core processor (4x2.5 GHz).
Despite VfEmbedded does not show a large parallelism degree for
this model, most of the mathematical operations do not have data
dependency between them. A pure VHDL implementation offers
a 2 µs execution time with 8% space occupation (Xilinx ML605
Board) which is related to the pure software execution time. This
result offers the opportunity to move model A from a processor to
the FPGA in order to bring performance and respect the needed
real-time constraints at the price of the hardware design time.
4. CONCLUSION
In this paper, we have highlighted the benefits of some profiling
tools making Test & Simulation model optimisation easier on the
CPU/FPGA system. In future works, our investigation will con-
cern a run-time task mapping for dynamically reconfigurable sys-
tem which will map optimally highly communicating hardware and
software tasks and deals efficiently with the violation of timing con-
straints.
5. REFERENCES
[1] S. Asano, T. Maruyama, and Y. Yamaguchi, “Performance
Comparison of FPGA, GPU AND CPU in Image Processing,”
in 19th IEEE International Conference on Field
Programmable Logic and Applications, FPL, Prague, Czech
Republic, Aug. 2009.
[2] Minjang Kim and Hyesoon Kim and Chi-Keung Luk, “A
scalable approach to dynamic data-depedence profiling,” in
43rd Symposium on Microarchitecture (MICRO), December
2010.
[3] N. W. Villarreal Jason, Park Adrian and H. Robert, “Designing
Modular Hardware Accelerators in C with ROCCC 2.0,” in
18th Symposium on Field-Programmable Custom Computing
Machines (FCCM’10), Washington, USA, May 2010.
