An integrated hardware/software design methodology for signal processing systems by Li, L. et al.
Journal of Systems Architecture 93 (2019) 1–19 
Contents lists available at ScienceDirect 
Journal of Systems Architecture 
journal homepage: www.elsevier.com/locate/sysarc 
An integrated hardware/software design methodology for signal processing 
systems 
Lin Li a , ∗ , Carlo Sau b , Tiziana Fanni b , Jingui Li c , Timo Viitanen c , François Christophe e , 
Francesca Palumbo d , Luigi Raﬀo b , Heikki Huttunen c , Jarmo Takala c , Shuvra S. Bhattacharyya a , c 
a University of Maryland, ECE Department, College Park, MD 20742, United States 
b University of Cagliari, Department of Electrical and Electronic Engineering, Italy 
c Tampere University, Finland 
d University of Sassari, PolComIng-Information Engineering Unit, Italy 
e Department of Computer Science, University of Helsinki, Finland 
a r t i c l e i n f o 
Keywords: 
Dataﬂow 
Model-based design 
Hardware/software co-design 
Low power techniques 
Deep learning 
Signal processing systems 
a b s t r a c t 
This paper presents a new methodology for design and implementation of signal processing systems on system-on- 
chip (SoC) platforms. The methodology is centered on the use of lightweight application programming interfaces 
for applying principles of dataﬂow design at diﬀerent layers of abstraction. The development processes inte- 
grated in our approach are software implementation, hardware implementation, hardware-software co-design, 
and optimized application mapping. The proposed methodology facilitates development and integration of signal 
processing hardware and software modules that involve heterogeneous programming languages and platforms. 
As a demonstration of the proposed design framework, we present a dataﬂow-based deep neural network (DNN) 
implementation for vehicle classiﬁcation that is streamlined for real-time operation on embedded SoC devices. 
Using the proposed methodology, we apply and integrate a variety of dataﬂow graph optimizations that are 
important for eﬃcient mapping of the DNN system into a resource constrained implementation that involves co- 
operating multicore CPUs and ﬁeld-programmable gate array subsystems. Through experiments, we demonstrate 
the ﬂexibility and eﬀectiveness with which diﬀerent design transformations can be applied and integrated across 
multiple scales of the targeted computing system. 
1
 
y  
a  
d  
i  
(  
t  
t  
n  
s  
s  
i  
p
 
e  
t
h
t  
s  
a  
f  
(  
l  
d  
g  
t  
o  
s
 
s  
W  
w  
v  
a  
h
R
A
1. Introduction 
Model-based design has been widely studied and applied over the
ears in many domains of embedded processing. Dataﬂow is well-known
s a paradigm for model-based design that is eﬀective for embedded
igital signal processing (DSP) systems [1] . In dataﬂow-based model-
ng, signal processing applications are represented as directed graphs
dataﬂow graphs), and computational functions are modeled as ver-
ices (actors) in these graphs. Actors exchange data packets (tokens)
hrough unidirectional, ﬁrst-in, ﬁrst-out (FIFO) communication chan-
els that correspond to dataﬂow graph edges. Many dataﬂow-based de-
ign methods for DSP systems have been explored in recent years to
upport various aspects of design and implementation, including model-
ng and simulation; scheduling and mapping of actors to heterogeneous
latforms; and buﬀer management (e.g. see [1,2] ). 
The diversity of design scales and dataﬂow techniques that are rel-
vant to signal processing systems poses major challenges to achieving∗ Corresponding author. 
E-mail addresses: lli12311@umd.edu (L. Li), carlo.sau@diee.unica.it (C. 
imo.2.viitanen@tuni.ﬁ (T. Viitanen), francois.christophe@helsinki.ﬁ (F. Chr
eikki.huttunen@tuni.ﬁ (H. Huttunen), jarmo.takala@tuni.ﬁ (J. Takala), ssb@umd.e
ttps://doi.org/10.1016/j.sysarc.2018.12.010 
eceived 27 April 2018; Received in revised form 4 November 2018; Accepted 31 De
vailable online 31 December 2018 
383-7621/© 2019 The Authors. Published by Elsevier B.V. This is an open access arhe fully potential that is oﬀered by signal processing platforms under
tringent time-to-market constraints. While automated techniques, such
s those referred to above for scheduling and buﬀer mapping, are ef-
ective for specialized combinations of platforms and dataﬂow models
e.g., multicore CPUs and synchronous dataﬂow, respectively), they are
imited in their ability to support more comprehensive assessment of the
esign space, where the models and target platforms themselves have
reat inﬂuence on addressing implementation constraints and optimiza-
ion objectives. System designers must therefore resort to ad-hoc meth-
ds to explore design alternatives that span multiple implementation
cales, platform types, or dataﬂow modeling techniques. 
In this work, we propose a design methodology and an integrated
et of tools and libraries that are developed to help bridge this gap.
e refer to this methodology as the STMC Methodology or STMCM,
hich is named after the diﬀerent institutions across which it is de-
eloped (Sassari, Tampere, Maryland, Cagliari). STMCM focuses on en-
bling experimentation across diﬀerent levels of abstraction throughoutSau), tiziana.fanni@diee.unica.it (T. Fanni), jingui.li@tuni.ﬁ (J. Li), 
istophe), fpalumbo@uniss.it (F. Palumbo), raﬀo@unica.it (L. Raﬀo), 
du (S.S. Bhattacharyya). 
cember 2018 
ticle under the CC BY license. ( http://creativecommons.org/licenses/by/4.0/ ) 
L. Li, C. Sau and T. Fanni et al. Journal of Systems Architecture 93 (2019) 1–19 
t  
a  
i  
w  
i  
d  
u  
c  
c
 
(  
(  
a  
s  
ﬁ  
a  
C  
t  
d  
R  
e  
o  
i  
a
 
l  
s  
d  
n  
D  
d  
d  
s  
c  
s  
l  
t
2
 
n  
f  
C  
s  
m  
a  
n  
t  
s  
h  
s
 
t  
d  
h  
c  
M  
f  
r  
d  
t  
a  
t  
m
 
p  
t  
t  
i  
a  
g  
f  
m  
f
 
d  
f  
d  
t  
c  
f  
l  
o  
o  
c
 
o  
i  
p  
r  
d  
h  
t  
g  
m  
a  
a  
d
 
h  
d  
s  
m  
t  
a  
t  
t  
n
3
 
v  
e  
c  
a  
o  
p  
t  
a
 
L  
d  
t  
m  
e  
r  
V  
Ahe design process, and allowing designers to experiment productively
nd iterate rapidly on complex combinations of design options, includ-
ng dataﬂow models, heterogeneous target platforms, and integration
ith platform-speciﬁc languages and back-end tools. Special emphasis
s placed on enabling eﬀective experimentation with hardware/software
esign trade-oﬀs, as well as trade-oﬀs involving performance, resource
tilization, and power consumption. These are trade-oﬀs that are espe-
ially important and challenging to navigate eﬃciently in design pro-
esses for system-on-chip implementation of signal process systems. 
The utility of STMCM is facilitated by the use of lightweight dataﬂow
LWDF) programming [3] , and its underlying core functional dataﬂow
CFDF) model of computation [4] . LWDF provides a compact set of
pplication programming interfaces (APIs) that allows one to apply
ignal-processing-oriented dataﬂow techniques relatively easily and ef-
ciently in the context of existing design processes, target platforms,
nd simulation- and platform-oriented languages, such as MATLAB, C,
UDA, and VHDL. Additionally, CFDF is a general form of dataﬂow
hat accommodates more specialized forms of dataﬂow, such as Boolean
ataﬂow [5] , cyclo-static dataﬂow [6] , synchronous dataﬂow [7] , and
VC-CAL [8] as natural special cases. This accommodation of diﬀer-
nt dataﬂow models in turn provides potential to integrate designs with
ther dataﬂow frameworks and DSP libraries, such as those described
n [8–13] . Furthermore, LWDF is granularity-agnostic, in the sense that
ctor complexity does not limit the applicability of the framework. 
To demonstrate the capabilities of STMCM in addressing the chal-
enges of mapping practical dataﬂow-based structures on heterogeneous
ignal processing platforms, we explore diﬀerent implementations of a
eep neural network (DNN) for vehicle classiﬁcation on a heteroge-
eous, embedded system-on-chip (SoC), the Xilinx Zynq Z-7020 SoC.
NN applications pose great challenges in their deployment on embed-
ed devices. Investigation of DNN implementations on embedded SoC
evices is challenging due to the limited resources for processing and
torage in these devices, and especially, due to the high computational
omplexity of DNNs. They involve very large and complex signal ﬂow
tructures that involve intensive computation, data exchange, and multi-
ayer processing. These characteristics make embedded DNN implemen-
ation highly relevant as a case study for STMCM. 
. Related work 
Dataﬂow provides valuable model-based design properties for sig-
al processing systems, and has been adopted in a wide variety of tools
or both software and hardware design. For example, LWDF APIs for
UDA and C have been targeted in the DIF-GPU tool for automated
ynthesis of hybrid CPU/GPU implementations [14] . The CAL program-
ing language and the Open RVC-CAL Compiler (Orcc) toolset provide
 dataﬂow environment for generating dataﬂow implementations in a
umber of languages, such as C, Jade, and Verilog [8,9,15] (note that
he Verilog backend of Orcc has been discontinued and Xronos synthe-
izer has been replaced). The CAPH language and framework generate
ardware description language (HDL) code from high-level dataﬂow de-
criptions [10] . 
The work in [16] presents an integrated design ﬂow and tools for
he automatic optimization of dataﬂow speciﬁcations to generate HDL
esigns. The Multi-Dataﬂow Composer (MDC) tool is a dataﬂow-to-
ardware framework able to automatically create multi-functional re-
onﬁgurable architectures. In addition to this baseline functionality,
DC oﬀers three additional features: (1) a structural proﬁler to per-
orm a complete design space exploration, evaluating trade-oﬀs among
esource usage, power consumption and operating frequency [17] ; (2) a
ynamic power manager to perform, at the dataﬂow level, the logic par-
itioning of the substrate to implement at the hardware level, and apply
 power saving strategy [18] ; (3) a coprocessor generator to perform
he complete dataﬂow-to-hardware customization of a Xilinx compliant
ulti-functional IP [16] . 2 All of the methodologies and tools described above are limited by the
rogramming language, adopted dataﬂow description, or implementa-
ion target. For example, HDL code can be highly optimized for a given
arget (such as a Xilinx FPGA) but not usable for an application speciﬁc
ntegrated circuit (ASIC) ﬂow (e.g., see [15,19,20] ). Automatic methods
nd tools require signiﬁcant eﬀort in development and maintenance of
raph analysis and code generation functionality, and may be too costly
or models and design approaches that are not mature. Such scenarios
ay arise for emerging applications or platforms that do not match ef-
ectively with the models or methods supported by available tools. 
STMCM is complementary to these eﬀorts that emphasize dataﬂow
esign automation. By applying LWDF APIs in novel ways, STMCM
acilitates implementation of and iterative experimentation with new
ataﬂow-based hardware/software architectures and design optimiza-
ion techniques. LWDF is applied as an integral part of STMCM be-
ause of LWDF’s minimal infrastructure requirements and its potential
or rapid retargetability to diﬀerent platforms and actor implementation
anguages. Furthermore, LWDF does not have any restriction in terms
f actor granularity and can be extended with diﬀerent combinations
f dataﬂow graph transformations, as well as other forms of signal pro-
essing optimizations (e.g., see [1] ). 
In [21] , we presented an eﬃcient integration of the LWDF methodol-
gy with hardware description languages (HDLs). Building on this HDL-
ntegrated form of LWDF, we developed methods for low power signal
rocessing hardware implementation, and system-level trade-oﬀ explo-
ation. In this paper, we apply the hardware design techniques intro-
uced in [21] as part of a general methodology that spans software,
ardware, and mixed hardware/software design, implementation, and
rade-oﬀ exploration. Thus, while the focus in [21] is on rigorous inte-
ration across digital hardware design, lightweight dataﬂow program-
ing, and low power optimization, the emphasis in this paper is on
 methodology for applying LWDF concepts in an integrated manner
cross complete hardware/software development processes for embed-
ed signal processing systems. 
In summary, STMCM provides methods to seamlessly and compre-
ensively integrate LWDF-based actor implementation techniques with
esign processes for real-time, resource-constrained signal processing
ystems. STMCM can be used as an alternative to or in conjunction with
ore conventional automated dataﬂow tools (e.g., for disjoint subsys-
ems). STMCM requires more eﬀort in programming compared to fully
utomated toolchains, however it provides more agility in terms of re-
argetability and experimentation, as described above. This is a useful
rade-oﬀ point to have available for model-based design of complex sig-
al processing systems. 
. Proposed design methodology 
Our proposed methodology STMCM is illustrated in Fig. 1 . As moti-
ated in Section 1 and Section 2 , STMCM is a design methodology that
mphasizes LWDF concepts, and is specialized for SoC-based signal pro-
essing systems. The upper part of Fig. 1 represents application-speciﬁc
nd algorithmic aspects, while the lower part represents the general part
f the methodology that is reusable across diﬀerent applications. The up-
er part is illustrated concretely in the context of DNN system design;
his part can be replaced with other application/algorithm level design
spects when applying STMCM to other applications. 
In STMCM, we apply the LWDF programming model through the
ightweight Dataﬂow Environment (LIDE). LIDE is a software tool for
ataﬂow-based design and implementation of signal processing sys-
ems [3,22] . LIDE is based on a compact set of application program-
ing interfaces (APIs) that is used for instantiating, connecting, and
xecuting dataﬂow actors. These APIs have been implemented in a va-
iety of implementation languages. For example, LIDE-C [22] and LIDE-
 [21] provide C and Verilog language implementations of the LIDE
PIs, respectively. 
L. Li, C. Sau and T. Fanni et al. Journal of Systems Architecture 93 (2019) 1–19 
Fig. 1. An illustration of STMCM in the context of DNN system design. 
 
d  
I  
o  
m  
w  
c  
g  
f  
W  
h  
a  
s  
t  
t  
iAs mentioned in Section 1 and illustrated in Fig. 1 , core functional
ataﬂow (CFDF) [4] , is the form of dataﬂow that LWDF is based on.
n CFDF, each actor is speciﬁed as a set of modes. Each actor ﬁring
perates according to one of the speciﬁed modes (called the “current
ode ” associated with the ﬁring), and determines a unique next mode,
hich will be the current mode for the next ﬁring. The production and
onsumption rates ( dataﬂow rates ) for the actor ports are constant for a
iven mode. However, diﬀerent modes of the same actor can have dif-3 erent rates, which allows actors to exhibit dynamic dataﬂow behavior.
e present the switch actor as an example of CFDF actor. Switch actor
as three modes: Control, True and False. In Control mode, the switch
ctor consumes one token from Control port. In True or False mode, the
witch actor consumes one token from Data port and forward that token
o True or False Output port accordingly. The dataﬂow table and mode
ransition diagram between CFDF modes of switch actor are illustrated
n Fig. 2 . 
L. Li, C. Sau and T. Fanni et al. Journal of Systems Architecture 93 (2019) 1–19 
Fig. 2. Switch actor in CFDF. (a) Switch Actor, (b) Dataﬂow Table, (c) Mode Transition Diagram between CFDF Modes. 
 
a  
w  
s  
i  
t  
t  
T  
a
 
s
3
 
a  
a  
a  
p  
r  
w  
c  
 
t  
t  
t  
s  
p
 
r  
S  
w  
F  
p
3
 
n  
l  
t  
s  
p  
e  
t  
t  
a  
t  
L  
a  
m  
L
 
o  
w  
s  
t  
t  
p  
D  
m  
S  
i
 
p  
e  
p  
a  
t  
i  
w  
t  
p  
e
3
 
i  
i  
T  
a  
i  
S  
t  
i
 
c  
t  
u  
D  
i
 
L  
a  
t  
e  
s  
w  
r  The deﬁnition of a CFDF actor includes two functions called the en-
ble function and invoke function of the actor. The enable function checks
hether there is suﬃcient data available on the actor’s input edges and
uﬃcient empty space available on the output edges to ﬁre the actor in
ts next mode. The invoke function executes an actor ﬁring according
o the actor’s current mode, consuming and producing amounts of data
hat are determined by the ﬁxed dataﬂow rates of the current mode.
he invoke function also determines the actor’s next mode, as described
bove. 
In the remainder of this section, we discuss in detail the application-,
oftware-, and hardware-speciﬁc processes illustrated in Fig. 1 . 
.1. Application-speciﬁc tools and processes 
In Fig. 1 , application-speciﬁc tools and associated design processes
re illustrated by gray blocks. Throughout this paper, we adopt a DNN
pplication as a concrete demonstration of how such application-speciﬁc
spects are used as an integral part of STMCM. The DNN-focused design
rocess illustrated in Fig. 1 starts with the derivation of DNN hyperpa-
ameters and the network conﬁguration. Then the parameters associated
ith the derived DNN structure are extracted and the DNN algorithm is
arefully validated to ensure that target levels of accuracy are satisﬁed.
The block labeled “Design Requirements and Constraints ” refers to
he application- and platform-speciﬁc requirements and constraints on
he DNN implementation. Examples of these include the accuracy and
hroughput requirements for image classiﬁcation DNN systems, and con-
traints on available power and hardware resources for a targeted SoC
latform. 
In the remainder of this section, we introduce the software-
elated and hardware-related design processes that provide the core of
TMCM. These processes are applied in an integrated manner for hard-
are/software co-design, as represented by the lower left hand part of
ig. 1 . Detailed explanations of the major components in STMCM are
rovided in Section 4.3 . 
.2. Software-related process 
In the next main phase of the proposed design methodology, the DNN
etwork conﬁguration derived using application-speciﬁc, algorithm-
evel tools is mapped to a software implementation using LIDE-C. Note
hat LIDE-C is in no way restricted to DNN systems, and is instead de-
igned to support a broad class of dataﬂow-based signal and information
rocessing systems. For example, in the work of [23] , the design space
xploration of a digital predistortion system for wireless communica-
ion is based on implementation using LIDE-C. In [24] , LIDE-C is ex-
ended to support parameterized synchronous dataﬂow [25] modeling
nd applied to the implementation of an adaptive wireless communica-
ion receiver. In [26] , optimized vectorization techniques are applied to
IDE-based actors for throughput optimization, and demonstrated using
n Orthogonal Frequency Division Multiplexing (OFDM) receiver. For4 ore details about LIDE-C and the development of DNN components in
IDE-C, we refer the reader to [22,27] . 
Working with the LIDE-C implementation of the DNN, a number of
ptimization processes are carried out iteratively to streamline the soft-
are implementation in terms of the relevant design objectives and con-
traints. This iterative optimization process is illustrated in Fig. 1 by
he cyclic path that involves the blocks labeled Dataﬂow Representa-
ion, LIDE-C Implementation, and Optimized LIDE-C Implementation . The
roposed approach supports eﬃcient application of commonly-used
NN software optimization methods such as for-loop tiling and buﬀer
emory sharing among dataﬂow graph edges. We refer the reader to
ection 4.1 for more details about these optimization methods and the
ntegration of them with the LIDE-C implementation. 
Next, software proﬁling is performed on the optimized LIDE-C im-
lementation of the DNN system to extract proﬁling data. This data is
xtracted for each dataﬂow component of the DNN architecture. In the
roﬁling process applied in STMCM, the memory sizes of the buﬀers
nd execution time of the actors in the graph are measured. According
o the characteristics of DNN architecture, the DNN system is divided
nto multiple computation layers. In our application of STMCM, soft-
are proﬁling is specialized to DNN implementation by measuring the
otal memory sizes for the buﬀers both inside each layer and between
airs of adjacent layers. We also measure the total time complexity of
ach DNN layer. 
.3. Hardware-related process 
The dataﬂow model of the subgraph to accelerate is implemented
n hardware using LIDE-V. Hardware proﬁling based on the speciﬁc
mplementation platform is performed on the LIDE-V implementation.
his proﬁling is used to collect measurements on hardware performance
nd help identify possible optimizations. Details on hardware proﬁl-
ng are demonstrated concretely through the case study presented in
ection 4.2 . Like the software implementation, the hardware implemen-
ation will in general go through multiple optimization iterations before
t is ﬁnalized. 
In LIDE-V, the hardware implementation of a dataﬂow actor is de-
omposed into implementations of its enable function and invoke func-
ion. These components are implemented as two coupled Verilog mod-
les — the actor enable module (AEM), and actor invoke module (AIM).
ataﬂow edges are implemented as dataﬂow edge modules (DEMs); we
nformally refer to DEMs also as “FIFOs ”. 
To provide fully distributed scheduling of actors, one can connect a
IDE-V actor scheduling module ( ASM ) to each actor. The ASM initiates
 new ﬁring of its associated actor any time the actor is not already in
he ﬁring mode, has suﬃcient data on its input edges, and has suﬃcient
mpty space on its output edges. Scheduling of LIDE-V actors is not re-
tricted to such a fully distributed scheduling approach. For example,
ith appropriately-designed control logic, subsets of actors can be se-
ialized to allow sharing of resources within the subsets. In this paper,
L. Li, C. Sau and T. Fanni et al. Journal of Systems Architecture 93 (2019) 1–19 
h  
d  
c  
t  
i  
p
 
s  
g  
w  
s  
c  
L  
o
 
a  
a  
r  
a  
g  
W  
t  
H  
a  
e  
d
 
c  
s
4
 
f  
t  
s  
b  
f  
r  
a  
T  
T  
a  
r  
I  
r  
o  
c  
p  
o  
j  
f  
b  
c  
o  
o  
o
 
b  
d  
t  
m  
t  
s
 
t  
c  
d  
v  
s  
s  
t  
h  
f  
c  
c  
l  
v
 
m  
g  
s  
n  
s  
b  
l  
d
 
e  
t  
a  
t
4
 
t  
b  
O  
p  
l  
a  
C  
t
 
D  
v  
e  
m  
p  
w  
t  
a  
n  
m  
g
4
 
p  
i  
l  
i  
r  
t  
l  
t  
a  
w  
l
 
towever, we restrict our attention to fully distributed scheduling. Fully
istributed scheduling of dataﬂow graphs has been analyzed in various
ontexts. For example, Ghamarian el al. have developed methods for
hroughput analysis of synchronous dataﬂow graphs that are scheduled
n a fully distributed manner [28] . Such analysis techniques can be ap-
lied to hardware subsystems in STMCM. 
The orthogonality (separation of concerns) among actor, edge, and
cheduler design in LIDE-V lays a valuable foundation for rigorous inte-
ration of power-management within the associated APIs. In particular,
e demonstrated in [29] and [21] that methods for asynchronous de-
ign, Globally Asynchronous Locally Synchronous (GALS) design, and
lock gating can be applied eﬃciently through natural extensions of the
IDE-V APIs. We also demonstrated the use of these extensions to power
ptimization. 
To manage complexity and improve reuse of subsystems within and
cross designs, one can encapsulate subgraphs in LIDE-V within hier-
rchical actors ( HAs ). An HA in LIDE-V appears from the outside as a
egular (non-hierarchical) LIDE-V actor with an associated AEM, AIM,
nd ASM. Execution of an HA as an actor in the enclosing dataﬂow
raph is coordinated by the external scheduler associated with the HA.
hen an HA is ﬁred by its external scheduler, the internal scheduler of
he HA coordinates the ﬁrings of actors that are encapsulated within the
A (nested actors). The internal scheduler carries out the set of nested
ctor ﬁrings that must be completed for a given ﬁring of the HA. An
xample of an HA with internal and external schedulers is discussed in
etail and illustrated in Fig. 6 . 
Since it appears from the outside as a regular actor, an HA can be
lock gated in exactly the same way, allowing the designer to eﬃciently
witch oﬀ the whole subgraph at appropriate times during operation. 
. Case study: A deep neural network for vehicle classiﬁcation 
As a concrete demonstration of STMCM, we adopt a DNN use case
or automatic discrimination among four types of vehicles — bus, car,
ruck, and van. This implementation is based on a neural network de-
ign presented in [30] , where a network conﬁguration — i.e., the num-
er and types of layers and other DNN hyperparameters — was care-
ully derived and demonstrated to have very high accuracy. The accu-
acy of the methods was validated with a database of over 6500 im-
ges, and the resulting prediction accuracy was found to be over 97%.
he work in this paper and the work in [30] have diﬀerent focuses.
he work of [30] focuses on deriving hyperparameters, network design,
nd demonstrating network accuracy, and does not address aspects of
esource-constrained implementation or hardware/software co-design.
n this paper, we go beyond the developments of [30] by investigating
esource constrained implementation on a relevant SoC platform, and
ptimized hardware/software co-design involving an embedded multi-
ore processor and FPGA acceleration fabric that are integrated on the
latform. In [30] , the proposed DNN architectures are evaluated based
n the classiﬁcation accuracy, while in our work on STMCM, the ob-
ectives that we are trying to optimize are system throughput, memory
ootprint and power eﬃciency. In addition, our work in this paper can
e generalized to the design and implementation of arbitrary DNN ar-
hitectures, and also it can be generalized beyond DNN applications to
ther signal and information processing applications; the architecture
f [30] is selected as a case study to concretely demonstrate the usage
f the methodology proposed in this paper. 
In relation to Fig. 1 , we apply the results from [30] in the block la-
eled “derivation of hyperparameters and DNN design ” as part of the
esign methodology that is demonstrated in this paper. Fig. 3 illustrates
he complete DNN architecture that we implement in this work. For
ore details about this use case, such as the dataset, the derivation of
he DNN architecture and the application of the use case in vehicle clas-
iﬁcation, we refer the reader to [30] . 
The DNN network design is composed of two convolutional layers,
wo dense layers and one classiﬁer layer, as depicted in Fig. 3 . The ﬁrst5 onvolutional layer takes an RGB image (3 ×96 ×96) as input, and pro-
uces 32 feature maps, each with dimensions (48 ×48). The second con-
olutional layer takes these 32 feature maps as input and produces 32
maller feature maps, each having dimensions (24 ×24). We refer to a
ubsystem that processes multiple input images to produce a single fea-
ure map as a branch . Thus, the ﬁrst and second convolutional layers
ave 32 branches each. The two dense layers combine to transform the
eature maps into a (1 ×100) vector, which is then multiplied in the
lassiﬁer layer by a (100 ×4) matrix to determine the (1 ×4) classiﬁ-
ation result. Each of the four values in the result corresponds to the
ikelihood that the vehicle in the input image belongs to one of the four
ehicle types (i.e., bus, car, truck and van). 
The studied use case is relatively easy to solve compared to com-
on image recognition benchmarks, such as MSCOCO [31] , or Ima-
eNet [32] . Therefore, one can reach high accuracy with a relatively
imple network requiring signiﬁcantly lower resources than common
etwork topologies intended for mobile use (such as Mobilenets). As
uch, the focus of our work is not in mobile devices (e.g., smartphones),
ut in simpler IoT devices targeted to solving less complex machine
earning problems at low cost. For further details on the DNN network
esign and hyperparameter speciﬁcations, we refer the reader to [30] . 
The speciﬁc platform and associated platform-based tools that we
mploy are based on the Xilinx Zynq Z-7020 SoC. The remainder of
his Section focuses on details associated with STMCM and its associ-
ted design processes. These details are presented concretely through
he development of this DNN case study. 
.1. Software implementation and optimization 
In this section, we discuss dataﬂow-graph- and actor-level optimiza-
ions and associated design iterations, as illustrated in Fig. 1 by the
locks labeled Dataﬂow Representation, LIDE-C Implementation, and
ptimized LIDE-C Implementation. We start with a dataﬂow graph im-
lementation that is derived using LIDE-C [3,22] , which provides a C-
anguage implementation of the LWDF APIs so that CFDF-based actors
nd dataﬂow graphs can be implemented in a structured manner using
. The initial (sequential) LIDE-C design is developed in a design phase
hat corresponds to the block labeled LIDE-C Implementation in Fig. 1 . 
After validating the correct, dataﬂow-based operation of the initial
NN dataﬂow graph implementation in LIDE-C, we experiment with
arious transformations at the actor, subgraph, and dataﬂow graph lev-
ls. Here, we exploit the orthogonality of actor, edge, and graph imple-
entation in LIDE-C, which allows designers to ﬂexibly and eﬃciently
erform experimentation with a wide variety of transformations, and
ith diﬀerent combinations of applied transformations. The actor-level
ransformations performed here are focused on optimization methods
pplied to the convolution actor, which is a major performance bottle-
eck in the design. The subgraph-level transformations involve memory
anagement optimizations performed on FIFOs both inside each sub-
raph (DNN layer) and between pairs of adjacent layers. 
.1.1. Actor-level optimization 
We demonstrate actor-level optimization at this stage of the design
rocess using the convolution actor in our DNN example. In our LIDE-C
mplementation of this actor, we apply a transformation of the convo-
ution computation that is commonly used to simplify the design, and
mprove classiﬁcation speed. The transformation involves loop tiling to
educe the cache miss rate. The utility of loop tiling in DNN implemen-
ation has been demonstrated previously, for example, in [33] . Using
oop tiling, we decompose the main loop of the convolution computa-
ion into an inner loop that iterates within contiguous “strips ” of data,
nd an outer loop that iterates across strips. Applying loop tiling in this
ay allows one to enhance cache reuse based on an array size (strip
ength) that ﬁts within the cache. 
Fig. 4 shows a segment of code from our application of the tiling
ransformation to the convolution actor. 
L. Li, C. Sau and T. Fanni et al. Journal of Systems Architecture 93 (2019) 1–19 
Fig. 3. DNN for automatic discrimination of four types of vehicles. 
 
i  
p  
f  
t  
t  
s  
t  
C  
i  
s
4
 
a  
r  Through the orthogonality provided by the model-based design rules
n LIDE-C, this transformation can be applied at a late stage in our design
rocess, in a way that is interoperable with previously applied trans-
ormations, and in a way that requires no modiﬁcations to other ac-
or or edge implementations. In this case, no modiﬁcation is needed
o the dataﬂow graph scheduler implementation as well, although for
ome transformations, scheduler adjustments can be useful to integrate
ransformed actors into the overall system in an optimized way. The6 FDF-based APIs (enable and invoke functions) in LIDE-C for scheduler
mplementation allow the designer to experiment eﬃciently with such
cheduling adjustments as needed. 
.1.2. Buﬀer memory management 
A major challenge in resource-constrained implementation of a DNN
rchitecture is managing the large volume of data transfers that are car-
ied out during network operation. Each DNN layer typically processes
L. Li, C. Sau and T. Fanni et al. Journal of Systems Architecture 93 (2019) 1–19 
Fig. 4. The code segment that implements loop 
tiling within the LIDE-C actor for convolution. 
a  
f  
c  
m
 
a  
L  
m  
o  
b  
o  
b  
𝑖
w  
o  
p  
t  
p
 
i  
w  
i  
a  
s  
h  
m  
w  
p  
i  
o  
S
 
n  
d  
g  
C  
i
 
L  
l  
b  
l  
t  
m  
c  
F  
ﬁ  
k  
t
 
d
Table 1 
Layer-level software proﬁling. Here, the row labeled “T ” gives the exe- 
cution time of each layer, and the row labeled “T% ” gives the percentage 
of the total DNN execution time that is attributed to each layer. 
Layer Total 
1 2 3 4 5 
T [ms] 18.71 22.08 0.0149 0.0034 0.0036 40.812 
T% 45.84 54.10 0.04 0.01 0.01 100 
Table 2 
Actor-level software proﬁling. 
Layer Convolutional layer 1 Convolutional layer 2 
Actor Conv Add M&ReLU Conv Add M&ReLU 
T ic [ 𝜇s] 230.10 0.03 0.025 59.77 0.005 0.006 
Layer Dense Layer 3 Dense Layer 4 Output Layer 5 
Actor Mult ReLU Mult ReLU Mult Softmax 
T ic [ 𝜇s] 5.1 0.0012 0.029 0.0012 0.0023 0.0031 
4
 
a  
m  
I  
a
 
a  
a  
T  
t  
C
 
c  
c  
A  
b  
c  
—  
o  
a  
L  
L
4
 
d  
w  
t large amount of data, and requires memory to store the input data
rom the previous layer or subsystem, the intermediate data during the
omputation processing, and the computation results that will be trans-
itted to the following layer or subsystem. 
Consider, for example, the buﬀer memory costs (the storage costs
ssociated with the dataﬂow graph edges) for the DNN of Fig. 3 . In our
IDE-C implementation, the second convolutional layer requires the
ost buﬀer memory. In this layer, each of the 32 branches is composed
f 32 convolution actors, 31 addition actors and one actor performing
oth maxpooling and ReLU (Rectiﬁed Linear Unit). Given that the size
f the input feature map processed by each branch is 48 ×48 pixels, the
uﬀer memory required for actor communication inside each branch is
𝑚𝑎𝑔𝑒 _ 𝑠𝑖𝑧𝑒 × ( 𝑛𝑢𝑚𝑏𝑒𝑟 _ 𝑜𝑓 _ 𝑐 𝑜𝑛𝑣 _ 𝑎𝑐 𝑡𝑜𝑟𝑠 + 𝑛𝑢𝑚𝑏𝑒𝑟 _ 𝑜𝑓 _ 𝑜𝑢𝑡𝑝𝑢𝑡 _ 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 _ 𝑚𝑎𝑝𝑠 ) , 
hich is 48 × 48 × (32 + 1) = 76 , 032 pixels. Thus, the total buﬀer mem-
ry inside the second convolutional layer is 76 , 032 × 32 = 2 , 433 , 024
ixels. The buﬀer memory required for data communication between
he ﬁrst and the second layer can be computed as 48 × 48 × 32 = 73 , 728
ixels. 
In STMCM, we apply a buﬀer memory optimization technique that
s useful for resource-constrained DNN implementation. In particular,
e incorporate a new FIFO abstract data type (ADT) implementation
n LIDE-C, called shared FIFO , that enables multiple dataﬂow edges in
 graph to be implemented through FIFO ADT instances that share the
ame region of memory. Such buﬀer sharing in dataﬂow implementations
as been investigated in diﬀerent forms for various contexts of auto-
ated scheduling and software synthesis (e.g., see [34–36] ). In STMCM,
e make it easy for the system designer to apply buﬀer sharing ex-
licitly within her or his implementation rather than depending on its
mplicit support through the toolset that is used. This is an example
f the agility that is supported in STMCM, as described at the end of
ection 2 . 
Again, by exploiting the orthogonality among dataﬂow compo-
ents, buﬀer sharing in STMCM is performed only on the targeted
ataﬂow edges and requires no modiﬁcation to other actors or sub-
raphs. Through the support for such separation of concerns in LIDE-
, diﬀerent ADT implementations for a FIFO or group of FIFOs can be
nterchanged without aﬀecting overall system functionality. 
There are three key aspects to our application of shared FIFOs in our
IDE-C DNN implementation. First, at the input of each convolutional
ayer L , input data from the previous layer is stored centrally instead of
eing copied separately into each branch of L . Second, edges in diﬀerent
ayers share the same memory so that the memory is time-division mul-
iplexed between the layers — the processing of a given layer overwrites
emory in its shared FIFOs without introducing conﬂicts that aﬀect the
omputation results. Third, actors operate on data from shared input
IFOs directly through their read pointers into the FIFO (rather than
rst copying the data locally within the actor’s internal memory). This
ind of copy-elimination is similar to dataﬂow memory management
echniques introduced by Oh and Ha [35] . 
Improvements resulting from our application of shared FIFOs are
emonstrated quantitatively in Section 5.1 . 7 .1.3. Software proﬁling 
In this subsection, we demonstrate the process of software proﬁling,
s illustrated in Fig. 1 , in the context of our optimized LIDE-C imple-
entation of the DNN architecture. The implementation platform is an
ntel i7-2600K running at 3.4GHz. Table 1 and Table 2 show layer- and
ctor-level software proﬁling measurements, respectively. 
In Table 2 , T ic denotes the invoke to ﬁring completion time of a given
ctor. This is the average time that elapses between the time that an
ctor ﬁring is initiated and when the ﬁring completes. We also refer to
 ic as the average execution time of the associated actor. The abbrevia-
ions Add, Conv, Mult, and M&ReLU stand, respectively, for Addition,
onvolution, Multiplication, and Maxpool-and-ReLU. 
Layer- and actor-level software proﬁling provide insight into the pro-
essing complexity of actors in each layer. According to Table 1 , the
onvolutional layers account for 99.94% of the system execution time.
lso, the execution time of Layer 2 is very close to that of Layer 1. In
oth convolutional layers, the Conv actors account for most of the pro-
essing time compared with the other two actors — Add and M&ReLU
in the convolutional layers. Additionally, the average execution time
f the Conv actors in Layer 2 is only about a quarter of that of the Conv
ctors in Layer 1. This is primarily because each of the Conv actors in
ayer 1 processes input images of size 96 ×96, while the Conv actors in
ayer 2 process input feature maps that have size 48 ×48. 
.2. Hardware implementation and design exploration 
In this section, we describe the main capabilities of the design ﬂow
epicted in Fig. 1 with respect to design and implementation of hard-
are accelerators. These capabilities are represented by the blocks in
he region labeled “Hardware-related Process ”. 
L. Li, C. Sau and T. Fanni et al. Journal of Systems Architecture 93 (2019) 1–19 
Fig. 5. LIDE-V implementation for the accelerated SFM. 
 
D  
w  
a  
i  
u
 
i  
p  
l  
w
 
a  
t  
p  
i  
F  
w  
a  
t  
i  
i
 
t  
r  
s  
t  
T  
p  
s
 
i  
m  
r  
T  
d
4
 
d  
t  
c  
p  
ﬁ  
t  
S  
m
 
w  
H  
S  
c  
Table 3 
Measured data associated with actor execution times and waiting (idle) 
times. 
SFM t tot 232,831 
T ic T ci ﬁrings Tot ( Tot %) T ii / T ic 
Deinterleave 3 2 9216 27,648 (11.87) 1.67 
Convolution 2402 2 96 230,592 (99.04) 1.00 
Sum 107 2297 96 10,272 (4.41) 22.46 
Maxpool&ReLU 195 4613 48 9360 (4.02) 24.66 
a  
i  
c  
a  
t  
o
 
C  
b  
T  
u
 
d  
t  
r  
t  
t  
i  
i  
b
 
t  
2  
p  
i  
e  
i  
t  
a  
w  
2  
s
 
i  
“  
s  
v  
4
 
wFor example, through a preliminary hardware proﬁling phase of the
NN application described in Section 4.2.1 , we can identify three hard-
are design aspects that are interesting to investigate in detail — the
doption of clock gating techniques, exploitation of asynchrony that is
nherent in dataﬂows, and exploration of diﬀerent levels of actor gran-
larity. 
We demonstrate the hardware-related design process of STMCM us-
ng a hardware accelerator that is introduced in [21] . The accelerator
rovides a subsystem for producing feature maps from the ﬁrst convo-
utional layer of the DNN application. In the remainder of this paper,
e refer to this subsystem as the Subtree for Feature Map (SFM). 
Due to the interfacing consistency that is maintained across LIDE
ctor implementations in diﬀerent languages, one can readily convert
he LIDE-C based SMF subsystem implementation into hardware by re-
lacing each software actor with a hardware module that is designed
n LIDE-V, and by connecting the derived hardware actors with LIDE-V
IFOs. Following the general approach of realizing LIDE actors in hard-
are, each LIDE-V actor implementation is decomposed into an AEM
nd AIM. The AEM is reusable among diﬀerent actors in our implemen-
ation, although in general it can be useful to have specialized AEM
mplementations that are streamlined for the speciﬁc requirements of
ndividual actors [21] . 
The hardware implementation diverges from the LIDE-C design in
wo major ways. First, we feed the input data in an interleaved format,
educing the complexity of the hardware interface and driver software
ince there is only one input FIFO to manage. Second, the hardware ac-
ors are designed to produce one row per ﬁring instead of entire images.
his reduces the FIFO size requirements in the ﬁrst layer from 96 ×96
ixels to only 96 pixels. The hardware actors in our implementation are
cheduled using a fully distributed approach. 
The resulting SMF is shown in Fig. 5 . The implemented hardware
s veriﬁed against reference outputs extracted from the LIDE-C imple-
entation. In this Figure, production and consumption rates (dataﬂow
ates) are annotated next to actor ports, and w is the input image width.
he convolution actor has multiple operating modes (CFDF modes) with
iﬀerent consumption rates. 
.2.1. Hardware proﬁling 
We employ hardware proﬁling in STMCM to extract execution time
ata, which is later used to guide the process of iterative design op-
imization. In this section, we demonstrate hardware proﬁling in the
ontext of our DNN application. Proﬁling is performed using the target
latform, which in our demonstration is the Zynq Z-7020 SoC. We pro-
le the LIDE-C implementation on the ARM A9 MPCores provided by
he target platform and develop a ﬁrst version implementation of the
FM on this platform and extract execution time data from this imple-
entation. 
Table 3 depicts various data associated with execution times and
aiting times for the SFM hardware accelerator illustrated in Fig. 5 .
ere, the symbol t tot represents the total time necessary to execute the
FM; T ic is the average time period between an actor invocation and its
orresponding ﬁring completion; T is the average time period that anci 
8 ctor has to wait to be ﬁred after its previous ﬁring completion; ﬁrings
s the number of ﬁrings of a given actor during execution of SFM; Tot ,
alculated as ( T ic ) × ( ﬁrings ), gives the total execution time of a given
ctor during the execution of SFM ; 𝑇 𝑖𝑖 = ( 𝑇 𝑖𝑐 + 𝑇 𝑐𝑖 ) denotes the average
ime period between the beginning of one invocation to the beginning
f the next; and the ratio T ii / T ic measures the extent of actor idleness. 
This rich collection of metrics, which is supported by the underlying
FDF model computation, provides various insights on the dataﬂow-
ased system architecture and its implementation. For example, the
 ii / T ic ratio provides insight on diﬀerences in processing speed that are
seful in exploiting the inherent asynchrony between dataﬂow actors. 
From analysis of our hardware proﬁling results ( Table 3 ), we can
erive diﬀerent versions of the SFM hardware accelerator with diﬀerent
rade-oﬀs among power consumption, system throughput, and hardware
esource cost. Firstly, looking at column Tot %, we see that all of the ac-
ors except for Convolution are inactive throughout most of the execu-
ion time. The maximum proportion of active time among these actors
s 11.87%, reached by Deinterleave. Gating the clock of these frequently
nactive actors can provide more energy eﬃcient accelerator operation
y eliminating dynamic power consumption during idle phases. 
Furthermore, the Deinterleave and Convolution actors have rela-
ively small idleness levels ( T ii / T ic ), with a waiting time T ci equal to
 clock cycles for both of them. On the other hand, Sum and Max-
ool&ReLU exhibit much larger waiting times and idleness levels. An
mportant hint coming from the T ci values is that, thanks to the inher-
nt asynchrony of dataﬂow actors, it is possible to partition the design
nto diﬀerent clock regions working at diﬀerent frequencies, thus ob-
aining a GALS design. In particular, the Deinterleave and Convolution
ctors can be placed in one clock region (Region 1), driven by clock 1 ,
hile Sum and Maxpool&ReLU can be placed in another region (Region
), driven by clock 2 . On the basis of the measured T ii / T ic values, we can
et clock 2 to be 20 times slower than clock 1 . 
Moreover, the subgraph included in Region 2 can be encapsulated
nto a hierarchical actor (see Section 3.3 ). This actor, seen from the
outside ”, is like any other LIDE-V actor. The actor and its encapsulated
ubsystem can be clock gated or clocked with a diﬀerent frequency, pro-
iding additional candidate solutions for SFM accelerator optimization.
.2.2. SFM Exploration 
Based on the hardware proﬁling analysis discussed in Section 4.2.1 ,
e explored six diﬀerent variants of the SFM design: 
L. Li, C. Sau and T. Fanni et al. Journal of Systems Architecture 93 (2019) 1–19 
Fig. 6. An illustration of the hierarchical actor associated with Design SFM h . 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
F  
i  
t  
o  
s  
g  
a
 
c  
c  • SFM a : This is an asynchronous design where actors belonging to dif-
ferent logic regions run at diﬀerent clock frequencies. In particu-
lar, the clock frequency for clock 1 is set to 100 MHz, and the clock
frequency for clock 2 is set to 5 MHz. Referring to Fig. 5 , the only
modiﬁcation required in the design is the replacement of FIFOs that
are placed between the two clock regions. These FIFOs need to be
replaced with asynchronous FIFOs — for this purpose, we employ
the clock domain crossing (CDC) FIFOs presented in [21] . CDC FI-
FOs are designed with read and write logic that can be driven by
diﬀerent clocks. At the same time, their module interfaces conform
to standard LIDE-V edge interfaces so they can replace other FIFO
implementations without requiring changes to actors that commu-
nicate with them. 
• SFM CG : Based on our hardware proﬁling results, we apply clock gat-
ing to the Deinterleave, Sum and Maxpool&ReLU actors. To be clock
gated, a LIDE-V actor needs only the instantiation of a clock gat-
ing module (CGM) [21] . The CGM involves a BUFG primitive that
physically enables/disables the clock signal in the target SoC. Thus
for each clock gated actor A in SFM CG , a CGM is instantiated and
connected to the clock inputs of A and to the read- and write-clock
inputs, respectively, of the FIFOs that A reads from and writes to. 
• SFM aCG : This design incorporates both asynchronous design and
clock gating techniques. As in SFM a , the FIFOs between the two clock
regions are replaced with CDC FIFOs. Additionally, the Deinterleave,
Sum and Maxpool&ReLU actors are clock gated as in SFM CG , and a
CGM is instantiated for each of these actors. 
• SFM h : This is a hierarchical SFM design, which can be viewed as
a baseline for evaluating our enhanced hierarchical design SFM hCG 
(deﬁned below). In SFM h , Region 2 (see Fig. 5 ) is encapsulated in
a hierarchical actor H . An illustration of this hierarchical actor is
provided in Fig. 6 . The subgraph that is encapsulated by H contains
three actors A1, A2 and B . We denote this subgraph by G H . Actors
A1 and A2 correspond to Sum 1 and Sum 2 , respectively, which are
two actors that add outputs from the three convolution actors. Actor
B corresponds to the Maxpool&ReLU actor. 
When H is viewed as a single actor from the outside, a ﬁring of H
starts when the internal scheduler I_ASM_HA for G H receives the in-
voke_HA signal from the external scheduler E_ASM_HA . Inside the
subgraph G H , the invoke_HA signal is received by ASM_A1 , which is
l  
9 the ASM of actor A1 . Once ASM_A1 receives the invoke_HA signal,
the ﬁring of the subgraph G H starts. 
• SFM hCG : This design is the same as SFM h , except that the Deinter-
leave actor and the hierarchical actor are clock gated. It is impor-
tant to highlight that the application of clock gating at the region
level is advantageous if the execution times of the actors within the
region are overlapped. In this design, however, the execution times
of the three actors are not overlapped. When one actor is executed,
the others wait in an idle state and waste power. Therefore, we ex-
pect that this conﬁguration would not be really eﬀective in reducing
power consumption as SFM CG in the targeted DNN case. However,
we include the test in our explorations to present the complete wide
variety of options made available by STMCM (even if some of them
may be less eﬃcient than others for this particular application sce-
nario). 
• SFM auto : This is a version of the SFM that is synthesized and im-
plemented by enabling the automatic power optimization available
within the adopted Xilinx Vivado environment. This design applies
ﬁne-grain clock-gating and ﬁne-grain logic-gating at the Verilog
level and excludes all of the higher-level, dataﬂow-based optimiza-
tions (coarse-grain asynchronous design, clock-gating, and hierar-
chical decomposition) that are applied in the other ﬁve investigated
designs. Thus, SFM auto is useful as a common baseline to assess the
higher-level models and transformations provided by STMCM com-
pared to existing oﬀ-the-shelf synthesis techniques. 
.3. Joint hardware/software implementation and optimization 
This section shows how the proposed design ﬂow (summarized in
ig. 1 ) provides a variety of interesting hardware/software co-design
mplementation choices and optimization possibilities. In particular,
hese features are represented by the “Co-design-related Process ” area
f Fig. 1 . For a given high-level LWDF model, the interaction between
oftware (see Section 4.1 ) and hardware (see Section 4.2 ) actors or sub-
raphs can be shaped and reﬁned depending on the speciﬁc constraints
nd requirements of the application. 
In particular, we demonstrate two main implementation aspects that
an be eﬃciently explored with STMCM: parallelism across actor exe-
ution, and the adopted communication interfaces. The degree of paral-
elism can be tuned depending on the number of software and/or hard-
L. Li, C. Sau and T. Fanni et al. Journal of Systems Architecture 93 (2019) 1–19 
w  
w  
p  
f  
c
 
h  
(  
a  
t  
a  
i  
c
 
m  
i  
w  
v  
F  
o  
i  
c  
v  
u
 
s  
w  
c  
d  
t  
i  
a  
t  
a  
P  
b  
p  
p
 
r  
w  
p  
a  
t  
s
 
t  
e  
o  
t  
c  
s  
i  
o  
c  
l  
w  
P  
e  
 
s  
b
 
 
 
 
S
4
 
a  
s  
d  
s  
s  
a  
f  
t  
t  
c
 
n  
a  
A  
D  
f
t
 
m  
i  
i  
c  
b  
e  
i  
e  
c  
a  
f
4
 
s  
b  
v  
S  
u  
T  
a
 
f  
o
 
r  
l  
T  
d  
d  
e  
i  
(  
a  
o  are cores adopted for the execution of a certain computational step,
hile diﬀerent communication interfaces allow diﬀerent levels of cou-
ling between hardware and software actors. Both of these dimensions
or exploration therefore represent important sources of trade-oﬀs to
onsider during the implementation process. 
For the purpose of our co-design explorations, the DNN application
as been split into two parts to be executed respectively in software
PS) and hardware (PL). Here, PS and PL stand for Processing System
nd Programmable Logic, respectively. In our experiments, we consider
he SFM subsystem introduced in Section 4.2 as the portion of DNN
pplication that will be accelerated in the PL, while the remaining part,
nvolving the second convolutional layer, two dense layers and ﬁnal
lassiﬁcation layer, will be executed by the PS. 
Note that the ﬁrst convolutional layer constitutes only one of the
ain computationally intensive steps of the DNN application. Accord-
ng to software proﬁling results that are based on the SoC platform that
e applied for hardware/software co-design (see Table 8 ), the ﬁrst con-
olutional layer only accounts for about 27% of the prediction time.
or this reason, the speedup brought by hardware acceleration to the
verall DNN application is not dramatic, as will be discussed further
n Section 5 . However, the results concretely demonstrate how STMCM
an be applied to perform extensive design space exploration across a
ariety of diverse designs to achieve system performance enhancement
nder highly-constrained hardware resource availability. 
The SFM accelerator has been integrated into the LIDE-C design pre-
ented in Section 4.1 by replacing the SFM software implementation
ith function calls to driver software that is capable of oﬄoading the
omputation to the PL. We have experimented with using a Linux kernel
river based on the Userspace I/O (UIO) framework [37] , and a driver
hat is independent of the Linux kernel and operates by directly access-
ng memory with the mmap system call. The UIO approach is more suit-
ble for production use, while mmap works well for prototyping, and
his latter approach has been used in this work for evaluation. The PS
nd PL can communicate by means of AXI interfaces exploiting General
urpose (GP) ports; 32-bit width PS master or slave ports with 600 Mbps
andwidth for both read and write channels; High Performance (HP)
orts or Accelerator Coherency Ports (ACP); and 64-bit width PS slave
orts with 1200 Mbps bandwidth for both read and write channels. 
Fig. 7 depicts the reference conﬁguration for the co-design explo-
ations. In order to integrate the accelerator into the SoC, a generic AXI
rapper for hardware dataﬂow subgraphs has to be provided. The wrap-
er is compliant with the adopted AXI interface and lets the programmer
ccess the input and output FIFOs of the dataﬂow graph and monitor
heir populations. For this purpose, the wrapper includes all the neces-
ary logic for the communication management. 
In our hardware acceleration approach, we map the SFM subsystem
o hardware. This subsystem produces a 48x48 feature map on each ex-
cution. Thus, in order to perform the entire ﬁrst convolutional layer
f the DNN application, which must produce 32 48x48 feature maps,
he SFM accelerator has to be executed 32 times with the appropriate
onvolution coeﬃcients. For each of these SFM executions, the PS will
end the corresponding convolution coeﬃcients to the accelerator. The
nput image, which remains the same across all 32 executions, is sent
nly once from the PS and stored within a local buﬀer within the ac-
elerator. All 32 executions of the SFM access the input image from this
ocal buﬀer. In this way, we avoid the large amount of data transfer that
ould be required if the input image had to be sent separately from the
S to the PL for each SFM execution. Upon completion of each SFM ex-
cution, the PS retrieves the resulting feature map from the accelerator.
In the remainder of this section, we discuss in detail three diﬀerent
ets of co-design implementations and optimizations that are facilitated
y STMCM: 
• the amount of parallelism that is exploited in the software and hard-
ware subsystems; 10 • two alternative communication interfaces that oﬀer diﬀerent trade-
oﬀs in terms of resource requirements and execution speed; and 
• local buﬀering to avoid redundant transmission of common data
across diﬀerent branches of the SFM accelerator. 
These three sets of co-design explorations are discussed further in
ection 4.3.1, Section 4.3.2 , and Section 4.3.3 , respectively. 
.3.1. Exploiting parallelism 
STMCM allows the designer to experiment eﬃciently with the
mounts of parallelism that are exploited in both the hardware and
oftware subsystems (see the dashed squares in Fig. 7 ). In particular,
epending on the speciﬁc application requirements, multiple parallel in-
tances of software cores or hardware accelerators can be utilized. While
oftware cores are able to execute all DNN application steps, hardware
ccelerators can only perform the steps that they have been conceived
or. Generally speaking, hardware accelerators achieve higher eﬃciency
han software cores when executing a given computational step, both in
erms of execution time and resource eﬃciency (resource utilization and
onsumption). 
In the targeted Xilinx Zynq Z-7020 SoC platform, a pair of homoge-
eous cores is available, so that the maximum degree of software par-
llelism in our implementations is 2. The available cores are both ARM
9 MPCores with two levels of cache and access to a 512 Mb oﬀ-chip
DR RAM. In our experiments, we have exploited software parallelism
or the two most computationally intensive steps of the application —
he two convolutional layers. 
When using FPGA fabric, designers have the possibility to utilize as
uch parallelism as the FPGA resources allow. In this work, we have
nvestigated three alternative designs that utilize 1, 2 or 4 parallel SFM
nstances, respectively, in the same hardware accelerator. In the ﬁrst
ase, the accelerator is executed 32 times in order to complete the 32
ranches of the ﬁrst convolutional layer. This design executes a diﬀer-
nt branch with diﬀerent convolution coeﬃcients for each accelerator
nvocation. In the second case (2 parallel SFM instances), the accelerator
xecution time is halved, but for each run, two new sets of convolution
oeﬃcients are necessary. Finally, with 4 parallel SFM instances, only 8
ccelerator executions are needed, with each requiring the updating of
our diﬀerent sets of coeﬃcients. 
.3.2. Communication interfaces 
During the process of co-design exploration, STMCM gives the de-
igner signiﬁcant ﬂexibility to select interfaces for communicating data
etween the hardware and software subsystems. This ﬂexibility is pro-
ided by the general dataﬂow model of computation that underlies
TMCM. Flexibility in selecting a communication interface can be very
seful in the context of resource- or performance-constrained design.
his is demonstrated, for example, by the work of Silva et al., which
nalyzes trade-oﬀs among the diﬀerent AXI interface options [38] . 
We investigated the usage of two diﬀerent AXI communication inter-
aces located at the extremes of the resource-versus-performance trade-
ﬀ: 
• the memory-mapped AXI4-lite ( mm-lite ) interface; and 
• FIFO-based AXI4-stream ( stream ) interface. 
Compared to the stream interface, the mm-lite interface has lower
esource requirements, but it also exhibits lower performance. The mm-
ite interface uses memory-mapped, one-by-one transfer of data items.
he interface is particularly intended for control signals and small-scale
ata accesses. It does not need any additional modules beyond those
epicted in Fig. 7 , and it uses only one of the PS master GP ports. For
xample, the execution of one branch of the ﬁrst layer requires the input
mages (3 RGB images with 96 ×96 pixels each) and kernel coeﬃcients
3 kernels with 5 ×5 coeﬃcients each). Since the mm-lite interface uses
 separate data transfer operation for each pixel, this results in a total
f 3 × 96 × 96 + 3 × 5 × 5 data transfer operations. Once the accelerator
L. Li, C. Sau and T. Fanni et al. Journal of Systems Architecture 93 (2019) 1–19 
Fig. 7. Reference conﬁguration for hardware/software co-design exploration in our experiments. 
c  
t  
m
 
o  
t  
s  
t  
a  
t  
N  
n  
t  
t  
a
 
f  
t  
o  
e  
m  
4
4
 
p  
m  
i  
i
 
p  
o  
l  
t  
i  
t  
e  
c  
b
5
 
d  
d  
t  
e  
d
5
 
S  
f  
mompletes its computation, the mm-lite interface requires 48 ×48 data
ransfer operations to enable the processor to read the output feature
ap. 
Unlike the mm-lite interface, which performs data transfers one-by-
ne, the stream interface employs a DMA engine that transfers data be-
ween processor memory and the accelerator in blocks, where the block
ize can be up to 256 bytes. Successive data items within a block are
ransferred in consecutive clock cycles. The stream interface requires
 DMA engine, as mentioned above, and additional FIFO buﬀers, and
herefore incurs signiﬁcant overhead in terms of resource requirements.
ote that the additional hardware required by the stream interface is
ot depicted in Fig. 7 . The DMA engine is conﬁgured through one of
he PS master GP ports, and requires two diﬀerent PS slave HP ports
o directly access the memory where data to be transferred to/from the
ccelerator is stored. 
To execute one branch of the ﬁrst DNN layer, the stream inter-
ace performs (a) 96 memory-to-accelerator DMA operations to send
he input images, with 96 ×3 pixels for each DMA operation, and (b)
ne memory-to-accelerator DMA operation to send 5 ×5 ×3 kernel co-
ﬃcients. Additionally, the stream interface needs 48 accelerator-to-
emory DMA operations to retrieve the computed feature map, with
8 pixels for each DMA operation. 
.3.3. Local buﬀering 
As mentioned previously, we incorporate local buﬀering of image
ixels in the SFM accelerator to avoid redundant transmission of com-
on data across diﬀerent branches of the accelerator. This local buﬀer-11 ng optimization is applied to both the mm-lite-interface- and stream-
nterface-based accelerator implementations. 
For an accelerator conﬁguration with a single SFM instance, the in-
ut image data is transferred to the accelerator only during execution
f the ﬁrst branch. After being transferred, this data is retained in a
ocal buﬀer within the accelerator for reuse by the remaining 31 execu-
ions. For accelerator conﬁgurations that have multiple (parallel) SFM
nstances, the input image is also transferred only once to the accelera-
or. For these conﬁgurations, the image data is reused by the remaining
xecutions of all of the SFM instances. Thus, our incorporation of lo-
al buﬀering optimization eliminates input image data transfers for all
ranches except the ﬁrst one. 
. Results 
In this section, we present experimental results to demonstrate the
esign and implementation methods provided by STMCM based on the
etailed case study presented in Section 4 . The main contribution of
his section is to demonstrate that the proposed methodology facilitates
ﬃcient experimentation with alternative dataﬂow-based architectures,
esign optimization methods, and implementation trade-oﬀs. 
.1. Embedded software implementation 
In this section we present results of our experimentation using
TMCM to explore alternative embedded software implementations. We
ocus speciﬁcally on the optimized application of loop tiling and buﬀer
emory management. 
L. Li, C. Sau and T. Fanni et al. Journal of Systems Architecture 93 (2019) 1–19 
Table 4 
Memory requirements (in pixels) for the ﬁrst two layers. In bracket in the last column: the percentage of memory requirement of DNN 
with shared FIFOs with respect to that of DNN with common FIFOs. 
FIFOs Convolutional layer 1 Convolutional layer 2 Total 
Conv. Add Maxpool&ReLU Conv. Add Maxpool&ReLU 
Common FIFOs 6,875,136 1,806,720 1,179,648 368,640 5,128,192 2,433,024 92,160 17883520 
Shared FIFOs 2,525,184 921,984 0 0 2,768,896 0 0 6,216,064 (34.8) 
5
 
i  
l  
t  
e  
a  
o  
i  
p
 
r  
i  
w  
i  
f
 
l  
w  
s  
a
 
f  
h  
t
 
m  
c  
o  
b  
o  
d  
5
 
e  
t  
d  
m  
o  
t  
c  
n
(
 
l  
m  
s  
w  
p  
F
 
(  
m  
a  
i  
Table 5 
Resource utilization. In parentheses: the percentage of utilization with re- 
spect to the resources available on the targeted FPGA. 
Available LUTs REGs BUFGs BRAMs DSPs 
53200 106400 32 140 220 
SFM 5188 (9.75) 3472 (3.26) 1 (3.1) 11 (7.9) 13 (5.9) 
SFM a 5430 (10.20) 3687 (3.47) 2 (6.3) 11 (7.9) 13 (5.9) 
SFM CG 5206 (9.79) 3496 (3.29) 5 (15.6) 11 (7.9) 13 (5.9) 
SFM aCG 5479 (10.30) 3704 (3.48) 6 (18.8) 11 (7.9) 13 (5.9) 
SFM h 5170 (9.72) 3472 (3.26) 1 (3.1) 11 (7.9) 13 (5.9) 
SFM hCG 5198 (9.77) 3480 (3.27) 3 (9.4) 11 (7.9) 13 (5.9) 
SFM auto 5230 (9.83) 3472 (3.26) 1 (3.1) 11 (7.9) 13 (5.9) 
w  
o  
M  
p  
F  
r
5
 
S  
d  
r  
n  
s  
c  
m  
a  
w
 
p  
t  
t  
T  
c  
o  
t
 
e  
P  
h  
v  
s  
s
 
s  
t  
g  
f  
h  
q  
t.1.1. Loop tiling 
As introduced in Section 4.1.1 , in the optimization of our LIDE-C
mplementation of the DNN application, we explored loop-tiled convo-
ution actor designs with diﬀerent tile sizes. Speciﬁcally, we measured
he number of cache load misses and the cache load miss rates during ex-
cution of a convolution actor. The valid tile sizes for each convolution
ctor were those within the range of 1 to D , where D is the dimension
f input images to the actor. For example, for the convolution actors
n Layer 1, which process input images with size 96 ×96 pixels, we ex-
lored tile sizes within the range of 1–96. 
Fig. 8 shows the number of cache load misses and cache load miss
ate under diﬀerent tile sizes for convolution actors with diﬀerent input
mage dimensions (48 ×48, 96 ×96, 750 ×750, and 1500 ×1500). As
e can see from the results, the cache load miss rates are very small for
mage dimensions D ∈ {48, 96, 750}. This indicates that the data can be
ully stored or almost fully stored in the cache with any valid tile size. 
For 𝐷 = 1500 , however, there is signiﬁcant variation in the cache
oad miss rate across diﬀerent tile sizes. The rate reaches its lowest value
hen the tile size is approximately 400. With careful setting of the tile
ize, loop tiling signiﬁcantly reduces the cache miss rate for convolution
ctors that have relatively large image dimensions. 
Additionally, we can see that there is a large average CPU cycle count
or small tile sizes in all ﬁgures. We expect that this is due to the over-
ead caused by the additional for loops that are introduced by the loop
iling transformation. 
In summary, based on our simulation analysis for small image di-
ensions (96 ×96 and 48 ×48), loop tiling does not help to reduce the
ache miss rate on the target platform, and furthermore, it introduces
verhead due to the additional for loops. Thus, loop tiling should not
e applied to this DNN application for low image dimensions. However,
ur experiments also show that for larger image dimensions, loop tiling
oes help to improve the eﬃciency by reducing the cache load miss rate.
.1.2. Buﬀer memory management 
Fig. 9 shows the amount of memory required for data storage in
ach DNN layer. We report memory requirements in this section in
erms of pixels. In our experiments, we used a 4-byte ﬂoating point
ata type for each pixel. Fig. 9 also shows the amount of data com-
unication that is needed between adjacent layers, and the amount
f memory that must be active simultaneously during the compu-
ation associated with each layer. The memory needed for input is
alculated as 𝑖𝑛𝑝𝑢𝑡 _ 𝑖𝑚𝑎𝑔𝑒 _ 𝑠𝑖𝑧𝑒 × 𝑛𝑢𝑚𝑏𝑒𝑟 _ 𝑜𝑓 _ 𝑖𝑛𝑝𝑢𝑡 _ 𝑖𝑚𝑎𝑔𝑒𝑠 . The memory
eeded for execution of each layer is calculated as 𝑖𝑛𝑝𝑢𝑡 _ 𝑖𝑚𝑎𝑔𝑒 _ 𝑠𝑖𝑧𝑒 ×
 𝑛𝑢𝑚𝑏𝑒𝑟 _ 𝑜𝑓 _ 𝑖𝑛𝑝𝑢𝑡 _ 𝑖𝑚𝑎𝑔𝑒𝑠 + 1) × 𝑛𝑢𝑚𝑏𝑒𝑟 _ 𝑜𝑓 _ 𝑜𝑢𝑡𝑝𝑢𝑡 _ 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 _ 𝑚𝑎𝑝𝑠 . 
As we can see from Fig. 9 , the processing in Layer 2 requires the
argest amount of active memory, and a minimum of 2,525,184 pixels
ust be allocated for buﬀer storage. The memory size can be optimized
ubject to this constraint through the application of shared FIFOs, which
ere introduced in Section 4.1.2 . The buﬀer memory allocation that we
ropose for this DNN application based on shared FIFOs is illustrated in
ig. 10 . 
Table 4 summarizes the memory requirements for dataﬂow edges
FIFO buﬀers) and actors in the two convolutional layers, which require
ost of the memory among the ﬁve layers. These memory requirements
re shown both with and without the use of shared FIFOs. As discussed
n Section 4.1.2 , actors operate on data from shared input FIFOs directly12 ithout copying data to its internal memory. Thus, convolution actors
nly need memory for its intermediate computation results. Add and
axpool&ReLU actors do not require additional memory. The results
resented in this table quantitatively demonstrate the utility of shared
IFOs for this application. In particular, the application of shared FIFOs
educes the memory requirements by 65%. 
.2. Hardware implementation 
In this section, we investigate trade-oﬀs among the variants of the
FM design that were introduced in Section 4.2.2 . STMCM and the un-
erlying LIDE-V approach allow one to perform such trade-oﬀ explo-
ation, based on diﬀerent combinations of high-level optimization tech-
iques, in a systematic manner. In particular, STMCM allows the de-
igner to focus on diﬀerent strategies for instantiating, conﬁguring, and
oordinating diﬀerent combinations of actor and buﬀer (edge) imple-
entations, and eliminates the need for modiﬁcation inside the actor
nd edge implementations. We exploited these advantages of STMCM
hen deriving the results presented in this section. 
Table 5 depicts resource utilization data that is extracted from the
ost-place and route reports generated by the Xilinx Vivado tool using
he targeted Zynq Z-7020 SoC. From the results in Table 5 , we see that
he diﬀerent design variants all exhibit similar levels of resource cost.
he asynchronous designs SFM a and SFM aCG incur the highest resource
osts due to the additional logic required by the CDC FIFOs. The number
f BUFGs varies signiﬁcantly among the diﬀerent designs, depending on
he number of clock domains and the number of clock gated actors. 
Each of the implemented designs has been simulated in order to gen-
rate a switching activity ﬁle, which has been back-annotated to Vivado
ower Estimation to extract power consumption data. Since the designs
ave diﬀerent execution times, the energy consumption levels do not
ary in the same proportions as the power consumption levels. Table 6
ummarizes the power consumption, execution time and energy con-
umption of the six alternative designs. 
In these experiments, the clock frequencies of the synchronous de-
igns and of Region 1 (CLK 1) in the asynchronous designs are all set
o 100 MHz, which is the maximum achievable frequency for the tar-
eted platform. For Region 2 (CLK 2) in the asynchronous designs, the
requency is set to 5 MHz. This setting of 5 MHz is derived from the
ardware proﬁling data (see Table 3 ) as 1/20 of CLK 1. These clock fre-
uencies are speciﬁed in Table 6 with the suﬃx _F , where F represents
he frequency value in MHz. 
L. Li, C. Sau and T. Fanni et al. Journal of Systems Architecture 93 (2019) 1–19 
Fig. 8. Performance evaluation of convolution actors with diﬀerent image dimensions: (a) 48 ×48, (b) 96 ×96, (c) 750 ×750, (d) 1500 ×1500. 
13 
L. Li, C. Sau and T. Fanni et al. Journal of Systems Architecture 93 (2019) 1–19 
Fig. 9. Buﬀer memory and communication requirements in the DNN architecture. 
Fig. 10. Buﬀer memory allocation for the DNN application. 
Table 6 
Dynamic power consumption, execution time and energy con- 
sumption of the diﬀerent SFM variants. In parentheses: the per- 
centage diﬀerence with respect to the baseline SFM . 
Power [ mW ] Time [ns] Energy [ 𝜇J ] 
SFM 115 2,329,165 268 
𝑆𝐹𝑀 𝑎 _ 5 89 (-22.61) 2,407,300 ( + 3.354) 214 (-20.01) 
SFM CG 89 (-22.61) 2,329,245 ( + 0.003) 207 (-22.61) 
𝑆𝐹𝑀 𝑎𝐶𝐺 _ 5 88 (-23.48) 2,408,100 ( + 3.389) 212 (-20.89) 
SFM h 117 ( + 1.74) 2,329,155 (-0.000) 273 ( + 1.74) 
SFM hCG 105 (-8.70) 2,329,175 ( + 0.000) 244 (-8.70) 
SFM auto 113 (-1.74) 2,329,165 ( + 0.000) 263 (-1.74) 
h  
c
s  
F  
a  
t  
a  
o
 
d  
p  
ﬁ  
t  
a  
e
 
l  
h  
i  
e  
t  According to Table 6 , the clock gated designs SFM CG and 𝑆𝐹 𝑀 𝑎𝐶𝐺 _ 5 
ave the best capabilities for saving energy, reducing the total energy
onsumption by 22.61% and 20.89%, respectively. Design 𝑆𝐹 𝑀 𝑎𝐶𝐺 _ 5 
aves less energy than SFM since the former employs one more BUFG.CG 
14 urthermore, in 𝑆𝐹 𝑀 𝑎𝐶𝐺 _ 5 , the actors in the slower domain (Region 2)
re active for a relatively large portion of the execution time, and thus,
hey cannot be switched oﬀ for large proportions of time. In contrast,
ccording to Table 3 , the Deinterleave actor in Region 1 can be switched
ﬀ for almost 90% of the total execution time. 
The designs 𝑆𝐹 𝑀 𝑎𝐶𝐺 _ 5 and 𝑆𝐹 𝑀 𝑎 _ 5 , both of which employ two clock
omains with CLK 1 at 100 MHz and CLK 2 at 5 MHz, have similar ca-
abilities to save energy. The former design is slightly more energy ef-
cient compared to the latter. The results for these two designs show
hat the energy saved by switching oﬀ the actors, when inactive, and
lso the saving of the unused logic in the CDC FIFOs counterbalance the
nergy overhead due to the additional circuitry. 
As expected, SFM h has a small amount of energy overhead due to the
ogic necessary to encapsulate Sum1, Sum2 and Maxpool&ReLU into the
ierarchical actor. The design SFM hCG , among the clock gated designs,
s not as advantageous as the previously analyzed designs in terms of en-
rgy saving. This is because even though it employs only three BUFGs,
he hierarchical actor is switched oﬀ only when none of the underly-
L. Li, C. Sau and T. Fanni et al. Journal of Systems Architecture 93 (2019) 1–19 
i  
a  
e  
e  
o  
t  
b  
n  
h  
i  
i
5
 
c  
o  
q  
w  
t  
w  
p  
f  
a
 
o
 
 
 
 
 
 
 
 
 
t  
p  
t  
d  
m  
u
 
t  
e  
p  
t  
i  
m
 
v  
a
5
 
d  
p  
H  
h  
a  
e  
(  
Table 7 
Resource occupancy for diﬀerent SFM accelerator implementations. In 
parentheses: the percentage of utilization with respect to the resources 
available on the targeted FPGA. The bottom part of the table depicts the 
percentage of variation with respect to HW1-mm. 
Available LUTs REGs BRAMs DSPs 
53200 106400 140 220 
HW1-mm 5395(10.14) 4668(4.39) 43 (30.71) 13 (5.91) 
HW2-mm 10890 (20.47) 8197 (7.70) 54 (38.57) 26 (11.82) 
HW4-mm 21474 (40.36) 16331(15.35) 76 (54.29) 52 (23.64) 
HW2-mm + 101.85 + 75.60 + 25.58 + 100.00 
HW4-mm + 298.04 + 249.85 + 76.74 + 300.00 
Table 8 
Performance of diﬀerent co-design solutions. The top part of the table de- 
picts execution time in milliseconds (ms). The bottom part depicts the per- 
centage of execution time variation for each conﬁguration with respect to 
SW1. 
input Layer Prediction 
1 2 3:5 
SW1 118.9 640.3 1594.7 34.4 2388.2 
SW2(L1) 118.7 368.3 1609.8 34.0 1639.5 
SW2(L1,L2) 117.4 354.7 842.1 33.8 1348.0 
SW2(L2)/HW1(L1)-mm 118.9 118.5 856.7 35.4 1129.5 
SW2(L2)/HW2(L1)-mm 118.4 74.6 866.5 35.1 1094.5 
SW2(L2)/HW4(L1)-mm 117.9 54.5 859.0 35.4 1066.8 
SW2(L1) -0.13 -42.48 + 0.95 + 1.12 -10.75 
SW2(L1,L2) -1.22 -44.61 -47.19 -1.72 -43.56 
SW2(L2)/HW1(L1)-mm -0.00 -81.49 -46.28 + 2.89 -52.71 
SW2(L2)/HW2(L1)-mm -0.40 -88.36 -45.66 + 2.09 -54.17 
SW2(L2)/HW4(L1)-mm -0.81 -91.48 -46.13 + 2.85 -55.33 
m  
h  
t  
p
 
d  
b  
a  
t  
m  
H  
c  
l  
V
5
 
h  
t  
(  
p  
l  
t  
h
 
p  
t  
i  
a  
f  
m  
t  
b  
tng actors are working. This means that, for instance, while Sum1 is
ctive, the actors Sum2 and Maxpool&ReLU will have an active clock
ven when the actors are in an idle state (so that they keep wasting
nergy). Finally SFM auto is the design with the smallest energy saving,
nly 1.74% compared to SFM . Even considering the same optimization
echnique (clock gating), the level on which it is applied turns out to
e fundamental: at a low level (single ﬂip-ﬂops in SFM auto ) only the dy-
amic power of a restricted number of gates can be saved. On the other
and, at a coarse-grain level (groups of dataﬂow actors in SFM CG ), it
s possible to act also on the clock tree, which is highly eﬀective for
mproving power saving. 
.3. Hardware/software co-design results 
In this section, we investigate diﬀerent hardware/software co-design
onﬁgurations. As anticipated in Section 4.1 , depending on the portion
f the application that is accelerated in hardware and on the given re-
uirements and constraints, diﬀerent design choices regarding the hard-
are/software communication interface lead to diﬀerent trade-oﬀs be-
ween resource requirements and performance. For the SFM accelerator,
e investigated several implementation and optimization solutions, ex-
loring three key aspects: exploiting parallelism, communication inter-
aces and local buﬀering (see Section 4.3 ). In this section, by an SFM
ccelerator , we mean speciﬁcally a hardware accelerator. 
Diﬀerent software and hardware conﬁgurations that we explored in
ur co-design exploration are summarized as follows. 
• SW1 — The application runs in software on a single ARM core. This
design can be viewed as a baseline design without any optimization
or hardware acceleration. Comparisons between this baseline design
and alternative designs are discussed in the remainder of this section.
• SW2 — The application runs in software by using both of the ARM
cores on the target platform. 
• HW1 — A single-branch SFM accelerator is employed to execute the
ﬁrst convolutional layer. 
• HW2 — An SFM accelerator with two parallel branches. In this con-
ﬁguration, a local buﬀer is shared between the branches. 
• HW4 — An SFM accelerator with four parallel branches. Again, a
local buﬀer is shared among the branches. 
For multicore software implementations and hardware implementa-
ions with multiple branches, the layer or layers that are executed in
arallel (i.e., intra-layer parallelism is exploited) are indicated in paren-
heses. Similarly, hardware conﬁgurations are annoated with -mm or -s
epending, respectively, on whether a memory-mapped AXI-lite com-
unication interface is used, or a FIFO-based AXI-stream interface is
sed. 
For example, SW2(L1, L2) represents a software-only implementa-
ion in which layer 1 and layer 2 are executed in parallel. As another
xample, SW2(L2)/HW2(L1)-mm represents a hardware/software im-
lementation based on conﬁgurations SW2 and HW2; in this implemen-
ation, layer 2 is executed across multiple cores, layer 1 is parallelized
n hardware with 2 parallel branches, and AXI-lite is used as the com-
unication interface. 
Note that the SFM accelerators are able to execute only the ﬁrst con-
olutional layer. Thus, in all of the DNN system implementations, the
ccelerators are coupled with one of the software conﬁgurations. 
.3.1. Resource costs of accelerator implementations 
Table 7 depicts the resource occupancy in the targeted Zynq Z-7020
evice for the diﬀerent SFM accelerator implementations that we ex-
erimented with. As expected, a higher level of parallelism (going from
W1-mm to HW4-mm) requires more resources, and our experiments
ere help to quantify the associated trends. For example, ﬁne-grained
nd computation-related resources (LUTs, REGs and DSPs) increase lin-
arly with the number of parallel branches placed in the accelerator
about +100% with one more branch and about +300% with three15 ore branches), while coarse-grained memory resources (BRAMs) ex-
ibit a gentler slope. We expect that this gentler slope results because
he primary BRAM-demanding module, the local buﬀer, is shared across
arallel branches. 
The results above indicate that when the DNN architecture is made
eeper (i.e., as the number of convolutional layers is increased), the
iggest restriction will be the hardware resource limitations. Usually, as
 DNN is made deeper, more parallel branches are needed to complete
he computation without compromising the processing speed and more
emory resources are needed to store the intermediate feature maps.
owever, deeper networks do not necessarily imply more computational
omplexity. For example, the well-known ResNet101, which has 101
ayers, needs less computation than the 16-layer VGG16 because the
GG layers are signiﬁcantly larger [39,40] . 
.3.2. Comparison of co-design solutions 
Table 8 presents performance results for diﬀerent software-only and
ardware/software solutions that we investigated using STMCM. In par-
icular, the table reports the execution time in terms of milliseconds
ms) for diﬀerent execution phases: reading the input ﬁle (column in-
ut), computing the ﬁrst and the second layers, and computing the deep
ayers (Layers 3, 4 and 5). The table also reports the execution time of
he overall application (prediction) for diﬀerent degrees of software and
ardware parallelism. 
The reference time is given by the execution of the entire DNN ap-
lication on a single ARM core (SW1), which is capable of completing
he prediction in about 2.4 seconds. From this reference conﬁguration,
t is also possible to appreciate the computational load of the diﬀerent
pplication phases. The heaviest part is Layer 2, which is responsible
or more than 65% of the overall execution time, while most of the re-
aining load is attributable to Layer 1 (around 25%), and to reading of
he input ﬁle (about 5%). For this reason, software parallelization has
een evaluated only on Layer 1 (SW2(L1)), and on both Layers 1 and 2
ogether (SW2(L1,L2)). 
L. Li, C. Sau and T. Fanni et al. Journal of Systems Architecture 93 (2019) 1–19 
Table 9 
Diﬀerences in resource costs between communication interfaces when ap- 
plied to HW1. In parentheses: percentage of utilization with respect to the 
resources available on the targeted FPGA. The bottom part of the table de- 
picts the percentage utilization variation with respect to HW1-mm. 
Available LUTs REGs BRAMs DSPs 
53200 106400 140 220 
HW1-mm 5395(10.14) 4668(4.39) 43 (30.71) 13 (5.91) 
(1) HW1-s 5784 (10.87) 4357 (4.09) 43 (30.71) 13 (5.91) 
(2) FIFOs (stream) 212 (0.40) 242 (0.23) 10 (7.14) 0 (0.00) 
(3) DMA (stream) 1490 (2.80) 1881 (1.77) 3 (2.14) 0 (0.00) 
(1) + (2)+(3) 7486 (14.07) 6480 (6.09) 56 (40.00) 13 (5.91) 
HW1-s + 7.21 -6.66 + 0.00 + 0.00 
(1) + (2)+(3) + 38.75 + 38.81 + 53.49 + 0.00% 
 
a  
n  
t  
i  
H  
t  
p  
S  
t
 
w  
i  
a  
T  
s  
2  
f  
m  
b  
t  
r  
f  
s
 
m  
S  
a  
o  
p  
S
 
o  
a  
c  
a  
t  
r  
(  
o  
t
 
a  
r  
m  
n  
m  
3  
(
 
s  
a  
t  
d  
d  
t  
F  
i  
F  
a  
h  
i  
p  
D  
W  
c  
p  
 
h  
m  
ﬁ  
p  
a  
d
 
a  
t  
t  
t  
o  
1  
o  
c  
t  
a  
t
 
t  
d  
iThe execution time needed by each of the major execution phases is
lmost halved when two cores are adopted. A precise 50% reduction is
ot reached because of the software overhead necessary to manage mul-
itasking. With software parallelization only, the overall execution time
s reduced to 1.13 seconds, about 44% less than the SW1 conﬁguration.
ardware acceleration and related parallelization are only applied to
he ﬁrst convolutional layer, while only software parallelization is ap-
lied to Layer 2. If we consider only the execution time of layer 1, then
W2(L2)/HW1(L1) reduces execution time by more than 80% compared
o SW1, and more than 65% compared to SW2(L1,L2). 
If multiple branches of Layer 1 are processed in parallel, the hard-
are accelerator achieves further performance beneﬁts — a time sav-
ng up to 88% for a 2-branch conﬁguration (SW2(L2)/HW2(L1)-mm),
nd up to 91% for a 4-branch conﬁguration (SW2(L2)/HW4(L1)-mm).
hese performance improvements are with respect to SW1. Note that the
peed-up obtained by doubling the number of branches (going from 1 to
 and from 2 to 4) is less than 2 in either case (1.6 from 1 to 2 and 1.4
rom 2 to 4). This is due to the software overhead related to managing
ultiple branches. Due to the limited computational load of Layer 1, the
eneﬁts of hardware acceleration and parallelization on the overall sys-
em are somewhat limited. The best solution, SW2(L2)/HW4(L1)-mm,
equires 1.07 seconds to perform the whole application, 55% less than a
ull software execution on a single core (SW1) and 21% less than a full
oftware execution on two cores (SW2(L1,L2)). 
Another aspect that has been studied in our co-design experi-
ents is the interfacing between system components. As discussed in
ection 4.3.2 , the adopted communication interface between software
nd hardware portions of a design can have a signiﬁcant impact on
verall system performance. In our co-design experiments, we have ap-
lied two very interfaces —mm-lite and stream, which are discussed in
ection 4.3.2 . 
Table 9 helps to understand diﬀerences between the resource costs
f these two interfaces. The ﬁrst row of this table shows resource avail-
bility on the target platform. The second and third rows show resource
osts for the HW1-mm accelerator, and HW1-s accelerator. The fourth
nd ﬁfth rows show resource costs for FIFO and DMA modules (external
o the accelerator) that are necessary for the stream interface. The sixth
ow shows total resource costs induced by use of the stream interfaceTable 10 
Results pertaining to the impact of the communication interface on execut
diﬀerent DNN application steps. The bottom part depicts the execution tim
File input [ms] Layer 1 
input tx [ 𝜇s] coeﬀs tx [ 𝜇s] output tx [ 𝜇s]
HW1-mm 118.9 5222 15 2339 
HW1-s 117.5 854 5 2333 
HW1-s dir 118.3 418 3 2333 
HW1-s -1.17 -83.65 -66.67 -0.26 
HW1-s dir -0.49 -92.00 -80.00 -0.26 
16 the sums of the costs in the preceding three rows). The last two rows
f the table represent percentage increases in resource costs relative to
he HW1-mm accelerator. 
From Table 9 , we see that the HW1-mm and HW1-s accelerators
lone require approximately the same amount of resources: HW1-s
equires 7.21% more LUTs and 6.66% less REGs compared to HW1-
m. However, when the overhead due to the DMA and FIFO modules
ecessary for AXI-stream communication is considered, signiﬁcantly
ore resources are required when the stream interface is used: about
8% more LUTs and REGs are required by the overall stream design
(1)+(2)+(3)), while over 50% more BRAM cost is incurred. 
To make the stream interface a useful option in our system design, its
igniﬁcant increase in resource costs should be accompanied by tangible
dvantages in execution performance. Table 10 shows results pertaining
o the impact of the communication interface on execution time. In or-
er to better expose the eﬀects of the selected communication interface,
etails on data transfers (input, convolution coeﬃcients and outputs) be-
ween the hardware (accelerator) and software subsystems is reported.
or the HW1-s design, two diﬀerent sets of results are reported depend-
ng on whether program data is directly accessible by the DMA engine.
or one set, the program data is located in a memory that is not directly
ccessible by the DMA. This scenario corresponds to the design that we
ave implemented. It requires an additional copy of the program data
n a memory that is accessed directly by the DMA. For the other set, the
rogram data is located in a memory that is directly accessible by the
MA. This set is indicated in Table 10 using the annotation HW1-s dir .
e have not implemented HW1-s dir; instead, we have estimated the
orresponding results to gain some idea about the maximum achievable
erformance. Details on the estimation approach are omitted for brevity.
The results in Table 10 demonstrate the utility of the resource-
ungry HW1-s design, and quantify its clear ability to outperform HW1-
m. In particular, the input data and transmission of convolution coef-
cients are respectively about 84% and 67% faster when the AXI-stream
rotocol is adopted. This leads to an estimated time saving of up to 92%
nd 80%, respectively, when the DMA has direct access to the program
ata (HW1-s dir). 
On the other hand, the output data transmission time is the same
mong all of the reported conﬁgurations. We expect that this is because
he outputs are produced in a row-by-row fashion (48 data units at a
ime), and the timing of output production is determined by the compu-
ation latency, which is greater than the communication latency for all
f the interfacing conﬁgurations. However, looking at the total Layer
 and DNN application execution times, we see that the advantages
f adopting the stream interface are no longer visible. Indeed, for the
onsidered SFM accelerator, the input data is transmitted only during
he ﬁrst branch execution due to our use of local buﬀering. Addition-
lly, even though the coeﬃcients are transmitted for each branch, their
ransmission requires a relatively small amount of time. 
These results involving communication interface selection illustrate
he importance of comprehensive system-level evaluation of alternative
esign options, which is one of the key parts of the design process that
s facilitated by STMCM. ion time. The top part of the table depicts the execution time of the 
e variation of each conﬁguration with respect to HW1-mm. 
Layer 2 [ms] Layers 3, 4, 5 [ms] Prediction [ms] 
 total [ms] 
118.5 856.7 35.4 1129.5 
114.6 864.3 34.9 1131.3 
114.1 869.5 35.0 1137.0 
-3.27 + 0.87 -1.39 + 0.16 
-3.69 + 1.49 -1.22 + 0.66 
L. Li, C. Sau and T. Fanni et al. Journal of Systems Architecture 93 (2019) 1–19 
6
 
S  
b  
d  
c  
p  
s  
i  
s  
e  
w  
i  
t  
c  
f  
v  
l  
t  
t  
n  
a  
s  
s  
v  
t  
a
A
 
p  
(  
7  
P
R
 
 
 
 
 
 
 
 
 
 
 
 
 
[  
 
[  
[  
 
[  
 
[  
 
 
[  
 
 
[  
[  
 
[  
 
[  
 
 
[  
 
[  
 
 
[  
 
[  
 
 
[  
 
 
[  
[  
 
[  
 
 
[  
 
 
[  
 
 
 
[  
 
[  
 
[  
 
[  
 
[  
 
[  
[  
 
 
[
[  
 
[  
 
[  . Conclusion 
In this paper, we have introduced a design methodology, called the
TMC Methodology or STMCM, and an integrated set of tools and li-
raries that support the application of this methodology. STMCM is
eveloped to assist designers of signal processing systems in exploring
omplex design alternatives that span multiple implementation scales,
latform types, and dataﬂow modeling techniques. We have demon-
trated the capabilities of STMCM through a detailed case study involv-
ng a deep neural network (DNN) for vehicle classiﬁcation. The demon-
tration encompasses dataﬂow-based application modeling, proﬁling,
mbedded software optimization, hardware accelerator design, hard-
are/software co-design, and hardware/software interface design, all
n the context of mapping the given DNN into an eﬃcient implementa-
ion on a resource-constrained, system-on-chip platform. Through this
ase study, it is shown how STMCM provides a uniﬁed, model-based
ramework for conducting comprehensive empirical evaluations of di-
erse hardware/software design alternatives. Through its application of
ightweight dataﬂow techniques, STMCM is complementary to dataﬂow
ools that emphasize specialized design ﬂows and high degrees of au-
omation. Useful directions for future work involve applying STMCM in
ovel ways that exploit these complementary relationships. Addition-
lly, we believe that an automatic code generator producing the corre-
ponding hardware/software co-design code given the hyperparameters
uch as the number of layers and/or number of feature maps would be
ery impactful. Implementation criteria could be integrated such that
he generated network can be optimized based on diﬀerent constraints
nd objectives. 
cknowledgments 
This research was supported in part by Business Finland (FiDiPro
roject StreamPro1846/31/2014); US National Science Foundation
CNS1514425); H2020 Program CERBERO (# 732105), ALOHA (#
80788), FitOptiVis (# 783162) Projects; and the Sardinian Regional
roject PROSSIMO (POR FESR 2014/20-ASSE I). 
eferences 
[1] S.S. Bhattacharyya, E. Deprettere, R. Leupers, J. Takala (Eds.), Handbook of signal
processing systems, Springer, 2013 . 
[2] S. Ha, J. Teich (Eds.), Handbook of hardware/software codesign, Springer, 2017 . 
[3] C. Shen , W. Plishker , H. Wu , S.S. Bhattacharyya , A lightweight dataﬂow approach
for design and implementation of SDR systems, in: Proceedings of the Wireless In-
novation Conference and Product Exposition, 2010, pp. 640–645 . 
[4] W. Plishker , N. Sane , M. Kiemb , S.S. Bhattacharyya , Heterogeneous design in func-
tional DIF, in: Proceedings of the International Workshop on Systems, Architectures,
Modeling, and Simulation, Samos, Greece, 2008, pp. 157–166 . 
[5] J.T. Buck , E.A. Lee , Scheduling dynamic dataﬂow graphs using the token ﬂow model,
in: Proceedings of the International Conference on Acoustics, Speech, and Signal
Processing, 1993 . 
[6] G. Bilsen , M. Engels , R. Lauwereins , J.A. Peperstraete , Cyclo-static dataﬂow, IEEE
Trans. Signal Process. 44 (2) (1996) 397–408 . 
[7] E.A. Lee , D.G. Messerschmitt , Synchronous dataﬂow, Proc. IEEE 75 (9) (1987)
1235–1245 . 
[8] J. Eker , J.W. Janneck , Dataﬂow programming in CAL — balancing expressiveness,
analyzability, and implementability, in: Proceedings of the IEEE Asilomar Confer-
ence on Signals, Systems, and Computers, 2012, pp. 1120–1124 . 
[9] H. Yviquel , A. Lorence , K. Jerbi , G. Cocherel , A. Sanchez , M. Raulet , Orcc: multime-
dia development made easy, in: Proceedings of the ACM International Conference
on Multimedia, 2013, pp. 863–866 . 
10] J. Sérot , F. Berry , S. Ahmed , CAPH: A Language for Implementing Stream-processing
Applications on FPGAs, in: P. Athanas, D. Pnevmatikatos, N. Sklavos (Eds.), Embed-
ded Systems Design with FPGAs, Springer, 2013 . 
11] J. Mcallister , R. Woods , R. Walke , D. Reilly , Multidimensional DSP core synthesis
for FPGA, J VLSI Signal Process Syst Signal Image Video Technol 43 (2–3) (2006) . 
12] M. Pelcat , P. Menuet , S. Aridhi , J.-F. Nezan , Scalable compile-time scheduler for mul-
ti-core architectures, in: Proceedings of the Design, Automation and Test in Europe
Conference and Exhibition, 2009, pp. 1552–1555 . 
13] C. Haubelt , J. Falk , J. Keinert , T. Schlichter , M. Streubühr , A. Deyhle , A. Hadert ,
J. Teich , A systemc-based design methodology for digital signal processing systems,
EURASIP J. Embed. Syst. 2007 (2007) 22 . Article ID 47580. 17 14] S. Lin , Y. Liu , W. Plishker , S.S. Bhattacharyya , A design framework for mapping
vectorized synchronous dataﬂow graphs onto CPU–GPU platforms, in: Proceedings
of the International Workshop on Software and Compilers for Embedded Systems,
Sankt Goar, Germany, 2016, pp. 20–29 . 
15] S. Casale-Brunet , M. Wiszniewska , E. Bezati , M. Mattavelli , J.W. Janneck , M. Canale ,
TURNUS: An open-source design space exploration framework for dynamic stream
programs, in: Proceedings of the Conference on Design and Architectures for Signal
and Image Processing, 2014, pp. 1–2 . 
16] C. Sau, et al., Automated design ﬂow for multi-functional dataﬂow-based platforms,
J. Signal Process. Syst. (2015) 1–23 . doi: 10.1007/s11265-015-1026-0 . 
17] F. Palumbo , T. Fanni , C. Sau , P. Meloni , Power-awarness in coarse-grained reconﬁg-
urable multi-functional architectures: a dataﬂow based strategy, J. Signal Process.
Syst. 87 (1) (2017) 81–106 . 
18] T. Fanni , C. Sau , P. Meloni , L. Raﬀo , F. Palumbo , Power and clock gating modelling
in coarse grained reconﬁgurable systems, in: Proceedings of the ACM International
Conference on Computing Frontiers, 2016, pp. 384–391 . 
19] S.C. Brunet , E. Bezati , C. Alberti , M. Mattavelli , E. Amaldi , J.W. Janneck , Multi-clock
domain optimization for reconﬁgurable architectures in high-level dataﬂow appli-
cations, in: Proceedings of the IEEE Asilomar Conference on Signals, Systems, and
Computers, 2013, pp. 1796–1800 . 
20] E. Bezati , S.C. Brunet , M. Mattavelli , J.W. Janneck , Coarse grain clock gating of
streaming applications in programmable logic implementations, in: Proceedings of
the Electronic System Level Synthesis Conference, 2014, pp. 1–6 . 
21] T. Fanni , L. Li , T. Viitanen , C. Sau , R. Xie , F. Palumbo , L. Raﬀo , H. Huttunen ,
J. Takala , S.S. Bhattacharyya , Hardware design methodology using lightweight
dataﬂow and its integration with low power techniques, J. Syst. Archit. 78 (2017)
15–29 . 
22] S. Lin , Y. Liu , K. Lee , L. Li , W. Plishker , S.S. Bhattacharyya , The DSPCAD Framework
for Modeling and Synthesis of Signal Processing Systems, in: S. Ha, J. Teich (Eds.),
Handbook of Hardware/Software Codesign, Springer, 2017, pp. 1–35 . 
23] L. Li , A. Ghazi , J. Boutellier , L. Anttila , M. Valkama , S.S. Bhattacharyya , Evolutionary
multiobjective optimization for digital predistortion architectures, in: Proceedings of
the International Conference on Cognitive Radio Oriented Wireless Networks, 2016,
pp. 498–510 . 
24] L. Li , A. Sapio , J. Wu , Y. Liu , K. Lee , M. Wolf , S.S. Bhattacharyya , Design and imple-
mentation of adaptive signal processing systems using Markov decision processes,
in: Proceedings of the International Conference on Application Speciﬁc Systems, Ar-
chitectures, and Processors, Seattle, Washington, 2017, pp. 170–175 . 
25] B. Bhattacharya , S.S. Bhattacharyya , Parameterized dataﬂow modeling for DSP sys-
tems, IEEE Trans. Signal Process. 49 (10) (2001) 2408–2421 . 
26] S. Lin , J. Wu , S.S. Bhattacharyya , Memory-constrained vectorization and scheduling
of dataﬂow graphs for hybrid CPU-GPU platforms, ACM Trans. Embedded Comput.
Syst. 17 (2) (2018) 50:1–50:25 . 
27] R. Xie , H. Huttunen , S. Lin , S.S. Bhattacharyya , J. Takala , Resource-constrained im-
plementation and optimization of a deep neural network for vehicle classiﬁcation,
in: Proceedings of the European Signal Processing Conference, Budapest, Hungary,
2016, pp. 1862–1866 . 
28] A.H. Ghamarian , M.C.W. Geilen , S. Stuijk , T. Basten , A.J.M. Moonen , M.J.G. Bekooij ,
B.D. Theelen , M.R. Mousavi , Throughput analysis of synchronous data ﬂow graphs,
in: Proceedings of the International Conference on Application of Concurrency to
System Design, 2006 . 
29] L. Li , T. Fanni , T. Viitanen , R. Xie , F. Palumbo , L. Raﬀo , H. Huttunen , J. Takala ,
S.S. Bhattacharyya , Low power design methodology for signal processing systems
using lightweight dataﬂow techniques, in: Proceedings of the Conference on Design
and Architectures for Signal and Image Processing, Rennes, France, 2016, pp. 81–88 .
30] H. Huttunen, F. Yancheshmeh, K. Chen, Car type recognition with deep neural net-
works, ArXiv e-prints (2016) To appear in proceedings of IEEE Intelligent Vehicles
Symposium 2016. ArXiv: 1602.07125v2 . 
31] T.-Y. Lin , M. Maire , S. Belongie , J. Hays , P. Perona , D. Ramanan , P. Dollár , C.L. Zit-
nick , Microsoft COCO: Common objects in context, in: Proceedings of the European
Conference on Computer Vision, 2014, pp. 740–755 . 
32] J. Deng , W. Dong , R. Socher , L.-J. Li , K. Li , L. Fei-Fei , ImageNet: A large-scale hierar-
chical image database, in: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2009, pp. 248–255 . 
33] T. Chen , et al. , DianNao: A small-footprint high-throughput accelerator for ubiqui-
tous machine-learning, in: Symposium on Architectural Support for Programming
Languages and Operating Systems, 2014, pp. 269–284 . 
34] P.K. Murthy , S.S. Bhattacharyya , Shared buﬀer implementations of signal processing
systems using lifetime analysis techniques, IEEE Trans. Comput. Aided Des. Integr.
Circuits Syst. 20 (2) (2001) 177–198 . 
35] H. Oh , S. Ha , Memory-optimized software synthesis from dataﬂow program graphs
with large size data samples, EURASIP J. Appl. Signal Process. 2003 (6) (2003) . 
36] K. Desnos , M. Pelcat , J.-F. Nezan , S. Aridhi , Buﬀer merging technique for minimiz-
ing memory footprints of synchronous dataﬂow speciﬁcations, in: Proceedings of
the International Conference on Acoustics, Speech, and Signal Processing, 2015,
pp. 1111–1115 . 
37] H.-J. Koch, The Userspace I/O HOWTO, Linutronix, 2006. 
38] J. Silva , V. Sklyarov , I. Skliarova , Comparison of on-chip communications in
zynq-7000 all programmable systems-on-chip, IEEE Embed. Syst. Lett. 7 (1) (2015)
31–34 . 
39] K. He , X. Zhang , S. Ren , J. Sun , Deep residual learning for image recognition, in:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
2016 . 
40] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image
recognition, 2014 arXiv: 1409.1556 . 
L. Li, C. Sau and T. Fanni et al. Journal of Systems Architecture 93 (2019) 1–19 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Lin Li is a Ph.D. student in DSPCAD research group in the De-
partment of Electrical and Computer Engineering at the Uni-
versity of Maryland, College Park, USA. Lin Li received the
Bachelor’s degree in Electrical Engineering and Automation
from Fudan University, Shanghai, China. Her research has fo-
cused on dataﬂow-based framework for design, implementa-
tion, and optimization of signal processing systems including
wireless communication systems and machine learning sys-
tems. 
Carlo Sau is currently Assistant Professor at the University
of Cagliari. He received his degree in Electronic Engineering
in 2012 at the University of Cagliari and his PhD in 2016
in the same university. Since 2012, he is working on auto-
mated methodologies for dataﬂow-based reconﬁgurable plat-
forms generation. His main research focus is related to recon-
ﬁgurable system design and development of code generation
tools for advanced reconﬁgurable hardware architectures. 
Tiziana Fanni is a Ph.D student at the Department of Electri-
cal and Electronic Engineering of the University of Cagliari.
She received her degree in Electronic Engineering in 2014
at the University of Cagliari. In June 2014 she started a 1
year research grant related to power saving methodologies in
dataﬂow- based reconﬁgurable platforms. Her main research
focus is related to reconﬁgurable systems design and develop-
ment of code generation tools for low power reconﬁgurable
hardware architectures. 
Jingui Li received his Bachelor’s degree (with honor) in Elec-
tronic Science and Technology from Hefei University of Tech-
nology, China. He received Master’s degree at Tampere Uni-
versity of Technology (TUT). He has worked on a project
which aims to apply dataﬂow techniques in hardware design
using Verilog HDL. 
Timo Viitanen received his M.Sc. degree in Embedded Sys-
tems from Tampere University of Technology (TUT) in 2013,
and is now a graduate student at the Department of Perva-
sive Computing in TUT. He is the recipient of a TUT graduate
school position and has been awarded the Nokia Scholarship
in 2014. His research interests include computer architecture
and computer graphics. 
François Christophe works as University Researcher at the
Department of Computer Science of University of Helsinki.
His research interests include computational models for the
simulation of complex systems, semi-formal modelling and
Artiﬁcial Intelligence. In 2007, after receiving a double Mas-
ter degree in Computer and Software Engineering from Brest
National Engineering School (France) and in Artiﬁcial Intel-
ligence and Image from University of Rennes I, he decided
to pursue doctoral studies in Systems Engineering at Helsinki
University of Technology, Finland. He received his Ph.D. de-
grees from Aalto University (Finland) and Nantes Centrale En-
gineering School (France) in 2012. He worked as post-doctoral
researcher at Aalto University from 2012 to 2014 and in the
Department of Pervasive Computing at Tampere University of
Technology from 2014 to 2017. 18 Francesca Palumbo is Assistant Professor at the University
of Sassari, within the Information Engineering unit of the De-
partment of Political Sciences, Communication Sciences and
Information Engineering. She received her summa cum laude
“Laurea Degree ” in Electronic Engineering in 2005 at the Uni-
versity of Cagliari, then attended the Master Advanced in Em-
bedded System Design in 2006 at the Advanced Learning and
Research Institute of the University of Lugano before start-
ing her Ph.D. in Electronic and Computer Engineering at the
University of Cagliari. Her research focus is related to recon-
ﬁgurable systems and to code generation tools and design au-
tomation strategies for advanced reconﬁgurable hardware ar-
chitectures. For her studies in the ﬁelds of dataﬂow-based pro-
gramming and hardware customization, she received two Best
Paper Awards at the Conference on Design and Architectures
for Signal and Image Processing, respectively in 2011 and in
2015, with the works entitled ”The Multi-Dataﬂow Composer
tool: A runtime reconﬁgurable HDL platform composer ” and
“MPSoCs for real-time neural signal decoding: A low-power
ASIP-based implementation ”. Dr. Palumbo serves in several
diﬀerent Technical Committee of international conferences
and she is a permanent Steering Committee Member of the
ACM Conference on Computing Frontiers and Associate Edi-
tor of the Springer Journal of Signal Processing Systems. At
the moment, Dr. Palumbo is the scientiﬁc coordinator of the
CERBERO H2020 European Project on Smart Cyber Physical
System Design and the Scientiﬁc Director of Summer School
entitled “Designing Cyber-Physical Systems - From concepts to
implementation ” that has been hold in Alghero in September
2017 and will be organized again in 2018. 
Luigi Raﬀo is full professor of Electronics at the Department
of Electrical and Electronic Engineering - University of Agliari
(ITALY). He received the “Laurea degree ” in Electronic En-
gineering at University of Genoa (ITALY) in 1989, the PhD
degree in Electronics and Computer Science at the same uni-
versity in 1994. In 1994 he joined the Department of Elec-
trical and Electronic Engineering of University of Cagliari
(ITALY) as assistant professor, in 1998 as associate profes-
sor and from 2006 as full professor of electronics. He teaches
courses on system/digital and analog electronic design and
processor architectures for the Courses of studies in Electronic
and Biomedical Engineering. He was coordinator of the project
EU IST- FET - IST-2001-39266 - BEST and he was unit coordi-
nator of the project EU IST-FET - SHAPES - Scalable Software
Hardware Architecture Platform for Embedded Systems. He
has been local coordinator of industrial projects in the ﬁeld
(among others: ST-Microelectronics - Extension of ST200 ar-
chitecture for ARM binary compatibility, ST-Microelectronics
- Network on chip). He is responsible for cooperation pro-
grams in the ﬁeld of embedded systems with several other Eu-
ropean Universities. He was coordinator of the MADNESS EU
Project (FP7/2007-2013) and local coordinator in the ASAM
(ARTEMIS-JU) and ALBA projects (national founded project)
and RPCT (regional founded project). 
Heikki Huttunen is an associate professor at Tampere Uni-
versity of Technology, Finland. He holds M.Sc. and Ph.D de-
grees from University of Tampere and Tampere University of
Technology in 1995 and 1999, respectively. He leads the Ma-
chine Learning Group and his research interests are in machine
learning deployment, to bring real time machine learning into
embedded and mobile devices. 
L. Li, C. Sau and T. Fanni et al. Journal of Systems Architecture 93 (2019) 1–19 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Jarmo Takala received his M.Sc. (hons) degree in Electrical
Engineering and Dr.Tech. degree in Information Technology
from Tampere University of Technology, Tampere, Finland
(TUT) in 1987 and 1999, respectively. From 1992 to 1995,
he was a Research Scientist at VTT-Automation, Tampere, Fin-
land. Between 1995 and 1996, he was a Senior Research Engi-
neer at Nokia Research Center, Tampere, Finland. From 1996
to 1999, he was a Researcher at TUT. Since 2000, he has
been Professor in Computer Engineering at TUT and currently
Dean of the Faculty of Computing and Electrical Engineering
of TUT. Dr. Takala is Co-Editor-in-Chief for Springer Journal
on Signal Processing Systems. During 2007–2011 he was As-
sociate Editor and Area Editor for IEEE Transactions on Signal
Processing and in 2012–2013 he was the Chair of IEEE Signal
Processing Society’s Design and Implementation of Signal Pro-
cessing Systems Technical Committee. His research interests
include circuit techniques, parallel architectures, and design
methodologies for digital signal processing systems. 19 Shuvra S. Bhattacharyya is a Professor in the Department
of Electrical and Computer Engineering at the University of
Maryland, College Park. He holds a joint appointment in the
University of Maryland Institute for Advanced Computer Stud-
ies (UMIACS). He is also a part-time visiting professor in the
Department of Pervasive Computing at the Tampere Univer-
sity of Technology, Finland, as part of the Finland Distin-
guished Professor Programme (FiDiPro). His research inter-
ests include signal processing, embedded systems, electronic
design automation, wireless communication, and wireless sen-
sor networks. He received the B.S. degree from the University
of Wisconsin at Madison, and the Ph.D. degree from the Uni-
versity of California at Berkeley. He has held industrial po-
sitions as a Researcher at the Hitachi America Semiconduc-
tor Research Laboratory (San Jose, California), and Compiler
Developer at Kuck & Associates (Champaign, Illinois). He has
held a visiting research position at the US Air Force Research
Laboratory (Rome, New York). He has been a Nokia Distin-
guished Lecturer (Finland) and Fulbright Specialist (Austria
and Germany). He has received the NSF Career Award (USA).
He is a Fellow of the IEEE. 
