Dataflow Aware Mapping of Convolutional Neural Networks Onto Many-Core
  Platforms With Network-on-Chip Interconnect by Bytyn, Andreas et al.
1Dataflow Aware Mapping of Convolutional Neural
Networks Onto Many-Core Platforms With
Network-on-Chip Interconnect
Andreas Bytyn, Rene´ Ahlsdorf, Rainer Leupers, and Gerd Ascheid
Abstract—Machine intelligence, especially using convolutional
neural networks (CNNs), has become a large area of research
over the past years. Increasingly sophisticated hardware acceler-
ators are proposed that exploit e.g. the sparsity in computations
and make use of reduced precision arithmetic to scale down the
energy consumption. However, future platforms require more
than just energy efficiency: Scalability is becoming an increas-
ingly important factor. The required effort for physical imple-
mentation grows with the size of the accelerator making it more
difficult to meet target constraints. Using many-core platforms
consisting of several homogeneous cores can alleviate the afore-
mentioned limitations with regard to physical implementation at
the expense of an increased dataflow mapping effort. While the
dataflow in CNNs is deterministic and can therefore be optimized
offline, the problem of finding a suitable scheme that minimizes
both runtime and off-chip memory accesses is a challenging task
which becomes even more complex if an interconnect system
is involved. This work presents an automated mapping strategy
starting at the single-core level with different optimization targets
for minimal runtime and minimal off-chip memory accesses. The
strategy is then extended towards a suitable many-core mapping
scheme and evaluated using a scalable system-level simulation
with a network-on-chip interconnect. Design space exploration is
performed by mapping the well-known CNNs AlexNet and VGG-
16 to platforms of different core counts and computational power
per core in order to investigate the trade-offs. Our mapping
strategy and system setup is scaled starting from the single core
level up to 128 cores, thereby showing the limits of the selected
approach.
Index Terms—Convolutional neural network (CNN), network
on chip (NoC), deep learning, application-specific instruction set
processor (ASIP), dataflow optimization
I. INTRODUCTION
Convolutional neural networks (CNNs) are nowadays
widely used for applications such as face recognition and
object detection. While smaller networks like MobileNet [1]
and SqueezeNet [2] can be efficiently processed using small
dedicated accelerators or even ARM based general-purpose
CPUs [3], larger networks require much more powerful ded-
icated hardware. In data-center applications, often GPUs are
the method of choice due to their easy programmability and
massive performance which, however, is paid for by their large
power consumption compared to application-specific acceler-
ators [4]. For future platforms, it is therefore desirable to have
A. Bytyn, R. Ahlsdorf, R. Leupers, and G. Ascheid are with
the Institute of Communication Technologies and Embedded Systems
(ICE), RWTH Aachen University, 52074 Aachen, Germany, e-mail:
{bytyn,ahlsdorf,leupers,ascheid}@ice.rwth-aachen.de.
This work was supported by the German Federal Ministry of Education
and Research (BMBF) via the PARIS project (16ES0602).
scalable performance with reasonable power consumption so
that the accelerator can be tailored towards a certain set of
problems.
Using a single large core yields high throughput and energy
efficiency if designed for networks with specific dimensions.
However, having a single very large accelerator makes it more
difficult to find suitable mappings for CNNs with dimension-
alities differing significantly from the initial design goals. As
a result, the arithmetic units are often under-utilized as shown
in [5]. This issue can be circumvented by having many smaller
cores that allow a more fine-grained mapping of the available
data-slices. These cores should provide sufficient flexibility,
e.g. by being programmable, in order to allow for multiple
processing schemes to be mapped onto them. One possible
way of achieving this is to use application-specific instruction-
set processors (ASIPs) as shown in [6].
The choice of arithmetic (e.g. limited precision fixed-point,
binary/log quantized etc.) has a large impact on the power
consumption of a system, so does the dataflow scheme. In
this work, the term dataflow is used to describe both the
temporal as well as the spatial distribution of data-packets,
i.e. filter weights and feature maps of convolutional kernels,
in the overall system. By maximizing the spatial correlation of
data and minimizing the temporal dependencies, it is possible
to reduce the number of off-chip accesses and, therefore,
reduce the energy required for data movement. In the context
of a many-core system it is, however, not sufficient to only
optimize the dataflow for a single core but the optimization
must be done for the entire system.
For many-core systems, a very important aspect is the
choice of interconnect: While traditional crossbar based inter-
connects imply a moderate implementation complexity, they
quickly become a bottleneck in terms of system scalability.
Using a robust and well-proven network-on-chip (NoC), on the
other hand, results in a larger initial implementation complex-
ity that is rewarded by a much easier system scalability. For
these reasons, we focus on configurable many-core systems
that use a DRAM-centric NoC with a mesh topology as an
interconnect.
While the dataflow of a single core can be accurately
classified and evaluated based on its specific loop-order, loop-
tiling and related loop-unrolling parameters [7], [8], a sophis-
ticated system-level simulation is required to do so for many-
core systems due to the interconnect. The main reason for
this is that congestion phenomena in the NoC are not easily
predictable and must therefore be simulated. This is because
ar
X
iv
:2
00
6.
12
27
4v
1 
 [c
s.D
C]
  1
8 J
un
 20
20
2the NoC requests issued by all the cores affect each other, i.e.
causing stalls within some cores while waiting for data. The
main contributions of this work can be summarized as follows:
• Based on previous work from [7], we provide a mathe-
matical formulation for finding the optimal single-core
mapping scheme for an existing programmable CNN
accelerator with user-definable optimization targets: min-
imal runtime or minimal off-chip accesses.
• The single-core mapping problem is extended towards
the many-core case and a heuristic for finding automatic
mappings is described.
• Case studies for mappings of the well-known CNNs
AlexNet and VGG-16 are presented and quantitatively
evaluated using system-level simulations of a platform
comprised of multiple accelerators and a network-on-
chip.
The remainder of this work is structured as follows: In
Section II, a brief overview of existing CNN accelerators and
dataflow optimization schemes is given. Next, the simulation
setup is elaborated in Section III, whereas the implementation
details of the different system components are presented in
Section III-A. A solution for the single-core mapping problem
is mathematically formulated in Section IV and simulation
results are depicted in Section V. Our proposed mapping
heuristic for extension towards the many-core case is detailed
in Section VI. Comprehensive simulation results for mappings
onto different platform configurations are then presented in
Section VII. Section VIII concludes this work with some final
remarks.
II. BACKGROUND
A. Hardware Acceleration of CNNs
Starting with the widespread use of CNNs in the early
2010’s, there have been a number of hardware accelerators
presented in both the research community as well as the
industry. In general, the different types of accelerators can
be grouped into one of the following categories: GPUs, FP-
GAs, ASICs and application-specific instruction-set processors
(ASIPs), in which the latter two groups vary in terms of
their data processing scheme. Some well-known examples of
dedicated accelerators are Snowflake [9] (FPGA), Escher [10]
(FPGA), Origami [4] (ASIC), Eyeriss [11] (ASIC), Envision
[12] (ASIP) and ConvAix [6] (ASIP). While they differ
in terms of their specific processing scheme, they are all
optimized towards the same overall goal: Keep data as local as
possible to reduce off-chip transfers and maximize processing
intensity, thereby maximizing the utilization of their arithmetic
units. A key difference in the presented architectures is their
degree of flexibility, ranging from fixed dataflow to runtime
configurable dataflow. Of course, having a more flexible pro-
cessing scheme always comes at some cost in terms of area
(additional control units) and energy (setting up the processing
scheme and multiplexing the data), but it also gives designers
more degrees of freedom. This freedom can then be exploited
when optimizing the mapping which is especially important
for many-core platforms. For this reason, we propose the use
of a flexible dataflow accelerator, e.g. an ASIP, and incorporate
a suitable system-level model of such a core into our overall
simulation setup as described in Section III.
B. Convolutional Layer
Todays CNNs consist of a number of different layers,
amongst them convolutional layers, pooling layers and acti-
vation layers. Due to the fact that the largest computational
demand is generated by the regular convolutional layers, we
focus our investigations on them. The convolutional operation
can be described according to (1) as follows:
O(co, yo, xo) = B(co) +
Nif∑
ci=0
Nky∑
ky=0
Nkx∑
kx=0
W (co, ci, ky, kx) · I(ci, yo · s, xo · s) (1)
whereby O depicts the entirety of all output feature maps
(ofmaps) indexed by their channel co and position (yo, xo)
within the ofmap, while I represents the set of all input
feature maps (ifmaps) with the same indexing scheme as the
ofmaps. The respective biases for each output channel are
depicted by B, with W being the weights associated with the
ofmap channel co, ifmap channel ci and filter kernel position
(ky, kx). Lastly, s represents the stride of the convolution.
The limits of the sums represent the total number of ifmaps
Nif as well as the filter kernel’s height Nky and width Nkx.
Since the result of (1) only calculates one single output
pixel for one output channel, an even greater number of
multiply-accumulate (MAC) operations is required to calculate
all ofmaps completely. The actual implementation of this
can be represented by many nested for-loops as shown in
Algorithm 1.
Algorithm 1: Nested for-loops of a convolutional layer.
Input: ifmaps I , filter weights W , biases B, stride s
Result: ofmaps O
1 for( co = 0; co < Nof ; co++ )
2 for( yo = 0; yo < Noy; yo++ )
3 for( xo = 0; xo < Nox; xo++ )
4 O(co, yo, xo) = B(co)
5 for( ci = 0; ci < Nif ; ci++ )
6 for( ky = 0; ky < Nky; ky++ )
7 for( kx = 0; kx < Nkx; kx++ )
8 O(co, yo, xo) +=
W (co, ci, ky, kx) · I(ci, yo ·
s, xo · s)
C. Dataflow Optimization
Related work on the topic of dataflow optimization can be
split twofold: First into work related to the topic of mapping
tasks onto multi/many-core systems in general and second into
techniques focusing on the optimal slicing and tiling of CNNs
on the single accelerator level.
For the first topic, a comprehensive overview is presented
in [13] where the authors introduce a taxonomy that allows
3TABLE I: Loop parameters for convolutional layers as pro-
posed in [7].
Dimension Tiling Unrolling
Filter kernels
Height Nky Tky Pky
Width Nkx Tkx Pkx
IFMaps
Height Niy Tiy Piy
Width Nix Tix Pix
Channels Nif Tif Pif
OFMaps
Height Noy Toy Poy
Width Nox Tox Pox
Channels Nof Tof Pof
to classify mapping techniques. A distinction between run-
time and design-time mapping techniques is made and further
differentiation is provided based on the type of target architec-
ture, which can either be a heterogeneous or a homogeneous
many-core system. For highly deterministic tasks like CNNs,
the authors of [13] suggest that using design-time mapping has
the benefit of being able to optimize the overall system instead
of having to rely on a more narrow view of the system. Several
optimization goals such as performance, communication cost,
energy consumption and reliability are elaborated, whereat
in this work the focus is on the first three. The authors
of [14] focus on the CNN-specific task-mapping within a
heterogeneous many-core system, e.g. consisting of multiple
CPUs and FPGAs or GPUs. Their aim is to speed up the
training of CNNs by optimizing the platform mapping, which
is done via a depth-first graph traversal methodology that
takes into account inter-core and inter-memory communication
overhead.
Regarding the second topic, which is the optimized slic-
ing and tiling of a CNN for a specific accelerator, several
authors have proposed their own taxonomies and strategies.
The authors of [8] provide a taxonomy for CNN accelerators
according to how data stationarity is realized, e.g. output sta-
tionary in case the output feature map’s pixels are kept in local
scratchpad memories or registers. In [7], an elaborate set of
design parameters is provided to mathematically describe this
stationarity and subsequently the parameters are explored to
find optimal configurations for given CNNs. These parameters,
together with the order of loops as depicted in Algorithm 1,
fully describe how data is organized within an accelerator and,
therefore, they are used in this work as well. For convenience,
the most important design parameters are summarized in
Table I. While the first column of parameters describes the
CNN layer’s dimensions, the second column describes the
loop tiling applied to the loops in Algorithm 1. The last
column expresses the loop unrolling factors which result in
parallel computation being executed at each loop iteration.
Often, this is equivalent to the fixed hardware parallelism
within an accelerator and must be set at design-time. For
example, unrolling the output width dimension Nox by setting
Pox = Nox would result in Pox MAC units working in parallel
on a single channel’s output row. In this work, a processing
core with design-time configurable Pof and Pox values is used
as described in Section III-A.
The main focus of this work is the optimization of the
tiling parameters as they are runtime configurable and largely
determine the dataflow. It should be noted that the term
mapping in this work refers to a concrete set of aforementioned
parameters with some additional slicing parameters which are
related to the many-core case as introduced later. Since a
detailed evaluation of all possible different loop orders is not
possible within the scope of this work, we focus on the one
presented in Algorithm 1 with some modifications as described
in Section VI to cope with the many-core mapping.
III. SIMULATION SETUP
To investigate the fitness of a CNN mapping in terms of
runtime and communication cost, a simulation is required that
is capable of accurately modeling both the single-core transac-
tions as well as the actual communication via an interconnect
network. While evaluation of a single-core mapping is possible
based on analytical considerations only as shown in Section
IV, the many-core case requires to account for the communi-
cation overhead induced by the NoC, i.e. network congestion
caused e.g. by limitations in data buffers. The most accurate
results for such an analysis could be obtained by using a full
RTL-level simulation of the NoC and the accelerator cores,
however this would result in an unreasonably high runtime.
We therefore use a parameterizable system-level simulation in
which different components of the system are implemented
using approximately-timed transaction-level modeling (TLM)
techniques as described in SystemC TLM [15]. Since the
focus of this work is the investigation of the dataflow, each
processing core is modeled in an abstract fashion: An inner
process is defined that imitates the dataflow of the actual
core in the way an external observer would see it. This is
done by traversing the loop structure as previously described
without actually performing any computations. However, the
formation of data-packets (both send- and receive-packets)
is carried out accurately and these packets are injected into
the NoC at the corresponding times. Using a cycle-accurate
instruction-set simulator of the processing core, we verified
the correctness of the generated transactions. All remaining
components, e.g. the NoC router, are modeled in a cycle-
accurate fashion in order to accurately reflect any congestion
phenomena that might occur during execution. Furthermore, to
allow realistic modeling of a complex system, the simulation
supports two clock domains, one for the NoC running at
a higher frequency and one for the processing cores. Also,
sophisticated monitoring and tracing facilities are included,
thereby allowing quantitative evaluation of e.g. the number
of data-packets routed over a certain router, the number of
SRAM and DRAM memory accesses performed, the count of
MAC operations executed per core and the average port buffer
stalling times in the NoC.
A. System Overview
As mentioned before, a 2D mesh-style NoC that uses a
credit-based flow control with the XY routing scheme was
selected as system interconnect. Our NoC implementation is
based on the work presented in [16] with some adaptions made
4DRAM
Master
Core
0x2 
1
0x1
0
1x2 
3
2x2 
6
2x1
5
2x0 
4
1x0 
2
Fig. 1: Example architecture of a 3x3 NoC with 7 processing
cores, 1 master core and a DRAM interface block.
especially to the direct memory network interface (DMNI)
[17] which is responsible for managing the data flow between
the processing cores and the NoC. One possible configuration
of the interconnect is shown in Fig. 1. Each router has 4 ports
for each direction and an additional local port connected to
the processing element.
In the following sections, different mesh- and processing
core sizes are investigated. However, since the base compo-
nents always stay the same, they are briefly introduced here.
An exhaustive investigation of different NoC parameters such
as the flit width, packet length, port buffer sizes and the
positioning of the DRAM interface were conducted for this
work. Based on this investigation, we use a flit width of 64 bit
and a packet length of 40 flits per packet. The inport buffer
size of the router is set to 16 flits with the DRAM interface
placed in the center of the mesh. If the mesh-size is increased
beyond the 3x3 configuration depicted in Fig. 1, the DRAM
block is always re-centered and the master core remains at
position (0, 0) (top left). The additional positions within the
mesh-grid are then filled with processing cores. Furthermore,
the system uses two clock domains, one for the processing
cores that runs at 500 MHz and one for the NoC that runs
at a higher frequency of 1 GHz. According to [18], this is a
reasonable choice for implementing a NoC in a modern CMOS
technology.
B. Processing Core Architecture
In order for the mapper to generate the best possible results,
it is desirable to have a processing core with the highest
possible flexibility with regards to the dataflow. We therefore
decided to use an application-specific instruction-set processor
(ASIP) that is very similar to the one presented in [6]. The pro-
cessor uses very long instruction words (VLIW) with 8 slots
in parallel and a RISC-like instruction set architecture (ISA)
that also incorporates some more complex vector instructions
specifically targeted at CNNs.
The total number of parallel MAC operations that can be
scheduled in one cycle depends on the design-time config-
urable unrolling factors Pox ∈ {4, 8, 16, 32}, Pof ∈ {4, 8, 16}
and is calculated as the product of these two factors. At the
inner-most loop level according to Algorithm 1, the ASIP uses
an output row stationary scheme, i.e. one ofmap row of width
Pox for Pof ofmap channels is kept in the register file in
parallel. Each MAC operation uses 16 bit fixed-point multiplier
operands that are accumulated in 32 bit registers. Data for
both weights and ifmaps is provided by a sophisticated on-
chip memory interface and a direct memory access (DMA)
controller that moves data between the on-chip SRAM and
the external memory via the NoC. The size of the SRAM
scales with the vector parallelism (Pox) of the core and is
calculated as follows: Dsram = Pox ·4096words ·Wword with
Wword = 16bit being the wordwidth. We synthesized and
placed and routed the ASIP using a 28nm TSMC technology
node for standard operating conditions (1 V, 25 °) resulting in
a maximum clock frequency of 500 MHz (400 MHz for the
largest core with Pox = 32).
C. Network-on-Chip Components
The relevant components of the NoC used in this work are
the network router, the direct memory access network interface
(DMANI), the DRAM interface handling data accesses to the
external DRAM memory and the master core that schedules
computations onto the processing cores described in Section
III-B. Our implementation is closely based on the work of the
HERMES infrastructure [16] with some adaptions as described
later on. Each packet within the NoC includes a flit containing
the payload size, a header flit with a destination and - in
contrast to HERMES - also a source address which is used
by the DRAM interface as the destination for sending back
data fetched due to a DRAM load request. More information
on the separate components is given below.
DMANI: The DMANI used in this work is an extension of
the DMNI introduced in [17] whose main task is to offload
the NoC packet-handling from the processing core so that said
core can focus on performing computations. While the original
DMNI required the accelerator to set up every NoC packet,
containing information such as the target address and payload
size, our DMANI does this on its own. It is used on top of
the already existing direct memory access (DMA) controller
contained within the core which is responsible for keeping
tabs on outstanding read and write transactions. Whenever the
core issues a new transaction, it is handed over to the DMANI
which then determines the number of required packets and
returns a request ID to the DMA. All requests handed to the
DMANI are processed in a FIFO fashion. To further reduce
the overhead for the processing core, the DMANI has direct
access to the core’s SRAM memory via the core’s memory
interface as shown in Fig. 2a. So, in contrast to the original
DMNI, no interrupt routine is required for the DMANI to write
or read data to/from the SRAM. Instead, whenever a request is
handed to the DMA of the core, it is ensured in software that a
sufficiently large space within the SRAM is reserved. Access
to the SRAM is arbitrated via the core’s own memory interface
which leaves the possibility of in-accessibility during an on-
going transaction. To reduce the effects of such conflicts, the
5Accelerator
SRAM
DMANI
Memory Interface
(a) Processing element
(ASIP and DMANI).
Arbiter Routing
Crossbar
SOUTH
NORTH
W
EST EA
ST
LOCAL
(b) Architecture of NoC
router.
Fig. 2: NoC processing elements.
DMANI has a receive buffer for packets being received from
the NoC (as response to an earlier issued DRAM read request)
and a write-buffer for pre-fetching data in case of a write to
the NoC. Data contained within a service packet is used to
configure the processing cores at the beginning.
Router: For any NoC, the router is one of the main
components that enables communication between the different
processing entities. As mentioned before, our router design is
based on the original HERMES router presented in [16] which
has four bi-directional ports for connecting to other routers
(North, East, South, West) and a local port that is con-
nected to its processing core as depicted in Fig. 2b. The router
consists of the following sub-components: crossbar, arbiter,
routing module and individual port buffers. Each individual
port buffer is responsible for requesting arbitration based on
its stored requests. Arbitration is then handled according to a
prioritization scheme that is as follows: East, West, North,
South. Afterwards, the list is shifted in a cyclical fashion so
that no starvation of requests occurs. A new packet is routed
via the routing module by determining its destination address
based on the XY routing protocol and the connection between
the ports is established using the crossbar network. In total,
this process, starting at the port buffers up to when the crossbar
connection is established, takes 4 clock cycles.
DRAM interface: To allow access to an external DRAM
memory, one of the grid-spaces within the NoC mesh is
reserved for a DRAM interface that receives both read- and
write-requests from the entire NoC and stores them in an
internal request buffer that can save exactly one request per
processing element. The interface preferably serves write-
requests in order to minimize the backlog into the NoC that
can occur in case of a long write request arising together with
multiple other write requests. In general, incoming requests are
processed in a FIFO fashion without any further prioritization.
For this work, we assume a DRAM bus width equal to
the NoC’s flit width resulting in a maximum bandwidth of
8 GByte/s at a NoC clock frequency of 1 GHz and a flit width
of 64 bit, which is very reasonable even for a slow DDR
memory.
Master core: The processing cores must be set up according
to the mapping described in Section VI. In this work, a master-
slave concept is employed in which a master core, that could
be a RISC-like microprocessor, assigns the configurations to
the different cores via a service message using the already in-
troduced NoC packet structure. Since the actual configuration
TABLE II: System parameter overview.
Parameter Value
Max. packet length 40 flits
Flit width (Wflit) 64 bit
NoC clock frequency (fnoc) 1 GHz
Core clock frequency (fcore) 500 MHz
Router inport buffer size 16 flits
DMANI buffer size 64 words
DRAM bandwidth (BWdram) 64 bitcycle
is determined offline, the master core only has to send these
pre-calculated configuration packets and is, therefore, modeled
as a simple state machine.
To summarize, the main system parameters used in this
work are depicted in Table II.
D. Energy Modelling
Since power and energy consumption are important metrics
to determine the fitness of a mapping, we use a macro-
modeling approach that estimates energy consumption at a
high level. As mentioned before, relevant key figures with re-
gards to energy consumption such as SRAM/DRAM load/store
counts, number of MACs etc. are already traced in our sim-
ulation. These figures are used to estimate the overall energy
consumption for the processing cores and DRAM according
to (2) and (3) as follows:
Ecore = Eidle ·Ncyc + Emac ·Nmac +
Esram ld ·Nsram ld +
Esram st ·Nsram st (2)
Edram = Edram ld ·Ndram ld +
Edram st ·Ndram st (3)
where Ex represents the energy value associated with a
single event of type x and Nx represents the event-count
during the whole simulation time. The energy values for
the processing core were extracted from time-annotated post-
layout simulation of a design similar to the one presented
in [6]. More details on the processing core are shown in
Section III-B. Because these energy values were obtained
using statistical methods, they already contain the energy
required for e.g. program control of the processor (included in
the idle energy) and register file accesses in case of MAC and
memory operations. For DRAM memory accesses, we use an
energy of 21 pJ/Bit as reported by [19] for LPDDR3 memory.
For estimating the NoC’s energy consumption, we used the
model presented in [20] which associates energies with the
most energy intensive events in NoC routers: Routing a packet
(Eroute), arbitrating a request (Earb), setting up the crossbar
(Exbar su), switching of the crossbar (Exbar sw), buffering
an incoming packet (Ebuf ) and leakage (Eleak). The energy
values used in this work are summarized in Table III. Since the
original values presented in [20] were extracted from a 90 nm
CMOS technology which is different from the TSMC 28 nm
technology used for the core, all values were scaled to 28 nm
according to Enew = Eold
(
Vnew
Vold
)2
Nnew
Nold
with N being the
gate pitch and V being the supply voltage.
6TABLE III: Energy values used in the macro-model.
Processing Core & DRAM Network-on-Chip
Name Energy Name Energy
Eidle 148.42 pJ/Cycle Eroute 0.06 pJ/Packet
Esram ld 0.89 pJ/Bit Earb 0.22 pJ/Packet
Esram st 0.46 pJ/Bit Exbar sw 0.03 pJ/Bit
Emac 6.42 pJ/Op a Exbar su 0.16 pJ/Bit
Edram ld 21 pJ/Bit Ebuf 0.09 pJ/Bit
Edram st 21 pJ/Bit Eleak 0.43 pJ/Cycle
a Energy for 1 MAC with 16-bit multiplier operands and 32-bit
accumulator that both use saturating fixed-point arithmetic.
IV. SINGLE-CORE DATAFLOW MAPPING
In order to find a suitable many-core mapping scheme, we
use a bottom-up flow in which the single-core mapping, i.e.
the determination of suitable tiling factors adhering to the
taxonomy in Table I, is calculated first. The communication
effects of the NoC are not considered in this step, thereby
making it possible to use the high-level dataflow description
from Algorithm 1 and formulate the mapping problem as a
constrained mixed integer non-linear problem (MINLP). We
introduce two different optimization targets: minimum overall
computing time and minimum off-chip memory accesses.
In the following, a derivation for the aforementioned cost-
functions is given which then allows us to find optimal single-
core tiling parameters using a regular MINLP solver.
Using knowledge of the actual software implementation of
the convolutional layer as presented in Section II-B on our
programmable ASIP, the nested loop structure with tiling can
be elaborated as shown in Algorithm 2. Note that for brevity,
the unrolled for-loops according to the selected Pox and Pof
values are not shown. For this work, we selected the following
tiling dimensions as this made the most sense given current
CNN topologies and the given hardware: tiling amongst the
ofmap & ifmap channels (T ′of , T
′
if ) and the ifmap width (T
′
ix)
which of course results in tiling along the ofmap width T ′ox =
(T ′ix −Nkx)/s+ 1 (padding is already included in the ifmap
width T ′ix). Furthermore, for examination of the single-core
case, we use dashed values for the tile-size T ′x, tile-count S
′
x
and dimension N ′x (x ∈ {of, if, ox, ix}) values in order to
differentiate these single-core optimization parameters from
the later to be introduced many-core optimization parameters
Tx and Sx. The numbers of resulting tiles per dimension as
denoted by S′ are calculated according to
S′of =
⌈
N ′of/T
′
of
⌉
(4)
S′if =
⌈
Nif/T
′
if
⌉
(5)
S′ox = dN ′ox/T ′oxe. (6)
By annotating processing-cycle costs, e.g. for data prefetching
and computations as well as data transfer-costs inferred by off-
chip accesses, an overall cost for a tiling can be determined.
As depicted in Algorithm 2, each step in the loop can be
associated with a certain action, e.g. the DMA loading filters
(line 3) or the calculation of a certain number of MAC
operations (line 21). We subdivide these loops into an inner
part that does not unconditionally depend on any off-chip
accesses (marked in blue) and an outer part that directly
relies on them (marked in red). Based on this information,
Algorithm 2: Convolutional layer with tiling parameters as
implemented on the ASIP.
Input: CNN paramters: Nkx/ky, Nif , N ′of etc., stride s
Tiling parameters: T ′of/if/ox, S
′
of/if/ox, Pox, Pof
Result: Associated costs: Nsram ld/st, Ndram ld/st, Nmac
1 for( to = 0; to < S′of ; to++ )
2 for( ti = 0; ti < S′if ; ti++ )
3 → DMA Load Filters()
4 → DMA Load Biases()
5 for( tx = 0; tx < S′ox; tx++ )
6 → DMA Load IFMap Initial()
7 → DMA Load PSum Initial()
8 for( yo = 0; yo < Noy; yo++ )
9 → DMA Load IFMap Next()
10 → DMA Load PSum Next()
11 for( co = to · T ′of ; co < (to + 1) · T ′of ;
12 co+=Pof )
13 for( xo = tx · T ′ox; xo < (tx + 1) · T ′ox;
14 xo+=Pox )
15 → SRAM Load Bias or PSum()
16 for( ky = 0; ky < Nky; ky++ )
17 for( ci = ti · T ′if ; ci < (ti + 1) · T ′if ;
18 ci++ )
19 → Line Prefetch()
20 for( kx = 0; kx < Nkx; kx++ )
21 → MAC(Pox · Pof per cycle);
22 → SRAM Store OFMap or PSum()
23 → DMA Store OFMap or PSum Row()
we first derive the total number of off-chip accesses Ndram =
Ndram init+Ndram par. It is hereby important to differentiate
between DMA requests that can be handled in parallel to the
computations (line 9, 10 and 23), denoted by Ndram par, and
those that must be waited for (line 3, 4, 6 and 7), denoted by
Ndram init.
Ndram init = N
′
of ·Nkx ·Nky ·Nif (filters)
+ N ′of (biases)
+ S′of ·N ′ix ·Nky ·Nif (initial ifmaps)
+ (S′if − 1) ·N ′ox ·N ′of (initial psums) (7)
Ndram par = S
′
if ·N ′ox ·Noy ·N ′of (ofmap/psum store)
+ S′of ·N ′ix · (Niy −Nky) ·Nif (next ifmaps)
+ (S′if − 1) ·N ′ox · (Noy − 1) ·N ′of (next psums) (8)
The required pure computational cycles within the inner loops
(marked in blue) can be computed as follows:
Ccomp = (Cmac + Csram) ·Noy (9)
Cmac = (Cpfetch +Nkx) · T ′if ·Nky ·
T ′ox
Pox
· T
′
of
Pof
(10)
Cpfetch =
⌈
Stride + 1
2
⌉
− 1 (11)
Csram = 2 ·
T ′ox · T ′of · Pox · Pof
BWsram
(12)
70
10
20
30
40
Tr
an
sf
er
s
(M
B
yt
e)
1 1 1 2 2 1 2 2 3 1 3 2 3 3 4 1 4 2 4 3 5 1 5 2 5 3
0
10
20
30
40
50
Layer
R
un
tim
e
(m
s)
VGG-16
0
5
10
15
20
25
E
ne
rg
y
(m
J)
Runtime
min-dram
min-comp
Energy
min-dram
min-comp
Transfers
min-dram
min-comp
0
1
2
3
4
Tr
an
sf
er
s
(M
B
yt
e)
1 2 1 2 2 3 4 1 4 2 5 1 5 2
0
0.5
1
1.5
2
Layer
R
un
tim
e
(m
s)
AlexNet
0
0.5
1
1.5
2
E
ne
rg
y
(m
J)
Fig. 3: Resulting single-core runtimes (bars), DRAM transfers (filled areas) and energy consumption (lines) of the CNNs
VGG-16 (top) and AlexNet (bottom) for two different optimization targets: min-dram and min-comp.
The prefetch cycle count Cpfetch in (11) is specific to the
processing core used in this work. The term BWsram repre-
sents the bandwidth in words per cycle that the internal SRAM
offers and is equal to 2 · Pox here since the on-chip memory
is a banked dual-port memory with bank count equal to Pox.
As mentioned before, certain DMA requests run in parallel
to the computations, so to allow accurate calculation of the
processing cycles for the inner loops, the estimated cycle count
Cdram par for these accesses must be calculated as well:
Cdram par =
Ndram par
BWdram
(13)
in which the term BWdram represents the bandwidth in
words per cycle that the DRAM can provide. In this work,
aforementioned bandwidth is set to the NoC flit width per
cycle divided by the wordwidth and multiplied with the clock
ratio between the NoC and the core:
BWdram = 64
bit
cycle
/16
bit
word
· 1GHz
500MHz
= 8
word
cycle
. (14)
The cycle count for the outer loops Couter loop depends
solely on the time required to finish the initial DRAM accesses
Ndram init.
Couter loop =
Ndram init
BWdram
(15)
Depending on the selected tiling parameters, either the com-
putational cycles Ccomp or the DRAM cycles Cdram par will
determine the overall inner loop cycles Cinner loop, which
is modeled in our optimization problem as two inequality
constraints according to (16) and (17):
Cinner loop ≥ Ccomp · S′ox · S′if · S′of (16)
Cinner loop ≥ Cdram par (17)
Together with the previously calculated outer loop cycle count
Couter loop, the total cycle count Ctotal amounts to:
Ctotal = Couter loop + Cinner loop (18)
Last but not least, the allocated SRAM memory that depends
on the tiling parameters must be constrained since the indi-
vidual processing cores only have limited on-chip memory
available:
Nsram alloc = T
′
of︸︷︷︸
biases
+T ′of ·Nkx ·Nky · T ′if︸ ︷︷ ︸
filters
+ T ′if · (Nky + stride) · T ′ix︸ ︷︷ ︸
ifmaps
+3 · T ′ox · T ′of︸ ︷︷ ︸
ofmaps
(19)
Nsram alloc ≤ Dsram = Pox · 8KByte = Pox · 4096word
(20)
As can be seen in (19), there are always 3 ofmap rows allo-
cated because we use a triple-buffering scheme: One allocated
row is used for pre-fetching the next partial sums (line 10 in
Algorithm 2), one is used for calculating the current ofmap
row and the third is used as write-back buffer (needed in line
23).
Depending on the optimization target, it is now possible
to find tiling parameters either minimizing the computational
cycles (hereafter referred to as min-comp) or the total number
of DRAM accesses (referred to as min-dram). The final
optimization targets are shown in (21) and (22).
min
T ′of ,T
′
if ,T
′
ox
Ctotal (min-comp) (21)
min
T ′of ,T
′
if ,T
′
ox
Ndram init +Ndram par (min-dram) (22)
8For brevity, the previously introduced constraints for the
minimization targets are not restated here but must be added
to the problem formulation.
V. SINGLE-CORE MAPPING EVALUATION
The previously introduced single-core mapping algorithm
is used to map two well known CNNs, VGG-16 [21] and
AlexNet [22], onto a 3x1 system configuration with just
one processing core, a master core and a DRAM interface
block. Using our system-level simulation, the runtime, number
of DRAM accesses as well as the energy consumption are
obtained and the results are presented in Fig. 3. As can be seen,
the min-comp target always achieves faster runtime compared
to the min-dram target, however, this is achieved at the
expense of an increased DRAM access count. Since AlexNet
is a fairly small CNN compared to the much larger VGG-
16, there are no large differences between both optimization
targets. However, even though DRAM energy consumption is
generally considered to be the main contributor in terms of
overall energy, our results show that the total energy for VGG-
16 is actually minimized for the min-comp case. The causes
for this observation are layers 4 2 and 4 3 which exhibit a
much longer runtime when optimized for min-dram. This can
be explained as follows: To minimize the number of DRAM
accesses, the optimization algorithm uses a configuration in
which the width of the ofmap tiles T ′ox is fairly small. In
doing so, the on-chip SRAM is used to store a maximally
large number of ifmap channels at the same time (T ′if very
large) thereby minimizing the need for any partial sum (psum)
transfers. The small width of the ofmap tiles (T ′ox), however,
causes a bad utilization of the processing core’s vALUs
(T ′ox < Pox) leading to the increased runtime. Finally, the
energy that is consumed for baseline processing core operation
during this additional runtime results in the overall higher
energy consumption.
VI. MANY-CORE DATAFLOW MAPPING
To enable a large-scale evaluation of different CNNs onto
arbitrary platform configurations, it is indispensable to have an
automated method to assign computing tasks to the processing
cores. The same taxonomy that was used for tiling the CNN
layery in the single-core case (Section IV) is, therefore, used
here as well. While the dashed single-core parameters such
as tile-size T ′x and tile-count S
′
x referred to data dimensions
within one core, the un-dashed variables have a different
meaning. These variables, namely Tx and Sx, are used to
express the size and count of complete CNN layer slices in the
many-core context. So starting from the top-level, each CNN
layer is subdivided into a number of slices along the ifmap and
ofmap width dimension (Six and Sox) as well as the ofmap
channel dimension Sof as depicted in Fig. 5a. For such a
slice, it is then possible to determine an optimal single-core
mapping as previously derived. An overview of the many-core
mapping heuristic is given in Fig. 4 with a more thorough
explanation in the following paragraphs.
In order to determine a suitable mapping it is indispensable
to first define a cost function. Since we always encounter either
Determine set of
all possible slices
Take one slice T out of set T
Optimize single-core tiling for T
Allocate slices to k = 1 cores
Evaluate and store cost Ck,T →
C of mapping according to (23)
Select and store new
minimal overall cost
Done
T
Sx, Tx
S′x, T ′x
if(k < Ncores)
→ k = min (2k,Ncores)
if(k ≥ Ncores)
if(T empty)
Fig. 4: Flow-chart of many-core mapping heuristic.
a computation bound or an I/O bound for our application, it
makes the most sense to optimize for these targets by incorpo-
rating the overall runtime and time required for communication
via the NoC into the cost function. The resulting optimization
goal is depicted in the following:
min
s
(
max
∀c∈ C
(Ctot wo dram(sc))
+
1
BWdram
·
∑
∀c∈ C
∑
p ∈ Pc
F (p) ·Wflit
)
(23)
The first term in (23) represents among all processing
cores c ∈ C the one with the maximum cycle count
Ctot wo dram(sc) not counting any cycles required for DRAM
accesses. An abstract mapping vector sc hereby denotes the
assignment of slices to processing cores. Taking the maximum
is required here because in case of asymmetric mapping of
slices to cores there might be a few cores left computing at
the end, thereby dictating the runtime of the overall system.
It can be calculated for each core as follows (see (9) and (16)
for reference):
Ctot wo dram = Ccomp · S′ox · S′if · S′of (24)
In addition, the second term of (23) is used to model the overall
NoC bandwidth requirement. To this end, the number of flits
F (p) for all NoC packets p ∈ Pc generated by each core c is
calculated. This is done by traversing the loop structure from
Algorithm 2 and building an exact list of all packets with their
associated lengths because only in doing so the overhead for
e.g. having many small packets with associated header-flits
can be accounted for. Since the calculation of each core’s
computation cycles according to (24) requires knowledge of
the single-core tiling parameters S′x, T
′
x which in turn can
only be calculated based on the slicing parameters Sx and
9Sox = 2Sof = 4
(a)
4 16 32 64 128
1
10
Cores
R
un
tim
e
(m
s)
VGG16 layers: 1 1 1 2 2 1 2 2 3 1
3 {2,3} 4 1 4 {2,3} 5 {1,2,3}
(b)
Fig. 5: (a) Illustration of the many-core slicing. (b) Results for simulation of VGG-16 using a system-setup with constant overall
computing capabilities and RAM, i.e. Ncores x (Pox · Pof ) & Ncores x Dsram are constant (total MAC per cycle = 2048,
total SRAM = 1 MByte).
Tx, we propose an iterative scheme as depicted in Fig. 4
that uses a heuristic to determine the latter parameters. This
iterative process starts with determining a set T of all possible
slice parameters based on the constraint that the width Tox
and depth Tof of a slice should ideally be a multiple of the
processing cores’ unrolling factors Pox and Pof to maximally
utilize the cores:
T :=
{
(mPof , nPox) | ∀ m ∈
[
1,
⌊
Nof
Pof
⌋]
;
∀ n ∈
[
1,
⌊
Nox
Pox
⌋]}
(25)
Afterwards, each possible slice parameter set T =
(Tof , Tox) ∈ T is fed into the single-core tiling optimizer
from Section IV and the set of optimal tiling parameters (T ′x,
S′x) is determined. Since each slice can be viewed as a new
CNN layer of smaller dimension, the single-core optimizer is
fed with the slice’s dimensions as follows:
N ′ox = Tox (26)
N ′ix = (Tox − 1) · s+Nkx (27)
N ′of = Tof (28)
Due to the constraint in the selection of the slice parameters,
it only takes a few seconds to determine all possible tiling
parameters. For a given slice parameter set T , the number of
resulting slices for each dimension is calculated as follows:
Sox = dNox/Toxe (29)
Sof = dNof/Tofe (30)
As our investigations have shown especially for small CNNs
such as AlexNet, it is not always desirable to distribute
tasks to all processing cores. In fact, the cost for additional
communication required by a large number of cores (second
term in (23)) can greatly outweigh the potential saving in
computational time, thereby rendering a solution with more
cores slower than a competing solution with fewer active cores.
Also, in addition to being slower, more active cores result in
more energy being consumed during idle cycles, which again
is not desirable. We found that using a waving scheme which
systematically explores configurations with different numbers
of active cores, starting with the ones closest to the DRAM
interface block, gives better results compared to using all cores
all of the time for every layer.
So in a final step, for each slice parameter set T and as-
sociated tiling parameters T ′x and S
′
x, aforementioned waving
scheme is employed to distribute slices to core instances in the
NoC. For the first step, all slices are mapped to a single core
(k = 1). After this, the number of activated cores k is doubled
and all tasks are assigned to the 2 cores closest to the DRAM.
Again, the number of activated cores is doubled and the former
steps are repeated until a configuration with all cores activated
was tested. For each of these iterations, the overall cost Ck,T is
saved in a set C. Afterwards, the configuration corresponding
to the lowest cost min{C} is selected as final configuration. It
should be noted that during slice assignment our algorithm
makes sure to map slices with adjacent boundaries in the
ofmap width dimension (neighboring T ′ox slices) to the same
core which are then stitched together afterwards to remove
the need for redundant filter loads. This method has proven to
be very robust and fairly fast: Determining a mapping usually
only takes a few minutes for e.g. VGG-16.
VII. MANY-CORE MAPPING EVALUATION
Using the previously described mapping scheme for the
many-core case, a number of benchmarks are evaluated in
this section. From a VLSI design perspective, it is much more
appealing to build medium-sized processing cores and scale
the system by adding more cores. This trend can be seen
in the design choices made by manufacturers of large GPUs
that usually consist of hundreds or even thousands of smaller
RISC-like cores. Therefore, we simulate a system that employs
a varying number of cores (4..128) which are scaled up or
down (larger/smaller Pox and Pof values) in such a way that
the overall computing capabilities and on-chip SRAM of the
system always stay at 2048 MACcycle and 1 MByte respectively.
Looking at the results for running VGG-16 in Fig. 5b, it can be
concluded that while having many small cores is not optimal
due to the increased communication overhead in the NoC, also
having only a few large cores does not give the best results.
A medium-sized configuration with 16 processing cores (each
core with configuration Pox = 16, Pof = 8) achieves the
10
24 7 14 23
2
4
7
14
Sp
ee
du
p
VGG16: 1 1
24 7 14 23
2
4
7
14
VGG16: 1 2
24 7 14 23
2
4
7
14
VGG16: 2 1
24 7 14 23
2
4
7
14
VGG16: 2 2
24 7 14 23
2
4
7
14
VGG16: 3 1
24 7 14 23
2
4
7
14
Sp
ee
du
p
VGG16: 3 {2,3}
24 7 14 23
2
4
7
14
VGG16: 4 1
24 7 14 23
2
4
7
14
VGG16: 4 {2,3}
24 7 14 23
2
4
7
14
VGG16: 5 {1,2,3}
Rel. speedup
Theo. bound
Alloc. cores
24 7 14 23
2
4
7
14
Cores
Sp
ee
du
p
AN: 1
24 7 14 23
2
4
7
14
Cores
AN: 2 {1,2}
24 7 14 23
2
4
7
14
Cores
AN: 3
24 7 14 23
2
4
7
14
Cores
AN: 4 {1,2}
24 7 14 23
2
4
7
14
Cores
AN: 5 {1,2}
Fig. 6: Speedups, number of allocated cores and theoretically bounded speedup according to (31) for CNNs AlexNet (AN) and
VGG-16 using systems with Ncores ∈ {2, 4, 7, 14, 23} processing cores (Pox = 16, Pof = 8, Dsram = 128KByte per core).
Layers with same dimensions and therefore same mappings and results are clustered as indicated by e.g. VGG16 5 {1,2,3}.
fastest runtime for most layers as this yields the best trade-off
between generating additional NoC communication on the one
hand and enabling more degrees of freedom for the mapping
algorithm on the other hand. It should be noted though that
for networks with much larger ifmap widths, e.g. CNNs for
semantic scene segmentation, it might be more favorable to
include these larger cores as they provide a more natural match
in terms of processing row width (Pox). For these kinds of
CNNs, however, the DRAM interface becomes the bottleneck
no matter how optimal the mapping is. To further explore the
performance scalability of our approach, we select the most
promising processing core configuration as determined earlier
(Pox = 16, Pof = 8) and simulate the same workload, i.e. the
convolutional layers of VGG-16 and AlexNet, for a system
configuration starting just with a single core going up to a
5x5 NoC configuration with 23 processing cores in total. The
uneven core-counts of e.g. 23 for a 5x5 NoC can be explained
by the 2 positions required for the DRAM interface and the
master core. Calculating the overall computational capability
of each configuration (measured in MAC per second) can
easily be done as follows: NMAC = Ncores · Pox · Pof (each
core has Dsram = 128KByte of SRAM). For the smallest
system, this results in NMAC = 128 while the largest 23-
core system offers up to NMAC = 2944. To obtain fair
values for the following speed-up comparison of a many-core
system compared to the single-core case, the single-core 3x1
configuration is simulated with a very large packet-length of
10000 flitpacket . This is done in order to minimize any unnecessary
communication overhead that would not be required in the
single-core scenario. The results in Fig. 6 show the achieved
speed-up relative to aforementioned single-core setup. Also, a
line representing the theoretical bound is plotted which takes
into consideration the tiling and slicing parameters generated
by the algorithms introduced in Section IV and Section VI as
well as the maximum available DRAM bandwidth. Said line
denotes the speed-up that is achievable for given constraints
without accounting for any of the overhead generated by the
NoC and is calculated as follows:
f(Ncores) =
Csingle core
max
(
Ctot wo dram,
Ndram
BWdram
) (31)
where Csingle core represents the number of processing cycles
for the single-core setup.
The final simulation results in Fig. 6 show a close match
between the theoretical bound as calculated by (31) and the
actual system performance whereas the explanation for the gap
between both curves is twofold: On the one hand, the over-
head required for NoC communication including congestion
phenomena slightly diminishes the achievable speedup. On the
other hand, the width and depth of the last slices generated
according to (25) are not necessarily a multiple of the cores’
hardware parallelism Pox and Pof , resulting in a core under-
utilization for these last slices. On average, the difference
between the simulation results and the theoretical bound is
between 6.59% (Ncores = 2) and 27.48% (Ncores = 7)
for AlexNet and between 3.28% (Ncores = 2) and 17.32%
11
(Ncores = 14) for VGG-16. Another interesting observation
is the fact that the mapping heuristic does not choose to
activate more than 14 cores for any of the configurations even
if more cores are available. This can be explained by the
large additional NoC traffic that would be generated and, as a
result, would actually slow down the overall system. Looking
at the curves of Fig. 6, this trend of diminishing returns on
investment can already be observed starting at 7 cores for most
layers.
In general, most configurations with more than 7 cores
are limited by the DRAM bandwidth with a few exceptions:
Layer 1 2 and 2 1 of VGG-16 achieve very high speed-
ups of up to 13x and 12.2x respectively for configurations
with 14 cores. Since aforementioned layers are fairly wide
(Nix = {224, 112}) and only have a limited number of output
channels (Nof = {64, 128}), the mapping heuristic preferably
slices along the width dimension. This minimizes the need for
repeated loading of ifmaps thereby minimizing the bandwidth
requirements and enabling the observed speedups. For later
layers, this is not possible due to the larger ofmap channel
counts and associated allocation of on-chip memory for filter-
weights. This limitation could be overcome by enabling each
core in the NoC to access the other cores’ on-chip memories in
order to reduce the traffic going through the DRAM interface
and thereby using the cores’ SRAM as caches.
VIII. CONCLUSION
Detailed strategies for mapping CNNs onto both single-core
as well as many-core systems were presented and evaluated
with regards to their practical feasibility. For the single-core
mapping, two strategies, one for DRAM access minimization
and one for runtime reduction, were investigated with the
result that DRAM access minimization does not always lead
to the desired energy reduction of the overall system. By
using a system-level simulator, several system setups with
different numbers of processing cores and core configurations
were investigated. The results show that for the investigated
CNNs AlexNet and VGG-16 configurations with 128 MAC
units per core result in the highest speedups although this
highly depends on the selected hardware unrolling factors Px.
Subsequent simulations that use this optimal core size have
shown the speedup potential which saturates at a 4x4 system
setup comprising 14 processing cores. The maximum speedups
observed for this 4x4 system were 8.4x for AlexNet’s first
layer and 13x for VGG-16’s second layer. Both layers share in
common that they allow slicing in the ofmap width dimension
which maps well to the given processing core’s architecture.
Since unrolling in the ofmap width and channel direction is
very common among other accelerators as well, similar results
could possibly be obtained for them too. As usual for many-
core systems, ultimately the maximum achievable speed-up is
always limited by the available on-chip memory and off-chip
bandwidth. The use of a NoC for the investigated application
domain has shown promising results and the presented map-
ping heuristic has proven to make robust choices for a variety
of system configurations.
REFERENCES
[1] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural
networks for mobile vision applications,” CoRR, vol. abs/1704.04861,
2017.
[2] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally,
and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer
parameters and ¡1mb model size,” CoRR, vol. abs/1602.07360, 2017.
[3] A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, and
L. V. Gool, “Ai benchmark: Running deep neural networks on android
smartphones,” in ECCV Workshops, 2018.
[4] L. Cavigelli and L. Benini, “Origami: A 803-GOp/s/W Convolutional
Network Accelerator,” IEEE Transactions on Circuits and Systems for
Video Technology, vol. 27, no. 11, pp. 2461–2475, nov 2017.
[5] Y. Shen, M. Ferdman, and P. Milder, “Maximizing cnn accelerator
efficiency through resource partitioning,” SIGARCH Comput. Archit.
News, vol. 45, no. 2, pp. 535–547, Jun. 2017.
[6] A. Bytyn, R. Leupers, and G. Ascheid, “An application-specific vliw
processor with vector instruction set for cnn acceleration,” in 2019 IEEE
International Symposium on Circuits and Systems (ISCAS), May 2019,
pp. 1–5.
[7] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing Loop Operation
and Dataflow in FPGA Acceleration of Deep Convolutional Neural Net-
works,” Proceedings of the 2017 ACM/SIGDA International Symposium
on Field-Programmable Gate Arrays - FPGA ’17, pp. 45–54, 2017.
[8] Y.-h. Chen, J. Emer, and V. Sze, “Using Dataflow to Optimize Energy
Efficiency of Deep Neural Network Accelerators,” IEEE Micro, vol. 37,
no. 3, pp. 12–21, 2017.
[9] V. Gokhale, A. Zaidy, A. X. M. Chang, and E. Culurciello, “Snowflake:
An efficient hardware accelerator for convolutional neural networks,” in
2017 IEEE International Symposium on Circuits and Systems (ISCAS),
May 2017, pp. 1–4.
[10] Y. Shen, M. Ferdman, and P. Milder, “Escher: A CNN Accelerator
with Flexible Buffering to Minimize Off-Chip Transfer,” in 2017 IEEE
25th Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM). IEEE, apr 2017, pp. 93–100.
[11] Y. H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture
for Energy-Efficient Dataflow for Convolutional Neural Networks,”
Proceedings - 2016 43rd International Symposium on Computer Ar-
chitecture, ISCA 2016, pp. 367–379, 2016.
[12] B. Moons and M. Verhelst, “An Energy-Efficient Precision-Scalable
ConvNet Processor in 40-nm CMOS,” IEEE Journal of Solid-State
Circuits, vol. 52, no. 4, pp. 903–914, 2017.
[13] A. K. Singh, M. Shafique, A. Kumar, and J. Henkel, “Mapping on
multi/many-core systems: Survey of current and emerging trends,” in
2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC),
May 2013, pp. 1–10.
[14] B. D. Rouhani, A. Mirhoseini, and F. Koushanfar, “Deep3: Leveraging
three levels of parallelism for efficient Deep Learning,” 2017 54th
ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6, 2017.
[15] A. S. Initiative. About systemc tlm. [Online]. Available: https:
//www.accellera.org/community/systemc/about-systemc-tlm
[16] F. Moraes, N. Calazans, A. Mello, L. Mo¨ller, and L. Ost, “HERMES:
An infrastructure for low area overhead packet-switching networks on
chip,” Integration, the VLSI Journal, vol. 38, no. 1, pp. 69–93, 2004.
[17] M. Ruaro, F. B. Lazzarotto, C. A. Marcon, and F. G. Moraes, “DMNI:
A specialized network interface for NoC-based MPSoCs,” Proceedings -
IEEE International Symposium on Circuits and Systems, vol. 2016-July,
pp. 1202–1205, 2016.
[18] A. Ejaz, V. Papaefstathiou, and I. Sourdis, “Ddrnoc: Dual data-rate
network-on-chip,” ACM Trans. Archit. Code Optim., vol. 15, no. 2, pp.
25:1–25:24, Jun. 2018.
[19] M. Schaffner, F. K. Gu¨rkaynak, A. Smolic, and L. Benini, “DRAM or
no-DRAM? Exploring Linear Solver Architectures for Image Domain
Warping in 28 nm CMOS,” pp. 707–712, 2015.
[20] J. Chan and S. Parameswaran, “NoCEE: Energy macro-model extraction
methodology for network on chip routers,” IEEE/ACM International
Conference on Computer-Aided Design, Digest of Technical Papers,
ICCAD, vol. 2005, pp. 254–259, 2005.
[21] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” CoRR, vol. abs/1409.1556, 2015.
[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Proceedings of the 25th
International Conference on Neural Information Processing Systems -
Volume 1, ser. NIPS’12. USA: Curran Associates Inc., 2012, pp. 1097–
1105.
