HERALD: Optimizing Heterogeneous DNN Accelerators for Edge Devices by Kwon, Hyoukjun et al.
Herald: Optimizing Heterogeneous DNN Accelerators
Hyoukjun Kwon1,2, Liangzhen Lai2, Tushar Krishna1 and Vikas Chandra2
1Georgia Institute of Technology, Atlanta, Georgia, USA
2Facebook, Menlo Park, California, USA
hyoukjun@gatech.edu, liangzhen@fb.com, vchandra@fb.com, tushar@ece.gatech.edu
ABSTRACT
New real time applications such as virtual reality (VR) with
multiple sub-tasks (e.g., image classification, segmentation,
hand tracking, etc.) based on deep neural networks (DNNs)
are emerging. Such applications rely on multiple DNNs for
each sub-task with heterogeneity and require to meet tar-
get processing rates (frame rate) for each sub-task. Thus,
the new compound DNN workload imposes new challenges
to accelerator designs in two folds: (1) meeting target pro-
cessing rate for each DNN for sub-tasks and (2) efficiently
processing heterogeneous layers As a solution, we explore
heterogeneous DNN accelerators (HDAs). HDAs consist of
multiple accelerator substrates (i.e., sub-accelerators) to sup-
port model parallelism that implements different mapping
styles to provide adaptivity to heterogeneous DNN layers.
However, HDA’s performance and energy heavily depend
on (1) how we partition hardware resources for each sub-
accelerator within a given hardware budget, and (2) how we
schedule layers on the sub-accelerators. That is, HDAs can
result in sub-optimal designs unless carefully co-optimized.
Therefore, we propose an HDA optimization framework,
Herald, which performs co-optimization of hardware parti-
tioning and layer schedluing. Herald exploits the simplicity of
dependence graph of DNN layers to reduce the complexity of
the problem, which on average requires 9.48 ms per layer on
a laptop with i9-9880H processor and 16GB memory. Herald
can be utilized both in design time to perform co-optimization
(optimizer) and in compile time to perform layer scheduling
and report expected latency and energy (cost model and sched-
uler). In our case studies, co-optimized HDAs with the best
EDP Herald identified provided 56.0% EDP improvements
with 46.82% latency and 6.3% energy benefits on average
across workloads and accelerators we evaluate compared to
the best monolithic accelerators for each evaluation setting
by deploying two complementary-style sub-accelerators in
an HDA.
1. INTRODUCTION
Deep neural network (DNN) accelerators, hardware spe-
cialized for DNN computations, played a key role in provid-
ing high computation capabilities with high energy efficiency
for DNN computations from edge/mobile devices [3, 13] to
2 3 40
500
1000
1500
2 3 40
200
400
600
800
2 3 40
50M
100M
150M
200M
250M
300M
ED
P 
(J
 x
 s
)
500
10
1500
Shi DLA RS Shi DLA RS Shi DLA RS
200
40
600
80
50
100
150
20
250
30
(a) Resnet50 (b) MobileNetV2 (c) UNet
Early Layer Late Layer Fully-connected Depth-wise Up-scale CONV
Figure 1: EDP estimation of Shidiannao [8], NVDLA [26], and Eye-
riss’s row-stationary [5] style mapping on Resnet50, MobileNetV2, and
UNet. We apply 256 PEs and 32GBps NoC bandwidth to MAESTRO
cost model [17] to estimate the energy and latency. The break-down is
in the granularity of unit operations.
cloud devices [15] scale. The enhanced DNN computation
capabilities via DNN accelerators enabled the deployment
of diverse applications that heavily rely on DNNs, such as
face recognition [38], image super-resolution [28], and so on.
Based on such innovations in hardware, real time applications
with multiple sub-tasks and DNN models for each, such as
AR/VR [40] or autonomous driving [20, 32, 39], have been
enabled. For example, a VR application can require object
detection, hand tracking, speech recognition, pose estimation,
and so on [1, 10, 30], which all rely on different DNN mod-
els for achieving state-of-the-art performance. Because of
the diversity of sub-tasks, the DNN models supporting each
sub-task are compound and heterogeneous.
Such compound DNN workloads impose a new challenge
to DNN accelerators since the workloads are expected to be
completed in target frame rates (i.e., real time application),
and the workloads are heterogeneous in layer operations,
layer shapes, and so on. In particular, the heterogeneity
of the workload is one of the major challenges to mono-
lithic DNN accelerators based on single processing style
(i.e., dataflow [5]) because they are often over-specialized
for specific layer shapes and operations [17] - which pro-
vides them an efficiency boost in the first place. For example,
NVDLA [26] style accelerators exploit input and output chan-
nel parallelism, which provides near roof-line throughput for
CONV2D layers with deep channels as shown in Figure 1 (a)
and (b). However, when it runs a CONV2D layer with a small
number of channels, NVDLA style accelerator suffers severe
compute unit under-utilization, which leads to low through-
put and energy efficiency, as results in Figure 1 (c) implies.
1
ar
X
iv
:1
90
9.
07
43
7v
2 
 [c
s.D
C]
  2
2 J
un
 20
20
That is, tuning an accelerator for the average case can lead
to uniform inefficiency across all the layers in heterogeneous
DNN workloads.
Such efficiency drop based on layer heterogeneity in oper-
ation and shape can be significant based on the target DNN
models. For example, in a recent classification network,
Resnet50 [11], layers with deep channels and low activation
resolution1 are dominant; all the layers except the first layer
are such layers. Therefore, NVDLA style accelerators pro-
vide high compute unit utilization all the layers except the
input layer. However, in a segmentation network, UNet [33],
layers with shallow channels and high activation resolution
account for 41.7% of the layers. NVDLA style accelerators
suffer from low compute unit utilization on those less favor-
able layers, resulting in severe efficiency drop; 3.6× slower
than Shi-diannao style while NVDLA style is 3.0×faster
than Shi-diannao style on Resnet 50 [17]. Flexible accelera-
tors, which include coarse-grained reconfigurable architex-
ture(CGRA) style ASIC accelerators [7, 19, 23] and FPGA
accelerators [16, 21, 35], are potential solutions to solve the
workload heterogeneity problems. However, the extra hard-
ware components for the reconfigurability can cause static
efficiency overheads, which makes them less desired under
stringent energy constraints in edge, mobile, and cloud de-
vices (e.g., MAERI required 96% more power compared to a
systolic array design [19]).
To deal with the challenges from layer heterogeneity, we
propose to design heterogeneous DNN accelerators (HDAs)
that contain diverse sub-accelerators within an accelerator
chip. This can be a promising option since we can assign
the most efficient sub-accelerator for each layer. Also, by
having multiple sub-accelerators, we can exploit layer paral-
lelism, which can be extremely useful for new applications
with many DNNs running simultaneously. However, because
we split finite hardware resources within a chip to multiple
sub-accelerators, each sub-accelerator may be less powerful
compared to a full homogeneous (or, monolithic) accelerator.
Therefore, as reported in a recent heterogeneous accelerator
work for database query [22], heterogeneous accelerators
can be either more efficient or inefficient than monolithic
accelerators based on the design parameters (architecture),
compiler mapping (dataflow + tile sizing (loop blocking)),
and workload (models). We observed the same trend for
DNNs as well as presented in Section 5.2, which implies
two findings: (1) Heterogeneous accelerators have potential
latency and energy gains over monolithic accelerators (2) To
materialize such benefits, we need a systematic optimization
framework that considers all of the aforementioned three
aspects; architecture, mapping, and DNN models.
Therefore, we propose an HDA optimization framework,
Herald, illustrated in Figure 2, which performs two-phase
co-optimizations in design and compile time. As inputs, Her-
ald receives total amounts of available hardware resources
(number of PEs, NoC bandwidth, and so on), mappings of
sub-accelerators, and target DNN models to run. As outputs,
Herald generates hardware resource partitioning for each
sub-accelerator, layer execution schedule, and estimated total
1we compare the number of input channels (C) and input activation
height (Y)to define deep/shallow channels and high/low activation
resolution; if C > Y, deep and low-res., and vice versa.
latency and energy using the MAESTRO cost model [17].
That is, Herald is an HDA design space explorer and layer-
granularity scheduler for user-specified workloads. Her-
ald mainly targets co-optimization in design and compile
time to fully exploit the benefits of hardware-schedule co-
optimization, specializing both hardware and schedule for
a target workload. When the workload changes at run time,
Herald can still generate an optimized schedule for the new
workload for the underlying hardware. In our evaluations, we
use Herald to create HDA design points combining multiple
accelerators with complementary dataflow styles. The de-
sign points with the best EDP for each experiment provided
24.9% smaller energy-delay product (16.1% latency and 7.6%
energy benefits) across complex edge device workloads we
evaluate compared to the best monolithic design we evaluate.
We summarize the contribution of this paper as follows:
• To the best of our knowledge, Herald is the first work to
explore heterogeneous architecture and layer scheduling
in DNN accelerator domain.
• Herald automatically searches for optimal hardware re-
source partitioning among sub-accelerators in HDAs (can
be used as an optimizer at design time).
• Herald explores optimal layer schedules that match lay-
ers with the most efficient accelerator for each consider-
ing both of local (layer-accelerator matching) and global
(load-balancing) optimization with expected latency and
energy (Can be used as a cost model for HDAs at compile
time.)
2. BACKGROUND
2.1 DNN Operations and Layer Sizes
Convolutional neural network (CNN) is one of the most
popular DNNs for computer vision applications, which is
dominant in edge device applications. Therefore, we focus
on CNNs and operations in recent CNN models as listed
in Table 1. The basis of CNN operation is convolutions
(CONV2D), a six-dimensional (or seven if include batch)
multiplication and addition (MAC) operations over two ten-
sors; input activation and filter weights, as illustrated in Fig-
ure 3. That is, the operations are sets of element-wise
multiplications (or, Hadamard product) and accumulations,
as described in the loop nest in Figure 3 (b) Other operations
are also based on the same style of computations, but they
accumulate in different dimensions (depth-wise CONV2D),
insert zeros in input activation (up-scale CONV), and so on.
2.2 Dataflow and Mapping
Dataflow collectively refers to loop ordering and spatial
unrolling (partitioning) [5, 41] strategies. Dataflows are often
represented in a loop-nest form [6], as shown in Figure 4,
loop nest with loop bounds are unfilled. From the base loop
nest (i.e., “unmapped" loop nest such as one in Figure 3 (b))
, a series of loop interchange and parallelization modifies
how we compute DNN operations while preserving what we
compute. By providing loop bounds to the representation
(i.e., tile size or blocking information), we obtain “mapping,"
which indicates an instance of dataflow, which contains full
2
Resnet50:
…
UNet:
…
Workload
HERALD
HDA Design Space Explorer
PE Partitioning BW Partitioning
Layer Scheduler
Layer Ordering / 
Assignment 
Idle Time
Elimination
MAESTRO
(Cost Model)
DNN Model
HW parameters
Sched. Policy, …
Latency, Energy,
Mem Occupancy,
…
PE PE PE PE
PE PE PE PE
PE PE PE PE
PE PE PE PE
Optimized HDA Design Paramters
Energy
Time0Energy
Time0Optimized Schedule
Mapping Strategies
Herald Outputs
Eye Tracking ImageSegmentation
Autonomous Driving
Vision-based
Navigation Lane Detection
…
AR/VR
…
Multi-task
Real-time Applications
…
…
+
+ +
DNN Model 1 DNN Model 2
…
DNN Models for each sub-task
Realtime and Heterogeneous
DNN Workload
… … HW ResourceParameters
Herald Inputs Proposed Framework
Latency:
Energy:
Expected Latency/Energy
Figure 2: Multi-task realtime application motivating heterogeneous DNN accelerators (HDAs), and an overview of Herald, an HDA optimization
framework.
R
S
X
Y
C
C
Y’
X’
K
K CONV…
for(k=0; k<K; k++)
  for(c=0; c<C; c++)
    for(y=0; y<Y; y++)
      for(x=0; x<X; x++)
        for(r=0; r<R; r++)
          for(s=0; s<S; s++)
            Output[k][y][x] += Input[c][y+r][x+s] * Filter[k][c][r][s]
Slide …
…
…
…
…
…
Hadamard
Product
And
Accumulation
(a) Visual Description of Convolution Operation (CONV2D)
Slide
(b) Loop Nest Description of Convolution Operation (CONV2D)
<Convention>
K: Output Channel (or, filter)
C: Input Channel
Y’: Input Activation Row
X’: Input Activation Column
Y: Output Activation Row
X: Output Activation Column
R: Filter Row
S: Filter Column
Filter
Input 
Activation
Output 
Activation
Figure 3: An illustration of fundamental convolution operation
(CONV2D). (a) shows a visualization of the operation as a sliding
window stencil operation, and (b) shows the precise description of
CONV2D in a loop nest. Note that we illustrate one representative way
(mapping) to compute CONV2D and do not include non-linear func-
tions in the illustration for simplicity.
information to map a DNN operation on an accelerator [18].
2.3 DNN Accelerators
DNN accelerators [2,6,8,9,15,19,23,26,27,34,36] are hard-
ware specialized for DNN forward (or, both of forward/back-
ward) pass, which provides tremendous throughput and en-
ergy benefits over CPUs and GPUs. For example, TPU [15]
reported 53.5× and 25.74× higher performance of two CNNs
over CPUs and GPUs on average, respectively. To achieve
such a high efficiency, DNN accelerators exploit hundreds
of processing elements to deal with billions of multiply-and-
accumulate (MAC) operations in DNN models and exploit
scratchpad memory and inter-PE interconnection network to
maximize data reuse to deal with massive energy cost for
fetching data from DRAM. We can observe three types of
DNN accelerators as follows, including HDAs we suggest to
design.
Monolithic Accelerators. Most of previously proposed DNN
accelerators are homogeneous, or monolithic, which has reg-
ular architecture as shown in Figure 5 (a) and runs only one
dataflow style. Monolithic DNN accelerators can have more
complicated structures but all the monolithic DNN accelera-
tors replicate a unit structure to scale up.
Flexible Accelerators. Some accelerators such as Flexflow [23],
MAERI [19], and EyerissV2 [7] include extra hardware com-
for(k1=0; k1<K1; k1++)
 pfor(k0=0; k0<K0; k0++)
  for(c1=0; c1<C1; c1++)
   for(y1=0; y1<Y1; y1++)
    for(x1=0; x1<X1; x1++)
     pfor(c0=0; c0<C0; c0++)
      for(r1=0; r1<R; r1++)
       for(s1=0; s1<S; s1++)
        for(y0=0; y0<Y0; y0++)
         for(x0=0; x0<X0; x0++)
          for(r=0; r0<1; r++)
           for(s=0; s0<1; s++) {
            k=k1*K0 + k0; c=c1*C0 + c0; 
            … x = x1*X0 + x0;
            Output[k][y][x] += 
            Input[c][y+r][x+s] * Filter[k][c][r][s]; }
(a) NVDLA Style Dataflow
for(k1=0; k1<K1; k1++)
 for(k0=0; k0<K0; k0++)
  for(c1=0; c1<C1; c1++)
   for(y1=0; y1<Y1; y1++)
    for(x1=0; x1<X1; x1++)
     for(c0=0; c0<C0; c0++)
      pfor(y0=0; y0<Y0; y0++)
       pfor(x0=0; x0<X0; x0++)
        for(r=0; r<R; r++)
         for(s=0; s<S; s++) {
          k=k1*K0 + k0; c=c1*C0 + c0; 
          … x = x1*X0 + x0;
          Output[k][y][x] +=
          Input[c][y+r][x+s] * Filter[k][c][r][s]; }
(b) Shi-diannao Style Dataflow
Figure 4: Loop nest representation of dataflows from recent acceler-
ators [5, 8]. We follow the same convention as the loop nest in Figure 3.
Numbers after some loop variables indicate tile level.
ponents for runtime reconfigurability, that enables to run
various mapping styles efficient for the workload. Because of
the different preference to mapping styles of each DNN layer
we discuss in Section 3.3, adapting mapping styles to each
layer can provide latency and energy benefits. For example,
a recent study [17] reported 37% latency and 10% energy
gains assuming zero reconfigurability overhead2. Although
flexible accelerators can provide such benefits, however, the
energy cost can be higher depending on the reconfigurabiltiy
cost. For example, 1024-MAC MAERI [17], a flexible accel-
erator, synthesized using 28nm library required 30.2% more
power compared to 1024-MAC NVDLA [26], a monolithic
accelerator, synthesized using the same library.
Heterogeneous Accelerators. A heterogeneous DNN accel-
erator (HDA) is an accelerator that contains multiple sub-
accelerators that run distinct mapping styles, as illustrated
in Figure 5 (c). Like flexible accelerators, HDAs also exploit
the different preference to mapping styles of each DNN layer
to optimize latency and energy but using a different strat-
egy. HDAs assign layers to a sub-accelerator with the most
preferred mapping style of each layer. Although an HDA
needs to distribute its hardware resources to sub-accelerators
at design time, resulting in an array of accelerators, each
smaller than monolithic or flexible accelerators with the same
hardware budget, multiple accelerator instances enable layer
parallelism when the workload is a complex application that
includes multiple DNN models. We discuss such potential
benefits of HDAs in Section 3.4.
In DNN accelerators, mapping determines the latency and
energy consumption because it determines the number of
2since the overhead is implementation-specific, and to show the full
potential, the paper excluded the overhead.
3
PE PE
PE PE
NIC
Buffer
PE PE
PE PE
NIC
Buffer
PE
PE
PE PE
PE PE
NIC
Buffer
Global Buffer
Global Interconnect
ReLu
Si
gm
oi
d
PE PE
PE PE
ACC1 ACC2 ACC3
Leaky ReLU
To/From DRAM
(c) Heterogeneous Accelerator(a) Monolithic Accelerator
PE
PE
PE
PE
PELo
ca
l In
te
rc
on
ne
ct
Global Buffer
Global Interconnect
To/From DRAM
PE
PE
PE
PE
PELo
ca
l In
te
rc
on
ne
ct PE
PE
PE
PE
PELo
ca
l In
te
rc
on
ne
ct PE
PE
PE
PE
PELo
ca
l In
te
rc
on
ne
ct
(b) Flexible Accelerator
PEPE PE PE PE
Global Buffer
Global Interconnect
To/From DRAM
PEPEPE
+ + + +
+ ++
Reconfiguration Controller
Distribution CrossBar
PEPE PE PE PE
Global Buffer
Global Interconnect
To/From DRAM
PEPEPE
+ + + +
+ ++
Reconfiguration Controller
Distribution CrossBar
Reconfig Reconfig
…
Figure 5: Example monolithic, flexible, and heterogeneous DNN accelerators (HDAs).
buffer accesses, degree of parallelization (mapping utilization
of PEs), buffer size requirements, and so on [5, 17, 23]. In ad-
dition, the efficiency of mapping depends on layer operation
and sizes, which is one of the key rationales toward HDAs.
We discuss such aspects in the following section.
3. MOTIVATION
In this section, we discuss compound workloads that moti-
vates layer paralleism, the heterogeneity of layer operations
and sizes, the impact of mapping style on the efficiency of a
DNN accelerator, which motivates the use of HDAs.
3.1 Realtime and Compound Workloads
As the number of applications relying on DNNs are in-
creasing and they are often tightly coupled with edge devices
such as self-driving cars, smart phones, VR head sets, or
AR glasses. Computer vision is one of the most popular do-
mains of the application, which includes various tasks like
face recognition, image segmentation, image super resolu-
tion, depth estimation, and so on [14]. Such applications
often consist of multiple computer vision sub-tasks to imple-
ment the desired functionality, resulting in compound and
heterogeneous DNN workloads. For example, a VR headset
simultaneously runs DNN models for hand tracking, pose
estimation, action segmentation, and multiple image clas-
sifications [40] at designated frame rate for each. Another
example is self-driving cars that performs lane detection [20],
driving scene understanding [32], and so on.
Compound and heterogeneous DNN workloads from such
emerging applications are often based on multiple DNN mod-
els targeting a processing rate (or, frame rate) because they
are in nature real time applications, which imposes a new de-
sign challenge to DNN accelerators. Such features motivate
DNN accelerators to have multiple substrates for simultane-
ously running layers from different DNN models to meet
the processing rate requirement for all the DNN models for
each sub-task. Another challenge is layer heterogeneity from
multiple DNN models for diverse tasks, which we discuss
next.
3.2 Layer Heterogeneity in DNN Models
Because of the diversity of sub-tasks in compound DNN
workloads, the layers in the DNNs are also diverse, or het-
erogeneous while DNN accelerators are often optimized for
specific DNN models to exploit the benefits of specialization.
Therefore, understanding the layer heterogeneity and opti-
mizing for them is crucial to efficiently support compound
DNN workloads.
Activation
Layer 1 Layer N
…
……
Filter …… …
… …
Activation
Layer 1 Layer N
…
……
Filter ……
…
(a) Classification Network
(b) Segmentation Network
Figure 6: Trends in layer shape of (a) classification networks such as
Resnet [11] and (b) segmentation networks such as UNet [33]
From recent DNN models, we can observe two classes of
heterogeneity; layer shape (or, size of each layer dimension)
and layer operations. We summarize the two classes of het-
erogeneity of some recent DNN models related to multi-DNN
applications such as autonomous driving and AR/VR devices
in Table 1.
3.2.1 Layer Shape
Classification networks such as Resnet [11] gradually re-
duce the resolution of activation because their goal is to
extract a classification vector where each entry represents the
probability of each class. Also, classification networks tend
to increase the number of channels to exploit as many fea-
tures as possible for accurate classification. Therefore, layers
in classification networks have high-resolution activation and
shallow channels in early layers and low-resolution activation
and deep channels in late layers, as illustrated in Figure 6 (a).
In contrast, segmentation networks such as UNet [33] need
to restore the original resolution of activation because their
goal is to generate masks over target objects in the input
image. However, segmentation networks still need to ex-
tract as many features as those in classification networks
for high accuracy. Therefore, segmentation networks first
follow the same trend as classification networks until the mid-
layer. Afterward, segmentation networks reduce the number
of channels and gradually restore the resolution of activation
using up-scaling methods such as transposed convolution
(a.k.a. deconvolution or up-convolution). As a result, layer
shapes in segmentation networks follow the trend illustrated
in Figure 6 (b).
3.2.2 Layer Operation
We list lists DNN models of computer vision tasks related
to AR/VR in Table 1. As listed in layer operation column
of Table 1, layer operations in such models are diverse and het-
erogeneous. Each layer operation prefers different mappings
4
Table 1: DNN models selected for case studies motivated by AR/VR workloads [40]. For works without model name,
we name them to refer to those works in the rest of paper.
Task Model Layer Shape Layer Operations
Image Classification Resnet50 [11] Classification CONV2D, FC, Skip-Con.
Image Classification MobileNetV2 [25] Classification CONV2D, DWCONV, Skip-Con.
Image Segmentation UNet [33] Segmentation CONV2D, FC,TRCONV, Concat.
Depth Estimation Focal Length DepthNet [12] Segmentation CONV2D, FC, UPCONV
Hand Pose Estimation Br-Q HandposeNet [24] Classification CONV2D, FC
and hardware [17], which makes such workloads challenging
to monolithic accelerators.
3.3 Mapping and Efficiency of Accelerators
Since mappings dictate the data reuse strategies and PE
utilization, they significantly affect the efficiency of a DNN
accelerator with distinct preference to layer shapes and oper-
ation [5, 17, 23].
We show examples of such cases in Figure 7, compar-
ing two example accelerators based on Shi-diannao [8] and
NVDLA [26] mapping styles. Those two accelerators have
distinct approaches to compute MAC operations in DNNs.
As illustrated in Figure 7 (a), a Shi-diannao style accelera-
tor parallelizes activation width and height using an output-
stationary style mapping, which exploits convolutional reuse
across kernels, while an NVDLA style accelerator paral-
lelizes input and output channels using a weight-stationary
style mapping, which exploits activation reuse across output
channels. Also, Shi-diannao style employs output-stationary
style mapping that maximizes output reuse, while NVDLA
style employs weight-stationary mapping that maximizes fil-
ter weight reuse. Such differences in mappings result in
dramatically different utilization of compute units, as shown
in Figure 7 (c). We use three example operations presented
in Figure 7 (c) to show the impact of mappings. Op1 and Op2
are CONV2D operations with the aspect ratio of early and
late layers in classification network introduced in Figure 6
(a), respectively. Op3 is a depth-wise CONV2D operation
with the same layer size as Op1.
Based on the parallelization strategies of each example
accelerator and layer sizes, we can observe dramatically dif-
ferent PE utilization as shown in Figure 6 (c). We use MAE-
STRO [17] cost model for DNN accelerators to estimate the
latency and energy and compute energy-delay product (EDP)
as one of the indicators of overall efficiency, as shown in Fig-
ure 6 (b). In combination of the differences in utilization and
data reuse strategies, two example accelerators result in dra-
matically different EDPs, which implies distinct preference
of two example accelerators to the operations. In addition to
the mapping utilization, each of mapping style has dramat-
ically different memory/network-on-chip(NoC) bandwidth
requirements, buffer size requirements, and so on, which also
varies based on the layer shape and operations in a different
degree [17].
Therefore, no single mapping style is good for all the
layers, and we need to optimize the mapping for each layer
in target workloads to maximize efficiency of an accelerator.
However, when the target workload is heterogeneous, the
common practice that optimizes the mapping for the average
case of the workload can result in a consistently inefficient
mapping for all the layers in the workload, which is one
of the major challenges for DNN acceleration for emerging
applications with multiple DNN models.
3.4 Benefits of HDAs
To overcome the limitation of monolithic DNN accelera-
tors implementing only one mapping style, we propose to
design heterogeneous DNN accelerators (HDAs) that contain
multiple sub-accelerators with distinct mapping styles within
a single accelerator chip. HDAs have two potential benefits
over monolithic accelerators.
Selective scheduling. Because each layer differ by operation
and shape prefers different mapping style and hardware, run-
ning each layer on its most preferred sub-accelerator in an
HDA is an effective solution to maximize overall efficiency.
Latency hiding via layer parallelism. Unlike most of the
monolithic accelerators run one layer and another, HDAs
can run multiple layers of different models on each sub-
accelerator in parallel. By running multiple layers in par-
allel, a heterogeneous accelerator can overlap the latency of
multiple models, which leads to latency hiding among DNN
models reducing overall latency.
3.5 Challenges of HDAs
Although HDAs can provide potential benefits in micro-
specialization for each layer, naive HDA designs and sched-
ulers can lead to inefficiency.
Reduced parallelism within a layer. Given the same num-
ber of PEs between a monolithic and an HDA, sub-accelerators
in the HDA has smaller number of PEs than the monolithic
accelerator since hardware resources need to be distributed
(or partitioned) for each sub-accelerator. Therefore, the maxi-
mum degree of parallelism each sub-accelerator can exploit
for a layer can decrease compared to a monolithic or flexi-
ble accelerator. That can lead to higher energy consumption
since the amount of data reuse can also decrease (depending
on the mapping) if the degree of parallelism decreases.
However, Each sub-accelerator contains smaller amount of
hardware resources than a monolithic DNN accelerator if we
assign the same hardware budget to the entire chip because we
need to distribute hardware resources across sub-accelerators
in an HDA. That is, if the hardware resource distribution is
sub-optimal, the overall efficiency of an HDA chip can be also
degraded. Also, because multiple sub-accelerators exist in an
HDA, scheduling of layers that determines sub-accelerator-
layer matching and ordering of the execution is now critical
for overall efficiency, which is a problem did not exist in
monolithic accelerators. Therefore, we propose Herald, an
HDA optimization framework that consists of a design time
(hardware resource distribution optimization) and a compile
time (layer scheduling) optimization framework, performing
(1) design and compile time co-optimization when a user de-
5
SRAM
PE PE
PE PE
PE
PE
PE
PE
PE PE PE
PE PE PE
PE
PE
SRAM
PE PE
PE PE
PE
PE
PE
PE
PE PE PE
PE PE PE
PE
PE
Shi-diannao Style Accelerator
Map
3
3
CONV
(Op1) CONV2D (Early Layers in Classification Networks)
Filter
Weight
3 6
6
3
Input 
Activation
Output 
Activation
2 4
42
Shi-diannao Style Accelerator
SRAM
PE PE
PE PE
PE
PE
PE
PE
PE PE PE
PE PE PE
PE
PE
Shi-diannao Style Accelerator
+
X X
+
X X
+
+
X X
+
X X
+
+ RFile
+
X X
+
X X
+
+
X X
+
X X
+
+ RFile
NVDLA Style Accelerator
SRAM
+
X X
+
X X
+
+
X X
+
X X
+
+ RFile
+
X X
+
X X
+
+
X X
+
X X
+
+ RFile
NVDLA Style Accelerator
SRAM
+
X X
+
X X
+
+
X X
+
X X
+
+ RFile
+
X X
+
X X
+
+
X X
+
X X
+
+ RFile
NVDLA Style Accelerator
SRAM
Map
Map
3
16
CONV
(Op2) CONV2D (Late Layers in Classification Networks)
3 4
4
3
2 2
22
…
…
…
…
Filter
Weight
Input 
Activation
Output 
Activation
3
3
6
6
2
4
4
Ch-
wise
CONV
(Op3) Depth-wise Convolution
2
2
Filter
Weight
Input 
Activation
Output 
Activation
+
X X
+
X X
+
+
X X
+
X X
+
+ RFile
+
X X
+
X X
+
+
X X
+
X X
+
+ RFile
Parallelize Input Channels (C) Parallelize Output Columns (X’)
SRAM
PE PE
PE PE
PE
PE
PE
PE
PE PE PE
PE PE PE
PE
PE
Pa
ra
lle
liz
e 
Ou
tp
ut
 R
ow
s 
(Y
’)
Pa
ra
lle
liz
e 
Ou
tp
ut
 C
ha
nn
el
s 
(K
)
NVDLA Style Accelerator
SRAM
Shi-diannao Style Accelerator
NVDLA Style
Shi-diannao Style
ED
P(
m
J 
x 
Se
c)
0
2k
4k
6k
8k
10k
12k
Op1 Op2 Op3
(a) Example Accelerators (b) The EDP of each Example Accelerator on Operations in (c) 
(c) Three Example Operations and Corresponding Mapping Utilization on Two Example Accelerators
37.5% 100%
100%
25%100%
12.5%
Utilized Compute Unit Under-utilized Compute Unit
Figure 7: The impact of mapping styles on efficiency. (a) Two example accelerators based on NVDLA [26] and Shi-diannao [8] mapping styles. (b)
the EDP as the indicator of efficiency (lower is better) of two example accelerators on three example operations presented in (c). (c) three example
operations based on CONV2D and depth-wise CONV2D and mapping utilization of compute units on each example accelerator based on their
mapping styles. We term the mapping utilization as the number of PEs with computation mapped divided by the number of PEs to distinguish it
from the under-utilization based on stalls at execution time due to insufficient network-on-chip (NoC) and memory bandwidth.
signs a specialized HDA for a given workload or (2) compile
time optimization only after an HDA is deployed and the
target workload changes. Exploiting those sub-components,
Herald automates HDA design tailored for user-specified tar-
get models and outputs estimated latency and energy using
the co-optimized design. We discuss details of Herald next.
4. HERALD FRAMEWORK
4.1 Execution Model and Workloads
We target layer granularity execution on each sub-accelerator
of HDAs because we observe significantly different mapping
preference of layers [17,23] and more fine-grained scheduling
results in high control and scheduling overhead. We assume
the following execution steps of accelerators in Herald.
1. Fetch filter weight values from DRAM and store them
in a global buffer.
2. Distribute filter values to sub-accelerators based layer
execution schedule.
3. Fetch activation from DRAM and store them in the
global buffer.
4. Stream activation values to their corresponding sub-
accelerators based on layer execution schedule.
5. Store streamed-out output activation from each sub-
accelerator to the global buffer.
6. During sub-accelerators compute output activation, fetch
next filter values from DRAM and send the filter values
to the next accelerator (assumes double-buffering).
7. When a sub-accelerator finishes executing a layer, stream
output activation stored in the global buffer as input ac-
tivation of the next layer.
8. Repeat above processes until processing all the layers
of all the models.
For steps 3 and 6, activation is stored in DRAM and loaded
in a tiled manner specified by the mapping in target accelera-
tor if the buffer size is not sufficient to store entire activation.
When output activation is committed to the global buffer,
Herald in default assumes a rearrange buffer that adjusts
the data layout for the next layer if it runs on another sub-
accelerator with a different mapping style. In the evaluation,
we select mappings that have the same inner-loop order so
that we can maintain the same data layout, which eliminates
sub-accelerator context change overheads from different data
layout. When the data layout and miscellaneous context
change overheads, Herald also provides an option to specify
the latency and energy penalties for them.
4.2 Latency and Energy Estimation
We extend MAESTRO [17,18] for latency and energy esti-
mation of HDA designs with given schedules. MAESTRO
is a validated cost model for monolithic DNN accelerators
6
0 10 20 30 40 50 600
200
400
600
800
1000
1200
200E
DP
 (J
 x
 s
)
PE Partition
0
400
600
800
1000
1200
016384
0 16384
Acc 1
3848.6 (out of range)
Acc 2
8192
8192
12288
4096
4352
12032
Heterogeneous Design Points
Best HDA Partitioning
Best Monolithic
Naive HDA Partitioning
Figure 8: The impact of PE partitioning upon a large accelerator
listed in Table 3 with two sub-accelerators (ACC1: Shi-diannao style,
ACC2: NVDLA style). We use evaluation workload A presented in Sec-
tion 5. The left- and right-most represents ACC1 and ACC2 monolithic
designs.
with any mapping, which reported 96.1% accuracy against
RTL simulation [19] and real processing time measured on a
chip [6]. Although MAESTRO supports any mapping, it does
not model multi-DNN sub-accelerator environment, which
is crucital for HDA evaluations. Therefore, we extend MAE-
STRO to support multi-DNN accelerator environment with
heterogeneity. Herald models the memory requirement for
the global buffer and data movement from/to the global buffer
to/from sub-accelerator buffers. The modeling method fol-
lows the same methodology proposed by MAESTRO, which
identifies the amount of reuse and computing activity counts
based on them (for energy) and communication/computa-
tion delay considering reuse (for latency). In addition to the
same analytic equations, Herald considers the layer execution
schedule generated by the scheduler we develop, discussed
in Section 4.4 by modeling non-synchronized execution of
sub-accelerators (i.e., each sub-accelerator start processing
a layer as soon as input data are available). For estimating
latency and energy of sub-accelerator runs, we exploit the
original MAESTRO cost model.
4.3 Accelerator Design Space Exploration
Herald models accelerators using various hardware pa-
rameters such as number of PEs, network-on-chip (NoC)
bandwidth, NoC latency, global memroy size, global memory
bandwidth, and so on. Herald exposes unit cost database so
that users can update unit area/energy costs for each hard-
ware component so that users can easily update it to evaluate
HDAs under their own environment (technology node, etc.).
Herald’s design space exploration (DSE) tool receives the
total number of PEs, memory size, and memory/NoC band-
width as inputs, which describes the overall hardware budget
for an accelerator chip. Unlike monolithic DNN accelerators
fully exploit them for a single accelerator substrate, HDAs
need to distribute such resources for each sub-accelerator.
However, evenly distributing those resources (i.e., naive HW
resource partitioning) does not yield the most optimal HDAs
because each accelerator’s mapping style has a different opti-
mal balance between the number of PEs, memory size, and
bandwidth requirements as discussed in Section 3.3. There-
fore, Herald explores the resource partitioning space for each
resource type, which constructs a nested combinatorial opti-
mization problem, or a nested resource partitioning problem.
We implement a combinatorial optimization framework
upon an analytic cost model for DNN accelerators, MAE-
STRO [17], which estimates the latency and energy for given
layer, mapping, and hardware design parameters (number of
PEs, NoC bandwidth, memory size, etc.). In Figure 8, We
show an example design space from PE partitioning over
two sub-accelerators in a 16K-PE-HDA fixing other design
parameters assuming naive bandwidth partitioning (128/128
GBps). We can observe that the design space has irregular
computational cost variations so evenly partitioning (8K/8K
PE partitioning) does not yield the most efficient HDA.
Based on user-specified framework options, Herald’s de-
sign space exploration (DSE) tool either performs an exhaus-
tive search, binary sampling-based search, or random search.
Binary sampling-based search first evaluates 2n design points
with a regular interval with user-specified parameter n. After-
ward, Herald selects an interval between two adjacent evalu-
ated design points with the lowest average cost and performs
an exhaustive search over the selected interval. The random
search follows a similar approach as the binary-sampling-
based search but selects random pre-evaluated design points.
4.4 Layer Scheduling
The goal of scheduling in Herald is to minimize the energy
consumption and latency of an HDA, exploiting different
preferences of each layer to accelerators. However, the layer
scheduling space is massive. For example, 2.54×1021 possi-
ble layer execution schedules exist for AR/VR workload A in
Table 2 even if we only consider permutation of the layers on
a single accelerator. To deal with such a large search space,
we develop a heuristic the characteristics of DNN workloads
to reduce the scheduling overhead. Two major character-
istics we exploit are the dependence among layers; layers
have mostly linear dependence chain in most models, and
they are independent across models. In Figure 9, we present
an overview of the layer scheduling processes that consist
of three processs: layer assignment to each sub-accelerator,
layer ordering within each sub-accelerator, and re-ordering
as post-procesing. Herald implements two-step scheduler to
perform the three steps
Step 1: Layer Assignment and Ordering. We describe
the layer assignment and ordering algorithm in Figure 10.
The algorithm iterates the frontier layers, (i.e., layers to be
executed first based on the dependnce from each model in the
workload, until it schedules all the layers. For each frontier
layer, the algorithm first queries the cost of execution on
each sub accelerator and identify the best-fit sub-accelerator.
Figure 10 describes energy-delay-product (EDP) as the metric
for example purpose while Herald supports various metrics
(e.g., latency, energy, and custom metric from users).
Once identified the best-fit sub-accelerator, the scheduler
checks (1) dependence, (2) memory, and (3) load-balancing
conditions. The dependence check tests if the execution
of previous layer of the layer to schedule is complete or
not. The memory check tests if scheduling the layer re-
quires requires memory space more than available at the
time to schedule, considering previously scheduled layers.
The load-balancing check tests if scheduling the layer results
in extreme load balance defined by a user parameter, load-
balancing factor, which defines the maximum ratio of the
latency of the fastest and slowest completing sub-accelerators.
For example, if the load-balancing factor is two, the slowest
sub-accelerator must complete all the scheduled layers within
2×Latency f astest_acc, where Latency f astest_acc is the latency
7
Acc2
Acc1
1 2
1 2 3
Model A
Model B
Layer
Assignment Acc1
1
2
3
Acc2
2
1
Layer
Ordering 2 1 3
1 2
Re-order for
Gap-filling
Time
En
er
gy
Time
En
er
gy <Acc2>
1
21
2
3
<Acc1>
Schedule
Construction
Time
En
er
gy
Time
En
er
gy <Acc2>
1
2 1
2
3
<Acc1>
Initial Scheduling Post-processing
Figure 9: An overview of three processes to schedule layers on HDAs, and the boundary of two-phase scheduler implementation of Herald. Circled
numbers represent layers in each model.
Inputs
 - A list of hardware parameters of sub accelerators (Accs)
 - A list of DNN models to run, sorted in the dependence order (MD)
 - Load-balancing factor (LbF)
Outputs
- A list of (schedule time, layer ID, model), (Schedule)
- A list of completion time  for each sub-accelerator (Tot_Latency_Acc)
cycle = 0;
while MD.notEmpty do
   for model in MD do
     layer = model.head;
     // Get EDP/Latency for the layer on each acc
     (EDP, Latency) = MAESTRO_Herald.query(layer, Schedule, Accs) ; 
     best_fit_acc = getAccIndex(min(EDP));
     // Check dependence, memory size, and load-balancing conditions
     dependence_cond = is_prev_layer_complelete(Schedule, model, cycle);
     mem_size_cond = MemorySize(cycle, Schedule) + cost.getMemSizeReq
                                      < MemorySize;
     load_balance_cond = max(Tot_Latency_Acc) 
                                           < LbF * (Tot_Latency_Acc[acc] + Latency[best_fit_acc];
     if dependence_cond and mem_size_cond then
       if load_balance_cond then
         ToT_Latency_Acc[best_fit_acc] += Latency[best_fit_acc]; // Assign layer
         PopLayer(MD, layer);
         assigned_a_layer = true;
       else
         //Try the second, third, … -best fit accelerator for load-balancing
       end if      
     end if
     if assigned_a_layer then
       rearrange(MD); // Rearrange the order of model based on the layer ordering
                                //  strategy (depth-first, breadth-first. etc) selected by users
       break; 
     end if
   end
   cycle = nextLayerCompletionTime(Schedule) // Failed to schedule; defer execution
end
Figure 10: Layer assignment and ordering algorithm.
of the sub-accelerator completing earilest.
In case only load-balancing check fails, the scheduler try
the second-best-fit sub-accelerator until it finds an alternative
that meets all the conditions. If not found, the scheduler
increment the scheduling cycle to the completion time of a
layer on any of sub-accelerators tracked by Tot_Latency_Acc
in Figure 10, which represent the completion time of each
sub-accelerator. If all the conditions are met, the scheduler
assign the layer onto the best-fit sub-accelerator and remove
the scheduled layer from the corresponding model in the
model list (MD in Figure 10). And then, the reference cycle
is incremented in the same way as scheduling fail updated it,
and search for another schedulable layer.
Before move on to the next layer, the scheduler change
the model to be scheduled next to implement layer-ordering
strategy specified by users. Herald supports depth-first and
breadth-first ordering. Depth-first ordering schedules all
the layers within a model and moves on to the next model.
Breadth-first ordering schedules one layer from each model,
and repeat it until the scheduler finishes. Such ordering is
possible without violating dependences since DNN models
mostly have linear dependence for layers. Even if there exists
branches like Inception [37], the layers within each branch
Time0
Acc 2 1 2 4
Time0
Acc 1 1 23 3 4
Time0
Acc 1 1 3 2 3 4
En
er
gy
Time0
Acc 2 12 4
(a) Depth-first Layer Ordering
(b) Breadth-first Layer Ordering
Layer from Model A Layer from Model B
Time0
Acc 2 1 2 4
Time0
Acc 1 1 23 3 4
(c) Layer Ordering Post-processing
2 3 4
Re-order
Latency
Reduction
En
er
gy
En
er
gy
Figure 11: Example timelines from different layer ordering meth-
ods on two accelerators and two DNN models. Numbers in each box
represent the layer number.
is also in linear dependence. However, those two ordering
methods are mainly for reducing the complexity of schedul-
ing, which do not guarantee the optimality of the resulting
schedule. Therefore, after we construct an initial schedule,
we perform post-processing to fix inefficiency caused by the
simple layer ordering methods.
Step 2: Post-processing. Once an initial schedule is gener-
ated by the first step, the scheduler in Herald further optimizes
the schedule by exploring layer execution reordering within
each sub-accelerator, which fills redundant idle time (i.e.,
gaps) caused by layer order strategies chosen for Step 1. We
describe an example in Figure 11. Figure 11 (a) and (b)
show example initial schedules generated by Step 1 applying
depth- and breadth-first layer ordering, which we discussed
in Step 1. As we can observe in the examples, idle time exist
in both. However, some of the such idle time is redundant.
For example, in Figure 11 (b), layer 2,3, and 4 from model
B can be scheduled earlier, if that does not violate memory
condition, as shown in Figure 11 (c).
We describe the post-processing algorithm in Figure 12,
which eliminates redundant idle time. Post-processing al-
gorithm performs look-ahead of scheduled layers from the
completion time of each layer, searching for other schedula-
ble layers within idle time. The algorithm performs similar
checks as the schedule did in Step 1 (dependence, memory,
and load-balacning). In addition to the three checks, the al-
gorithm also checks if the idle time (i.e., gap of the initial
schedule) is sufficient for scheduling fethcing a layer from
the later in the schedule to the beginning of the idle time.
8
Inputs
 - A list of hardware parameters of sub accelerators (Accs)
 - A list of (schedule time, layer ID, model) (Schedule)
 - Look-ahead depth (LA)
Outputs
 - An updated schedule (Schedule)
for acc in ACCs do
  for baseLayerIdx in NumLayers(Schedule[acc]) do
    look-ahead =1
    while look-ahead  < LA do
       prev_completion_time = Schedule[acc][baseLayerIdx].completion_time
       test_layer = Schedule[acc][baseLayerIdx + look-ahead]
       // Test dependence, memory, load-balancing, and schedule overlap
       if layers is_schedulable(test_layer,  prev_completion_time, Schedule) then
          // Reorder the test layer
          UpdateSchedule(Schedule, test_layer, prev_completion_time)
       end if
     end
   end
end
Figure 12: Post-processing algorithm that minimizes the idle time.
Table 2: Heterogeneous DNN workloads used for the
evaluation. We model AR/VR workloads using models
listed in Table 1. We also evaluate computer-vison net-
works in MLPerf. Number of batches models different
target frame rates for each sub-task.
Workload Model # of batches
AR/VR-A
Resnet50 2
Mask-RCNN 4
MobileNetV2 4
AR/VR-B
Resnet50 2
Unet 2
MobileNetV2 4
BR-Q Handpose 2
Focal Length DepthNet 2
MLPerf-CV
Resnet50 1
MobileNetV1 1
SSD-Resnet34 1
SSD-MobileNetV1 1
5. CASE STUDIES
To show the potential of HDAs as future proof, we perform
case studies on HDA designs and layer execution schedules
generated by Herald using two heterogeneous workloads
listed in Table 2.
5.1 Case Study Settings
Workloads. Based on AR/VR-motivated DNN models
listed in Table 1, we model AR/VR workloads as listed in Ta-
ble 2. For each DNN model, we assign different number
of batches to model different target processing rate of each
sub-task. In addition to AR/VR workloads, we also evaluate
multi-stream ML-perf inference workload related to computer
vision, considering the motivation toward AR/VR.
Mapping. Although Herald can handle arbitrary number of
mapping styles in sub-accelerators, we combine two and three
distinct mapping styles from recent DNN accelerators(Shi-
diannao [8], NVDLA [26]), and Eyeriss [5]. The selection of
mapping style is based on their distinct parallel unrolling (or,
partitioning) strategies to maximize synergy among mapping
styles. For example, Shi-diannao parallelizes output row and
column over PEs while NVDLA parallelizes input and output
channels, which has significantly different preference to layer
shapes. Also, Shi-diannao style can run depth-wise convo-
lution more efficiently than NVDLA; NVDLA is optimized
for channel-wise accumulation but depth-wise convolution
does not have channel-wise accumulation. However, NVDLA
Table 3: Two hardware parameter settings we use for
the evaluation. For heterogeneous accelerators, each set-
ting indicate the total amount of hardware resources to
be partitioned into sub-accelerators.
Acc. ID Num. of PEs NoC BW Glob. Memory
Small (Edge) 1024 16 GB/s 4 MiB
Medium (Mobile) 4096 64 GB/s 8 MiB
Large (Cloud) 16384 256 GB/s 16 MiB
0

10
1
20
2
30
Av
er
ag
e 
Im
pr
ov
em
en
ts
 (%
)
05
1015
2025
30
A
Workload
B Avg
Latency
Energy
EDP
ȫ10
0
10
20
30
0
10
20
30
Baseline
HERALD
(a) The impact of Wokrload
-10
Av
er
ag
e 
ED
P
Im
pr
ov
em
en
ts
 (%
)
0
00
1000
100
0
500
1000
1500
ED
P 
(J
 x
 s
)
(b) The impact of Scheduling
Design opt. for WL A
Design opt. for WL B
Monolithic Design
(c) The impact of Workload Change
Workload B
Large Acc.
Workload A
Small Acc.
Workload B
Small Acc.
Workload A
Large Acc.
Workload B
Large Acc.
Workload A
Small Acc.
Workload B
Small Acc.
Workload A
Large Acc.
Figure 13: The impact of workload and scheduling on the EDP of
large and medium HDAs. (a) The average latency, energy, and EDP
improvements of HDAs compared to the best monolithic accelerator
for each workload. (b) The average EDP improvements compared to
the best monolithic design using baseline and Herald’s scheduler.
provides higher efficiency for fully-connected (FC) layers be-
cause FC layers are equivalent to CONV2D layers with only
one output activation per each input and output channel. That
is, combining those two mapping styles in an HDA and run-
ning each layer on an appropriate sub-accelerator can provide
latency and energy benefits.
Cost estimation. As we discussed in Section 4.2, we extend
MAESTRO for the latency and energy estimation.
Accelerators. Based on previously proposed cloud and mo-
bile accelerators, Cloud TPU [15] and Qualcomm Hexagon [29],
we select three accelerator settings with the number of PEs
1K, 4K, and 16K, as described in Table 3. We also corre-
spondingly scaled network-on-chip (NoC) bandwidth and
global memory. We estimate the extra energy costs for re-
configurability of MAERI based on the open-sourced RTL
by running CAD tool chain using a 28nm library like MAE-
STRO’s reference hardware cost model did. We also model
homogeneous multi-sub-accelerator chips [36] with evenly
partitioned hardware resources and Herald’s scheduler.
Schedulers. We apply the scheduling algorithm we discussed
in Section 4.4 in Herald. We compare the EDP of heteroge-
neous accelerator designs with the best EDP for each experi-
ment based on a baseline scheduler and Herald’s scheduler.
The baseline scheduler performs EDP-greedy layer selection
and depth-first layer ordering discussed in Section 4.4.
5.2 Results
Based on the observation from data we colleted, we high-
light some of them that provide useful insights.
5.2.1 Costs and Benefits of HDA
We estimated latency and energy of HDAs in Figure 14.
9
(c) AR/VR-A, Large Accelerator
(f) AR/VR-B, Large Accelerator
La
te
nc
y (
se
c)
(e) AR/VR-B, Medium Accelerator
(b) AR/VR-A, Medium Accelerator(a) AR/VR-A, Small Accelerator
3200 3400 3600 3800Energy(mJ)
03000
10
5 RS Style
Shi-diannao Style
NVDLA Style
4000
15
(d) AR/VR-B, Medium Accelerator
3200 3400 3600 3800Energy(mJ)
0
1
3000
2
3
4
5 RS StyleShi-diannao Style
NVDLA
Style
4000360034003200 380028000
5
10
Energy(mJ)
Shi-diannao Style
NVDLA Style
RS Style
3000
20
3300320031003000290028000
1
2
3
4
Energy(mJ)
Shi-diannao Style
NVDLA
Style
RS Style
La
te
nc
y (
se
c)
La
te
nc
y (
se
c)
34003300320031003000290028000
1
2
3
4
Energy(mJ)
Shi-diannao Style
NVDLA Style
RS Style
3500
3200 3400 3600 3800Energy(mJ)
0
1
3000
2
3
4
5 RS StyleShi-diannao Style
NVDLA
Style
3000290028002700260025000
0.5
1
1.5
2
Energy(mJ)
Shi-diannao Style
NVDLA Style
RS Style
3100
(i) MLPerf-CV, Large Accelerator
300029002800270026000
0.5
1
1.5
2
Energy(mJ)
Shi-diannao Style
NVDLA
Style
RS Style
(h) MLPerf-CV, Medium Accelerator
4000360034003200 380028000
2
6
8
4
Energy(mJ)
Shi-diannao Style
NVDLA Style
RS Style
3000
10
12
2600
(g) MLPerf-CV, Small Accelerator
Legend Monolthic Acc. DLA-Shi-RS HDA DLA-RS HDA DLA-Shi HDA Shi-RS HDAFlexible Acc. Pareto CurveHomogeneous Multi-DNN
15
3400 3500
4200 4400
20
La
te
nc
y (
se
c)
La
te
nc
y (
se
c)
La
te
nc
y (
se
c)
La
te
nc
y (
se
c)
La
te
nc
y (
se
c)
La
te
nc
y (
se
c)
Figure 14: Design space of two- and three-way heterogeneous DNN accelerators on workloads listed in Table 2. Each point represents a design
point (i.e., HW partitioning chocies) with an optimized schedule for the design point. We label each monolithic design point in each plot.
Table 4: Hardware resource partitioning optimization
results for two-way HDA based on NVDLA and Shi-
diannao style accelerators. (NVDLA/Shi-diannao)
Setting BW partitioning PE Partitioning
AR/VR-A, Small 4 / 12 128 / 896
AR/VR-A, Medium 40 / 24 1792 / 2304
AR/VR-A, Large 224 / 32 9728 / 6656
AR/VR-B, Small 4 / 12 128 / 896
AR-VR-B, Medium 48 / 16 1536 / 2560
AR/VR-B, Large 128 / 128 12032 / 4352
MLPerf-CV, Small 4 / 12 64 / 960
MLPerf-CV, Medium 32 / 32 1280 / 2816
MLPerf-CV, Large 160 / 96 8192 / 8192
We observe that well-optimized HDA points are always on
the Pareto curve, and monolithic designs are not. The flexible
accelerator was on the Pareto curve on AR/VR-A (small and
medium accelerators) and MLPerf-CV (medium accelerator).
On average, compared to the best monolithic design with the
lowest EDP, the best heterogeneous design provided 56.0%
EDP improvements across all the case studies in Figure 14.
Optimal HW Resource Partitioning. As we observed
in Figure 8, naive (or even) partitioning of hardware resources
often lead to sub-optimal HDAs. The case study results of
optimal hardware partitioning shows the same observation.
In Table 4, we list the hardware resource partitioning results
of design points with the best EDP for two-way HDAs based
on NVDLA and Shi-diannao style mappings. We can observe
that the optimal hardware partitioning is not trivial, which ne-
cessitates a systematic approach like Herald. This is because
more number of active PEs requires more bandwidth, and the
number of active PEs is a complex high-dimensional func-
tion of layer operation, layer size, number of PEs, mapping,
and so on [17]. Also, we observe homogeneous multi-DNN
design points often provide similar latency and energy as
shown in Figure 14. In such cases, we found that the opti-
mization framework resulted in assigning minimum hardware
resources to sub-accelerators except one, and tried to maxi-
mize gain from the only large sub-accelerator.
La
te
nc
y (
se
c)
NVDLA Style
Shi-diannao Style
0.5
0
2500
2600 2800 3000 3200
Energy(mJ)
1.0
1.5
2.0
2.5 RS Style
(a) UNet, Large Accelerator (b) Resnet50, Large Accelerator
2800
1.5
1.0
0.5
0 350 400 450
Energy(mJ)300
Shi-diannao Style
NVDLA Style
RS Style
Heterogeneous Design Point Monolithic Design PointLegend
Figure 15: Design space on single DNN use cases on (a) UNet and (b)
Resnet50 based on large accelerator settings in Table 3.
Impact of workloads. Each row in Figure 14 shows the
latency-energy space of monolithic designs, two- and three-
way HDAs, and flexible accelerators we evaluate. As ex-
pected, the design space depended on the workloads. In
particular, we observe that workload with more heterogene-
ity and layers like AR/VR-B workload is more friendly to
HDAs, providing 86.8% latency and 6.61% energy improve-
ments over best monolithic accelerators for each case study
in Figure 14, compared to 63.26% latency and 4.05% energy
improvements for AR/VR-A and 3.75% latency and 8.13%
energy improvements for MLPerf-CV.
Single-DNN Case. Even for a single DNN, HDAs can still
exploit layer parallelism and heterogeneity within a model by
batch-processing the workload. We run UNet and Resnet50
using the batch size of four on the large accelerator setting.
We observe that the best monolithic accelerator design is
on the Pareto curve, unlike compound workloads we tar-
get. However, HDA designs still provide latency and energy
benefits over monolithic designs. For UNet and Resnet50
workload, the best HDA provided 26.4% and 48.1% EDP im-
provements over the best monolithic design. Compared to the
flexible accelerators, HDA provided 11.7% and 15.8% lower
energy for UNet and Resnet50, respectively. For latency, flex-
ible accelerators provided 22.5% and 29.0% lower latency
compared to HDAs for UNet and Resnet50, respectively.
Impact of scheduling algorithm. We evaluate the EDP of
HDAs with a baseline scheduler and Herald’s scheduler. For
the baseline, we implement a scheduler with EDP-greedy
10
Table 5: Average time required for scheduling for each
hardware design point (i.e., HW partition choice).
Workload Layers Sub-accs Time per HW DP (s)
AR/VR-A 448 2 2.89
AR/VR-A 448 3 4.32
AR/VR-B 618 2 3.98
AR/VR-B 618 3 10.74
MLPerf-CV 168 2 1.17
MLPerf-CV 168 3 1.67
layer assignment and depth-first layer ordering discussed
in Section 4.4. In Figure 13 (b), we show the average EDP
improvements of the best HDA design points for each exper-
iment configuration based on baseline and Herald’s sched-
uler. Across all the experiment settings using Herald’s sched-
uler,Herald identified efficient designs with 24.9% average
EDP improvements while the baseline scheduler provided
13.6% EDP improvements on average. In addition, on the
small accelerator with workload A, the baseline scheduler-
based optimized EDP was worse than the best monolithic
baseline while Herald’s scheduler identified an optimized
schedule that provides improved EDP over baseline.
Comparison against flexible accelerators. We evaluate a
MAERI [19] style flexible accelerator using the same hard-
ware parameter of large accelerator setting. We plot the
design point of flexible accelerator in Figure 14. Compared
to the best HDA design points for each evaluation, flexible
accelerator designs provided 22.9%, 21.5%, and 24.0% less
latency for AR/VR-A, AR/VR-B, and MLPerCV workloads,
respectively. However, flexible accelerator designs required
18.7%, 15.5%, and 18.9% more energy for each workload,
respectively. The extra energy cost of the flexible accelerator
is based on extra hardware components for reconfigurability.
In contrast, an HDA can keep sub-accelerators with relatively
simple architecture compared to flexible accelerators, which
leads to energy savings we present.
Results in Figure 14 show that both heterogeneous and
flexible accelerators are both Pareto-optimal. HDAs are bene-
ficial for energy, while flexible accelerators are beneficial for
latency. The amount of benefits for latency and energy vary
depending on the workload. Therefore, the choice of flexible
or heterogeneous accelerators depend on the performance
goal, energy constraints, and the target workload.
Impact of workload change. Since DNN models evolves
and applications change their inner implementation accord-
ingly, workload change can occur after the deployment of
an HDA. Figure 13 (c) shows a case study of such scenario.
That is, the case study represents the use case as a compile
time optimizer (i.e., scheduler). We observe that the large ac-
celerator is less sensitive to the workload change still showing
benefits against the best monolithic design with 14.1% EDP
increase compared to HDAs originally optimized for new
workloads on average. However, when the amount of over-
all hardware resources is smaller (e.g., medium accelerator),
26.2% of EDP increase compared to HDAs originally opti-
mized for new workloads on average. The results imply that
heterogeneous DNN accelerators prefer large accelerators.
Execution time of Herald. Although the scheduling in Her-
ald is designed to be offline, the scheduler is light-weighted.
We run Herald on a laptop with i9-9880H processor and 16GB
of memory and present the time required for scheduling on
each hardware design point in Table 5, since overall runtime
heavily depends on user parameters (e.g., search granularity,
strategy, etc.). On average, the scheduling requires 9.48 ms
per layer and per HDA design point.
Summary. We summarize our main observations below:
• The design space of HDA is not trivial, which requires
a systematic co-optimization of hardware resource parti-
tioning and layer execution schedule.
• HDAs outperform or match the monolithic and flexible
accelerators, as Pareto curves in Figure 14 show.
• HDA and flexible accelerators are beneficial for energy
efficiency and latency, respectively, while maintaining
similar overall EDP (often both of them are on the Pareto
curve in latency-energy space).
• Simple combination of homogeneous sub-accelerators
does not provide Pareto-optimal design points.
6. RELATED WORKS
DNN Dataflows and Accelerators. Shi-diannao [8] is a
CNN accelerator designed to be embedded near sensors,
which exploits convolutional reuse via an output-stationary
style dataflow. Eyeriss [6] is one of the state-of-the-art low-
energy DNN accelerators that introduced dataflow taxonomy
and a new dataflow style, row-stationary. Fused-layer CNN
accelerator [2] exploited fine-grained pipelined layer par-
allelism that minimizes activation data movement among
memory hierarchy. Flexflow [23] is a DNN accelerator that
supports three distinct dataflow styles on a DNN accelera-
tor with an analytic model used to identify the best dataflow
based on PE utilization. Tensor Processing Unit(TPU) [15] is
a systolic array-based DNN accelerator designed for cloud
workload in data centers. MAERI [19] is a flexible dataflow
DNN accelerator that also efficiently supports irregular map-
pings resulting from sparsity, cross-layer mapping [2], and
so on. Tangram [9] is a DNN accelerator explored pipelined
layer parallelism within a model with optimized dataflow for
such back-to-back layer execution. Interstellar [41] presented
the importance of loop blocking (tile sizing) in DNN accel-
erators utilizing Halide [31]. Shen et al. explored the use of
homogeneous multi sub-accelerators termed as convolutional
layer processors for CNN processing in FPGAs [36].
Heterogeneous Accelerators. Chandramoorthy et al. [4]
explored accelerator-rich chip-multiprocessor that include
various accelerators for different tasks in image processing.
Although the work included a convolution module among
sub-accelerators, the convolution module provides only one
dataflow style, focusing on general image kernels, not DNNs.
Master of None Acceleration [22] explored a heterogeneous
accelerator for analytical query and presented that the design
space of heterogeneous accelerators for the target domain has
both beneficial and disadvantageous design points.
7. CONCLUSION
In this paper, we explored the latency and energy optimiza-
tion opportunities of heterogeneous DNN accelerators on
recent realtime applications with multiple sub-taks based on
DNNs. Because the efficiency of a DNN accelerator depends
on mapping, workload, and hardware design parameters at
11
the same time, identifying the best heterogeneous DNN accel-
erator design point with an optimized schedule is challenging.
Therefore, we developed Herald, an automated design space
exploration and layer scheduler framework for heterogeneous
DNN accelerators. In our case studies, Herald identified op-
timized design points and layer schedules, providing 56.0%
EDP benefits compared to the best monolithic design we
compare. Herald has presented that the most efficient design
point has non-trivial hardware resource partitioning and a
naive scheduler can result in EDP degradation, motivating a
systematic approach like Herald.
8. REFERENCES
[1] M. Abrash, “Inventing the future,”
https://www.oculus.com/blog/inventing-the-future, 2017.
[2] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer cnn
accelerators,” in The 49th Annual IEEE/ACM International
Symposium on Microarchitecture. IEEE Press, 2016, p. 22.
[3] Apple, “The future is here - iphone x (neural engine),” https:
//www.apple.com/newsroom/2017/09/the-future-is-here-iphone-x/,
2017.
[4] N. Chandramoorthy, G. Tagliavini, K. Irick, A. Pullini, S. Advani,
S. Al Habsi, M. Cotter, J. Sampson, V. Narayanan, and L. Benini,
“Exploring architectural heterogeneity in intelligent vision systems,” in
2015 IEEE 21st International Symposium on High Performance
Computer Architecture (HPCA). IEEE, 2015, pp. 1–12.
[5] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for
energy-efficient dataflow for convolutional neural networks,” in
International Symposium on Computer Architecture (ISCA), 2016.
[6] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An
energy-efficient reconfigurable accelerator for deep convolutional
neural networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1,
pp. 127–138, 2016.
[7] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible
accelerator for emerging deep neural networks on mobile devices,”
IEEE Journal on Emerging and Selected Topics in Circuits and
Systems, vol. 9, no. 2, pp. 292–308, 2019.
[8] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,
and O. Temam, “Shidiannao: Shifting vision processing closer to the
sensor,” in International Symposium on Computer Architecture (ISCA),
2015.
[9] M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis, “Tangram:
Optimized coarse-grained dataflow for scalable nn accelerators,” in
Proceedings of the Twenty-Fourth International Conference on
Architectural Support for Programming Languages and Operating
Systems. ACM, 2019, pp. 807–820.
[10] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril,
D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro et al., “Applied
machine learning at facebook: A datacenter infrastructure perspective,”
in 2018 IEEE International Symposium on High Performance
Computer Architecture (HPCA). IEEE, 2018, pp. 620–629.
[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2016, pp. 770–778.
[12] L. He, G. Wang, and Z. Hu, “Learning depth from single images with
deep neural network embedding focal length,” IEEE Transactions on
Image Processing, vol. 27, no. 9, pp. 4676–4689, 2018.
[13] Huawei, “Hiai,”
https://consumer.huawei.com/en/campaign/kirin-990-series/, 2019.
[14] A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, and
L. Van Gool, “Ai benchmark: Running deep neural networks on
android smartphones,” in Proceedings of the European Conference on
Computer Vision (ECCV), 2018, pp. 0–0.
[15] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,
S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter
performance analysis of a tensor processing unit,” in International
Symposium on Computer Architecture (ISCA). IEEE, 2017, pp. 1–12.
[16] S. Kala, B. R. Jose, J. Mathew, and S. Nalesh, “High-performance cnn
accelerator on fpga using unified winograd-gemm architecture,” IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, vol. 27,
no. 12, pp. 2816–2828, 2019.
[17] H. Kwon, P. Chatarasi, M. Pellauer, A. Parashar, V. Sarkar, and
T. Krishna, “Understanding reuse, performance, and hardware cost of
dnn dataflow: A data-centric approach,” in Proceedings of the 52nd
Annual IEEE/ACM International Symposium on Microarchitecture,
2019, pp. 754–768.
[18] H. Kwon, P. Chatarasi, V. Sarkar, T. Krishna, M. Pellauer, and
A. Parashar, “Maestro: A data-centric approach to understand reuse,
performance, and hardware cost of dnn mappings,” IEEE Micro,
vol. 40, no. 3, pp. 20–29, 2020.
[19] H. Kwon, A. Samajdar, and T. Krishna, “Maeri: Enabling flexible
dataflow mapping over dnn accelerators via reconfigurable
interconnects,” in International Conference on Architectural Support
for Programming Languages and Operating Systems (ASPLOS), 2018,
pp. 461–475.
[20] S. Lee, S. W. Oh, D. Won, and S. J. Kim, “Copy-and-paste networks
for deep video inpainting,” in Proceedings of the IEEE International
Conference on Computer Vision, 2019, pp. 4413–4421.
[21] J. Li, Y. Chi, and J. Cong, “Heterohalide: From image processing dsl
to efficient fpga acceleration,” in The 2020 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays, 2020, pp. 51–57.
[22] A. Lottarini, J. P. Cerqueira, T. J. Repetti, S. A. Edwards, K. A. Ross,
M. Seok, and M. A. Kim, “Master of none acceleration: a comparison
of accelerator architectures for analytical query processing,” in
Proceedings of the 46th International Symposium on Computer
Architecture. ACM, 2019, pp. 762–773.
[23] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “Flexflow: A flexible
dataflow accelerator architecture for convolutional neural networks,”
in 2017 IEEE International Symposium on High Performance
Computer Architecture (HPCA). IEEE, 2017, pp. 553–564.
[24] M. Madadi, S. Escalera, X. Baró, and J. Gonzalez, “End-to-end global
to local cnn learning for hand pose recovery in depth data,” arXiv
preprint arXiv:1705.09606, 2017.
[25] M. Z. A. Z. Mark Sandler, Andrew Howard and L.-C. Chen,
“MobileNetV2: Inverted Residuals and Linear Bottlenecks,” arXiv
preprint arXiv:1801.04381, 2019.
[26] NVIDIA, “Nvdla deep learning accelerator,” http://nvdla.org, 2017.
[27] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan,
B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An
accelerator for compressed-sparse convolutional neural networks,”
ACM SIGARCH Computer Architecture News, vol. 45, no. 2, pp.
27–40, 2017.
[28] S.-J. Park, H. Son, S. Cho, K.-S. Hong, and S. Lee, “Srfeat: Single
image super-resolution with feature discrimination,” in Proceedings of
the European Conference on Computer Vision (ECCV), 2018, pp.
439–455.
[29] Qualcomm, “Quacomm hexagon 680,”
https://www.hotchips.org/wp-content/uploads/hc_archives/hc27/
HC27.24-Monday-Epub/HC27.24.20-Multimedia-
Epub/HC27.24.211-Hexagon680-Codrescu-Qualcomm.pdf, 2015.
[30] S. Rabii, E. Beigne, V. Chandra, B. D. Salvo, R. Ho, and R. Pendse,
“Computational directions for augmented reality systems,” in 2019
IEEE Symposium on VLSI Circuits, plenary, 2019.
[31] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and
S. Amarasinghe, “Halide: a language and compiler for optimizing
parallelism, locality, and recomputation in image processing pipelines,”
Acm Sigplan Notices, vol. 48, no. 6, pp. 519–530, 2013.
[32] V. Ramanishka, Y.-T. Chen, T. Misu, and K. Saenko, “Toward driving
scene understanding: A dataset for learning driver behavior and causal
reasoning,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2018, pp. 7699–7707.
[33] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional
networks for biomedical image segmentation,” in International
Conference on Medical image computing and computer-assisted
intervention. Springer, 2015, pp. 234–241.
[34] Y. S. Shao, J. Clemons, R. Venkatesan, B. Zimmer, M. Fojtik, N. Jiang,
B. Keller, A. Klinefelter, N. Pinckney, P. Raina et al., “Simba: Scaling
deep-learning inference with multi-chip-module-based architecture,”
in Proceedings of the 52nd Annual IEEE/ACM International
12
Symposium on Microarchitecture, 2019, pp. 14–27.
[35] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao,
A. Mishra, and H. Esmaeilzadeh, “From high-level deep neural
models to fpgas,” in 2016 49th Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO). IEEE, 2016, pp. 1–12.
[36] Y. Shen, M. Ferdman, and P. Milder, “Maximizing cnn accelerator
efficiency through resource partitioning,” in 2017 ACM/IEEE 44th
Annual International Symposium on Computer Architecture (ISCA).
IEEE, 2017, pp. 535–547.
[37] C. Szegedy et al., “Going deeper with convolutions,” in Conference on
Computer Vision and Pattern Recognition (CVPR), 2015.
[38] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing
the gap to human-level performance in face verification,” in
Proceedings of the IEEE conference on computer vision and pattern
recognition, 2014, pp. 1701–1708.
[39] Y. Tian, K. Pei, S. Jana, and B. Ray, “Deeptest: Automated testing of
deep-neural-network-driven autonomous cars,” in Proceedings of the
40th international conference on software engineering, 2018, pp.
303–314.
[40] C.-J. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan,
K. Hazelwood, E. Isaac, Y. Jia, B. Jia et al., “Machine learning at
facebook: Understanding inference at the edge,” in 2019 IEEE
International Symposium on High Performance Computer Architecture
(HPCA). IEEE, 2019, pp. 331–344.
[41] X. Yang, M. Gao, Q. Liu, J. Setter, J. Pu, A. Nayak, S. Bell, K. Cao,
H. Ha, P. Raina et al., “Interstellar: Using halide’s scheduling
language to analyze dnn accelerators,” in Proceedings of the
Twenty-Fifth International Conference on Architectural Support for
Programming Languages and Operating Systems, 2020, pp. 369–383.
13
