Efficient Communication Acceleration for Next-Gen Scale-up Deep Learning
  Training Platforms by Rashidi, Saeed et al.
Efficient Communication Acceleration for Next-Gen
Scale-up Deep Learning Training Platforms
Saaed Rashidi∗, Srinivas Sridharan§, Sudarshan Srinivasan† , Matthew Denton∗ and
Tushar Krishna∗
∗Georgia Institute of Technology, Atlanta, USA
§Facebook, Menlo Park, USA
†Intel, Bangalore, India
saeed.rashidi@gatech.edu, ssrinivas@fb.com, sudarshan.srinivasan@intel.com,
matthewdenton@gatech.edu, tushar@ece.gatech.edu
ABSTRACT
Deep Learning (DL) training platforms are built by intercon-
necting multiple DL accelerators (e.g., GPU/TPU) via fast,
customized interconnects. As the size of DL models and
the compute efficiency of the accelerators has continued to
increase, there has also been a corresponding steady increase
in the bandwidth of these interconnects.
Systems today provide 100s of gigabytes (GBs) of inter-
connect bandwidth via a mix of solutions such as Multi-Chip
packaging modules (MCM) and proprietary interconnects
(e.g., NVlink) that together from the scale-up network of ac-
celerators. However, as we identify in this work, a significant
portion of this bandwidth goes under-utilized. This is because
(i) using compute cores for executing collective operations
such as all-reduce decreases overall compute efficiency, and
(ii) there is memory bandwidth contention between the ac-
cesses for arithmetic operations vs those for collectives, and
(iii) there are significant internal bus congestions that increase
the latency of communication operations.
To address this challenge, we propose a novel microar-
chitecture, called Accelerator Collectives Engine (ACE), for
DL collective communication offload. ACE is an extension
to the conventional Network Interface (NIC) tuned to cope
with the high-bandwidth and low latency requirements of
scale-up networks and is able to efficiently drive the various
scale-up network systems (e.g. switch-based or point-to-point
topologies). We evaluate the benefits of the ACE with micro-
benchmarks (e.g. single collective performance) and popular
DL models using an end-to-end DL training simulator. ACE
significantly accelerates the average raw latency of collective
communication operations observed in DL training work-
loads such as all-reduce and all-to-all by 1.53× and 1.21×,
respectively. For modern DL workloads, ACE on average in-
creases the network bandwidth utilization by 1.97×, resulting
in 2.71× and 1.44× speedup in iteration time for ResNet-50
and GNMT, respectively.
1. INTRODUCTION
Deep Learning (DL) and Deep Neural network (DNN)
models are being deployed pervasively across a wide range
1) NPU 
compute 
used for 
collectives
1) NPU 
dedicated to 
DNN 
computation
NPU NPU
M
E
M
M
E
M
NIC
Scale-up Fabric. e.g. Torus (TPU) or Switch (DGX-2)
~100GB/s
(Inter-package)
400GB/s
(Intra-package)
~500GB/s
Scale-up linksNPU-NIC busNPU-Memory bus
~900GB/s
2) Compute 
and collectives 
contend for 
bandwidth
3) Extra 
latency 
due to 
gradients
2) Minimal 
contention 
for 
bandwidth
3) Latency 
minimized
NIC
ACE
Figure 1: Baseline system vs. ACE
of real-world application domains such as image classifica-
tion, natural language processing, and autonomous driving.
The size and computational requirements of these DNN mod-
els are growing at an unparalleled rate, 2× every 3.4 months
[5], to handle the unrelenting growth in data and workload
requirements. The advent of energy-efficient inference accel-
erators capable of handling these large models and need for
accelerating training time when dealing with 10s to 100s of
petabytes of input data is raising the demand for faster and
more efficient DL training solutions.
Today’s DL training platforms are built by interconnecting
multiple accelerators (e.g., GPUs, TPUs, FPGAs), henceforth
called Neural Processing Units (NPUs) in the interest of gen-
erality, together through different network fabrics, running
distributed DL training algorithms. This is because a sin-
gle accelerator cannot satisfy the compute, memory, and I/O
requirements of today’s state-of-the-art DNNs. Distributed
DL training fundamentally involves splitting the DNN model,
training data, or both across multiple NPUs. These schemes
are referred to as model, data, and hybrid parallel respectively.
The parallelization strategy in turn dictates the commu-
nication required between NPUs. This communication is
"collective" in nature, i.e., all NPUs synchronize input/out-
put/error activations during the forward/backward pass, and
gradients during the backward pass. Specifically, two collec-
tive operations: all-to-all and all-reduce, occur heavily during
distributed DL training. These operations are often latency-
sensitive (since the next layer of the DNN cannot proceed un-
1
ar
X
iv
:2
00
7.
00
15
6v
2 
 [c
s.A
R]
  2
 Ju
l 2
02
0
)0(
0X ¦j jX )(0
¦j jX )(1
¦j jX )(2
¦j jX )(3Reduce
-scatter
All-gatehr
)0(
1X
)0(
2X
)0(
3X
)1(
0X
)1(
1X
)1(
2X
)1(
3X
)2(
0X
)2(
1X
)2(
2X
)2(
3X
)3(
0X
)3(
1X
)3(
2X
)3(
3X
Node 
0
Node 
1
All-to-all
All-reduce
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
0X
1X
2X
3X
0X 0X 0X 0X
1X 1X 1X 1X
2X 2X 2X 2X
3X 3X 3X 3X
Reduce
-scatter
All-gatehr
Node 
0
Node 
1
All-to-all
All-reduce
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
¦j jX )(0
¦j jX )(1
¦j jX )(2
¦j jX )(3
¦j jX )(0 ¦j jX )(0 ¦j jX )(0
¦j jX )(1 ¦j jX )(1 ¦j jX )(1
¦j jX )(2 ¦j jX )(2 ¦j jX )(2
¦j jX )(3¦j jX )(3¦j jX )(3
Reduce
-scatter
All-gatehr
Node 
0
Node 
1
)0(
0X
)0(
1X
)0(
2X
)0(
3X
)1(
0X
)1(
1X )1(
2X
)1(
3X
)2(
0X
)2(
1X
)2(
2X
)2(
3X
)3(
0X
)3(
1X
)3(
2X
)3(
3X
All-to-all
All-reduce
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Reduce
-scatter
All-gatehr
Node 
0
Node 
1
All-to-all
)0(
0X
)0(
1X
)0(
2X
)0(
3X
)1(
0X
)1(
1X
)1(
2X
)1(
3X
)2(
0X
)2(
1X
)2(
2X
)2(
3X
)3(
0X
)3(
1X
)3(
2X
)3(
3X
)0(
0X
)1(
0X
)2(
0X
)3(
0X
)0(
1X
)1(
1X
)2(
1X
)3(
1X
)0(
2X
)1(
2X
)2(
2X
)3(
2X
)0(
3X
)1(
3X
)2(
3X
)3(
3X
All-reduce
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
Node 
0
Node 
1
Node 
2
Node 
3
reduce-
scatter
all-
gather all-
reduce
all-
to-
all
Figure 2: Overview of collective communication operations used in DNN training networks.
til gradients have been synchronized1) and bandwidth hungry
due to the large sizes of the activations/gradients and limited
network bandwidth an hence can easily b c me bottl neck
[30, 42]. There has thus been extensive work on optimizing
DL training communication on current platforms; spanning
algorithms [28, 15], frameworks [21, 4, 23], communication
libraries [22, 8, 35], and high-bandwidth interconnects [29].
This work identifies a key challenge with the current mech-
anisms by which collective communication is orchestrated,
and shows that this would become worse in future systems.
Specifically, we focus on scale-up networks that connect
accelerators within one or more chassis directly, through in-
terconnect fabric, and do not require CPU host involvement
in the communication data-path. 2
Fig. 1 shows a NPU and its connected peripherals. We
identify three sources of inefficiency. (i) Compute: a portion
of the compute resources of the NPU are used for performing
collective updates and driving the network, which reduces the
compute available to training computations (e.g., GEMMs
during the forward and backward passes), (ii) Memory Band-
width: the received gradients need to be written to and read
from memory, in turn reducing the bandwidth available for
the actual training computations which is known to be highly
bandwidth-intensive in modern DNNs, and (iii) Commu-
nication Latency: communicating the gradients over the
NPU-NIC and NPU-MEM buses adds latency overhead. To
the best of our knowledge, this is the first work to identify
these issues in both modern systems (via real experiments
on NVIDIA DGX-1 and Intel Xeon Skylake) and future sys-
tems (modeled in a detailed simulator). The implication of
these three issues is under-utilization of the available scale-up
bandwidth as systems scale.
To address these challenges, we propose Accelerator Col-
lectives Engine (ACE), which is a microarchitecture support
for NIC designed to handle distributed DL training collective
operations and cope with the high bandwidth demand of scale-
up network. With ACE, (i) NPU resources are totally free to
perform training algorithms only, (ii) the memory bandwidth
is freely available for training computation only, and (iii)
the NPU-MEM and NPU-NIC buses are significantly less
congested and are used only for initial and final data-transfer
to/from ACE on-chip memory. This is shown in Fig. 1.
Previous works have demonstrated the efficiency of collec-
tive offload to either smart NICs or network switches for the
scale-out network that involves host CPU intervention [46, 20,
16, 30, 42, 32, 25]. However, these CPU-initiated schemes
require explicit resource reservations that will be exacerbated
1We assume synchronous updates which provide better guarantees
at convergence than asynchronous updates
2In contrast, the scale-out network, i.e. the network connecting host
CPUs together through Ethernet or other network fabrics, require
accelerators to rely on the host for both control/datapath. Hence,
scale-out is used to connect individual scale-up networks. This will
be explored as part of future work.
for shared memory systems as the number of network packets
grows, eventually limiting the scalability. ACE is a smart
NIC-based system for accelerator-based systems and can be
integrated in different system topologies such as switch based
or point-to-point (e.g. Torus) systems, contrasting our work
against switch offload proposals that limit their usability to
only switch based networks. We contrast ACE against prior
works in Table 1.
This paper makes the following contributions:
• We identify a set of key challenges in the end-points
of DL training platforms related to compute, memory
bandwidth and communication latency that can limit the
utilization of available scale-up network bandwidth for
future training platforms.
• We propose a novel offload engine, called Accelerator
Collectives Engine (ACE), designed to tuned to handle
collective communication and efficiently drive scale-up
network utilization.
• ACE achieves 1.53× speedup for all-reduce and 1.21×
for all-to-all collective operations (in terms of raw com-
munication latency). For modern DL workloads such as
ResNet-50 [19] and GNMT [48], ACE gets 2.71× and
1.44× better performance (on average) in iteration time
over the state-of-the-art bandwidth-equivalent baselines,
respectively. ACE improves network bandwidth utiliza-
tion by 1.97× compared to baseline systems.
The rest of the paper is organized as follows: Section 2
presents the necessary background for distributed training
systems. Section 3 establishes the challenges in scaling DL
training, specifically focused on critical bottlenecks in the
endpoint that inhibit efficient network utilization. Section 4
describes the ACE microarchitecture. This is followed by a
detailed description of our evaluation and simulation method-
ology in Section 5 and the experimental results in Section 6.
Next, we compare our work against related work in Section 7.
Finally, we conclude the paper in Section 8.
2. BACKGROUND
Training DNNs involves iteratively refining the parame-
ters (aka weights) of the network by solving a non-convex,
non-linear optimization problem to minimize a loss function.
Here, we provide background on distributed training [37, 11].
Parallelization. The most common parallelization tech-
nique for speeding up DL training is called data parallelism.
It replicates the entire model on multiple nodes to take ad-
vantage of the large number of input samples. Every node
computes partially trained weight gradients for its subset of
samples of the input data, aka mini-batch. At the end of each
iteration, nodes exchange their partially trained weight gradi-
ents and perform the SGD operation to update the weights gra-
dients accumulated from all nodes. The updated weights, are
then used in the forward pass of the next iteration. In model
2
Table 1: Comparison of previous SmartNIC and switch offload schemes against ACE.
Scheme Workload Compute Offload Protocol Topology Aggregation
Mellanox [3] HPC, DL CPU or Accel Switch Infiniband Switch-based Collective
Barefoot [1] DL CPU or Accel Switch Ethernet Switch-based Collective
sPIN [20] HPC CPU NIC Ethernet Switch-based Collective
Triggered [46] HPC CPU NIC RDMA-based Flexible Collective
Inceptionn [31] DL Accel NIC Ethernet Tree Parameter Server
NVIDIA [13] DL Accel (GPU) Switch Shared Memory Switch-based Collective (All-
Reduce only)
ACE DL Accel NIC RDMA-based Flexible Collective
Table 2: Communication payloads for different parallelism approaches.
FP:Forward Pass. BP: Back Pass
Parallelism Activations (FP) Weight grad (BP) Input grad (BP)
Data
Model
Hybrid partially partially partially
Table 3: ResNet-50 communication time and NW utilization
Number cores 1 2 4 8 16
Time (ms) 89.8 45.1 22.2 18.4 18.0
Utilization (%) 17% 34% 70% 84% 86%
parallelism, all nodes have the same datasets and work on
the same mini-batch, but the model is divided among nodes.
Each node thus produces a part of the output activations and
input gradients during the forward pass and back-propagation,
respectively, and these values must be communicated across
all nodes to enable forward pass and back-propagation for
all nodes. Table 2 describes the data being exchanged for
various parallelism approaches.
Collective Communication Operations. Exchanges of
input/weight gradients and output activations among the
nodes, depending on the parallelism approach, is known
as "collective communication". In general, four different
collective communication operations are the main contrib-
utor in DNN training communication, as shown in Fig. 2:
(i) reduce-scatter, (ii) all-gather, (iii) all-reduce, and (iv) all-
to-all. Reduce-scatter reduces (e.g. sum) all data, initially
residing in the nodes, such that at the end each node has a
portion globally reduced data. All-gather gathers the data,
initially scattered across nodes, such that at the end all of
nodes have all of the data. All-reduce can be thought of a
reduce-scatter followed by an all-gather. In all-to-all, each
node needs to send different portion of data to other nodes.
All-reduce is the dominant communication pattern observed
in the DL training for exchanging gradients and activations
in various parallelism schemes. However, all-to-all is used
in some scenarios such as table embedding exchanges for
recommendation models [34].
Topology-Aware Collective Algorithm. Collectives have
efficient implementations algorithm based on the underlying
topology. Libraries like NVIDIA NCCL [35] provide dif-
ferent implementations for collectives, such as ring-based,
tree-based, hierarchical, direct all-reduce/all-to-all, to opti-
mize for the available bandwidth in the underlying topology
[45, 7, 38]. We will discuss this further in Section 5, where
we consider topology-aware collectives for evaluating our
target systems.
Figure 3: Impact of Memory Bandwidth on NVIDIA NCCL Allreduce.
Figure 4: Impact of Memory Bandwidth on Intel MLSL Allreduce.
3. MOTIVATION: POOR SCALE-UP BAND-
WIDTH UTILIZATION
In this section, we identify some critical challenges in
scaling DL training, both on current and future platforms.
Fig. 1 qualitatively summarizes our findings.
NPU Compute Availability. On modern systems, a frac-
tion of NPU cores (e.g. CUDA cores) asynchronously exe-
cute/orchestrate DL training collective communication op-
erations while the majority of compute cores execute DL
computation (e.g. GEMMs and Convolutions). Collective op-
erations, such as allreduce, are parallelized across these cores
since a single NPU core cannot fully saturate memory band-
width and network bandwidth given various bottlenecks in the
NPU core - memory - network data-path [22, 35]. Each core
iterates over a sequence of send/receive operations to/from
peers followed by an optional computation on data that was
received with locally available data (e.g. reduction sum).
Therefore, the optimal number of NPU cores is a function
of network bandwidth saturated by a single core, memory
bandwidth available to a single core, communication algo-
rithm, message size, and so on. Table Table 3 presents the
number of CPU cores and the effective network bandwidth
utilization when running ResNet-50 allreduce message sizes
3
Figure 5: Time breakdown for an all-reduce on a 2D Torus topology for (a) baseline, (b) NW optimized: baseline with 3x network bandwidth, (c) Memory
bandwidth contention: NW optimized with 20% read memory bandwidth per NPU core.
on 32-nodes with dual-socket Intel Xeon Skylake processors
and 100G Intel Omnipath fabric. Given significant latency
overheads in executing collective communication operations,
even on latency-optimized Intel Xeon CPUs, we observe we
need 4-8 cores to efficiently drive the network - cores that
would otherwise be used in executing DL computation.
Memory Bandwidth Contention. Fig. 3 presents the neg-
ative impact or slowdown of allreduce collective when run
concurrently with the Stream benchmark [2] (mimicking
memory bandwidth hungry DL compute) on NVIDIA and
Intel platforms. On a NVIDIA DGX-1 system, we observed
an average 2× slowdown for NCCL allreduce3 when run
concurrently with the Stream benchmark over an unloaded
system running only NCCL allreduce. The two bars repre-
sents the slowdown of NCCL allreduce for messages from
32KB to 256MB when running with either 32MB or 64MB
Stream buffer sizes. Additionally, we observed a 19% drop
in the achieved Steam memory bandwidth when run con-
currently with the NCCL allreduce benchmark compared to
only running the Stream benchmark. A 20% drop in memory
bandwidth will have a drastic impact on compute time since
most DL compute kernels are memory bound. Similarly, on
Intel Xeon Skylake processors we observed 2-3× slowdown
for MLSL allreduce [22] when run concurrently with the
Stream benchmark or LIBXsmm convolution kernels running
ResNet-50 [19] operations.
Scale-Up Bandwidth Utilization. We developed a sim-
ple analytical model to highlight some of these issues. The
analytical model captures time spent in the performing local
reduction sum (streaming operation involving two local mem-
ory reads, sum operation, and local write), remote network
write (e.g. RDMA write and synchronization), and various
other per-NPU local latency overheads (e.g. scheduling, pro-
cessing message headers etc.) incurred in each step of an
allreduce collective. We assume a 2D-torus topology with
16 NPUs arranged in 4x4 grid. Each NPU (1GHz, 1TB/s
memory bandwidth) has 4 links, with each link at 25GB/s
uni-directional bandwidth and 95% link efficiency. We as-
sume each NPU core can read 64 bytes/clock and write 32
bytes/clock, resulting in 64GB/s and 32GB/s of read and
write bandwidth respectively. We assume the reduction sum
operation and local write are completely overlapped by the
local reads and other overheads. In an ideal case, we need 4
NPUs (2 bytes of local load for each byte of remote write)
3We expect to observe similar behavior on DGX-2 as well. The
impact would be higher in general given DGX-2’s NVSwitch is
more efficient than DGX-1 Hybrid Cubic Mesh
to achieve sufficient memory bandwidth to match network
bandwidth (100GB/s). Fig. 5 presents the time breakdown
and % peak bandwidth achieved with 1 to 12 NPU cores used
to parallelize a single 16MB allreduce for three scenarios: (a)
2us latency for each step in the allreduce (very aggressive for
any modern NPU architecture), (b) 3x network bandwidth,
(c) 3x network bandwidth with 20% read memory bandwidth
(12.8GB/s) per NPU. We generally observe the bandwidth
utilization to drop from 50% to 15% we go from Fig. 5(a)-(c).
due to increased latency and reduction overheads. In other
words, we believe addressing endpoint overheads must be
prioritized over adding more network bandwidth.
Overall, we observe the challenge of using NPU cores,
i.e. GPU CUDA cores or CPU cores, for executing collec-
tive communication - the increased contention to memory
bandwidth for performing the local reduction sum opera-
tion and various other scheduling overheads negatively af-
fects collective performance across different architectures. In
other words, the complex interplay among compute-memory-
network not only affects collective performance but also the
performance of compute kernels executing concurrently. This
in turn increases the effective compute time available for
overlapping asynchronous communication operations - hid-
ing deep inefficiencies in how systems are designed. These
challenge are expected to get exacerbated in future training
platforms, where better compute accelerators and/or hierar-
chical bandwidths due to emerging MCM technologies [10]
will make it harder to hide communication behind compute.
4. ACCELERATOR COLLECTIVES ENGINE
We propose a novel micro-architecture for DL collective
communication offload called ACE. Fig. 6 shows the high-
level overview of ACE integrated into the NIC module. Upon
activation, ACE loads initial data to be communicated into its
SRAM (from NPU main memory) via NIC TX DMA. The in-
coming/outgoing traffic is stroed into NIC SRAM temporally.
ACE receives incoming scale-up network gradients/activa-
tions through NIC SRAM, and directly stores and processes
them, requiring no further communication with NPU. ACE
also writes back data to NIC SRAM for sending data to other
nodes. Finally, ACE transfers the results to the NPU main
memory through NIC RX DMA. This decoupling method
enables conventional NIC components to focus on handling
communication protocol (e.g. packetizing, depacketizing)
while allowing ACE to focus on processing and executing
the collective operation. Also, it allows ACE to be bypassed
and NIC falls back to normal operation mode when needed.
4
ACE
ACE
Control 
Unit
Control path Compute units
Data management units Network units
Network path
Switch
6
4
5
1
Phase 1
par t i t ion 
SRAM
Phase 2
par t i t ion 
Phase N
par ti t ion 
Por t Buffer
3
5
Host NPU/ Mem
TX DMARX DMA  NIC SRAM (De)PacketizerControl 2
NIC
Network
Conventioanl NIC Components
Data Transfer  path Compute path
Figure 6: ACE microarchitecture. #1 is the on-chip SRAM. #2 is the NIC
TX DMA for transferring data from main memory to NIC SRAM (normal
operation) or ACE SRAM (ACE activated). #3 is the ALU. #4 is the NIC
RX DMA for transferring data from NIC SRAM (normal operation mode) or
ACE SRAM (ACE activated mode) to main memory. #5 are the input/output
port buffers. These buffers are allocated per each physical link and contain
packets corresponding for that specific link. #6 is the control unit logic.
This section describes the details of operation, and the fea-
tures within ACE that make it flexible across various network
topologies and collective algorithms.
Table 4: Data granularity at different levels of ACE execution.
Granularity Size Constraint
Payload (variable) Training Algorithm Training Algorithm
Chunk (64kB on avg.) Parameter for Pipelining Storage Element Size(Area/Power)
Message (4kB on avg.) Parameter -Multiple of Number of Nodes Topology
Packet (256B) Link Technology Technology
Flit (256B) Network Buffer Size Microarchitecture(Area/Power)
Phit (variable) Link Width Technology
4.1 Data Granularity
Table 4 shows the granularity of data at different levels
of the ACE execution and their determining factor. It also
shows the default value of each level used in ACE. ACE
initiates execution by receiving a command from NPU to
perform a specific collective on a payload. The payload
could be activations or gradients depending on the parallelism
approach and forward/back pass (Table 2). The command
includes the collective type and the address range for data
residing in the main memory. ACE then divides the payload
into multiple chunks and begins processing and scheduling
of each chunk individually and in a pipelined manner.
A chunk itself decomposes into multiple messages and
the collective algorithm runs at message granularity. The
number of messages is a multiple of the number of nodes
in the system. For e.g., if the ACE wants to perform an all-
reduce in a ring with 4 NPUs, it can divide the chunk into
8 messages, and execute all-reduce serially over two steps4.
Multiple chunks can however be scheduled to run in parallel.
The degree of parallelism for running the chunks depends
4Each step leads to processing & performing all-reduce for a group
of 4 messages and the algorithm is ring-based. More details on
ring-based all-reduce is provided in Section 4.2
on the number of state machines within the ACE control
unit ( details in Section 4.3.2) to handle the dataflow for each
chunk. Each message comprises of one or more packets when
it enters the network layer. The unit of data processing within
the ACE is packets. The bus width and SRAM interface
might or might not be equal to the size of the packets and
data movement/execution is serialized if the size is smaller
than packet width. Each packet is split into multiple flits
and each flit consists of multiple phits (depending on the link
speed) as it traverses the physical link.
4.2 Walk-Through Example for All-Reduce
We describe ACE in action via a detailed walk-through
example for running the ring-based all-reduce collective over
a ring for both the baseline and ACE vs. system as shown
in Fig. 7a. The general concepts and the main advantages of
ACE compared to the baseline are applicable to any topolo-
gy/collective, as we describe later in Section 4.5.
Fig. 7a(a) shows the logical flow of the algorithm across
the different nodes. We assume one chunk for simplicity.
Since there are three nodes, there are three messages5. An
all-reduce can be implemented as a reduce-scatter followed
by an all-gather, as can be seen from Fig. 2. Steps 1-3 are
the reduce-scatter phase. Step 1 initiates the reduce-scatter;
each node simply sends one message to its neighbor and
waits for receiving another message from its other neighbor.
In step 2, each node reduces the received message with its
local message and forwards to the next. Step 3 concludes the
reduce-scatter by each node reducing the last message it has
received. All-gather starts by each node forwarding a copy
of its reduced message to its neighbor (step 4) and then each
node keeping a copy of its received message and forwarding
it to its neighbor (steps 5 and 6).
Fig. 7b shows this flow from node X’s view in the case of
baseline vs. ACE. It is clear from this figure that in baseline,
in all phases, messages need to go all the way from/to main
memory to/from NIC to be injected/received into/from net-
work. The local reduction is done through a series of back
and forth data movements between NPU and main memory
resulting in congestion on both NPU-Mem and NPU-NIC
busses. This in turn reduces the available memory bandwidth
and compute power for the training algorithm. In contrast,
ACE restricts the data movement only to the first and last
phases (reduced congestion and increased available memory
bandwidth) and allows the training algorithm to make use of
complete NPU compute power.
Fig. 7c shows the internal ACE interactions for node X.
Here, the ACE SRAM is divided into two partitions - one
serves as a source for the (only one) all-reduce phase, and
the last one serves as the source to hold the final results to
send back to main memory. In step 1, the three messages
are brought into the first partition of ACE SRAM by the TX
DMA (sub-step 1.1). Then, one message is sent out through
a series of packets6 injected into the designated output port
buffer to be written to NIC SRAM (sub-step 1.2). In step 2,
5Note that there could be multiple number of 3 messages and as we
described in Section 4.1, they should be executed in serial. But here
for simplicity we assume only 1 group of 3 messages.
6Note that here packets means packet data. The actual packetization
of this data is the job of NIC once it wants to send it over the links.
5
Node Y
a
b c
c
a
b
c
c
a
b
b
ca
b
b
+
=
c+c =
a a+= a + a =
b b+ =
+= c c
Step 1 Step 2 Step 3 Step 5
b
Step 4
Reduce-scatter  phase All-gather  phase
a
Node Z
Node X
b
Node Y
b
a c
Node Z
Node X
Node YNode Z
Node X
b
ca
Node YNode Z
Node X
a
b
ca
Node YNode Z
Node X
b
c c
c
a
a
b
(a) Indication of all-reduce on a ring.
Mem NPU NIC Mem NPU NIC
1.1
Mem NPU NIC
2.5
2.4
2.3
2.1
2.2
Baseline
Step 2 Steps 3&4
network operations memory operations compute operations
Step 1
ca
c
c
c
c + c c=
Mem NPU NIC
5.3
5.1
Step 5
a
a
Mem NPU Mem NPUMem NPU
2.2
ACE
Step 2 Steps 3&4Step 1
c ab
c
c + c c=
Mem NPU
 5.2
Step 5
a
1.1 1. 2
a
2.1
2.3
c 5.1a
5.4
a b c
ACE ACE ACE ACE
5. 2
c
5.3c
3.2
b
b + b b=
3.1
4.1
b
4.1
3.4
3.3
3.1
3.2 b
b
b
b
b + b b=
(b) Flow of compute/mem/network operations for ring all-reduce in the baseline vs. ACE from the viewpoint of node X
1.1
1.2
To
 NIC
Fr
o
m
 
NI
C
From memory
1.2
1.2
abc 2.2
2.2
2.3
2.3
To
 NIC
Fr
o
m
 
NI
C
2.1
bc 3.2
3.2
4.1
To
 NICFr
o
m
 
 
NI
C 3.1
3.2 b
5.1
5.2
5.2 5.2
To
 NIC
Fr
o
m
 
 
NI
C
Text
5.1
Step 1 Step 2 Step 3&4
Step 5a
5.4
To
 NIC
Fr
o
m
 
 
NI
C
To 
MemoryControl 
Unit
5.4
Text
5.3
Step 5b
b ba
c
c b
a
(c) Steps of ring all-reduce within ACE microarchitecture for node X
Figure 7: Steps showing the comparison between baseline vs. ACE for single ring all-reduce.
the received message is reduced with the local data (sub-step
2.2) and is forwarded to the neighbor (sub-step 2.3). ACE
overlaps steps 3 and 4 of the algorithm; after performing a
reduction (sub-step 3.2), stores it locally (sub-step 3.2) and
forwards to the next neighbor (sub-step 4.1) at the same time.
Step 5 is broken into two figures for more clarity. In step
5a, the received message is stored and forwarded (sub-step
5.2) at the same time, while in step 5b, the final received
message is stored and the whole chunk is sent back to the
main memory by RX DMA (sub-step 5.4).
It is clear that in some steps within ACE, multiple resources
should be available for some sub-steps to proceed. For ex-
ample, in sub-step 5.2 in step 5a, both the SRAM input port
should be available and the output port should have the free
space. In such cases, if any of the resources are not free,
that step is stalled until all resources are available. Multiple
chunks can be executed in parallel to maximize internal hard-
ware resources and link bandwidth, as we discuss more in
Section 4.3.2.
4.3 Parallelism
In order to achieve high network utilization, we need to
apply parallelism at various levels. From the algorithmic per-
spective, there are several levels where parallelism is possible.
Assuming the collective algorithm has P phases, multiple
chunks can run in parallel both within a phase and across
6
different phases. Each chunk will sends/receive multiple
messages. Hence, multiple in-flight chunks mean multiple in-
flight messages (belonging to different chunks) are possible.
Parallel chunks mean parallel packets can be processed in
parallel. Packets are the unit of data transfer within network
and parallelism below that is the network’s job. So the ACE
memory management and control unit are designed to ensure
using all algorithmic parallelism opportunities that we will
describe next.
4.3.1 Memory Management
The SRAM within ACE is partitioned according to the
number of phases of the collective algorithm being, each
serves as the source for one specific phase. In addition, one fi-
nal partition is added that holds the final results for RX DMA.
For e.g., the ring-based all-reduce has 1 phase and hence
needs 2 partitions, as discussed in Section 4.2. More complex
hierarchical topologies can implement the collectives over
multiple phases [40]. one or more state machines coordinate
the dataflow within ACE for each phase.
Chunks are brought from main memory to the first phase
partition within SRAM by TX DMA using a pipeline manner.
Chunks reside within SRAM and moves to next partitions as
they goe to later phases of collective algorithm. Finally, after
finishing the last phase, the result is sent back to the main
memory by RX DMA.
An alternative approach is to allocate a constant space for
a chunk and use that space during all phases of the collec-
tive. However, this is not an efficient way of using valuable
on-chip storage due to one important property of collective
operations: the data size might change during different phases
of the algorithm. Thus, allocating a constant space wastes on-
chip storage or requires sophisticated dynamic allocation/de-
allocation schemes to manage free spaces, adding complexity
to both datapath and control unit.
Using logical partitions for different phases allows for a
fine-grained yet low complexity storage allocation/de-allocation
scheme as chunks go through different phases, as well as sim-
plifying the control logic. The partitions size can be then be
adjusted according to the requirement of each phase that is
dependant on the algorithm and network speed for that phase
(in case of having links with different speeds).
4.3.2 Control Unit
The control unit comprises of multiple programmable state
machines. Each state machine can be programmed for a
specific phase of a specific collective algorithm, and holds
a queue of chunks that should process in order. Each entry
of this queue holds the context of a chunk like its start and
end address inside the SRAM and the address range for hold-
ing the final result for next phase. When a chunk is created
in ACE, it is also assigned the state machines it should go
through for each phase7. When entering each phase, the
chunk is inserted into the queue of its assigned state machine
for that phase. The state machines then compete with each
other to access different resources, resulting in overlapping
7It is possible that, for a given workload, different collective op-
erations (e.g. all-reduce and all-to-all) exist for the same phase.
In this case, the state machines allocated for that phase should be
programmed to handle all collective operations for that phase
and out-of-order execution of multiple chunks both within
and across phases. This increases resource utilization and net-
work bandwidth. The available parallelism is only bounded
by the number of available state machines to manage the
dataflow for each phase.
4.4 Interface with NPU and NIC
ACE extends the existing NIC interface exposed to NPU
as shown in Fig. 6. NIC control forwards the ACE specific
commands from NPU/ACE to ACE/NPU. The NPU-NIC
command interface is similar to UCX [43] or OFI [18] which
are the standard high level NIC interfaces. Once collective
command is received, ACE decides when to load data given
the availability of SRAM space. Finally, ACE notifies the
completion of chunk collective by raising an interrupt and
forwarding it to NPU.
In addition, ACE control unit should consult NIC control
unit to ensure available space before attempting to write to
NIC SRAM. Moreover, NIC should be able to detect whether
the incoming packets belongs to normal mode or ACE acti-
vated mode. This is done by including a bit in packet headers
to distinguish between the two modes.
4.5 Flexibility and Scalability
Recall from Section 2, there can be different classes of
collectives depending on the parallelism approach, collective
algorithm, and network topology. As DNNs evolve, their
training systems need to be able to evolve as well to run any
collective over any topology. The microarchitecture of ACE
is designed keeping this goal in mind.
Supporting Different Collectives. The general principles
for running any collective algorithm using ACE remain the
same. For a collective with say P phases, the SRAM is
divided to P+1 partitions. Each partition is assigned to a
phase, as described in Section 4.3.1 and state-machines as
described in Section 4.3.2. The only difference between dif-
ferent collective algorithms is that the state machines should
be programmed to perform that specific algorithm.
Supporting Different Topologies. Sine ACE handles the
collectives at the endpoints, it is orthogonal to any network
topology (e.g. switch-based, point-to-point, hybrid). From
the logical view, ACE can perform any collective algorithm
(e.g. ring-based all-reduce) on top of any physical topology.
It is then the job of network protocol and routing algorithm
to deliver the packets accordingly8.
5. EVALUATION METHODOLOGY
This section describes our methodology establishing and
simulating high-performance training systems and evaluating
the benefits of communication acceleration.
Target Training Platforms. We model future DL training
platforms comprising of multiple NPUs integrated through
multi-chip packaging technology [10] on a package, and
multiple packages interconnected via a dedicated scale-up
fabric. Given the large design space, we limit the fabric
topologies in this study to a 3D Torus (inspired by Google’s
TPU platform [26]) and fully connected cross-bar switch
8Note that ACE is compatible with networks with out-of-order
packet delivery
7
M0
M3
M2
M1
M0
M3
M2
M1
M0
M3
M2
M1
M0
M3
M2
M1
#Switches = 
#Inter-Package-Links/NPU
Min. switch ports = 
# Packages x #NPUs/Package
P0 P2
P1 P3
M0
M3
M2
M1
M0
M3
M2
M1
P0 P1
S0
(a) 3D torus topology (b) Switch based (alltoall) topology
S1 S2 S3
Figure 8: Target Systems: 3D Torus and Fully-connected Switch Topology.
Multiple NPU chips (M0-...) are connected within a package, and multiple
packages (P0-...) are connected together, presenting heterogeneous scale-up
bandwidth across the system. Connectivity is shown for NPU M0. NPU’s
M1/M2/M3 have similar connectivity - not shown.
Table 5: Topology details
Topology/
Dimension
Notation
Collective
Algorithm BW/Dimension
3D torus /
L x V x H
(L is the # of
NPUs within
a package, V
is the #of
inter-package
rows, H is the#
of inter-package
columns)
Hierarchical all-reduce:
1. Ring-based reduce-scatter in L
2. Ring-based all-reduce in V
3. Ring-based all-reduce in H
4. Ring-based all-gather in L
Hierarchical all-to-all:
1. Ring-based all-to-all in L
2. Ring-based all-to-all in V
3. Ring-based all-to-all in H
1 bi-directionl ring in L
(using 2 intra-packag
links/NPU),
1 bi-directionl ring in V
(using 2 inter-packag
links/NPU),
1 bi-directional ring in H
(using 2 inter-package
links/NPU)
NOTE: If V/L=1, then
L/V has 2 bi-directional
rings.
Alltoall/
L x P
(L is the #
of NPUs within
a package, P is
the # of packages)
Hierarchical all-reduce:
1. Ring-based reduce-scatter in L
2. Direct all-reduce in P
3. Ring-based all-gather in L
Hierarchical all-to-all:
1. Ring-based all-to-all in L
2. Direct all-to-all in P
1 bidirectional ring in L
(using 2 intra-package
links/NPU ),
4 fully connected
switches (using
4 inter-package
links/NPU )
(inspired by NVIDIA’s DGX-2 platform [36]). Fig. 8 shows
the two selected topologies.
Table 5 shows more details about the topologies such as:
the notation used to define their dimension, the collective
algorithms they are using per dimension, and how intra-
package/inter-package links are used to construct rings or
connect to switches.
Target Collective Algorithms. As shown in Table 5, we
use hierarchical multi-phase collective algorithms for our
multi-dimensional physical topologies [40]. Each phase
uses a simple ring-based or direct algorithm depending on
the topology for that dimension. The ring-based all-reduce
(as well as ring-based reduce-scatter and all-gather) was de-
scribed in Section 4.2. In order to perform direct all-reduce
for N nodes, the chunk is broken into N (or multiple of N as
described earlier) messages and and the i’th node is respon-
siple for reduction of i’th message. Hence, all nodes send
their i’th messages to node i simultaneously. After receiving
all i’th messages, node i performs the reduction and then,
simultaneously broadcast its message to all other nodes.
Direct all-to-all is simple since each node simply sends all
of the respective messages to their destinations nodes at the
same time. Ring-based all-to-all of N nodes consists of N-1
steps where step s, each node sends a message that belongs
to the node with distance s in the ring.
Table 6: Synthesis Results
Component Area (µm2) Power (mW)
ALU 16112 7.552
Control unit 159803 128
4×1MB SRAM banks 5113696 4096
Crossbar switches 1084 0.329
ACE (Total) 5339031 4255
Table 7: System parameters
Parameter Values
Intra-Package
Packet size 256 Bytes
Per link BW 200 GB/s
Link latency 90 cycles
# of links/NPU 2
Link efficiency 94%
Inter-Package
Packet size 256 bytes
Per link BW 25 GB/s
Link latency 200 cycles
# of links/NPU 4 (bi-directional)
Link efficiency 94%
NPU and NIC parameters
Compute Accel. 256x256 TPU-like
NPU-NIC BW 500 GB/s
NPU-MEM BW 900 GB/s
Message size 4KB
Shared Memory. We assume a shared memory model
across the scale-up fabric in our target platforms, similar to
NVIDIA DGX-2 [36]. This design decision is reasonable for
two reasons: first, optimizing for Non-Uniform Memory Ac-
cess (NUMA) and shared memory is challenging - as evident
from extensive work on HPC workloads for CPUs. Addi-
tionally, having two separate communication paths, i.e. say
shared memory within package and message passing across
packages, will increases the scheduling and optimization
complexity for various collective algorithms. Therefore, we
assume explicit communication between NPUs both within
a package and across, through a high-performance topology
aware communication library. This programming model also
allows scaling the number of modules within a package with
improvements in packaging technology [6].
Simulation Parameters. Table 7 shows the major sys-
tem parameters. To model the network communication and
latency, we used ASTRA-SIM [40], a distributed DNN train-
ing simulator, and developed ACE on top of that. For the
compute model, we used the SCALE-SIM [41] to find the
compute times (i.e. forward-pass, weight gradient, and in-
put gradient) of the workloads. The compute model is a
700MHz 256x256 systolic array to be an estimate for the
4x128x128 systolic array of the TPU-V3 chip. This means
∼ 91.5 TFLOPS compute power per node that is quite com-
paribale to the state-of-the-art training platforms [33]. For
modeling the bus (i.e. NPU-NIC and NPU-Mem) congestion,
we used the LogGP [9] model with L=50 and o=g=20 cyles
(these values are within the range stated in [20]), and G equals
to the corresponding BW of the NPU-NIC and NPU-Mem
bus, as stated in Table 7.
NPU-Mem bandwidth is chosen based on NVIDIA Volta
[24]. NPU-NIC bandwidth is assumed to be the sum of
all intra-package/inter-package links as a logical choice to
prevent NPU-NIC bandwidth to be the bottleneck in driving
8
the network links. Inter-package link BW is assumed to be
the same as NVlink [29], while intra-pakcage bandwidth is
selected based on [44] that is an expected bandwidth for high
performance multi-chip package systems.
Target Systems. We considered four different systems:
• Baseline: Represents a conventional system where the
NPU is responsible performing the collective communi-
cation. Based on our analysis from Section 3, we conser-
vatively assume ∼5% of the compute and ∼20% of the
Mem BW is spent for collective communication.
• Baseline++: Considers a baseline system with higher
compute and memory bandwidth than Baseline such that
the available NPU compute and memory bandwidth is
same as ACE. However, the actual communication still
occurs via the Memory and NIC like the Baseline. This
baseline helps us isolate the effects of latency and con-
gestion over the NPU-MEM and NPU-NIC buses.
• ACE: Implements our proposed system. In this case,
100% of the NPU resource is dedicated to the training
algorithm. Moreover, memory BW is only used for initial
load and final writeback to/from ACE. Hence, we assume
only 5% of the memory BW available to ACE.
• Ideal: It is a system where the NIC has unlimited re-
sources and is able to handle all collective communica-
tion operations using one cycle. This essentially means
that there is no associated latency from the endpoint side
in the collective communication algorithm. This gives an
upper bound to our design. In this case 100% of compute
and memory is allocated for training algorithm only.
Target Workloads. In order to evaluate our platforms, we
considered two different sets of workloads: (1) Microbench-
marks that consist of the single all-reduce or all-to-all col-
lective communications, and (2) ResNet-50 [19] and GNMT
[48] DNNs. For the DNNs, we consider data-parallel paral-
lelism (i.e. requiring all-reduce for weight gradients) for two
training iterations with Last-In-First-Out (LIFO) collective
scheduling policy to give more priority to the collectives of
first layers during back-propagation. The mini-batch size is
set to be 32 per node for both workloads. We demonstrate our
design with both end-to-end runtime and detailed per-layer
breakdown.
6. RESULTS
This section presents simulation results comparing ACE
against ideal and baseline for micro-benchmarks (i.e. single
collective performance) and for real workloads. But first we
present area/power analysis of ACE on 28nm technology nod.
ACE: Area/Power Analysis. We ran a number of sim-
ulations to discover the best yet reasonably low overhead
parameters for ACE. Due to space reasons, we do not present
the design space evaluation for ACE but here present the
parameters we chose. We found out that a 4×1MB SRAM
banks and 16 state machines are sufficient to implement and
drive our target networks and algorithms. The complexity
of each state machine is approximated by 8KB SRAM that
is sufficient to implement various target algorithms. Also, 4
wide ALU units, each consisting of 16x FP32 functional units
were sufficient for ACE. The interconnect between SRAM
0
200
400
600
800
1000
1200
1400
1MB 4MB 8	MB 1MB 4MB 8	MB 1MB 4MB 8	MB 1MB 4MB 8	MB
8	NPUs	(4X1X2) 16	NPUs	(4X2X2) 32	NPUs	(4X2X4) 64	NPUs	(4X4X4) Avg.
ti
m
e	
(u
s)
Baseline ACE ideal
Figure 9: Baseline, ACE and Ideal total communication latency (µs) of the
all-reduce collective for the torus topology.
0
20
40
60
80
100
120
140
160
64KB 256KB 512KB 64KB 256KB 512KB 64KB 256KB 512KB 64KB 256KB 512KB
8	NPUs	(4X1X2) 16	NPUs	(4X2X2) 32	NPUs	(4X2X4) 64	NPUs	(4X4X4) Avg.
ti
m
e	
(u
s)
Baseline ACE Ideal
Figure 10: Baseline, ACE and Ideal total communication latency (µs) of the
all-to-all collective for the torus topology.
and functional units are wide 64B buses. We implemented
ACE using Verilog and synthesized the design using the Syn-
opsis Design Compiler in 28nm technology node. Table 6
shows the area and power estimates for our design, enumerat-
ing individual components as well as ACE itself. Compared
to the area and power of high-end training accelerators re-
ported in [26, 47], ACE has less than 2% overhead in both
area and power.
In addition, we used a simple heuristic for SRAM partition-
ing that partitions the SRAM based on the (available network
bandwidth × initial chunk size) for each phase9. Moreover,
state machines are evenly distributed across different phases.
Micro-benchmarks. In this section, we test the perfor-
mance of baseline, ACE and Ideal for single all-reduce and
all-to-all collectives10.
Fig. 9 presents the ideal, baseline, and ACE all-reduce on
2D/3D Torus topologies. On average, using ACE improves
the total communication latency by 1.51× compared to the
baseline. Ideal is only 1.43× better than ACE, while com-
pared to baseline, Ideal is 2.17× better.
To delve deeper into the reason of this improvement, Fig. 11
shows the detailed time breakdown for Fig. 9. As shown in
the figure, due to better performance, the average chunk queu-
ing is reduced by 44.9% on average in ACE compared to the
baseline. In addition, ACE significantly reduces the bus con-
gestion (total NIC-NPU/Mem-NIC bus queuing + transfer)
by 88.9% on average compared to the baseline. Note that
in general the message network transfer in ACE is smaller
9For example, a phase with 2× link bandwidth and 2× initial chunk
size has a partition 4× greater than a phase with 1× bandwidth and
1× chunk size. Also note that if for each phase, there are different
chunk sizes belonging to different collective operations, we use
average of such chunk sizes.
10All-reduce is the only collective communication operation in most
of the training workloads. However, sometimes all-to-all is used
especially for recommendation model DNNs where all-to-all is used
for embeddings.
9
0100
200
300
400
500
600
700
800
Ba
se
lin
e-
1M
B
AC
E-
1M
B
Id
ea
l-1
M
B
Ba
se
lin
e-
4M
B
AC
E-
4M
B
Id
ea
l-4
	M
B
Ba
se
lin
e-
8M
B
AC
E-
8M
B
Id
ea
l-8
M
B
Ba
se
lin
e-
1M
B
AC
E-
1M
B
Id
ea
l-1
M
B
Ba
se
lin
e-
4M
B
AC
E-
4M
B
Id
ea
l-4
	M
B
Ba
se
lin
e-
8M
B
AC
E-
8M
B
Id
ea
l-8
M
B
Ba
se
lin
e-
1M
B
AC
E-
1M
B
Id
ea
l-1
M
B
Ba
se
lin
e-
4M
B
AC
E-
4M
B
Id
ea
l-4
	M
B
Ba
se
lin
e-
8M
B
AC
E-
8M
B
Id
ea
l-8
M
B
Ba
se
lin
e-
1M
B
AC
E-
1M
B
Id
ea
l-1
M
B
Ba
se
lin
e-
4M
B
AC
E-
4M
B
Id
ea
l-4
	M
B
Ba
se
lin
e-
8M
B
AC
E-
8M
B
Id
ea
l-8
M
B
8	NPUs	(4X1X2) 16	NPUs	(4X2X2) 32	NPUs	(4X2X4) 64	NPUs	(4X4X4)
ti
m
e	
(u
s)
NIC-NPU	bus	queueing NIC-NPU	bus	transfer Local	reduction Mem-NIC	bus	queueing
Mem-NIC	bus	transfer Chunk	Queueing Messgae	network	transfer
Figure 11: The detailed latency breakdown for all-reduce on torus network. The NIC-NPU/Mem-NIC bus queuing and bus transfer shows the average
queening and average transfer time of transfer requests in NIC-NPU/Mem-NIC bus, respectively. The local reduction delay refers to the average latency of
messages being reduced by the NPU. Chunk queuing is the average amount of time chunks are waiting for the chunks that are ahead of them to be finished and
then start executing. Message network transfer refers to the average amount of time messages spend in the network.
0
200
400
600
800
1000
1200
1400
1MB 4MB 8MB 1MB 4MB 8MB 1MB 4MB 8MB 1MB 4MB 8MB
8	NPUs	(4X2) 16	NPUs	(4X4) 32	NPUs	(4X8) 64	NPUs	(4X16) Avg.
ti
m
e	
(u
s)
Baseline ACE Ideal
Figure 12: Baseline, ACE and Ideal total communication latency (µs) of the
all-reduce collective for the alltoall topology.
0
10
20
30
40
50
60
70
80
90
64	KB 256
KB
512
KB
64	KB 256
KB
512
KB
64	KB 256
KB
512
KB
64	KB 256
KB
512
KB
8	NPUs	(4X2) 16	NPUs	(4X4) 32	NPUs	(4X8) 64	NPUs	(4X16) Avg.
ti
m
e	
(u
s)
Baseline ACE Ideal
Figure 13: Baseline, ACE and Ideal total communication latency (µs) of the
all-to-all collective for the alltoall topology
than baseline/ideal systems because in ACE, chunk sizes
(and hence message sizes) are smaller because of the limited
on-chip SRAM. However, as Fig. 11 and Fig. 9 show, this
limitation does not prevent ACE from driving the network in
a much more efficient way compared to the baseline.
We observe the same behavior in other collectives and
systems. Fig. 10, Fig. 12, and Fig. 13 show the performance
of the three systems for all-to-all on torus, all-reduce on
alltoall, and all-to-all on alltoall, respectively. On average
ACE improves all-to-all on torus by 1.22×, all-reduce on
alltoall by 1.55× and all-to-all on alltoall by 1.20×, compared
to the baseline system.
Workload performance comparison. In this section, we
evaluated the four different systems mentioned in Section 5.
Fig. 14 shows the total training loop algorithm latency for two
iterations of the ResNet-50 DNN training loop running on dif-
ferent torus network sizes. Each latency is decomposed into
endpoint delay (compute+bus congestion) and the exposed
communication latency delay. The exposed communication
latency is the latency that the training algorithm is forced
to stop because of waiting for the communication. On av-
erage, ACE improves the overall training latency by 2.71×
compared to the baseline.
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
Ba
se
lin
e
Ba
se
lin
e+
+
AC
E
Id
ea
l
Ba
se
lin
e
Ba
se
lin
e+
+
AC
E
Id
ea
l
Ba
se
lin
e
Ba
se
lin
e+
+
AC
E
Id
ea
l
Ba
se
lin
e
Ba
se
lin
e+
+
AC
E
Id
ea
l
8	NPUs	(4X1X2) 16	NPUs	(4X2X2) 32	NPUs	(4X2X4) 64	NPUs	(4X4X4)
ti
m
e	
(u
s)
Endpoint	latency Exposed	Communication	Latency
Figure 14: The baseline, baseline++, ACE and ideal latency (µs) for two
training iterations of ResNet-50 training algorithm running on torus topolo-
gies. Note that endpoint latency contains congestion latencies as well.
Fig. 15 shows the layer wise exposed latency and its com-
parison with compute latency for baseline++ and ACE sys-
tems for the 64 NPU case. As the figure shows, only switch-
ing from baseline to baseline++ does not completely address
the limitations NPU cenetric communication handling ap-
proach and there is still a significant bottleneck arsing from
congestion in the buses that are mostly mitigated in ACE.
This difference becomes more evident by looking at the later
layers of the ResNet-50 in Fig. 15 that have larger commu-
nication sizes. We note that this is not the indication that
the poor performance of the baseline in general. In fact, as
shown in Fig. 14, a training iteration duration is ∼ 9ms for
the 64-node baseline system. This means that one epoch of
training on the ImageNet dataset [12] takes ∼ 61.5 seconds
that is comparable to the best training performance-per-node
times reported [33]. However, Fig. 14 points to the available
opportunity to further improve the performance since high
performance NPUs: (i) leave less time to overlap the commu-
nication with computation and (ii) reqiure more collectives
to be executed in parallel, resulting in more bus congestion
10
0200
400
600
800
1000
1200
1400
1600
1800
2000
Ba
se
lin
e+
+
AC
E
Ba
se
lin
e+
+
AC
E
Ba
se
lin
e+
+
AC
E
Ba
se
lin
e+
+
AC
E
Ba
se
lin
e+
+
AC
E
Ba
se
lin
e+
+
AC
E
Ba
se
lin
e+
+
AC
E
Ba
se
lin
e+
+
AC
E
Ba
se
lin
e+
+
AC
E
Ba
se
lin
e+
+
AC
E
Ba
se
lin
e+
+
AC
E
layers
1-5
layers
6-10
layers
11-15
layers
16-20
layers
21-25
layers
26-30
layers
31-35
layers
36-40
layers
41-45
layers
46-50
layers
51-54
tim
e	
(u
s)
fwd	compute wg	compute ig	compute exposed	comm
Figure 15: The layer wise exposed communication latency (µs) and its
comparison with compute (excluding bus congestion) latency for baseline++
and ACE running on the 64-NPU (4X4X4) system. The latencies are for 2
training iterations.
0
10000
20000
30000
40000
50000
60000
Ba
se
lin
e
Ba
se
lin
e+
+
AC
E
Id
ea
l
Ba
se
lin
e
Ba
se
lin
e+
+
AC
E
Id
ea
l
Ba
se
lin
e
Ba
se
lin
e+
+
AC
E
Id
ea
l
Ba
se
lin
e
Ba
se
lin
e+
+
AC
E
Id
ea
l
8	NPUs	(4X1X2) 16	NPUs	(4X2X2) 32	NPUs	(4X2X4) 64	NPUs	(4X4X4)
ti
m
e	
(u
s)
Endpoint	latency Exposed	Communication	Latency
Figure 16: The baseline, baseline++, ACE and ideal latency (µs) for two
training iterations of GNMT training algorithm running on torus topologies.
Note that endpoint latency contains congestion latencies as well.
and queuing for compute and memory resources.
Fig. 16 shows the total endpoint and exposed delays for
GNMT running on torus topology. On average, ACE achieves
1.44× better performance compared to the baseline. Base-
line++ only works 1.04× better than baseline.
Finally, Fig. 17 compares the network bandwidth utiliza-
tion of baseline, baseline++ and ACE normalized to their
respective ideal systems for both ResNet-50 and GNMT net-
works. On average, ACE improves the bandwidth utilization
by 1.97×, increasing the bandwidth utilization to 74.6% of
the ideal system compared the baseline that drives 37.8% of
ideal system network bandwidth.
7. RELATED WORK
Previous work have demonstrated the efficiency of collec-
tive offload to either smart NICs or network switches. As ex-
plained in Table 1, ACE is a SmartNIC for accelerator-based
systems and can be integrated in different system topologies
such as switch based or point-to-point (e.g. Torus) systems.
SmartNICs and Communication offload Prior Smart-
NIC solutions for collective offload, such as [46, 20], focused
on scale-out networks and HPC workloads. Triggered col-
lectives [46] is the closest comparison to ACE but is specific
to Portals transport and CPU-based scale-out systems while
ACE is transport-agnostic and sufficiently flexible to support
hierarchical topologies (local + scale-up). Unlike both Trig-
gered collectives [46] and ACE, sPIN [20] uses ARM cores
and will not be able to drive high bandwidths required for DL
0
10
20
30
40
50
60
70
80
90
100
Ba
se
lin
e
Ba
se
lin
e+
+
AC
E
Ba
se
lin
e
Ba
se
lin
e+
+
AC
E
Ba
se
lin
e
Ba
se
lin
e+
+
AC
E
Ba
se
lin
e
Ba
se
lin
e+
+
AC
E
8	NPUs 16	NPUs 32	NPUs 64	NPUs
%
	o
f	n
et
w
or
k	
ut
ili
za
tio
n
(N
or
m
al
iz
ed
	to
	
th
e	
id
ea
l	s
ys
te
m
ne
tw
or
k	
ut
ili
za
ti
on
)
Resnet-Torus GNMT-Torus
Figure 17: The baseline, baseline++ and ACE network bandwidth utiliza-
tion for ResNet-50 and GNMT normalized to their respective ideal system
bandwidth utilization
platforms. INCEPTIONN [31] proposes a Smart-NIC based
compression offload for parameter servers for DL training.
The inherent differences between parameter server vs. col-
lectives makes ACE vs. INCEPTIONN comparison out of
scope for this paper. HPC systems with Torus-based topology,
such as BlueGene [27], PERCS [39], and Anton2 [17], have
supported collective offload on network routers. SmartNICs
are more flexible than routers found in such system since
routers are purpose-built for the target topology.
Numerous other proposals have demonstrated the benefits
of SmartNICs for communication offload but do not address
the requirements of DL training. Microsoft’s Brainwave [14]
proposes a bump-in-the-wire SmartNIC for distributed infer-
ence. IncBrinks [32] proposes distributed in-network caches
on the scale-out network, which store key-value pairs for
frequently accessed data. They allow simple operations such
as increment to be done using a simple in-network processor.
NetCache [25] uses the ASIC within a switch to do dynamic
caching and load-balancing of requests to that switch.
Switch-based collective offload Switch-based offload so-
lutions, such Intel’s Barefoot [1], Mellanox SHARP [3], and
NVIDIA’s shared memory switch offload [13], have proposed
aggregation in switches. In contrast to SmartNIC-based col-
lective offload solutions, Switch-based solutions have two
major disadvantages: (1) they are inherently limited to switch
topologies while SmartNICs are more flexible and support
point-to-point (e.g. Torus) and switch topology, (2) they are
have higher constraints on compute and memory resources
given the large number of endpoints connecting to the switch.
SmartNIC based solutions overcome this by distributing the
resource requirement across endpoints. More specifically,
Barefoot [1] offload scheme requires operating on quantized
integer values (due to lack of floating point arithmetic in the
switch) potentially affecting training accuracy for future DL
models. On the other hand, NVIDIA’s switch-based offload
[13] is purpose-built for shared memory allreduce collectives;
limiting the number of endpoints and types of collectives that
can be supported. Switch offload has also been explored for
accelerating Reinforcement learning [30].
8. CONCLUSIONS
One oft ignored aspects of scaling distributed DL training
systems is the complex interplay between (a) the processing
engines used for executing DL compute operations such as
GEMMs/Convolutions versus executing collective commu-
nication, and (b) the contention between them for shared
resources like memory bandwidth. In this paper, we propose
a novel micro-architecture, called ACE to augment NICs for
11
DL collective communication offload to efficiently drive the
hierarchical scale-up fabric and free up critical compute and
memory resources for DL computation. We demonstrate
ACE achieves 2.71× and 1.44× speedup for modern DL
workloads such as ResNet-50 and GNMT over the state-of-
the-art bandwidth-equivalent baseline, respectively.
9. REFERENCES
[1] [Online]. Available: https://www.barefootnetworks.com/
[2] [Online]. Available: https://github.com/UoB-HPC/BabelStream
[3] Mellanox Scalable Hierarchical Aggregation and Reduction Protocol
(SHARP). [Online]. Available:
https://www.mellanox.com/products/sharp
[4] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,
A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,
M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray,
C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,
P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals,
P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng,
“TensorFlow: Large-scale machine learning on heterogeneous systems,”
2015, software available from tensorflow.org. [Online]. Available:
http://tensorflow.org/
[5] D. Amodei and D. Hernandez, “Ai and compute,” 2018. [Online].
Available: https://openai.com/blog/ai-and-compute/
[6] A. Arunkumar, E. Bolotin, B. Cho, U. Milic, E. Ebrahimi, O. Villa,
A. Jaleel, C.-J. Wu, and D. Nellans, “Mcm-gpu: Multi-chip-module
gpus for continued performance scalability,” in Proceedings of the
44th Annual International Symposium on Computer Architecture, ser.
ISCA ’17. New York, NY, USA: ACM, 2017, pp. 320–332. [Online].
Available: http://doi.acm.org/10.1145/3079856.3080231
[7] E. Chan, R. van de Geijn, W. Gropp, and R. Thakur, “Collective
communication on architectures that support simultaneous
communication over multiple links,” in Proceedings of the Eleventh
ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming, ser. PPoPP ’06. New York, NY, USA: ACM, 2006, pp.
2–11. [Online]. Available:
http://doi.acm.org/10.1145/1122971.1122975
[8] M. Cho, U. Finkler, S. Kumar, D. S. Kung, V. Saxena, and
D. Sreedhar, “Powerai DDL,” CoRR, vol. abs/1708.02188, 2017.
[Online]. Available: http://arxiv.org/abs/1708.02188
[9] D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos,
R. Subramonian, and T. von Eicken, “Logp: Towards a realistic model
of parallel computation,” in Proceedings of the Fourth ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming, ser.
PPOPP ’93. New York, NY, USA: ACM, 1993, pp. 1–12. [Online].
Available: http://doi.acm.org/10.1145/155332.155333
[10] W. J. Dally, C. T. Gray, J. Poulton, B. Khailany, J. Wilson, and
L. Dennison, “Hardware-enabled artificial intelligence,” in 2018 IEEE
Symposium on VLSI Circuits, June 2018, pp. 3–6.
[11] D. Das, S. Avancha, D. Mudigere, K. Vaidyanathan, S. Sridharan,
D. D. Kalamkar, B. Kaul, and P. Dubey, “Distributed deep learning
using synchronous stochastic gradient descent,” CoRR, vol.
abs/1602.06709, 2016. [Online]. Available:
http://arxiv.org/abs/1602.06709
[12] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
“ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR09,
2009.
[13] B. K. et al., “An in-network architecture for accelerating
shared-memory multiprocessor collectives,” in 2020 ACM/IEEE 47th
Annual International Symposium on Computer Architecture (ISCA),
2020. [Online]. Available:
https://www.iscaconf.org/isca2020/papers/466100a996.pdf
[14] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu,
D. Lo, S. Alkalay, M. Haselman, L. Adams, M. Ghandi, S. Heil,
P. Patel, A. Sapek, G. Weisz, L. Woods, S. Lanka, S. K. Reinhardt,
A. M. Caulfield, E. S. Chung, and D. Burger, “A configurable
cloud-scale dnn processor for real-time ai,” in 2018 ACM/IEEE 45th
Annual International Symposium on Computer Architecture (ISCA),
2018, pp. 1–14.
[15] P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski,
A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch
SGD: training imagenet in 1 hour,” CoRR, vol. abs/1706.02677, 2017.
[Online]. Available: http://arxiv.org/abs/1706.02677
[16] R. L. Graham, D. Bureddy, P. Lui, H. Rosenstock, G. Shainer,
G. Bloch, D. Goldenerg, M. Dubman, S. Kotchubievsky, V. Koushnir,
L. Levi, A. Margolin, T. Ronen, A. Shpiner, O. Wertheim, and
E. Zahavi, “Scalable hierarchical aggregation protocol (sharp): A
hardware architecture for efficient data reduction,” in 2016 First
International Workshop on Communication Optimizations in HPC
(COMHPC), 2016, pp. 1–10.
[17] J. P. Grossman, B. Towles, B. Greskamp, and D. E. Shaw, “Filtering,
reductions and synchronization in the anton 2 network,” in 2015 IEEE
International Parallel and Distributed Processing Symposium, 2015,
pp. 860–870.
[18] P. Grun, S. Hefty, S. Sur, D. Goodell, R. D. Russell, H. Pritchard, and
J. M. Squyres, “A brief introduction to the openfabrics interfaces - a
new network api for maximizing high performance application
efficiency,” in 2015 IEEE 23rd Annual Symposium on
High-Performance Interconnects, 2015, pp. 34–39.
[19] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” CoRR, vol. abs/1512.03385, 2015. [Online].
Available: http://arxiv.org/abs/1512.03385
[20] T. Hoefler, S. D. Girolamo, K. Taranov, R. E. Grant, and R. Brightwell,
“spin: High-performance streaming processing in the network,” CoRR,
vol. abs/1709.05483, 2017. [Online]. Available:
http://arxiv.org/abs/1709.05483
[21] Intel, “Intel caffe,” 2018. [Online]. Available:
https://github.com/intel/caffe
[22] Intel, “Intel machine learning scalability library (mlsl),” 2018.
[Online]. Available: https://github.com/intel/MLSL
[23] Intel, “Intel nervana graph,” 2018. [Online]. Available:
https://github.com/NervanaSystems/ngraph
[24] Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, “Dissecting the
NVIDIA volta GPU architecture via microbenchmarking,” CoRR, vol.
abs/1804.06826, 2018. [Online]. Available:
http://arxiv.org/abs/1804.06826
[25] X. Jin, X. Li, H. Zhang, R. Soulé, J. Lee, N. Foster, C. Kim, and
I. Stoica, “Netcache: Balancing key-value stores with fast in-network
caching,” in Proceedings of the 26th Symposium on Operating Systems
Principles, ser. SOSP âA˘Z´17. New York, NY, USA: Association for
Computing Machinery, 2017, p. 121âA˘S¸136. [Online]. Available:
https://doi.org/10.1145/3132747.3132764
[26] N. P. Jouppi, C. Young, N. Patil, D. A. Patterson, G. Agrawal,
R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle,
P. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean,
B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann,
R. C. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey,
A. Jaworski, A. Kaplan, H. Khaitan, A. Koch, N. Kumar, S. Lacy,
J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin,
G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan,
R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick,
N. Penukonda, A. Phelps, J. Ross, A. Salek, E. Samadiani, C. Severn,
G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan,
G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter,
W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter performance
analysis of a tensor processing unit,” CoRR, vol. abs/1704.04760,
2017. [Online]. Available: http://arxiv.org/abs/1704.04760
[27] D. J. Kerbyson, K. J. Barker, A. Vishnu, and A. Hoisie, “A
performance comparison of current hpc systems: Blue gene/q, cray
xe6 and infiniband systems,” Future Generation Computer Systems,
vol. 30, pp. 291 – 304, 2014, special Issue on Extreme Scale Parallel
Architectures and Systems, Cryptography in Cloud Computing and
Recent Advances in Parallel and Distributed Systems, ICPADS 2012
Selected Papers. [Online]. Available: http:
//www.sciencedirect.com/science/article/pii/S0167739X13001337
[28] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P.
Tang, “On large-batch training for deep learning: Generalization gap
and sharp minima,” CoRR, vol. abs/1609.04836, 2016. [Online].
Available: http://arxiv.org/abs/1609.04836
[29] A. Li, S. L. Song, J. Chen, J. Li, X. Liu, N. R. Tallent, and K. J.
Barker, “Evaluating modern GPU interconnect: Pcie, nvlink, nv-sli,
nvswitch and gpudirect,” CoRR, vol. abs/1903.04611, 2019. [Online].
12
Available: http://arxiv.org/abs/1903.04611
[30] Y. Li, I.-J. Liu, Y. Yuan, D. Chen, A. Schwing, and J. Huang,
“Accelerating distributed reinforcement learning with in-switch
computing,” in Proceedings of the 46th International Symposium on
Computer Architecture, ser. ISCA âA˘Z´19. New York, NY, USA:
Association for Computing Machinery, 2019, p. 279âA˘S¸291. [Online].
Available: https://doi.org/10.1145/3307650.3322259
[31] Y. Li, J. Park, M. Alian, Y. Yuan, Z. Qu, P. Pan, R. Wang, A. Schwing,
H. Esmaeilzadeh, and N. Kim, “A network-centric hardware/algorithm
co-design to accelerate distributed training of deep neural networks,”
10 2018, pp. 175–188.
[32] M. Liu, L. Luo, J. Nelson, L. Ceze, A. Krishnamurthy, and K. Atreya,
“Incbricks: Toward in-network computation with an in-network cache,”
in Proceedings of the Twenty-Second International Conference on
Architectural Support for Programming Languages and Operating
Systems, ser. ASPLOS âA˘Z´17. New York, NY, USA: Association for
Computing Machinery, 2017, p. 795âA˘S¸809. [Online]. Available:
https://doi.org/10.1145/3037697.3037731
[33] H. Mikami, H. Suganuma, P. U.-Chupala, Y. Tanaka, and
Y. Kageyama, “Imagenet/resnet-50 training in 224 seconds,” CoRR,
vol. abs/1811.05233, 2018. [Online]. Available:
http://arxiv.org/abs/1811.05233
[34] M. Naumov, D. Mudigere, H. M. Shi, J. Huang, N. Sundaraman,
J. Park, X. Wang, U. Gupta, C. Wu, A. G. Azzolini, D. Dzhulgakov,
A. Mallevich, I. Cherniavskii, Y. Lu, R. Krishnamoorthi, A. Yu,
V. Kondratenko, S. Pereira, X. Chen, W. Chen, V. Rao, B. Jia,
L. Xiong, and M. Smelyanskiy, “Deep learning recommendation
model for personalization and recommendation systems,” CoRR, vol.
abs/1906.00091, 2019. [Online]. Available:
http://arxiv.org/abs/1906.00091
[35] NVIDIA, “Nvidia collective communications library (nccl),” 2018.
[Online]. Available: https://developer.nvidia.com/nccl
[36] NVIDIA, “Nvidia dgx-2,” 2019. [Online]. Available:
https://www.nvidia.com/en-us/data-center/dgx-2/
[37] S. Ouyang, D. Dong, Y. Xu, and L. Xiao, “Communication
optimization strategies for distributed deep learning: A survey,” 2020.
[38] P. Patarasuk and X. Yuan, “Bandwidth optimal all-reduce algorithms
for clusters of workstations,” J. Parallel Distrib. Comput., vol. 69,
no. 2, pp. 117–124, Feb. 2009. [Online]. Available:
http://dx.doi.org/10.1016/j.jpdc.2008.09.002
[39] R. Rajamony, L. B. Arimilli, and K. Gildea, “Percs: The ibm
power7-ih high-performance computing system,” IBM J. Res. Dev.,
vol. 55, no. 3, p. 233âA˘S¸244, May 2011. [Online]. Available:
https://doi.org/10.1147/JRD.2011.2109230
[40] S. Rashidi, S. Sridharan, S. Srinivasan, and T. Krishna, “ASTRA-SIM:
Enabling SW/HW Co-Design Exploration for Distributed DL Training
Platforms,” in IEEE International Symposium on Performance
Analysis of Systems and Software, ISPASS 2020, Boston, MA, USA,
August 22-26, 2020. IEEE, 2020. [Online]. Available:
https://synergy.ece.gatech.edu/wp-content/uploads/sites/332/2020/
03/astrasim_ispass2020.pdf
[41] A. Samajdar, Y. Zhu, P. N. Whatmough, M. Mattina, and T. Krishna,
“Scale-sim: Systolic CNN accelerator,” CoRR, vol. abs/1811.02883,
2018. [Online]. Available: http://arxiv.org/abs/1811.02883
[42] A. Sapio, M. Canini, C. Ho, J. Nelson, P. Kalnis, C. Kim,
A. Krishnamurthy, M. Moshref, D. R. K. Ports, and P. Richtárik,
“Scaling distributed machine learning with in-network aggregation,”
CoRR, vol. abs/1903.06701, 2019. [Online]. Available:
http://arxiv.org/abs/1903.06701
[43] P. Shamis, M. G. Venkata, M. G. Lopez, M. B. Baker, O. Hernandez,
Y. Itigin, M. Dubman, G. Shainer, R. L. Graham, L. Liss, Y. Shahar,
S. Potluri, D. Rossetti, D. Becker, D. Poole, C. Lamb, S. Kumar,
C. Stunkel, G. Bosilca, and A. Bouteiller, “Ucx: An open source
framework for hpc network apis and beyond,” in 2015 IEEE 23rd
Annual Symposium on High-Performance Interconnects, 2015, pp.
40–43.
[44] Y. S. Shao, J. Clemons, R. Venkatesan, B. Zimmer, M. Fojtik, N. Jiang,
B. Keller, A. Klinefelter, N. Pinckney, P. Raina, S. G. Tell, Y. Zhang,
W. J. Dally, J. Emer, C. T. Gray, B. Khailany, and S. W. Keckler,
“Simba: Scaling deep-learning inference with
multi-chip-module-based architecture,” in Proceedings of the 52nd
Annual IEEE/ACM International Symposium on Microarchitecture, ser.
MICRO âA˘Z´52. New York, NY, USA: Association for Computing
Machinery, 2019, p. 14âA˘S¸27. [Online]. Available:
https://doi.org/10.1145/3352460.3358302
[45] R. Thakur, R. Rabenseifner, and W. Gropp, “Optimization of
collective communication operations in mpich,” Int. J. High Perform.
Comput. Appl., vol. 19, no. 1, pp. 49–66, Feb. 2005. [Online].
Available: http://dx.doi.org/10.1177/1094342005051521
[46] K. D. Underwood, J. Coffman, R. Larsen, K. S. Hemmert, B. W.
Barrett, R. Brightwell, and M. Levenhagen, “Enabling flexible
collective communication offload with triggered operations,” in 2011
IEEE 19th Annual Symposium on High Performance Interconnects,
Aug 2011, pp. 35–42.
[47] Y. Wang, Q. Wang, S. Shi, X. He, Z. Tang, K. Zhao, and X. Chu,
“Benchmarking the performance and power of ai accelerators for ai
training,” 2019.
[48] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey,
M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah,
M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo,
H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young,
J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes,
and J. Dean, “Google’s neural machine translation system: Bridging
the gap between human and machine translation,” CoRR, vol.
abs/1609.08144, 2016. [Online]. Available:
http://arxiv.org/abs/1609.08144
13
