Accelerating Recommender Systems via Hardware "scale-in" by Krishna, Suresh & Krishna, Ravi
Accelerating Recommender Systems
via Hardware “scale-in”
Suresh Krishna∗, Ravi Krishna∗
Electrical Engineering and Computer Sciences
University of California, Berkeley
{suresh_krishna, ravi.krishna}@berkeley.edu
Abstract—In today’s era of “scale-out”, this paper makes the
case that a specialized hardware architecture based on “scale-
in”—placing as many specialized processors as possible along
with their memory systems and interconnect links within one or
two boards in a rack—would offer the potential to boost large
recommender system throughput by 12–62× for inference and
12–45× for training compared to the DGX-2 state-of-the-art AI
platform, while minimizing the performance impact of distribut-
ing large models across multiple processors. By analyzing Face-
book’s representative model—Deep Learning Recommendation
Model (DLRM)—from a hardware architecture perspective, we
quantify the impact on throughput of hardware parameters such
as memory system design, collective communications latency and
bandwidth, and interconnect topology. By focusing on conditions
that stress hardware, our analysis reveals limitations of existing
AI accelerators and hardware platforms.
I. INTRODUCTION
Recommender Systems serve to personalize user experience
in a variety of applications including predicting click-through
rates for ranking advertisements [34], improving search re-
sults [30], suggesting friends and content on social networks
[30], suggesting food on Uber Eats [53], helping users find
houses on Zillow [54], helping contain information overload
by suggesting relevant news articles [55], helping users find
videos to watch on YouTube [43] and movies on Netflix [59],
and several more real-world use cases [60]. An introduction to
recommender system technology can be found in [58] and a
set of best practices and examples for building recommender
systems in [56]. The focus of this paper is recommendation
systems that use neural networks, referred to as Deep Learning
RecSys, or simply RecSys1. These have been recently applied
to a variety of areas with success [34] [30].
Due to their commercial importance—by improving the
quality of Ads and content served to users, RecSys directly
drives revenues for hyperscalers, especially under cost-per-
click billing—it is no surprise that recommender systems
consume the vast majority (∼80%) of AI inference cycles
within Facebook’s datacenters [1]; the situation is similar [30]
at Google, Alibaba and Amazon. In addition, RecSys training
now consumes >50% of AI training cycles within Facebook’s
datacenters [10]. Annual growth rates are 3.3× for training
[10] and 2× for inference [31]. The unique characteristics
∗Equal contribution.
1While the term is often used to denote any recommender system, it is
specifically used for Deep Learning recommenders here.
of these workloads present challenges to datacenter hardware
including CPUs, GPUs and almost all AI accelerators. Com-
pared to other AI workloads such as computer vision or natural
language processing, RecSys tend to have larger model sizes of
up to 10TB [7], are memory access intensive [30], have lower
compute burden [31], and rely heavily on CC2 operations [6].
These characteristics make RecSys a poor fit for many existing
systems, as described in Sec. II, and call for a new approach
to accelerator HW architecture.
In this paper, various HW architectures are analyzed using
Facebook’s DLRM [30] as a representative example and the re-
sulting data are used to derive the characteristics of RecSpeed,
an architecture optimized for running RecSys. Specifically, we:
• Describe the DLRM workload in terms of its character-
istics that impact HW throughput and latency.
• Identify HW characteristics such as memory system de-
sign, CC latency and bandwidth, and CC interconnect
topology that are key determinants of upper bounds on
RecSys throughput.
• Use a generalized roofline model that adds communica-
tion cost to memory and compute to show that specialized
chip-level HW features can improve upper bounds on
RecSys throughput 4× by reducing latency and 7× by
improving bandwidth. It is known that a fixed-topology
quadratic interconnect can offer CC all-to-all performance
gains of 2.3×–15× [10].
• Explain why AI accelerators for RecSys would benefit
from supporting hybrid memory systems that combine
multiple forms of DRAM.
• Evaluate the practical implementation of specialized HW
for RecSys and show how this has the potential to
improve throughput by 12–62× for inference and 12–
45× for training compared to the NVIDIA DGX-2 AI
platform.
II. RELATED WORK
There are several published works that describe systems
and chips for accelerating RecSys. Compared to these, our
work focuses on sweeping various hardware parameters for
a homogeneous3 system in order to understand the impact
2COLLECTIVE COMMUNICATIONS.
3System where a single type of processor contributes the bulk of processing
and communications capability.
1
ar
X
iv
:2
00
9.
05
23
0v
1 
 [c
s.A
R]
  1
1 S
ep
 20
20
of each upon upper bound DLRM system throughput. As
such, we do not evaluate, from the standpoint of RecSys
acceleration, any of the other types of systems described
below.
A. Heterogeneous Platforms
Facebook’s Zion platform [2] is a specialized heteroge-
neous4 system for training RecSys. Fig. 1 shows the major
components and interconnect of the Zion platform, which are
summarized in Table I. Zion offers the benefit of combining
very large CPU memory to hold embedding tables with high
compute capability and fast interconnect from GPUs, along
with 8 100GbE NICs for scale-out. AIBox [7] from Baidu is
another heterogeneous system. A key innovation of AIBox is
the use of SSD memory to store the parameters of a model up
to 10TB in size; the system uses CPU memory as a cache for
SSD. The hit rate of this combination is reported to increase as
training proceeds, plateauing at about 85% after 250 training
mini-batches. A single-node AIBox is reported to train a 10TB
RecSys with 80% of the throughput of a 75-node MPI cluster
at 10% of the cost.
TABLE I: Key Features of FB Zion [2] [10].
Feature Example of Implementation
CPU 8x server-class processor such as Intel
Xeon
CPU Memory speed DDR4 DRAM, 6 channels/socket, up to
3200MHz, ∼25.6GB/s/channel
CPU memory capacity Typical 1 DIMM/channel, up to
256GB/DIMM = 1.5TB/CPU
CPU Interconnect UltraPath Interconnect, coherent
CPU I/O PCIe Gen4 x16 per CPU, ∼30GB/s
AI HW Accelerator Variable
Accelerator
Interconnect
7 links per accelerator, x16 serdes
lanes/link, 25G, ∼350GB/s/card
Accelerator Memory On-package, likely HBM
Accelerator Power Up to 500W @ 54V/48V, support for
liquid cooling
Scale-out 1x NIC per CPU, typically 100Gb
Ethernet
B. Homogeneous Platforms
NVIDIA’s DGX-A100 and Intel/Habana’s HLS-1 are two
representative examples of homogeneous AI appliances. Ta-
bles II and III respectively summarize their key characteristics.
C. In/Near Memory Processing
RecSys models tend to be limited by memory accesses to
embedding table values that are then combined using pooling
operators [19], which makes the integration of memory with
embedding processing an attractive solution.
The first approach for this is to modify a standard DDR4
DIMM by replacing its buffer chip with a specialized processor
that can handle embedding pooling operations.
Fig. 2 shows the idea behind this concept for a scenario
involving two dual-rank DIMMs sharing one memory channel.
4System that combines various processor types, each providing specialized
compute and communications capability.
TABLE II: Key Features of NVIDIA DGX-A100 [76].
Feature Example of Implementation
CPU 2x AMD Rome 7742, 64 cores each
CPU Memory
speed
DDR4 DRAM, 8 channels/socket, up to
3200MHz, ∼25.6GB/s/channel
CPU memory
capacity
Typical 1 DIMM/channel, up to
256GB/DIMM = 2TB/CPU
AI HW
Accelerator
8x NVIDIA A100
Accelerator
Interconnect
Switched all-to-all, NVLink3, 12 links/chip,
∼300GB/s bandwidth per chip
Accelerator
Memory
HBM2 @ 2430MHz [75], 40GB/chip, 320GB
total
System Power ∼6.5kW max.
Scale-out 8x 200Gb/s HDR Infiniband
TABLE III: Key Features of Intel/Habana HLS-1 [25].
Feature Example of Implementation
AI HW
Accelerator
8x Habana Gaudi
Accelerator
Interconnect
Switched all-to-all, 10x 100Gb Ethernet per
chip of which 7 are available for interconnect
Accelerator
Memory
HBM, 32GB/chip, 256GB total
System Power ∼3kW max.
Scale-out 24x 100Gb Ethernet configured as 3 links per
chip
In this example, the typical buffer chip found on each DIMM
is replaced by a specialized NMP (Near-Memory Processing)
chip such as TensorDIMM [20] or RecNMP [19] that can
access both ranks simultaneously, cache embeddings, and pool
them prior to transfer over the bandwidth-constrained channel.
For a high-performance server configuration with one dual-
rank DIMM per memory channel, simulations [19] indicate
speedups of 1.61× to 1.96×.
It is also possible to provide a memory/compute module in
a non-DIMM form factor. An example is Myrtle.ai’s SEAL
Module [77], an M.2 module specialized for recommender
systems embedding processing. This is built from Bittware’s
250-M2D module [78] that integrates a Xilinx Kintex Ultra-
Scale+ KU3P FPGA along with two 32-bit channels of DDR4
memory. Twelve such modules can fit within an OpenCompute
Glacier Point V2 carrier board [79], in the space typically
taken up by a Xeon CPU with its six 64-bit channels of DDR4
memory.
The second approach is to build a processor directly into a
DRAM die. UpMem [27] has built eight 500MHz processors
into an 8Gb DRAM. Due to the lower speed of the DRAM
process, the CPU uses a 14-stage pipeline to boost its clock
rate. Several factors limit the performance gains available with
this approach. These include the limited number of layers of
metal (e.g. 3) in DRAM semiconductor processes, the need
to place the processor far downstream from sensitive analog
sense amplifier logic, and the lag of DRAM processes behind
logic processes.
2
DDR MEMORY
CPU#0 CPU#1
N
I
C
N
I
C
PCIe 
SWITCH
AI HW 
ACCEL.
AI HW 
ACCEL.
M
E
M
M
E
M
DDR MEMORY
CPU#2 CPU#3
N
I
C
N
I
C
PCIe 
SWITCH
AI HW 
ACCEL.
AI HW 
ACCEL.
M
E
M
M
E
M
DDR MEMORY
CPU#4 CPU#5
N
I
C
N
I
C
PCIe 
SWITCH
AI HW 
ACCEL.
AI HW 
ACCEL.
M
E
M
M
E
M
DDR MEMORY
CPU#6 CPU#7
N
I
C
N
I
C
PCIe 
SWITCH
AI HW 
ACCEL.
AI HW 
ACCEL.
M
E
M
M
E
M
PCIe x16
7 links x16 serdes Accelerator Interconnect
BASEBOARD
HOST INTERFACE
BOARD
CPU UltraPath Interconnect
SCALE-OUT
Fig. 1: FB Zion System: Major components and interconnect.
DRAM 
CHIP
DRAM 
CHIP
… DRAM 
CHIP
DRAM 
CHIP
DRAM 
CHIP
… DRAM 
CHIP
NMP 
CHIP
POOLING, 
CACHING
RANK#1
RANK#0
DDR4
MEMORY
CHANNEL
DRAM 
CHIP
DRAM 
CHIP
… DRAM 
CHIP
DRAM 
CHIP
DRAM 
CHIP
… DRAM 
CHIP
NMP 
CHIP
POOLING, 
CACHING
RANK#1
RANK#0
NMP DIMM #1
NMP DIMM #0
Fig. 2: Near-Memory Processing via Modified
DIMM Buffer Chip. Example shows two DIMMs,
each dual-rank, sharing a single memory channel.
D. DRAM-less Accelerators
DRAM-less AI accelerators include the Cerebras CS1 [15],
Graphcore GC2 [13] and GC200 [16]. The CS1 and GC2 lack
attached external DRAM, and the GC200 likely offers low-
bandwidth access to such DRAM since two DDR4 DIMMs
are shared between four GC200s via a gateway chip.
E. Other approaches
Centaur [61] offloads embedding layer lookups and dense
compute to an FPGA that is co-packaged with a CPU via co-
herent links. NVIDIA’s Merlin [62], while not a HW platform
for RecSys, is an end-to-end solution to support the devel-
opment, training and deployment of RecSys on GPUs. Intel
[6] describes optimizations that improve DLRM throughput
by 110× on a single CPU socket, to a level about 2× that of
a V100 GPU, with excellent scaling properties on clusters of
up to 64 CPUs.
There is also a considerable body of work on domain-
specific architectures for accelerating a broad set of AI ap-
plications and there are several surveys of the field [71] [72]
[73]. An up to date list of commercial chips can be found in
[68]. Approaches using FPGAs are described in [70] [69], and
[74] describes a datacenter AI accelerator.
III. OVERVIEW OF RECOMMENDER SYSTEMS
In this section, we provide an overview of a representative
RecSys, Facebook’s DLRM [34], from both application and
algorithm/model perspectives, including relevant deployment
constraints and goals that will guide the development of our
architecture. An overview of several other RecSys can be
found in [30].
A. Black box model of RecSys; Inputs & Output
The focus of this paper is a RecSys for rating an individual
item of content, as opposed to RecSys that process multiple
items of content simultaneously [63]. Inputs to the RecSys
are a description u of a user and a description c of a
piece of content; the RecSys outputs the estimated probability
P (u, c) ∈ (0, 1) that the user will interact with the content
in some specified way. Both the user and the content are
described by a set of dense features ∈ Rn and a set of
sparse features ∈ {0, 1}m. Dense features are continuous,
real-valued inputs such as the click-through rate of an Ad
during the last week [33], the average click-through rate of
the user [33], and the number of clicks already recorded for
an Ad being ranked [35]. Sparse (or categorical) features each
correspond to a vector with a small number of 1-indices out
of many 0s in multi-hot vectors, and represent information
such as user ID, Ad category, and user history such as visited
product pages or store pages [64] [65].
From a conceptual standpoint, the RecSys could be run
on every piece of content considered for a user and the
resulting output probabilities could then be used to determine
which items of content to show that user to achieve business
objectives such as maximizing revenue or user engagement.
Note that the RecSys output may be combined with other
components in order to decide what content to ultimately show
the user, as described in Sec. III-B.
3
B. Deployment Scenario
USER 
REQUESTS 
CONTENT
AD INVENTORY
USER DATAAD FILTERING
AD CANDIDATES
AD RATING
RECOMMENDER
SYSTEM
TOP FEW ADS
AD AUCTION
WINNING AD
TO DISPLAY
Fig. 3: Overview of multi-step Ad serving
process consisting of filtering a large set of
Ad content down to a small set of candi-
dates, followed by ranking the candidates,
the Ad auction, and displaying the winning
Ad [1] [30] [51].
A typical deployment scenario of a RecSys model is illus-
trated in Fig. 3:
1) User loads a page, which triggers a request to generate
personalized content. This request is sent from the user’s
device to the company’s datacenter.
2) Based on available content, an input query is generated,
consisting of a set of B (the query size) features, each
one representing the piece of content and the user, and
possibly their interactions. B varies by recommendation
model. There is typically a hierarchy of recommendation
models whereby easier-to-evaluate models rank larger
amounts of content first with high B, and then pass the
top results to more complex models that rate smaller
amounts of content with lower B [1]. B in the low
to mid-hundreds is representative [30], with some B as
large as ∼900.
3) This query is then passed on to the RecSys. This system
outputs, for each of the B pieces of potential content, the
probability that the user will interact with that content
in some way. In the case of advertising, this often
means the probability that the user will click on the Ad.
For video content, it could be metrics related to user
engagement [43].
4) Based on these probabilities, the most relevant content is
returned to the user, for instance “the top tens of posts.”
[1]. However, for Ad ranking, the probabilities generated
by the RecSys are first fed to an auction where they are
combined with advertiser bids to select the Ad(s) that
are ultimately shown to the user [42] [51].
C. System Constraints
RecSys operate under strict constraints.
Inference Constraints: The system must return the most
accurate results within the (SLA and thus inviolate) la-
tency constraint defined statistically in equation 1, where
PPF (DQ, P ) is the percentage point function or inverse
CDF , DQ is the distribution of the times to evaluate each
query, P is a percentile, such as 99th or 90th, and CSLA
is the latency constraint, typically in the range of “tens to
hundreds of milliseconds” [1]. Tail latencies are dependent on
the QPS (queries per second) throughput of the serving system
[30]; one method to trade off QPS and tail latency which [30]
explores adjusting the query size.
PPF (DQ, P ) ≤ CSLA (1)
Training Constraints: The system must train the most
accurate model within the minimum amount of time. Rec-
ommendation models need to be retrained frequently in order
to maintain accuracy when faced with changing conditions and
user preferences [32], and time spent training a new model is
time when that model is not contributing to revenue.
Total deployment cost for the hardware needed to run the
system is also a consideration. This is measured as the TCO,
or Total Cost of Ownership, of that hardware.
D. Model Overview of DLRM
The DLRM structure was open-sourced by Facebook in
2019 [34]. Fig. 4 illustrates the layers that comprise the
DLRM model. DLRM is meant to be a reasonably general
structure; for simplicity, the “default” implementation is used,
with sum pooling for embeddings, an FM5-based [36] feature
interactions layer, and exclusion of the diagonal entries from
the feature interactions layer output.
Inputs to DLRM are described in Sec. III-A. We note
that each sparse feature is effectively a set of indices into
embedding tables. We now describe embedding tables and
the other main components of DLRM.
Embedding tables are look-up tables, each viewed as a c×d
matrix. Given a set of indices into a table, the corresponding
rows are looked up, transposed to column vectors and com-
bined into a single vector through an operation called pooling.
Common pooling operators include sum and concatenation
[30]. An example of a sparse feature is user ID—a single
index denoting the user for whom Ads are being ranked,
corresponding to a single vector in an embedding table [39].
We typically refer to the output of embedding tables as
5Factorization Machine.
4
Lookups
Pooling
FC Layer + ReLU
FC Layer + ReLU
.
.
.
FC Layer + Sigmoid
FC Layer + ReLU
.
.
.
FC Layer + ReLU
FC Layer + ReLU
Dense features
Bottom MLP
Feature interactions layer
Top MLP
Predicted CTR
Sparse features Sparse features Sparse features
Embedding
Table #1
Lookups
Pooling
Embedding
Table #2
Lookups
Pooling
Embedding
Table #s
c[s]
dd
…
Notations:
size of feature
c[i] = #rows in embedding table #i
d = Embedding dimension
s = #embedding tables
Note: Batch dimension is not 
shown.
dd
s(s+1)/2
d
Fig. 4: DLRM Structure illustrating dense inputs and their associated bottom MLP, categorical inputs
and embedding tables, feature interactions layer, and top MLP [34].
embedding vectors, and embedding tables themselves may
be referred to as embedding layers.
FC Layers: The DLRM contains two Multi-Layer Percep-
trons (MLPs), which consist of stacked (or composed) fully
connected (FC) layers. The “bottom MLP” processes the dense
feature inputs to the model, and the “top MLP” processes the
concatenation of the feature interactions layer output and the
bottom MLP output. DLRM uses ReLU activations for all FC
layers except the last one in the top MLP which uses a sigmoid
activation; this is to convert the 1-dimensional output of that
layer to a click probability ∈ (0, 1).
Feature interactions layer: This layer is designed to
combine the information from the dense and sparse features.
The FM-based feature interactions layer forces the embedding
dimension of every table and the output dimension of the final
FC layer in the bottom MLP to all be equal. After the dense
features are first run through the bottom MLP, the output size
is b×d, where b is the batch size being used for inference or
training, and d is the common embedding dimension. Further,
each of the s sparse feature embedding tables produces a b×d
output after pooling. These are concatenated along a new di-
mension to construct a b×(s+1)×d tensor which we will call
A. Let A′ ∈ Rb×d×(s+1) be constructed by transposing the last
two dimensions of A. F is then calculated as the batch matrix
multiplication of A and A′ such that F ∈ Rb×(s+1)×(s+1).
Roughly half of these entries are duplicates, and those are
discarded along with optionally the diagonal entries, which
we opt to do. After flattening the two innermost dimensions,
the result is the output matrix F ′ ∈ Rb× (s
2+s)
2 . This batch
of vectors, after being concatenated with the bottom MLP
output, is then fed to the top MLP. In algorithms 1 and 2,
part of Sec. IV-C where we cover the implementation of
DLRM from a HW architecture perspective, for simplicity, this
concatenation is considered as part of the feature interactions
layer.
E. RecSys vs. other AI models
The two key factors that set RecSys models apart from
other AI models (such as those for computer vision or natural
language processing) are arithmetic intensity—the number of
compute operations relative to memory accesses—and model
size. RecSys models are significantly larger and have signifi-
cantly lower arithmetic intensity as shown in Table IV, result-
ing in increased pressure on memory systems and interconnect
structure.
IV. DLRM FROM A HARDWARE ARCHITECTURE
PERSPECTIVE
A. Distributed model setup
The characteristics of recommender systems require a com-
bination of model parallelism and data parallelism. On a large
homogeneous system, dense parameters such as FC layer
weights are copied onto every processor. However, embedding
tables are distributed across the memory of all processors,
such that no parameters are replicated. This arises from the
fact that embedding tables can easily reach from several 100
GBs to multiple TBs in size [40].
While each processor can compute the dense part of the
model for a batch, it cannot look up the embeddings specified
by the sparse features, because some of those embeddings are
stored in the memory of other processors. Thus, each processor
5
TABLE IV: Recommender System vs. Other AI Models. Data from Table 1 of [31].
Category Model Type Size #Parms Arithmetic Intensity Max. Live Activations
Computer Vision ResNeXt101-32x4-48 43-829M Avg. 188-380 2.4-29M
Language Seq2Seq GRU/LSTM 100M-1B 2-20 >100K
Recommender Fully Connected Layers 1-10M 2-200 >10K
Recommender Embedding Layers >10B 1-2 >10K
needs to query other processors for embedding entries. This
leads to the communication patterns outlined in Sec. IV-B.
There are multiple ways to split up the embedding tables.
The distribution and (if beneficial) replication of tables across
processors must be optimized to avoid system bottlenecks by
evening out memory access loading across the memory sys-
tems of the various processors. One extreme is “full sharding”
of the tables, where tables are split up at a vector-level across
the attached memory systems of multiple processors to get
very close to uniformly distributing table lookups—at the cost
of increased communication. This increased communication
and the resulting stress on the HW is due to the fact that, with
full sharding, each processor is prevented from doing local
pooling of embeddings, thus requiring unpooled vectors to be
communicated, of which there are many more than pooled
vectors, as described in Sec. III-D. As such, many systems
today attempt to fit entire tables into the memory of single
processors (which we call “no sharding”), or to break up the
tables as little as possible as shown on slide 6 of [2].
B. CC operations
The following CC operations are key to distributed RecSys
throughput. For more information on CC operations, please
see [8]. In all of the scenarios described below, there are n
processors, numbered 1, 2, · · · , n.
All-to-all: This CC primitive essentially implements a
“transpose” operation. Each processor starts out with with
n pieces of data. Supposing Aij denotes the jth piece of
data currently residing on the ith processor, then the all-to-
all operation will move Aij such that it is instead on the jth
processor and is ordered as the ith piece of data there. This
operation is useful when each processor needs to send some
data to every other processor.
All-reduce: All-reduce is an operation which replaces local
data on each processor with the sum of data across all
processors. If processor i contains data Ai, then after the all-
reduce operation, all processors will contain AR =
∑n
i=1Ai.
Efficient all-reduce algorithms exist for ring interconnects [8]
which are popular in AI systems; however, all-reduce can
also be implemented in equivalent time by first performing
a reduce-scatter operation and then an all-gather operation.
Reduce-scatter: This operation may be best described as an
all-to-all operation followed by a “local reduction.” Similar to
the all-to-all setup, the starting point is Aij as the jth piece of
data on processor i. Where all-to-all will result in this being
the ith piece of data on processor j, reduce-scatter performs an
extra reduction, such that the only piece of data on processor
j at the end is
∑n
i=1Aij .
All-gather: In this operation, the starting point is a single
piece of data Ai on each of the n processors, and the result
of the operation is to have every Ai on every processor.
For example, with 4 processors initially containing A1, A2,
A3, and A4, respectively, after the all-gather operation, each
processor will contain all of A1, A2, A3, and A4.
C. Detailed steps for DLRM Inference and Training
Please refer to the Appendix for a description of operators
and an in-depth view of the steps involved in DLRM inference
an training.
D. Key HW performance factors
Our performance model (Sec. V-B) helps identify several
HW characteristics that are major determinants of throughput
for the RecSys that we evaluate. As could be expected from
the DLRM operations described in the previous three sections,
these are CC performance, memory system performance for
embedding table lookups, compute performance for running
the dense portions of the model, and on-chip buffering. The
following sections examine each of these factors in more
detail.
1) CC Performance: RecSys make extensive use of CC for
exchanging embedding table indices, embedding values, and
for gradient averaging during training backprop. Table V sum-
marizes HW factors that impact RecSys CC performance. In
particular, all-to-all with its smaller message sizes is especially
sensitive to latency and scales poorly with interconnect such as
rings [10]. Various lower bounds on the latency and throughput
of these operators have been derived for representative system
architectures [8].
TABLE V: Summary of HW Collective Communications
factors for RecSys.
Factor Impact
Per-processor
bandwidth
Sets upper-bounds on CC all-reduce and
all-to-all throughput
Interconnect topology Major determinant of all-to-all throughput
Ring interconnect Well-suited for all-reduce, poorly-suited
for all-to-all
CC latency Particularly important for all-to-all due to
smaller message sizes
For the purposes of HW analysis on homogeneous6 systems,
we note the following [8]:
6Systems where all processors share equally in the communications and
arithmetic components of CC.
6
• For an all-to-all with data volume V and n processors,
the lower bound on the amount of data sent and received
by each processor is V × (n−1)n .
• Similarly for an all-reduce, the lower bound is 2×V ×
(n−1)
n .
• The above bounds impose a minimum per-processor
data volume that must be transferred. As a consequence,
the bandwidth per processor will limit overall all-to-
all and all-reduce throughput, even as more processors
are added to a system. As a rule of thumb, for a large
system with a per-processor interconnect bandwidth BW ,
the maximum achievable system-wide all-to-all and all-
reduce throughputs are roughly BW and BW2 .
The above rule of thumb works well for NVIDIA’s DGX-2
system built from 16 V100 processors [9] since it achieves
an “all-reduce bandwidth7” of ∼118GB/s with a maximum
per-processor bandwidth (Fig. 9 of [9]) of 150GB/s from six
NVLink2 interconnects per V100 chip, for an efficiency of
79% of the theoretical peak.
0
20
40
60
80
100
120
100 1000 10000 100000 1000000
ACTUAL
FIT 100μs LATENCY, 119.5 GB/s BANDWIDTH
Al
l-g
at
he
r B
an
dw
id
th
, G
B/
s
Data Volume, KB
Fig. 5: DGX-2 All-gather bandwidth, measured
values [12] compared to simple latency/bandwidth
model. All-gather time for smaller message sizes
is latency dominated.
Fig. 5 illustrates the impact of latency on CC throughput.
Due to the ∼100µs latency for all-gather on the DGX-2, the
time to perform an all-gather for smaller message sizes is
latency dominated. Latencies for a few current systems are
shown in Table VI.
TABLE VI: Latencies of interest.
System Latency incl. SW
overhead
NVIDIA DGX-2 CC All-reduce Est. ∼50µs [12]
NVIDIA DGX-2 CC All-gather Est. ∼100µs [12]
NVIDIA DGX-2 NVLink2 point to point ∼10µs [12]
NVIDIA DGX-2 PCIe point to point ∼30µs [12]
Graphcore GC2, 16-IPU system,
single-destination Gather
∼25µs [13]
Graphcore GC2, 16-IPU system,
single-destination Reduce
∼14.5µs [13]
The other important factor for CC, available per-chip peak
bandwidth, is summarized in Table VII. Note that Broadcom’s
7Data transfer rate in each direction, per processor, during all-reduce.
Tomahawk 4 switch chip supports an order of magnitude more
bandwidth than any AI HW chip, demonstrating the headroom
available to improve per-chip bandwidth for AI applications.
TABLE VII: HW communication peak bandwidth for various
chips.
Chip Peak HW bandwidth, each
direction, per chip
CPU: Intel Xeon Platinum
8180
62GB/s UPI, 48GB/s PCIe
aggregate [6], [22], [23]
GPU: NVIDIA A100 NVLink3, 300GB/s [21]
FPGA: Achronix Speedster
AC7t1500
400GB/s via 112G SerDes [18]
AI HW: Graphcore GC200 320GB/s via 10x IPU-Links [14]
AI HW: Intel/Habana Gaudi 125GB/s via 10x 100GbE [25]
Network Switch: Broadcom
Tomahawk 4
3,200GB/s [26]
2) Memory System Performance: This section refers to
external DRAM memory attached to an AI HW accelerator
or CPU. RecSys applies considerable pressure on the memory
system due to the large number of accesses to embedding
tables. Furthermore, these accesses have limited spatial locality
(but some temporal locality) [19], resulting in scattered mem-
ory accessed of 64B-256B in size [1] that exhibit poor DRAM
page hit characteristics. Multiple ranks per DIMM and internal
banks and bank groups per memory die help by increasing
parallelism, within memory device timing parameters such as
command issue rates and on-die power distribution limitations.
TABLE VIII: External Memory systems for select AI HW.
Chip Memory System
Intel Xeon CPU DDR4, 6 channels for Xeon Gold, up
to 1.5TB [5]
NVIDIA A100 HBM, 5 stacks, 40GB total
NVIDIA TU102
RTX2080Ti
GDDR6, 11 chips, 11GB total
Habana Goya DDR4, 2 channels, 16GB
Graphcore GC2 None, 300MB on-chip
Graphcore GC200 900MB on-chip, 2x DDR4 channels
for 4 chips [16]
Cerebras CS1 None, all memory is on-wafer, 18GB
total
Table VIII shows memory types for a few representative
AI HW chips and Fig. 6 show achievable effective memory
bandwidth for various memory system configurations and
embedding sizes. HBM has considerably higher performance
for random embedding accesses, while the typical 6-channel
DDR4 server CPU memory system has far lower performance,
especially for smaller embedding sizes. However HBM and
GDDR6 suffer from limited capacity compared to DDR4 as
shown in Fig. 7.
3) Compute Performance: Table IX shows available com-
pute capability of various chips. For the specific workload that
we analyze, compute capability is not a limiting factor (see
Sec. V-A) and this is believed to be the case for many RecSys
workloads when run on specialized AI accelerators [30].
7
0200
400
600
800
1000
1200
1400
32 64 96 128 160 192 224 256
HBM2E @ 2400MHz, 4 stacks
GDDR6 @ 16GHz, 12 chips
DDR4 @ 3200MHz, Dual-Rank DIMM, 6 channels
EMBEDDING SIZE, BYTES
PE
AK
 B
AN
DW
ID
TH
 C
AP
AB
IL
IT
Y,
 G
B/
s
Fig. 6: Peak Random Embedding Access Band-
width for common memory systems based
on memory timing parameters, assuming auto-
precharge. Data transfer frequency shown; device
clock is half for DDR4 & HBM, one-eighth for
GDDR6. DDR4 memory systems have far lower
performance than HBM for embedding lookups.
0
200
400
600
800
1000
1200
1400
TO
TA
L 
CA
PA
CI
TY
, G
B
DDR4 6-CHANNEL
3D TSV DIMM 256GB
HBM2E
4 STACKS
GDDR6
12 CHIPS
Fig. 7: Total capacity of common memory sys-
tems.
TABLE IX: Peak Compute capability of various chips.
Chip FLOPS capability
CPU: Intel Xeon Platinum 8180 4.1 TFLOPS FP32 [6]
GPU: NVIDIA A100 19.5 TFLOPS FP32, 312
TFLOPS FP16/32 [21]
FPGA: Achronix Speedster
AC7t1500
3.84 TFLOPS FP24 @
750MHz [17], [18]
TABLE X: Uses for on-chip buffer memory.
Item Buffering memory
Dense weights for data
parallelism
Model-dependent, replicated across
each chip
Embeddings for one
mini-batch
Dependent on embedding table sizes,
mini-batch size, and (temporal)
locality across mini-batch
Working buffers for data
transfers
Used to overlap processing and
transfers for input features and
embedding lookups
Data during training Activations, gradients, optimizer state
[44] [52]
4) On-Chip Buffering: Buffering memory serves several
purposes as shown in Table X. While some of these are
straightforward to estimate—such as the number of weights
in the dense layers of a model—others are harder to quantify,
such as the number of unique embedding values across one
or multiple mini-batches. Analyses by Facebook [19] indicate
hit rates of 40% to 60% with a 64MB cache.
E. Improving existing AI HW accelerators for RecSys
TABLE XI: Improving existing AI HW accelerators for Rec-
Sys.
Chip Potential Changes
NVIDIA A100 Increase on-chip memory for buffering; Add
DDR memory support; Enable sixth HBM
stack; support deeper stacks; reduce compute
capability to fit die area budget
Graphcore GC200 Add external high-speed DRAM support
Cerebras CS1 Change from mesh to fully connected
topology, Add external high-speed DRAM
support
All of the above Increase I/O bandwidth; Add HW support for
CC; Add dual external DRAM support
(Sec. VII-A)
Table XI illustrates potential changes to enhance the perfor-
mance of existing AI HW accelerators on RecSys workloads.
V. PERFORMANCE MODEL
A. Representative DLRM Models
Our choice of representative model is Facebook’s DLRM-
RM2 [30]. In order to reveal limitations of various HW
platforms, two model configurations are analyzed. These are
positioned at the low and high end of batch size and embed-
ding entry size; specifically, 200 and 600 as the batch size
points8 and 64B and 256B as the embedding size points9. We
refer to the 200 batch size/64B embedding size combination
as “Small batch/Small embeddings”, while the 600 batch
size/256B combination is “Large batch/Large embeddings”.
Similarly, the two extremes of table distribution are an-
alyzed. “Unsharded” refers to each table being able to fit
within the memory attached to an AI accelerator, such that
only pooled embeddings need be exchanged. “Sharded” refers
to “full sharding” (see section IV-A) where each table is fully
distributed across the attached memory of every accelerator—a
worst-case scenario. The reality will likely fall between these
two extremes.
Note that the same batch sizes are used for both our infer-
ence and training performance models and that all parameters
are stored in FP16 for both inference and training. As the
size of the tables for the production model is not publicly
available, we assume that the model is large enough to occupy
the memory of all chips in the system. This is a reasonable
assumption based on other recommendation systems [7].
8Roughly 200 is the median query size; 600 is significantly farther out in
the query size distribution [30].
9Facebook has noted that embedding sizes in bytes are typically 64B-256B
[19].
8
TABLE XII: DLRM-RM2 configurations.
Parameter Value(s)
Number of embedding tables 40
Lookups per table 80
Embedding size 32 FP16 = 64B (small), 128 FP16
= 256B (large)
Number of dense features 256
Bottom MLP 256-128-32-Embedding dimension
Top MLP Interactions-512-128-1
Feature Interactions Dot products, exclude diagonal
FLOPS/Inference ∼1.40 MFLOPs (small), ∼2
MFLOPs (large)
Batch size 200 (small), 600 (large)
B. Overview of Performance Model
We have developed a performance model that computes the
time, memory usage and communications overhead for each
of the steps detailed in Sec. IV-C. Our model takes as input
a specific DLRM configuration as described in Sec.V-A as
well as various parameters that describe batch size, embedding
table sharding, processing engine capabilities and configura-
tion, memory system configuration, memory device timing
parameters from vendor datasheets, communication network
latencies and bandwidths, and system parameters that control
the level of overlap and concurrency within the HW. In order to
maximally stress the HW, our model assumes zero (temporal)
locality within the embedding access stream.
Results are reported in Sec. VI for the stated configura-
tions, assuming the HW and system exploit maximal overlap
within a batch for inference by grouping memory accesses
and overlapping memory activity with communications where
possible, and for training by pipelining the collective commu-
nications with backpropagation computations and parameter
updates. For embedding parameter updates during training, the
originally looked up embeddings are buffered on-chip, thereby
only requiring a write to update them instead of a read-modify-
write. In particular, sufficient on-chip buffering to support the
above capabilities is assumed—doing so over-estimates the
performance capabilities of most current AI accelerator HW.
VI. EVALUATION OF SYSTEM PERFORMANCE
A. Systems Parameters
We focus our performance evaluation on homogeneous
systems, where one type of processor provides the bulk of
compute and communications capability.
Table XIII shows the range of parameters that we consider
for our reference homogeneous system. In terms of CC per-
formance, these ranges are, for the most part, significantly
in excess of what is supported by state of the art AI HW
accelerators. This is consistent with our goal of showing the
benefit of further optimizing these parameters. In particular,
the CC latency range is lower than measured numbers for
NVIDIA’s DGX-2 system (Sec. VII-B). The range of per-chip
bandwidth spans popular training accelerators as well as about
3× beyond NVLink3.
TABLE XIII: Ranges of Key Parameters for Homogeneous
Systems.
Parameter Range
CC all-to-all Latency of 0.5µs to 10µs, Bandwidth 100 to
1000GB/s
CC all-reduce Latency of 0.5µs to 10µs, Bandwidth 100 to
1000GB/s
Bandwidth per
chip
100 to 1000GB/s aggregated across all links
#chips/system 8
Compute
capability
200 TFLOPS FP16
Memory system 6 stacks of HBM2E per chip @ 2400 MHz
B. Inference
Fig. 8 shows upper bounds on achievable system throughput
for inference. For the large batch, large embeddings case,
unsharded, throughput is primarily limited by memory ac-
cesses for embeddings such that interconnect is of secondary
importance. Such models would run well on existing systems
including scale-out topologies with limited bandwidth.
Latency, as opposed to bandwidth, matters most for small
batch/embedding workloads, as detailed in Fig. 9. For the
unsharded case this sensitivity applies at high bandwidth as
well as at low bandwidth, with throughput dropping by almost
5× as latency increases. This is not surprising since the typical
all-to-all message sizes for this configuration are 320KB
of indices per processor and 64KB for pooled embeddings,
small messages that would typically fall within the latency-
dominated regime of CC.
This would indicate that when batch sizes are relatively
small and tables are allocated to each fit within the memory
attached to each processor, it is more important to design
systems to minimize latency as opposed to pushing per-chip
bandwidth. Scale-out architectures, with multiple interconnect
hops and long physical distances, make this difficult. On the
other hand, “scale-in” system design can help reduce latency.
The sharded, small batch/embeddings case is sensitive to
both latency and bandwidth since the exchange of unpooled
embeddings results in an all-to-all message size of ∼5.2MB
per processor. However, even in this situation, there is limited
benefit in improving bandwidth unless such improvement is
also accompanied by reduced latency.
The unsharded, large batch/embedding case is very slightly
sensitive to latency and largely insensitive to bandwidth.
This is because the communication volume, compared to the
unsharded small batch/embeddings case, increases by 3× for
indices and by 12× for embeddings; the resulting message
sizes are still typically within the latency-dominated region of
CC. However, the number of bytes of memory lookups for
embedding tables increases by 12× such that memory lookup
time becomes the dominant term.
The sharded, large batch/embeddings case depends on both
latency and bandwidth due to the larger message sizes from
exchanging unpooled embeddings.
9
LATENCY s
0.5
4.0 6.0 8.0
  BA
ND
WID
TH 
GB
/s
100
500
1000
  
QP
S
   75K
   125K
   175K
INFERENCE SMALL UNSHARD
(a) Small batch, Small embeddings, Unsharded
LATENCY s
0.5
4.0 6.0 8.0
  BA
ND
WID
TH 
GB
/s
100
500
1000
  
QP
S
   50K
   100K
   150K
INFERENCE SMALL SHARD
(b) Small batch, Small embeddings, Sharded
LATENCY s
0.5
4.0 6.0 8.0
  BA
ND
WID
TH 
GB
/s
100
500
1000
  
QP
S
  
23K
INFERENCE LARGE UNSHARD
(c) Large batch, Large embeddings, Unsharded
LATENCY s
0.5
4.0 6.0 8.0
  BA
ND
WID
TH 
GB
/s
100
500
1000
  
QP
S
   5K
   10K
   15K
INFERENCE LARGE SHARD
(d) Large batch, Large embeddings, Sharded
Fig. 8: Inference performance upper bounds as a function of Collective Communications latency and bandwidth for small/large
batches, with and without Sharding. See text for analysis.
In all cases, higher bandwidth minimizes the throughput
impact of sharding. As shown in Fig. 10, increasing band-
width from 100GB/s to 1000GB/s reduces the impact of full
sharding from about 3.1× loss down to 1.2× for the small
batch/embeddings case. Similarly significant gains are seen
for the large batch/embeddings case with sharding.
C. Training
Fig. 11 shows upper bounds on achievable system through-
put for training. Similarly to inference, we note that the case
of unsharded large batches/embeddings is primarily memory-
bound, hence does not depend much on CC latency or band-
width.
For the other configurations, compared to inference, the
importance of optimizing bandwidth and latency are more
balanced for training. For the small batch/embeddings case,
bandwidth matters to training QPS as detailed in Fig. 12. High
bandwidth cuts dense all-reduce times since the message sizes
involved are ∼2.4MB per processor; the impact of bandwidth
is, as expected, most felt at latencies under 4µs.
As with inference, higher bandwidth helps mitigate the
throughput penalty of sharding. In the large batch/embeddings
scenario, overall throughput increases almost proportionally
with bandwidth as shown in Fig. 13. The all-to-all exchange
of unpooled embeddings between processors dominates, with
message sizes of ∼60MB per processor.
VII. RECSPEED: AN OPTIMIZED SYSTEM ARCHITECTURE
FOR RECSYS
This section describes the features of RecSpeed, a hypothet-
ical system architecture for RecSys workloads. The objectives
of RecSpeed are to:
• Maximize throughput for inference and training of large
RecSys models;
10
0%
20%
40%
60%
80%
100%
0K
50K
100K
150K
200K
0.5 2.5 4.5 6.5 8.5
QPS vs. Latency, Bandwidth=100 GB/s
Small batch/embeddings, Unsharded
QPS
Memory Utilization
Q
PS
M
em
or
y 
U
til
iza
tio
n
Latency, μs
(a) With 100GB/s Bandwidth per chip
0%
20%
40%
60%
80%
100%
0K
50K
100K
150K
200K
0.5 2.5 4.5 6.5 8.5
QPS vs. Latency, Bandwidth=1000 GB/s
Small batch/embeddings, Unsharded
QPS
Memory Utilization
Q
PS
M
em
or
y 
U
til
iza
tio
n
Latency, μs
(b) With 1,000GB/s Bandwidth per chip
Fig. 9: Impact of latency on QPS for small batch size, small
embedding vectors, Unsharded. Latency matters at both high
and low bandwidth, and there is benefit in driving CC latency
down to typical network switch port to port levels.
0K
5K
10K
15K
20K
25K
30K
35K
40K
45K
50K
100 200 300 400 500 600 700 800 900 1000
QPS vs. Bandwidth, Latency=10μs
Small batch/embeddings
QPS Unshard QPS Shard
Q
PS
BANDWIDTH PER CHIP, GB/s
Fig. 10: High Bandwidth helps minimize the
performance loss from sharding, even at high
latency, due to the all-to-all exchange of unpooled
embedding entries between processors.
• Support future, ever-larger RecSys models;
• Allow implementation using existing process technolo-
gies, and to fit into common datacenter power envelopes;
• Support existing SW and HW interfaces for datacenter
AI server racks.
A. RecSpeed Architecture Features
Fig. 14 shows a sketch of the proposed chip and interconnect
structure for a 6-chip RecSpeed system. Key features of the
architecture are as follows:
• Fixed-topology quadratic point-to-point interconnect
without any form of switching to minimize latency.
• HW support for Collective Communications to minimize
synchronization and SW-induced latency.
• Fast HBM memory attached to each chip, as many stacks
as practical; as of the writing of this paper, NVIDIA’s
A100 has room for 6 stacks of HBM (of which only 5 are
used), each 4-deep; however, 8-deep stacks are available.
• Slow bulk memory, such as DDR4.
• Optimized packaging and system design to allow “scale-
in”, packing as many RecSpeed chips as possible in close
physical proximity.
High-density physical packaging of chips is particularly
important in order to achieve maximum throughput; when
interconnect moves from intra-board to the system level, the
energy consumed per bit goes up by 25×, bandwidth falls by
over 20×, and overhead increases markedly [29].
Fixed-topology vs. switched all-to-all interconnect: The
proposed interconnect for RecSpeed could reduce latency
compared to the all-to-all switched interconnect found in
NVIDIA’s DGX-2 [9] or Habana’s HLS-1 training system [25].
Specifically, the presence of a switch introduces several hun-
dred nanoseconds of additional latency [45] [46]. A quadratic
interconnect offers performance gains of 2.3× for large CC
all-to-all messages on an 8-node system, compared to a ring
interconnect, and for smaller message sizes the gain can range
as high as 15× [10].
HW support for CC: We note that the proposed intercon-
nect structure can efficiently support both CC all-to-all and all-
reduce operations with minimal latency and bandwidth usage
that matches the theoretical lower bound [8].
Implementing High-Bandwidth Links: The proposed
high-bandwidth RecSpeed links can be implemented via ex-
isting technology, amounting to ∼31% of the bandwidth of a
Tomahawk 4 switch chip [26].
Hybrid Memory Support: Certain embedding tables and
vectors are accessed less often than others [7]. It is therefore
useful to provide a two-level memory system, with large bulk
memory in the form of DDR4 combined with fast HBM
memory. Tables can be allocated to one memory or the other
statically—sharded or not—or the faster HBM can be used as
a cache. Static allocation is preferred, as dealing with such a
large cache structure where the smaller memory has latency
comparable to that of the larger memory may not offer much
benefit, based on prior efforts such as Intel’s Knights Landing
HPC architecture [47]. With the configuration shown, up to
11
LATENCY s
0.5
4.0 6.0 8.0
  BA
ND
WID
TH 
GB
/s
100
500
1000
  
QP
S
   30K
   50K
   70K
TRAINING SMALL UNSHARD
(a) Small batch, Small embeddings, Unsharded
LATENCY s
0.5
4.0 6.0 8.0
  BA
ND
WID
TH 
GB
/s
100
500
1000
  
QP
S
   20K
   40K
   60K
TRAINING SMALL SHARD
(b) Small batch, Small embeddings, Sharded
LATENCY s
0.5
4.0 6.0 8.0
  BA
ND
WID
TH 
GB
/s
100
500
1000
  
QP
S   8K
   10K
   12K
TRAINING LARGE UNSHARD
(c) Large batch, Large embeddings, Unsharded
LATENCY s
0.5
4.0 6.0 8.0
  BA
ND
WID
TH 
GB
/s
100
500
1000
  
QP
S
   4K
   8K
TRAINING LARGE SHARD
(d) Large batch, Large embeddings, Sharded
Fig. 11: Training performance upper bounds as a function of Collective Communications latency and bandwidth for small/large
batches, with and without Sharding. See text for analysis.
5.5TB of memory can be provided for a RecSpeed system,
of which 27% would be fast HBM. Baidu reports that their
AIBox [7] system is able to effectively hide storage access
time, despite the two order of magnitude latency difference and
the greater than one order of magnitude bandwidth difference
between SSD and main memory. Since the performance gap
between DDR and HBM memory is smaller, it is reasonable to
assume that careful system design can enable a hybrid memory
system to run closer to the performance of a full HBM system.
B. RecSpeed Performance vs. DGX-2
Table XIV shows the system characteristics that we use for
computing the performance upper bound for RecSpeed, and
Table XV for the DGX-2. Note that we assume a “modified”
V100 chip with more on-chip buffering memory than the
actual V100.
TABLE XIV: Numbers used for RecSpeed performance upper
bounds.
Parameter Value
CC all-to-all Latency of 1000ns, bandwidth of 1000GB/s
CC all-reduce Latency of 1000ns, bandwidth of 1000GB/s
HBM Memory HBM2E @ 3000MHz, 6 stacks, 96GB
DDR4 Memory 1-channel, up to 256GB 3D TSV DIMM
#chips/system 16
Compute
capability
200 TFLOPS FP16
Memory Size 1.5TB HBM2E + 4TB DDR4
The resulting throughput numbers and comparison versus
NVIDIA’s DGX-2 estimated upper bounds are shown in Ta-
ble XVI for inference and Table XVII for training. In our
12
0K
10K
20K
30K
40K
50K
60K
100 200 300 400 500 600 700 800 900 1000
QPS vs. Bandwidth, Latency=2μs
Small batch/embeddings, Unsharded
Q
PS
BANDWIDTH PER CHIP, GB/s
(a) QPS
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
100 200 500 1000
Training Time Breakdown vs. Bandwidth
Latency=2μs, Small batch/embeddings, Unsharded
SPARSE UPDT ALLREDUCE FWD%TIME FOR
BANDWIDTH PER CHIP, GB/s
Note: FWD = Forward Pass, ALLREDUCE = Dense all-reduce and dense update, 
SPARSE UPDT = Sparse all-to-all or all-gather and sparse update
(b) Relative duration of training phases
Fig. 12: Impact of Bandwidth on Training, small batch size,
small embeddings, Unsharded. In contrast to inference, band-
width helps directly for training due the all-reduce of large
message sizes from averaging gradients across processors.
TABLE XV: Numbers used for NVIDIA DGX-2 performance
upper bounds.
Parameter Value
CC all-to-all Derived from CC all-gather [12]
CC all-reduce, CC all-gather Data from [12]
Memory System HBM2, 4 stacks @ 2300MHz
Bandwidth per chip 150GB/s
#chips/system 16
Compute capability 125 TFLOPS FP16
On-chip memory Assumed sufficient—not the case
in practice
model, the DGX-2 is largely bound by its high CC latency,
which can likely be reduced via software optimization.
Limitations: In this section, we do not discuss trade-offs
and issues—important as they may be—relating to die size,
power, and thermal design.
CONCLUSION
This paper reviews the features of a representative Deep
Learning Recommender System and describes hardware ar-
chitectures that are used to deploy this and similar workloads.
The performance of this representative Deep Learning Rec-
0.0K
1.0K
2.0K
3.0K
4.0K
5.0K
6.0K
7.0K
8.0K
9.0K
100 200 300 400 500 600 700 800 900 1000
QPS vs. Bandwidth, Latency=2μs
Large batch/embeddings, Sharded
Q
PS
BANDWIDTH PER CHIP, GB/s
(a) QPS
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
100 200 500 1000
Training Time Breakdown vs. Bandwidth
Latency=2μs, Large batch/embeddings, Sharded
FWD ALLREDUCE SPARSE UPDT% TIME FOR
BANDWIDTH PER CHIP, GB/s
Note: FWD = Forward Pass, ALLREDUCE = Dense all-reduce and dense update, 
SPARSE UPDT = Sparse all-to-all or all-gather and sparse update
(b) Relative duration of training phases
Fig. 13: Impact of Bandwidth on Training, large batch size,
large embeddings, Sharded. In this situation, the all-to-all
exchange of unpooled embeddings between processors dom-
inates. Increasing bandwidth reduces the time spent in this
all-to-all, thus speeding up the forward pass.
TABLE XVI: RecSpeed Inference Upper Bounds.
Config. QPS Mem. DGX-2 DGX-2 Potential
Util. QPS Mem.Util. Speedup
Sm. Unshard 300K 67% 4.9K 1.8% 62×
Sm. Shard 207K 47% 4.5K 1.6% 46×
Lg. Unshard 56K 93% 4.7K 15% 12×
Lg. Shard 30K 50% 2.1K 7% 14×
TABLE XVII: RecSpeed Training Upper Bounds.
Config. QPS Allred. DGX-2 DGX-2 Potential
%Time QPS Allred. Speedup
Sm. Unshard 99K 33% 2.2K 31% 45×
Sm. Shard 83K 28% 2.1K 30% 39×
Lg. Unshard 25K 9% 2K 28% 12×
Lg. Shard 16K 6% 1.2K 18% 13×
All-reduce refers to dense reduction and update.
13
I/O  FABRIC
RS
HBM
D
D
R
RS
HBM
D
D
R
RS
HBM
D
D
R
RS
HBM
D
D
R
HBM
STACKS
HBM
STACKS
SERDES
SERDES
COMPUTE & CC 
PROCESSORS
DDR
MEMRS
HBM
D
D
R
RS
HBM
D
D
R
(a) Schematic of chip
I/O  FABRIC
RS
HBM
D
D
R
RS
HBM
D
D
R
RS
HBM
D
D
R
RS
HBM
D
D
R
HBM
STACKS
HBM
STACKS
SERDES
SERDES
COMPUTE & CC 
PROCESSORS
DDR
MEMRS
HBM
D
D
R
RS
HBM
D
D
R
(b) System interconnect example with 6 chips
Fig. 14: Proposed RecSpeed Architecture.
ommender is investigated with respect to its sensitivity to
hardware system design parameters for training and inference.
We identify the latency of collective communications oper-
ations as a crucial, yet overlooked, bottleneck that can limit
recommender system throughput on many platforms. We also
identify per-chip communication bandwidth, on-chip buffering
and memory system lookup rates as further factors.
We show that a novel architecture could achieve substan-
tial throughput gains for inference and training on recom-
mender system workloads by improving these factors be-
yond state of the art via the use of “scale-in” to pack
processing chips in close physical proximity, a two-level high-
performance memory system combining HBM2E and DDR4,
and a quadratic point-to-point fixed topology interconnect.
Specifically, achieving CC latencies of 1µs and chip-to-chip
bandwidth of 1000GB/s would offer the potential to boost
recommender system throughput by 12–62× for inference
and 12–45× for training compared to upper bounds for
NVIDIA’s DGX-2 large-scale AI platform, while minimizing
the performance impact of “sharding” embedding tables.
ACKNOWLEDGMENT
The authors would like to thank Prof. Kurt Keutzer and
Dr. Amir Gholami for their support and their constructive
feedback.
REFERENCES
[1] Gupta, Udit, Carole-Jean Wu, Xiaodong Wang, Maxim Naumov, Bran-
don Reagen, David Brooks, Bradford Cottel et al. “The architectural
implications of facebook’s DNN-based personalized recommendation.”
In 2020 IEEE International Symposium on High Performance Computer
Architecture (HPCA), pp. 488-501. IEEE, 2020. See https://arxiv.org/
abs/1906.03109.
[2] “Zion: Facebook Next-Generation Large Memory Training Platform”,
Misha Smelyanskiy, Hot Chips 31, August 19, 2019. See https:
//www.hotchips.org/hc31/HC31 1.10 MethodologyAndMLSystem-
Facebook-rev-d.pdf.
[3] “OCP Accelerator Module Design Specification v1.0”, Whitney Zhao et
al.
[4] “Open Accelerator Infrastructure Universal Baseboard Design Specifi-
cation v0.42”
[5] https://www.samsung.com/semiconductor/dram/module/
M386ABG40M51-CAE/
[6] Kalamkar, Dhiraj, Evangelos Georganas, Sudarshan Srinivasan, Jianping
Chen, Mikhail Shiryaev, and Alexander Heinecke. “Optimizing Deep
Learning Recommender Systems Training On CPU Cluster Architec-
tures.” arXiv preprint arXiv:2005.04680 (2020). See https://arxiv.org/
abs/2005.04680.
[7] Zhao, Weijie, Jingyuan Zhang, Deping Xie, Yulei Qian, Ronglai Jia, and
Ping Li. “AIBox: CTR prediction model training on a single node.” In
Proceedings of the 28th ACM International Conference on Information
and Knowledge Management, pp. 319-328. 2019. See http://research.
baidu.com/Public/uploads/5e18a1017a7a0.pdf.
[8] Chan, Ernie, Marcel Heimlich, Avi Purkayastha, and Robert Van De
Geijn. “Collective communication: theory, practice, and experience.”
Concurrency and Computation: Practice and Experience 19, no. 13
(2007): 1749-1783.
[9] “NVSwitch Accelerates NVIDIA DGX-2” https://developer.nvidia.com/
blog/nvswitch-accelerates-nvidia-dgx2/
[10] Naumov, Maxim, John Kim, Dheevatsa Mudigere, Srinivas Sridharan,
Xiaodong Wang, Whitney Zhao, Serhat Yilmaz et al. “Deep Learning
Training in Facebook Data Centers: Design of Scale-up and Scale-out
Systems.” arXiv preprint arXiv:2003.09518 (2020). See https://arxiv.org/
abs/2003.09518.
[11] Micron Technology Datasheet for MT40A2G4 2Gx4 DDR4 SDRAM.
[12] Li, Ang, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan
R. Tallent, and Kevin J. Barker. “Evaluating modern GPU intercon-
nect: Pcie, nvlink, nv-sli, nvswitch and gpudirect.” IEEE Transactions
on Parallel and Distributed Systems 31, no. 1 (2019): 94-110. See
https://arxiv.org/abs/1903.04611.
[13] Jia, Zhe, Blake Tillman, Marco Maggioni, and Daniele Paolo Scarpazza.
“Dissecting the graphcore ipu architecture via microbenchmarking.”
arXiv preprint arXiv:1912.03413 (2019). See https://arxiv.org/abs/1912.
03413.
[14] https://www.servethehome.com/hands-on-with-a-graphcore-c2-ipu-
pcie-card-at-dell-tech-world/.
[15] https://www.cerebras.net/product/.
[16] https://www.graphcore.ai/posts/introducing-second-generation-ipu-
systems-for-ai-at-scale.
[17] Achronix Machine Learning Processor, https://www.achronix.com/
machine-learning-processor.
[18] Achronix Speedster7t FPGA Datasheet (DS015), https:
//www.achronix.com/sites/default/files/docs/Speedster7t FPGA
Datasheet DS015 2.pdf.
[19] Ke, Liu, Udit Gupta, Benjamin Youngjae Cho, David Brooks, Vikas
Chandra, Utku Diril, Amin Firoozshahian et al. “Recnmp: Accelerating
personalized recommendation with near-memory processing.” In 2020
ACM/IEEE 47th Annual International Symposium on Computer Archi-
tecture (ISCA), pp. 790-803. IEEE, 2020.
[20] Kwon, Youngeun, Yunjae Lee, and Minsoo Rhu. “Tensordimm: A
practical near-memory processing architecture for embeddings and ten-
sor operations in deep learning.” In Proceedings of the 52nd Annual
IEEE/ACM International Symposium on Microarchitecture, pp. 740-753.
2019.
[21] NVIDIA A100 Specifications, https://www.nvidia.com/en-us/data-
center/a100/.
[22] https://ark.intel.com/content/www/us/en/ark/products/120496/intel-
xeon-platinum-8180-processor-38-5m-cache-2-50-ghz.html.
14
[23] https://www.intel.com/content/www/us/en/products/docs/processors/
xeon/2nd-gen-xeon-scalable-processors-brief.html.
[24] https://www.microway.com/knowledge-center-articles/performance-
characteristics-of-common-transports-buses/.
[25] Eitan Medina, Hot Chips 2019 presentation, Habana.
[26] https://investors.broadcom.com/news-releases/news-release-details/
broadcom-ships-tomahawk-4-industrys-highest-bandwidth-ethernet.
[27] “The true Processing In Memory accelerator”, Fabrice Devaux, Hot
Chips 31, August 19, 2019. See https://www.hotchips.org/hc31/HC31
1.4 UPMEM.FabriceDevaux.v2 1.pdf.
[28] http://monitorinsider.com/HBM.html.
[29] Arunkumar, Akhil, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic,
Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and
David Nellans. “MCM-GPU: Multi-chip-module GPUs for continued
performance scalability.” ACM SIGARCH Computer Architecture News
45, no. 2 (2017): 320-332.
[30] Gupta, Udit, Samuel Hsia, Vikram Saraph, Xiaodong Wang, Brandon
Reagen, Gu-Yeon Wei, Hsien-Hsin S. Lee, David Brooks, and Carole-
Jean Wu. “DeepRecSys: A System for Optimizing End-To-End At-scale
Neural Recommendation Inference.” arXiv preprint arXiv:2001.02772
(2020). See https://arxiv.org/abs/2001.02772
[31] Park, Jongsoo, Maxim Naumov, Protonu Basu, Summer Deng, Aravind
Kalaiah, Daya Khudia, James Law et al. “Deep learning inference in
facebook data centers: Characterization, performance optimizations and
hardware implications.” arXiv preprint arXiv:1811.09886 (2018). See
https://arxiv.org/abs/1811.09886
[32] Hazelwood, Kim, Sarah Bird, David Brooks, Soumith Chintala, Utku
Diril, Dmytro Dzhulgakov, Mohamed Fawzy et al. “Applied machine
learning at facebook: A datacenter infrastructure perspective.” In 2018
IEEE International Symposium on High Performance Computer Archi-
tecture (HPCA), pp. 620-629. IEEE, 2018.
[33] He, Xinran, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin
Shi et al. “Practical lessons from predicting clicks on ads at facebook.”
In Proceedings of the Eighth International Workshop on Data Mining
for Online Advertising, pp. 1-9. 2014.
[34] Naumov, Maxim, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu
Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang et al.
“Deep learning recommendation model for personalization and rec-
ommendation systems.” arXiv preprint arXiv:1906.00091 (2019). See
https://arxiv.org/abs/1906.00091.
[35] Li, Cheng, Yue Lu, Qiaozhu Mei, Dong Wang, and Sandeep Pandey.
“Click-through prediction for advertising in twitter timeline.” In Pro-
ceedings of the 21th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 1959-1968. 2015.
[36] Rendle, Steffen. “Factorization machines.” In 2010 IEEE International
Conference on Data Mining, pp. 995-1000. IEEE, 2010.
[37] Jain, Paras, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter
Abbeel, Kurt Keutzer, Ion Stoica, and Joseph E. Gonzalez. “Checkmate:
Breaking the memory wall with optimal tensor rematerialization.” arXiv
preprint arXiv:1910.02653 (2019). See https://arxiv.org/abs/1910.02653.
[38] Chung, Junyoung, Caglar Gulcehre, KyungHyun Cho, and Yoshua
Bengio. “Empirical evaluation of gated recurrent neural networks on
sequence modeling.” arXiv preprint arXiv:1412.3555 (2014).
[39] Graepel, Thore, Joaquin Quinonero Candela, Thomas Borchert, and Ralf
Herbrich. “Web-scale bayesian click-through rate prediction for spon-
sored search advertising in microsofts´ bing search engine.” Omnipress,
2010.
[40] Zhao, Weijie, Deping Xie, Ronglai Jia, Yulei Qian, Ruiquan Ding,
Mingming Sun, and Ping Li. “Distributed Hierarchical GPU Parameter
Server for Massive Scale Deep Learning Ads Systems.” arXiv preprint
arXiv:2003.05622 (2020). See https://arxiv.org/abs/2003.05622.
[41] Zhou, Guorui, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou,
Xiaoqiang Zhu, and Kun Gai. “Deep interest evolution network for click-
through rate prediction.” In Proceedings of the AAAI conference on
artificial intelligence, vol. 33, pp. 5941-5948. 2019.
[42] https://www.facebook.com/business/help/430291176997542
[43] Zhao, Zhe, Lichan Hong, Li Wei, Jilin Chen, Aniruddh Nath, Shawn
Andrews, Aditee Kumthekar, Maheswaran Sathiamoorthy, Xinyang Yi,
and Ed Chi. “Recommending what video to watch next: a multitask
ranking system.” In Proceedings of the 13th ACM Conference on
Recommender Systems, pp. 43-51. 2019.
[44] Rajbhandari, Samyam, Jeff Rasley, Olatunji Ruwase, and Yuxiong
He. “Zero: Memory optimization towards training a trillion parameter
models.” arXiv preprint arXiv:1910.02054 (2019). https://arxiv.org/abs/
1910.02054.
[45] https://innovium.com/blog-post/low-latency-teralynx-switch-delivers-
best-data-center-application-performance/.
[46] https://www.nextplatform.com/2019/06/11/the-games-a-foot-intel-
finally-gets-serious-about-ethernet-switching/.
[47] Peng, Ivy Bo, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti, Erwin
Laure, and Stefano Markidis. “Exploring the performance benefit of hy-
brid memory system on HPC environments.” In 2017 IEEE International
Parallel and Distributed Processing Symposium Workshops (IPDPSW),
pp. 683-692. IEEE, 2017.
[48] https://www.systemverilog.io/understanding-ddr4-timing-parameters.
[49] Kim, Yoongu, Weikun Yang, and Onur Mutlu. “Ramulator: A fast and
extensible DRAM simulator.” IEEE Computer architecture letters 15,
no. 1 (2015): 45-49.
[50] https://www.hpl.hp.com/research/cacti/.
[51] Varian, Hal R. “Online Ad auctions.” American Economic Review 99,
no. 2 (2009): 430-34.
[52] Anil, Rohan, Vineet Gupta, Tomer Koren, and Yoram Singer. “Memory
Efficient Adaptive Optimization.” In Advances in Neural Information
Processing Systems, pp. 9749-9758. 2019.
[53] “Food Discovery with Uber Eats: Recommending for the Marketplace”,
https://eng.uber.com/uber-eats-recommending-marketplace/.
[54] “Home Embeddings for Similar Home Recommendations”, https://www.
zillow.com/tech/embedding-similar-home-recommendation/.
[55] “MIND News Recommendation Competition”, https://msnews.github.io/
competition.html.
[56] https://aws.amazon.com/personalize/
[57] “Amazon Personalize Create personalized user experiences faster”, https:
//aws.amazon.com/personalize/.
[58] https://developers.google.com/machine-learning/recommendation.
[59] Gomez-Uribe, Carlos A., and Neil Hunt. “The netflix recommender
system: Algorithms, business value, and innovation.” ACM Transactions
on Management Information Systems (TMIS) 6, no. 4 (2015): 1-19.
[60] “Use Cases of Recommendation Systems in Business Current Appli-
cations and Methods”, https://emerj.com/ai-sector-overviews/use-cases-
recommendation-systems/
[61] Hwang, Ranggi, Taehun Kim, Youngeun Kwon, and Minsoo Rhu.
“Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Per-
sonalized Recommendations.” arXiv preprint arXiv:2005.05968 (2020).
See https://arxiv.org/abs/2005.05968.
[62] “Announcing NVIDIA Merlin: An Application Framework for Deep
Recommender Systems”, https://developer.nvidia.com/blog/announcing-
nvidia-merlin-application-framework-for-deep-recommender-systems/
[63] Hadash, Guy, Oren Sar Shalom, and Rita Osadchy. “Rank and rate:
multi-task learning for recommender systems.”, In Proceedings of the
12th ACM Conference on Recommender Systems, pp. 451-454. 2018.
[64] Guo, Lin, Hui Ye, Wenbo Su, Henhuan Liu, Kai Sun, and Hang Xiang.
“Visualizing and understanding deep neural networks in ctr prediction.”
arXiv preprint arXiv:1806.08541 (2018).
[65] Zhou, Guorui, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao
Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. “Deep interest
network for click-through rate prediction.” In Proceedings of the 24th
ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, pp. 1059-1068. 2018.
[66] “FPGA-based computing in the Era of Artificial Intelligence and Big
Data”, Talk by Eriko Nurvitadhi, Intel Labs, ISPD 2019, http://ispd.cc/
slides/2019/6 FPGASpecial Eriko.pdf
[67] Nurvitadhi, Eriko, Jeffrey Cook, Asit Mishra, Debbie Marr, Kevin
Nealis, Philip Colangelo, Andrew Ling et al.“In-package domain-
specific asics for intel stratix 10 fpgas: A case study of accelerating deep
learning using tensortile asic.” In 2018 28th International Conference on
Field Programmable Logic and Applications (FPL), pp. 106-1064. IEEE,
2018.
[68] “AI Chip (ICs and IPs)”, Shang Tang, https://github.com/basicmi/AI-
Chip
[69] Li, Zhengjie, Yufan Zhang, Jian Wang, and Jinmei Lai. “A survey of
FPGA design for AI era.” Journal of Semiconductors 41, no. 2 (2020):
021402.
[70] Fowers, Jeremy, Kalin Ovtcharov, Michael Papamichael, Todd Massen-
gill, Ming Liu, Daniel Lo, Shlomi Alkalay et al. “A configurable cloud-
scale DNN processor for real-time AI.” In 2018 ACM/IEEE 45th Annual
International Symposium on Computer Architecture (ISCA), pp. 1-14.
IEEE, 2018.
15
[71] “DNN Accelerator Architectures”, Joel Emer, Vivienne Sze, Yu-Hsin
Chen, ISCA Tutorial (2019), http://www.rle.mit.edu/eems/wp-content/
uploads/2019/06/Tutorial-on-DNN-06-RS-Dataflow-and-NoC.pdf
[72] Reuther, Albert, Peter Michaleas, Michael Jones, Vijay Gadepally,
Siddharth Samsi, and Jeremy Kepner. “Survey and benchmarking of
machine learning accelerators.” arXiv preprint arXiv:1908.11348 (2019).
See https://arxiv.org/abs/1908.11348.
[73] Chen, Yiran, Yuan Xie, Linghao Song, Fan Chen, and Tianqi Tang.
“A Survey of Accelerator Architectures for Deep Neural Networks.”
Engineering 6, no. 3 (2020): 264-274.
[74] Jouppi, Norman P., Doe Hyun Yoon, George Kurian, Sheng Li, Nishant
Patil, James Laudon, Cliff Young, and David Patterson. “A domain-
specific supercomputer for training deep neural networks.” Communi-
cations of the ACM 63, no. 7 (2020): 67-78.
[75] https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/.
[76] https://www.nvidia.com/content/dam/en-zz/Solutions/Data-
Center/nvidia-dgx-a100-datasheet.pdf.
[77] “SEAL: An Accelerator for Recommender Systems”, https://myrtle.ai/
seal/
[78] “250-M2D Xilinx Kintex UltraScale+ FPGA on M.2 Accelerator Mod-
ule”, https://www.bittware.com/fpga/250-m2d/
[79] “Glacier Point V2 Design Specification v0.1”, http:
//files.opencompute.org/oc/public.php?service=files&t=
c2b97804f70d4a54808e22b0b4d93b07&download
16
APPENDIX
STEPS AND OPERATORS FOR DLRM
This appendix presents a high-level algorithm representing
the steps for training a DLRM model in a distributed fashion
with n identical processors. The forward pass is covered in
algorithm 1 (used in both inference and training, note that
for training it is assumed that activations are checkpointed
as needed and optimally [37] to facilitate backpropagation)
and the backward pass and weight update in algorithm 2.
Note that most operations are performed concurrently across
all processors as the inference query or training batch is
split up across all n processors. For simplicity, it is also
assumed that the concatenation of the bottom MLP output and
the pairwise dot products (after duplicate removal, diagonal
removal, and vectorizing is completed) is performed as part
of the FeatureInteractions operation.
For training, the optimizer used is vanilla SGD. AdaGrad
is reported to achieve slightly better results for DLRM on the
Criteo Ad Kaggle dataset [34], however we use vanilla SGD
in our steps and performance model for consistency with the
DLRM repo. The pipelining during the dense backward pass
of collective communications (i.e. all-reduce operations) with
the backpropagation computations and parameter updates is
not shown; in practice, this is certainly feasible as long as the
all-reduce latency is acceptable.
The notations used are shown in Table XVIII.
Other operations
This section introduces additional operations (other than
collective communications operations) mentioned in algo-
rithms 1 and 2.
FC: The forward pass of the FC layer.
FeatureInteractions: The forward pass of the dot-product
feature interactions layer with exclusion of the diagonal feature
interactions matrix entries.
Concat: The concatenation operation of the batched bottom
MLP output and batched pooled embedding vectors along a
new dimension as mentioned in Sec. III-D.
FCBackward: This operator takes in the gradient of the
loss with respect to the output of a given FC layer, uses the
checkpointed input activations to the layer, and returns both
the gradient of the loss with respect to the weights of the layer
as well as the gradient of the loss with respect to the input
to the layer. This will then be used by the next FCBackward
operator.
FeatureInteractionsBackward: This operator takes in the
gradient of the loss with respect to the output of the feature
interactions layer and returns the gradient of the loss with
respect to each of the batched inputs to the feature interactions
layer, which are the output of the bottom MLP as well as
the pooled embeddings resulting from the embedding lookups
in the model. Note that there are no weights in the feature
interactions layer so this is sufficient for FeatureInteractions-
Backward.
Expand sparse grads: Because of pooling in this model,
all of the gradients on embeddings that are pooled into a single
TABLE XVIII: Notations for DLRM steps.
Symbol Usage Meaning
nd Constant Number of dense features in the model
nc Constant Number of sparse features or embedding
tables
ci Constant Cardinality of the ith categorical feature
li Constant Number of lookups performed on the ith
embedding table
lb Constant Number of bottom MLP layers
lt Constant Number of top MLP layers
d Constant Embedding dimension
OB Fwd Output of bottom MLP up to a given
layer
Li Fwd Local embedding lookup indices for ith
table after indices all-to-all
OE Fwd Local embedding lookup vectors (possible
pooled)
Vi Fwd Pooled embedding vectors (after second
all-to-all) resulting from lookups for the
ith table
F Fwd Feature interactions input denoted as A in
feature interactions layer description of
Sec. III-D
OT Fwd Output of bottom MLP up to a given
layer
GT i Bckwd/update Gradient of loss w.r.t. output of ith top
MLP FC layer
GBi Bckwd/update Gradient of loss w.r.t. output of ith
bottom MLP FC layer
GEi Bckwd/update Pooled gradient (i.e. on processor which
doesn’t own relevant embeddings) of loss
w.r.t. lookups from ith embedding table
LGEi Bckwd/update Pooled gradient of loss w.r.t. lookups
from i table on processor which owns
relevant embeddings after
all-to-all/all-gather
FGEi Bckwd/update Unpooled/expanded batch-reduced
gradient of loss w.r.t. lookups from i
table on processor which owns relevant
embeddings after all-to-all/all-gather
output are identical. After communicating only the gradients
on the pooled vectors, these values are simply “expanded”, or
copied, to every unpooled vector that was summed to generate
the pooled vector. This operation also averages the gradients
on the pooled vectors across the batch.
Params: Shorthand operator to denote the parameters asso-
ciated with a given FC layer or embedding table.
17
Algorithm 1: DLRM forward pass steps.
Input: Number of processors p; b×nd batch of dense
features D; nc sets of sparse features each one
called Si, each one b×li; bottom MLP layers
FCBi , i ∈ [1, lb]; top MLP layers FCTi ,
i ∈ [1, lt]; nc embedding tables denoted as Ei
with each table representing a ci×d matrix.
OB ← D
for i = 1 · · · lb do
OB ← FCBi(OB)
end
L1, L2, · · · , Lnc ← all to all([S1, S2, · · · , Snc ])
OE ← []
for j = 1 · · ·nc do
OEj = Ej(Lj)
if no sharding then
OEj = pool(OEj )
end
OE .append(OEj )
end
if full sharding then
V1, V2, · · · , Vnc ← reduce scatter(OE)
end
if no sharding then
V1, V2, · · · , Vnc ← all to all(OE)
end
F ← concat([OB , V1, V2, · · · , Vnc ])
F ′ ← FeatureInteractions(F )
OT ← F ′
for i = 1 · · · lt do
OT ← FCTi(OT )
end
p← OT [:, 0]
Output: Predicted click probabilities vector p
Algorithm 2: DLRM backward pass and weight update
steps.
Input: Number of processors p; b×nd batch of dense
features D; nc sets of sparse features each one
called Si, each one b×li; bottom MLP backward
operators FCBackwardBi , i ∈ [1, lb]; top MLP
backward operators FCBackwardTi , i ∈ [1, lt];
nc embedding tables denoted as Ei with each
table representing a ci×d matrix. Learning rate γ;
predictions p ∈ (0, 1)n; labels l ∈ [0, 1]n; loss
function
L(p, l) = 1n
∑n
i=1 li log pi + (1− li) log(1− pi)
with predictions p and labels l.
LBCE ← L(p, l)
∇pL ← 1n
∑b
i=1(
li
pi
− 1−li1−pi )
GT lt+1 ← ∇pL
for i = lt, lt − 1, · · · , 1 do
GT i,∇FCTiL ← FCBackwardTi(GT i+1)
end
GBlb+1, GE1, · · · , GEnc ←
FeatureInteractionsBackward(GT 1)
if no sharding then
LGE1, · · · , LGEnc ←
all to all([GE1, · · · , GEnc ])
end
if full sharding then
LGE1, · · · , LGEnc ←
all gather([GE1, · · · , GEnc ])
end
FGE1, · · · , FGEnc ←
expand sparse grads([LGE1, · · · , LGEnc ])
for i = lb, lb − 1, · · · , 1 do
GBi,∇FCBiL ← FCBackwardBi(GBi+1)
end
GB1, GB2, · · · , GBlb , GT 1, GT 2, · · · , GT lt ←
1
pall reduce([GB1, GB2, · · · , GBlb ,
GT 1, GT 2, · · · , GT lt ])
for i = 1 · · · lb do
Params(FCBi)← Params(FCBi)− γ∇FCBiL
end
for i = 1 · · · lt do
Params(FCTi)← Params(FCTi)− γ∇FCTiL
end
for i = 1 · · ·nc do
Params(Ei)← Params(Ei)− γFGEi
end
Output: Current loss LBCE
18
