Dagger: Towards Efficient RPCs in Cloud Microservices with Near-Memory
  Reconfigurable NICs by Lazarev, Nikita et al.
Dagger: Towards Ecient RPCs in Cloud Microservices with
Near-Memory Recongurable NICs
Nikita Lazarev, Neil Adit, Shaojie Xiang, Zhiru Zhang, and Christina Delimitrou
Abstract— Cloud applications are increasingly relying on hundreds of loosely-coupled microservices to complete user requests that meet
an application’s end-to-end QoS requirements. Communication time between services accounts for a large fraction of the end-to-end
latency and can introduce performance unpredictability and QoS violations. In this work, we present our early work on Dagger, a hardware
acceleration platform for networking, designed specically with the unique qualities of microservices in mind. The architecture relies on an
FPGA-based NIC, closely coupled with the processor over a congurable memory interconnect, designed to ooad and accelerate RPC
stacks. Unlike the traditional cloud systems with PCIe links as the de facto NIC I/O interface, which lack the eciency needed for
ne-grained microservices and add non-negligible overheads, we leverage memory-interconnected FPGAs as networking devices to make
I/Os more ecient, transparent and programmable. We show that this considerably improves CPU utilization, performance, and RPC
scalability in cloud microservices.
Index Terms—Microservices, programmable NICs, RPC, memory interconnects, FPGA, near-memory processing.
F
1 Introduction
T HE past few years have seen a major shift in the way cloudapplications are designed, from traditional monolithic architectures
to microservices. Large applications are split into many loosely-coupled
and single-purpose components communicating with each other over
the network. While microservices have several advantages, including
improved development and deployment cycles, and error isolation, they
also introduce non-negligible system overheads. Since individual mi-
croservices are typically not overly computationally intensive [10], [24],
this overhead is mostly introduced in the networking stack, with the latter
gaining increased attention both by application and system designers.
Remote Procedure Calls (RPC) is one of the most common com-
munication techniques in microservices. There is a variety of available
RPC frameworks, however, since these frameworks were not designed
with microservices specifically in mind, they do not address their unique
resource requirements. Unlike traditional monolithic and coarse-grained
distributed applications, microservices exhibit different traffic patterns
and performance requirements. Thus they demand a fundamental recon-
sideration of design and architectures of networking systems.
As the demand for high-bandwidth and low-latency networking in the
cloud continues to grow, research from both industry and academia has
offered numerous proposals that approach the problem at different levels
of the networking stack. Some of these proposals optimize transport
protocols [3], [12], [17] for low latency networking, while others leverage
hardware-assisted system design solutions, such as user-space network-
ing [5], [13], RDMA [8], [15], reconfigurable FPGA NICs [6], [21], and
multicore smartNICs [7], [16]. While these works demonstrate the strong
potential of hardware/software co-design to improve the performance and
efficiency of cloud networking, these systems are designed for traditional
monolithic applications, and are all based on PCIe-attached NICs, so
they inherit the I/O inefficiencies coming from this de facto CPU-NIC
interconnection method [18]. Moreover, most of the aforementioned
proposals require laborious software engineering to work on a given
NIC, which is especially challenging given that most commercial NIC
implementations are closed source [14].
In this work, we focus on a hardware-accelerated platform for RPCs
in the context of interactive miroservices. We observe that today’s PCIe-
based CPU-NIC interconnects lack the efficiency required for microser-
vices with µs-scale performance and are limited to MMIO+DMA/DDIO
data transferring modes, which introduce significant overheads when
dealing with small RPC requests under strict latency and throughput
requirements. Instead, we propose to leverage memory interconnects
(NUMA) as the interface between CPUs and NIC, where the latter
accommodates the RPC stack and is integrated into the processor’s
memory sub-system. Compared to PCIe interfaces, memory intercon-
• All authors are with the Department of Electrical and Computer Engineering,
Cornell University, Ithaca, NY, 14853.
E-mail: {nl524, na469, sx233, zhiruz, delimitrou}@cornell.edu
Manuscript received June XXX, 2020; revised August XXX, 2020.
nects provide more efficient, weak memory consistency models with
relaxed message ordering that can (1) dramatically speed up NIC I/O,
(2) reduce CPU-side network queueing, and (3) improve the end-to-end
performance of the whole networking stack. While prior work has shown
the potential of integrating NICs into the processor’s memory sub-system
in simulation [19], we specifically focus on RPC-optimized architectures
for granular services, and on the need for the CPU-NIC interfaces to
be programmable and transparent so that the protocol and consistency
model best fit a given application.
With this in mind, we propose Dagger, a hardware-accelerated
networking fabric which leverages FPGAs that are closely coupled with
CPUs over memory interconnects as a fully reconfigurable networking
device to offload the entire RPC stack on. In addition to previous
proposals around FPGA-based NICs that only leverage hardware to
speed up transport layers and/or to perform in-network processing, we
propose to make the actual NIC I/O interface reconfigurable. This allows
us to implement efficient software-NIC interaction schemes for the
requirements and design principles of the currently-running applications,
which is essential given the diversity and frequent development cadence
of interactive microservices [23].
We characterize the unique properties of microservice traffic using
the Social Network and Media Service applications from the DeathStar-
Bench suite [10], and prototype our platform on the Intel® Broadwell®
integrated CPU/FPGA architecture. We demonstrate in practice that
offloading networking to a near-memory FPGA significantly increases
throughput for the small requests common in microservices up to
3.8− 5.7× compared to both specialized hardware platforms [15] and
optimized software protocols [5], [13]. Our solution yields single-core
RPC goodput of 12.4Mrps, it scales up to 40Mrps with 8 cores, and
provides state-of-the-art µs-scale end-to-end latency.
2 Network Characteristics in Microservices
Microservices have distinct network requirements and traffic com-
pared to monolithic applications and traditional distributed systems.
First, every user request in microservices is propagated through a
large graph of tiers, with per-node processing and communication delays
being accumulated to the end-to-end latency. As a result, the Quality
of Service (QoS) which is usually defined in terms of tail latency,
under a certain load (Queries per Second (QPS)) critically depends on
the performance of every communication channel between each pair of
microservices on the call path. Hence, even a small latency increase in
the networking stack translates to considerable increases in end-to-end
latency, as shown in Figure 1 which plots the end-to-end fractions of
networking and application (including queueing) time w.r.t the load.
Second, even though RPC request and response payloads in typ-
ical datacenter applications are already relatively small, ranging from
hundreds of bytes to few kBytes [11], [17], [28], in microservices that
number is even smaller, as shown in Figure 2 for the Social Network and
Media Service from DeathStarBench [10].
100 300 600 800 900
QPS
0
5
10
M
ed
ia
n 
La
te
nc
y,
 m
s
application
networking
100 300 600 800 900
QPS
0
20
40
90
th
 L
at
en
cy
, m
s
application
networking
Fig. 1: Networking as fraction of end-to-end median (left) and tail (right)
latency.
0 256 512 768 1024 1280
RPC Size (Bytes)
0
25
50
75
100
%
 o
f M
es
sa
ge
s
SocialNetwork
MediaService
RPC requests
RPC responses
Fig. 2: Distribution of RPC sizes across microservices in [10].
As seen from Figure 2, more than 70% of RPC requests are smaller
than 256B and almost all requests are within 1280B. Responses are
even smaller: nearly 100% of messages are less than 256B with 95%
of messages fitting 64B. These tiny messages introduce high pressure
on networking stacks at all levels, and previous work has shown that
commodity networking systems cannot efficiently handle this traffic, due
to high per-packet overheads [5], [13], [17].
Finally, microservices are by design very diverse in terms of design
patterns and performance requirements [23]. In particular, there is a rich
variety of thread models [25], network queueing architectures [27], and
different strategies of mapping microservices to the available hardware
resources. Performance requirements also vary a lot, with some microser-
vices being latency-critical while others are treated as batch. Commodity
networking systems were initially designed with generality and com-
patibility in mind, in order to fit any request features and, therefore,
do not necessarily provide the most efficient solution for a particular
application class. This has caused programmable networking systems to
become more popular [4], [20]. Such systems allow flexibly adjusting
networking primitives depending on the currently-running applications.
Even so, one part of the networking stack that is always fixed in silicon,
even in programmable network controllers, is the NIC I/O interface.
3 Dagger: A Near-Memory Reconfigurable NIC for Inter-
active Microservices
In this work, we propose to leverage FPGAs closely coupled with
server-class CPUs as reconfigurable RPC acceleration fabrics, which are
designed with the unique properties and requirements of interactive mi-
croservices in mind. Such FPGA-enabled platforms are already available
in commercial server systems like Intel® Xeon®.
3.1 Motivation for Near-Memory NICs
PCIe is the current de facto standard interface between CPU and NICs
or accelerators. Unfortunately, the PCIe bus has limited functionality, it
requires multiple bus transactions and memory synchronization primi-
tives require sending data chunks to the device which makes per-packet
overheads large [14]. The regular way to send data over PCIe to a
peripheral is by using DMA transfers together with expensive initiation
transactions explicitly issued by the processor as MMIO requests. This
MMIO+DMA combination, alongside with mandatory synchronization
instructions, promotes strong memory consistency over achieving high
performance for fine-grained data transfers.
The ideal CPU-NIC interconnect for modular interactive services
should avoid any CPU-initiated requests and explicit memory synchro-
nization, and should track data transfers entirely in hardware buffers
without the processor’s intervention. In addition, it can sacrifice strong
consistency for a weaker memory consistency model in favor of perfor-
mance, since interactive services have high request-level parallelism and
are tolerant to message ordering. Most commercially-available memory
interconnects satisfy these requirements: they provide efficient relaxed
memory consistency and implement cache coherency state machines that
can track updates in networking buffers without CPU intervention.
3.2 The Need for Recongurable and Transparent NICs
As shown in Section 2, microservices are very diverse in terms of traffic
patterns and performance requirements. Prior work has also established
that specially optimized transports are required to get high through-
put [12], and that such protocols usually do not work well for latency-
critical applications [5], [17], highlighting the need for reconfigurable
transport layers [4].
We argue that the same holds for CPU-NIC interfaces. For example,
the standard MMIO+DMA mode works well when transferring large
packets, however, it performs poorly when it comes to delivering a large
number of small requests. The consistency model requirements can also
vary across applications: for requests that fit into the interconnect MTU
and for the applications that do not require strict RPC ordering, a weak
memory consistency model will achieve the highest performance. Finally,
software design patterns also differ: some systems provision network
buffers on a per-connection basis [8], [15], while others use shared or
single-queue solutions to handle load balancing and improve scalability
[27]. Despite this variability, the hardware support on the NIC is required
to efficiently run whichever queue provisioning schemes [26].
This variability introduces the potential to tailor the acceleration
fabric and CPU-FPGA interface to the requirements of a given appli-
cation model. At the same time, it introduces no-trivial complexity,
which users should not have to manage themselves. In light of the
above considerations, we present our early work on a reconfigurable
and transparent-to-the-user hardware acceleration fabric for microservice
RPCs, with a tunable CPU-NIC interface which can be easily altered to
fit the requirements of a given application.
3.3 Dagger Platform Overview
We implement a prototype of our FPGA NIC on the Intel® Broadwell®
integrated CPU/FPGA architecture with different programmable options
for the CPU-NIC interface over CCI-P [1] bus based on both commodity
PCIe and coherent memory (UPI) interconnects. CCI-P is selected
because of the following four features: (1) Relaxed memory consistency
that can enable much faster transfer of small RPC requests in settings
where message ordering is not critical. (2) The ability to strengthen
consistency models (up to sequential) on-the-fly, when required by the
application. (3) The possibility to monitor invalidation transactions and
use them to initiate data transfers entirely in hardware, instead of the slow
software-issued MMIOs. (4) The flexible choice of the low-level physical
and link layers of the interconnect between PCIe and UPI. We use CCI-
P to communicate ready-to-use RPC objects with the processor’s LLC.
Figure 3 shows the top-level overview of our prototype.
L
L
C
        FPGACPU
NIC
APP
interface
R
P
C
Transport
P
H
Y
QSFPAPP
interface
CPU-NIC
interface
UPI/PCIe
CCI-P
Connection conf. Packet capture
TransportTransport
DRAM
Fig. 3: Top-level diagram of the RPC acceleration fabric. The triangle
shows the CCI-P protocol stack, and the stacked blocks denote reconfig-
urable hardware units.
Our stack includes network transport and physical layers, the RPC
layer, and the CPU-NIC interface. The transport layer implements a ver-
sion of the UDP/IP protocol. The RPC unit maintains connections, keeps
metadata, and does payload (de)serialization. Since transport protocols
and data transformation are not the focus of this work, we drastically
simplify this part in our first prototype. We plan to handle support for
transport reliability and congestion control as part of future work. Note
that deployment of reliable transports with congestion control can be
done using open-source layers for FPGAs such as [4], [22]. However,
we also plan to implement our own RPC-oriented transport on top of
UDP/IP, whose efficiency has been shown in prior work [13].
For the CPU-NIC interface, we have designed a set of hardware state
machines implementing different CCI-P messaging schemes, including
the standard MMIO+DMA/DDIO mode over PCIe for fair comparison.
We have also implemented the corresponding set of software drivers for
these modes. Our design supports both synchronous and asynchronous
RPCs. The interface’s SW/HW data and control diagrams are shown in
Figure 4.
  LLC TX buf RX buf
Completion queue
TX FSM RX FSM
Software (CPU)
RPC
responses
Hardware (NIC)5
7
memory 
config.
conn.
set-up
8
reconfigurable partRPC data path hardwareconfiguration
TX cmpl
43
Non-blocking RPC 
requests
TX controller
12Interface buffers
bookkeeping
6
LBalancer
Fig. 4: CPU-NIC interface diagrams for asynchronous RPCs. TX path:
TX controller writes new RPC requests 2© to a free entry in the TX buffer
as read from the TX completion ring 1©. The TX FSM on the FPGA
fetches RPC objects from TX buffer using one of the CCI-P messaging
schemes 3©; it also does the asynchronous bookkeeping 4© to release
previously-fetched entries. RX path: the hardware RX FSM puts newly
received RPC objects to the RX buffer by using one of the available
CCI-P modes 5©, and asynchronously fetches the next free entries via
bookkeeping 6©. The RPC payload is then delivered to the completion
queue via AVX-enhanced parallel memcpy 7©.
The red part in Figure 4 is the CCI-P interface between the pro-
cessor’s LLC and the NIC together with the corresponding TX and
RX FSMs. This part of the design is reconfigurable and is the main
contribution of the current Dagger design. Different CCI-P messaging
schemes and state machines can be used to fetch/deliver RPC payloads
from/to the processor. The choice of the desired I/O scheme is controlled
via SystemVerilog macros, summarized below.
Dimension A: CCI-P messaging
(1) NIC polls TX buffers from CPU to get updated RPC requests.
This gives low latency and high throughput, but burns CCI-P cycles. TX
buffer updates are explicitly tracked by single-bit “dirty” flags.
(2) NIC polls TX buffers allocated in its local cache in Shared
state and relies on invalidation messages sent by CPU caches when the
application writes new RPC requests. This gives higher latency due to
additional invalidation transactions, but saves CCI-P cycles.
(3) NIC snoops CCI-P bus for Invalidation messages; upon receiving
a snoop, it either initiates a DMA read or polls the buffer. Functionally,
it is similar to MMIO+DMA, but it relies on hardware invalidation
transactions instead of software-initiated MMIO requests.
(4) Uncacheable write from CPU to NIC. CPU directly writes-
through data to the NIC cache. This mode is similar to pure MMIO
writes [9], but is done over our optimized memory interconnect. This
mode potentially gives the best performance of the write path, but
requires additional hardware support.
The CCI-P messaging scheme is chosen depending on the current
networking traffic, application performance and power consumption
requirements, as well as the hardware support availability in the target
platform (our current platform supports the first two schemes). The
messaging scheme is defined by a set of states in the TX and RX
FSMs and the types of CCI-P requests. The same FPGA configuration
can run multiple different messaging schemes. In addition, the CCI-P
interconnect allows choosing the batch size of data transfers, which can
be fine-tuned to achieve the required latency/throughput trade-off.
Dimension B: RPC threading model
(1) Synchronous: every RPC call blocks the connection until the
response is received from the server.
(2) Asynchronous: connections do not block the calling threads and
might have many outstanding requests.
The RPC threading model choice depends on the application de-
sign [25]. The architecture of the asynchronous model is shown in
Figure 4. The synchronous model is simpler: it does not have completion
queues and rings, all buffers contain only one entry per connection,
and there are no bookkeeping CCI-P transactions. As a result, the
synchronous model simplifies the hardware, and improves propagation
latency and FPGA resource utilization.
Dimension C: NIC buffer provisioning and load balancing
(1) On connection basis: one tx/rx queue pair per connection.
(2) On CPU core basis: one tx/rx queue pair per CPU core.
(3) Single-buffer provisioning.
NIC hardware varies a lot depending on the buffer provisioning
scheme. For example, Figure 4 shows connection-based provisioning
as the currently-implemented scheme. In this case, all incoming RPC
requests are uniformly distributed to the corresponding connection
buffers and cores in a fair round-robin fashion (LBalancer). Since our
architecture is reconfigurable, any buffer provisioning scheme along with
the desired load balancing can be supported.
4 Preliminary Results
We built Dagger using an Intel Broadwell CPU/FPGA chip (Xeon® E5-
2600, 2.3GHz), available in the HARP academic research platform. Due
to hardware limitation of the platform, we are only presenting a subset
of the CCI-P schemes described in Section 3.3. In our experiments, we
run a simple concurrent P2P client-server application sending 64B echo
RPCs.
4.1 Single Request Latency
Figure 5 shows the round trip time of 64B synchronous (blocking) RPC
requests with different CCI-P messaging schemes.
Latency Percentile
R
ou
nd
 T
rip
 T
im
e,
 u
s
0
1
2
3
4
Median 90th 99th
PCIe MMIO + DMA PCIe, mode #1 UPI, mode #1 UPI, mode #2
Fig. 5: Different percentiles of round trip times for synchronous 64B
RPCs across different CCI-P messaging schemes.
First, all examined CCI-P schemes perform noticeably better than the
standard MMIO+DMA that we also implemented in the same system.
The overhead of invalidation messages when the NIC polls the TX buffers
in its local cache (option #2, dimension A in Section 3.3) is only 17%
compared to simple polling. Note that invalidation messages play exactly
the same role as MMIO transactions in traditional DMA-based systems:
they notify the NIC about new data available in TX buffers, however,
they achieve 42% better single-request median latency. Interestingly,
even within the same messaging mode, the memory bus achieves 10-
26% better latency than PCIe although they are both based on the same
chip. This demonstrates that in addition to supporting relaxed consis-
tency and a variety of messaging schemes, memory interconnects are
physically faster than PCIe busses, even when both are on-chip. Table 1
compares our median round-trip times with the results presented in three
related papers using either hardware accelerated or software-optimized
networking stacks: IX [5], eRPC [13], and NetDIMM [2]. We also show
the TOR network delays assumed in each work for fair comparison. Our
approach improves the round trip latency of small requests, even when
compared to in-memory integrated NICs (NetDIMM), and outperforms
software-based networking solutions.
Table 1: Round trip times of synchronous RPCs vs. related work.
IX eRPC(CX4)
eRPC
(CX5)
Net-
DIMM Dagger
Objects 64Bmsgs
32B
RPC
32B
RPC
64B
msgs
64B
RPC
TOR delay N/A 0.3 us 0.22 us 0.1 us 0.1 us
RTT 11.4 us 3.7 us 2.3 us 2.2 us 1.93 us
4.2 Throughput
As seen from Table 2, Dagger’s memory interconnect outperforms the
DMA+MMIO transfer by 7− 9×. We confirm this observation by
running another microbenchmark where we write to a remote memory
location using MMIO compared to UPI. Similarly, writing over UPI
offers 10− 14× better throughput. Table 2 also shows that CPU LLC
polling (mode #1, Dimension A) achieves goodput of up to 18.4 Mrps
which is 3× better than mode #2. Even though the polling mode may
not be practical due to the high bus bandwidth consumption, it shows the
potential of memory interconnects for hardware-accelerated networking.
Moreover, since the CCI-P messaging scheme can be easily configured
on the fly, the NIC can dynamically switch from mode #2, which is more
energy efficient to #1 when the load increases.
Table 2 shows the RPC receiving rate of Dagger, which reaches
12.4Mrps for B = 4. Since this is lower than the throughput of sending
requests, the receiving path is the current system bottleneck. Nonetheless,
Dagger noticeably outperforms previous solutions [5], [13], [15]. We also
observe that in our current implementation, the goodput is limited by the
software side of the networking stack, i.e., writing/reading TX and RX
buffers. We plan to further optimize the software stack as part of our
future work.
Table 2: Goodput of asynchronous transfer of 64B RPCs on a single core;
B denotes CCI-P batching.
B Sending rate Recv. rate
MMIO/DMA Mode #2 Mode #1 -
1 0.43 Mrps 4.1 Mrps 11.8 Mrps 8.2 Mrps
2 0.61 Mrps 5.3 Mrps 14.3 Mrps 9.4 Mrps
4 0.83 Mrps 6.2 Mrps 18.4 Mrps 12.4 Mrps
Figure 6 (Left) shows the latency-goodput curves across loads. Since
we run a simple echo microbenchmark here, the system immediately
blocks the caller thread when goodput is saturated, shown with vertical
dotted lines. The latency remains stable across the entire load range. The
figure also shows how CCI-P batching might affect system performance:
B = 4 allows high goodput, however, it increases the request latency
under low QPS. Since Dagger can reconfigure the batch size on-the-fly,
it can dynamically adjust the CCI-P batching depending on the current
load, as shown by the green dashed line.
2.5 5.0 7.5 10.0
Goodput, Mrps
1
2
3
4
5
M
ed
ia
n 
La
te
nc
y,
 u
s
B = auto
B = 1
B = 2
B = 4
2 4 6 8
Number of CPU cores
0
10
20
30
40
50
60
Go
od
pu
t, 
M
rp
s RPC Issue
End-to-end, #1
End-to-end, #2
Fig. 6: (Left) Latency-Goodput curves for single-core asynchronous
round-trip 64B RPCs; B denotes CCI-P batching, dotted lines show the
saturation point. (Right) Multi-core scalability of sending 64B requests.
The purple graph shows RPC issue rate; end-to-end client-server through-
put for the CCI-P messaging scheme #1 (black), and scheme #2 (red).
Figure 6 (right) shows the goodput scalability of our system with
the number of CPU cores. Dagger achieves 68Mrps of the RPC issue
rate by the client and 40Mrps of the end-to-end client-server median
goodput with 8 cores. This result is 3.8× better than the best RDMA-
based solution, FASST [15], and 5.7× better than DPDK-based IX [5].
When using a more efficient CCI-P messaging scheme based on inval-
idation messages (mode #2, Dimension A in Section 3.3), Dagger still
outperforms previous works with end-to-end median goodput of 25Mrps
on 8 cores.
5 Conclusion
In this paper, we present implementation of a prototype RPC system
on a closely-coupled near-memory FPGA and show its performance
advantages over existing RPC frameworks built on top of commodity
PCIe-attached NICs, therefore demonstrating in practice why NICs
should be integrated into the processor’s memory sub-system. The
approach significantly increases RPC goodput while showing state-of-
the-art round-trip latency. Our prototype outperforms existing user-space
networking- and RDMA-based solutions based on peripheral networking
devices and, in addition, provides transparent NIC I/O at the hardware
level. Overall, we show that the closely-coupled CPUs and FPGAs can
be used as efficient programmable networking devices that drastically
improve networking I/O for fine-grained workloads.
References
[1] “Intel acceleration stack for intel xeon cpu with fpgas core cache interface
(cci-p) reference manual,” accessed May, 2020, https://www.intel.com.
[2] M. Alian and N. S. Kim, “Netdimm: Low-latency near-memory network
interface architecture,” in Proceedings of the 52nd Annual IEEE/ACM
International Symposium on Microarchitecture, 2019.
[3] M. Alizadeh, A. Javanmard et al., “Analysis of dctcp: Stability, convergence,
and fairness,” in Proceedings of the ACM SIGMETRICS Joint International
Conference on Measurement and Modeling of Computer Systems, 2011.
[4] M. T. Arashloo, A. Lavrov et al., “Enabling programmable transport proto-
cols in high-speed nics,” in 17th USENIX Symposium on Networked Systems
Design and Implementation (NSDI 20), 2020.
[5] A. Belay, G. Prekas et al., “IX: A protected dataplane operating system for
high throughput and low latency,” in 11th USENIX Symposium on Operating
Systems Design and Implementation (OSDI 14).
[6] M. Blott, K. Karras et al., “Achieving 10gbps line-rate key-value stores with
fpgas,” in Presented as part of the 5th USENIX Workshop on Hot Topics in
Cloud Computing, 2013.
[7] A. Caulfield, P. Costa et al., “Beyond smartnics: Towards a fully pro-
grammable cloud: Invited paper,” 2018.
[8] A. Dragojevic´, D. Narayanan et al., “Farm: Fast remote memory,” in 11th
USENIX Symposium on Networked Systems Design and Implementation
(NSDI 14), 2014.
[9] M. Flajslik and M. Rosenblum, “Network interface design for low latency
request-response protocols,” in 2013 USENIX Annual Technical Conference
(USENIX ATC 13), 2013.
[10] Y. Gan, Y. Zhang et al., “An open-source benchmark suite for microservices
and their hardware-software implications for cloud and edge systems,” in
Proceedings of the Twenty-Fourth International Conference on Architectural
Support for Programming Languages and Operating Systems, 2019.
[11] Q. Huang, K. Birman et al., “An analysis of facebook photo caching,” in
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems
Principles, 2013.
[12] E. Jeong, S. Wood et al., “mtcp: a highly scalable user-level TCP stack
for multicore systems,” in 11th USENIX Symposium on Networked Systems
Design and Implementation (NSDI 14), 2014.
[13] A. Kalia, M. Kaminsky et al., “Datacenter rpcs can be general and fast,” in
16th USENIX Symposium on Networked Systems Design and Implementa-
tion, 2019.
[14] A. Kalia, M. Kaminsky et al., “Design guidelines for high performance
RDMA systems,” in 2016 USENIX Annual Technical Conference (USENIX
ATC 16), 2016.
[15] A. Kalia, M. Kaminsky et al., “Fasst: Fast, scalable and simple distributed
transactions with two-sided (rdma) datagram rpcs,” in Proceedings of the
12th USENIX Conference on Operating Systems Design and Implementation,
2016.
[16] M. Liu, S. Peter et al., “E3: Energy-efficient microservices on smartnic-
accelerated servers,” in Proceedings of the 2019 USENIX Conference on
Usenix Annual Technical Conference, USA, 2019.
[17] B. Montazeri, Y. Li et al., “Homa: A receiver-driven low-latency transport
protocol using network priorities,” in Proceedings of the 2018 Conference of
the ACM Special Interest Group on Data Communication, 2018.
[18] R. Neugebauer, G. Antichi et al., “Understanding pcie performance for end
host networking,” in Proceedings of the 2018 Conference of the ACM Special
Interest Group on Data Communication, 2018.
[19] S. Novakovic, A. Daglis et al., “Scale-out numa,” in Proceedings of the
19th International Conference on Architectural Support for Programming
Languages and Operating Systems, 2014.
[20] P. M. Phothilimthana, M. Liu et al., “Floem: A programming system for
nic-accelerated network applications,” in Proceedings of the 12th USENIX
Conference on Operating Systems Design and Implementation, 2018.
[21] A. Putnam, A. Caulfield et al., “A reconfigurable fabric for accelerating
large-scale datacenter services,” in Proceeding of the 41st Annual Interna-
tional Symposium on Computer Architecuture (ISCA), June 2014.
[22] D. Sidler, Z. Istva´n et al., “Low-latency tcp/ip stack for data center applica-
tions,” in 2016 26th International Conference on Field Programmable Logic
and Applications (FPL), 2016.
[23] A. Sriraman, A. Dhanotia et al., “Softsku: Optimizing server architectures
for microservice diversity at scale,” in Proceedings of the 46th International
Symposium on Computer Architecture, 2019.
[24] A. Sriraman and T. F. Wenisch, “µsuite: A benchmark suite for microser-
vices,” 2018.
[25] A. Sriraman and T. F. Wenisch, “µtune: Auto-tuned threading for OLDI
microservices,” in 13th USENIX Symposium on Operating Systems Design
and Implementation (OSDI 18), 2018.
[26] B. Stephens, A. Singhvi et al., “Titan: Fair packet scheduling for commodity
multiqueue nics,” in 2017 USENIX Annual Technical Conference (USENIX
ATC 17), 2017.
[27] M. Sutherland, S. Gupta et al., “The nebula rpc-optimized architecture,”
[Proceedings of ISCA 2020].
[28] Y. Xu, E. Frachtenberg et al., “Characterizing facebook’s memcached work-
load,” IEEE Internet Computing, 2014.
