University of New Mexico

UNM Digital Repository
Computer Science ETDs

Engineering ETDs

Fall 12-13-2020

Intelligent Networks for High Performance Computing
William Whitney Schonbein
University of New Mexico - Main Campus

Follow this and additional works at: https://digitalrepository.unm.edu/cs_etds
Part of the Digital Communications and Networking Commons

Recommended Citation
Schonbein, William Whitney. "Intelligent Networks for High Performance Computing." (2020).
https://digitalrepository.unm.edu/cs_etds/108

This Thesis is brought to you for free and open access by the Engineering ETDs at UNM Digital Repository. It has
been accepted for inclusion in Computer Science ETDs by an authorized administrator of UNM Digital Repository.
For more information, please contact disc@unm.edu.

Whit Schonbein
Candidate

Computer Science
Department

This dissertation is approved, and it is acceptable in quality
and form for publication:
Approved by the Dissertation Committee:

University of New Mexico

Trilce Estrada
Chairperson

Dorian Arnold

Emory University

Ryan E. Grant

University of New Mexico

Emory University

Jinho D. Choi

i

INTELLIGENT NETWORKS FOR HIGH PERFORMANCE
COMPUTING

by

WHIT SCHONBEIN
B.A., University of Wisconsin, 1994,
M.A., Washington University in St. Louis, 1999,
Ph.D., Washington University in St. Louis, 2002,
M.S., University of New Mexico, 2016

DISSERTATION
Submitted in Partial Fulfillment of the
Requirements for the Degree of
Doctor of Philosophy
Computer Science
The University of New Mexico
Albuquerque, New Mexico
December, 2020
ii

ACKNOWLEDGEMENTS
I would like to thank the members of my committee, without whom this work would
not have been possible: Dorian Arnold, Ryan Grant, Trilce Estrada, and Jinho Choi.
My gratitude also extends to my collaborators, Matthew Dosanjh, Scott Levy, and Pepper Marts. Thanks to Matthew Dosanjh for identifying target kernels and calculating
potential application speedups reported in Section 7.1, and to David DeBonis and Ronnie Garduño for providing the profiling data and source code for the study reported in
Section 7.2.

iii

INTELLIGENT NETWORKS FOR HIGH PERFORMANCE
COMPUTING
by
Whit Schonbein
B.A., University of Wisconsin, 1994,
M.A., Washington University in St. Louis, 1999,
Ph.D., Washington University in St. Louis, 2002,
M.S., University of New Mexico, 2016

Doctor of Philosophy, Computer Science
There exists a resurgence of interest in ‘smart’ network interfaces that can operate on
data as it flows through a network. However, while smart capabilities have been expanding, what they can do for high-performance computing (HPC) is not well-understood. In
this work, we advance our understanding of the capabilities and contributions of smart
network interfaces to HPC. First, we show current offloaded message demultiplexing can
mitigate (but not eliminate) overheads incurred by multithreaded communication. Second, we demonstrate current offloaded capabilities can be leveraged to provide Turing
complete program execution on the interface. We elaborate with a framework for offloading arbitrary compute kernels to the NIC: In-Network Compute Assistance (INCA).
We show INCA can accelerate host applications by offloading components to the network. Moreover, INCA supports the offloading of autonomous machine learning kernels
for predicting network properties, and by doing so, takes a significant first step towards
realizing intelligent, adaptive networks.

iv

LONG ABSTRACT
The past half-decade has witnessed a resurgence of interest in ‘smart’ network interfaces (‘SmartNICs’), i.e., NICs that not only move data, but also perform potentially
complex work with or on that data. What sets this new wave of research apart from
earlier attempts at diversifying network functionality is its scope. Where once making
the network intelligent meant offloading characteristically network-oriented applications
– packet forwarding, firewalls, segmentation, bits and pieces of specific protocols, collective communications, and so on – researchers are currently broadening the vision of
what a smart network can be expected to do. Perhaps, for example, the network itself
can handle web searches local to the cell phones that issued them, or perform edge detection for guiding autonomous vehicles, or instantiate machine learning algorithms to
dynamically adapt itself to changing workloads.
Networks characteristic of high performance computing (HPC) are no exception to the
trend towards increasing intelligence. Current state-of-the-art HPC network interfaces
are smart in the traditional sense that they offload network applications such as collective
communication or message demultiplexing. However, emerging products and proposals
aim to expand this domain by including more flexible offloading capabilities, e.g., onNIC CPUs executing user-defined kernels for manipulating data as it passes between the
network and the host.
While on-NIC capabilities have been expanding, what exactly they can do for HPC is
not well-understood. For example, to what degree do current smart capabilities address
anticipated future paradigms such as multithreaded communication? Can existing onNIC capabilities provide additional flexibility for supporting novel offloaded functionality,
or must we appeal to general-purpose compute hardware such as CPUs? Finally, when
such flexibility is secured, what can be done with it to service HPC applications?
In this work, we advance our understanding of the capabilities and contributions of
smart network interfaces to HPC by addressing these questions. First, we show that
current offloaded capabilities – specifically, message demultiplexing – can help mitigate
(but not entirely alleviate) overheads incurred by allowing individual threads to engage
in inter-process communication. Second, we demonstrate that message matching, when
coupled with other standard HPC offloaded capabilities, can enable fully generalized (i.e.,
Turing complete) program execution on the NIC. That is, current offloaded smart capabilities typical of HPC NICs can be leveraged to provide the flexibility characteristic
of today’s SmartNICs, without appeal to hardware such as FPGAs or CPUs. On this
foundation, we build a framework for expressing and offloading arbitrary compute kernels
to the NIC: In-Network Compute Assistance (INCA). INCA is unique in the SmartNIC
landscape because it leverages existing offloaded, task-specific processing elements to
provide Turing complete compute capabilities. Moreover, despite being located on the
data processing pathway, these capabilities are not subject to deadlines imposed by network speeds. We show that INCA affords the possibility of accelerating host application
performance by offloading parts of those applications to the network. Furthermore, we
show that INCA supports the offloading of autonomous machine learning kernels capable
of predicting network features, and by doing so, takes a significant first step towards
realizing intelligent, adaptive networks. In short, once fully general-purpose compute
v

capabilities are secured, the future of SmartNICs in HPC looks promising.
To sum, the contributions of this work are:
1. A comprehensive survey of SmartNIC hardware, applications, and general architectures that situates contemporary SmartNICs in the offloading tradition.
2. The design and implementation of two benchmarks for assessing the performance
impact of multithreaded MPI communication under standard halo exchange patterns. We show that the overheads incurred by a full multithreaded halo exchange
involving 9 or 27 nodes can be approximated by a ‘low-cost’ benchmark that emulates the full exchange using only two nodes. In either case, these overheads are
onerous for larger thread counts. Moreover, we use the low-cost benchmark to assess
whether contemporary offloaded message-matching capabilities can mitigate these
overheads, concluding that, while such a smart capability can reduce processing
times, in some conditions it actually exacerbates the overhead.
3. In-Network Compute Assistance (INCA), a novel SmartNIC design that leverages
‘smart’ offloaded capabilities common to HPC – message matching, atomic operations, and triggered operations – to provide arbitrary program execution to assist
host applications or execute in-network applications. On the basis of a formal
model and proof of Turing completeness, we identify a modest set of changes to
current Portals-compliant NICs sufficient to secure Turing completeness.
4. The design of assembly and high-level languages for expressing INCA kernels, a
compiler for transforming code written in the latter to the former, and an assessment – via a simulator interpreting INCA assembly code – of the performance
of INCA executing a selection of HPC-centric kernels under various operating assumptions. We conclude that with the right selection of software and hardware
optimizations, INCA kernel runtimes can, for some kernels, be made comparable to
those of contemporary CPUs. Moreover, all else being equal, INCA runtimes will
only decrease as network speeds increase.
5. An assessment of the potential speedups afforded by INCA for a selection of scientific applications or miniapps executing on the host. We show that, by offloading
parts of host applications, INCA can in principle accelerate some applications by
35% or more. We also consider a second class of applications, those that utilize
the network only infrequently during their runtime, meaning the network is usually
idle. In a preliminary study, we profile one such application to locate functions
that consume a significant portion of the overall runtime, and consider speedups
afforded by offloading a subset of those calls to an INCA-enabled SmartNIC. Results from a simulation study show speedups of up to 1.13× and 1.23×, without
and with additional hardware acceleration, respectively.
6. The first known design and implementation of fully-offloaded machine learning kernels for enabling ‘self-learning’ networks by efficiently and accurately predicting
local network traffic rates. Specifically, we profile the amount of data received by

vi

the NIC during the execution of a selection of scientific applications, proxy applications, and miniapps, and show that machine learning kernels based on static
or rolling linear regression can predict the amount of incoming data with an accuracy ranging from less than 2.5% normalized RMSE in the worst case, to less
than 0.1% in the best-performing cases. Moreover, INCA kernels for generating
predictions regarding future network traffic execute with runtimes less than 40 µs
in the most computationally demanding scenarios, and in many cases, less than 1
µs, with memory requirements never exceeding 12 KiB.

vii

TABLE OF CONTENTS

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

2 Directions in networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Application-Specific Integrated Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.3 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Core network applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Host applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.3 Independent applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 On-path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Off-path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.3 In-path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Discussion: The path forward. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7
8
9
10
12
13
13
18
20
21
21
23
24
25

3 Message matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Benchmarks and methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 Resource usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.2 Items searched . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.3 Matching overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6 Case study: assessing hardware offloaded message matching . . . . . . . . . . . . . . . . .
3.6.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28
30
33
34
37
41
41
42
45
48
49
52
53

4 INCA: In-Network Compute Assistance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

viii

4.1
4.2
4.3
4.4
4.5

The big picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The triggered operation machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Turing completeness of TOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Implementing INCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57
59
61
64
66

INCA Ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The INCA ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
INCA-Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
INCA-A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Compiler and interpreter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68
68
69
70
72
73

6 INCA: Kernel Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1 INCASim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Kernels and INCAsim configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.2 INCAsim configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.1 Base and scratchpad performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.2 Network speeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.3 Reducing data movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.4 Hardware optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75
76
78
78
79
82
84
85
85
86
90

5 The
5.1
5.2
5.3
5.4
5.5

7 INCA: Host Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.1 Offloading near-network functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.2 NIC as co-processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.3 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8 Independent Applications for Adaptive Networks . . . . . . . . . . . . . . . . . . . . . . . . . .102
8.1 Constraints on ML kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.3 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.4 ML kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.4.1 Ordinary LR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.4.2 Rolling LR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.5 Results: Ordinary linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.5.1 Method performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.5.2 INCA kernel resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.6 Results: Rolling linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.6.1 Method performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.6.2 INCA kernel resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.7 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.8 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
ix

9 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .125
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .130
Appendix A. The INCA-A language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Appendix B. The INCA-Q Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .141

x

LIST OF FIGURES
2.1
2.2
2.3

Diagram of an on-path NIC architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Diagram of an off-path NIC architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Diagram of an in-path NIC architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9

MPI message matching queues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Singlethreaded and multithreaded nine-point halo exchanges. . . . . . . . . . . . . . . .
Average core hours for real-world and low-cost benchmarks. . . . . . . . . . . . . . . . . .
Median total number of items searched. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Average number of searches by number of items searched. . . . . . . . . . . . . . . . . . . .
Median queue drain times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Average time per search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison of message processing time with and without hardware acceleration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Median time spent processing messages by message size. . . . . . . . . . . . . . . . . . . . . .

4.1
4.2
4.3

Portals processing pipeline (data structure view) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
An INCA instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Message processing flow for the triggered operation machine. . . . . . . . . . . . . . . . . 59

5.1

The INCA ecosystem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.1
6.2
6.3
6.4
6.5

The extended LogGP model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The baseline INCA architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The scratchpad INCA architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
INCA instruction rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
INCA SIMD unit.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8.1
8.2
8.3
8.4
8.5
8.6

Examples of raw bytes received data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Illustration of rolling linear regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Ordinary linear regression results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
The impact of startup plateaux. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Error as a function of window size for rolling LR.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Error for rolling LR methods.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

xi

31
35
41
42
44
45
47
50
51

77
80
81
81
89

LIST OF TABLES
2.1
2.2

Summary of survey of SmartNIC applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Summary of survey of SmartNIC architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1

Message counts for multithreaded halo exchange. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1
4.2

Triggered operation machine notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Summary of the reduction of URMs to TOM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.1
6.2
6.3
6.4

INCA kernels and instruction counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Table of INCAsim parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
INCA kernel runtimes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Impact of network speed on kernel runtimes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.1
7.2
7.3

Potential impact of INCA on applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
FPGA exponential offloaded latencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Potential speedups for applications with idle networks.. . . . . . . . . . . . . . . . . . . . . . . 100

8.1
8.2
8.3
8.4

Mean probability of data arrival. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Best-performing window sizes for each prediction point.. . . . . . . . . . . . . . . . . . . . . . 118
Selected small and large windows sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
INCA runtimes for rolling LR methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

xii

79
82
83
85

Chapter 1
Introduction
The past half-decade has witnessed a resurgence of interest in ‘smart’ network interfaces (‘SmartNICs’), i.e., NICs that not only move data, but also perform potentially
complex work with or on that data. What sets this new wave of research apart from
earlier attempts at diversifying network functionality is its scope. Where once making
the network intelligent meant offloading characteristically network-oriented applications
– packet forwarding, firewalls, segmentation, bits and pieces of specific protocols, collective communications, and so on – researchers are currently broadening the vision of
what a ‘smart’ network can be expected to do. Perhaps, for example, the network itself
can handle web searches local to the cell phones that issued them, or perform edge detection for guiding autonomous vehicles, or instantiate machine learning algorithms to
dynamically adapt itself to changing workloads.
Networks characteristic of high performance computing (HPC) are no exception to
the trend towards increasing intelligence. Current state-of-the-art HPC network interfaces are ‘smart’ in the traditional sense that they offload network applications such as
collective communication or message demultiplexing. However, emerging products and
proposals aim to expand this domain by including more flexible offloading capabilities,
e.g., on-NIC CPUs executing user-defined kernels for manipulating data as it passes
between the network and the host.
1

While on-NIC capabilities have been expanding, what exactly they can do for HPC is
not well-understood. For example, to what degree do current ‘smart’ capabilities address
anticipated future paradigms such as multithreaded communication? Can existing onNIC capabilities provide additional flexibility for supporting novel offloaded functionality,
or must we appeal to general-purpose compute hardware such as CPUs? Finally, when
such flexibility is secured, what can be done with it to service HPC applications?
In this work, we advance our understanding of the capabilities and contributions of
smart network interfaces to HPC by addressing these questions. First, we show that
current offloaded capabilities – specifically, message demultiplexing – can help mitigate
(but not entirely alleviate) overheads incurred by allowing individual threads to engage
in inter-process communication. Second, we demonstrate that message matching, when
coupled with other standard HPC offloaded capabilities, can enable fully generalized
(i.e., Turing complete) program execution on the NIC. That is, current offloaded ‘smart’
capabilities typical of HPC NICs can be leveraged to provide the flexibility characteristic of today’s SmartNICs, without appeal to hardware such as FPGAs or CPUs. On
this foundation, we build a framework for expressing and offloading arbitrary compute
kernels to the NIC: In-Network Compute Assistance (INCA). INCA is unique in the
SmartNIC landscape because it leverages existing offloaded, task-specific processing elements to provide Turing complete compute capabilities. Moreover, despite being located
on the data processing pathway, these capabilities are not subject to deadlines imposed
by network speeds. Finally, we show that INCA affords the possibility of accelerating
host application performance by offloading parts of those applications to the network.
Furthermore, we show that INCA supports the offloading of autonomous machine learning kernels capable of predicting network features, and by doing so, take a significant first
step towards realizing intelligent, adaptive networks. In short, once fully general-purpose
compute capabilities are secured, the future of SmartNICs in HPC looks promising.
We begin, in Chapter 2, by surveying the field of intelligent network interfaces, fo-

2

cusing on three distinct aspects: the hardware used to underwrite NIC intelligence, the
types of applications offloaded to that hardware, and the general architectures these solutions adopt. This survey underwrites the observation made above, that the concept
of a SmartNIC has evolved from one involving NICs offloading task-specific, core network applications (e.g., packet forwarding, filtering, etc.) to one implicating flexible,
general-purpose kernel execution. Moreover, this survey situates our proposed approach
to enabling fully general on-NIC offloading – In-Network Compute Assistance – within
the hardware, application, and architectural landscapes of current and past SmartNICs.
Trends towards larger core counts and faster but lower-capacity memory (e.g., highbandwidth memory [1]) discourage traditional ‘MPI-everywhere’ approaches, where each
Message Passing Interface (MPI) process instantiates its own set of communication resources. Instead, under a multithreaded strategy, a single set of resources is used, and
multiple threads are permitted to engage in communication with other MPI processes.
In Chapter 3, we design and implement benchmarks for assessing the message processing overheads incurred by multithreaded communication in a standard halo exchange.
Moreover, we deploy one of these benchmarks to address whether contemporary message
matching offloading hardware can help mitigate these overheads, thereby addressing the
broader question of what SmartNICs can (and cannot) do.
In Chapter 4 we describe how In-Network Compute Assistance – INCA – leverages
offloaded message demultiplexing and other existing task-specific network applications to
provide on-NIC general-purpose compute capabilities. To this end, we develop a formal
computational model – the triggered operation machine – and demonstrate it is Turing
complete. On the basis of this model, we articulate a set of modest adjustments to
existing NIC hardware sufficient to enable INCA-style offloading.
Chapter 5 builds on the prior theoretical results to develop an ecosystem for expressing
INCA kernels to be executed on NIC hardware. Based on the formal model, we define
a low-level ‘assembly’ language – INCA-A – for defining INCA kernels, and implement

3

an interpreter for this language. To assist in developing INCA kernels, we also define a
high-level language – INCA-Q – and a compiler for translating kernels written in INCA-Q
into INCA-A.
The remaining chapters are dedicated to putting INCA to work. In Chapter 6, we
assess the performance of INCA-enabled SmartNICs by developing a simulation model,
identifying a set of representative kernels (matrix multiplication, convolution, etc.), and
interpreting the execution of these kernels within the simulation. In the process, we
consider various software and hardware optimizations, and their impact on INCA kernel
runtimes. In Chapter 7, we identify two possible scenarios where INCA could enable
host application speedups by offloading parts of those host applications, and assess those
speedups. Finally, in Chapter 8, we show that INCA enables adaptive, intelligent networks by offloading machine learning techniques for predicting network traffic entirely to
the NIC.
To sum, the contributions of this work are:
1. A comprehensive survey of SmartNIC hardware, applications, and general architectures that situates contemporary SmartNICs in the offloading tradition, and
underscores the novelty of In-Network Compute Assistance. Whereas contemporary SmartNICs invariably adopt one of two standard designs – on-path, where
processing elements are placed on the packet processing pathway, and off-path,
where they are located off that pathway – INCA represents a synthesis of the two
by leveraging on-path processing elements to provided compute capabilities that
are nonetheless not constrained by network speed.
2. The design and implementation of two benchmarks for assessing the performance
impact of multithreaded MPI communication under standard halo exchange patterns. We show that the overheads incurred by a full multithreaded halo exchange
involving 9 or 27 nodes can be approximated by a ‘low-cost’ benchmark that emulates the full exchange using only two nodes. In either case, these overheads are
4

onerous for larger thread counts; e.g., approaching or exceeding 1 ms. Moreover,
we use the low-cost benchmark to assess whether contemporary offloaded message
matching capabilities (as provided by NVIDIA Mellanox ConnectX-5 NICs [2]) can
mitigate these overheads, concluding that, while such a smart capability can reduce
processing times, in some conditions it actually exacerbates the overhead.
3. In-Network Compute Assistance (INCA), a novel SmartNIC design that leverages
‘smart’ offloaded capabilities common to HPC – message matching, atomic operations, and triggered operations – to provide arbitrary program execution to assist
host applications or execute in-network applications. On the basis of a formal
model and proof of Turing completeness, we identify a set of modest changes to
current HPC NICs based on the Portals network programming API [3] sufficient to
secure general-purpose compute capabilities.
4. The design of assembly and high-level languages for expressing INCA kernels, a
compiler for transforming code written in the latter to the former, and an assessment – via a simulator interpreting INCA assembly code – of the performance
of INCA executing a selection of HPC-centric kernels under various operating assumptions. We conclude that with the right selection of software and hardware
optimizations, INCA kernel runtimes can, for some kernels, be made comparable to
those of contemporary CPUs. Moreover, all else being equal, INCA runtimes will
only decrease as network speeds increase [4].
5. An assessment of the potential speedups afforded by INCA for a selection of scientific applications or miniapps executing on the host. We show that, by offloading
parts of host applications, INCA can in principle accelerate some applications by
35% or more. We also consider a second class of applications, those that utilize
the network only infrequently during their runtime, meaning the network is usually
idle. In a preliminary study, we profile one such application to locate functions
5

that consume a significant portion of the overall runtime, and consider speedups
afforded by offloading a subset of those calls to an INCA-enabled SmartNIC. Results from a simulation study show speedups of up to 1.13× and 1.23×, without
and with additional hardware acceleration, respectively.
6. The first known design and implementation of fully-offloaded machine learning kernels for enabling ‘self-learning’ networks by efficiently and accurately predicting
local network traffic rates. Specifically, we profile the amount of data received by
the NIC during the execution of a selection of scientific applications, proxy applications, and miniapps, and show that machine learning kernels based on static
or rolling linear regression can predict the amount of incoming data with an accuracy ranging from less than 2.5% normalized RMSE in the worst case, to less
than 0.1% in the best-performing cases. Moreover, INCA kernels for generating
predictions regarding future network traffic execute with runtimes less than 40 µs
in the most computationally demanding scenarios, and in many cases, less than 1
µs, with memory requirements never exceeding 12 KiB.

6

Chapter 2
Directions in networking
The primary purpose of a network interface card (NIC) is to push and receive data
to and from the network. For an ‘ignorant’ NIC – i.e., one that services only these
functions – anything more complicated (such as protocol enforcement, packet processing,
filtering, demultiplexing, etc.) is the responsibility of the host CPU. The past quarterdecade has witnessed a series of increasingly complex network interfaces with enhanced
processing capabilities designed to relieve host CPUs of some of these tasks. Originally,
this offloading was modest, including fixed-field packet forwarding, packet filtering, or
simple header manipulations. More recently, this trend in so-called ‘SmartNICs’ (also
sometimes referred to as ‘intelligent NICs’ (iNICs) or ‘data processing units’ (DPUs))
has culminated in NICs with complex processing engines, including fully-programmable
CPUs, FPGAs, or even specialized ‘engines’ dedicated to tasks such as neural network
processing.
The term ‘SmartNIC’ can be traced at least as far back as the mid-1990s, where it was
used to refer to NICs offering basic offloaded functionality such as TCP (de)segmentation,
header matching for purposes of flexible forwarding, error checking, or simple packet
modification [5], [6]. So, as a starting point, let us define a SmartNIC (/iNIC/DPU) as
a network interface or host adapter that offloads functionality traditionally performed by
a host CPU.
7

This is obviously a broad definition, and during the course of this survey of SmartNICs, we will identify a narrower conception that better captures the contemporary (ca.
2020) understanding of the term. To accomplish this, we view the concept of a SmartNIC
from three different perspectives: (1) hardware (Section 2.1), (2) the tasks, functions, or
services offloaded (Section 2.2), and (3) general architecture (Section 2.3). The results
of this survey motivate and situate the present project both historically and within the
contemporary SmartNIC landscape.
This survey comes with several caveats. First, we do not attempt to catalog every
instance of a network appliance that (self-identified or otherwise) qualifies as a SmartNIC
(cf. [7]). Instead, the present goal is to highlight salient general features of the SmartNIC landscape. Second, SmartNICs are one aspect of a broader collection of research
into ‘smart’ network appliances in general, such as programmable routers, switches, or
middleboxes [8]–[13] and active networks [14]–[21]. Building on these technologies, researchers have proposed offloading tasks such as consensus algorithms [22], [23], collective
communication/data aggregation [24], [25], caches or distributed key-value stores [26]–
[28], policy [29], intrusion detection [30], and network monitoring [31]. With occasional
exceptions, this survey is limited to network interfaces or host adapters, i.e., to discrete
or integrated network appliances residing on network endpoints. Finally, because the
focus of this work is on high-performance computing (HPC), we emphasize research in
that domain.

2.1

Hardware

The first dimension along which SmartNICs vary is the hardware they use to enable
offloading. For purposes of this survey, we distinguish between three broad categories:
application-specific integrated circuits (ASICs), central processing units (CPUs), and
field-programmable gate arrays (FPGAs).

8

2.1.1

Application-Specific Integrated Circuits

ASICs are specialized circuits designed to carry out specific, ‘hardwired’ tasks, e.g., executing CRC checks on packet headers of a predefined size, incrementing counters, or
performing packet reassembly. Despite being task-specific, ASICs may nonetheless be
‘configurable’ in the sense that within that limited domain of functionality, various parameters may be adjusted. An example of a configurable ASIC that is extremely common
in high-speed networks is a content addressable memory (CAM) or ternary content addressable memory (TCAM). CAMs and TCAMs allow for efficient table lookups, e.g.,
for Layer 2 and Layer 3 forwarding, applying access control lists (ACLs), packet classification, or message demultiplexing in HPC systems [32]–[34]. These units qualify as
ASICs because they service, in hardware, one and only one task: table lookup. They are
‘configurable’ because entries can be added, removed, or modified.
Examples of contemporary NICs that may be ASIC-based are those designed to implement the Portals network programming API [3], [35]. The Portals specification identifies
several primitive capabilities – message matching, basic arithmetic and logical operations, etc. – that support offloading network applications. Because the context is HPC
and the high-speed networks supporting HPC applications, it is reasonable to expect
these capabilities will be implemented in hardware (cf. Chapter 4). Netronome Agilo
SmartNICs [36], [37] are sometimes identified as ASIC-based, but for reasons discussed
below, we categorize them as involving CPUs.
In contrast to other types of hardware, ASICs have the advantage of being fast (e.g.,
a CAM lookup takes 5 or fewer nanoseconds [38], [39]) and hence are good candidates
for latency-sensitive tasks requiring processing at line rate. However, because ASICs
must be designed from the ground up to fulfill novel tasks, they can incur significant
development costs and their time-to-market may be slow. Moreover, despite being configurable, ASICs are also relatively inflexible, e.g., in comparison to standard CPUs. This

9

means they cannot easily be adapted to changes in protocols (perhaps impacting their
time-in-market), and are poor candidates for prototyping.
Motivated by the desire to retain the benefits of ASICs (speed, energy-efficiency),
while enabling additional flexibility, application-specific instruction-set processors (ASIPs)
are defined as ASICs extended to be configurable through software, i.e., ‘programmable’ [40].
By being programmable, the promise is that the same hardware can be re-configured to
serve different roles within the same narrow task domain. While ASIPs may represent a
further degree of configurability beyond that illustrated by content-addressable memory,
because they are still designed to service a specific application, for purposes of this survey
we place ASIPs in the same category as (configurable) ASICs, reserving the notion of
‘programmable’ for hardware that can execute arbitrary programs.

2.1.2

CPUs

Any given SmartNIC is likely to include a general-purpose CPU, if only to service control functions. However, some smartNICs utilize CPUs as the primary hardware for
supporting offloaded applications. Often, the instruction sets and capabilities of these
CPUs may be extended or tailored to provide network-specific instructions such as CRC
checks, matching bit fields, traversing data structures, or queue management [41]. Insofar
as these ‘network processors’ (NPs, or ‘network processing units’ (NPUs)) have specialized, application-specific aspects, they are comparable to ASICs. However, because they
are based on general-purpose compute hardware configurable through software, these
NICs can provide flexibility lacking in ASIC-based approaches.
In HPC, influential earlier-generation examples of NICs with programmable processors include those used by Meiko CS-2 [42], Myrinet [43], Quadrics [44], [45], and
SeaStar [46], [47] networks. Meikos and Quadrics NICs included versions of the Elan
network interface, which (in the Quadrics incarnation) provided a 32bit RISC processor
designed to foster the offloading of arbitrary higher-level messaging libraries. Similarly,
10

Myrinet NICs provided a 133MHz processor that could be used to offload various network functions [48]. Finally, according to [47], SeaStar featured a 500MHz processor
programmable through firmware; while the primary purpose of this processor was to
support the offloading of functions identified in the Portals API, its inherent flexibility
offered the possibility of offloading other sorts of network services [49]. In each of these
cases, the flexibility of CPU-based offloading was a motivating design consideration.
More recent examples of SmartNIC designs featuring general-purpose CPUs include
NVIDIA’s Mellanox Bluefield and Bluefield-2 SmartNICs [50], [51], Broadcom’s Stingray
PS225 SmartNIC [52], and Marvell’s LiquidIO II series (using the cnMIPS chip which
is heavily specialized for network processing) [53]. A more recent example of an HPCoriented SmartNIC design featuring general-purpose CPUs is sPIN (Streaming Processing
in-Network) [54]. On a sPIN NIC, incoming packets are initially processed by a header
handler, which then dispatches each packet to a different CPU core executing userdefined packet handlers. When the work is completed, the (potentially modified) packets
are pushed to host memory using DMA. Again, the emphasis is on flexibility: packet
handlers are written in high-level languages (e.g., C), and compiled to execute on the
NIC.
A final example are Netronome’s Agilio SmartNICs [36], [37]. These NICs embody the
‘match-plus-action’ processing model popularized by software-defined networking [8], [9]:
incoming packets are classified by configurable packet processors and dispatched to programmable cores (e.g., 60 cores with 8 threads/core) for further processing. These cores
are domain specific, targeted towards network processing and security (consequently, they
are sometimes referred to as ‘ASIC-based’ [55]) However, they are also programmable using standard software tools (e.g., C, P4), and the vendor notes these specialized cores
could in principle be replaced with general-purpose RISC cores [56], [57]. For these
reasons, we include them as examples of CPU-based SmartNICs.
As already emphasized, the primary advantage of deploying CPUs for NIC offloading

11

is flexibility. In contrast to ASICs (configurable or otherwise), this flexibility means
CPU-based SmartNICs can be readily adapted to changing network offloading demands,
and used for prototyping. However, CPUs also tend to be slower than ASICs on similar
tasks (e.g., table lookups [58]).

2.1.3

FPGAs

Field programmable gate arrays (FPGAs), when deployed on NICs, aim towards providing the flexibility of CPUs with the speed of ASICs. An FPGA comprises a set of
millions of logic elements (LEs), each of which has a lookup table and several registers.
Each LE can be configured to compute arbitrary combinatorial logic. The FPGA also
has onboard block RAM (megabytes), as well as a variety of digital signal processors
for performing more complex mathematical operations (numbering in the thousands).
FPGAs are typically a magnitude of order slower than CPUs, but they make up for this
by trading the temporal organization of the CPU for a ‘spatial’ or massively parallel
organization made possible by the LEs and DSPs. FPGAs are thus a popular choice for
the design and implementation of hardware for network offloading.
Underwood et al. [59], [60] propose an intelligent NIC equipped with an FPGA for
performing operations on data as it moves through the NIC. Other examples of FPGAenabled SmartNICs include the NetFPGA platform for education and research [61], the
various accelerator cards offered by Xilinx [62], [63], NVIDIA’s Mellanox Innova-2 Flex
card [64], and the NICs used to implement Microsoft’s Virtual Filtering Platform [65],
[66]. Also within Microsoft, the Catapult network appliances are FPGAs placed on the
data path between local switch and NIC [67].

12

2.2

Applications

In addition to hardware, SmartNICs can be distinguished according to the types of tasks
or applications that have been offloaded (or proposed for offloading). These can be
grouped into three categories: core network applications, host applications, and independent applications. As with our discussion of SmartNIC hardware, we do not claim
these categories are mutually exclusive, i.e., there may exist edge cases situated on the
boundaries.

2.2.1

Core network applications

For purposes of this survey, a core network application is broadly defined as one that is
involved in the ingress or egress of data to and from a host application, i.e., in performing
a task that addresses the traditional notion of what a NIC is supposed to do, namely,
deliver data. This class of applications includes tasks at all layers of the OSI model, and
extends to communication middleware such as the Message Passing Interface (MPI) and
OpenSHMEM [68], [69]. However, the majority of offloaded core network applications
target the lower layers, e.g., L2 and L3.
Core network functions include CRC checks, atomic operations (e.g., decrementing TTL, bit shifts, addition), segmentation/desegmentation, traffic shaping and packet
scheduling (QoS), forms of congestion control (e.g., rate adjustment), receive-side scaling
or packet steering, aspects of firewall implementation (e.g., dropping packets), offloading
collective communications such as multicast or reduction, encryption/decryption, message demultiplexing (e.g., MPI message matching), TCP offloading, and other forms of
protocol processing. Since the majority of these functions involve forms of packet classification – i.e., matching on header or payload bit fields for purposes of differentially
processing those packets – we include in the category of core network functions tasks
that naturally cohere with this processing strategy, e.g., applying access control lists or
13

performing forms of intrusion detection. However, we also acknowledge that some of
these applications may span multiple categories.
Core network functions comprise the bulk of applications offloaded to SmartNICs.
In the remainder of this subsection we highlight several classes of offloaded network
applications.
2.2.1.1

Basic packet processing

Basic packet processing applications involve tasks such as forwarding, filtering, receiveside scaling (RSS) [70], simple header manipulation (e.g., decrementing TTL), or segmentation. The capacity to handle applications such as these is the foundation of the
notion of a SmartNIC. For example, in an early use of the term ‘smart network interface’,
Connery et al. [5] patent a method for performing on-NIC TCP segmentation. The bar
for qualifying as ‘smart’ is not high.
Most if not all contemporary NICs include hardware support for some form of basic
packet processing. Recent work in this vein includes includes intelligent forwarding of
packets to available cores [71], and support for offloading network virtualization [65], [72].
2.2.1.2

TCP offloading

As the name implies, TCP offload engines (TOEs) are designed to offload TCP functionality. The primary motivation behind TOEs is to free up CPU resources and hence accelerate host applications by avoiding interrupts or the need to calculate checksums, perform
out-of-order packet assembly, or copy data. Despite being an occasionally contentious
subject ([73]–[75]), work on TOEs has continued over multiple decades. For example,
Connery et al. [5] described a method for offloading TCP (de)segmentation, Ang [76]
explored the potential benefits of offloading TCP congestion control (i.e., windows) as
well as (de)segmentation. Hoskote et al. [77] proposed a chip design for offloading the
ingress side of a TOE, and Freimuth et al. [78] presented a system that offloads the entire

14

TCP/IP stack. Feng et al. [79] performed a performance analysis of a then-current TOE
that utilizes a combination of TCAM and CPU configurable through firmware, showing
that the TOE provides higher bandwidth and lower latency than the kernel implementation of the stack. The iWARP protocol builds on top of TOEs to bring zero-copy,
kernel-bypass, RDMA to TCP/IP, and Rashti et al. and Grant et al. [80], [81] show how
this can be extended to operate using UDP, and hence avoid the overhead of maintaining
state. Jang et al. [82] propose a hybrid TOE architecture that provides ASIC-based Tx
and Rx engines coupled with programmable processors to provide flexibility to adapt to
changes in the protocol. More recently, Moon et al. [83] explore the benefits of offloading
not the entire TCP/IP stack, but rather ‘peripheral’ TCP functions such as connection
setup and teardown (in addition to the usual checksums and (de)segmentation), showing that doing so makes performance on short-lived connections comparable to that of
longer-lived connections.
2.2.1.3

Security

Another class of core network applications that has received significant attention for
SmartNIC offloading are distributed security applications such as cryptography, IPsec,
firewalls, and intrusion detection systems (IDS). While some of these applications may
involve nothing more than header matching (e.g., basic firewalls), others have more intensive memory or processing requirements, e.g., matching strings in payloads may require
TCP stream reconstruction and more complex searches. For example, Clark et al. [84]
use a network processor ([85]) for initial header-based filtering, TCP dsegmentation, and
implementing a decision procedure for determining whether the incoming data should be
subject to further analysis, and an FPGA for implementing the more computationallydemanding process of searching payloads for suspect patterns. Bos et al. [86] investigate
a signature detection system that executes without assistance of an FPGA, and Bruijn et
al. [87] demonstrate an offloaded solution that provides deep packet inspection (e.g., for

15

detecting polymorphic buffer overflows) at gigabit rates. Friedman and Nagle [88] present
a firewall implementation that runs on the local SmartNIC. Burnside and Keromytis [89],
[90] advocate for offloaded cryptography services, focusing on discrete cards but noting
the functionality could be provided by the NIC itself. Chaignon et al. [91] address the
problem of supporting distinct security domains on a single server (e.g., for different
virtual machine instances), proposing a means for offloading nearly arbitrary packet filters to on-NIC CPUs while maintaining fairness under the standard run-to-completion
model. Dimolianis et al. [92] implement a DDoS attack detection method using matchplus-action programs on a contemporary Netronome Agilio SmartNIC; the system is
capable of handling millions of messages per second on 10 Gb/s links.
2.2.1.4

Collective and rendezvous communication

Collective communications (i.e., ‘collectives’) include broadcast, multicast, reduction,
barrier, and other paradigms involving the dissemination or retrieval of data to or from
multiple distributed processes. Rendezvous communication occurs when a sender cannot
assume a receiver has available buffer space to handle the payload, so must initiate a
‘request-to-send’ to the receiver, and wait for a ‘clear-to-send’ response, at which point
the payload proper can be sent. Both forms of communication present opportunities
for overlapping host computation with communication by offloading the latter onto the
NIC. For instance, in a tree-structured broadcast, a process in the middle of the tree is
responsible for passing incoming data to its children; if this process can be autonomously
handled by the NIC, CPU resources are made available to continue servicing the host
application. Likewise, computation can be overlapped with rendezvous communication
by offloading the process of confirming available buffer space.
With its embedded CPU, the Myrinet NIC [43] prompted significant research on
the offloading of collective and rendezvous communication. For instance, Verstoep et
al. [93], Bhoedjang et al. [94], Buntinas et al. [95] each explore strategies for offloading

16

multicast. Keppitiyagama and Wagner [96] and Tourancheau and Westrelin [97] consider how on-NIC CPUs can facilitate asynchronous MPI progress, and hence enable
computation/communication overlap in multicast or rendezvous communication. Similarly, Buntinas et al. [98], [99] research performance benefits for offloading barriers onto
Myrinet processors.
Outside of the Myrinet sphere of influence, Graham et al. [100] describe how Mellanox ConnectX-2 hardware facilitates the offloading of barriers (cf. [24]). NICs based
on the Portals network programming interface [3] include ‘triggered operations’, e.g., the
capacity to issue sends or perform other actions when a local buffer has been filled, independently of the host processor (cf. Chapter 4). Using these building blocks, Hemmert
et al. [101] implement offloaded barrier and broadcast collectives, Underwood et al. [102]
do the same for allreduce, and Barrett et al. [103] provide an initial attempt at handling
rendezvous communication.
2.2.1.5

Message matching

The final type of offloaded core network applications we identify is that of message
matching. Broadly construed, this category could include any offloaded packet forwarding
mechanism based on matching bit fields, but here we target matching as defined by higherlevel communication middleware, primarily the Message Passing Interface (MPI) [69].
Under MPI, incoming messages are demultiplexed, and payloads placed in the correct
buffers, according to a process identifier (the ‘rank’) and a user-specified tag. Since MPI
is by far the most common user-level network interface used by developers of scientific
applications, MPI message matching has been the subject of considerable research into
on-NIC offloading.
Tourancheau and Westrelin [97] show that offloading matching to an on-NIC CPU
can allow MPI to progress independently of the host application calling into the library,
enabling offloaded rendezvous communication. Underwood et al. [32] propose a TCAM-

17

based architecture for offloading message matching, and Tanabe et al. [104] explore an
offloading strategy based on separating headers and payloads. Klenk et al. [105] consider
how GPUs may accelerate matching if current MPI semantics can be relaxed, and current Atos Bull and NVIDIA Mellanox ConnectX-5 NICs provide offloaded MPI message
matching functionality [2], [35].

2.2.2

Host applications

In the previous subsection we considered the offloading of core network applications,
which include functions ranging from low-level packet processing to communication middleware. A second class of offloaded applications has received less attention: host applications. The characteristic feature of this type of offloaded application is that rather
than targeting network functionality, it encapsulates some part of the host application.
For purposes of illustration, a simple example of an offloaded host application is
one that performs specialized data packing and unpacking. For instance, an application
process may need to exchange data with neighboring processes, where this data comprises
elements of different sizes, irregularly distributed across local memory. To perform a send,
the application may first allocate a temporary buffer, pack this buffer by copying the
various elements into the contiguous memory, and then send the contents of the buffer to
a receiving process. Likewise, the receiver must unpack the payload, placing the contents
into the appropriate memory locations. Because these packing and unpacking routines
are not part of the communication software stack, offloading them to the NIC constitutes
offloading part of the host application.
While there has been significant research into offloading host applications to specialized appliances (e.g., discrete cards for accelerating mathematical operations or neural
network computation [106]–[110]), host application offloading to network appliances has
received less attention. Indeed, because of the relative rarity, the novelty is sometimes
explicitly noted, e.g., by Dang et al. [60], in their work on offloading consensus algorithms
18

to intelligent switches.
Because of their importance in reducing datacenter server workload, the most popular host application targeted for offloading is key-value stores or caches, such as Memcached [111]. For example, Kim et al. [112] and Choi et al. [113] exploit increases in
on-NIC memory to cache recently accessed web data, potentially avoiding expensive
transactions over PCIe. Fukada et al.[114], Chalamalasetti et al. [115], and Istvan et
al. [116] explore FPGA-based NICs for offloading aspects of Memcached, e.g., using an
on-NIC FPGA (and associated DRAM) as an additional, near-network memory cache; if
no match is found on the NIC, the request is passed to the host for processing as usual.
Noting that processing key lookups is a bottleneck, Lim et al. [117] propose an ASIC solution, to be included on the NIC, for accelerating request handling. Similarly, the central
motivating case for the FlexNIC architecture proposed by Kaufmann et al. [118] is accelerating Memcached by providing the NIC with the capability to inspect headers and
intelligently direct lookup requests to cores dedicated to specific key ranges. Lavasani et
al. [119] propose implementing a ‘fast path’ for processing Memcached requests on-NIC,
passing the request to the host CPU only if necessary. Li et al. [120] explore using SmartNIC offloading capabilities to extend RDMA semantics to include primitives for doing
key lookup and allowing the NIC to retrieve values directly from main memory (rather
than caching those values on-NIC). Liu et al. [121] implement an offloaded key-value store
and distributed transactions as part of their work on the iPipe SmartNIC programming
model. Likewise, Choi et al. [122] use Netronome Agilio SmartNICs to offload database
transactions and decrease tail latencies. Roughly, the strategy is a copy of a transaction
sent to the master is kept on the local NIC, and the master then negotiates with the NIC
to determine any ordering constraints, thus allowing the client to proceed as if the transaction had already been finalized. Finally, in HPC environments, key-value stores form
the basis for global address spaces. With that target in mind, Larkins et al. [123], [124]
propose leveraging the offloaded message demultiplexing provided by NICs implementing

19

the Portals network API [3] to serve as a key-value store for implementing a version of
such a space.
Other possible host application offloads have been less well-explored. Some of the
more complex intrusion-detection systems surveyed in Section 2.2.1.3 could be viewed as
host application offloading. Underwood et al. [59], [60] offload FFT to FPGA-equipped
intelligent NICs. By embedding the FPGA in the path from host memory to NIC,
the NIC is able to perform the necessary matrix transformations on the fly, as data is
moving between network and host. More recently, Di Girolamo et al. [125] utilize the
sPIN SmartNIC architecture to accelerate the packing and unpacking of applicationdefined MPI datatypes. Glebke et al. [126] present research into offloading convolution
for purposes of edge detection in mobile vehicle guidance to a match-plus-action NIC
(Netronome Agilio). Sanvito et al. [127] analyze CNN network processing to determine
when the host CPU is least efficient, and propose methods for offloading those stages
to the NIC, essentially finishing the network computation as the results are delivered to
their final destination.

2.2.3

Independent applications

The final category of applications offloaded to SmartNICs is the natural extension of
offloading parts of host applications: let the NICs themselves be the hosts. In other
words, treat on-NIC compute resources as independent from the host CPUs, allowing
them to autonomously execute arbitrary applications. For example, if not needed to
accelerate network or host applications, these resources could be allocated like any other
compute resource by a job scheduler. Alternatively, they could be used to execute on-NIC
kernels for dynamically tuning the network as host applications execute.
Recent work in network offloading points towards a future where network compute
resources are treated as independent. For example, Bhattacharyya et al. [128] propose
offloading functions as diverse as clickstream analysis, financial data analysis, and sensor
20

data processing. While in this work these applications are not implemented on SmartNICs, they cite sPIN [54] as a potential platform.
In cloud computing, ‘serverless compute’ refers to services that allow developers to
deploy code without having to consider the full infrastructure they run on. In principle,
one can execute arbitrary code under such a service (although in practice, there are
limits placed on time and memory). Choi et al. [55] propose a mechanism by which
SmartNICs handle the execution of these so-called ‘lambda’ workloads. SmartNICs are
again targeted for the execution of arbitrary workloads that may be independent of
applications executing on the host.
As a final example, Microsoft Catapult inserts an FPGA in the pathway between
switch and NIC [67], [129]. In addition to accelerating ‘local’ applications on the near
host, these resources can be allocated to handle ‘global’ applications, i.e., for purposes
independent of the local host. While not technically a SmartNIC according to the definition adopted here, this work nonetheless illustrates how SmartNIC compute resources
may be treated as independent of their hosts.

2.3

Architectures

The third and final dimension along which we compare SmartNICs is their general architecture. In this section we distinguish between three broad categories of SmartNIC
architecture, discussing the benefits and limitations of each.

2.3.1

On-path

The first category of architecture is on-path in the sense that processing elements (PEs)
are situated on the packet processing pipeline [121]. Figure 2.1 provides an abstract
representation of an on-path architecture. In this scenario, incoming data passes through
some initial handler A, is directed to one or more processing elements (PE), and then

21

handed to B for delivery to the host. On-path architectures are perhaps the most common
architecture in the SmartNIC ecosystem, because they cohere with the original goal of
processing packets in-flight: packets arrive and are forwarded, classified, filtered, checked
for errors, etc., and then delivered.
Rx
Tx

A

PE

B

Host

Figure 2.1: Abstract representation of an on-path SmartNIC architecture.
The decision to place PEs on the processing pipeline has consequences regarding
the nature of the work that can be done. First, because these PEs are engaged only
when there is traffic in the pipeline, these resources are wasted when the network is
idle. Second, processing is fundamentally deadline-based. That is, when data arrives
at the PE, it must be ready to handle that data, for otherwise the data stream may be
invalidated, resulting in an unrecoverable, undefined data state. Since the PE is expected
to keep up with line rate, there is a limited amount of time the PE has to perform work.
For example, for a 200 Gb/s network with 256B packets, a single issue 2.5 GHz core can
execute approximately 26 instructions before it must be ready to service the next packet.
There are several strategies for addressing this deadline. One is to arrange PEs in
a pipeline: each PE is subject to the deadline, and the chain of PEs is responsible for
executing the entire program. For example, extending the previous example, given a
200 Gb/s network and 256B packets, a chain of 10 2.5GHz cores could execute more
than 260 instructions on each packet. Netronome Agilo SmartNICs are an example of a
contemporary SmartNIC that adopts a pipelined architecture [36]. A second strategy is
to arrange PEs in parallel; each additional PE buys another gap’s worth of processing
time. For example, 32 cores with an IPC of 1, operating in parallel on a 200 Gb/s
network with 256B packets, each core can execute at most 819 instructions before it
must be available to process the next packet. sPIN follows this strategy [54].
22

Of course, none of these strategies remove the deadline; they only delay it. This leads
to a third consequence, regarding code portability. For instance, when network speeds
increase to 400 Gb/s [4], all else being equal, the number of available instructions in the
parallel example just given is nearly halved, to approximately 420. Unfortunately, as
network speeds increase, we cannot reasonably expect core clock speeds or the number
of cores to increase proportionally. Consequently, packet processing kernels will not
necessarily be forward-compatible.

2.3.2

Off-path

The obvious way to avoid network-imposed deadlines is to move the processing element
off the packet processing pipeline. Figure 2.2 is an abstract depiction of such a NIC:
incoming data passes from A to B, at which point it can either proceed to the host,
or be diverted to the processing element. Contemporary SmartNICs that utilize this
architecture include NVIDIA’s Mellanox Bluefield data processing units [50], [51] and
Broadcom’s Stingray NICs [52]. In each of these cases, the processing element is a CPU
executing a version of Linux, and communication with other components in the NIC
(e.g., B and the host) uses the same mechanism as the host OS (e.g., InfiniBand Verbs).
Host

Rx
Tx

A

B

PE

Figure 2.2: Abstract representation of an off-path SmartNIC architecture.
Off-path architectures avoid deadlines by moving processing out of the pipeline that
imposes them in the first place. This has two consequences. First, since one can no
longer assume the PEs will operate within those deadlines, the system must allow for
more state than an on-path architecture (where the required memory can be carefully
23

tailored to the task and network speed). Second, having to divert data to the off-path
PEs can introduce additional latency [121].

2.3.3

In-path

The second strategy for avoiding processing deadlines – and hence enable offloaded
general-purpose compute on the NIC – is to reuse on-path PEs. Such an in-path architecture is represented in Figure 2.3: data passes through the PEs as it travels from
A to B, and has the option of being ‘recycled’ back into the pipeline for additional
processing.

Rx
Tx

A

PE

B

Host

Figure 2.3: Abstract representation of an in-path SmartNIC architecture.
In-path architectures avoid deadlines by adjusting the scope of what constitutes
program execution: rather than view execution as something that happens within the
pipeline, program execution occurs over that pipeline. That is, while the execution of
individual ‘instructions’ is subject to the deadline, overall program execution is not.
One consequence is that the architecture has the benefit of leveraging existing onpath PEs to provide general-purpose compute capability. Second, code developed for one
generation is, in principle, portable to future generations. Third, we can assume that
existing on-path PEs will adapt to increasing network speeds. Therefore, as network
speeds increase, so will program execution speeds. Fourth, as discussed in depth in
Chapter 6, because network speeds (i.e., message rates) are currently relatively slow in
comparison to CPU speeds (MHz versus GHz), in-path code execution will likewise be
slow. This is somewhat mitigated by projected increases in network speeds, but also
suggests additional hardware for accelerating kernel execution will be useful. Finally,
24

like off-path architectures, in-path architectures gain independence from deadlines at the
cost of increased state, e.g., for caching data during program execution.

2.4

Discussion: The path forward

In this chapter we’ve considered intelligent network interfaces from three perspectives:
the hardware underwriting intelligence, the types of offloaded applications, and the general architecture of SmartNICs.
Regarding hardware, we’ve highlighted three broad categories: ASICs, CPUs, and
FPGAs. While we emphasize some types of hardware over others while providing examples of each, it is safe to say that in most (if not all) cases, SmartNICs incude some
combination of the three, a trend that has a long history. For example, the influential Intel IXP family of network processors includes modified RISC cores for fast packet
processing, and an ARM processor for executing arbitrary code [85], [130].
The speed of ASICs or ASIPs makes them good candidates for traditional packet
processing tasks that at this point in the evolution of network technology are unlikely to
change. However, as this survey reveals, flexibility has become the most important feature
for determining whether a NIC is ‘smart’. For example, it is currently not uncommon to
find SmartNICs defined as NICs including CPUs or FPGAs [120], [122]. Moreover, there
currently exist options that, from a hardware perspective, include ‘everything but the
kitchen sink’, e.g., Xilinx’s Versal NICs include an FPGA, realtime and standard CPUs,
and configurable ASICs dedicated to artificial intelligence processing [131]. Consequently,
while we began the chapter with a broad definition of a SmartNIC as a network interface
that offloads core network applications traditionally performed by the host CPU, from
a contemporary perspective, a SmartNIC is one that not only offloads, but also does
so using hardware (CPUs, FPGAs) that affords the execution of arbitrary (or nearly
arbitrary) programs.
The emphasis on flexibility is further highlighted by trends in offloaded applications,
25

Application type

Core network

Host

Independent

Characteristics

Examples

Offloads L2-L4 protocol
processing, collectives,
message demultiplexing, etc.

Packet forwarding, TCP segmentation,
virtualization (Section 2.2.1.1)
TOEs (Section 2.2.1.2)
Security (Section 2.2.1.3)
Collectives (Section 2.2.1.4)
Message matching (Section 2.2.1.5)

Offloads components of
host applications

Key-value stores, memcached,
web caching, consensus algorithms,
data packing/unpacking,
AI (Section 2.2.2)

Offloads applications executing
autonomously of
host applications

Arbitrary jobs (Section 2.2.3)

Table 2.1: Summary of survey of SmartNIC applications (Section 2.2).
summarized in Table 2.1. Initially, an intelligent network interface was simply one that
offloaded core network applications. This situation was encouraged by slow NIC processing speeds relative to host CPUs: why bother offloading complex tasks when there
is capacity to spare on the host? [76], [95] However, as the limits of Moore’s law approach, support for offloaded applications has climbed up the network stack, increasing
in the complexity of computation and the amount of state required. This shift in focus
is reflected by recent research that looks beyond core network functions to offloading
parts of host applications onto the NIC, or even to treating the NIC as a platform in its
own right, i.e., as a compute resource running applications that are independent of any
traditional host CPU. Perhaps unsurprisingly, trends towards diversity in the types of
offloaded applications parallel those towards flexibility in available hardware.
Finally, we surveyed the abstract architectural approaches that have been pursued
for making NICs intelligent. The primary conclusions of this survey are summarized in
Table 2.2. On-path architectures situate processing elements on the packet processing
pipeline, meaning there are deadlines on the amount of work they can do, imposed by
the network speed. Off-path architectures avoid these deadlines by moving processing

26

Examples

On processing
pipeline?

Deadlinebased?

On-path

Quadrics [45]
Netronome Agilio [36]
Marvell LiquidIOII [53]
Myrinet [43]
SPiN [54]

Y

Y

Off-path

Broadcom Stingray [52]
NVIDIA Mellanox Bluefield [50], [51]

N

N

In-path

INCA [132]

Y

N

Architecture

Table 2.2: Summary of survey of SmartNIC architectures (Section 2.3).
elements off the packet processing pipeline, e.g., by equipping the NIC with an onboard CPU. In-path approaches – of which the design proposed below (Chapter 4) is,
to our knowledge, the sole representative – synthesize the on- and off-path approaches
by leveraging on-path processing elements to nonetheless provide deadline-free kernel
execution.
Many HPC NICs are smart in the traditional sense of offloading core network applications, e.g., collective communication or message demultiplexing. In the next chapter,
we begin our inquiry into what SmartNICs can do by identifying message processing
overheads incurred by emerging multithreaded approaches to inter-process communication, and assessing the ability of a state-of-the-art NIC offering offloaded support for MPI
message matching to deal with these overheads.

27

Chapter 3
Message matching
Many scientific applications rely on the Message Passing Interface (MPI) for managing
data exchange between processes. As noted in Chapter 2, the importance of MPI to
these applications has led the offloading of parts of MPI to the NIC or host adapter,
e.g., message matching. NICs providing such offloaded capabilities are thus ‘smart’ in
the traditional sense of the term.1
These smart capabilities may increase application performance given present state of
practice, but – as noted in Chapter 1 – we are also concerned with assessing how such capabilities apply to future scenarios. In this chapter, we explore one such future paradigm
– multithreaded communication in scientific applications – and assess the ability of a
contemporary SmartNIC offering offloaded support for MPI message matching to handle
the overheads incurred by multithreaded communication.
MPI includes support for allowing user-level threads to concurrently call into an MPI
library [69]. While this level of thread support (MPI THREAD MULTIPLE) is rarely used in
current practice, a recent survey of application developers in the United States’ Department of Energy Exascale Computing Project indicates a majority (86%) are interested
in taking advantage of the opportunities it affords [136]. Specifically, trends towards
larger core counts and faster-yet-lower-capacity memory (e.g., high-bandwidth memory
1

Results reported in this chapter are drawn from [133], [134], and [135].

28

(HBM) [1]) discourage traditional ‘MPI-everywhere’ approaches, where each MPI process instantiates its own set of communication resources. Instead, memory requirements
can be reduced by adopting a multithreaded approach, where a single set of resources is
used. Therefore, understanding the implications of adding multithreading to MPI codes
is important for future application development.
Since under MPI THREAD MULTIPLE, individual threads can issue sends and receives,
this mode of operation may lead to significant increases in the number of messages exchanged. Furthermore, since many threads in a process may be issuing receives or sends
concurrently, multithreaded communication may result in deviations to the expected ordering of messages, making the standard strategy of posting receives in the order incoming
messages are anticipated to arrive ineffective. This suggests that utilizing multithreaded
communication can result in more time spent in MPI message matching, with potential
consequences for application performance. This calls for methods for assessing the performance impacts of pursuing multithreaded communication under MPI, including for
hardware offloading.
One such method is to emulate a multithreaded halo exchange. This approach has
the benefit of being ‘low-cost’ in the sense it requires fewer resources to execute, e.g., two
nodes rather than the 9 or 27 required by full exchanges. Consequently, the benchmark
can be deployed on systems where acquiring a full allocation is difficult (e.g., because it
is under high utilization) or even impossible (e.g., because it lacks the necessary number
of nodes).
An alternative is to provide a benchmark implementing the complete, ‘real-world’
multithreaded halo exchange. In comparison to the low-cost approach, this strategy has
the benefit of imposing more realistic demands on MPI and the underlying network, but
also incurs additional resource costs.
In this chapter we make the following contributions:
• A formal analysis of thread and message counts under multithreaded halo commu29

nication patterns;
• The design and implementation of a ‘low-cost’ benchmark for assessing MPI communication performance under typical multithreaded halo exchanges;
• The design and implementation of a ‘real-world’ benchmark for assessing MPI communication performance under those same communication patterns;
• An evaluation of the low-cost benchmark relative to the real-world baseline, across
multiple architectures; and
• A case study using the low-cost benchmark to examine the impact of message
matching offloading on multithreaded halo exchange.
The low-cost and real-world benchmarks will be included in the next release version
of the Sandia MPI Micro-Benchmark Suite (SMB) [137].
The remainder of the chapter is structured as follows. In Section 3.1, we provide
summary of MPI message matching and how multithreaded communication might affect
it. Section 3.2 provides a brief survey of related work, and in Section 3.3, we provide a
formal analysis of features of multithreaded halo exchanges such as number of threads
participating in inter-process communication and number of messages exchanged. In Section 3.4, we describe the benchmarks and the systems upon which they were run. Results
from both benchmarks are presented and discussed in Section 3.5, and in Section 3.6, we
consider results from running the low-cost benchmark on a system with hardware offload
support for message matching.

3.1

Background

Message matching is MPI’s receiver-side data-placement mechanic, used primarily to
support point-to-point communication. To send a message (e.g., through MPI Send or
MPI Isend), an MPI process specifies a buffer containing data to be sent, a destination
30

ID (‘rank’), and a placement identifier (‘tag’). The receiving MPI process posts a corresponding receive (e.g., MPI Recv or MPI Irecv) specifying a buffer where data will be
placed, the rank of the sender, and the tag of the expected message. The communication
is completed when the receiver matches the sending rank and tag of an incoming message
to that of a posted receive, and the payload delivered to the specified buffer.
The MPI specification imposes several constraints on receiver-side message matching.
First, messages with the same matching fields must be matched in the order their receives
are posted. Second, the matching mechanism must allow wildcards for both rank and
tag. To handle these requirements, traditional implementations use two linked lists: a list
of outstanding receive requests in a posted receive queue (PRQ), and a list of messages
that have arrived and failed to match any receive request in the unexpected message
queue (UMQ).
New
Receive
Message Request

Head

Head

Tail

Tail

PRQ

UMQ

Figure 3.1: MPI message matching queues. Given an incoming message, the
receive requests posted in the PRQ are searched, and if no match is found,
the message is enqueued at the tail of the UMQ. Likewise, given a new posted
request, the UMQ is searched to determine whether the matching message has
already arrived, and if not, the request is appended to the PRQ.

31

As shown in Figure 3.1, when an MPI process posts a receive, its UMQ is traversed to
determine whether a message with the desired sending rank and tag has already arrived,
and if not, the receive is appended to the PRQ. When a message arrives at that process,
the PRQ is traversed to determine whether a receive with the required rank and tag
has already been posted, and if not, the information is appended to the UMQ. MPI
ordering and wildcard semantics are guaranteed by initiating searches from queue heads
and appending to their tails. For the purposes of this paper, we use a traditional dualqueue model for message matching, based on the model used by MPICH [138] and its
derivatives. Some other implementations have opted for different models. For example,
Open MPI [139] utilizes an array of lists, indexed by sending rank, which can reduce
average search depth at the cost of increased memory. The benchmarks and results
presented in this chapter can provide a better understanding on how these optimized
models will impact next-generation applications.
An obvious concern for this approach are situations where (i) the PRQ or UMQ grow
large, and (ii) the order in which messages arrive or requests are posted is such that either
no matches are found, or that when matches are found, they tend to occur near the end of
a list. In these cases, message processing latencies increase due to time spent searching,
and this can disrupt application performance. The introduction of multithreading to
communication via MPI THREAD MULTIPLE prima facie encourages this problem in two
ways. First, because individual threads are now participating in communication, there
is a corresponding increase in the number of messages exchanged, and queue sizes grow.
Second, a rule of thumb for efficient MPI programming is to post sends in the same order
as receives so that even if there are many messages to be processed (and the PRQ or
UMQ grows large), matches are always found at or near the head of the list, keeping time
spent searching to a minimum. With multithreaded communication, however, threads
are issuing sends or receives concurrently, introducing nondeterminism into this ordering,
and rendering the standard strategy ineffective. One goal of this research is to ascertain

32

how just how disruptive this nondeterminism is.

3.2

Related work

The possibility that message matching overheads may negatively affect application performance has led to research on both understanding MPI message matching behavior,
and into strategies for mitigating the impact of performing queue searches. For example,
initial work by Underwood et al. [140] explored the performance impact of long queues.
Balaji et al. [141] look at search overheads on BlueGene/P, and further studies by Barrett
et al. [142] showed the impact of match list length on a variety of system architectures.
Brightwell et al. [143] measure queue characteristics of several applications, and Dosanjh
et al. [144] considers the impacts of temporal and spatial locality on match engine performance. Ferreira et al. [145] and Levy et al. [134] use trace-based simulations to gain
insights into message matching costs for a variety of scientific applications, noting that
in most (single-threaded) cases queue lengths tend to remain small. Finally, Bridges et
al. [146] present a model for assessing MPI queue performance on many-core systems.
Strategies for mitigating message matching costs include both software refinements
and, more recently, hardware offloading. Proposed software-side optimizations include
using a dedicated thread for message processing [147], incorporating data structures such
as hash tables [148]–[152], leveraging underlying InfiniBand queue pairs to satisfy MPI
semantics [153], using vector units to check multiple items in parallel [154], or enabling
more fine-grained locks on queue data structures [155].
Regarding hardware-based approaches, Ferreira et al. [156] use simulation results to
identify requirements on hardware solutions (e.g., amount of memory). Tourancheau
and Westrelin [97] address rendezvous communication by proposing message matching
and associated queues be offloaded to an on-NIC CPU. Underwood et al. [32] propose a
TCAM-based architecture for offloading message matching, Tanabe et al. [104] explore an
offloading strategy based on separating headers and payloads, Klenk et al. [105] consider
33

how GPUs may accelerate matching if current MPI semantics can be relaxed, and current Atos Bull and NVIDIA Mellanox ConnectX-5 NICs provide offloaded MPI message
matching functionality [2], [35].
Several works have addressed multithreading support in MPI by improving implementation internals [157]–[159], and proposing new interfaces [160], [161]. In addition
to traditional send/receive multithreading and matching overheads, work has also examined multithreading in the context of MPI one-sided communication [162], [163]. Other
MPI multi-threaded benchmarks have been proposed for basic MPI functionality and
performance testing [137], [164], [165].
In the research described in remainder of this chapter, we contribute to the general
understanding of message processing overheads by investigating a natural extension of
halo exchange communication to multithreaded contexts. Moreover, as illustrated by the
case study provided in Section 3.6, the benchmarks developed here allow one to assess
how hardware offload solutions perform.

3.3

Analysis

As noted in Section 3.1, multithreaded communication, when applied to typical halo
exchange communication patterns, increases the number of messages exchanged. To
emulate such a pattern, then, we need to establish both the number of threads and the
number of messages exchanged, for a variety of common 2D and 3D stencils. In this
section, we present this analysis.
The halo exchange communication pattern is common amongst scientific applications [166], [167].

Figure 3.2a illustrates a simple single-threaded, 9-point, two-

dimensional halo exchange between process p4 and its neighbors. In a typical bulksynchronous processing (BSP) application using this stencil, after each process completes
its assigned work, it exchanges data with its eight nearest neighbors. Once this exchange
is complete, it continues into a new work phase.
34

p0

p3

p6

p1

p4

p7

p2

t15

t12

t13

t14

t15

t12

t3

t0

t1

t2

t3

t0

t7

t4

t5

t6

t7

t4

t11

t8

t9

t10

t11

t8

t15

t12

t13

t14

t15

t12

t3

t0

t1

t2

t3

t0

p5

p8

(a) A singlethreaded, nine-point halo exchange.

(b) A multithreaded, nine-point halo exchange.

Figure 3.2: Singlethreaded (a) and multithreaded (b) nine-point halo exchanges.
Figure 3.2b represents a straightforward multithreaded version using the same stencil
communication pattern. In this scenario, work on p4 is divided between 16 threads (t0
through t15 ). During the communication phase, each thread is responsible for exchanging
messages with its neighboring threads, based on the same 9-point stencil pattern shown
in Figure 3.2a.
Assuming that inter-thread communication within the same process is handled outside
of the MPI matching engine (e.g., via shared memory), one can calculate the number
of sending and receiving threads, and the total number of messages exchanged between
processes. Specifically, a decomposition can be viewed as a collection of ‘faces’ of different
dimensions. For example, a square (2D) decomposition comprises 4 faces of dimension
0 (corners)2 , 4 faces of dimension 1 (edges)3 and 1 face of dimension 2 (interior)4 . If we
assume, for purposes of exposition, that all dimensions have the same length, x, then the
total number of threads, t(k), on faces of dimension k can be expressed as a variation on
2

Each corner comprises a face containing a single thread (e.g., t0 , t3 , t12 , and t15 in Figure 3.2b).
Each edge comprises a face containing a one-dimensional line of threads (e.g., t1 -t2 in Figure 3.2b).
4
The interior comprises a face containing a two-dimensional grid of threads (e.g., t5 , t6 , t9 , and t10 in
Figure 3.2b
3

35

the standard equation for calculating the faces of a hypercube of dimension d:
k d−k

t(k) = (x − 2) 2

 
d
k

(3.1)

for x ≥ 1. In this equation, the first factor ((x − 2)k ) is the volume of the k-dimensional
face, expressed as the number of threads it contains. The remainder of the equation
computes the number of k-dimensional faces. The total number of threads is simply the
sum of the number of threads for each possible dimension:
d
X

t(k)

(3.2)

k=0

The number of threads involved in inter-process communication depends on the dimensionality of the stencil used to define the communication. For this work, we limit our
analysis to standard 5 and 9-point 2D stencils, and 7 and 27-point 3D stencils. If the
dimensionality of the stencil (ds ), is greater than the dimension of the thread decomposition (d), then the set of participating threads includes every thread in the decomposition.
Otherwise, the number of participating threads is equal to Equation 3.2, where the upper
bound of the summation is reduced to d−1, i.e., the single d-dimensional face is excluded.
The case where ds < d is beyond the scope of this work.
The number of messages exchanged by each thread depends on the face it belongs to
and the stencil used. For communication limited to the Von Neumann neighborhood (5
and 7-point stencils), the number of messages is given by Equation 3.3. For communication within the Moore neighborhood (9 and 27-point stencils) the number of messages is
given by Equation 3.4.

m(k) = 2ds − (d + k)

(3.3)

m(k) = 3ds − 3k 2d−k

(3.4)

36

That is, the number of messages exchanged by a thread is equal to the total number of
messages (2ds for the Von Neumann neighborhood, and 3ds for the Moore neighborhood)
minus the number of intra-process messages. The total number of messages exchanged
is therefore the sum of the messages processed by each thread on each face:
d
X

(3.5)

t(k)m(k)

k=0

If ds = d, the upper bound of the summation is reduced to d − 1.
This analysis can be straightforwardly (if tediously) extended to handle the case where
the number of threads in each dimension is not equal, e.g., a rectangular (rather than
square) decomposition in two dimensions. For present purposes, Table 3.1 summarizes
the number of messages processed by the receiver for each of the decompositions and
stencils considered in the experiments reported in this chapter, as calculated using the
analysis just provided. These data illustrate how multithreading can significantly increase
the number of messages exchanged.
5pt
9pt

7pt
27pt

7pt
27pt

1x1

2x1

2x2

4x2

4x4

8x4

8x8

16x8

16x16

4
8

6
14

8
20

12
32

16
44

24
68

32
92

48
140

64
188

1x1x1

2x1x1

2x2x1

2x2x2

4x2x2

4x4x2

4x4x4

8x4x4

8x8x4

6
26

10
50

16
92

24
152

40
272

64
464

96
728

160
1256

256
2072

1x1x1

1x1x2

1x1x4

1x1x32

1x1x64

1x1x128

1x1x256

6
26

10
50

18
98

130
770

258
1538

514
3074

1026
6146

1x1x8 1x1x16
34
194

66
386

Table 3.1: Number of messages processed by receiver under different thread-level
decompositions and stencils.

3.4

Benchmarks and methodology

To assess the impact of multithreading on MPI message matching performance, we designed and implemented two multithreaded halo exchange benchmarks. Each benchmark
37

follows a BSP pattern, where receives are pre-posted in anticipation of incoming messages.
Recent versions of Open MPI include an option to utilize a traditional dual-queue
matching engine of the sort described in Section 3.1, optimized for SIMD operations
(see [154]). To collect data on number of items searched and time spent searching, we
instrumented this traditional dual-queue matching engine, without enabling these optimizations.5 Our instrumented Open MPI reports: (i) the number of items searched in the
PRQ before a match is found for each incoming message; and (ii) the total amount of time
spent searching the PRQ. Since all of the receives are pre-posted in these benchmarks,
all incoming messages will have a match in the PRQ.
The first benchmark is real-world in the sense it performs an actual multithreaded
halo exchange. That is, nine processes are used for 2D stencils, and 27 processes are used
for 3D stencils. Depending on the dimensionality of the process-level decomposition, Each
process is partitioned into x × y or x × y × z threads. Using MPI THREAD MULTIPLE,
threads on the perimeter of the decomposition participate in inter-process message exchanges with those of neighboring processes, posting receives and issuing sends. Communication between processes is non-periodic (i.e., non-toroidal). Because the goal is to
assess MPI matching engine performance, and this engine can be bypassed when MPI
processes reside on the same node, each process is situated on a distinct node. At the
thread level, all receives are pre-posted in the same order as the corresponding sends, and
data is collected from the center process via the modified Open MPI described above.
Based on the analysis presented in Section 3.3, the low-cost benchmark emulates an
actual multithreaded halo exchange using only two MPI processes: a sender and a receiver
(cf. [133]). Both use multithreaded MPI for communication, but their thread counts
differ. The number of receiver-side threads is determined by a version of Equation 3.2,
and is equal to the number of threads used by the center process in the real-world
5

Open MPI hash dd74c6252f8947e213e83f470024f3a4ce78b10b

38

benchmark. The number of threads used by the sender is the total number of threads
issuing sends in the real-world case, as calculated using minor variations on Equations 3.1
and 3.2. In other words, while the receiver corresponds to the center process in an actual
halo exchange (e.g., p4 in Figure 3.2a), in the low-cost benchmark, the threads from
surrounding processes (e.g., p0 -p3 and p5 -p8 in Figure 3.2a) are collected onto a single
sending process.
There are multiple ways that an application developer can decompose their problem with multiple threads. We consider three possible decompositions: two-dimensional
‘square’ (e.g., 1×1, 2×1, 2×2, 4×2, . . .), three-dimensional ‘cube’ (e.g., 1×1×1, 2×1×1,
2×2×1, 2×2×2, . . .), and three-dimensional ‘linear’, which scales only along the z axis
(e.g., 1×1×1, 1×1×2, 1×1×4, . . .). Square and cube decompositions represent typical
ways that applications decompose problems. The linear decomposition represents a less
efficient corner case.
To compare the real-world and low-cost benchmarks, we ran both on a Cray XC40.
This system comprises two compute node partitions. In the first, each compute node has
two sockets of Intel Xeon ES-2698 v3 (Haswell) processors operating at 2.3GHz. Each
processor has 16 cores with 2 hardware threads per core, for a total of 64 thread contexts.
In the second partition, each node has a single socket containing a 68 core Intel Xeon
Phi 7250 (Knights Landing) processor operating at 1.4GHz. Each core has 4 hardware
threads, for a total of 272 thread contexts. The nodes are connected via a Cray Aries
network. All tests were conducted on a single cabinet from one of the Cray XC40’s
partitions. Therefore the results are for a fully connected network (not a dragonfly, as
would be the case for communication that spanned multiple cabinets).
The low-cost and real-world benchmarks were executed on each partition, using one
MPI process per node, with each of the three types of decomposition. Each decomposition
was scaled up until the point oversubscription would occur on the receiver side. For the
low-cost benchmark, this means sender-side oversubscription is permitted. For the real-

39

world benchmark, this means that oversubscription is not permitted, since all processes
are both senders and receivers. For the square decomposition, data was collected for
5 and 9-point stencils; for the cube and linear decompositions, data was collected for
7 and 27-point stencils. Because internal nodes in a regular halo exchange have an
equal communication overhead, and assessing this overhead is the goal of the benchmark,
scaling beyond 9 nodes (for 2D decompositions) or 27 nodes (for 3D decompositions) is
not necessary. For each configuration, the benchmarks were run 50 times, with each run
executing two trials, i.e., two halo exchanges. All reported results are for the second of
the two trials, and all messages contained 8-byte payloads.
For the case study on offloaded message matching (Section 3.6), we ran the low-cost
benchmark on an ARM-based testbed, with offloaded matching disabled and enabled.
On this system, each compute node has two sockets. Each socket contains a 28-core
Cavium ThunderX2 CN9775 processor operating at 2GHz. The ARM testbed’s network utilizes Mellanox ConnectX-5 interfaces, which provide dedicated hardware assistance for MPI message matching [168]. This capability is accessed via OpenUCX [169].
For the experiments in this paper, we used UCX 1.7.0, configured with optimizations
(--enable-optimizations) and multithreaded support (--enable-mt) enabled. We
used the same version of MPI as in the previous experiments. However, because Open
MPI’s message matching is bypassed by UCX, our MPI instrumentation was also bypassed. Therefore, for these results, we report the time spent processing messages as
measured in the benchmark code. This includes a barrier (to ensure the sender cannot begin issuing messages before the timer is started) and a call to MPI Waitall. In
addition to supporting hardware offloading, UCX also includes message matching optimizations in software (in the form of binning messages into 1021 bins according to
tag and source), and these optimizations are reflected in the reported times. Hardware
offloading of message matching is controlled via the UCX RC MLX5 TM ENABLE environment variable. The default threshold recommended by Mellanox for engaging offloading

40

is 1024 bytes. For our experiments, we kept this default value, meaning that even when
offloading is enabled, it is not used for messages whose size is less than 1KiB.

3.5

Results

In this section, we compare results derived from the two benchmarks, confirming that
the low-cost benchmark provides a reasonable approximation of a genuine multithreaded
halo exchange while requiring fewer resources.

low-cost, 27pt

ratio
15x

1.5

10x
1.0

4x
4x
4

4x
4x
2

0x

2x
2x
2

0.0

2x
2x
1

5x

2x
1x
1

0.5

1x
1x
1

Average core hours

real-world, 27pt
2.0

Reduction in resource usage

Resource usage

4x
2x
2

3.5.1

Thread decomposition

Figure 3.3: Average core hours for the real-world and low-cost multithreaded
halo exchange benchmarks executing a 27-point 3D halo exchange.
Figure 3.3 compares the average (over 30 runs) number of core hours for the realworld and low-cost benchmarks executing a multithreaded 27-point, 3D halo exchange.
Note that results were obtained on 2.3GHz Intel Xeon ES-2698 v3 processors using Cray
MPICH v7.7.6. As the size of the thread decomposition increases, the low-cost benchmark
remains under 0.25 core hours, while the real-world benchmark approaches 2 core hours.
The reduction in resource usage afforded by the low-cost benchmark ranges from 9.1x to
16.1x relative to the real-world benchmark. This reduction in resource usage decreases as
the number of threads increases, because the amount of work each process in the low-cost
benchmark has to do increases faster than the work done by each of the processes in the
41

104

8
8x

4
8x

4
4x

4
4x

4x

2
4x

4x

2
4x

2x

2
2x

103
102

64
1x

1x

32
1x

1x

16
1x

1x

8
1x
1x

1x

1x

2

1
1x
1x

1x
1x

4

101

6
25

8
1x

1x

12

64
1x

1x

32
1x

1x

16
1x

1x

8
1x
1x

1x

1x

2
1x
1x

1x
1x

4

102

real-world, 7pt
low-cost, 7pt
real-world, 27pt
low-cost, 27pt

105

1x

104

106

1x

Median total items searched

Cube decomposition, Haswell

real-world, 7pt
low-cost, 7pt
real-world, 27pt
low-cost, 27pt

1

Median total items searched

2x

2x

1x

1x
1x

1

1

4

101

Cube decomposition, KNL

106

2

102

8x

4

103

8x

8x

4x

4
4x

4x

2
4x

4x

2
4x

2x

2
2x
2x

2x

2x

1
2x

1x

1x
1x

1

101

10

1

102

real-world, 7pt
low-cost, 7pt
real-world, 27pt
low-cost, 27pt

4

2x

103

105

2x

Median total items searched

104

1

Median total items searched

real-world, 7pt
low-cost, 7pt
real-world, 27pt
low-cost, 27pt

105

4x

1x

Square decomposition, Haswell

Square decomposition, KNL
106

2

1

6
x1

x8

8

101

16

16

8x

4
8x

4
4x

2
4x

2
2x

2x

1x

1

101

102

2x

102

1

103

real-world, 5pt
low-cost, 5pt
real-world, 9pt
low-cost, 9pt

103

2x

Median total items searched

real-world, 5pt
low-cost, 5pt
real-world, 9pt
low-cost, 9pt

1

Median total items searched

104

Linear decomposition, Haswell

Linear decomposition, KNL

Figure 3.4: Median total number of items searched under the low-cost and
real-world benchmarks, for 2D (5 and 9-point stencils) and 3D (7 and 27-point
stencils) decompositions, on KNL (left) and Haswell (right) architectures.
real-world benchmark. However, we expect the low-cost benchmark to be used at small
scale, mostly by MPI researchers and for MPI performance regression testing. MPI halo
exchange performance is often tested/monitored on supercomputers, which can now use
a lower-cost benchmark to achieve this goal.

3.5.2

Items searched

The total number of items searched during a halo exchange, as well as search depths
for the processing of individual messages, are useful for understanding the overheads are
42

due to the non-deterministic behavior that is introduced by multi-threaded communication [133]. Consequently, we begin by contrasting total items searched as reported by
the low-cost and real-world benchmarks. Figure 3.4 summarizes this data for square,
cube, and linear decompositions for KNL and Haswell architectures. Data points represent median search depth (over 50 runs), and error bars are first and third quartiles.
Note that the y-axis is logarithmic, and differs between KNL and Haswell architectures.
Furthermore, the decomposition for the KNL architecture has two extra data points over
Haswell due to the greater number of available execution contexts.
Both benchmarks confirm the hypothesis that the non-deterministic ordering introduced by multithreaded communication leads to increased search depths [133]. For instance, under the real-world benchmark on Haswell, a 4x4x4 cube decomposition with
27-point stencil communication results in a total number of items searched that, on average, is 170.5 times larger than the ideal, where each search matches on the first item.
Comparing the two benchmarks, the mean absolute error across all stencils and decompositions, expressed as a ratio to the median reported by the real-world benchmark, is
16.4% (σ = 11.1). We observe that disagreement between the low-cost and real-world
benchmarks occurs under larger (9 and 27-point) stencils, and this typically involves the
low-cost benchmark underestimating the total number of items searched. For instance,
the average error of 5-point stencils versus 9-point stencils for square decompositions,
across both architectures, is 7.6% and 16.6%, respectively. Likewise, for 7-point stencils
versus 27-point stencils and cube decompositions, these values are 6.3% and 27.7%.
Figure 3.5 offers a more detailed view of the data for three different cube decompositions on the Haswell architecture, selected because they clearly manifest this trend.
In these histograms, the x-axis specifies bins of numbers of items searched during the
processing of each incoming message during an exchange, and the y-axis the number of
searches of that depth, averaged across all 50 trials. In all three cases, there is strong
agreement between low-cost and real-world benchmarks for 7-point stencil communica-

43

2x2x1 cube Haswell
16

real-world, 7pt
low-cost, 7pt
real-world, 27pt
low-cost, 27pt

12
10

real-world, 7pt
low-cost, 7pt
real-world, 27pt
low-cost, 27pt

20

Number of searches

14

8
6
4

15

10

5

2
0

0
3

25

47

69

92

5

40

Search depth, bins=25, bin size=3.68 items

81

116

152

Search depth, bins=30, bin size=5.07 items

4x2x2 cube Haswell
35

Number of searches

Number of searches

2x2x2 cube Haswell

real-world, 7pt
low-cost, 7pt
real-world, 27pt
low-cost, 27pt

30
25
20
15
10
5
0
9

72

145

208

272

Search depth, bins=30, bin size=9.07 items

Figure 3.5: Histograms showing average number of searches at each bin of items
searched.

44

8
8x

4
8x

4
4x

2
4x

4
4x

4x

2
4x

4x

2
4x

2x

2
2x

64
1x

1x

32
1x

1x

16
1x

1x

8
1x
1x

1x

1x
1x

1x
1x

4

100

6
25

8
1x

1x

12

64
1x

1x

32
1x

1x

16
1x

1x

8
1x
1x

1x

1x

2
1x
1x

1x
1x

4

100

101

1x

101

2

102

102

1x

103

real-world, 7pt
low-cost, 7pt
real-world, 27pt
low-cost, 27pt

103

1

104

104

1x

real-world, 7pt
low-cost, 7pt
real-world, 27pt
low-cost, 27pt

105

1

Cube decomposition, Haswell
Median total search time (µsecs)

106

1

Median total search time (µsecs)

Cube decomposition, KNL

2x

2x

1x

1x
1x

1

1

4

100

8x

4

101

8x

8x

4x

4
4x

4x

2
4x

4x

2
4x

2x

2
2x
2x

2x

2x

1
2x

1x

1x
1x

1

100

102

2x

101

real-world, 7pt
low-cost, 7pt
real-world, 27pt
low-cost, 27pt

103

2x

Median total search time (µsecs)

102

1

Median total search time (µsecs)

real-world, 7pt
low-cost, 7pt
real-world, 27pt
low-cost, 27pt

103

2

Square decomposition, Haswell

Square decomposition, KNL

104

2x

1x

1

1

6

10−1

x1

x8

8

100

16

16

8x

4
8x

4
4x

2
4x

2
2x

2x

1x

1

100

real-world, 5pt
low-cost, 5pt
real-world, 9pt
low-cost, 9pt

101

2x

Median total search time (µsecs)

101

1

Median total search time (µsecs)

real-world, 5pt
low-cost, 5pt
real-world, 9pt
low-cost, 9pt

102

Linear decomposition, Haswell

Linear decomposition, KNL

Figure 3.6: Median queue drain times for 2D (5 and 9-point stencils) and 3D (7
and 27-point stencils) decompositions, for KNL (left) and Haswell (right).
tion. In contrast, for 27-point stencils, the low-cost benchmarks exhibits greater numbers of shallower searches, and correspondingly fewer deep searches, in comparison to
the real-world benchmark. Furthermore, the maximum search depths for the real-world
benchmark consistently exceed that of the low-cost benchmark.

3.5.3

Matching overheads

Determining search times can give a user an idea of how expensive MPI overhead is
for multiple decomposition strategies. Even close agreement between low-cost and realworld benchmarks with respect to mean or median number of items searched does not
45

guarantee that the temporal overhead of searching the match lists is similar, because
the temporal costs of searching may not be linear. Consequently, to assess how the lowcost benchmark relates to the real-world version regarding processing times, we compare
time to process all items in the posted receive queue (i.e., ‘queue drain times’) for each
architecture, stencil, and decomposition. Results are summarized in Figure 3.6. As
before, data points represent medians, and error bars are first and third quartiles. The
y-axis is logarithmic and differs between architectures, and KNL has two additional data
points.
Previous work has observed that the costs of multi-threaded communication can be
prohibitively expensive, requiring more time for message processing than currently allocated by current scientific applications for an entire compute-plus-communication cycle [133]. The results reported here confirm this observation across both benchmarks
and architectures. For example, on Haswell, 27-point communication on a 4x4x4 cube
decomposition exceeds 1 millisecond for queue processing.
Comparing the benchmarks, the mean absolute error across all stencils and decompositions (calculated as in Section 3.5.2) is 24.9% (σ = 18.6). However, as shown in
the subfigures, discrepancies between benchmarks are more pronounced for decompositions and stencils with fewer numbers of messages; as the number of messages increases,
disagreement decreases. For example, whereas the average error across 5-point stencils
is 36.0%, this decreases to 17.6% for 27-point cube decompositions, and to 16.7% for
27-point linear decompositions.
Figure 3.7 shows results from a square and two cube decompositions, all executed on
Haswell, chosen because they are representative of the trend towards agreement. As these
figures illustrate, a notable contributor to discrepancies between low-cost and real-world
results are outliers: on all stencils, for lower message counts, the low-cost benchmark has
more extreme slow outliers than the real-world benchmark, and this gap closes as the
number of messages increases. For example, for the 4x4 decomposition, the maximum

46

4x2x2 cube Haswell

4x4 square Haswell
real-world, 5pt
low-cost, 5pt
real-world, 9pt
low-cost, 9pt

25
20

max rw 5pt

15

max lc 5pt

max rw 9pt

10

real-world, 7pt
low-cost, 7pt
real-world, 27pt
low-cost, 27pt

80

Mean no. of searches

30

max lc 9pt

5

60

max lc 7pt

40

max rw 27pt

max rw 7pt

20

max lc 27pt

0

0
0.1

1.4

2.8

4.2

5.6

7.0

8.5

9.8

11.2

12.7

0.1

1.4

Search time (µs), bins=150, size=0.08 µs

2.9

4.3

5.7

7.2

8.6

10.0

11.5

12.9

Search time (µs), bins=150, size=0.09 µs

4x4x4 cube Haswell
real-world, 7pt
low-cost, 7pt
real-world, 27pt
low-cost, 27pt

80

Mean no. of searches

Mean no. of searches

35

60

40

max lc 7pt

max rw 27pt

max rw 7pt

20

max lc 27pt

0
0.2

2.6

5.3

8.0

10.6

13.3

16.1

18.6

21.3

24.1

Search time (µs), bins=150, size=0.16 µs

Figure 3.7: Histograms showing average time per search for selected decompositions. Annotations show the minimum and maximum search times for real-world
(rw) and low-cost (lc) benchmarks, for each stencil.

47

real-world search times under 5 and 9-point stencils are 436ns and 828ns, respectively,
while the corresponding low-cost values are 12686ns and 9367ns, exhibiting gaps of one
or two magnitudes. For the 4x2x2 cube decomposition, the maximum real-world 7point stencil is 909ns while the low-cost is 7213ns, and for 27-point these are 11051ns
and 12941ns. Finally, for the 4x4x4 cube decomposition, the maximums are 6251ns vs.
7817ns (7-point) and 23080ns vs. 24061ns (27-point), indicating that by this point, the
gaps have closed significantly.

3.5.4

Discussion

Previous work has studied the impact of multithreading on message queues and processing times on a KNL system [133]. As part of the present work, we’ve reproduced
that study. While our results support the conclusion that multithreaded communication
leads to increased search depths and potentially problematic message processing times,
observed message processing times trend higher than in that earlier work. This is likely
due to the Spectre/Meltdown exploit, a variant of which affects look-ahead pointer dereferencing [170]. In contrast to the data presented here, results gathered for the earlier
work were obtained prior to the exploit being patched. Unfortunately we were unable to
test this hypothesis due to the unavailability of unpatched nodes.
As noted above, the results indicate a user of the low-cost benchmark should be aware
that the benchmark may, under larger message counts, underestimate the number of items
searched during message processing. This is not unexpected: whereas in the low-cost
benchmark, the non-deterministic order of message arrivals at the receiver is due entirely
to thread-level competition at the single sender, the real-world benchmark adds networkinduced non-determinism, i.e., variation in message arrivals due to taking different routes
through the network. We hypothesize that as adaptive routing gains traction in HPC,
this divergence between benchmarks will decrease, assuming processes participating in
the low-cost benchmark are sufficiently far removed in the network topology.
48

Finally, the user should also keep in mind that the low-cost benchmark can overestimate message processing time for smaller message counts in comparison to the real-world
benchmark. Again, this is not unexpected: because there is only a single sender, in the
low-cost benchmark messages arrive at a higher rate than in the real-world benchmark,
where arrival times are diluted by having multiple senders. This means in the low-cost
benchmark, more memory accesses are occurring in a smaller window of time, introducing
contention and leading to the outliers noted in Figure 3.7. As the number of messages increases, the real-world benchmark trends towards the behavior exhibited by the low-cost
benchmark.

3.6

Case study: assessing hardware offloaded message
matching

As described in Section 3.4, we ran the low-cost benchmark on a ConnectX-5 system with
hardware matching enabled and disabled. Because this system bypasses our instrumented
MPI, we record the time spent processing incoming messages, within the benchmark
rather than within MPI. Furthermore, since offloading is only active above a certain
threshold of message size (1024 B, the default), we also vary message sizes to span this
transition point (from 8 B to 1 MiB). For each set of parameters, the benchmark was
executed 50 times, with 11 emulated halo exchanges per run; the data presented here
discards the 1st exchange from each run, for a total of 500 trials.
Results for three message sizes – small (512 B), medium (16 KiB), and large (1 MiB) –
are presented in Figure 3.8. The left column shows results for square decompositions, and
the right for cube decompositions. Plotted values are medians, and error bars represent
the 1st and 3rd quartiles.
As expected, when message sizes are below the threshold for engaging offloaded matching (<1024 B), processing time is not affected by hardware matching being enabled

49

300
200
100

1000

Square decomposition (1MiB msgs)

4x
4x
2
4x
4x
2
4x
4x
2

0

4x
2x
2

8x
4

4x
4

4x
2

2x
2

0

20000

2x
2x
2

2000

40000

2x
2x
1

4000

7pt, HWM+
7pt, HWM−
27pt, HWM+
27pt, HWM−

60000

1x
1x
1

HWM+
HWM−
HWM+
HWM−

2x
1

4x
2x
2

Cube decomposition (16KiB msgs)

Median proc. time (µs)

6000

1x
1

Median proc. time (µs)

5pt,
5pt,
9pt,
9pt,

8000

2x
2x
2

0

Square decomposition (16KiB msgs)

10000

4x
2x
2

500

8x
4

4x
4

4x
2

2x
2

2x
1

50

7pt, HWM+
7pt, HWM−
27pt, HWM+
27pt, HWM−

1500

2x
2x
1

100

2000

2x
1x
1

150

HWM+
HWM−
HWM+
HWM−

1x
1x
1

200

Cube decomposition (512B msgs)

Median proc. time (µs)

5pt,
5pt,
9pt,
9pt,

250

1x
1

Median proc. time (µs)

Square decomposition (512B msgs)

2x
2x
2

1x
1x
1

0

8x
4

4x
4

4x
2

2x
2

2x
1

25

400

2x
2x
1

50

7pt, HWM+
7pt, HWM−
27pt, HWM+
27pt, HWM−

500

2x
1x
1

HWM+
HWM−
HWM+
HWM−

2x
1x
1

75

Median proc. time (µs)

5pt,
5pt,
9pt,
9pt,

100

1x
1

Median proc. time (µs)

125

Cube decomposition (1MiB msgs)

Figure 3.8: Median time spent processing messages with hardware matching
enabled (HWM+) and disabled (HWM−), for square and cube decompositions
and small (512B), medium (16KiB), and large (1MiB) messages.

50

1x1x1, HWM+
1x1x1, HWM−
2x2x1, HWM+

104

Message size (B)

2x2x1, HWM−
4x4x2, HWM+
4x4x2, HWM−

103

10
24
20
48
40
96
81
92
16
38
4
32
76
8
65
53
6
13
10
72
26
21
44
52
42
88
10
48
57
6

102

8

10
24
20
48
40
96
81
92
16
38
4
32
76
8
65
53
6
13
10
72
26
21
44
52
42
8
10 8
48
57
6

8

102

Cube decomposition, 27pt stencil

105

51
2

Median processing time (µsecs)

2x2, HWM−
4x4, HWM+
4x4, HWM−

103

51
2

Median processing time (µsecs)

Square decomposition, 9pt stencil
1x1, HWM+
1x1, HWM−
2x2, HWM+

Message size (B)

Figure 3.9: Median time spent processing messages by message size, square and
cube decompositions, with hardware matching enabled (HWM+) and disabled
(HWM−).
(Figure 3.8, row 1). When message sizes are modestly above the threshold (e.g., 16
KiB), hardware matching accelerates processing time across most square and cube decompositions and stencils (row 2). Within each type of decomposition, speedup is more
pronounced for larger stencil sizes (larger numbers of messages). For square decompositions, the average speedup for the 5-point stencil is 1.13×, and for the 9-point stencil it
is 1.23×. Similarly, for cube decompositions, the average speedup for the 7-point stencil
is 1.21× while that for the 27-point is 1.39×.
However, our benchmark also reveals (Figure 3.8, row 3) that when messages become
large (e.g., 1 MiB), hardware-assisted message match may actually slow down message
processing relative to software message matching. For 1 MiB messages, the average
slowdown is remarkably consistent across types of decomposition and stencils: 1.85× for
5-point square, 1.87× for 9-point square, 1.87× for 7-point cube, and 1.88× for 27-point
cube. These results suggest that, on this system, there exists a ‘window of effectiveness’
of hardware offloading as regards handling the overheads incurred by multithreaded communication patterns.
To better characterize this window, we plot processing time across all message sizes
for selected square and cube decompositions in Figure 3.9 (note the gap between 8 B and
512 B). We observe that offloaded matching typically reduces processing time for message

51

sizes beginning near the default threshold of 1024 B, but this benefit disappears between
32 KiB and 64 KiB, at which point offloaded matching incurs additional overheads in
comparison to not using offloading. This effect threshold is observed regardless of the
number of messages being processed.

3.6.1

Discussion

These results confirm the potential benefits of hardware-assisted message matching in
handling the increased overhead of multithreaded applications. However, they also suggest that there are situations where offloading may be detrimental to application performance (e.g., when message sizes exceed 32 KiB).
Previous work by Marts et al. [168] has also considered the impact of ConnectX-5
message matching offloading. The results presented in here are consistent with this earlier
study. Marts et al. saw similar performance benefits that were limited to a window of
message sizes between 1 KiB and 16 KiB. For messages larger than 16 KiB, they showed
that hardware-assisted message matching was slower than software message matching.
The same effect is seen in Figure 3.9. These results are also consistent with Mellanox’s
default threshold which limits hardware-assisted message matching to messages that are
larger than 1024 B.
Marts et al. also observed that even for message sizes within this window, UCX tag
binning collisions can decrease the performance of hardware message matching. Therefore, the fact that we observe a speedup for these medium-sized messages suggests that
our benchmark is balancing its tags across the bins effectively. Provided that a user sends
messages in this ideal size window and ensures that their tag use is consistent with low
binning collisions, Mellanox’s hardware offloading has the potential to alleviate some of
the aforementioned concerns in MPI matching overhead.

52

3.7

Conclusion

Because of the centrality of MPI to HPC, vendors are now offering NICs that are ‘smart’
in the traditional sense of providing hardware support for MPI message matching. A
goal of the present work is to assess what these sorts of offloaded capabilities can do,
especially with respect to future programming paradigms.
To this end, in this chapter we’ve presented the design and implementation of two
benchmarks for assessing the potential performance impact of multithreaded MPI communication (using MPI THREAD MULTIPLE) under common 2D and 3D halo exchanges.
The ‘real-world’ benchmark implements a full halo exchange using 9 or 27 nodes, while
the ‘low-cost’ version builds on the analysis of thread and message counts provided in
Section 3.3 to emulate the exchange using only two processes. Results from both benchmarks executed on multiple architectures show the two are comparable with respect to
number of items searched and time spent searching.
The benchmark results confirm and extend earlier work suggesting that multithreaded
halo exchanges can incur unacceptable latencies, severely impacting application performance. For example, Schonbein et al. [133] note that on current systems, molecular
dynamics applications (e.g. [171]) typically budget approximately 105 to less than 103 µs
per simulated femtosecond; this includes a complete halo exchange. The results reported
above reveal multiple scenarios where message matching overhead alone can exceed the iteration’s entire time budget. Taken together with other known inefficiencies in implementations of multithreaded MPI (e.g., [160]), these results provide additional motivation for
exploring other uses for multithreading in MPI, such as partitioned communication [161].
Finally, the offloading case study indicates that hardware-based matching can be
effective at reducing message processing latencies, even under the increased message
counts incurred by multithreaded communication. However, message processing times
remain high, and (at least in the case of the specific hardware considered in this study)

53

larger message sizes may actually incur additional overheads when hardware offloading
is enabled.
These results provide a more nuanced understanding of SmartNIC capabilities as
regards offloaded message matching. In the following chapters, we show that hardwarebased message matching has another potential future life in HPC, besides coping with
multithreaded communication: when bundled with other offloaded network applications
typical of HPC, message matching enables NICs whose offloading capabilities were not
general in the sense invoked by the contemporary notion of a SmartNIC can be ‘made
intelligent’, i.e., made capable of executing arbitrary programs to assist host applications,
or even execute independent programs.

54

Chapter 4
INCA: In-Network Compute Assistance
In the previous chapter, we explored what a particular offloaded ‘smart’ capability –
message matching – can contribute to the emerging paradigm of multithreaded communication in scientific computing. In this chapter, we consider what offloaded message
matching can accomplish when coupled with other hardware-assisted smart capabilities.
State-of-the-art NICs deployed in HPC systems provide offloaded support for several
network applications of critical importance to scientific workflows. These include support for MPI message matching (as explored in Chapter 3), and for collective operations
such as barriers, broadcasts, gathers, and reductions. NICs supporting the OpenFabrics
standard [172], such those based on the Portals network programming API [3], [35], provide this offloaded support through three basic capabilities. The first is direct support for
message matching, e.g, via TCAMs [32]. Support for collectives, including reductions and
barriers, is provided through two additional offloaded capabilities: triggered operations
and atomic operations. A NIC supporting triggered operations can autonomously generate outgoing messages in response to events (e.g., buffer updates) caused by incoming
traffic. Triggered operations thus provide primitives for constructing offloaded collectives and rendezvous messaging [102], [103]. A NIC supporting atomic operations can
perform basic arithmetic or logical operations (e.g., summation, compare-and-swap) on
message payloads. In conjunction with triggered operations, atomic operations facilitate
55

the offloading of reductions, distributed mutexes, and so on.
Considered individually, each of these offloaded capabilities renders the NIC ‘smart’ in
the traditional sense of the term (Chapter 2). However, in this chapter, we show how these
three task-specific capabilities – message matching, atomic operations, and triggered
operations – can be leveraged to make a NIC intelligent in the contemporary sense, i.e.,
being capable of executing arbitrary, user-defined programs. That is, offloaded generalpurpose compute capabilities do not require appeals to CPUs or other general-purpose
hardware; existing task-specific capabilities common to HPC NICs may be sufficient.
The demonstration proffered in this chapter provides the formal basis for INCA (InNetwork Compute Assistance), a framework for SmartNIC offloading that is further
developed in subsequent chapters. INCA is unique in the SmartNIC landscape in that
it offers deadline-free kernel execution while nonetheless utilizing on-path processing elements. In the terminology of Chapter 2, INCA is in-path. To our knowledge, while the
strategy of recirculating data through an on-path architecture makes rare appearances in
the literature (e.g., [126]), INCA is the first contemporary SmartNIC design to explicitly
adopt an in-path architecture1 .
We begin with an overview of the Portals message processing pipeline in Section 4.1.
In Section 4.2 we make the strategy for securing general-purpose compute capabilities
concrete by presenting the Triggered Operation Machine (TOM), a formal model of
computation based on the primitive capabilities of message demultiplexing, triggered
operations, and atomic operations. We prove the TOM is Turing complete by reducing
the well-known universal register machine model of computation [173] to the TOM in
Section 4.3. The practical outcome of this proof is a clear specification of a set of basic
changes to existing Portals NICs sufficient to secure general-purpose compute capabilities;
these are discussed in Section 4.4.
1

Some of the results reported in Chapters 4, 5, 6, and 7 appear in [132].

56

4.1

The big picture

NICs supporting the OpenFabrics standard [172], such as those adopting the Portals
network programming API [3], typically provide hardware support for three primitive
network applications: (1) message demultiplexing (cf. Chapter 3), (2) triggered operations, and (3) atomic operations.
Match List
2

ME
bufi

1

msg

PTE

3

counter++

ME
bufj

ME
bufk

...

atomic

θ

Op

4

5

msg

Portals Table

Triggered
Operations

Figure 4.1: The Portals message processing pipeline. Numbers indicate steps in
the processing pipeline, as described in the main text.
Figure 4.1 provides a data-structure view of the Portals message processing pipeline,
including these functions. (1) An incoming message matches against a list of Portals
table entries (PTEs), and then (2) against a list of matching elements (MEs) attached
to that PTE. In this illustration, the incoming message matches the second ME. Each
ME specifies a local buffer that is the destination for the payload of the incoming message. When the incoming message matches the ME, an optional atomic operation (3) is
performed on the message payload and (in the case of a binary operation) the contents
of the local buffer specified by the ME, and the result stored in that buffer. (4) The
act of writing to the buffer increments a counter attached to that buffer. Finally, (5) a
table stores a list of operations along with the counter thresholds at which point they
are triggered. If the counter is greater than or equal to a given operation’s threshold, the
57

operation is triggered, and an outgoing message generated.
Triggered
operation

Matching

++

Atomic

Figure 4.2: Program execution under INCA.
The basic insight behind INCA is that the three primitive operations of triggering
messages, demultiplexing incoming messages by matching tags, and applying atomic operations, when taken together, can be interpreted as executing an instruction. Figure 4.2
illustrates how this is the case. First, a triggered operation generates a message, perhaps
containing a payload. Second, the message matches against a PTE/ME. When the message is matched, we have (i) an operand, i.e., the contents of the buffer specified by the
ME; (ii) an (optional) second operand, contained in the payload of the triggered message,
and (iii) an operation to apply to one or both those payloads, which (in a departure from
current Portals semantics) we assume is also specified by the ME. The atomic unit then
executes the operation, writing the results to the buffer indicated by the ME. In this way,
the sequence of generating a message, matching, and performing an atomic operation can
be viewed as the execution of an instruction.
The counter provides the means to sequence instructions into programs by serving as
an analog to a traditional program counter. Let a sequence of instructions be ordered
by the triggering threshold of each, and suppose all of these instructions share the same
counter. Then, as each instruction executes, its completion triggers the next instruction
in the program.
To sum, the basic strategy behind INCA is to treat triples of triggered messages,
matching elements, and atomic operations, as instructions.

By sharing a common

counter, these instructions can be sequenced to execute a program.

58

4.2

The triggered operation machine

In this section we present a formal model capturing the intuitive notion of program execution given above: the Triggered Operation Machine (TOM). The TOM stands to INCA
as, e.g., the traditional register machine stands to typical RISC-based program execution. Specifically, first, it demonstrates that when properly organized, these primitives
are indeed capable of general-purpose computing (i.e., are Turing complete); and second,
by specifying the basic capabilities required of any practical realization, the model serves

Atomic
Unit

trigger

Interpret

in

message
mj

Interpret

as a guide for what is required of any hardware implementation (Section 4.4).

Triggering
Unit

message
mk

out

Figure 4.3: Message processing flow for the triggered operation machine. An
incoming message (mj ) carrying an optional operand is interpreted (‘matched’)
by the Atomic Unit, and an atomic operation applied. This event updates the
contents of a trigger, which, when interpreted by the Triggering Unit, causes a
new message (mk ) to be generated.
Formally, a TOM comprises two components. First, a finite set of atomic unit (AU)
entries. Second, a finite set of triggering unit (TU) entries. These sets are then associated
through messages (which link TU entries to AU entries) and triggers (which link AU
entries to TU entries). Figure 4.3 shows the model. An incoming message (mj ) is
interpreted (‘matched’) against the AU entries, and an atomic operation applied. This
event updates a trigger, which, when interpreted against the TU entries, generates a new
message, and the process repeats.
An incoming message mj is a pair of an operand (oj ) and a tag (µj ). Each member of
the AU set is a 4-tuple comprising a tag (µi ), an operand (oi ), an atomic operation (αi ),
and a trigger (Ri ). Finally, each member of the TU set is also a 4-tuple, each comprising
a trigger (Rk ), a threshold (θk ), an operand (ok ), and a tag (µk ). This notation is
59

oi
O
hoi i
µi
mj = (oj , µj )
α
Ri
ai = (oi , µi , α, Ri )
A
θk
tk = (θk , Rk , ok , µk )
T

Operand location
Set of operand locations o0 , o1 , . . .
Contents of operand o
Tag
Message
Atomic operation (unary or binary)
Trigger
Atomic unit (AU) tuple
Set of AU tuples
Trigger threshold
Triggering unit (TU) tuple
Set of TU tuples

Table 4.1: Triggered operation machine (TOM) notation.

summarized in Table 4.1.
In this notation, operands and triggers are locations, and angled brackets designate
the contents of those locations. Let the set of tuples of the AU be A. Given an incoming
message mj = (oj , µj ), the operation of the AU can be represented as updating the state
of all operands oi referenced by the members of A:

∀ai ∈ A : hoi i ←




α(hoi i, hoj i) µi = µj


hoi i

(4.1)

otherwise

Unary operations are accommodated by operating only on oi , ignoring oj . Note that
‘matching’ on a tag or set of match bits (e.g., on a PTE and ME in Portals as in
Figure 4.1) is integrated into this piecewise function as the condition µi = µj .
The contents of the trigger specified in the AU tuple are updated according to some
function; since the present goal is to have a trigger emulate a program counter that is
updated after each instruction, we use the successor function to count the number of

60

atomics initiated using the same trigger, where R is the set of triggers:

∀Ri ∈ R : hRi i ←




hRi i + 1 µi = µj


hRi i

(4.2)

otherwise

Finally, letting T be the set of 4-tuples of the TU, the operation of the TU can then be
represented as updating the state of out based on the status of the triggers identified
by each tuple:
∀tk ∈ T : houti ←




(ok , µk ) θk = hRk i




(4.3)

otherwise

Note that these equations are slightly underconstrained; this is a deliberate attempt to
anticipate future variations on the model. For present purposes, let us further constrain
the definition of a TOM as follows. First, all tags µi in the members of the AU set
are unique (i.e., multiple atomic operations cannot be induced by the same incoming
message). Second, the set of triggers R is unary, containing only a single trigger, the
‘program counter’. Third (as is implied by the previous restriction), all members of the
AU set specify the sole member of R. Finally, all thresholds θk in the members of the TU
set are also unique (i.e., multiple messages cannot be generated off of the same trigger
update).
Given these definitions, a token TOM is a set of AU tuples, a set of TU tuples, and
a rule for updating the sole trigger (namely, succession).

4.3

The Turing completeness of TOM

The definition of the TOM model given above does not specify the set of available atomic
operations. Here, we establish the Turing completeness of the TOM model, and in
doing so, simultaneously determine a sufficient set of atomics. Moreover, as discussed
in Section 4.4, the proof has implications for how current hardware can be extended to
61

secure arbitrary program execution.
Our demonstration proceeds by reducing the well-known universal register machine
(URM) model to TOM [173]. A simple URM is defined as follows. First, let R =
{r0 , r1 , . . .} be an unbounded set of registers, each capable of storing a member of N0 .
Second, the behavior of the machine is determined by a finite list of indexed instructions
I = {i1 , i2 , . . . , ik }, each of which is one of the following four possibilities, where ri , rj ∈ R:
1. Zero: Z(ri ): hri i ← 0.
2. Successor: S(ri ): hri i ← hri i + 1.
3. Transfer: T (ri , rj ): hri i ← hrj i.
4. Jump: J(ri , rj , q), where iq ∈ I: letting p be the index of the current instruction,
if hri i = hrj i, jump to instruction iq ; otherwise, proceed to instruction ip+1 .
For all instructions but jump, the instruction index is incremented by one after the
instruction is executed. The machine begins at instruction i1 , and continues until there
is no instruction with the current index to execute. The result of the computation is the
contents of the registers.
URMs are reduced to the TOM as follows. First, for each URM register r0 , r1 , . . .
designate a corresponding TOM operand location o0 , o1 , . . .. Second, let all AU and TU
tuples utilize the same trigger, Rpc . hRpc i can be viewed as a program counter storing
the index of the current URM instruction in the list of instructions comprising a URM
program. Third, each TOM operation has two parts: an entry in the TU and a matching
entry in the AU (connected by the message generated by the TU). A TOM instruction
ij is therefore a pair (tk , ai ) = ((θk , Rpc , ok , µk ), (oi , µi , α, Rpc )). This can be simplified
as follows. First, since the tag for a message cannot vary between the AU tuple and
the TU tuple, µi = µk , so can be dropped. Second, since all instructions use the same
trigger, it is also omitted. Finally, rearranging terms for clarity and dropping superfluous
parentheses gives a simplified definition: ij = (θk , oi , ok , α). Fourth, a TOM program is
62

defined as a list of n ≥ 1 TOM instructions i1 , i2 , . . . , in beginning with θk = 1, and
concluding with θk = n. Since under this definition, hθk i = j, the TU threshold can be
omitted, and instruction j of a TOM program – corresponding to instruction j of a URM
program – is represented simply as: ij = (oi , ok , α).
The remainder of the reduction involves the selection of an appropriate set of primitive
operations for possible values of α. Recall incoming messages are PUTs from location oj
to location oi , where the contents of the latter are replaced with the contents of some
atomic operation performed on one or both operands. Consequently, if α is identity, the
result is merely hoi i ← hoj i. Therefore, for each URM instruction of the form Z(ri ), let
the corresponding TOM instruction be (oi , 0, =); and for each URM instruction of the
form T (ri , rk ), let the corresponding TOM instruction be (oi , ok , =).
Addition is a standard atomic operation in current hardware, so the URM successor
function can be simulated by letting α be addition, and setting hok i = 1. Each URM
instruction of the form S(ri ) thus has a corresponding TOM instruction (oi , 1, +).
To reduce the URM jump instruction, first note that compare-and-swap is a typical
atomic operation, so testing whether the contents of two operand locations are equal is
a reasonable atomic operation for the TOM. Consequently, the equality test of a URM
jump can be simulated
by a TOM instruction of the form (oi , ok , ==), where hoi i == hok i


1 hoi i = hok i
is defined as hoi i ←
. Second, since triggers are ‘registers’ in the same


0 otherwise
sense as operand locations, we define a special atomic operation that takes advantage of
this fact: [oi , q, > 0], interpreted as follows:

hRpc i ←




q

hoi i > 0

(4.4)



hRpc i + 1 otherwise
Note that since each URM jump results in two TOM instructions, the values of q and j
for all instructions must be adjusted appropriately. To simplify exposition, we exclude
63

URM instruction

⇒

TOM instruction(s)

Zero

Z(ri ) ⇒

(oi , 0, =)

Successor

S(ri ) ⇒

(oi , 1, +)

Transfer

T (ri , rk ) ⇒

Jump J(ri , rk , q) ⇒

(oi , ok , =)
(oi , ok , ==)
[oi , q, > 0]

Table 4.2: Summary of the reduction of URMs to TOM.

these details.
Table 4.2 summarizes the reduction of URMs to TOMs. This reduction tells us
the TOM model can be made Turing complete if, at the minimum, it supplies atomic
operations for identity (i.e, PUT), addition, and equivalence, as well as the capacity to
conditionally modify the contents of a location (the trigger) depending on whether the
contents of another are greater than zero. In the next section, we use these requirements
to guide the construction of INCA, a concrete implementation of the TOM model within
the Portals networking API.

4.4

Implementing INCA

INCA is an implementation of the TOM model on top of the Portals network programming API. In other words, INCA shows how, with relatively modest modifications to
existing offloaded capabilities, current Portals-compliant network interfaces can provide
general-purpose compute capabilities of the sort offered by existing on- and off-path
SmartNIC architectures.
Portals-compliant NICs offer message matching, atomic operations, and triggered
operation functionality (Section 4.1). Consequently, much of the TOM model maps
straightforwardly onto Portals. A TOM instruction corresponds to a pair of a Portals
triggered operation (PtlTriggeredPut) and an ME, linked by unique match bits. All

64

instructions (i.e., all MEs and triggered operations) use the same trigger, i.e., a program
counter. One TOM operand (oi ) is specified as the ME buffer, and the other (ok ) is
given by the message generated by the triggered operation. TOM transfers and the zero
operation are thus standard Portals PUT operations.
In a slight modification to current Portals semantics (which has the sender of a message specify an optional atomic operation (α) to be performed at the receiver), we stipulate that the ME can specify the atomic. The reason for this change is to maintain
consistency with how program execution is typically understood, namely, the local PEs
store and execute the program, not remote hosts.
The atomics included as part of the Portals standard include addition, so it is trivial
to simulate the URM successor function; however, rather than limit ourselves to URM
operations, our implementation allows α to be any of the atomics provided by the Portals
specification. In addition to arithmetic operations (addition, subtraction, multiplication),
these include bitwise, logical, and comparison operators.
Implementing the remainder of the TOM model on top of Portals can be accomplished
with three additional modifications to the Portals specification. First, under the current
specification, a Portals triggered operation is discharged when its threshold is ≤ the
counter value. This is problematic from the perspective of program execution, because
if these entries are persistent (i.e., they do not disappear when triggered), incrementing
the program counter would result in all prior instructions being ‘re-executed’ each time
the counter is incremented. Therefore, we modify this behavior to allow ‘strict’ indexing
(i.e., ==), as required by Equation 4.3 of the TOM model.
Second, while the current Portals standard allows MEs to be used multiple times,
triggered operations are by default ‘use once’, and are removed from the set of registered
operations when discharged. Since TOM supports backwards jumps, we extend Portals
to include persistent triggered operations (Portals already allows for persistent MEs).
Finally, to realize jumps, the TOM requires a method for testing for equality, and

65

modifying a counter value conditional on the outcome of this test (Equation 4.4). While
Portals provides an atomic operation for testing equality, it does not support the required
counter modification. To address this, we introduce a new atomic operation, ‘branch if
≤ zero’ (BLEZ). An ME specifying this atomic also provides an instruction index. If
the value contained in the operand identified by the ME is zero or less, the counter is
updated to this instruction index; otherwise, it is incremented by one. Obviously, any
equivalent branching atomic could be substituted for BLEZ.
To sum, the majority of the TOM model can be realized within Portals with no
modification to the current specification. The remaining proposed changes represent a
relatively small delta from that specification: adding strict indexing for triggered operations, making those operations persistent, and adding a single new atomic operation. As
a proof of concept, we’ve modified the current Portals reference implementation [174] to
include these changes, although at the time of writing, this modified version is not yet
publicly available.

4.5

Conclusion

In this chapter, we have shown (via the Triggered Operation Machine model) that when
effectively coordinated, standard task-specific offloaded capabilities – namely, message
demultiplexing, atomic operations, and triggered operations – are capable of enabling
Turing complete, deadline-free, on-NIC program execution. That is, many standard HPC
NICs already provide the basic building blocks required to be ‘smart’ in the contemporary
sense of a SmartNIC, without the need for additional processors.
Moreover, the demonstration given in this chapter serves to identify a set of modest changes to the current Portals standard sufficient for securing Turing completeness.
These changes include the strict indexing of triggered operations, persistent triggered
operations, and a single additional atomic operation whose result is written to a counter.
The demonstration offered in this chapter introduces the core concepts underwriting
66

In-Network Compute Assistance. In the next chapter, we build on this foundation to
design and implement an ‘ecosystem’ for exploiting the expanded smart capabilities of
an INCA-enabled NIC.

67

Chapter 5
The INCA Ecosystem
The triggered operation machine provides a formal foundation for INCA, and, with modest modifications, Portals-compliant NICs can implement this model (Chapter 4). However, just as ‘programming’ a Turing machine or working directly in assembly language
can be difficult, so can working at the level of the TOM. Indeed, programming with
triggered operations can itself be counterintuitive, and require significant specialized
knowledge regarding the underlying hardware.
To facilitate the development of INCA kernels, we designed and implemented an
‘ecosystem’ comprising various languages for expressing these kernels, a compiler, and
an interpreter. In this chapter we provide a broad overview of these components; more
details can be found in Appendices A and B.

5.1

The INCA ecosystem

Figure 5.1 is an overview of the INCA ecosystem. INCA-Q is a high-level language based
on C that provides a familiar means to express INCA kernels. The ‘Q-compiler’ compiles
INCA-Q code into a lower-level INCA ‘assembly’ language – INCA-A – that directly
mirrors the structure of the TOM. INCA-A code, in turn, can be executed through an
interpreter. INCA-A code can also be executed through a modified version of the Portals
reference implementation; however, we do not discuss the reference implementation in
68

INCA-Q

Q-Compiler

INCA
Interpreter

INCA-A
Reference
Implementation

Triggered Operation Machine

Figure 5.1: The INCA ecosystem.
this document.
In the remainder of this chapter, we take a closer look at these components.

5.2

INCA-Q

INCA-Q is a high-level language for expressing INCA kernels. As noted in the previous
chapter, registering an INCA program with the NIC involves setting up a collection
of Portals matching elements (each of which specify buffers), atomic operations, and
triggered operations that generate messages containing payloads. INCA kernels then
execute using the resources specified in this ‘program’.
Since INCA does not itself allocate memory, the INCA-Q language divides programs
into two parts. The first is a list of ‘directives’ establishing the execution environment,
i.e., the available buffers and variables, their types, and (if needed), their values. The
second part is the code that executes in that environment, i.e., the INCA kernel itself.
For example, Algorithm 1 is INCA-Q code for performing dot product on two 10element vectors. Lines 1-4 are prefixed with ‘%’, and define the execution environment.
For example, i is a 16-bit integer, c a float, and the two arrays are arrays of floats.
These directives do not specify what the contents of these variables and arrays are. Lines
5-11 comprise the INCA kernel itself, defining the operations to be performed within the
69

Algorithm 1 INCA-Q Dot Product
1: % i16 i
2: % f c
3: % f A[10]
4: % f B[10]
5: i = 0
6: c = 0.0
7: while i < 10 do {
8:
c = c + (A[i] * B[i])
9:
i = i + 1
10: }
11: end
environment defined by the directives. Lines 5 and 6 initialize two of the variables, and
the loop calculates the dot product. At the end of execution, c contains the result.
The full INCA-Q grammar is given in Appendix B. INCA-Q supports while loops
and conditionals. At present, it does not include the capability to define function calls;
all code is inline.

5.3

INCA-A

INCA-A is a low-level ‘assembly’ language for expressing INCA kernels. Like INCA-Q,
INCA-A does not allocate memory, so code is divided into two sections, the directives
defining the execution environment, and the code specifying the algorithm to execute
within that environment. Moreover, while INCA-Q allows the use of literals, all values
in INCA-A are stored in locations in memory; therefore, INCA-A must define, as part of
its directives, locations containing any literals used in INCA-Q, designated with an

C

prefix.
Finally, note that atomic operations as defined in the Portals specification are clobbering in the sense that the value of one of the operands in the operation will be overwritten
with the result. Consequently, INCA-A code may contain ‘registers’, designated with a
R prefix, that specify locations used to temporarily store values to avoid clobbering.

70

Algorithm 2 INCA-A Dot Product
1: d i16
C0
2: d
C0 = 0
3: d i16
C1
C1 = 10
4: d
C2
5: d i16
6: d
C2 = 1
7: d f A[10]
8: d f B[10]
9: d f c
10: d i16 i
C0, i16
11: 1 PUTL i,
12: 2 PUTL
R0, i, i16
R0, R0, C1, i16
13: 3 LT
14: 4 BLEZ
R0, 10
15: 5 PUTL
R1, A[i], f
16: 6 MUL
R1, R1, B[i], f
17: 7 ADD c, c,
R1, f
C2, i16
18: 8 ADD i, i,
19: 9 JMP 2
20: 10 END
Algorithm 2 is INCA-A code that corresponds to the INCA-Q code given in Algorithm 1, and illustrates the points just raised. Lines 1-9 are directives defining the
execution environment for the code specified on lines 10-20. The INCA-Q kernel appeals
to three literals – 0, 10, and 1 – so the INCA-A directives declare and initialize three
constants,

C0,

C1, and

C2. The INCA-A kernel proper appears on lines 11-20.

To illustrate the syntax of an INCA-A instruction, consider the multiplication (MUL)
on line 16. From left to right, the first item (6) is the threshold for triggering the instruction. The second item is the instruction (MUL) to be executed, followed by the destination
where the result of the operation should be placed ( R1), the location containing the
first operand (also

R1), and the location containing the second operand (B[i]). The

final item is the type, in this case float.
Because the destination location and the location containing the first operand are
the same, the first operand is clobbered, i.e., this example illustrates the clobbering

71

semantics of Portals atomic operations. Note that all of the binary operations appearing
in the kernel share this feature. To avoid losing the overwritten data, the INCA-A code
introduces temporary storage locations (‘registers’), e.g.,

R0 and

R1, and data to be

preserved is first copied to one of these locations before it is operated on. For example,
line 15 copies an element of array A to register

R1, so that the multiplication performed

on line 16 does not destroy the data. In some cases, clobbering is permissible, e.g., in
accumulating the results of the ongoing multiplication on line 17.
We note that an alternative, non-clobbering semantics for INCA instructions presents
an obvious opportunity for optimization: if instructions allowed the destination of an
operation to be a location besides the first operand, the additional memory would not
be necessary, and the copying of data could be avoided. While all of the work presented
in this document retains the standard clobbering semantics, INCA-A syntax is designed
to allow for such a distinction in the future.
The current version of INCA-A includes directives for defining the execution environment (declarations, initializations), core operations for data manipulation (arithmetic,
logical, bitwise), program control operations (branch, jump, end), and data movement
operations (put). A list of INCA-A instructions is included in Appendix A.

5.4

Compiler and interpreter

To translate kernels written in INCA-Q to INCA-A, we designed and implemented a
‘Q-compiler’. The Q-compiler is written in Python using the Lark library [175], which
handles the tasks of tokenization and parsing.
The compiler makes an initial pass over the INCA-Q code to assemble tables for variable types (for type checking), and for arrays and their dimensions (for bounds checking). Next, an AST is generated for the program, and traversed to do type and bounds
checking, and to generate INCA code. During code generation, the compiler generates
locations to store any literals, as described in the previous section. Furthermore, the
72

compiler automatically generates temporary registers and code for copying data to and
from those locations, to address the clobbering issue discussed above.
At the time of writing, the Q-compiler is optimized in that effort has been made
to avoid superfluous data movement. For example, in the expression x = x + y, the
Q-compiler recognizes the final destination of the sum is x, so there is no need to avoid
clobbering x (compare: z = x + y). Similarly, the compiler recognizes statements such
as x = x do not require data movement.
In addition to the compiler, we also developed two methods for executing INCA-A
kernels. First, we modified the Portals reference implementation to support INCA. As
noted in Chapter 4, this required extending Portals to support strict indexing, persistent
triggered operations, and an additional atomic to enable program branching. Since the
Portals reference implementation is written in C, we also developed a tool to translate
INCA-A kernels into C code that uses the extended version of the Portals API. The
resulting code can be compiled and executed over an InfiniBand or Ethernet network.
The second method is an INCA-A interpreter. This interpreter is implemented in
Python, again with assistance from the Lark library to lex and parse the INCA-A code.
The resulting collection of ASTs is traversed to generate an representation of the initial
state of the variables at the start of program execution. The ASTs are then traversed
again, and the operations executed in the context of the interpreter. The interpreter
reports the initial state of all variables, and the final state of those variables. Furthermore,
the interpreter records the INCA-A instructions appearing in the kernel as well as how
many times each is executed; this information is then used for performance modeling, as
described in Chapter 6.

5.5

Conclusion

In Chapter 4, we demonstrated how traditional, task-specific smart capabilities can be
leveraged to enable a Portals-compliant NIC to be intelligent in the contemporary sense
73

of the term. In this chapter, we’ve provided a brief overview of a set of tools – languages,
a compiler, and an interpreter – designed to facilitate the deployment of these generalpurpose compute capabilities.
The tools in the INCA ecosystem include high- and low-level languages for expressing
INCA kernels, a compiler, and an interpreter. An example kernel written in the highlevel language (INCA-Q) was discussed, as was a corresponding kernel written in the
low-level language (INCA-A). In the process, we highlighted some potential opportunities
for optimization, e.g., modifying Portals semantics to allow results of an operation to be
placed in a location distinct from that of the operands. We also briefly described the
Q-compiler for compiling INCA-Q code to INCA-A code, and the methods developed to
execute INCA-A code.
In the following chapters, we use this ecosystem to develop a variety of INCA kernels.
By supplementing the INCA-A interpreter with a simulator, we provide runtime estimates
for these kernels (Chapter 6). Moreover, we investigate the impact such offloaded kernels
may have for host application performance (Chapter 7), and explore how INCA affords
the offloading of entirely independent applications to the network, e.g., for purposes of
predicting network traffic (Chapter 8).

74

Chapter 6
INCA: Kernel Performance
Traditional offloaded core network applications – message demultiplexing, atomic operations, and triggered operations – can be leveraged to provide Turing complete and
deadline-free on-NIC compute capabilities (Chapter 4). In this chapter, we utilize the
INCA ecosystem tools described in Chapter 5 to refine our understanding of what an
INCA-enabled SmartNIC can do by implementing a selection of representative kernels,
and assessing their runtimes under different configurations of network speeds, memory
access times, and other hardware optimizations. To achieve this evaluation, we designed
and implemented a simulator – INCAsim – that interfaces with the INCA-A interpreter
described in the previous chapter.
The results of this simulation study show that ‘vanilla’ INCA – i.e., the result of
making the bare minimum modifications to contemporary NICs to make them INCAcompliant – is, not surprisingly, slow relative to standard CPUs. However, by taking
advantage of currently-unused on-NIC silicon area to include hardware optimizations
(e.g., SIMD units), and through intelligent data staging, the execution speeds can be
made comparable to CPUs, and in certain cases, exceed them.
In Section 6.1 we describe the simulator used to derive performance estimates. In Section 6.2 we introduce the set of representative kernels selected for consideration, as well as
the various INCAsim configurations used for evaluation. The results of the performance
75

evaluation are discussed in Section 6.3.

6.1

INCASim

INCAsim is based on the LogGP model of parallel computation [176], which itself is an
extension of the original LogP model [177]. Under LogGP, communication performance
is modeled by five parameters:
• L: message latency;
• o: message processing overhead;
• g: the inter-message gap, i.e., the inverse of bandwidth;
• G: the inter-byte gap, i.e., the inverse of bandwidth calculated at the byte level;
and
• P : the number of processes.
The total time to send a small message is estimated as the sum of latency and sendand receive-side overheads: L + 2o. If m messages are sent, then the total time spent in
communication is m(L + 2o) + (m − 1)g. The inter-byte gap G is used to characterize
communication time for larger messages; the time to send a large message of b bytes is
L + 2o + (b − 1)G.
Because the phases in executing INCA instructions are precisely those involved in
communication, the LogGP model maps naturally to INCA program execution. Figure 6.1 shows this relationship by annotating the original INCA instruction execution
illustration given in Chapter 4 (Figure 4.2) with the parameters of the LogGP model.
Latency L is the time it takes the message generated by the triggering unit to travel
to the matching unit. Overhead o is the time spent matching, performing the atomic,
and triggering any further message. And the gap (g or G) characterizes how quickly the
messages (or bytes) can leave the triggering unit. Note that o is bounded above by g,
76

m

m

Triggered
operation
o

g
Matching

or

Atomic

++

G
o

L

Figure 6.1: The extended LogGP model used by INCAsim.
because the NIC hardware is itself designed to operate at line rate, i.e., it should take no
more time than g to match, apply an atomic, and trigger a new message.
To capture interactions between the execution pipeline and memory, we extend the
LogGP model to include a parameter for memory access, m. In principle, an INCA
instruction could involve zero, one, or two fetches per instruction depending on the
instruction. Figure 6.1 illustrates a two-fetch scenario. Suppose the instruction being
executed is addition. We assume one operand is contained in the incoming message, so no
fetch is required. Suppose, however, the second operand, as specified by the ME, is not
present in cache; the result is a fetch prior to addition being applied. Furthermore, if the
data to be sent in a triggered message is not already present – e.g., if the outgoing data
is not the result of the operation or one of the two operands – then the triggering unit
must also fetch. We assume writing results out to local memory or to host through the
DMA engine can proceed independently of program execution. Therefore, in the example
illustrated in Figure 6.1, the time to execute the instruction is L+2o+2m+g, and an upper
bound on the time to execute a program of m instructions is m(L + 2o + 2m) + (m − 1)g.
The INCA-A interpreter records which instructions are executed, and how many
times each is executed. The INCAsim simulator takes as inputs (1) a configuration file
specifying network speed (GB/s) or message rate (messages/s), PCIe configuration or

77

local memory access speed, cable lengths, switch latencies, etc., and (2) the instruction
counts provided by the interpreter, and applies the extended LogGP model to generate
an estimate of total kernel runtimes.
This model is deployed conservatively by INCAsim. For example, we assume that the
triggered unit must always perform a fetch. Likewise, we assume there is no pre-fetching
of operands specified by MEs registered in the matching unit. Consequently, every binary
operation (for instance) always involves two fetches. This situation represents a clear
opportunity for future optimization: since MEs are associated with instructions, and
instructions are ordered, at sequences of operands could be pre-fetched into the atomic
unit concurrently with the actions of other units.

6.2

Kernels and INCAsim configurations

In this section we describe the set of representative kernels selected for performance
evaluation, and the set of INCAsim configurations used to generate runtime estimates.

6.2.1

Kernels

The kernels we study are listed in Table 6.1 in order of complexity. Given an incoming
array of data, the filter kernel replaces each item in host memory if the new value is
within a user-defined neighborhood surrounding the current value; incoming data outside
the neighborhood is considered invalid. The matrix-unpack kernel unpacks an incoming
array containing data from an eastern neighbor participating in a three-dimensional halo
exchange. The linear-interpolation kernel performs a linear interpolation on an incoming array of data, interpolating two intermediate points between each adjacent pair of
points in the incoming data. The convolution kernel applies a standard 3x3 edge detection filter to an incoming matrix, with wraparound for edge pixels. hadamard-product
is the entrywise product of two matrices. The remaining kernels, vector-dot-product,

78

Kernel

Payload size
128B

256B

512B

1024B

2048B

4096B

8192B

vector-dot-product

132

260

516

1028

2052

4100

8196

matrix-transpose

136

268

460

916

1684

3364

6436

hadamard-product

168

332

588

1172

2196

4388

8484

filter

208

406

808

1647

3271

6555

13112

matrix-unpack

246

486

966

1926

3846

7686

15366

matrix-multiplication

696

2700

4748

18836

35220

140580 271652

convolution

808

1616

3088

6176

12064

24128

47680

linear-interpolation

826

1690

3418

6874

13786

27610

55258

Table 6.1: Number of instructions executed for each kernel by payload size.
Shaded cells indicate instruction counts that exceed the deadline-imposed limit
on a 32 core, 2.5GHz, 200Gb/s, 128B packet deadline-based SmartNIC.

matrix-transpose and matrix-multiplication, are self-explanatory.
Table 6.1 also reports instruction counts, as reported by the INCA-A interpreter, for
payloads ranging from 128 to 8192 bytes. The grey-shaded entries indicate instruction
counts that exceed the processing deadline for on-path architectures on a 32 core, 2.5GHz,
200Gb/s, 128B packet system; most of these kernels cannot be executed on such a system.

6.2.2

INCAsim configurations

For the studies presented here, we consider a variety of hardware configurations, beginning with a ‘baseline’ configuration, ‘base’. The baseline configuration represents the
smallest delta from existing architectures; i.e., it reflects a scenario where current stateof-the-art Portals NIC hardware is modified to achieve Turing completeness, as detailed
above, with no further optimizations.
Figure 6.2 depicts the architecture of a Portals-compliant NIC, modified to support
the base INCA configuration. The Portals Unit (center of diagram, shaded) contains
the matching, atomic, and trigger units comprising the foundation of INCA program
79

FIFO

Data
Headers

From
Network

Switch

Portals Unit
Matching Unit

Atomic
Unit

Trigger Unit

FIFO

Portals
Commands

Host Interface

FIFO

Rx DMA Engine

FIFO

ALU

To
Network

FIFO

Tx DMA Engine

Figure 6.2: Architecture of a Portals NIC underlying the base INCA configuration. Adapted from [102].
execution.
In accordance with the results given in Chapter 4, the trigger unit now includes strict
indexing, MEs and triggered operations are persistent, and the atomic unit provides the
capacity to modify the contents of a trigger if those of another buffer are ≤ 0. Otherwise,
the base configuration works with the capabilities already supplied by the NIC. This
has two implications for execution speeds. First, triggered messages must bounce off
or loopback from the local switch, a process that increases latency (L in the extended
LogGP model). Second, because there is no NIC-local memory dedicated to INCA kernel
execution, all operands must be fetched from main memory via PCIe, and memory access
times (m in the extended LogGP model) will therefore be significant. Note we assume
access to main memory leverages the Tx DMA engine.
The second model configuration, ‘scratchpad ’, accounts for several optimizations, diagrammed in Figure 6.3. First, a fast path is available for message loopback, eliminating
the trip through the switch. This fast path includes a low-priority queue for the pending
operation, allowing incoming data to take precedence over INCA programs. Second, we
assume the Portals system is extended with a scratchpad memory similar to fast cache
(SRAM) as employed in contemporary stream models [52], [54]. This memory provides

80

1MiB 1ns SRAM Scratchpad

FIFO

Data
Headers

From
Network

Rx DMA Engine

FIFO

Portals Unit

FIFO
INCA
LPQ

FIFO

Portals
Commands

Host Interface

Match/Event
Offload
Trigger
Logic

Atomic
Unit
ALU

To
Network

Tx DMA Engine

FIFO

Figure 6.3: Architecture of a Portals NIC underlying the scratchpad INCA configuration.
a low-latency, NIC-local space for caching current data, and operands specified by an
INCA program can be pre-loaded to this local memory, e.g., by traversing the matching
list.
8B
44B
64B

100

00

0
40

0.

0
0.
20

0
0.
10

54

.5

4

10−1

.0

Mellanox
OmniPath

10

GInst/s

101

Network Bandwidth (Gb/s)

Figure 6.4: INCA instruction rates for 8, 44, and 64 byte packets, and message
rates for several current network products.
In both configurations, Portals overhead and gap are bounded by network speed, because a Portals-enabled NIC is expected to process incoming requests and issue outgoing
data at least at message rate [102]. Figure 6.4 shows the ideal processing speeds at
network bandwidths targeted by the InfiniBand roadmap [4]. The three message sizes
81

correspond to 8 B operands, 44 B messages containing the minimum Portals header information (according to the current Portals reference implementation) and an operand,
and the common 64 B cache line. Message rates reported by Intel (100 Gb/s OmniPath)
and Mellanox (100 Gb/s EDR and 200 Gb/s HDR) are shown in colored triangles. For
current hardware, Portals overhead can therefore be expected to fall somewhere between
5 ns (gap for 200 million messages/s and 1.76 ns (gap for 44 B messages). For the present
evaluation, we use the conservative larger overhead value.
Configuration parameters are summarized in Table 6.2. Loopback latency for base
reflects the time to traverse the local switch, and Portals overhead is assessed as just
described. Memory latency for base estimates the cost of retrieving an operand via
generation 5 x32 PCIe [178], while scratchpad memory latency is based on the value
adopted in [54].

Loopback latency
Portals overhead
Memory latency
gap

Base

Scratchpad

50 ns
5 ns
250 ns
5 ns

0
5 ns
1 ns
5 ns

Table 6.2: Parameters used for INCAsim.

6.3

Performance Evaluation

Using INCAsim with the parameters described in the previous section, we first calculated
INCA runtimes for the base and the scratchpad configurations. We then calculated runtimes for a series of optimizations. These optimizations are presented incrementally so
that benefits of each can be assessed independently, e.g., as regards potential implementation costs. Results for all configurations and optimizations are collected in Table 6.3.

82

Payload size
128B

256B

512B

1024B

2048B

4096B

8192B

Avg spdup
wrt base

base
scratchpad

42.16
1.52

83.08
3.00

142.6
5.14

283.96
10.24

522.04
18.81

1042.84
37.58

1995.16
71.89

27.74×

filter

base
scratchpad

65.98
2.34

130.36
4.58

260.23
9.12

525.57
18.56

1046.01
36.88

2091.8
73.88

4190.72
147.81

28.36×

m-unpack

base
scratchpad

96.51
2.80

190.91
5.54

379.71
11.01

757.31
21.96

1512.51
43.84

3022.91
87.62

6043.71
175.17

34.48×

convolution base
scratchpad

328.48
9.26

657.96
18.52

1274.28
35.45

2549.56 5014.84 10030.68
70.91 138.62
277.24

19891.8
548.09

35.94×

lin-inter

base
scratchpad

301.31
9.38

617.15
19.18

1248.83
38.80

2512.19 5038.91 10092.35
78.03 156.50
313.42

20199.23
627.28

32.18×

hadamard

base
scratchpad
clobber
parallel
para-clobber

56.08
1.89
1.54
0.02
0.01

110.92
3.73
3.03
0.03
0.02

198.28
6.61
5.21
0.05
0.03

395.32
13.18
10.37
0.09
0.05

744.76
24.70
19.07
0.18
0.09

1488.28
49.36
38.09
0.35
0.18

2886.04
95.44
72.91
0.69
0.35

30.00×
38.13×
3918.83×
7208.41×

1.27×
116.99×
240.11×

base
scratchpad
clobber
parallel
para-clobber
adv-parallel

48.92
1.50
1.32
1.16
1.15
0.04

96.6
2.96
2.60
2.25
2.24
0.05

191.96
5.87
5.17
4.45
4.42
0.08

382.68
11.69
10.29
8.84
8.80
0.12

764.12
23.34
20.52
17.63
17.54
0.21

1527.0
46.64
41.01
35.21
35.04
0.42

3052.76
93.23
81.97
70.37
70.03
0.84

32.70×
37.18×
43.09×
43.33×
2807.45×

1.14×
1.32×
1.33×
85.81×

base
scratchpad
parallel
para-clobber
adv-parallel

247.76
7.89
1.13
1.08
0.29

51771.8 100596.12
1597.64
3088.59
208.86
417.68
203.24
406.43
31.29
47.42

32.06×
246.12×
254.01×
1354.69×

7.68×
7.92×
42.09×

Kernel

Optimization

m-trans

dot-prod

m-mult

965.0 1727.88 6863.16 12966.2
30.61
53.91 213.88 400.25
3.61
7.07
26.59
52.92
3.48
6.86
25.84
51.52
1.10
1.50
5.89
7.82

Avg spdup
wrt spad

Table 6.3: Kernel runtimes in µs, under different optimizations, and speedups
with respect to the base and scratchpad (‘spad’) configurations. Kernel abbreviations are as follows. m-trans: matrix transposition; m-unpack: matrix unpack;
lin-inter: linear interpolation; hadamard: hadamard product; dot-prod: dot
product; and m-mult: matrix multiplication. Optimization abbreviations are:
para-clobber: parallel operations with clobbering; adv-parallel: advanced
parallel optimzations. See main text for description of kernels and optimizations.

83

6.3.1

Base and scratchpad performance

The base configuration reflects INCA performance if current generation HPC networks
were extended to support the TOM model as described in Chapter 4. Perhaps unsurprisingly, the base runtimes range from slow (42 µs for 128 B matrix transpose) to very
slow (100 ms for 8192 B matrix multiplication). As a point of comparison, 8 KiB matrix
multiplication on a 2.3 GHz Intel Haswell CPU takes between 10.56 µs and 139.49 µs,
depending on gcc compiler optimization level.
While below we present a series of significant optimizations, we observe that even the
performance of the base configuration may not be problematic. An application running
on the host with sufficient latent parallelism still can take advantage of relatively slow
compute assistance. Moreover, since INCA programs are intrinsically preemptible, a
host application can always assume control over NIC compute resources. For instance,
Portals includes an event notification queue that can be polled by host applications
to determine the status of posted requests. This event queue can be used to provide
information regarding the current state of an executing INCA program. Consequently,
given a mechanism for signaling the NIC to halt INCA program execution, an idle host
could steal partially-completed work from the NIC, picking up where the INCA program
left off.
The scratchpad configuration assumes an INCA NIC is equipped with a local scratchpad memory to minimize memory access penalties, and a fast loopback path to avoid
routing instructions through the switch. Unsurprisingly, enabling these features has a
significant performance impact, reducing runtimes by one or two orders of magnitude
across all kernels considered.

84

Network Bandwidth

Kernel

400Gb/s

1000Gb/s

dot-product

32.2

19.66

matrix-transpose

24.0

14.12

hadamard-product

32.32

19.28

filter

50.26

30.11

matrix-unpack

60.85

37.25

linear-interpolation

216.16

131.28

matrix-multiplication

1067.5

650.24

convolution

193.35

120.12

Table 6.4: Runtimes (µs) for kernels with scratchpad configuration, 64 B messages at 400 Gb/s and 1000 Gb/s, and initial payload of 8 KiB.

6.3.2

Network speeds

As illustrated in Figure 6.4, INCA execution times are expected to decrease as network
speeds increase. To investigate the impact of network bandwidth on INCA kernel runtimes, we modified model parameters to execute kernels on the scratchpad configuration
with 64B messages at 400 Gb/s and 1000 Gb/s network speeds; gap and Portals overhead
becomes 1.28 ns and 0.512 ns, respectively. These parameters reflect the assumption that
loopback for INCA program execution is optimized to avoid overhead. Table 6.4 shows
results for kernel execution on an initial 8 KiB payload of 8 B operands. Note that, at
1000 Gb/s, the 1 ns scratchpad memory latency starts to become a bottleneck (although
at these network speeds, we also expect cache latency to be less).

6.3.3

Reducing data movement

Atomic operations overwrite the contents of the first operand, so, to preserve standard
program semantics, INCA programs copy operands before operating on them. However,
this data copy may introduce unnecessary overhead. When possible, we modified kernels
85

Algorithm 3 INCA-A Dot Product w/ clobbering
1: 1 PUTL i,
C0, i16
R0, i, i16
2: 2 PUTL
3: 3 LT
R0, R0, C1, i16
4: 4 BLEZ
R0, 9
5: 5 MUL A[i], A[i], B[i], f
6: 6 ADD c, c, A[i], f
C2, i16
7: 7 ADD i, i,
8: 8 JMP 2
9: 9 END

. Contents of A are clobbered

to work in place under traditional atomic semantics. For example, the dot product code
given in Chapter 5 is modified to allow clobbering of array A, as shown in Algorithm 3.
Some copying is still necessary since overwriting the contents of the index during the
loop condition evaluation would invalidate the loop. Note that the INCA-A directives
are not shown in Algorithm 3.
Table 6.3 shows runtimes for the kernels that can be straightforwardly modified to
allow clobbering: dot-product and hadamard-product (rows labeled ‘clobber’). In
comparison to scratchpad, which shows an average speedup relative to base of 30.0×,
and 32.7× for the two kernels, respectively, clobber brings this speedup to 38.13× and
37.18× (1.27× and 1.14× speedup with respect to scratchpad), demonstrating that when
possible, avoiding data movement through clobbering is an effective optimization. These
results also provide an initial indication of the speedups available if a non-clobbering
semantics were available, i.e., one that allowed the destination location for a result to
differ from that of the operand(s).

6.3.4

Hardware optimizations

While integrating scratchpad memory and allowing data to be overwritten when possible
promise significant boosts to INCA kernel performance, reasonable hardware enhancements are also promising. In particular, faster NICs require more pins, so as process
technologies scale to smaller sizes, the overall size of a NIC chip may not scale; instead,
86

the NIC circuitry simply occupies less of the overall area. So, for instance, ALUs of
the sort utilized by a Portals NIC occupy approximately 1/8th of the die space typically
occupied by the compute logic circuitry of a modern CPU core.
Consequently, there is space to include additional hardware for accelerating INCA
kernel execution. Here, we consider the impact of general purpose acceleration hardware,
namely ALUs used in parallel to provide SIMD or simple MIMD functionality.
Algorithm 4 INCA-A Parallel Dot Product (n ≤ 256)
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:

. Bulk copy.
. SIMD multiply.

1 PUTLM T[0], A[0], f, 256
2 MULM T[0], T[0], B[0], f, 256
3 PUTL i, C0, i16
4 PUTL R0, i, i16
5 LT R0, R0, C1, i16
6 BLEZ R0, 10
7 ADD c, c, T[i], f
8 ADD i, i, C2, i16
9 JMP 2
10 END

As an illustrative example, consider the INCA microcode for dot product in Algorithm 2 (Chapter 5). With SIMD multiplication available, we can move the multiplication
out of the inner loop, substituting dk/Ae SIMD multiplications – where A is the width
of the SIMD instruction – for the original k single-operand instructions. The innermost
addition cannot be SIMD-parallelized easily because it accumulates to a single memory
location, so the loop is not entirely eliminated. The resulting parallel algorithm for inputs
with 256 or fewer operands is shown in Algorithm 4. The contents of array A are copied
to a temporary location (line 1), T, which is then clobbered (line 2) by a new SIMD INCA
instruction, MULM (‘multiply multiple’). The remaining code accumulates the result.
We parallelized dot-product and hadamard-product in the manner just described.
We also parallelized matrix-multiplication under the assumption data is provided in
a format that facilitates SIMD operations, e.g., the second input matrix is transposed,
and distinct matrices corresponding to rows of the first input matrix are provided (so

87

that the Hadamard product can be used to concurrently multiply all columns by a row).
For comparison (see below), we also implemented parallel versions with clobbering for
each of these three kernels, which eliminates the need to copy vectors or matrices to
temporary locations.
The

other

kernels

were

excluded

from

parallelization

considerations:

matrix-transpose and matrix-unpack are primarily data-movement kernels. filter is
parallelizable assuming a ternary compare-and-swap operation; currently this is not an
available atomic operation under Portals. linear-interpolation contains a backwards
dependency, and convolution invokes non-trivial logic for handling edge data.
To evaluate the impact of parallelization, we extended the INCA interpreter to include
SIMD versions of arithmetic instructions (MULM, ADDM, etc.). To estimate execution time,
we incorporated inter-byte gap for ‘multiple’ instructions (G in the extended LogGP
model described in Section 6.1), because those instructions involve larger payloads. Runtime calculations use whichever parameter (G or g) gives a longer execution time.
We posit 256 parallel ALUs, corresponding to approximately half the die space of
a common 32-core CPU, leaving enough die space for NIC logic, processing pipelines
and other miscellaneous circuitry. For payloads with more than 256 operands, multiple
instances of the INCA SIMD instructions are used, unrolled rather than iterated. A
diagram of the proposed SIMD parallel ALU design is shown in Figure 6.5. It should be
noted that due to the reactive nature of computation used by INCA, data for the local
operands can be staged to scratchpad or even local ALU cache (when predictable) prior
to the remote operands arriving in network messages. Local operand caches can be small
(a few operands) up to a size of a few KiB. For example, a reasonable design may use 4
KiB caches. This would typically require at least a 1 MiB scratchpad to feed 256 ALU
4KiB caches at a reasonable rate. It is possible to design a scratchpad within the timing
requirements at 1 MiB up to several MiB as this matches size and speed requirements
on modern CPU cache design.

88

Scratchpad

[N-1,0]
Operand Local

Local Operand
SRAM Store

...

ALU
Output

Operand Remote

...

Rx FIFO
Output [0,0]

Operand Local

Local Operand
SRAM Store

ALU
Output

Operand Remote

Rx FIFO
Output [N-1, 0]

[0, N-1]
Operand Local

Local Operand
SRAM Store

ALU
Output

Operand Remote

Rx FIFO
Output [0,N-1]

Operation Notes:
1) Operands are staged in scratchpad by the host
when the program is loaded on the INCA NIC or
whenever those operands become available.
2) When the program is triggered by an incoming
message, the operands are loaded (pipelined)
into the local ALU caches to match the data in the
incoming FIFO

Figure 6.5: INCA SIMD Unit. Rx FIFO inputs are those from the incoming
data in the message. Local operands are staged well in advance of the operations
in the scratchpad memory and pipelined into the local ALU operand caches on
demand.
Revisiting Table 6.3, for the parallel-clobber kernels we observe average speedups
(compared to base) of 254.01× for matrix multiplication, 43.33× for dot product,
and 7208.41× for hadamard product. Compared to scratchpad, the respective speedups
are 7.92×, 1.33×, and 240.11×. The large speedup for Hadamard product is due to the
fact the product can often be calculated in a single operation.
We can also compare (for dot product and hadamard product) the clobber and parallel optimizations, and the combination thereof. We observe that while using hardware
SIMD parallelization offers benefits over simply minimizing data copies, the combined impact of parallelization and clobbering renders only a marginal benefit over parallelization
alone. This is because the use of SIMD instructions itself eliminates the data movement
associated with executing the arithmetic instructions serially.
Our basic parallelizations for dot-product and matrix-multiplication are limited
89

by the fact their summation phases remain serialized. While this phase cannot be fully
parallelized, it can be restructured as a binary tree reducing n operands to an accumulated
value in dlog2 ne steps. To investigate the potential impact of specialized ALU hardware
for affecting this process, we extended the INCA interpreter to include a logarithmic
addition instruction. We assume that each stage in the reduction incurs the same Portals
overhead cost as for the scratchpad configuration (5ns), but that these stages occur in
a cascade, without requiring re-injection of new instructions (so the gap and memory
latency costs are only incurred once, at the initiation of the instruction). Again referring
to Table 6.3, these advanced parallel dot-product and matrix-multiplication kernels
achieve average speedups over base of 2807.45× and 1354.69×, respectively (85.81× and
42.09× over scratchpad ).

6.4

Conclusion

In this chapter, we have continued to explore the capabilities of SmartNICs, in particular,
NICs supporting INCA. To this end, we presented INCAsim, a simulator that works with
the INCA-A interpreter to estimate INCA kernel runtimes using a version of the LogGP
model, extended to include memory access costs. Using the INCA ecosystem (INCAQ, Q-compiler, INCA-A, interpreter), we developed a set of representative kernels and
evaluated their runtimes under different workloads, software optimizations, and hardware
optimizations. As a result of this exploration, we highlight the following conclusions.
First, it is clearly desirable to have on-NIC memory dedicated to INCA program execution. While runtimes will decrease as network speeds increase (Section 6.3.2), and it
may be possible to reduce the costs of accessing main memory by aggregating accesses or
pre-fetching, given the current runtimes, it is not clear these strategies are worth pursuing. Instead, since low-latency on-NIC memory is already standard within SmartNICs,
including a scratchpad is a reasonable solution.
Second, avoiding unnecessary data movement significantly reduces runtimes. For
90

example, with clobbering, 8 KiB Hadamard product is approximately 20% faster than
the non-clobbering version, and dot product about 10% faster. Adjusting hardware to
allow ternary INCA-A instructions – e.g., placing the result of a binary operation in
a location distinct from that of either of the two operands – is even more effective at
avoiding data movement. For example, avoiding data movement to temporary locations
in the 8 KiB scratchpad dot product kernel reduces runtimes from 93.23 µs to 70.8 µs,
making the optimization comparable to introducing SIMD instructions.
The third conclusion is that it is reasonable to take advantage of available on-chip silicon area to include additional hardware for accelerating INCA kernel execution. Through
a combination of hardware acceleration and intelligent data staging, INCA performance
can, in at least some cases, be made comparable with contemporary CPUs, e.g., 8 KiB
matrix multiplication is 7.42 µs on INCA versus 10.56 to 139.49 µs on a contemporary
CPU.
The results presented in this chapter shed light on what one can expect from INCA
kernels with respect to their runtimes. In the following chapter, we explore their potential
impact on host application performance.

91

Chapter 7
INCA: Host Applications
In previous chapters, we investigated how one particular ‘smart’ capability – message
matching – copes with overheads incurred by emerging multithreaded communication
paradigms (Chapter 3), demonstrated that when coordinated with other task-specific
offloaded capabilities, message matching enables general-purpose, deadline-free computation on the NIC (Chapter 4), and showed that a SmartNIC based on this model –
i.e., an INCA SmartNIC – can execute some kernels with runtimes comparable to those
of contemporary CPUs (Chapter 6). In this chapter, we continue this progression of
inquiry by considering what INCA offloading affords with respect to speedups of host
applications, namely, by offloading components of those applications to the NIC.
There are at least two scenarios where utilizing INCA to offload parts of host applications has the potential to benefit those applications. The first is by offloading ‘nearnetwork’ parts of host applications, i.e., work that the host application does on data
immediately before sending or immediately after receiving, e.g., data packing and unpacking, performing an initial matrix operation, etc. Since the data being operated on is
passing through the NIC, it is reasonable to offload these near-network components and
have the NIC deliver the transformed results. If this offloaded work can be overlapped
with other work done on the host CPU, the result is an accelerated runtime.
The second scenario involves applications with idle networks. For example, an appli92

cation processing sensor data (e.g., from satellites) may periodically receive an influx of
new data. However, outside of that relatively short burst, the network is idle; applications executing on host CPUs are compute-bound, attempting to produce useful results
before the next influx. In cases such as these, it may be possible to utilize NIC resources
as a co-processor by loading a kernel to the NIC and assigning it some chunk of the inmemory data to process. This could speed up the application by allowing it to complete
earlier, or it could free up CPU resources to do other work that leads to better results.
In this chapter we explore both of these scenarios. In Secton 7.1, we evaluate the
potential speedups afforded by INCA for a selection of applications and miniapps when
near-network functions are offloaded. In Section 7.2, we present a preliminary investigation into the potential for INCA to accelerate an application with mostly-idle networks.

7.1

Offloading near-network functions

By providing the capacity to execute arbitrary kernels, INCA offers opportunities for
accelerating applications by overlapping host application compute with more than mere
core network applications. To assess the potential for the acceleration of applications,
we studed a set of proxy applications identified by the Exascale project [179] and a full
application.
MiniAMR [180] is an adaptive mesh refinement code designed to represent a range of
applications. It should be noted that a general solver is used in this application, as the
target of the proxy application is to capture the behavior of an adaptive mesh, rather
than the computational kernels. MiniAMR uses pack/unpack and interpolation routines
similar to the INCA kernels described above.
MiniMD [180] is a molecular dynamics code that is a subset of the application
LAMMPS [171]. This code runs the Lennard Jones Liquid solver from LAMMPS that
is based on a halo exchange pattern. This application was selected because it runs a
common simulation mode of a major application. To analyze the potential for INCA,
93

we adapted the computation to separate internal data dependencies, the overlap target, from external dependencies, the INCA kernel target. This process also allows for
communication and computation to be overlapped.
MiniFE [180] is a finite elements code that solves a conjugate gradient. This application was selected for its broad applicability to major production applications. Particularly, MiniFE’s solver features a matrix vector multiply that can be split into two
separate computation phases, one handles intra-rank data dependencies and the other
handles inter-rank data dependencies. As these split computation phases don’t directly
depend on the other, the inter-rank computation can be directly translated into an INCA
kernel without restructuring the application. MiniFE spends an observable amount of
time in its setup phase that would otherwise be negligible in a production application.
Therefore, to examine the impact on real applications we only consider the major time
consuming section for production codes, the solver time.
LAMMPS [171] is a full molecular dynamics application which MiniMD is based on.
For this test, we ran the same problem as with miniMD. The adaptations here mirror
those of MiniMD discussed above. Many of the solvers in LAMMPS share the same
structure and can be converted to INCA in a similar manner.
For each application, we identified code that could potentially be offloaded as an INCA
kernel, subject to several criteria. First, these potential “INCA targets” should come
directly before or after a communication region, i.e., they are ‘near-network’. Second,
the target needs to take a significant portion of the application’s runtime. Finally, the
target has to be separable from other computation. For MiniFE, MiniMD, and LAMMPS,
there were computational regions that could easily be separated to internally dependent
(intranode) and externally dependent (internode) computation. As an initial evaluation,
we measured the target kernel as just the external computation. Additionally, for these
applications we analyzed the maximum possible speedup if we could also refactor the code
to better split the computation between the INCA processor and the CPU. This analysis

94

has the caveat that it doesn’t account for variations in the throughput between an INCA
processor and a CPU. In contrast to the other three apps, MiniAMR does not have easily
separable regions, so MiniAMR required a refactor to stagger data dependencies to allow
INCA to run in parallel with the CPU.
For code modification, we set a limit of no more than 10 lines of source code (not
including the INCA kernel) to meet our definition of not refactoring the code. Nonrefactored code represents a ‘first pass’ INCA implementation expected to take a few
hours to several days of developer time. This requires modifications to allow for communication overlap with internal computation, to prevent external computation from
occurring, and implementing the INCA kernel. The effort required to address the first
two is relatively minimal, e.g., MiniFE already addresses communication overlap, and
required removing nine lines to address the external computation. Similarly, MiniMD required under 10 lines to affect communication overlap, and two additional lines to remove
externally dependent computation from the CPU code. Converting one of the LAMMPS
solvers requires a similar effort to MiniMD.
For each application, we timed the INCA target kernel, communication, and the computation phase the target kernel would overlap. Table 7.1 shows the results of profiling
these applications averaged over 10 runs. For non-refactored code, we observe a potential runtime improvement of up to 2.98% in MiniFE, up to 11.0% in MiniMD, and
up to 11.5% in the LJ cut solver for LAMMPS. With a structural refactoring, we can
leverage the INCA’s processing power to further parallelize these applications. With
MiniFE and MiniMD our refactor predictions are the result of moving some of the internal computation to the INCA kernel. For MiniAMR the process is more complex,
where computation starts and continues as an unpack INCA kernel is making incoming
data available and a pack INCA kernel is preparing completed data for transfer. With
refactoring, speedups 26%, 37.2%, 25.7%, and 28.9% for MiniAMR, MiniMD, MiniFE,
and LAMMPS respectively.

95

MiniAMR
Runtime
Communication
INCA Target
Overlap Target
Potential
speed-up
without refactor
Potential
speed-up
with refactor

MiniMD
s
s
s
s

MiniFE
37.0
4.36
1.10
22.3

s
s
s
s

LAMMPS

174 s
52.8 s
45.4 s
76.3 s

43.1
6.98
4.73
34.3

39.2
6.23
3.80
21.5

s
s
s
s

N/A

11.0%

2.98%

11.5%

26%

37.2%

25.7%

28.9%

Table 7.1: Potential application impact

There are complexities that this analysis doesn’t account for, namely performance
variation from laggard processes and network congestion. With delays caused in data
arrival, the amount of time available to run INCA kernels before the host needs the data
may be reduced such that the data must be handed off to the host before processing is
complete, necessitating a lengthened overall runtime due to reduced overlap potential.
However, even in the unlikely case where extreme performance variation halved the performance improvement shown in this analysis, the impacts of leveraging INCA kernels
could have significant impact on the runtime of applications at scale.

7.2

NIC as co-processor

The applications considered in the preceding section follow a bulk synchronous processing
pattern where work phases alternate with communication phases. Consequently, we chose
targets for offloading that are ‘near-network’ in the sense they involve working on data
that is either incoming or departing the host.
However, some applications have predominantly idle networks because they only engage in communication for a small portion of their overall runtime. For example, satellite
data may only be delivered once an hour, with the ingress phase lasting minutes, and
96

the remaining time dedicated to processing by the host application. During the noncommunication phase of these applications, the network is idle. Moreover, the resolution
of processing done on incoming data is indirectly constrained by the requirements of
finishing the processing prior to the next influx of data, so gains on the order of microseconds can be high impact. Since INCA harvests idle network resources, this type of
application presents a prima facie opportunity for an INCA NIC to serve as a coprocessor
(rather than near-network processor) to which work could be offloaded.
As a preliminary investigation into this NIC-as-coprocessor scenario, we profiled an
application tasked with processing satellite data for purposes of enabling over-the-horizon
radar. Data arrives hourly, ingress lasts approximately four minutes, and the system is
expected to finish processing before the next batch arrives approximately 56 minutes
later. Incoming data is subject to some initial complex processing processing (which,
given the results reported above, we do not currently consider as candidates for INCA
offloading), at which point it is placed in memory. Once in memory, the data is traversed
to perform a version of standard ray tracing. This portion of the processing pipeline
was profiled to identify bottlenecks. On the basis of this profiling, we identified the
range doppler function as a potential bottleneck (763.70 ms/call). Within this function, solve normal dist was called over two million times, at 0.0393 µs/call, for total
time spent in the function at greater than 78600 µs. The convolve normal function was
also identified as time-consuming, at 16.63 µs per call. The question, then, is whether
INCA can help alleviate these bottlenecks.
For these simulations, we assumed a scratchpad configuration with local loopback, as
described in Chapter 6. Network speed is InfiniBand HDR (400Gb/s) and packets are
64B.
Algorithm 5 is a pseudoalgorithmic depiction of the original solve normal dist
code identified by the profile. While the logic is simple, the exponential function (exp) is
non-trivial. To accommodate exp under the INCA simulator, we inspected the standard

97

Algorithm 5 Solve Normal Distribution
1:
2:
3:
4:
5:
6:
7:
8:
9:

procedure SolveNormalDist(x, sigma, x0, mag)
result ← 0.0
xof f set ← x − x0
twosigsq ← 2.0 ∗ sigma ∗ sigma
if twosiqsq 6= 0.0 then
result ← mag ∗ exp(−(xof f set ∗ xof f set)/twosiqsiq)
end if
return result
end procedure

C math library using objdump, and given those results, modified INCAsim to charge for
50 binary instructions. In comparison to the roughly 40ns/call reported by the profiling,
the INCA implementation requires 300ns/call. So, assuming offloaded INCA kernels
overlap with the same function executed on the host, using INCA could trim off 9104
µs from the execution time of the application by offloading solve normal dist, i.e.,
approximately 11% of the current time spent in the function.
In the INCA-A implementation of solve normal dist, the exponential function call
is by far the most expensive: 231.56 ns, or 77.2% of the total kernel execution time. In
Chapter 6 we noted that on-chip real estate is available for additional acceleration hardware. Furthermore, because of its importance to scientific computing, the exponential
function has been the target of acceleration attempts, usually involving FPGAs.
Publication
Detry & Dinechin [181]
Jamro et al. [108]
Pottathuparabil & Sass [182]
Wielgosz et al. [183]
De Dinechin & Pasca [107]
Alachiotis et al. [106]
Yuan & Xu [184]

Precision

Freq (MHz)

Latency (cycles)

Time (ns)

Single
Double
Double
Double
Double
Double

100
161
100
200
310
252
230

27
258
30
35
224
23

85
167.7
2580
150
112.9
888.9
100

Table 7.2: Latencies for FPGA offloaded exponential function.

Table 7.2 summarizes a selection of research on offloading the exponential function
98

to FPGA. As this survey indicates, execution times for FPGA-accelerated exponential
functions can be near 100ns. Assuming, then, the NIC is equipped with hardware acceleration support for the exponential function, the kernel runtime is reduced to 172ns.
Under this scenario, total time spent in the function is reduced to 63981 µs, saving over
14 ms, a 19% improvement.
Algorithm 6 Convolve Normal
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:

procedure ConvolveNormal(impsize, imp, in, numpts)
maxindex = numpts − 1
midpoint = imp/2
for i ← 0, maxindex do
sumval ← 0.0
sumdat ← 0.0
for j ← 0, impsize do
if i ≤ midpoint then
if midpoint − i − j > 0 then
val = in[midpoint − i − j]
else
val = in[j − midpoint + i]
end if
else if i ≥ maxindex − midpoint then
if i − midpoint + j > maxindex) then
val ← in[2 ∗ maxindex − index + midpoint − j]
else
val ← in[j − midpoint + i]
end if
else
val = in[i − midpoint + j]
end if
sumval+ = imp[j] ∗ val
sumdat+ = val
end for
result[i] = sumval − (sumdat/impsize)
end for
return result
end procedure

Algorithm 6 is a pseudoalgorithmic representation of the original convolve normal
C++ code. As can be seen, the logic for this function is considerably more complicated
than for solve normal dist. According to the profile, the convolve normal function
99

Time in function
Function

Original

solve normal dist 78600 µs
convolve normal
28487 µs

INCA scratchpad

INCA hardware

69496 µs (1.13×)
28470 µs (1.0006×)

63981 µs (1.23×)
23628 µs (1.21×)

Table 7.3: Time spent in function call without and with offloading to INCA
under the scratchpad configuration (‘INCA scratchpad’) and INCA under the
scratchpad configuration with additional hardware acceleration support (‘INCA
hardware’). Speedups relative to the original time are given in parentheses.

is called 1713 times at 16.63 µs/call, for a total time of 28487.19 µs.
We converted the convolve normal code into INCA-Q and compiled an INCA-A
kernel. The estimated execution time of this kernel is 25962.06 µs. While significantly
slower than the CPU-side code, a single INCA call could be overlapped with the others,
reducing the time spent in the function by 16.63 µs.
Motivated in part by the rise of deep neural networks, there has been considerable research into accelerating varieties of convolution using FPGAs or other hardware.
Such hardware is capable of outperforming GPUs on, e.g., convolutional neural network
processing [109]. Mohammad et al. [185] report a 19.298 ns latency for their FPGA
implementation of a 4x4 convolution, and the parallel FPGA approach developed by
Strom [186] shows 47000 µs latency for processing an 852x480 image. convolve normal
operates over 512 elements, so roughly extrapolating from the per-element timings, we
project an FPGA-based convolution accelerator applied to this particular task would
incur a latency between 700 ns and 60 µs. Since this form of acceleration would effectively transform the majority of the INCA-A kernel into a single use of a specialized
INCA-A instruction (CONV or similar), any remaining overhead is due to data staging or
movement. Suppose for sake of discussion, then, that the total accelerated INCA kernel
runtime is 80 µs. Under this scenario, execution for the 1713 calls to convolve normal
is reduced to about 23628 µs.
Table 7.3 summarizes the results of this preliminary study. In this table, speedups
100

are relative to the amount of time spent in the function call in the original code. All
INCA use cases offer modest reductions of time spent in the bottleneck function relative
to the original. However, as noted above, even modest speedups can be high impact by
enabling higher-resolution results.
This is only a preliminary study. For example, INCA kernel execution times do not
take into account the costs of ‘calling’ the INCA kernel. Moreover, speedups are calculated under the assumption that calls to the same function can be overlapped. However,
in the actual application, it may be possible to offload more calls to these functions
because they can be overlapped with other work to be done by the host. Identifying
these opportunities will require a more thorough examination of the target application,
a project left to future work.

7.3

Conclusion

In this chapter, we’ve extended our inquiry into what SmartNICs can do by exploring two
scenarios where INCA may accelerate application performance: offloading ‘near-network’
aspects of host applications under BSP communication patterns, and treating the NIC
as a co-processor for addressing bottlenecks in applications where the network is more
often idle than not. In the former case, we found that under ideal conditions, INCA offers
2% to 37% speedups. In the latter, we presented preliminary results suggesting that by
harvesting idle network resources, INCA can offer speedups up to 1.23×, depending on
available hardware acceleration support.
Both scenarios considered in this chapter involve offloading parts of host applications
for purposes of accelerating those applications. However, INCA also affords the opportunity to offload applications that execute autonomously from host applications, entirely
within the network. In Chapter 8, we explore this opportunity in more depth.

101

Chapter 8
Independent Applications for Adaptive
Networks
Variations in network workloads can impact overall quality of service. For example, large
(‘elephant’) flows can disrupt applications by consuming available network resources for
extended periods of time, large numbers of unexpected messages arriving in a short
time period may exhaust receive-side buffers (the ‘ingress problem’), or incoming RDMA
traffic may create contention across the receiver’s memory bus, decreasing application
performance [187]. Likewise, distinct jobs executing on the same network can interfere
with each other, making scheduling and resource allocation (e.g., job placement on the
network topology) important considerations [188], [189].
Given their location between host application and network, as well as their advanced
processing capabilities, in-path, deadline-free NICs such as INCA make attractive candidates for addressing some of these issues. Specifically, applying techniques from machine learning (ML) and artificial intelligence (AI) may enable adaptive or self-learning
networks, i.e., networks that anticipate and dynamically adapt to changing network configurations. For example, the capacity to predict a large amount of data will arrive over
a short period of time may allow the NIC to intelligently schedule DMA transfers (to
address memory contention), preemptively adjust credits assigned to senders (to avoid
congestion), or schedule processors (in a sPIN-type NIC [54]) to allow additional time
102

for packet processing.
In this scenario, offloaded ML kernels are independent applications in the sense defined
in Chapter 2: they neither service traditional network applications, nor are they parts
of host applications. Rather, they are applications that execute autonomously, within
the network. Moreover, the deadline-free nature of INCA kernel execution is essential.
What ought to drive such applications is not the need to operate within the gap defined
by the network speed, but rather the need to generate predictions that are useful at
whatever time scale is required, and this requirement is independent of gap. That is,
constraining a solution by network speed is an artificial constraint on processing. By
providing deadline-free kernel execution, INCA avoids this unnecessary constraint.
These considerations raise three questions. First, is it feasible to offload ML kernels
for purposes of traffic prediction to an INCA-enabled SmartNIC? Second, do these ML
kernels generate reasonably accurate predictions? And third, how well do these kernels
address one or more of the issues raised above, e.g., in terms of increasing QoS or decreasing application runtimes? In this chapter, we take an important first step towards
enabling adaptive networks by addressing the first and second of these questions. We
design and implement a series of INCA kernels that accurately and effectively predict
local network traffic. These kernels reside entirely in the NIC, operating independently
of any host applications. To our knowledge, this is the first study demonstrating the feasibility of fully offloading ML kernels into the network for purposes of enabling adaptive
networks.
In Section 8.1, we identify constraints on acceptable ML kernels imposed by INCA
and the task demands, broadly construed. In Section 8.2 we list the applications whose
traffic patterns we chose to profile. The system used to execute the applications, the
data collected, and the means by which it was collected is described in Section 8.3. In
Section 8.4 we identify two candidate ML kernels – simple linear regression and variations
on rolling linear regression – selected on the basis of a consideration of features of the raw

103

data as well as the constraints established previously. The results of each ML method are
presented in Sections 8.5 and 8.6. For each, we discuss the accuracy of the predictions
generated by the ML method, followed by an analysis of the resource requirements of the
corresponding INCA kernel(s).

8.1

Constraints on ML kernels

Since INCA’s compute capabilities are Turing complete (Chapter 4), an INCA SmartNIC
can run any ML algorithm. Moreover, because the architecture is in-path, kernel execution is not subject to deadlines imposed by network speeds (Chapter 2). In practice,
however, there are practical constraints on the types of ML kernels that can be executed.
First, as noted in Chapter 6, we assume on-NIC memory is limited (e.g., ≤ 1MiB). Second, predictions must be generated sufficiently quickly to be useful for whatever task they
are to be used for. For example, adjusting credits in anticipation of a possible ingress
event requires sufficient time to accomplish that adjustment prior to the event occurring,
and this includes sending messages to senders with updated credit information. InfiniBand EDR (100 Gb/s) point-to-point latency is roughly 1 µs for small messages, so any
prediction must allow for at least 2 µs before the predicted event.
Third, there are interactions with sampling rates that must be taken into consideration, especially if the model is dynamic, i.e., updated to take into account incoming
data. With dynamic methods, the ML kernel must both update the model and generate
predictions with sufficient time to handle the next incoming data point. Alternatively,
incoming data could be buffered and processed in batches, and predictions generated
between batches. Finally, a third possibility is to have two distinct kernels executing,
one to update the model, and another to generate predictions using the most recently
available parameters. In this study, we choose the simplest solution, namely, to keep the
ML kernel execution time within the sampling rate.
For the purposes of this study, these considerations imply two constraints on ML
104

kernel selection. First, a suitable kernel should have a relatively small memory footprint.
Second, kernels with short (e.g., less than 50 ms) runtimes are desirable.

8.2

Applications

For this study, we investigated the network traffic of seven applications, proxy applications, and miniapps (henceforth collectively referred to as ‘applications’). The HPCG
(High Performance Conjugate Gradients) benchmark is a widely-used I/O bound benchmark that models data access patterns of common scientific workloads [190]. LAMMPS
is a classical molecular dynamics code [191]. For this study, we ran the Lennard-Jones
(atomic fluid) and Rhodo (rhodospin protein) benchmarks. Lulesh is a shock hydrodynamics proxy application [192]. MILC is a quantum chromodynamics code [193]. The
final three miniapps are taken from the Mantevo miniapps suite [194]. MiniAMR performs adaptive mesh refinement, MiniAMR MiniFE is an unstructured implicit finite
solver, and MiniMD solves a molecular dynamics problem.

8.3

Data collection

All applications were executed on an ARM-based system, where each node has two
sockets, and each socket contains a 32-core Cavium ThunderX2 ARM CPU operating
at 2GHz. The network is 4x EDR (100 Gb/s) InfiniBand using NVIDIA Mellanox
ConnectX-5 NICs. All applications were executed on eight nodes with one MPI process per node.
CPU hardware performance counters are a popular source of input data for ML tasks
in HPC and networking, appearing in studies involving memory contention, resource
allocation, and intrusion detection [195]–[198]. Like CPUs, ConnectX-5 NICs also include
performance counters. These counters track features such as number of bytes and packets
sent and received, number of dropped packets, number of buffer overruns, packet sequence

105

errors, and so on. For this study, we focus solely on predicting the amount of received
data.
One way to access the ConnectX-5 hardware counters is through a dedicated system call, perfquery, which requires root access. Counters are also exposed through
the filesystem, so can be accessed without root privileges using cat. Based on a simple
benchmark that queries each of these access methods, we found that the maximum sampling rate using the system call method is 22.22 Hz (period = 45.01 ms), and through the
filesystem the maximum is 105.93 Hz (period = 9.5 ms). Of course, any kernel operating
on-NIC will not be constrained by either of these methods, and will be capable of much
higher sampling frequencies.
For the results reported here, we use the filesystem access method with a sampling frequency of 20.0 Hz. We focus on bytes received, as reported by the NIC’s port rcv data
counter. Our data gathering tool is pinned to one core on one socket, and does not itself
engage in any communication through the IB port. The MPI processes of the applications
are pinned to cores on the other CPU. The hardware counters used for this study only
collect data for bytes received by processes residing on the CPU used by the application.
Each application was executed 11 times, providing 11 data sets for each rank of each
application..

8.4

ML kernels

Figure 8.1 shows raw counter data (bytes received) for a single run of each application.
Since the applications are executed with one MPI process per host, the eight hosts shown
in each graph show the data received by each MPI rank in the application (plus any
additional network traffic).
Most of the applications exhibit similar gross behavior: after an initial startup phase,
data arrives at a more-or-less steady rate, and then levels off as the application enters
a teardown phase. The duration of the startup phase before sustained communication
106

HPCG [Run 0]

port rcv data (MiB)

100
80
60
40

LAMMPS-lj [Run 0]

st172
st167
st171
st168
st169
st166
st170
st165

node
node
node
node
node
node
node
node

400

port rcv data (MiB)

node
node
node
node
node
node
node
node

120

20

300

200

st280
st286
st284
st285
st279
st282
st283
st281

100

0

0
0

2000

4000

6000

8000

10000

12000

14000

0

2500

5000

7500

LAMMPS-rhodo [Run 0]

port rcv data (MiB)

500
400
300
200

node
node
node
node
node
node
node
node

250

100
0

15000

17500

200
150
100

st280
st279
st286
st284
st281
st283
st282
st285

50
0

0

2000

4000

6000

8000

0

2000

4000

sample

30000
25000
20000
15000
10000

8000

10000

MiniAMR [Run 0]

st282
st285
st286
st279
st280
st284
st283
st281

node
node
node
node
node
node
node
node

1200

port rcv data (MiB)

node
node
node
node
node
node
node
node

6000

sample

MILC [Run 0]

port rcv data (MiB)

12500

Lulesh [Run 0]

300

st281
st282
st280
st285
st286
st284
st279
st283

port rcv data (MiB)

node
node
node
node
node
node
node
node

10000

sample

sample

5000

1000
800
600
400

st235
st119
st190
st232
st231
st120
st32
st234

200

0

0
0

5000

10000

15000

20000

25000

0

1000

2000

sample

3000

4000

5000

6000

sample

MiniFE [Run 0]

MiniMD [Run 0]

40

port rcv data (MiB)

35
30
25
20
15

st235
st31
st119
st111
st234
st112
st231
st232

node
node
node
node
node
node
node
node

250

port rcv data (MiB)

node
node
node
node
node
node
node
node

10
5
0

200
150
100

st232
st234
st162
st119
st109
st31
st235
st231

50
0

0

2000

4000

6000

8000

0

sample

2000

4000

6000

8000

sample

Figure 8.1: Examples of raw data (bytes received) acquired from the NIC
port recv data hardware counter for the first run of each application. Byte
totals are adjusted to the start of the application run.

107

Application
HPCG
LAMMPS-lj
LAMMPS-rhodo
Lulesh
MILC
MiniAMR
MiniFE
MiniMD

P r(data) P r(¬data)
0.15
0.03
0.09
0.93
1.00
0.79
0.14
0.06

0.85
0.97
0.91
0.07
0.00
0.21
0.86
0.94

Table 8.1: Mean probability of data arriving at any given point during the main
work/communication phase, calculated across all participating NICs for all runs.
Standard deviation was ≤ 0.02 for all means. Shaded cells show which category
has higher probability.

begins varies from around 5 samples (250 ms) in the case of Lulesh to over 4000 samples
(200 s) for MiniFE. This period likely includes samples from before the application launch
simply because the tool for gathering data must be launched before the application. While
the startup period may appear flat, it can involve the reception of small amounts of data.
For instance, in the run of HPCG shown in Figure 8.1, sustained communication begins at
596 samples (29.8 s), but by 200 samples nearly all nodes have received multiple influxes
of small amounts of data (< 600 B). Likewise, by 200 samples half of the MiniFE nodes
have received more than one influx of data (< 4 KiB). As with the startup phase, the
teardown phase may differ in length across applications, and in the raw data, includes
any additional samples that might occur between exiting the application and terminating
the data gathering tool.
During the main work and communication loop, nodes for some applications exhibit
a ‘stairway-with-landings’ pattern one expects from bulk synchronous processing applications (LAMMPS-lj, LAMMPS-rhodo, MiniMD). Likewise, the pattern of data received
across MiniMD nodes reflects the fact the miniapp performs adaptive mesh refinement
during execution, so workloads are adjusted across nodes.
To gain a better understanding of the data, we calculated the probability of data
108

arriving at any given point in the application execution (not including startup and teardown phases). The results of this analysis are shown in Table 8.1. As this analysis shows,
the applications fall into two categories: those that maintain a more-or-less steady influx
of data (Lulesh, MILC, and to a lesser extent, MiniAMR), and those where data arrives
infrequently (HPCG, LAMMPS-lj, LAMMPS-rhodo, MiniFE, and MiniMD). In what
follows, we refer to the former class as ‘dense’ applications, and the latter as ‘sparse’.
This distinction will play a role in understanding some of the ML techniques explored
below.
The main work and communication loop is of primary importance for a traffic prediction task, since that is where the vast majority of data arrives. Focusing only on this
phase, given the gross behavior of the data as shown in Figure 8.1, and the constraints on
memory and execution time given in Section 8.1, we determined a reasonable candidate
ML method is ordinary linear regression (LR) and variations thereof. That is, given the
characteristics of the data and the fact we are looking for methods with small memory
and runtime footprints, methods based on linear regression are reasonable candidates. In
the remainder of this section, we describe the regression methods used in the study.

8.4.1

Ordinary LR

The first LR method we consider is ordinary LR, i.e., simple linear regression using the
ordinary least squares method. This is a static method in the sense the model is not
updated during application execution. Instead, a possible use case is to have a library of
pre-trained models (which are nothing more than sets of pairs of intercepts and slopes),
one of which is loaded at application launch time to the NIC along with the INCA kernel
that performs predictions when requested.
As previously noted, the raw data includes a main phase of sustained communication where the majority of data arrives, with (perhaps optional) plateaux on either side representing startup and teardown phases. Since our target is the sustained
109

latest
sample

predictions
time

p0 p1 p2 p3 p4
training window
Figure 8.2: Rolling linear regression. As each new sample is acquired, the model
is trained on the contents of a training window comprising the preceding k
samples, and predictions calculated for some number of points pi in the future.
work/communication phase, our training script pre-processes the raw data to remove
these plateaux by trimming the starting phase up until the first sample at which data
is received, and trimming the teardown phase until the last sample at which data is
received.
For each application, for each rank in each of the 11 runs, the model is trained on
the data from that rank, and then tested against the same rank data from the remaining
10 runs. For each rank, root mean squared error (RMSE) is calculated for both the 11
training instances and the 110 testing instances.

8.4.2

Rolling LR

In addition to the static method described above, we also considered two dynamic LR
techniques: rolling (ordinary) LR, and rolling weighted LR. These methods attempt to
reduce error by adapting the model to changes in the amount of received data as they
occur, and hence may provide more accurate prediction, e.g., when in plateaux phases
such as those seen in LAAMPS-lj and MiniMD (Figure 8.1).
Both types of rolling LR methods share the same basic strategy, shown in Figure 8.2.
A window of size k samples is ‘rolled’ along the sequence of samples as they arrive. At
each sample, the model is trained on the samples in the training window, and prediction

110

generated for one or more points pi in the future.
In the ordinary rolling LR condition, the LR performed at each step is the same as
in the static LR case. In the weighted condition, the error at each point in the training
window is adjusted by a vector of weights, affecting their influence on the result. For example, a vector of weights that reduces the importance of the most recent sample relative
to the others will be less responsive to changes and may ‘smooth’ the estimation given
transient perturbations. For weighted rolling LR, we considered two types of weights:
exponentially increasing and exponentially decreasing. In the former, older samples are
discounted while the most recent sample is amplified, and in the latter, the oldest sample is given the greatest weight. Weights are calculated using a standard exponential
function, scaled so that the maximum is 1.0.
These methods raise the issue of preferred window sizes: which training window
sizes perform best for predictions at which point (pi in Figure 8.2) in the future? One
possibility is to have the kernel itself dynamically adapt the training window size to
minimize error for a given prediction point as the application is running. However, for
this study we adopt a use case similar to that proposed for static, ordinary LR. Through
some initial analysis, we determine performant window sizes for two classes of predictions:
short-term and long-term. Then, these window sizes (and their weights, in the case of
rolling weighted LR) are loaded from a library at the launch of the application. Given the
two window sizes, we then evaluate the methods by calculate the average RMSE across
all ranks of all 11 runs of each application. Results from both the initial analysis of
window sizes, and the performance given windows selected on the basis of that analysis,
are given below. As in the case of static ordinary LR, the raw data is trimmed to remove
startup and teardown plateaux.

111

8.5

Results: Ordinary linear regression

In this section we present and discuss the results of the ordinary (static) linear regression
experiments described above. We first consider (1) how well the model captures the
data, and (2) the resource requirements (time and memory) for an INCA kernel that
implements the inference phase.

8.5.1

Method performance

Figure 8.3 shows normalized RMSE (NRMSE) for all ranks of all applications. Results
are averaged across 11 runs for the training results, and across 110 runs for the test
results. Error bars are standard deviation, although this may be sufficiently low as to
not be visible.
We see that training and testing normalized RMSE never exceeds 2.5% (MiniAMR),
and for some apps remains close to 0.5% (LAAMPS-rhodo, Lulesh). Test NRMSE is
always higher than training, and for some applications this suggests overfitting (HPCG,
MILC, MiniFE, and some ranks of MiniMD). One possible contributor to this outcome
may be the method for trimming startup plateaux described in Section 8.4. The raw
data is trimmed to remove startup plateaux by removing samples up to the first positive
change. However, in some cases this method is insufficient. For example, Figure 8.4
shows the result of training on rank 0 data from a specific run of MiniFE (left), and the
application of the result of that training to data from the same rank on a different run
(right). As the figure makes clear, despite the relatively good fit on the training data,
and the prima facie good fit to the test data, the lack of the startup plateaux on the
testing data offsets the predictions, resulting in increased error. One possible method for
addressing this issue is to have the application signal to the NIC when the model should
be active, i.e., after any startup phase.

112

HPCG

3.0

Test

Train

2.5

Normalized RMSE

Normalized RMSE

Train

Lammps LJ

3.0

2.0
1.5
1.0
0.5

2.0
1.5
1.0
0.5

0.0

0.0
0

1

2

3

4

5

6

7

0

1

2

3

Rank

5

6

7

6

7

6

7

6

7

Lulesh

3.0

Train

Test

2.5

Normalized RMSE

Normalized RMSE

Train

2.0
1.5
1.0
0.5

Test

2.5
2.0
1.5
1.0
0.5

0.0

0.0
0

1

2

3

4

5

6

7

0

1

2

3

Rank

4

5

Rank

MILC

3.0

Train

MiniAMR

3.0
Test

Train

2.5

Normalized RMSE

Normalized RMSE

4
Rank

Lammps Rhodo

3.0

2.0
1.5
1.0
0.5

Test

2.5
2.0
1.5
1.0
0.5

0.0

0.0
0

1

2

3

4

5

6

7

0

1

2

3

Rank

4

5

Rank

MiniFE

3.0

Train

MiniMD

3.0
Test

Train

2.5

Normalized RMSE

Normalized RMSE

Test

2.5

2.0
1.5
1.0
0.5

Test

2.5
2.0
1.5
1.0
0.5

0.0

0.0
0

1

2

3

4

5

6

7

0

Rank

1

2

3

4
Rank

Figure 8.3: Results from the static, ordinary LR study.

113

5

MiniFE, Training, Rank 0, Job 1313649

MiniFE, Testing, Rank 0, Job 1313787
40

40

Data
Fit

20

20

10

10

0

0
0

Data
Fit

30

MiB

MiB

30

1000

2000

3000

4000

5000

0

Sample

1000

2000

3000

4000

Sample

Figure 8.4: An illustration of how the startup plateau can offset the learned
model. The training data (left) has a long startup plateau, so despite the relatively good fit, when applied to another data set with no such plateau (right),
there is significant error.

8.5.2

INCA kernel resources

The static, ordinary LR method was chosen not only because the data suggested it,
but also because it promises to have a light resource footprint. To confirm this, we
implemented the inference phase as an INCA kernel, and estimated memory requirements
and runtimes using INCAsim.
Algorithm 7 INCA-A Ordinary Linear Regression
1: PUTL output, b0, f
2: PUTL
R1, b1, f
R1, R1, x, f
3: MUL
4: ADD output, output,
R1, f
5: END

. b0 is the y-intercept.
. b1 is the slope.

The INCA-A kernel is shown in Algorithm 7. The kernel takes advantage of the fact
the output can be clobbered, so only requires four instruction to complete. The memory
required is extremely modest. Assuming 64 bit operands, the total amount of memory
required is less than 64 B. Regarding runtimes, on a 200 Gb/s NIC with local scratchpad
memory and 64B packets, the estimated runtime of this kernel is 26.48 ns. When network
speeds increase to 400 Gb/s [4], all else being equal, the runtime is reduced to 16.24 ns.

114

8.6

Results: Rolling linear regression

One of the limitations of a static LR approach is it does not adjust to changes in the
incoming data stream. The ‘offset problem’ illustrated in Figure 8.4 highlights this
drawback. If a process is significantly delayed in sending data to a receiver, the result
may be a plateau, and lacking any mechanism for adapting to the change, subsequent
predictions are shifted.
Rolling LR methods avoid these issues by dynamically updating the model as new
data is acquired. Prediction accuracy is a function of both the size of the training window
and the future point to be predicted. To assess training window sizes with respect to
prediction points, for each rank in each application run, we execute the prediction method
while varying training window size.
Figure 8.5 shows the results of this process for a single rank (rank 3) from each
application, averaged across 11 runs, for prediction points ranging from t0 + 1 (p0 ) to
t0 +10 (p9 ), and window sizes ranging from 2 to 250. Error bars show standard deviation,
and for each prediction point, a red ‘×’ indicates the best performing window size. For six
of the eight applications – HPCG, LAMMPS-lj, LAMMPS-rhodo, MiniFE, and MiniMD
– best performing window sizes fall into two groups: smaller window sizes for short-term
predictions, and larger window sizes for long-term predictions. This is to be expected
given the raw data shown in Figure 8.1 and the analysis provided in Table 8.1. NICs for
applications such as HPCG, LAMMPS-lj, LAMMPS-rhodo, MiniFE, and MiniMD have
are mostly quiescent with respect to data arrival. Consequently, smaller window sizes
allow the regression to react to sudden increases in amounts of incoming data, and to
return to quiescence (i.e., zero slope), while larger window sizes allow the regression to
ignore plateaux where no data arrives, capturing the longer-term trends. In contrast, in
cases where data is nearly constantly arriving (Lulesh, MILC, MiniAMR), there is not
a huge distinction between small and large window sizes, because the slope of the data

115

HPCG, Rank 3

0.15

p5
p6
p7
p8
p9

p0
p1

0.5

p2
p3

0.10

p6
p7

p8
p9

0.3

0.2

0.05

0.1

2

17

32

47

62

77

92 107 122 137 152 167 182 197 212 227 242

2

17

32

47

Training window size (max = 250)

62

77

p0
p1

p2
p3

p4
p5

92 107 122 137 152 167 182 197 212 227 242

Training window size (max = 250)

Lammps Rhodo, Rank 3
0.6

Lulesh, Rank 3
p6
p7

0.12

p8
p9

p0
p1
p2
p3
p4

0.10

0.5

0.08

0.4

NRMSE

NRMSE

p4
p5

0.4

NRMSE

0.20

NRMSE

Lammps LJ, Rank 3
p0
p1
p2
p3
p4

0.3

p5
p6
p7
p8
p9

0.06
0.04

0.2
0.02
0.1
0.00
2

17

32

47

62

77

92 107 122 137 152 167 182 197 212 227 242

2

17

32

47

Training window size (max = 250)

62

77

92 107 122 137 152 167 182 197 212 227 242

Training window size (max = 250)

MILC, Rank 3

MiniAMR, Rank 3

0.06

0.04

0.6
0.5

NRMSE

0.05

NRMSE

p5
p6
p7
p8
p9

p0
p1
p2
p3
p4

0.03

0.4
0.3

p0
p1
p2
p3
p4

0.02
0.2
0.01
0.1
0.00
2

17

32

47

62

77

92 107 122 137 152 167 182 197 212 227 242

2

17

32

47

Training window size (max = 250)

62

77

92 107 122 137 152 167 182 197 212 227 242

Training window size (max = 250)

MiniFE, Rank 3

MiniMD, Rank 3
0.8

p0
p1
p2
p3
p4

0.7
0.6
0.5
0.4

p5
p6
p7
p8
p9

p0
p1
p2
p3
p4

0.7
0.6

NRMSE

0.8

NRMSE

p5
p6
p7
p8
p9

0.5
0.4

0.3

0.3

0.2

0.2
0.1

0.1
2

17

32

47

62

77

92 107 122 137 152 167 182 197 212 227 242

2

Training window size (max = 250)

17

32

47

62

77

92 107 122 137 152 167 182 197 212 227 242

Training window size (max = 250)

Figure 8.5: Error as a function of window size, for predictions ranging from t0 +1
(p0 ) to t0 + 10 (p9 ) samples and window sizes ranging from 2 to 250. Results
are averaged across 11 runs, and error bars show standard deviation. The best
performing window size for each prediction point is indicated by a red ‘×’.

116

p5
p6
p7
p8
p9

does not change.
We executed the window size search (depicted in Figure 8.5 for a single MPI rank
from each application using ordinary LR) for all MPI ranks of all runs of all applications,
and determined the best performing window size for each prediction point under the
three types of rolling LR. The results of this analysis are given in Table 8.2, which shows
the median best performing window size for each prediction point for each application
and rolling LR method. Again, the pattern observed in Figure 8.5 is evident insofar
as applications either exbibit two distinct groupings of window sizes, one for short-term
predictions and the other for long-term predictions, or do not exhibit such an obvious
distinction (Lulesh, MILC, MiniAMR). Moreover, we see that weighted rolling LR with
increasing (exp-inc) weights tends to increase the best performing window size, while
decreasing weights (exp-dec) tends to decrease the best performing window size. As
before, this is expected. For example, the increasing weights emphasize the most recent
sample, so, on applications where data arrival is relatively infrequent, larger window
sizes are required to compensate for the prima facie overemphasis on that sample. The
opposite holds for the decreasing case.
The fact many applications have apparently distinct groupings of window sizes corresponding to short- and long-term prediction points suggests a strategy where an INCA
kernel can use two different windows, one for each type of prediction. To determine
these window sizes, we applied the Fisher-Jenks algorithm to each set of median window
sizes given in Table 8.2 to provide guidance for establishing a boundary between the two
classes of window. We then selected the median window size from each of those groups
to establish a small and large window size for each app, for each type of rolling LR. The
results of this analysis are given in Table 8.3. In one case – MiniFE – we overrode the
results of the algorithm to move the boundary by one location to better correspond to
the intuitive class divisions.1
1

For example, in the MiniFE ord case, the algorithm placed 6 and 54 into the small category, and 55
through 61 into the large category; we manually adjusted the boundary to occur between 6 and 54.

117

Prediction Point
Application

Weights

p0

p1

p2

p3

hpcg

ord
exp-inc
exp-dec

5
8
2

6
12
2

6
14
5

13 151 150 149 148 148 147
17 19 247 250 250 250 250
117 148 147 146 146 145 144

lammps-lj

ord
exp-inc
exp-dec

6
8
2

5
9
5

5
8
4

5
8
5

5
9
5

6
10
6

6
13
6

6
18
6

ord
lammps-rhodo exp-inc
exp-dec

2
5
2

4
7
2

4
7
2

5
7
2

5
8
3

5
8
4

107
10
4

106 106 105
248 250 250
99 98 97

15
68
15

5
11
14

48
143
16

12
53
32

11
25
31

56
148
33

15
41
32

28
66
31

59
147
30

p4

p5

p6

p7

p8

6
26
7

p9

242
34
7

lulesh

ord
45
exp-inc 138
exp-dec 13

milc

ord
exp-inc
exp-dec

2
5
2

4
7
2

5
7
2

5
8
2

5
8
3

5
7
3

5
7
4

5
7
4

5
7
4

5
7
4

miniamr

ord
exp-inc
exp-dec

2
2
2

4
8
2

6
9
2

6
10
2

6
10
2

7
11
4

7
12
5

8
15
6

8
39
6

23
45
7

minife

ord
exp-inc
exp-dec

4
8
60

61
131
59

60
153
58

59
174
57

58
174
56

57
173
55

57
173
54

56
171
53

55
170
76

54
168
75

minimd

ord
exp-inc
exp-dec

2
4
2

3
5
2

3
11
2

9
17
2

105 105 104 104 103 102
22 233 238 241 244 245
99 98 97 96 95 94

Table 8.2: Best-performing window sizes for each prediction point, for ordinary (unweighted) linear regression (ord), exponentially increasing weights
(exp-inc), and exponentially decreasing weights (exp-dec).

118

ord
Application
hpcg
lammps-lj
lammps-rhodo
lulesh
milc
miniamr
minife
minimd

exp-inc

Small

Large

6
6
5
14
3
6
4
3

149
242
106
52
5
23
57
104

Small Large
14
9
7
47
7
10
8
11

250
30
250
145
8
42
171
241

exp-dec
Small

Large

2
3
2
15
2
2
57
2

146
6
98
31
4
6
76
97

Table 8.3: Small and large window sizes selected on basis of the best-performing
window size parameter sweep, for ordinary (unweighted) linear regression
(ord), exponentially increasing weights (exp-inc), and exponentially decreasing
weights (exp-dec). Values determined using the Fisher-Jenks algorithm; class
boundaries for MiniME were slightly adjusted by hand.

8.6.1

Method performance

Given the small and large window sizes specified in Table 8.3, we executed each type of
rolling LR on the data from each rank from each application, and calculated the NRMSE
for each prediction point. To facilitate comparison, we started accumulating error only
at sample 251, the first prediction available at the largest considered window size; the
NRMSE reported for ordinary LR (Section 8.5) is also calculated in this way.
Figure 8.6 shows average NRMSE for small and large training window sizes, for
each application, prediction point, and rolling LR type (ordinary (ord), exponentially
increasing weighted (exp-inc) and exponentially decreasing weighted (exp-dec)). We
note, first, that the crossover point between lines representing predictions made on the
basis of LR on small and large training windows corresponds to the division between
groups of window sizes given in Table 8.2. For instance, the crossover between small
and large in the ord method applied to LAMMPS-rhodo is between p5 and p6 . Second,
for short-term predictions, exponentially increasing weighted LR tends to perform better
than ord or exp-dec. Moreover, this holds regardless of whether the application is

119

HPCG

0.20

Lammps LJ

ord small (6)

0.45

ord small (6)

ord large (149)

0.40

ord large (242)

inc small (14)

NRMSE

NRMSE

inc large (250)
dec small (2)

0.15

inc small (9)

0.35

dec large (146)
0.10

inc large (30)

0.30

dec small (3)

0.25

dec large (6)

0.20
0.15
0.10

0.05

0.05
p0

0.6

p1

p2

p3

p4

p5

p6

p7

p8

p9

p0

NRMSE

NRMSE

dec large (98)

0.1

0.006
p3

p4

p5

p6

p7

p8

p9

p0

p1

p2

p3

p4

p5

p7

p8

p9

p6

p7

p8

p9

p6

p7

p8

p9

Prediction

MILC

MiniAMR
0.6

ord small (3)
ord large (5)

ord small (6)
ord large (23)

0.5

inc small (7)
inc large (8)

NRMSE

NRMSE

p6

dec large (31)

Prediction

dec small (2)
dec large (4)

0.002

inc small (10)
inc large (42)

0.4

dec small (2)
dec large (6)

0.3
0.2

0.001

0.1

p0

p1

p2

p3

p4

p5

p6

p7

p8

p9

p0

p1

p2

p3

Prediction

p4

p5

Prediction

MiniFE

MiniMD
0.8

0.5

ord small (4)

ord small (3)
0.7

ord large (57)
inc small (8)

0.4

NRMSE

dec small (57)
0.3

ord large (104)
inc small (11)

0.6

inc large (171)

NRMSE

p9

dec small (15)

0.008

0.003

p8

inc large (145)

0.010

0.2

0.004

p7

inc small (47)

0.3

0.005

p6

ord large (52)

0.012

dec small (2)

p2

p5

ord small (14)

0.014

inc large (250)

p1

p4

Lulesh

inc small (7)

p0

p3

Lammps Rhodo
ord large (106)

0.4

p2

Prediction

ord small (5)

0.5

p1

Prediction

dec large (76)

inc large (241)

0.5

dec small (2)
dec large (97)

0.4
0.3

0.2

0.2
0.1

0.1
p0

p1

p2

p3

p4

p5

p6

p7

p8

p9

p0

Prediction

p1

p2

p3

p4

p5

Prediction

Figure 8.6: Average error for ordinary (ord), exponentially increasing weighted
(exp-inc), and exponentially decreasing weighted (exp-dec) rolling LR methods, for small and large training window sizes, and prediction points ranging
from t0 + 1 (p0 , 50 ms) to t0 + 10 (p9 , 500 ms).

120

sparse or dense. This result is intuitive, because emphasizing the more recent samples in
comparison to older samples improves the ability of the method to react to incoming data.
Third, regarding longer-term predictions, for three of the eight applications (HPCG,
LAMMPS-rhodo, and MiniMD), ord performs better than either exp-inc or exp-dec
These are all sparse applications. In dense applications, exp-inc performs better for the
most part. Fourth, rarely does exp-dec perform better than the others for either short- or
long-term predictions, with MiniFE being the most notable exception. Finally, Lulesh’s
oscillating errors are an obvious departure from those of the other applications. Unlike the
other applications surveyed here, there are significant differences in the amount of data
each rank receives by the time the application terminates (Figure 8.1). We hypothesize
the unique behavior with respect to NRMSE is due in part to the current strategy of
using a single pair of training window sizes (small and large) for all ranks; testing this
hypothesis is left to future work.
In comparison to the static, ordinary LR method (Figure 8.3), the rolling LR methods
represent a significant reduction in error. For example, with ordinary LR testing NRMSE
for HPCG approached 1%; in comparison, the best performing rolling LR methods remain
below 0.1%. Similarly, whereas NRMSE for ordinary LR for LAMMPS-lj can exceed 1%,
the best performing rolling methods remain below 0.3%. In general, in comparison to
static ordinary LR, the rolling LR methods secure reductions in NRMSE for all of the
applications considered here. Because rolling methods – ord and exp-dec in particular –
secure reduced prediction error, and by being dynamic address the issue of responding to
unexpected events raised at the beginning of this section, we believe a rolling LR method
is preferable to the standard method as an offloaded kernel. The question, then, is how
the increased complexity of the methods impact resource requirements.

121

8.6.2

INCA kernel resources

Whereas the ordinary LR approach considered in Section 8.5 required the INCA kernel to
do little more than perform a multiplication and addition, rolling LR methods require the
kernel perform the actual regression. To assess the impact of this increased complexity, we
implemented both ordinary and weighted rolling LR kernels in INCA, taking advantage
of SIMD instructions (Chapter 6) when applicable.
Runtime (µs)
Rolling LR Method
ord

ord-parallel

weighted

weighted-parallel

Network
Gb/s

5 Sample
Window

250 Sample
Window

200

0.68

21.87

400

0.41

13.46

200

0.28

0.69

400

0.17

0.34

200

1.17

38.10

400

0.71

23.22

200

0.74

15.72

400

0.45

9.65

Memory
(KiB)
<4
<4
<8
< 12

Table 8.4: Runtimes (µs) for INCA kernels implementing ordinary (ord) and
weighted (weighted) rolling LR methods, for current and next-generation InfiniBand network speeds (64B packets). parallel versions use SIMD instructions.

Table 8.4 shows the simulation results for rolling LR INCA kernels using ordinary
(ord) and weighted (weighted) linear regression, for current (200 Gb/s) network speeds,
and the upcoming next generation (400 Gb/s). These kernels generate a single prediction.
Runtimes are estimated for both a small training window size (5 samples) and a large
window size (250 samples). The use of dot products in performing linear regression affords
opportunities for parallelization through SIMD instructions (Chapter 6); the parallel
versions take advantage of these opportunities.
These results show that INCA kernels generating predictions from small training win122

dow sizes have very modest runtimes, ranging from less than a microsecond to approximately seven microseconds. Runtimes for large training windows span tens of microseconds, but can be roughly halved by taking advantage of parallel (SIMD) instructions.
In the implementations considered here, the ordinary rolling LR requires less than
4 KiB of memory, weighted rolling LR requires approximately 8 KiB, and because of
additional space required as destinations for SIMD instructions, the weighted parallel
kernel requires less than 12 KiB. The memory requirements of these kernels are thus well
within the 1 MiB limit proposed in Chapter 6.
Finally, we note that while the number of instructions executed for kernels using the
small (5 samples) window size are in the low 100s, those for the large window size (250
samples) are in the thousands or tens of thousands. The larger number of instructions
puts these kernels outside the capabilities of, e.g., a 32 core 2.5 GHz on-path SmartNIC, which (assuming 200 Gb/s and 64 B packets), have a deadline of approximately
205 instructions (Chapter 2). This outcome provides further motivation for preferring
deadline-free SmartNIC architectures such as INCA for enabling adaptive networks.

8.7

Related work

A popular target for ML methods in HPC is job scheduling. Krishnaswamy et al. [199]
use a collection of scripts with known runtimes and a similarity metric. Rodrigues et
al. [200] use job submission features (user ID, resources requested, etc.) and consider
k-means, SVMs, MLPs, and random forests. Wyatt et al. [201] convert job submission
scripts into ‘pictures’, and then use a convolutional neural network to generate runtime
predictions.
Closer to the sorts of network-side issues targeted in this chapter, Jain et al. [196]
use random forests to predict which node allocation will result in better application
performance. Dickov et al. [202] propose using n-grams to detect patterns in MPI calls
issued by a host application to deactivate and reactivate NIC links to conserve power.
123

Kiran et al. [203] use a Gaussian mixture model to classify flows as elephant or mouse
for purposes of segregating them. Groves et al. [195] show that random forests trained
on CPU performance counter data can distinguish between situations where networkinduced memory contention will and will not occur, potentially allowing the dynamic
selection of solutions. None of these works propose offloading to the NIC.
Some of these works share a strategy of using hardware performance counter values
as inputs [195], [196]. Performance counters have also been found to be useful for
potentially detecting malicious attacks, e.g., Spectre or Rowhammer [197], [198]. The
work described here is, as far as we know, the first to take advantage of on-NIC hardware
counters.

8.8

Conclusion

Adaptive networks are networks that deploy techniques from ML or AI to dynamically
adapt to changing network conditions. As ‘gatekeepers’ situated between the network
and host applications, SmartNICs can play a pivotal role in enabling adaptive networks.
In this chapter, we’ve demonstrated that INCA-enabled NICs can offload ML kernels
for the purposes of predicting network traffic. Even without significant hardware acceleration support, these kernels can execute in sub-microsecond times, with modest (<1
MiB) memory requirements. Moreover, these kernels execute independently of any host
application.

124

Chapter 9
Conclusion and future work
The past half-decade has witnessed a resurgence of interest in SmartNICs, i.e., network
interfaces that not only move data, but also perform potentially complex work with or on
that data. What sets this new wave of research apart from earlier attempts at diversifying
network functionality is its scope. Where once making the network intelligent meant
offloading characteristically network-oriented applications – packet forwarding, firewalls,
segmentation, bits and pieces of specific protocols, collective communications, and so
on – researchers are currently broadening the vision of what a ‘smart’ network can be
expected to do. Rather than only perform core network functions, NICs and other
network appliances are increasingly being asked to offload parts of host applications, or
even execute applications that are entirely independent of those running on host CPUs.
While on-NIC capabilities have been expanding, what exactly they can do for high
performance computing is not well-understood. For example, to what degree do current
‘smart’ capabilities address anticipated future paradigms such as multithreaded communication? Can existing on-NIC capabilities provide additional flexibility for supporting
novel offloaded functionality, or must we appeal to general-purpose compute hardware
such as CPUs? Finally, when such flexibility is secured, what can be done with it to
service HPC applications?
In this work, we’ve explored answers to these questions. In Chapter 3, we showed
125

that offloaded message matching may help mitigate message processing overheads, but
not alleviate them entirely. In Chapter 4 we established that offloaded message matching, when coordinated with other offloaded capabilities (namely, atomic and triggered
operations), can elevate a NIC from being smart in the traditional sense of handling
task-specific network applications, to the contemporary sense of supporting the execution of arbitrary programs. Finally, in Chapters 6 through 8, we explored what these
general-purpose, deadline-free compute capabilities can offer as regards accelerating host
applications or enabling the offloading of machine learning kernels.
To conclude, we briefly highlight some of the additional research opportunities afforded by INCA. First, INCA currently exists in theory and software. An obvious future
direction of research is to secure a physical implementation of an INCA-capable NIC. To
this end, we are currently collaborating with academic partners to realize this goal in the
form of an FPGA implementation.
Second, we’ve presented INCA as an alternative to on-path or off-path approaches,
but in fact these approaches are not mutually exclusive. Aside from mandating the
primitive capabilities required to secure general-purpose compute capabilities, INCA is
neutral with respect to the on- or off-path processing elements that are actually available.
So, for instance, in Chapter 7, we proposed including FPGA or ASIC-based accelerators
for performing certain functions, integrated into INCA as new INCA-A instructions. In
the same way, INCA could schedule resources such as CPU cores in a sPIN-style system,
making the resulting hybrid deadline-free. More generally, from this perspective, INCA
can be viewed as an overarching system for coordinating the deployment of potentially
heterogeneous on-NIC compute resources. Exploring how this type of hybrid ‘INCA+X’
SmartNIC can address the issues raised above is a potentially fruitful direction for future
work.
Third, deadline-free SmartNICs such as INCA have potential applications to middleware runtimes, and these deserve close consideration. One example is tolerance to node

126

failures. For instance, it is possible that the host system panics while leaving the NIC
still operational. In such a situation, the NIC is in an opportune position to both detect
and mitigate the failure: noticing the host is defunct, an INCA kernel could intercept and
forward incoming traffic to a replacement node while also alerting the sender of the new
location for the destination process. If the CPU panics but leaves the memory system
operational, the NIC may be able to handle the task of rescuing state through its DMA
engine, moving it to the new location. Similarly, an INCA kernel may be able to take
over the task of handling checkpoint I/O: rather than having the host application spend
time pausing to write out a checkpoint, it could provide an INCA kernel with the memory
location of the data to be written out, along with the size, and then continue doing work,
letting the NIC (through DMA) stream the checkpoint to the filesystem. We explore
additional examples of future directions in runtime offloading afforded by deadline-free
approaches such as INCA in [204].
Fourth, in this work we’ve emphasized local INCA kernel execution, i.e., kernels
that run on a single NIC. But the underlying compute model is not restricted in this
way: rather than looping back to the origin, a message generated by a triggered operation could be sent to a distinct NIC. In principle, then, an INCA program could be
distributed across a network, with different NICs responsible for different parts, and
perhaps operating in parallel. One possible application for such a distributed execution
model is the application of machine learning techniques to network monitoring. INCA
kernels might periodically issue messages containing network status data, and as this
data flows through the network, it is aggregated and processed by intermediate INCA
kernels that classify the statuses. By the time it is delivered to its ultimate destination,
the receiving application has actionable intelligence regarding network state.
Finally, the research on offloading machine learning kernels presented in Chapter 8 is
just the first step towards adaptive, intelligent networks. There are many more techniques
to investigate. While some are likely too computationally or memory intensive (n-grams,

127

random forests), an intriguing alternative is to treat network traffic as a ‘language’ –
e.g. a sequence of symbols representing incoming data or no incoming data – and then
applying techniques from natural language processing (NLP) to the task of predicting
upcoming symbols. Such an approach need not be especially computationally expensive,
e.g. traditional ‘simple recurrent networks’ of the sort used in pioneering research in
neural network NLP [205]. Other obvious alternatives include Markov models and simple
Baysian networks, but clearly the space of possibilities is vast and potentially rich.

128

LIST OF APPENDICES

Appendix A. The INCA-A language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Appendix B. The INCA-Q Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

129

Appendix A:
The INCA-A language
This appendix includes a description of the INCA-A language.

A.1 INCA-A: Instruction categories
Table 1 summarizes the basic categories of INCA-A instruction. The current implementation includes directives, core, and control instructions. Additional instructions under
consideration for future versions are enclosed in square brackets.

A.2 INCA-A: Semantics
In this section we present the semantics for the current version of INCA-A. Instructions
include those from the core, control, and data movement categories.
All variables are locations, i.e., L-values. Locations in memory are designated ri , rj , rk .
Angled brackets refer to the contents of a memory location. t is a datatype; a list of
supported types can be found in Appendix B. For remote operations, hi is the destination
host handle. item I is the set of valid intruction indices for a program, and p, q ∈ I
are specific instruction indices. ‘Single’ instructions operate over single instances of a
datatype. ‘Multiple’ instructions operate over multiple instances of a datatype, i.e., are
SIMD instructions. For these instructions, s is the size or number of instances operated

130

Category

Examples

Use Cases

Directives

Variable declarations
Variable initializations

Required for interpretation

Core

Arithmetic operations
Logical operations
Bitwise operations

Data manipulation

Control

Branch (BLEZ)
Jump
Post END
[Post handoff]
[Set counter]
[Get counter]

Turing completeness
Application interaction

Data Movement

Put (local)
[Put (remote)]

Data manipulation
Distributed programs

Environment

[Get NIC id]
[Get NIC addressing mode]
[Get hardware counter X]
[Get/set MTU]
[Get time]

Collective communication
Adaptive networks

Header manipulation

[Get/set initiator]
[Get/set target]

Traffic control
Rendezvous communications

Table 1: INCA instruction categories. Items enclosed in square brackets are
proposed for inclusion in future versions of INCA-A.
over. Finally, instructions marked with an asterisk represent atomic operations not in
the current (4.2) Portals network programming API.
1. Arithmetic operations (single)
(a) ADD ri , rj , rk , t : Add. hri i ← hrj i + hrk i
(b) SUB ri , rj , rk , t : Subtract. hri i ← hrj i − hrk i
(c) MUL ri , rj , rk , t : Multiply. hri i ← hrj i ∗ hrk i
(d) (*) DIV ri , rj , rk , t : Divide. hri i ← hrj i/hrk i
(e) (*) EXP ri , rj , t : Exponentiation. hri i ← ehrj i
(f) (*) SIN ri , rj , t : Sine. hri i ← sin(hrj i)
131

(g) (*) COS ri , rj , t : Cosine. hri i ← cos(hrj i)
(h) (*) TAN ri , rj , t : Tangent. hri i ← tan(hrj i)
2. Arithmetic operations (multiple)
(a) (*) ADDM ri , rj , rk , t, s : Add multiple (s items).
0 ≤ m < s : hri [m]i ← hrj [m]i + hrk [m]i
(b) (*) SUBM ri , rj , rk , t, s : Subtract multiple (s items).
0 ≤ m < s : hri [m]i ← hrj [m]i − hrk [m]i
(c) (*) MULM ri , rj , rk , t, s : Multiply multiple (s items).
0 ≤ m < s : hri [m]i ← hrj [m]i ∗ hrk [m]i
(d) (*) DIVM ri , rj , rk , t, s : Divide multiple (s items).
0 ≤ m < s : hri [m]i ← hrj [m]i/hrk [m]i
3. Bitwise operations (single)
(a) AND ri , rj , rk , t : Bitwise AND. hri i ← hrj i & hrk i
(b) OR ri , rj , rk , t : Bitwise OR. hri i ← hrj i | hrk i
(c) XOR ri , rj , rk , t : Bitwise XOR. hri i ← hrj i ˆ hrk i
(d) (*) NOT ri , rj , t : Bitwise NOT. hri i ← ¬hrj i
4. Bitwise operations (multiple)
(a) (*) ANDM ri , rj , rk , t, s : Bitwise AND multiple (s items).
0 ≤ m < s : hri [m]i ← hrj [m]i & hrk [m]i
(b) (*) ORM ri , rj , rk , t, s : Bitwise OR multiple (s items).
0 ≤ m < s : hri [m]i ← hrj [m]i | hrk [m]i
(c) (*) XORM ri , rj , rk , t, s : Bitwise XOR multiple (s items).
0 ≤ m < s : hri [m]i ← hrj [m]i ˆhrk [m]i
132

(d) (*) NOTM ri , rj , t, s : Bitwise NOT multiple (s items).
0 ≤ m < s : hri [m]i ← ¬hrj [m]i
5. Logical operations (single)
(a) LAND ri
, rj , rk , t : Logical AND.


1 if hrj i > 0 and hrk i > 0
hri i ←


0 otherwise
(b) LOR ri , rj , rk , t : Logical OR.


1 if hrj i > 0 or hrk i > 0
hri i ←


0 otherwise
(c) LXOR ri
, rj , rk , t : Logical XOR.


1 if hrj i > 0 or hrk i > 0 but not both.
hri i ←


0 otherwise
(d) (*) LNOT
ri , rj , t : Logical NOT.


1 if hrj i = 0
hri i ←


0 otherwise
6. Logical operations (multiple)
(a) (*) LANDM ri , rj , rk , t, s 
: Logical AND multiple (s items).


1 if hrj [m]i > 0 and hrk [m]i > 0
0 ≤ m < s : hri [m]i ←


0 otherwise
(b) (*) LORM ri , rj , rk , t, s : 
Logical OR multiple (s items).


1 if hrj [m]i > 0 or hrk [m]i > 0
0 ≤ m < s : hri [m]i ←


0 otherwise
(c) (*) LXORM ri , rj , rk , t, s 
: Logical OR multiple (s items).


1 if hrj [m]i > 0 or hrk [m]i > 0 but not both.
0 ≤ m < s : hri [m]i ←


0 otherwise
133

(d) (*) LNOTM ri , rj , rk , t, s 
: Logical NOT multiple (s items).


1 if hrj [m]i = 0
0 ≤ m < s : hri [m]i ←


0 otherwise
7. Control flow operations
(a) (*) BLEZ ri , q : Jump if less-than-or-equal-to zero. If hri i ≤ 0, pc = q, else
pc = p + 1
(b) (*) JMP q : Jump to instruction index q.
8. Data movement operations
(a) PUTL ri , rj , t : Put local (1 item). hri i ← hrj i
(b) PUTLM ri , rj , t, s : Put local multiple (s items). hri i ← hrj i
(c) PUTR ri , rj , t, hi : Put remote (1 item). hri i ← hrj i, where ri is on host hi .
(d) PUTRM ri , rj , t, s, hi : Put remote multiple (s items). hri i ← hrj i, where ri is
on host hi .
9. Comparison operations (single)
(a) EQ ri , rj , rk , t : Equal local (1 item). If hrj i == hrk i > 0, hki i = 1, else
hri i = 0.
(b) NE ri , rj , rk , t : Not equal local (1 item). If hrj i =
6 hrk i > 0, hri i = 1, else
hri i = 0.
(c) GE ri , rj , rk , t : Greater than or equal local (1 item). If hrj i ≥ hrk i > 0,
hri i = 1, else hri i = 0.
(d) GT ri , rj , rk , t : Greater than local (1 item). If hrj i > hrk i > 0, hri i = 1, else
hri i = 0.
(e) LE ri , rj , rk , t : Less than or equal local (1 item). If hrj i ≤ hrk i > 0, hri i = 1,
else hri i = 0.
134

(f) LT ri , rj , rk , t : Less than local (1 item). If hrj i < hrk i > 0, hri i = 1, else
hri i = 0.
(g) MIN ri , rj , rk , t : Minimum local (1 item). hri i ← min(hrj i, hrk i).
(h) MAX ri , rj , rk , t : Mximum local (1 item). hri i ← max(hrj i, hrk i).
10. Comparison operations (multiple)
(a) (*) MINM ri , rj , rk , t, s : Minimum local multiple (s items).
0 ≤ m < s : hri [m]i ← min(hrj [m]i, hrk [m]i)
(b) (*) MAXM ri , rj , rk , t, s : Maximum local multiple (s items).
0 ≤ m < s : hri [m]i ← min(hrj [m]i, hrk [m]i)
(c) (*) Multiple versions of EQ, NE, GE, GT, LE, and LT are also in INCA-A.
11. Finishing operations
(a) (*) END : Terminate program.
(b) (*) NOP : No operation.

135

Appendix B:
The INCA-Q Language
This appendix includes the INCA-Q language syntax.

B.1 INCA-Q: Program
An INCA-Q program has two parts: directives defining the execution environment of the
kernel, and the kernel code itself. Every INCA-Q program must end with an end or fin
keyword. Comments are any valid string, terminated with a newline.
hprogrami

::= hinstructionsi hend-instructioni

hinstructionsi

::= hinstructionsi hinstructioni
| hinstructioni

hinstructioni

::= hdirectivei
| hexpressioni
| hstatementi
| hcommenti
| 

hend-instructioni

::= end
| fin

hcommenti

::= # . . .
136

B.2 INCA-Q: Directives
INCA-Q directives define the execution environment of an INCA kernel, i.e., the variables
and buffers it will access, as well as their types.
hdirectivei

::= % hdirective-declarationi
| % hdirective-array-declarationi
| % hdirective-assignmenti
| % hdirective-waiti

hdirective-declarationi

::= % htypei hvariablei

hdirective-array-declarationi

::= % htypei hvariablei hdirective-array-dimsi

hdirective-array-dimsi

::= hdirective-array-dimsi hdirective-array-dimi
| hdirective-array-dimi

hdirective-array-dimi

::= [ hunsigned-integer i ]

hdirective-assignmenti

::= % hvariablei = hinteger i
| % hvariablei = hfloati

hdirective-waiti

::= % wait hunsigned-integer i

B.3 INCA-Q: Expressions
The INCA-Q expressions are given below. This is a compressed version of the grammar for expressions, leaving out the details that enforce operator precedence. Operator
precedence, from highest to lowest, is
1. !, ~, exp, sin, cos, tan
2. \, *
3. +, 4. >, >=, <, <=, ==
137

5. &, ^, |
6. &&, ||, min, max.

hexpressioni

::= hexpressioni || hexpressioni
| hexpressioni \ hexpressioni
| hexpressioni * hexpressioni
| hexpressioni - hexpressioni
| hexpressioni + hexpressioni
| hexpressioni < hexpressioni
| hexpressioni <= hexpressioni
| hexpressioni > hexpressioni
| hexpressioni >= hexpressioni
| hexpressioni == hexpressioni
| hexpressioni & hexpressioni
| hexpressioni ^ hexpressioni
| hexpressioni | hexpressioni
| hexpressioni && hexpressioni
| hexpressioni || hexpressioni
| hexpressioni min hexpressioni
| hexpressioni max hexpressioni
| ! hexpressioni
| ~ hexpressioni
| exp( hexpressioni )
| ( hexpressioni )
| hvariablei
| harray-elementi
| hnumber i

138

B.4 INCA-Q: Statements
hstatementi

::= hif-statementi
| hwhile-statementi
| hassignmenti

hif-statementi

::= if hexpressioni hthen-statementi

hthen-statementi

::= then { hprogram-block i }
| then { hprogram-block i } else { hprogram-block i }

hwhile-statementi

:: while hexpressioni do { hprogram-block i }

hassignmenti

:: hvariablei = hexpressioni
| harray-elementi = hexpressioni

hvariablei

::= [A-Za-z]+[0-9A-Za-z ]*

harray-elementi

:= hvariablei harray-dimsi

harray-dimsi

:= harray-dimsi harray-dimi
| harray-dimi

harray-dimi

:= [ hexpressioni ]

hunsigned-integer i

::= [0-9]+

hinteger i

::= [+|-] hunsigned-integer i

hunsigned-floati

::= hunsigned-integer i . hunsigned-integer i

hfloati

::= [+|-] hunsigned-floati

B.5 INCA-Q: Additional information
INCA-Q includes a no-op function (inca nop) to facilitate placeholding when developing
INCA-A kernels that may use functionality not currently included in INCA-Q. In addition

139

to this function, the following strings are reserved and cannot be used as variable or array
names:
min, max, exp, sin, cos, tan
INCA-Q uses types as defined by the atomic operations appearing in the Portals
specification, not including complex numbers. These are listed in Table 2.
i8
ui8
i16
ui16
i32
ui32
i64
ui64
f
d
ld

8-bit signed integer
8-bit unsigned integer
16-bit signed integer
16-bit unsigned integer
32-bit signed integer
32-bit unsigned integer
64-bit signed integer
64-bit unsigned integer
32-bit float
64-bit float
64-bit float

Table 2: INCA-Q types.

140

References

[1] JEDEC. (2016). JEDEC Updates Groundbreaking High Bandwidth Memory
(HBM) Standard — JEDEC, [Online]. Available: https : / / www . jedec . org /
news / pressreleases / jedec - updates - groundbreaking - high - bandwidth memory-hbm-standard.
[2] NVIDIA. (2018). Understanding MPI Tag Matching and Rendezvous Offloads
(ConnectX-5), [Online]. Available: https : / / community . mellanox . com / s /
article/understanding- mpi- tag- matching- and- rendezvous- offloads-connectx-5-x (visited on 06/28/2019).
[3] B. W. Barrett, R. Brightwell, R. E. Grant, S. Hemmert, K. Pedretti, K. Wheeler,
K. Underwood, R. Riesen, T. Hoefler, A. B. Maccabe, and T. Hudson, “The
Portals 4.2 Network Programming Interface,” Tech. Rep. SAND2018-12790, 2018.
[4] InfiniBand Trade Association. (2018). InfiniBand Roadmap, [Online]. Available:
http : / / www . infinibandta . org / content / pages . php ? pg = technology _
overview.

141

[5] G. W. Connery, W. P. Sherer, G. Jaszewski, and J. S. Binder, “Offload of TCP
Segmentation to a Smart Adapter,” US Patent 5,937,169, Aug. 1999.
[6] J. Freeman, “An Industry Analyst’s Perspective on Network Processors,” in Network Processor Design : Issues and Practices, P. Z. Onufryk, H. Hadimioglu, and
P. Crowley, Eds., Burlington: Morgan Kaufmann, 2002, pp. 191–218, isbn: 9781-55860-875-7. doi: 10.1016/B978-155860875-7/50027-2.
[7] M. Ahmadi and S. Wong, “Network Processors: Challenges and Trends,” in In
Proceedings of the 17th Annual Workshop on Circuits, Systems and Signal Processing, ProRisc 2006, 2006, pp. 223–232.
[8] P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica,
and M. Horowitz, “Forwarding Metamorphosis: Fast Programmable Match-action
Processing in Hardware for SDN,” in Proceedings of the ACM SIGCOMM 2013
Conference on SIGCOMM, New York, NY, USA: ACM, 2013, pp. 99–110, isbn:
978-1-4503-2056-6. doi: 10.1145/2486001.2486011.
[9] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, C. Schlesinger,
D. Talayco, A. Vahdat, G. Varghese, and D. Walker, “P4: Programming Protocolindependent Packet Processors,” SIGCOMM Comput. Commun. Rev., vol. 44,
no. 3, pp. 87–95, Jul. 2014, issn: 0146-4833. doi: 10.1145/2656877.2656890.
[10] S. Han, K. Jang, K. Park, and S. Moon, “PacketShader: A GPU-Accelerated
Software Router,” SIGCOMM Comput. Commun. Rev., vol. 40, no. 4, pp. 195–
206, Aug. 2010, issn: 0146-4833. doi: 10.1145/1851275.1851207.
[11] E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek, “The Click
Modular Router,” ACM Trans. Comput. Syst., vol. 18, no. 3, pp. 263–297, 2000,
issn: 0734-2071. doi: 10.1145/354871.354874.
[12] M. Liu, L. Luo, J. Nelson, L. Ceze, A. Krishnamurthy, and K. Atreya, “IncBricks:
Toward In-Network Computation with an In-Network Cache,” in Proceedings of
142

the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, New York, NY, USA: ACM, 2017,
pp. 795–809, isbn: 978-1-4503-4465-4. doi: 10.1145/3037697.3037731.
[13] G. Lu, C. Guo, Y. Li, Z. Zhou, T. Yuan, H. Wu, Y. Xiong, R. Gao, and Y. Zhang,
“ServerSwitch: A Programmable and High Performance Platform for Data Center
Networks,” in Proceedings of the 8th USENIX Conference on Networked Systems
Design and Implementation, USA: USENIX Association, 2011, pp. 15–28.
[14] S. Bhattacharjee, K. L. Calvert, and E. W. Zegura, “An Architecture for Active Networking,” in High Performance Networking VII: IFIP TC6 Seventh International Conference on High Performance Networks (HPN ‘97), 28th April –
2nd May 1997, White Plains, New York, USA, A. Tantawy, Ed., Boston, MA:
Springer US, 1997, pp. 265–279, isbn: 978-0-387-35279-4. doi: 10.1007/978-0387-35279-4_17.
[15] K. L. Calvert, S. Bhattacharjee, E. Zegura, and J. Sterbenz, “Directions in active
networks,” IEEE Communications Magazine, vol. 36, no. 10, pp. 72–78, Oct. 1998,
issn: 0163-6804. doi: 10.1109/35.722139.
[16] S. Eichert, O. N. Ertugay, D. Nessett, and S. Vobbilisetty, “Commercially Viable
Active Networking,” SIGOPS Oper. Syst. Rev., vol. 36, no. 1, pp. 8–22, 2002,
issn: 0163-5980. doi: 10.1145/844166.844167.
[17] R. Jaeger, S. Bhattacharjee, J. K. Hollingsworth, R. Duncan, T. Lavian, and
F. Travostino, “Integrating Active Networking and Commercial-Grade Routing
Platforms,” in Proceedings of the Workshop on Intelligence at the Network Edge,
USA: USENIX Association, 2000, p. 2.
[18] V. Jeyakumar, M. Alizadeh, Y. Geng, C. Kim, and D. Mazières, “Millions of Little
Minions: Using Packets for Low Latency Network Programming and Visibility,”
in Proceedings of the 2014 ACM Conference on SIGCOMM, New York, NY, USA:
143

Association for Computing Machinery, 2014, pp. 3–14, isbn: 978-1-4503-2836-4.
doi: 10.1145/2619239.2626292.
[19] U. Legedza, D. Wetherall, and J. Guttag, “Improving the performance of distributed applications using active networks,” in Proceedings. IEEE INFOCOM
’98, the Conference on Computer Communications. Seventeenth Annual Joint
Conference of the IEEE Computer and Communications Societies. Gateway to
the 21st Century (Cat. No.98, vol. 2, 1998, 590–599 vol.2. doi: 10.1109/INFCOM.
1998.665079.
[20] N. Shalaby, L. Peterson, A. Bavier, Y. Gottlieb, S. Karlin, A. Nakao, Xiaohu
Qie, T. Spalink, and M. Wawrzoniak, “Extensible routers for active networks,” in
Proceedings DARPA Active Networks Conference and Exposition, 2002, pp. 92–
116.
[21] D. L. Tennenhouse, J. M. Smith, W. D. Sincoskie, D. J. Wetherall, and G. J.
Minden, “A survey of active network research,” IEEE Communications Magazine,
vol. 35, no. 1, pp. 80–86, Jan. 1997, issn: 0163-6804. doi: 10.1109/35.568214.
[22] H. T. Dang, D. Sciascia, M. Canini, F. Pedone, and R. Soulé, “NetPaxos: Consensus at Network Speed,” in Proceedings of the 1st ACM SIGCOMM Symposium on
Software Defined Networking Research, New York, NY, USA: ACM, 2015, 5:1–5:7,
isbn: 978-1-4503-3451-8. doi: 10.1145/2774993.2774999.
[23] H. T. Dang, M. Canini, F. Pedone, and R. Soulé, “Paxos Made Switch-y,” SIGCOMM Comput. Commun. Rev., vol. 46, no. 2, pp. 18–24, May 2016, issn: 01464833. doi: 10.1145/2935634.2935638.
[24] R. L. Graham, D. Bureddy, P. Lui, H. Rosenstock, G. Shainer, G. Bloch, D.
Goldenerg, M. Dubman, S. Kotchubievsky, V. Koushnir, L. Levi, A. Margolin, T.
Ronen, A. Shpiner, O. Wertheim, and E. Zahavi, “Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction,”
144

in Proceedings of the First Workshop on Optimization of Communication in HPC,
Piscataway, NJ, USA: IEEE Press, 2016, pp. 1–10, isbn: 978-1-5090-3829-9. doi:
10.1109/COM-HPC.2016.6.
[25] A. Sapio, I. Abdelaziz, A. Aldilaijan, M. Canini, and P. Kalnis, “In-Network Computation is a Dumb Idea Whose Time Has Come,” in Proceedings of the 16th ACM
Workshop on Hot Topics in Networks, New York, NY, USA: ACM, 2017, pp. 150–
156, isbn: 978-1-4503-5569-8. doi: 10.1145/3152434.3152461.
[26] S. Bhattacharjee, K. Calvert, and E. Zegura, “Self-organizing wide-area network
caches,” in Proceedings. IEEE INFOCOM ’98, the Conference on Computer Communications. Seventeenth Annual Joint Conference of the IEEE Computer and
Communications Societies. Gateway to the 21st Century (Cat. No.98, vol. 2, Mar.
1998, 600–608 vol.2. doi: 10.1109/INFCOM.1998.665080.
[27] X. Li, R. Sethi, M. Kaminsky, D. G. Andersen, and M. J. Freedman, “Be fast,
cheap and in control with SwitchKV,” in Proceedings of the 13th Usenix Conference on Networked Systems Design and Implementation, Santa Clara, CA:
USENIX Association, Mar. 2016, pp. 31–44, isbn: 978-1-931971-29-4.
[28] X. Jin, X. Li, H. Zhang, R. Soulé, J. Lee, N. Foster, C. Kim, and I. Stoica, “NetCache: Balancing Key-Value Stores with Fast In-Network Caching,” in Proceedings
of the 26th Symposium on Operating Systems Principles, Shanghai, China: Association for Computing Machinery, Oct. 2017, pp. 121–136, isbn: 978-1-4503-5085-3.
doi: 10.1145/3132747.3132764.
[29] S. Eichert, D. Nessett, W. Luo, and E. Lusher, “Dynamic policy management
apparatus and method using active network devices,” US Patent 6393474B1, May
2002.

145

[30] Â. C. Lapolli, J. A. Marques, and L. P. Gaspary, “Offloading Real-time DDoS
Attack Detection to Programmable Data Planes,” in 2019 IFIP/IEEE Symposium
on Integrated Network and Service Management (IM), 2019, pp. 19–27.
[31] M. Ghasemi, T. Benson, and J. Rexford, “Dapper: Data Plane Performance Diagnosis of TCP,” in Proceedings of the Symposium on SDN Research, New York, NY,
USA: ACM, 2017, pp. 61–74, isbn: 978-1-4503-4947-5. doi: 10.1145/3050220.
3050228.
[32] K. D. Underwood, K. S. Hemmert, A. Rodrigues, R. Murphy, and R. Brightwell,
“A Hardware Acceleration Unit for MPI Queue Processing,” in 19th IEEE International Parallel and Distributed Processing Symposium, Apr. 2005, 96b–96b.
doi: 10.1109/IPDPS.2005.30.
[33] B. Gamache, Z. Pfeffer, and S. P. Khatri, “A fast ternary CAM design for IP
networking applications,” in Proceedings. 12th International Conference on Computer Communications and Networks (IEEE Cat. No.03EX712), 2003, pp. 434–
439.
[34] D. Chang and P. Wang, “TCAM-Based Multi-Match Packet Classification Using Multidimensional Rule Layering,” IEEE/ACM Transactions on Networking,
vol. 24, no. 2, pp. 1125–1138, 2016.
[35] S. Derradji, T. Palfer-Sollier, J. P. Panziera, A. Poudes, and F. W. Atos, “The
BXI Interconnect Architecture,” in 2015 IEEE 23rd Annual Symposium on HighPerformance Interconnects, Aug. 2015, pp. 18–25. doi: 10.1109/HOTI.2015.15.
[36] Netronome. (2020). Agilio CX SmartNICs, [Online]. Available: https : / / www .
netronome.com/products/agilio-cx/.
[37] ——, (2018). Netronome 25gbe SmartNICs with Open vSwitch Hardware Offload
Drive Unmatched Cloud and Data Center Infrastructure Performance, [Online].

146

Available: https://www.netronome.com/m/documents/WP_Netronome_25GbE_
SmartNICs_with_Open_vSwitch_Hardware_Offload.pdf.
[38] K. Pagiamtzis and A. Sheikholeslami, “Content-addressable memory (CAM) circuits and architectures: A tutorial and survey,” IEEE Journal of Solid-State Circuits, vol. 41, no. 3, pp. 712–727, 2006.
[39] M. A. Foysal, M. Z. Anam, M. S. Islam, I. Tahmid, and K. Mondal, “Performance
analysis of ternary content addressable memory (TCAM),” in 2015 International
Conference on Advances in Electrical Engineering (ICAEE), 2015, pp. 105–108.
[40] M. Gries and K. Keutzer, Building ASIPS: The mescal methodology. Springer,
2005, isbn: 0-387-26057-9.
[41] P. Crowley, M. A. Franklin, H. Hadimioglu, and P. Z. Onufryk, “Network processors: An introduction to design issues,” in Network Processor Design: Issues and
Practices, P. Crowley, M. A. Franklin, H. Hadimioglu, and P. Z. Onufryk, Eds.,
vol. 1, Burlington: Morgan Kaufmann, 2003, pp. 1–8.
[42] E. Barton, J. Cownie, and M. McLaren, “Message Passing on the Meiko CS-2,”
Parallel Comput., vol. 20, no. 4, pp. 497–507, Apr. 1994, issn: 0167-8191. doi:
10.1016/0167-8191(94)90025-6.
[43] N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic,
and W.-K. Su, “Myrinet: A gigabit-per-second local area network,” IEEE Micro,
vol. 15, no. 1, pp. 29–36, Feb. 1995, issn: 0272-1732. doi: 10.1109/40.342015.
[44] F. Petrini, Wu-chun Feng, A. Hoisie, S. Coll, and E. Frachtenberg, “The Quadrics
network (QsNet): High-performance clustering technology,” in HOT 9 Interconnects. Symposium on High Performance Interconnects, 2001, pp. 125–130.
[45] F. Petrini, W.-c. Feng, A. Hoisie, S. Coll, and E. Frachtenberg, “The Quadrics
Network: High-Performance Clustering Technology,” IEEE Micro, vol. 22, no. 1,
pp. 46–57, Jan. 2002, issn: 0272-1732. doi: 10.1109/40.988689.
147

[46] R. Brightwell, K. Pedretti, and K. D. Underwood, “Initial performance evaluation of the Cray SeaStar interconnect,” in 13th Symposium on High Performance
Interconnects (HOTI’05), Aug. 2005, pp. 51–57. doi: 10.1109/CONECT.2005.24.
[47] R. Brightwell, K. Pedretti, K. Underwood, and T. Hudson, “SeaStar Interconnect: Balanced Bandwidth for Scalable Performance,” IEEE Micro, vol. 26, no. 3,
pp. 41–57, May 2006, issn: 0272-1732. doi: 10.1109/MM.2006.65.
[48] A. Wagner, H.-W. Jin, D. K. Panda, and R. Riesen, “NIC-based offload of dynamic user-defined modules for Myrinet clusters,” in 2004 IEEE International
Conference on Cluster Computing (IEEE Cat. No.04EX935), Sep. 2004, pp. 205–
214. doi: 10.1109/CLUSTR.2004.1392618.
[49] K. T. Pedretti and T. Hudson, “Developing Custom Firmware for the Red Storm
SeaStar Network Interface,” in Proceedings of the 47th Cray User Group Annual
Technical Conference, 2005.
[50] NVIDIA. (2020). NVIDIA Mellanox BlueField DPU, [Online]. Available: https:
/ / www . mellanox . com / products / bluefield1 - overview (visited on
09/28/2020).
[51] ——, (2020). NVIDIA Mellanox BlueField-2 DPU, [Online]. Available: https://
www.mellanox.com/products/bluefield2-overview (visited on 09/28/2020).
[52] Broadcom. (2019). Stingray SmartNIC, [Online]. Available: https : / / www .
broadcom.com/products/ethernet-connectivity/smartnic/ps225.
[53] Marvell. (2020). LiquidIO II 10/25GbE Adapter family, [Online]. Available:
https://www.marvell.com/products/ethernet-adapters-and-controllers/
liquidio-smart-nics/liquidio-ii-smart-nics.html.
[54] T. Hoefler, S. Di Girolamo, K. Taranov, R. E. Grant, and R. Brightwell, “sPIN:
High-performance Streaming Processing In the Network,” in Proceedings of the
International Conference for High Performance Computing, Networking, Storage
148

and Analysis, New York, NY, USA: ACM, 2017, 59:1–59:16, isbn: 978-1-45035114-0. doi: 10.1145/3126908.3126970.
[55] S. Choi, M. Shahbaz, B. Prabhakar, and M. Rosenblum, “Lamda-NIC: Interactive
Serverless Compute on SmartNICs,” in Proceedings of the ACM SIGCOMM 2019
Conference Posters and Demos, New York, NY, USA: Association for Computing
Machinery, 2019, pp. 151–152, isbn: 978-1-4503-6886-5. doi: 10.1145/3342280.
3342341.
[56] Netronome. (2018). Programming Netronome Agilio SmartNICs, [Online]. Available: https : / / www . netronome . com / m / documents / WP _ NFP _ Programming _
Model.pdf.
[57] ——, (2018). Netronome NFP Silicon and Composable IP Blocks, [Online]. Available: https://www.netronome.com/m/documents/FAQ_NFP_IP_Blocks_.pdf.
[58] V. C. Ravikumar and R. N. Mahapatra, “TCAM architecture for IP lookup using
prefix properties,” IEEE Micro, vol. 24, no. 2, pp. 60–69, 2004.
[59] K. D. Underwood, R. R. Sass, and W. B. Ligon, “A reconfigurable extension to
the network interface of Beowulf clusters,” in Proceedings 2001 IEEE International
Conference on Cluster Computing, Oct. 2001, pp. 212–221. doi: 10.1109/CLUSTR.
2001.959980.
[60] K. Underwood, R. Sass, and W. Ligon, “Acceleration of a 2D-FFT on an
Adaptable Computing Cluster,” in The 9th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines (FCCM’01), Mar. 2001, pp. 180–189.
[61] G. Gibb, J. W. Lockwood, J. Naous, P. Hartke, and N. McKeown, “NetFPGA—An
Open Platform for Teaching How to Build Gigabit-Rate Network Switches and
Routers,” IEEE Transactions on Education, vol. 51, no. 3, pp. 364–369, 2008.

149

[62] N. Mehta. (2015). Xilinix UltraScale architecture for high-performance, smarter
systems, [Online]. Available: https : / / www . xilinx . com / support /
documentation/white_papers/wp434-ultrascale-smarter-systems.pdf.
[63] Xilinx. (2020). Alveo Accelerator Cards, [Online]. Available: https : / / www .
xilinx.com/products/boards-and-kits/accelerator-cards.html.
[64] NVIDIA. (Jul. 2019). NVIDIA Mellanox Innova-2 Flex Open Programmable
SmartNIC, [Online]. Available: http://www.mellanox.com/page/products_
dyn?product_family=276 (visited on 07/05/2019).
[65] D. Firestone, “VFP: A Virtual Switch Platform for Host Sdn in the Public Cloud,”
in Proceedings of the 14th USENIX Conference on Networked Systems Design and
Implementation, Berkeley, CA, USA: USENIX Association, 2017, pp. 315–328,
isbn: 978-1-931971-37-9.
[66] D. Firestone, A. Putnam, S. Mundkur, D. Chiou, A. Dabagh, M. Andrewartha, H.
Angepat, V. Bhanu, A. Caulfield, E. Chung, H. K. Chandrappa, S. Chaturmohta,
M. Humphrey, J. Lavier, N. Lam, F. Liu, K. Ovtcharov, J. Padhye, G. Popuri,
S. Raindel, T. Sapre, M. Shaw, G. Silva, M. Sivakumar, N. Srivastava, A. Verma,
Q. Zuhair, D. Bansal, D. Burger, K. Vaid, D. A. Maltz, and A. Greenberg, “Azure
Accelerated Networking: SmartNICs in the Public Cloud,” in 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), Renton,
WA: USENIX Association, 2018, pp. 51–66, isbn: 978-1-939133-01-4.
[67] A. M. Caulfield, E. S. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman,
S. Heil, M. Humphrey, P. Kaur, J.-Y. Kim, D. Lo, T. Massengill, K. Ovtcharov,
M. Papamichael, L. Woods, S. Lanka, D. Chiou, and D. Burger, “A cloud-scale
acceleration architecture,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct. 2016, pp. 1–13. doi: 10.1109/MICRO.
2016.7783710.

150

[68] OpenSHMEM. (2020). OpenSHMEM Application Programming interface Version
1.5, [Online]. Available: http://www.openshmem.org/.
[69] Message Passing Interface Forum. (2015). MPI: A Message-Passing Interface Standard Version 3.1, [Online]. Available: https://www.mpi-forum.org/docs/.
[70] Microsoft. (2019). Introduction to Receive Side Scaling, [Online]. Available:
https://docs.microsoft.com/en-us/windows-hardware/drivers/network/
introduction-to-receive-side-scaling.
[71] J. T. Humphries, K. Kaffes, D. Mazières, and C. Kozyrakis, “Mind the Gap: A
Case for Informed Request Scheduling at the NIC,” in Proceedings of the 18th
ACM Workshop on Hot Topics in Networks, New York, NY, USA: Association
for Computing Machinery, 2019, pp. 60–68, isbn: 978-1-4503-7020-2. doi: 10 .
1145/3365609.3365856.
[72] Y. Dong, X. Yang, X. Li, J. Li, K. Tian, and H. Guan, “High performance network
virtualization with SR-IOV,” in HPCA - 16 2010 The Sixteenth International
Symposium on High-Performance Computer Architecture, Jan. 2010, pp. 1–10.
doi: 10.1109/HPCA.2010.5416637.
[73] J. C. Mogul, “TCP Offload is a Dumb Idea Whose Time Has Come,” in Proceedings
of the 9th Conference on Hot Topics in Operating Systems - Volume 9, Berkeley,
CA, USA: USENIX Association, 2003, pp. 5–5.
[74] P. Shivam and J. S. Chase, “On the Elusive Benefits of Protocol Offload,” in
Proceedings of the ACM SIGCOMM Workshop on Network-I/O Convergence: Experience, Lessons, Implications, New York, NY, USA: ACM, 2003, pp. 179–184.
doi: 10.1145/944747.944750.
[75] P. Sarkar, S. Uttamchandani, and K. Voruganti, “Storage Over IP: When Does
Hardware Support Help?” In Proceedings of the 2nd USENIX Conference on File
and Storage Technologies, USA: USENIX Association, 2003, pp. 231–244.
151

[76] B. S. Ang, “An evaluation of an attempt at offloading TCP/IP protocol processing
onto an i960rn-based iNIC,” Hewlett Packard, Tech. Rep., 2001.
[77] Y. Hoskote, B. A. Bloechel, G. E. Dermer, V. Erraguntla, D. Finan, J. Howard,
D. Klowden, S. G. Narendra, G. Ruhl, J. W. Tschanz, Sriram Vangal, V. Veeramachaneni, H. Wilson, Jianping Xu, and N. Borkar, “A TCP offload accelerator
for 10 Gb/s Ethernet in 90-nm CMOS,” IEEE Journal of Solid-State Circuits,
vol. 38, no. 11, pp. 1866–1875, 2003.
[78] D. Freimuth, E. Hu, J. LaVoie, R. Mraz, E. Nahum, P. Pradhan, and J. Tracey,
“Server Network Scalability and TCP Offload,” in Proceedings of the Annual Conference on USENIX Annual Technical Conference, USA: USENIX Association,
2005, p. 15.
[79] W. Feng, P. Balaji, C. Baron, L. N. Bhuyan, and D. K. Panda, “Performance Characterization of a 10-Gigabit Ethernet TOE,” in Proceedings of the 13th Symposium
on High Performance Interconnects, Washington, DC, USA: IEEE Computer Society, 2005, pp. 58–63, isbn: 978-0-7695-2449-8. doi: 10.1109/CONECT.2005.30.
[80] M. J. Rashti, R. E. Grant, A. Afsahi, and P. Balaji, “iWARP redefined: Scalable
connectionless communication over high-speed Ethernet,” in High Performance
Computing (HiPC), 2010 International Conference on, IEEE, 2010, pp. 1–10.
[81] R. E. Grant, M. J. Rashti, A. Afsahi, and P. Balaji, “RDMA Capable iWARP over
Datagrams,” in 2011 IEEE International Parallel Distributed Processing Symposium, May 2011, pp. 628–639. doi: 10.1109/IPDPS.2011.66.
[82] H. Jang, S.-H. Chung, D. K. Kim, and Y.-S. Lee, “An Efficient Architecture
for a TCP Offload Engine Based on Hardware/Software Co-design,” Journal of
Information Science and Engineering, vol. 27, pp. 493–509, 2011.
[83] Y. Moon, S. Lee, M. A. Jamshed, and K. Park, “AccelTCP: Accelerating Network Applications with Stateful TCP Offloading,” in 17th USENIX Symposium
152

on Networked Systems Design and Implementation (NSDI 20), Santa Clara, CA:
USENIX Association, Feb. 2020, pp. 77–92, isbn: 978-1-939133-13-7.
[84] C. Clark, W. Lee, D. Schimmel, D. Contis, M. Koné, and A. Thomas, “A hardware
platform for network intrusion detection and prevention,” in Proceedings of the
3rd Workshop on Network Processors and Applications (NP3), 2005.
[85] Intel, “Intel IXP1200 Network Processor Family,” Intel Corporation, Tech. Rep.,
Dec. 2001.
[86] H. Bos and K. Huang, “Towards Software-Based Signature Detection for Intrusion
Prevention on the Network Card,” in Recent Advances in Intrusion Detection, A.
Valdes and D. Zamboni, Eds., Berlin, Heidelberg: Springer, 2006, pp. 102–123,
isbn: 978-3-540-31779-1. doi: 10.1007/11663812_6.
[87] W. de Bruijn, A. Slowinska, K. van Reeuwijk, T. Hruby, L. Xu, and H. Bos,
“SafeCard: A Gigabit IPS on the Network Card,” in Recent Advances in Intrusion
Detection, D. Zamboni and C. Kruegel, Eds., Berlin, Heidelberg: Springer, 2006,
pp. 311–330, isbn: 978-3-540-39725-0. doi: 10.1007/11856214_16.
[88] D. Friedman and D. Nagle, “Building firewalls with intelligent network interface
cards,” Tech. Rep. CMU-CS-00-173, 2001.
[89] M. Burnside and A. Keromytis, “Accelerating application-level security protocols,” in The 11th IEEE International Conference on Networks, 2003. ICON2003.,
Oct. 2003, pp. 313–318. doi: 10.1109/ICON.2003.1266209.
[90] M. Burnside and A. D. Keromytis, “High-speed I/O: The Operating System As
a Signalling Mechanism,” in Proceedings of the ACM SIGCOMM Workshop on
Network-I/O Convergence: Experience, Lessons, Implications, New York, NY,
USA: ACM, 2003, pp. 220–227. doi: 10.1145/944747.944756.

153

[91] P. Chaignon, D. Adjavon, K. Lazri, J. François, and O. Festor, “Offloading Security Services to the Cloud Infrastructure,” in Proceedings of the 2018 Workshop on
Security in Softwarized Networks: Prospects and Challenges, New York, NY, USA:
Association for Computing Machinery, 2018, pp. 27–32, isbn: 978-1-4503-5912-2.
doi: 10.1145/3229616.3229624.
[92] M. Dimolianis, A. Pavlidis, and V. Maglaris, “A Multi-Feature DDoS Detection
Schema on P4 Network Hardware,” in 2020 23rd Conference on Innovation in
Clouds, Internet and Networks and Workshops (ICIN), 2020, pp. 1–6.
[93] K. Verstoep, K. Langendoen, and H. Bal, “Efficient reliable multicast on Myrinet,”
in Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing,
vol. 3, Aug. 1996, 156–165 vol.3. doi: 10.1109/ICPP.1996.538571.
[94] R. A. F. Bhoedjang, T. Ruhl, and H. E. Bal, “Efficient multicast on Myrinet using
link-level flow control,” in Proceedings. 1998 International Conference on Parallel
Processing (Cat. No.98EX205), Aug. 1998, pp. 381–390. doi: 10 . 1109 / ICPP .
1998.708509.
[95] D. Buntinas, D. K. Panda, J. Duato, and P. Sadayappan, “Broadcast/Multicast
over Myrinet Using NIC-Assisted Multidestination Messages,” in Network-Based
Parallel Computing. Communication, Architecture, and Applications, Springer,
Berlin, Heidelberg, Jan. 2000, pp. 115–129. doi: 10.1007/10720115_9.
[96] C. Keppitiyagama and A. S. Wagner, “Asynchronous MPI Messaging on Myrinet,”
in Proceedings of the 15th International Parallel & Distributed Processing Symposium, Washington, DC, USA: IEEE Computer Society, 2001, pp. 50–, isbn:
978-0-7695-0990-7.
[97] B. Tourancheau and R. Westrelin, “Support for MPI at the Network Interface
Level,” in Recent Advances in Parallel Virtual Machine and Message Passing

154

Interface, Y. Cotronis and J. Dongarra, Eds., Springer Berlin Heidelberg, 2001,
pp. 52–60, isbn: 978-3-540-45417-5.
[98] D. Buntinas, D. K. Panda, and P. Sadayappan, “Performance benefits of NICbased barrier on myrinet/GM,” in Proceedings 15th International Parallel and
Distributed Processing Symposium. IPDPS 2001, Apr. 2001, pp. 1717–1724. doi:
10.1109/IPDPS.2001.925159.
[99] ——, “Fast NIC-based barrier over Myrinet/GM,” in Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001, Apr. 2001.
doi: 10.1109/IPDPS.2001.924993.
[100] R. L. Graham, S. Poole, P. Shamis, G. Bloch, N. Bloch, H. Chapman, M. Kagan, A. Shahar, I. Rabinovitz, and G. Shainer, “Overlapping computation and
communication: Barrier algorithms and ConnectX-2 CORE-Direct capabilities,”
in 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), Apr. 2010, pp. 1–8. doi: 10.1109/IPDPSW.
2010.5470854.
[101] K. S. Hemmert, B. Barrett, and K. D. Underwood, “Using Triggered Operations
to Offload Collective Communication Operations,” in Recent Advances in the Message Passing Interface, Springer, Berlin, Heidelberg, Sep. 2010, pp. 249–256. doi:
10.1007/978-3-642-15646-5_26.
[102] K. D. Underwood, J. Coffman, R. Larsen, K. S. Hemmert, B. W. Barrett, R.
Brightwell, and M. Levenhagen, “Enabling Flexible Collective Communication
Offload with Triggered Operations,” in 2011 IEEE 19th Annual Symposium on
High Performance Interconnects, Aug. 2011, pp. 35–42. doi: 10 . 1109 / HOTI .
2011.15.
[103] B. W. Barrett, R. Brightwell, K. S. Hemmert, K. B. Wheeler, and K. D. Underwood, “Using Triggered Operations to Offload Rendezvous Messages,” in Recent
155

Advances in the Message Passing Interface, Springer, Berlin, Heidelberg, Sep.
2011, pp. 120–129, isbn: 978-3-642-24448-3. doi: 10.1007/978-3-642-244490_15.
[104] N. Tanabe, A. Ohta, P. Waskito, and H. Nakajo, “Network Interface Architecture
for Scalable Message Queue Processing,” in 2009 15th International Conference on
Parallel and Distributed Systems, Dec. 2009, pp. 268–275. doi: 10.1109/ICPADS.
2009.140.
[105] B. Klenk, H. Fröening, H. Eberle, and L. Dennison, “Relaxations for HighPerformance Message Passing on Massively Parallel SIMT Processors,” in 2017
IEEE International Parallel and Distributed Processing Symposium (IPDPS), May
2017, pp. 855–865. doi: 10.1109/IPDPS.2017.94.
[106] N. Alachiotis and A. Stamatakis, “FPGA Optimizations for a Pipelined FloatingPoint Exponential Unit,” in Reconfigurable Computing: Architectures, Tools and
Applications, A. Koch, R. Krishnamurthy, J. McAllister, R. Woods, and T. ElGhazawi, Eds., Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 316–327,
isbn: 978-3-642-19475-7.
[107] F. d. Dinechin and B. Pasca, “Floating-point exponential functions for DSPenabled FPGAs,” in 2010 International Conference on Field-Programmable Technology, 2010, pp. 110–117.
[108] E. Jamro, K. Wiatr, and M. Wielgosz, “FPGA Implementation of 64-Bit Exponential Function for HPC,” in 2007 International Conference on Field Programmable
Logic and Applications, 2007, pp. 718–721.
[109] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “DianNao: A
Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning,”
in Proceedings of the 19th International Conference on Architectural Support for
Programming Languages and Operating Systems, New York, NY, USA: Association
156

for Computing Machinery, 2014, pp. 269–284, isbn: 978-1-4503-2305-5. doi: 10.
1145/2541940.2541967.
[110] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O.
Temam, “ShiDianNao: Shifting vision processing closer to the sensor,” in 2015
ACM/IEEE 42nd Annual International Symposium on Computer Architecture
(ISCA), Jun. 2015, pp. 92–104. doi: 10.1145/2749469.2750389.
[111] B. Fitzpatrick, “Distributed Caching with Memcached,” Linux J., vol. 2004,
no. 124, p. 5, 2004, issn: 1075-3583.
[112] H.-y. Kim, V. S. Pai, and S. Rixner, “Increasing web server throughput with
network interface data caching,” ACM SIGPLAN Notices, vol. 37, no. 10, pp. 239–
250, Oct. 2002, issn: 0362-1340. doi: 10.1145/605432.605423.
[113] G. S. Choi, J.-H. Kim, D. Ersoz, M. S. Yousif, and C. R. Das, “Exploiting NIC
Memory for Improving Cluster-Based Webserver Performance,” in 2005 IEEE
International Conference on Cluster Computing, Sep. 2005, pp. 1–10. doi: 10.
1109/CLUSTR.2005.347067.
[114] E. S. Fukuda, H. Inoue, T. Takenaka, D. Kim, T. Sadahisa, T. Asai, and M. Motomura, “Caching memcached at reconfigurable network interface,” in 2014 24th
International Conference on Field Programmable Logic and Applications (FPL),
2014, pp. 1–6. doi: 10.1109/FPL.2014.6927487.
[115] S. R. Chalamalasetti, K. Lim, M. Wright, A. AuYoung, P. Ranganathan, and M.
Margala, “An FPGA Memcached Appliance,” in Proceedings of the ACM/SIGDA
International Symposium on Field Programmable Gate Arrays, New York, NY,
USA: Association for Computing Machinery, 2013, pp. 245–254, isbn: 978-1-45031887-7. doi: 10.1145/2435264.2435306.

157

[116] Z. István, G. Alonso, M. Blott, and K. Vissers, “A flexible hash table design for
10GBPS key-value stores on FPGAS,” in 2013 23rd International Conference on
Field programmable Logic and Applications, 2013, pp. 1–8.
[117] K. Lim, D. Meisner, A. G. Saidi, P. Ranganathan, and T. F. Wenisch, “Thin
servers with smart pipes: Designing SoC accelerators for memcached,” in Proceedings of the 40th Annual International Symposium on Computer Architecture,
Tel-Aviv, Israel: Association for Computing Machinery, Jun. 2013, pp. 36–47,
isbn: 978-1-4503-2079-5. doi: 10.1145/2485922.2485926.
[118] A. Kaufmann, S. Peter, N. K. Sharma, T. Anderson, and A. Krishnamurthy, “High
Performance Packet Processing with FlexNIC,” in Proceedings of the Twenty-First
International Conference on Architectural Support for Programming Languages
and Operating Systems, New York, NY, USA: ACM, 2016, pp. 67–81, isbn: 9781-4503-4091-5. doi: 10.1145/2872362.2872367.
[119] M. Lavasani, H. Angepat, and D. Chiou, “An FPGA-based In-Line Accelerator
for Memcached,” IEEE Computer Architecture Letters, vol. 13, no. 2, pp. 57–60,
Jul. 2014, issn: 1556-6064. doi: 10.1109/L-CA.2013.17.
[120] B. Li, Z. Ruan, W. Xiao, Y. Lu, Y. Xiong, A. Putnam, E. Chen, and L. Zhang,
“KV-Direct: High-Performance In-Memory Key-Value Store with Programmable
NIC,” in Proceedings of the 26th Symposium on Operating Systems Principles,
New York, NY, USA: ACM, 2017, pp. 137–152, isbn: 978-1-4503-5085-3. doi:
10.1145/3132747.3132756.
[121] M. Liu, T. Cui, H. Schuh, A. Krishnamurthy, S. Peter, and K. Gupta, “Offloading Distributed Applications onto SmartNICs Using IPipe,” in Proceedings of the
ACM Special Interest Group on Data Communication, New York, NY, USA: Association for Computing Machinery, 2019, pp. 318–333, isbn: 978-1-4503-5956-6.
doi: 10.1145/3341302.3342079.

158

[122] S. Choi, S. J. Park, M. Shahbaz, B. Prabhakar, and M. Rosenblum, “Toward
Scalable Replication Systems with Predictable Tails Using Programmable Data
Planes,” in Proceedings of the 3rd Asia-Pacific Workshop on Networking 2019,
New York, NY, USA: Association for Computing Machinery, 2019, pp. 78–84,
isbn: 978-1-4503-7635-8. doi: 10.1145/3343180.3343181.
[123] D. B. Larkins and J. Dinan, “Extending a Message Passing Runtime to Support
Partitioned, Global Logical Address Spaces,” in Proceedings of the First Workshop
on Optimization of Communication in HPC, Piscataway, NJ, USA: IEEE Press,
2016, pp. 11–16, isbn: 978-1-5090-3829-9. doi: 10.1109/COM-HPC.2016.7.
[124] D. B. Larkins, J. Snyder, and J. Dinan, “Efficient Runtime Support for a Partitioned Global Logical Address Space,” in ICPP 2018: 47th International Conference on Parallel Processing, Eugune, Oregon: ACM, 2018.
[125] S. Di Girolamo, K. Taranov, A. Kurth, M. Schaffner, T. Schneider, J. Beránek,
M. Besta, L. Benini, D. Roweth, and T. Hoefler, “Network-Accelerated NonContiguous Memory Transfers,” in Proceedings of the International Conference
for High Performance Computing, Networking, Storage and Analysis, New York,
NY, USA: Association for Computing Machinery, 2019, isbn: 978-1-4503-6229-0.
doi: 10.1145/3295500.3356189.
[126] R. Glebke, J. Krude, I. Kunze, J. Rüth, F. Senger, and K. Wehrle, “Towards Executing Computer Vision Functionality on Programmable Network Devices,” in
Proceedings of the 1st ACM CoNEXT Workshop on Emerging In-Network Computing Paradigms, New York, NY, USA: Association for Computing Machinery,
2019, pp. 15–20, isbn: 978-1-4503-7000-4. doi: 10.1145/3359993.3366646.
[127] D. Sanvito, G. Siracusano, and R. Bifulco, “Can the Network Be the AI Accelerator?” In Proceedings of the 2018 Morning Workshop on In-Network Computing,

159

New York, NY, USA: Association for Computing Machinery, 2018, pp. 20–25,
isbn: 978-1-4503-5908-5. doi: 10.1145/3229591.3229594.
[128] S. Bhattacharyya, D. Katramatos, and S. Yoo, “Why wait? Let us start computing
while the data is still on the wire,” Future Generation Computer Systems, vol. 89,
pp. 563–574, Dec. 2018, issn: 0167-739X. doi: 10.1016/j.future.2018.07.024.
[129] M. Liu, S. Peter, A. Krishnamurthy, and P. M. Phothilimthana, “E3: Energyefficient microservices on SmartNIC-accelerated servers,” in Proceedings of the
2019 USENIX Conference on Usenix Annual Technical Conference, Renton, WA,
USA: USENIX Association, Jul. 2019, pp. 363–378, isbn: 978-1-939133-03-8.
[130] A. Janian, A. Hasan, U. Sheth, and A. Barrett, “IXP2400 network processor package development,” in 2004 Proceedings. 54th Electronic Components and Technology Conference (IEEE Cat. No.04CH37546), vol. 1, Jun. 2004, 306–313 Vol.1.
doi: 10.1109/ECTC.2004.1319356.
[131] Xilinx. (2018). Versal: The first adapative compute acceleration platform (ACAP),
[Online]. Available: https://www.xilinx.com/support/documentation/white_
papers/wp505-versal-acap.pdf.
[132] W. Schonbein, R. E. Grant, M. G. F. Dosanjh, and D. Arnold, “INCA: In-network
compute assistance,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, Colorado: Association for Computing Machinery, Nov. 2019, pp. 1–13, isbn: 978-1-4503-6229-0.
doi: 10.1145/3295500.3356153.
[133] W. Schonbein, M. G. F. Dosanjh, R. E. Grant, and P. G. Bridges, “Measuring
Multithreaded Message Matching Misery,” in Euro-Par 2018: Parallel Processing, M. Aldinucci, L. Padovani, and M. Torquati, Eds., Springer International
Publishing, 2018, pp. 480–491, isbn: 978-3-319-96983-1.

160

[134] S. Levy, K. B. Ferreira, W. Schonbein, R. E. Grant, and M. G. F. Dosanjh, “Using
simulation to examine the effect of MPI message matching costs on application
performance,” Parallel Computing, vol. 84, pp. 63–74, May 2019, issn: 0167-8191.
doi: 10.1016/j.parco.2019.02.008.
[135] W. Schonbein, S. Levy, W. P. Marts, M. G. Dosanjh, and R. E. Grant, “LowCost MPI Multithreaded Message Matching Benchmarking,” in High Performance
Computing and Communications (HPCC 2020), 2020.
[136] D. E. Bernholdt, S. Boehm, G. Bosilca, M. G. Venkata, R. E. Grant, T. Naughton,
H. P. Pritchard, M. Schulz, and G. R. Vallee, “A survey of MPI usage in the
US exascale computing project,” Concurrency and Computation: Practice and
Experience, vol. 0, no. 0, e4851, 2018, issn: 1532-0634. doi: 10.1002/cpe.4851.
[137] Sandia National Laboratories. (2020). Sandia MPI Micro-Benchmark Suite
(SMB), [Online]. Available: http://www.cs.sandia.gov/smb/index.html.
[138] MPICH. (2020). MPICH: High-Performance Portable MPI, [Online]. Available:
https://www.mpich.org/.
[139] Open MPI. (2020). Open MPI: Open Source High Performance Computing, [Online]. Available: https://www.open-mpi.org/.
[140] K. D. Underwood and R. Brightwell, “The impact of MPI queue usage on message
latency,” in International Conference on Parallel Processing, 2004. ICPP 2004.,
Aug. 2004, 152–160 vol.1. doi: 10.1109/ICPP.2004.1327915.
[141] P. Balaji, A. Chan, W. Gropp, R. Thakur, and E. Lusk, “The Importance of
Non-Data-Communication Overheads in MPI,” The International Journal of High
Performance Computing Applications, vol. 24, no. 1, pp. 5–15, Feb. 2010, issn:
1094-3420. doi: 10.1177/1094342009359528.

161

[142] B. W. Barrett, R. Brightwell, R. Grant, S. D. Hammond, and K. S. Hemmert,
“An evaluation of MPI message rate on hybrid-core processors,” International
Journal of High Performance Computing Applications, vol. 28, no. 4, Nov. 2014,
issn: 1094-3420. doi: 10.1177/1094342014552085.
[143] R. Brightwell, S. Goudy, and K. Underwood, “A preliminary analysis of the MPI
queue characterisitics of several applications,” in 2005 International Conference
on Parallel Processing (ICPP’05), Jun. 2005, pp. 175–183. doi: 10.1109/ICPP.
2005.13.
[144] M. G. F. Dosanjh, S. M. Ghazimirsaeed, R. E. Grant, W. Schonbein, M. J. Levenhagen, P. G. Bridges, and A. Afsahi, “The Case for Semi-Permanent Cache
Occupancy: Understanding the Impact of Data Locality on Network Processing,” in Proceedings of the 47th International Conference on Parallel Processing, New York, NY, USA: ACM, 2018, 73:1–73:11, isbn: 978-1-4503-6510-9. doi:
10.1145/3225058.3225130.
[145] K. B. Ferreira, S. Levy, K. Pedretti, and R. E. Grant, “Characterizing MPI Matching via Trace-based Simulation,” in Proceedings of the 24th European MPI Users’
Group Meeting, New York, NY, USA: ACM, 2017, 8:1–8:11, isbn: 978-1-45034849-2. doi: 10.1145/3127024.3127040.
[146] P. G. Bridges, M. G. F. Dosanjh, R. Grant, A. Skjellum, S. Farmer, and R.
Brightwell, “Preparing for Exascale: Modeling MPI for Many-Core Systems Using
Fine-Grain Queues,” in Proceedings of the 3rd Workshop on Exascale MPI, New
York, NY, USA: Association for Computing Machinery, 2015, isbn: 978-1-45033998-8. doi: 10.1145/2831129.2831134.
[147] K. Vaidyanathan, D. D. Kalamkar, K. Pamnany, J. R. Hammond, P. Balaji,
D. Das, J. Park, and B. Joó, “Improving concurrency and asynchrony in mul-

162

tithreaded MPI applications using software offloading,” ACM Press, 2015, pp. 1–
12, isbn: 978-1-4503-3723-6. doi: 10.1145/2807591.2807602.
[148] M. Bayatpour, H. Subramoni, S. Chakraborty, and D. K. Panda, “Adaptive and
Dynamic Design for MPI Tag Matching,” in 2016 IEEE International Conference
on Cluster Computing (CLUSTER), Sep. 2016, pp. 1–10. doi: 10.1109/CLUSTER.
2016.69.
[149] H.-V. Dang, M. Snir, and W. Gropp, “Eliminating Contention Bottlenecks in
Multithreaded MPI,” Parallel Comput., vol. 69, no. C, pp. 1–23, Nov. 2017, issn:
0167-8191. doi: 10.1016/j.parco.2017.08.003.
[150] M. Flajslik, J. Dinan, and K. D. Underwood, “Mitigating MPI Message Matching
Misery,” in High Performance Computing, Springer, Cham, Jun. 2016, pp. 281–
299. doi: 10.1007/978-3-319-41321-1_15.
[151] S. M. Ghazimirsaeed, R. E. Grant, and A. Afsahi, “A Dedicated Message Matching Mechanism for Collective Communications,” in Proceedings of the 47th International Conference on Parallel Processing Companion, New York, NY, USA:
Association for Computing Machinery, 2018, isbn: 978-1-4503-6523-9. doi: 10.
1145/3229710.3229712.
[152] J. A. Zounmevo and A. Afsahi, “A fast and resource-conscious MPI message queue
mechanism for large-scale jobs,” Future Generation Computer Systems, vol. 30,
pp. 265–290, Jan. 2014, issn: 0167-739X. doi: 10.1016/j.future.2013.07.003.
[153] M. J. Koop, J. K. Sridhar, and D. K. Panda, “TupleQ: Fully-asynchronous and
zero-copy MPI over InfiniBand,” in 2009 IEEE International Symposium on Parallel Distributed Processing, May 2009, pp. 1–8. doi: 10 . 1109 / IPDPS . 2009 .
5161056.
[154] M. G. F. Dosanjh, W. Schonbein, R. E. Grant, P. G. Bridges, S. M. Gazimirsaeed, and A. Afsahi, “Fuzzy Matching: Hardware Accelerated MPI Communica163

tion Middleware,” in 2019 19th IEEE/ACM International Symposium on Cluster,
Cloud and Grid Computing (CCGRID), May 2019, pp. 210–220. doi: 10.1109/
CCGRID.2019.00035.
[155] M. G. Dosanjh, R. E. Grant, W. Schonbein, and P. G. Bridges, “Tail queues: A
multi-threaded matching architecture,” Concurrency and Computation: Practice
and Experience, vol. 0, no. 0, e5158, 2019. doi: 10.1002/cpe.5158.
[156] K. Ferreira, R. E. Grant, M. J. Levenhagen, S. Levy, and T. Groves, “Hardware
MPI message matching: Insights into MPI matching behavior to inform design,”
Concurrency and Computation: Practice and Experience, vol. 0, no. 0, e5150, 2019,
issn: 1532-0634. doi: 10.1002/cpe.5150.
[157] T. Patinyasakdikul, D. Eberius, G. Bosilca, and N. Hjelm, “Give MPI Threading a
Fair Chance: A Study of Multithreaded MPI Designs,” in 2019 IEEE International
Conference on Cluster Computing (CLUSTER), 2019, pp. 1–11. doi: 10.1109/
CLUSTER.2019.8891015.
[158] M. Si, A. J. Peña, J. Hammond, P. Balaji, M. Takagi, and Y. Ishikawa, “Casper:
An Asynchronous Progress Model for MPI RMA on Many-Core Architectures,”
in 2015 IEEE International Parallel and Distributed Processing Symposium, 2015,
pp. 665–676.
[159] S. Sridharan, J. Dinan, and D. D. Kalamkar, “Enabling Efficient Multithreaded
MPI Communication through a Library-Based Implementation of MPI Endpoints,” in SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2014, pp. 487–498.
[160] J. Dinan, R. E. Grant, P. Balaji, D. Goodell, D. Miller, M. Snir, and R. Thakur,
“Enabling Communication Concurrency through Flexible MPI Endpoints,” International Journal of High Performance Computing Applications, vol. 28, no. 4,
pp. 390–405, 2014, issn: 1094-3420. doi: 10.1177/1094342014548772.
164

[161] R. E. Grant, M. G. F. Dosanjh, M. J. Levenhagen, R. Brightwell, and A. Skjellum, “Finepoints: Partitioned Multithreaded MPI Communication,” in High Performance Computing, M. Weiland, G. Juckeland, C. Trinitis, and P. Sadayappan,
Eds., Springer International Publishing, 2019, pp. 330–350, isbn: 978-3-030-206567.
[162] M. G. F. Dosanjh, T. Groves, R. E. Grant, R. Brightwell, and P. G. Bridges,
“RMA-MT: A Benchmark Suite for Assessing MPI Multi-threaded RMA Performance,” in 2016 16th IEEE/ACM International Symposium on Cluster, Cloud
and Grid Computing (CCGrid), May 2016, pp. 550–559. doi: 10.1109/CCGrid.
2016.84.
[163] N. Hjelm, M. G. F. Dosanjh, R. E. Grant, T. Groves, P. Bridges, and D. Arnold,
“Improving MPI Multi-Threaded RMA Communication Performance,” in Proceedings of the 47th International Conference on Parallel Processing, New York,
NY, USA: Association for Computing Machinery, 2018, isbn: 978-1-4503-6510-9.
doi: 10.1145/3225058.3225114.
[164] T. Patinyasakdikul, X. Luo, D. Eberius, and G. Bosilca, “Multirate: A Flexible
MPI Benchmark for Fast Assessment of Multithreaded Communication Performance,” in 2019 IEEE/ACM Workshop on Exascale MPI (ExaMPI), 2019, pp. 1–
11.
[165] R. Thakur and W. Gropp, “Test Suite for Evaluating Performance of MPI Implementations That Support MPI thread multiple,” in Proceedings of the 14th
European Conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface, Berlin, Heidelberg: Springer-Verlag, 2007, pp. 46–55, isbn:
3-540-75415-6.
[166] M. Bianco, “An Interface for Halo Exchange Pattern,” Swiss National Supercomputing Centre, Tech. Rep., 2013.

165

[167] P. G. Raponi, F. Petrini, R. Walkup, and F. Checconi, “Characterization of the
Communication Patterns of Scientific Applications on Blue Gene/P,” in 2011
IEEE International Symposium on Parallel and Distributed Processing Workshops
and Phd Forum, 2011, pp. 1017–1024.
[168] W. P. Marts, M. G. F. Dosanjh, W. Schonbein, R. E. Grant, and P. G. Bridges,
“MPI tag matching performance on ConnectX and ARM,” in Proceedings of the
26th European MPI Users’ Group Meeting, Zürich, Switzerland: Association for
Computing Machinery, Sep. 2019, pp. 1–10, isbn: 978-1-4503-7175-9. doi: 10 .
1145/3343211.3343224.
[169] OpenUCX. (2020). UCX: Unified Communication X, [Online]. Available: http:
//www.openucx.org/.
[170] V. G. Vergara Larrea, M. J. Brim, W. Joubert, S. Boehm, M. Baker, O. Hernandez, S. Oral, J. Simmons, and D. Maxwell, “Are we witnessing the spectre
of an HPC meltdown?” Concurrency and Computation: Practice and Experience,
vol. 31, no. 16, e5020, 2019. doi: 10.1002/cpe.5020.
[171] S. Plimpton, “Fast parallel algorithms for short-range molecular dynamics,” Journal of computational physics, vol. 117, no. 1, pp. 1–19, 1995.
[172] OpenFabrics Alliance. (2020). OpenFabrics Alliance, [Online]. Available: https:
//www.openfabrics.org/ (visited on 10/15/2020).
[173] J. C. Shepherdson and H. E. Sturgis, “Computability of Recursive Functions,”
J. ACM, vol. 10, no. 2, pp. 217–255, Apr. 1963, issn: 0004-5411. doi: 10.1145/
321160.321170.
[174] Portals. (2019). Portals4/portals4, [Online]. Available: https : / / github . com /
Portals4/portals4 (visited on 10/17/2020).
[175] Lark. (2020). The Lark Parser, [Online]. Available: https://github.com/larkparser/lark (visited on 10/20/2020).
166

[176] A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman, “LogGP: Incorporating long messages into the LogP model—one step closer towards a realistic
model for parallel computation,” ACM Press, 1995, pp. 95–105, isbn: 978-0-89791717-9. doi: 10.1145/215399.215427.
[177] D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken, “LogP: Towards a Realistic Model of Parallel Computation,” in Proceedings of the Fourth ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming, New York, NY, USA: ACM, 1993, pp. 1–
12, isbn: 978-0-89791-589-2. doi: 10.1145/155332.155333.
[178] K. P. Srinivasan, “Creating a PCI Express interconnect in the Gem5 simulator,”
PhD thesis, University of Illinois at Urbana-Champaign, 2018.
[179] ECP Project. (2019). ECP Proxy Applications, [Online]. Available: https : / /
proxyapps.exascaleproject.org/.
[180] M. Heroux and R. Barrett. (2019). Mantevo project, [Online]. Available: https:
//mantevo.github.io/.
[181] J. Detrey and F. de Dinechin, “Parameterized Floating-Point Logarithm and
Exponential Functions for FPGAs,” Microprocessors and Microsystems, vol. 31,
no. 8, pp. 537–545, Dec. 2007, issn: 0141-9331. doi: 10.1016/j.micpro.2006.
02.008.
[182] R. Pottathuparambil and R. Sass, “Implementation of a CORDIC-based DoublePrecision Exponential Core on an FPGA.,” in Proceedings of RSSI 2008, Urbana,
Illinois, 2008.
[183] M. Wielgosz, E. Jamro, and K. Wiatr, “Highly Efficient Structure of 64-Bit Exponential Function Implemented in FPGAs,” in Reconfigurable Computing: Architectures, Tools and Applications, R. Woods, K. Compton, C. Bouganis, and

167

P. C. Diniz, Eds., Berlin, Heidelberg: Springer, 2008, pp. 274–279, isbn: 978-3540-78610-8. doi: 10.1007/978-3-540-78610-8_28.
[184] W. Yuan and Z. Xu, “FPGA based implementation of low-latency floating-point
exponential function,” in IET International Conference on Smart and Sustainable
City 2013 (ICSSC 2013), 2013, pp. 226–229.
[185] K. Mohammad and S. Agaian, “Efficient FPGA implementation of convolution,”
in 2009 IEEE International Conference on Systems, Man and Cybernetics, 2009,
pp. 3478–3483.
[186] H. Ström, “A Parallel FPGA Implementation of Image Convolution,” PhD thesis,
Linkoping University, 2016.
[187] T. Groves, R. E. Grant, and D. Arnold, “NiMC: Characterizing and Eliminating
Network-Induced Memory Contention,” in 2016 IEEE International Parallel and
Distributed Processing Symposium (IPDPS), May 2016, pp. 253–262. doi: 10 .
1109/IPDPS.2016.29.
[188] A. Bhatele, K. Mohror, S. H. Langer, and K. E. Isaacs, “There goes the neighborhood: Performance degradation due to nearby jobs,” ACM Press, 2013, pp. 1–12,
isbn: 978-1-4503-2378-9. doi: 10.1145/2503210.2503247.
[189] A. Bhatele, A. R. Titus, J. J. Thiagarajan, N. Jain, T. Gamblin, P. Bremer, M.
Schulz, and L. V. Kale, “Identifying the Culprits Behind Network Congestion,”
in 2015 IEEE International Parallel and Distributed Processing Symposium, May
2015, pp. 113–122. doi: 10.1109/IPDPS.2015.92.
[190] J. Dongarra, M. A. Heroux, and P. Luszczek, “High-performance conjugategradient benchmark: A new metric for ranking high-performance computing systems,” The International Journal of High Performance Computing Applications,
vol. 30, no. 1, pp. 3–10, 2016. doi: 10.1177/1094342015593158.

168

[191] LAMMPS. (2020). LAMMPS, [Online]. Available: https : / / lammps . sandia .
gov/.
[192] I. Karlin, J. Keasler, and R. Neely, “LULESH 2.0 Updates and Changes,” Livermore, CA, Tech. Rep. LLNL-TR-641973, Aug. 2013, pp. 1–9.
[193] NERSC. (2020). Running the NERSC MILC Benchmarks, [Online]. Available:
https://github.com/lattice/quda.
[194] P. S. Crozier, H. K. Thornquist, R. W. Numrich, A. B. Williams, H. C. Edwards,
E. R. Keiter, M. Rajan, J. M. Willenbring, D. W. Doerfler, and M. A. Heroux,
“Improving performance via mini-applications.,” Tech. Rep., 2009, SAND20095574. doi: 10.2172/993908.
[195] T. L. Groves, R. E. Grant, A. Gonzales, and D. Arnold, “Unraveling NetworkInduced Memory Contention: Deeper Insights with Machine Learning,” IEEE
Transactions on Parallel and Distributed Systems, vol. 29, no. 8, pp. 1907–1922,
Aug. 2018, issn: 2161-9883. doi: 10.1109/TPDS.2017.2773483.
[196] N. Jain, A. Bhatele, M. P. Robson, T. Gamblin, and L. V. Kale, “Predicting Application Performance Using Supervised Learning on Communication Features,”
in Proceedings of the International Conference on High Performance Computing,
Networking, Storage and Analysis, New York, NY, USA: ACM, 2013, 95:1–95:12,
isbn: 978-1-4503-2378-9. doi: 10.1145/2503210.2503263.
[197] V. Jyothi, X. Wang, S. K. Addepalli, and R. Karri, “BRAIN: BehavioR Based
Adaptive Intrusion Detection in Networks: Using Hardware Performance Counters
to Detect DDoS Attacks,” in 2016 29th International Conference on VLSI Design
and 2016 15th International Conference on Embedded Systems (VLSID), 2016,
pp. 587–588. doi: 10.1109/VLSID.2016.115.
[198] C. Li and J.-L. Gaudiot, “Detecting Malicious Attacks Exploiting Hardware Vulnerabilities Using Performance Counters,” in 2019 IEEE 43rd Annual Computer
169

Software and Applications Conference (COMPSAC), vol. 1, 2019, pp. 588–597.
doi: 10.1109/COMPSAC.2019.00090.
[199] S. Krishnaswamy, S. W. Loke, and A. Zaslavsky, “Estimating computation times
of data-intensive applications,” IEEE Distributed Systems Online, vol. 5, no. 4,
2004. doi: 10.1109/MDSO.2004.1301253.
[200] E. R. Rodrigues, R. L. F. Cunha, M. A. S. Netto, and M. Spriggs, “Helping HPC
Users Specify Job Memory Requirements via Machine Learning,” in 2016 Third
International Workshop on HPC User Support Tools (HUST), 2016, pp. 6–13.
doi: 10.1109/HUST.2016.006.
[201] M. R. Wyatt II, S. Herbein, T. Gamblin, A. Moody, D. H. Ahn, and M. Taufer,
“PRIONN: Predicting Runtime and IO Using Neural Networks,” in Proceedings of the 47th International Conference on Parallel Processing, New York, NY,
USA: ACM, 2018, 46:1–46:12, isbn: 978-1-4503-6510-9. doi: 10.1145/3225058.
3225091.
[202] B. Dickov, M. Pericàs, P. Carpenter, N. Navarro, and E. Ayguadé, “SoftwareManaged Power Reduction in Infiniband Links,” in 2014 43rd International Conference on Parallel Processing, Sep. 2014, pp. 311–320. doi: 10.1109/ICPP.2014.
40.
[203] M. Kiran and A. Chhabra, “Understanding flows in high-speed scientific networks:
A Netflow data study,” Future Generation Computer Systems, vol. 94, pp. 72–79,
May 2019, issn: 0167-739X. doi: 10.1016/j.future.2018.11.006.
[204] R. E. Grant, W. Schonbein, and S. Levy, “RaDD runtimes: Radical and different
distributed runtimes with SmartNICs,” in Fourth Annual Workshop on Emerging
Parallel and Distributed Runtime Systems and Middleware (IPDRM 2020), 2020.
[205] J. L. Elman, “Distributed representations, simple recurrent networks, and grammatical structure,” Machine Learning, vol. 7, pp. 195–225, 1991.
170

