Packet Transactions: High-level Programming for Line-Rate Switches by Sivaraman, Anirudh et al.
Packet Transactions: High-level Programming for
Line-Rate Switches
Anirudh Sivaraman*, Mihai Budiu†, Alvin Cheung‡, Changhoon Kim†, Steve Licking†,
George Varghese++, Hari Balakrishnan*, Mohammad Alizadeh*, Nick McKeown+
*MIT CSAIL, †Barefoot Networks, ‡University of Washington, ++Microsoft Research, +Stanford University
ABSTRACT
Many algorithms for congestion control, scheduling, net-
work measurement, active queue management, security, and
load balancing require custom processing of packets as they
traverse the data plane of a network switch. To run at line
rate, these data-plane algorithms must be in hardware. With
today’s switch hardware, algorithms cannot be changed, nor
new algorithms installed, after a switch has been built.
This paper shows how to program data-plane algo-
rithms in a high-level language and compile those programs
into low-level microcode that can run on emerging pro-
grammable line-rate switching chipsets. The key challenge
is that these algorithms create and modify algorithmic state.
The key idea to achieve line-rate programmability for state-
ful algorithms is the notion of a packet transaction: a se-
quential code block that is atomic and isolated from other
such code blocks. We have developed this idea in Domino, a
C-like imperative language to express data-plane algorithms.
We show with many examples that Domino provides a con-
venient and natural way to express sophisticated data-plane
algorithms, and show that these algorithms can be run at line
rate with modest estimated die-area overhead.
1. INTRODUCTION
Network switches and routers in modern datacenters, en-
terprises, and service-provider networks perform many tasks
in addition to standard packet forwarding. The set of re-
quirements for routers has only increased with time as net-
work operators seek greater control over performance and
security. Performance and security may be improved using
both data-plane and control-plane mechanisms. This paper
focuses on data-plane algorithms. These algorithms process
and transform packets, creating and maintaining state in the
switch. Examples include active queue management [53, 52,
67, 74, 76], scheduling [88], congestion control with switch
feedback [64, 93, 60, 28], network measurement [98, 51,
50], security [34], and traffic load balancing [27].
An important requirement for data-plane algorithms is the
ability to process packets at the switch’s line rate (typically
10–100 Gbit/s on 10–100 ports). As a result, these algo-
rithms are typically implemented using dedicated hardware.
Hardware designs are rigid, however, and not reconfigurable
in the field. Thus, to implement and deploy a new algorithm
today, or to even modify a deployed one, the user must invest
in new hardware—a time-consuming and expensive propo-
sition.
This rigidity affects many stakeholders adversely: ven-
dors [7, 9, 4] building network switches with merchant-
silicon chips [13, 14, 21], network operators deploying
switches [85, 80, 58], and researchers developing new
switch algorithms [64, 74, 96, 99, 60].
To run data-plane algorithms after a switch has been built,
researchers and companies have attempted to build pro-
grammable routers for many years, starting from efforts on
active networks [94] to network processors [82] to software
routers [66, 11, 46]. All these efforts sacrificed performance
for programmability, typically running an order of magni-
tude (or worse) slower than hardware line rates. Unfortu-
nately, this reduction in performance has meant that these
systems are rarely deployed in production networks, if at all.
Programmable switching chips [17, 26, 36, 1, 22,
15] competitive in performance with state-of-the-art fixed-
function chipsets [13, 14, 21] are now becoming available.
These chips implement a few low-level hardware primi-
tives that can be configured by software into a processing
pipeline, and are field-reconfigurable [6, 5, 91]. Building
a switch with such a chip is attractive because it does not
compromise on data rates [36].
In terms of programmability, these chips today allow the
network operator to specify packet parsing and forwarding
without restricting the set of protocol formats or the set of
actions that can be executed when matching packet headers
in a match-action table. Languages such as P4 are emerg-
ing as a way to express such match-action processing in a
hardware-independent way [35, 23, 87].
There is a gap between this form of programmability and
the needs of data-plane algorithms. By contrast to packet
header parsing and forwarding, which don’t modify state in
the data plane, many data-plane algorithms create and mod-
ify algorithmic state in the switch as part of packet process-
ing. For such algorithms, it is important for programmabil-
ity to directly capture the algorithm’s intent without requir-
ing it to be “shoehorned” into hardware constructs such as
a sequence of match-action tables. Indeed, this is how such
1
ar
X
iv
:1
51
2.
05
02
3v
2 
 [c
s.N
I] 
 30
 Ja
n 2
01
6
data-plane algorithms are expressed in pseudocode [53, 92,
3, 67, 52], and implemented in software routers [66, 11, 46],
network processors [47, 56], and network endpoints [8].
By studying the requirements of data-plane algorithms
and the constraints of line-rate hardware, we introduce a
new abstraction to program and implement data-plane algo-
rithms: a packet transaction (§3). A packet transaction is a
sequential code block that is atomic and isolated from other
such code blocks (i.e., any visible state is equivalent to a se-
rial execution of packet transactions across packets). Packet
transactions allow the programmer to focus on the opera-
tions needed for each packet without worrying about other
concurrent packets.
We have designed and implemented Domino, a new
domain-specific language (DSL) for data-plane algorithms,
with packet transactions at its core. Domino is an imperative
language with C-like syntax, perhaps the first to offer such a
high level programming abstraction for line-rate switches.
This paper makes three further contributions. First, Ban-
zai, a machine model for line-rate programmable switches
(§2). Banzai generalizes and abstracts essential features of
line-rate programmable switches [36, 26, 17]. Banzai also
models practical constraints limiting stateful operations at
line rate. Informed by these constraints, we introduce the
concept of atoms to represent a programmable switch’s in-
struction set.
Second, a compiler from Domino packet transactions to a
Banzai target (§4). The Domino compiler introduces all-or-
nothing compilation, where all packet transactions accepted
by the compiler will run at line rate, or be rejected outright.
There is no “slippery slope” of running network algorithms
at lower speeds as with traditional network processors or
software routers: when compiled, a Domino program runs
at the line rate, or not at all. Performance is not just pre-
dictable, but is guaranteed.
Third, an evaluation of Domino (§5). We evaluate
Domino’s expressiveness by programming a variety of data-
plane algorithms (Table 4) in Domino and compare with P4.
We find that Domino provides a more concise and easier pro-
gramming model for stateful data-plane algorithms. Next,
because no existing programmable switch supports the set
of atoms required for our data-plane algorithms, we design
a set of compiler targets (§5.2) based on Banzai and show
that these are feasible in a 32 nm standard-cell library with
< 15% estimated chip area overhead. Finally, we compile
data-plane algorithms written in Domino to these targets to
show how the choice of atoms in a target determines which
algorithms it can support.
2. A MACHINE MODEL FOR LINE-RATE
SWITCHES
Banzai is a machine model for programmable line-rate
switches that serves as the compiler target for Domino pro-
grams. Banzai’s design is inspired by recent programmable
switch architectures such as RMT [36], Intel’s FlexPipe [17],
and Cavium’s XPliant Packet Architecture [26]. Banzai ab-
stracts these architectures and extends them with stateful
processing units to implement data-plane algorithms. These
processing units, called atoms, precisely model the set of op-
erations that a hardware target can execute at line rate; they
function as the target instruction set for the Domino com-
piler.
2.1 Background: Programmable switches
Packets arriving at a programmable switch (Figure 1) are
parsed by a programmable parser that turns packets into
header fields. These header fields are first processed by
an ingress pipeline consisting of match-action tables ar-
ranged in stages. Processing a packet at a stage may mod-
ify its header fields as well as some persistent state at that
stage. Each stage has access only to its own local state.
To share state between stages, it must be carried forward in
packet headers. Following the ingress pipeline, the packet is
queued. Once the packet is dequeued by the switch sched-
uler, it is processed by a similar egress pipeline before being
transmitted.
To reduce chip area, the ingress and egress pipelines are
shared across switch ports. Each pipeline handles aggregate
traffic belonging to all ports on the switch, at all packet sizes.
For instance, a 64-port switch with a line rate of 10 Gbits/s
per port and a minimum packet size of 64 bytes needs to
process around a billion packets per second [36]. Equiva-
lently, with a clock frequency of 1 GHz, each pipeline stage
needs to process one packet every clock cycle (1 ns). The
need to handle one packet per clock cycle is typical because
switches are designed for the highest port count and line rate
for a given chip area. We assume one packet per clock cycle
throughout the paper.1
Having to process a packet every clock cycle in each stage
greatly constrains the operations that can be performed on
each packet. In particular, any packet operation that modi-
fies state visible to the next packet must finish execution in
a single clock cycle (see §2.3 for details). Because of this
restriction, programmable switching chips provide a small
set of processing units or primitives for manipulating pack-
ets and state in a stage, unlike in software routers. These
processing units determine what algorithms can run on the
switch at line rate.
The challenge here is to determine primitives that allow
a broad range of data-plane algorithms to be implemented,
and build a compiler to map a user-friendly description of an
algorithm to the primitives provided by a switch.
2.2 The Banzai machine model
Banzai (the bottom half of Figure 1) models the data-plane
components of an ingress or egress switch pipeline, consist-
ing of a number of stages executing synchronously on every
clock cycle. Each stage processes one packet every clock
cycle (1 ns) and hands it off to the next, until it exits the
1For concreteness, we assume a 1 GHz clock frequency.
2
Physical Stage 1
Packet
Headers Packet
Headers
Physical Stage 2
Packet
Headers
StateAtom Body
StateAtom Body
StateAtom Body
Physical Stage n
StateAtom Body
StateAtom Body
StateAtom Body
StateAtom Body
StateAtom Body
StateAtom Body
Parser
Bits Headers
Match-action table
Match
V
L
I
W
Primitives
Action Headers
Match-action table
Ingress pipeline
Headers
Queues
Match-action table
Headers
Match-action table
Egress pipeline
Headers Transmit
The architecture of a programmable switch
The  Banzai machine model
Eth
IPv4 IPv6
TCP
Figure 1: The Banzai machine model and its relationship to programmable switch architectures.
pipeline. Banzai models the computation within a match-
action table in a stage (i.e., the action half of the match-
action table), but not the match semantics (e.g., direct, or
ternary) (we discuss how to embed these computations in
a standard match-action pipeline in §3.3). Banzai does not
model packet parsing and assumes that packets arriving to it
are already parsed.
2.3 Atoms: Banzai’s processing units
Each pipeline stage in Banzai contains a vector of atoms.
All atoms in the vector execute in parallel on every clock cy-
cle. Informally, an atom is an atomic unit of packet process-
ing supported natively by a Banzai machine. The atoms pro-
vided by a Banzai machine form its instruction set. Atoms
may modify persistent state stored on the switch. In contrast
to instruction sets for CPUs, GPUs, DSPs, and NPUs, the
atoms for a Banzai machine need to be substantially richer
to run real-world data-plane algorithms at line rate. We ex-
plain why with an example.
Suppose we need to atomically increment a state variable
stored on the switch to count packets. One approach would
be to have hardware support for three simple single-cycle
operations: read some memory in the first clock cycle, add
one in the next, and write it to memory in the third. This
approach, however, does not provide atomic isolation. To
see why, suppose packet A increments the counter from 0 to
1 by executing the read, add, and write operations at clock
cycles 1, 2, and 3 respectively. If packet B issues the read
at time 2, it will increment the counter again from 0 to 1,
when it should be 2. Locks over the shared counter are a po-
tential solution. However, locking causes packet B to wait
during packet A’s increment, and the switch no longer sus-
tains line rate of one packet every clock cycle.2 CPUs em-
ploy microarchitectural techniques such as operand forward-
ing to address this problem, but these techniques suffer from
occasional pipeline stalls, which militates against line-rate
performance.
The only way to provide an atomic increment is to ex-
plicitly support it in hardware with an atom to read memory,
increment it, and write it back in a single stage within one
clock cycle. The same observation applies to any other line-
rate atomic operation.
This observation motivates why we represent an atom as
a body of sequential code. An atom completes execution of
the entire body of code and modifies a packet before process-
ing the next packet. An atom may also contain internal state
that is local to that atom alone and persists across packets.
An atom’s body of sequential code fully specifies the atom’s
behavior and serves as an interface between the compiler and
the programmable switch hardware.
Using this representation, a switch counter that wraps
2Wait-free objects [59] are an alternative to locking, but are typi-
cally too complex for hardware.
3
around at a value of 100 can be written as the atom:3
if (counter < 99)
counter ++;
else
counter = 0;
Similarly, a stateless operation like setting a packet field
(e.g. P4’s modify_field primitive [23]) can be written as
the atom:
p.field = value;
Table 3 provides more examples of atoms.
We note that—unlike stateful atomic operations such
as a counter—stateless atomic operations are easier to
support with basic packet-field arithmetic. Consider,
for instance, the operation pkt.f1 = pkt.f2 + pkt.f3 -
pkt.f4. This operation does not modify any persistent
switch state because it only reads and writes packet fields.
It can be implemented without violating atomicity by using
two atoms: one atom to add fields f2 and f3 in one pipeline
stage (clock cycle), and another to subtract f4 from the result
in the next—without having to provide one large atom that
supports the entire operation.
2.4 Constraining atoms
Computational limits: To provide line-rate performance,
atom bodies must finish execution within one clock cycle.
We constrain atom bodies by defining atom templates (§4.3).
An atom template is a program that always terminates and
specifies exactly how the atom is executed. One example
is an ALU with a restricted set of primitive operations to
choose from (Figure 2a). Atom templates allow us to cre-
ate Banzai machines with different atoms. In practice, atom
templates will be designed by an ASIC engineer and exposed
as a machine’s instruction set (§5.2). As programmable
switches evolve, we expect that atoms will evolve as well,
but constrained by the clock-cycle requirement (§5.4).
Adder
x constant
x
Subtractor
choice
Add Result Sub Result
2-to-1
 Mux
(a) Circuit for an atom that
can add or subtract a constant
from a state variable.
bit choice = ??;
int constant = ??;
if (choice) {
x = x + constant;
} else {
x = x - constant;
}
(b) Circuit representation as an atom tem-
plate.
Figure 2: Atoms and atom templates
Resource limits: For any real machine, we also need to
limit the number of atoms in each stage (pipeline width) and
3We use p.x to represent field x within a packet p and x to represent
a state variable x that persists across packets.
the number of stages in the pipeline (pipeline depth). This is
similar to limits on the number of stages, number of tables
per stage, and amount of memory per stage in programmable
switch architectures such as RMT and FlexPipe [62].
2.5 What can Banzai not do?
Like real programmable switches, Banzai is a good fit for
data-plane algorithms that modify a small set of packet head-
ers and carry out small amounts of stateful or stateless com-
putation per packet. Data-plane algorithms like deep packet
inspection and WAN optimization require a switch to parse
and process the packet payload as well—effectively pars-
ing a large “header” consisting of each byte in the payload,
which is challenging at line rates of 1 GHz. Such algorithms
are best left to general-purpose CPU platforms [75, 84, 55].
Some algorithms require complex computations, but not on
every packet. For example, consider a measurement algo-
rithm that periodically scans a large table to perform garbage
collection. Banzai’s atoms model small computations that
occur on every packet, and are not suitable for such opera-
tions that span many clock cycles.
3. PACKET TRANSACTIONS
To program a data-plane algorithm, a programmer would
write code in Domino using packet transactions (Figure 3a)
and then use the Domino compiler to compile to an atom
pipeline for a Banzai machine (Figure 3b). We first describe
packet transactions in greater detail by walking through an
example (§3.1). Next, we discuss constraints in Domino
(§3.2) informed by the domain of line-rate switches. We
then discuss triggering packet transactions (§3.3) and han-
dling multiple transactions (§3.4).
3.1 Domino by example
We use flowlet switching [86] as an example. Flowlet
switching is a load-balancing algorithm that sends bursts of
packets (called flowlets) from a TCP flow on different paths,
provided the bursts are separated by a large enough time in-
terval to ensure packets do not arrive out of order at a TCP
receiver. Figure 3a shows flowlet switching in Domino. For
simplicity, we hash only the source and destination ports; it
is easy to extend it to the full 5-tuple.
This example demonstrates the core language constructs
in Domino. All packet processing happens in the context of a
packet transaction (the function flowlet starting at line 17).
The function’s argument type Packet declares the fields in
a packet (lines 5–12)4 that can be referenced by the function
body (lines 18–32). The function body can also modify per-
sistent switch state using global variables (e.g. last_time
and saved_hop on lines 14 and 15, respectively).
Conceptually, the switch invokes the packet transaction
function one packet at a time, with no concurrent packet
processing. To the programmer, the function modifies the
4We use fields to refer to both packet headers such as source port
(sport) and destination port (dport) and packet metadata (id).
4
1 #define NUM_FLOWLETS 8000
2 #define THRESHOLD 5
3 #define NUM_HOPS 10
4
5 struct Packet {
6 int sport;
7 int dport;
8 int new_hop;
9 int arrival;
10 int next_hop;
11 int id; // array index
12 };
13
14 int last_time [NUM_FLOWLETS] = {0};
15 int saved_hop [NUM_FLOWLETS] = {0};
16
17 void flowlet(struct Packet pkt) {
18 pkt.new_hop = hash3(pkt.sport ,
19 pkt.dport ,
20 pkt.arrival)
21 % NUM_HOPS;
22
23 pkt.id = hash2(pkt.sport ,
24 pkt.dport)
25 % NUM_FLOWLETS;
26
27 if (pkt.arrival - last_time[pkt.id]
28 > THRESHOLD)
29 { saved_hop[pkt.id] = pkt.new_hop; }
30
31 last_time[pkt.id] = pkt.arrival;
32 pkt.next_hop = saved_hop[pkt.id];
33 }
(a) Flowlet switching written in Domino
pkt.saved_hop = saved_hop[pkt.id];
pkt.next_hop = pkt.tmp2 ? 
                         pkt.new_hop : 
                         pkt.saved_hop;
pkt.last_time = last_time[pkt.id];
last_time[pkt.id] = pkt.arrival;
pkt.tmp = pkt.arrival - pkt.last_time;
pkt.new_hop = 
hash3(pkt.sport,
           pkt.dport, 
           pkt.arrival)
% NUM_HOPS;
pkt.tmp2 = pkt.tmp > 5;
pkt.id =
hash2(pkt.sport,
           pkt.dport)
% NUM_FLOWLETS;
saved_hop[pkt.id] = pkt.tmp2 ? 
                                  pkt.new_hop :
                                  pkt.saved_hop;
Stage 1
Stage 2
Stage 3
Stage 4
Stage 5
Stage 6
(b) 6-stage Banzai pipeline for flowlet switching. Control flows
from top to bottom. Stateful atoms are in grey.
Figure 3: Programming flowlet switching in Domino
passed-in packet argument and runs to completion before
processing the next packet. The function may invoke in-
trinsics such as hash2 on line 23 to use hardware acceler-
ators such as hash generators. The Domino compiler uses
an intrinsic’s signature to infer dependencies and supplies a
canned run-time implementation, but otherwise does not an-
alyze an intrinsic’s internal behavior. When compiled to a
Banzai machine, the compiler converts the code in Figure 3a
to the atom pipeline in Figure 3b.
3.2 Constraints on the language
The syntax of Domino is similar to C, but with several
constraints (Table 1). These constraints are required for de-
terministic performance. Memory allocation, unbounded it-
eration counts, and unstructured control flow all cause vari-
able performance, which may prevent an algorithm from
achieving line rate. Additionally, Domino constrains array
modifications by requiring that all accesses to a given array
within one execution of a transaction, i.e. one packet, must
use the same array index. For example, all read and write ac-
cesses to the array last_time use the index pkt.id, which
is constant for each packet, but can change between pack-
ets. This restriction mirrors restrictions on memories, which
don’t typically support distinct read and write addresses ev-
ery clock cycle.
No iteration (while, for, do-while).
No goto, break, or continue.
No pointers.
No dynamic memory allocation / heap.
Array index is constant for each transaction execution.
No access to data i.e. unparsed portion of the packet.
Table 1: Restrictions in Domino
3.3 Triggering packet transactions
Packet transactions specify how to process packet head-
ers and/or state. To specify when to run packet transactions,
we provide a guard: a predicate on packet fields that trig-
gers the transaction whenever a packet matches the guard.
An example guard (pkt.tcp_dst_port == 80) would execute
heavy-hitter detection on all packets on TCP destination port
80. This guard can be implemented using an exact match in
a match-action table, with the actions being the atoms result-
ing from compiling the packet transaction. Guards can be of
various forms, e.g., exact, ternary, longest-prefix and range-
based matches, depending on the match semantics supported
by the match-action pipeline. Because guards map rather
straightforwardly to the match key in a match-action table,
this paper only focuses on compiling packet transactions.
5
3.4 Handling multiple transactions
So far, we have discussed a single packet transaction cor-
responding to a single data-plane algorithm. In practice, a
switch would run multiple data-plane algorithms—each pro-
cessing its own subset of packets. To accommodate multi-
ple transactions, we envision a policy language that speci-
fies pairs of guards and transactions. Realizing a policy is
straightforward when all guards are disjoint. When guards
overlap, multiple transactions need to execute on the same
subset of packets, requiring a mechanism to compose trans-
actions. One semantics for composition is to concatenate
the two transaction bodies in an order specified by the user,
providing the illusion of a larger transaction that combines
two transactions. We leave a detailed exploration of this and
alternative semantics to future work, and focus only on com-
piling a single packet transaction.
4. THE DOMINO COMPILER
The Domino compiler compiles from Domino programs
to Banzai targets. The compiler provides an all-or-nothing
model: if compilation succeeds, the compiler guarantees that
the program will run at line rate on the target. If the program
can’t be run at line rate, the compiler rejects the program
outright; there is no smooth tradeoff between a program’s
performance and its complexity. This all-or-nothing compi-
lation model is unusual relative to other substrates such as a
CPU, GPU, or DSP. But, it reflects how routers are used to-
day. Routers are rated for a particular line rate, regardless of
the enabled feature set. The all-or-nothing model trades off
diminished programmability for guaranteed line-rate perfor-
mance, in contrast to software routers that provide greater
flexibility but unpredictable run-time performance [45, 97].
The Domino compiler has three passes (Figure 4). First,
normalization simplifies the packet transaction into a restric-
tive three-address code form while retaining the sequential
nature of packet transactions, i.e., processing one packet at
a time. Second, pipelining transforms the normalized code
into code for a pipelined virtual switch machine (PVSM).
PVSM is an intermediate representation that models a switch
pipeline with no computational or resource limits. Third,
code generation transforms this intermediate representation
into configuration for a Banzai machine, given as inputs the
machine’s computational and resource constraints, and re-
jects the program if it can’t run at line rate on that Banzai ma-
chine. The Domino compiler uses many existing compiler
techniques, but adapts and simplifies them in important ways
to suit the domain of line-rate switches (§4.4). Throughout
this section, we use flowlet switching as a running example
to demonstrate compiler passes.
4.1 Normalization
Branch removal: A packet transaction’s body can con-
tain (potentially nested) branches (e.g., Lines 27 to 29 in
Figure 3a). Branches alter control flow and complicate de-
pendency analysis, i.e., whether a statement should precede
Domino
Code
Normalization
(4.1)
Three
address
Code
Pipelining
(4.2)
Codelet
pipeline for
Pipelined
Virtual Switch
Machine
Code
Generation
(4.3)
Computational and resource limits
Atom
pipeline for
Banzai
machine
Domino compiler
Figure 4: Passes in the Domino compiler
another. We transform branches into the conditional opera-
tor, starting from the innermost if and recursing outwards
(Figure 5). This turns the transaction body into straight-
line code with no branches. Straight-line code simplifies the
rest of the compiler, by simplifying dependency analysis and
conversion to static single-assignment form.
Rewriting state variable operations: We now identify
state variables in a packet transaction, such as last_time
and saved_hop in Figure 3a. For each state variable, we
create a read flank to read the state variable into a temporary
packet field. For an array, we also move the index expression
into the read flank using the fact that only one array index
is accessed by each packet. Within the packet transaction,
we replace the state variable with the packet temporary, and
create a write flank to write the packet temporary back into
the state variable (Figure 6). After this, the only operations
on state variables are reads and writes; all arithmetic happens
on packet fields. Restricting stateful operations simplifies
handling of state during pipelining.
Converting to static single-assignment form: We
next convert the code to static single-assignment form
(SSA) [43], where every packet field is assigned exactly
once. To do so, we replace every assignment to a packet field
with a new packet field and propagate this until the next as-
signment to the same field (Figure 7) . Because every field is
assigned exactly once, SSA removes Write-After-Read and
Write-After-Write dependencies. Only Read-After-Write
dependencies remain, simplifying dependency analysis.
Flattening to three-address code: Three-address
code [25] is a representation where all instructions are ei-
ther reads/writes into state variables or operations on packet
fields of the form pkt.f1 = pkt.f2 op pkt.f3; where op
can be an arithmetic, logical, relational, or conditional 5 op-
erator. We also allow either one of pkt.f2 or pkt.f3 to be
an intrinsic function call. To convert to three-address code,
we flatten expressions that are not in three-address code us-
ing temporaries (Figure 8).
4.2 Pipelining
At this point, the normalized code is still sequential in
that it operates on a single packet at a time without using a
pipeline to process packets concurrently. Pipelining turns se-
quential code into a pipeline of codelets, where each codelet
is a sequential block of three-address code statements. This
codelet pipeline corresponds to an intermediate represen-
tation (IR) we call the Pipelined Virtual Switch Machine
5Conditional operations alone have 4 arguments.
6
(PVSM). PVSM places no computational or resource con-
straints on the pipeline—much like IRs such as LLVM place
no restriction on the number of virtual registers. Later, dur-
ing code generation, we map these codelets to atoms avail-
able in a Banzai machine, while respecting its constraints.
We create PVSM’s codelet pipeline using the steps below.
1. Create a dependency graph of all statements in the nor-
malized packet transaction. First, create a node for
each statement. Second, add a pair of edges between
any two nodes N1 and N2, where N1 is a read from
a state variable and N2 is a write into the same vari-
able, to capture the notion that state should be inter-
nal to a codelet/atom. Third, create an edge (N1, N2)
for every pair of nodes N1, N2 where N2 reads a vari-
able written by N1. We only check read-after-write
dependencies because we eliminate control dependen-
cies6 through branch removal, and write-after-read and
write-after-write dependencies don’t exist after SSA.
Figure 9a shows the resulting dependency graph.
2. Generate strongly connected components (SCCs) of
this dependency graph and condense them into a di-
rected acyclic graph (DAG). This captures the notion
that all operations on a state variable must be confined
to one codelet/atom because state cannot be shared be-
tween atoms. Figure 9b shows the resulting DAG.
3. Schedule the resulting DAG using critical path
scheduling [65] by creating a new pipeline stage when
one operation needs to follow another. This results in
the codelet pipeline shown in Figure 3b.7
The codelet pipeline implements the packet transaction on
a switch pipeline with no computational or resource con-
straints. We handle these constraints next.
4.3 Code generation
To determine if the codelet pipeline can be compiled to a
Banzai machine, we consider two constraints in any Banzai
machine: resource limits, i.e., the pipeline width and depth,
and computational limits on atoms within a pipeline stage,
i.e., the atom templates provided by a Banzai machine.
Resource limits: To handle resource limits, we scan each
pipeline stage in the codelet pipeline starting from the first
to check for pipeline width violations. If we violate the
pipeline width, we insert as many new stages as required and
spread codelets evenly across these stages. We continue un-
til the number of codelets in all stages is under the pipeline
width and reject the program if we exceed the pipeline depth.
Computational limits: Next, we determine if codelets
in the pipeline map one-to-one to atoms provided by the
Banzai machine. In general, codelets have multiple three-
address code statements that need to execute atomically. For
6An instruction A is control dependent on a preceding instruction
B if the outcome of B determines whether A should be executed or
not.
7We refer to this both as a codelet and an atom pipeline because
codelets map one-to-one atoms (§4.3).
instance, updating the state variable saved_hop in Figure 3b
requires a read followed by a conditional write. It is not ap-
parent whether such codelets can be mapped to an available
atom. We develop a new technique to determine the imple-
mentability of a codelet, given an atom template.
Each atom template has a set of configuration parameters,
where the parameters determine the atom’s behavior. For
instance, Figure 2a shows a hardware circuit that can per-
form stateful addition or subtraction, depending on the value
of the constant and which output is selected from the mul-
tiplexer. Its atom template is shown in Figure 2b, where
choice and constant represent configuration parameters.
Each codelet can be viewed as a functional specification of
the atom. With that in mind, the mapping problem is equiva-
lent to searching for the value of the parameters to configure
the atom such that it implements the provided specification.
We use the SKETCH program synthesizer [90] for this
purpose, as the atom templates can be easily expressed us-
ing SKETCH, while SKETCH also provides efficient search
algorithms and has been used for similar purposes in other
domains [89, 38, 39, 78]. As an illustration, assume we
want to map the codelet x=x+1 to the atom template shown
in Figure 2b. SKETCH will search for possible parame-
ter values so that the resulting atom is functionally iden-
tical to the codelet, for all possible input values of x. In
this case, SKETCH finds the solution with choice=0 and
constant=1. In contrast, if the codelet x=x*x was supplied
as the specification, SKETCH will return an error as no pa-
rameters exist.
4.4 Related compiler techniques
The Domino compiler employs many techniques from the
compiler literature, but adapts and simplifies them in new
ways to suit the domain of line-rate switches (Table 2). The
use of SCCs is inspired by software pipelining for VLIW
architectures [68, 79]. The size of the largest SCC affects
the maximum throughput of the pipelined loop in software
pipelining. For Domino, it affects the circuit area of the
atom required to run a program at line rate. Domino trades
off an increase in space for line-rate performance.
Program synthesis was used for code generation in
Chlorophyll [78]. Code generation for Domino shares simi-
lar goals as technology mapping [70, 41, 40] and instruction
selection [73]. However, prior work maps a code sequence
to multiple instructions/tiles, using heuristics to minimize in-
struction count. Domino’s problem is simpler: we map each
codelet to a single atom using SKETCH. The simpler prob-
lem allows a non-heuristic solution: if there is any way to
map the codelet to an atom, SKETCH will find it.
Branch removal resembles if-conversion [30], a technique
used in vectorizing compilers. This procedure is easier in
Domino because there is no backward control transfer (goto,
break, continue). Domino’s SSA computation operates on
straight-line code and doesn’t handle branches, which con-
siderably complicate SSA algorithms [43].
7
if (pkt.arrival - last_time[pkt.id] > THRESHOLD) {
saved_hop[pkt.id] = pkt.new_hop;
}
=⇒
pkt.tmp = pkt.arrival - last_time[pkt.id] > THRESHOLD;
saved_hop[pkt.id] = pkt.tmp
? pkt.new_hop
: saved_hop[pkt.id]; // Rewritten
Figure 5: Branch removal
pkt.id = hash2(pkt.sport ,
pkt.dport)
% NUM_FLOWLETS;
...
last_time[pkt.id] = pkt.arrival;
...
=⇒
pkt.id = hash2(pkt.sport , // Read flank
pkt.dport)
% NUM_FLOWLETS;
pkt.last_time = last_time[pkt.id]; // Read flank
...
pkt.last_time = pkt.arrival; // Rewritten
...
last_time[pkt.id] = pkt.last_time; // Write flank
Figure 6: Rewriting state variable operations
pkt.id = hash2(pkt.sport ,
pkt.dport)
% NUM_FLOWLETS;
pkt.last_time = last_time[pkt.id];
...
pkt.last_time = pkt.arrival;
last_time[pkt.id] = pkt.last_time;
=⇒
pkt.id0 = hash2(pkt.sport , // Rewritten
pkt.dport)
% NUM_FLOWLETS;
pkt.last_time0 = last_time[pkt.id0]; // Rewritten
...
pkt.last_time1 = pkt.arrival; // Rewritten
last_time[pkt.id0] = pkt.last_time1; // Rewritten
Figure 7: Converting to static single-assignment form
1 pkt.id = hash2(pkt.sport , pkt.dport) % NUM_FLOWLETS;
2 pkt.saved_hop = saved_hop[pkt.id];
3 pkt.last_time = last_time[pkt.id];
4 pkt.new_hop = hash3(pkt.sport , pkt.dport , pkt.arrival) % NUM_HOPS;
5 pkt.tmp = pkt.arrival - pkt.last_time;
6 pkt.tmp2 = pkt.tmp > THRESHOLD;
7 pkt.next_hop = pkt.tmp2 ? pkt.new_hop : pkt.saved_hop;
8 saved_hop[pkt.id] = pkt.tmp2 ? pkt.new_hop : pkt.saved_hop;
9 last_time[pkt.id] = pkt.arrival;
Figure 8: Flowlet switching in three-address code. Lines 1 and 4 are flipped relative to Figure 3a because pkt.id is an array
index expression and is moved into the read flank.
pkt.saved_hop = saved_hop[pkt.id]
pkt.last_time = last_time[pkt.id]
pkt.tmp = pkt.arrival - pkt.last_time last_time[pkt.id] = pkt.arrival
pkt.tmp2 = pkt.tmp > THRESHOLD
pkt.new_hop = hash3(pkt.sport,
                                    pkt.dport,
                                    pkt.arrival) 
                         % NUM_HOPS
pkt.id = hash2(pkt.sport,
                        pkt.dport)
             %NUM_FLOWLETS
pkt.next_hop =  pkt.tmp2 ?
                         pkt.new_hop :
                         pkt.saved_hop
saved_hop[pkt.id] =   pkt.tmp2 ?
                                  pkt.new_hop :
                                  pkt.saved_hop
(a) Dependency graph. Edges are read-after-write dependencies.
=⇒
pkt.saved_hop = saved_hop[pkt.id]
pkt.last_time = last_time[pkt.id]
pkt.tmp = pkt.arrival - pkt.last_time
last_time[pkt.id] = pkt.arrival
pkt.tmp2 = pkt.tmp > THRESHOLD
pkt.new_hop = hash3(pkt.sport,
                                    pkt.dport,
                                    pkt.arrival) 
                         % NUM_HOPS
pkt.id = hash2(pkt.sport,
                        pkt.dport)
             %NUM_FLOWLETS
pkt.next_hop =  pkt.tmp2 ?
                         pkt.new_hop :
                         pkt.saved_hop
saved_hop[pkt.id] =   pkt.tmp2 ?
                                  pkt.new_hop :
                                  pkt.saved_hop
(b) DAG after condensing SCCs.
Figure 9: Code Pipelining
8
Technique Prior Work Differences
Conversion to
straight-line code
If Conver-
sion [30]
No backward control flow (gotos,
break, continue)
SSA Cytron et. al [43] SSA runs on straight-line code with
no branches
Strongly Con-
nected Compo-
nents
Lam [68], Rau
and Glaeser [79]
Scheduling in space vs. time
Code generation
using program
synthesis
Chlorophyll [78],
technology map-
ping [70, 41, 40],
instruction selec-
tion [73]
Optimal vs. best-effort mapping;
One-to-one mapping vs. tiling
Table 2: Domino’s compiler in relation to prior work
Atom Description Area
(µm2)
Stateless Arithmetic, logic, relational, and con-
ditional operations on packet/constant
operands
1384
Read/Write Read/Write packet field/constant into
single state variable.
250
ReadAddWrite (RAW) Add packet field/constant to state vari-
able (OR) Write packet field/constant
into state variable.
431
Predicated ReadAd-
dWrite (PRAW)
Execute RAW on state variable only if a
predicate is true, else leave unchanged.
791
IfElse ReadAddWrite
(IfElseRAW)
Two separate RAWs: one each for when
a predicate is true or false.
985
Subtract (Sub) Same as IfElseRAW, but also allow sub-
tracting a packet field/constant.
1522
Nested Ifs (Nested) Same as Sub, but with an additional
level of nesting that provides 4-way
predication.
3597
Paired updates (Pairs) Same as Nested, but allow updates to a
pair of state variables, where predicates
can use both state variables.
5997
Table 3: Atom areas in a 32 nm standard-cell library. All
atoms meet timing at 1GHz. Each of the seven compiler
targets contains one of the seven stateful atoms (Read/Write
through Pairs) and the single stateless atom.
5. EVALUATION
We have performed a number of experiments to evaluate
Domino. First, we evaluate Domino’s expressiveness by us-
ing it to program several data-plane algorithms (Table 4),
and comparing it to writing them in P4 (§5.1). To validate
that these algorithms can be implemented at line rate, we de-
sign a concrete set of Banzai machines that we use as com-
piler targets for Domino (§5.2). We estimate that these ma-
chines are feasible in hardware today because their atoms
incur modest chip area overhead. Next, we use the Domino
compiler to compile the algorithms in Table 4 to these tar-
gets (§5.3). We conclude by quantifying the tradeoff be-
tween a target’s programmability (the space of data-plane
algorithms that it can run at line rate) and the target’s perfor-
mance (the maximum line rate it can support) (§5.4).
5.1 Expressiveness
To evaluate Domino’s expressiveness, we express several
data-plane algorithms (Table 4) using Domino. These algo-
rithms encompass a variety of data-plane functionality in-
cluding data-plane load balancing, in-network congestion
control, active queue management, security, and measure-
ment. In addition, we also used Domino to express the prior-
ity computation for programming scheduling using the push-
in first-out queue abstraction [88]. In all these cases, the al-
gorithms are already available as blocks of imperative code
from online sources; translating them to Domino syntax was
straightforward.
In contrast, expressing any of these algorithms in P4 re-
quires manually teasing out portions of the algorithm that
can reside in independent match-action tables and then
chaining these tables together. In essence, the programmer
manually carries out the transformations in Domino’s com-
piler. Of the algorithms in Table 4, only flowlet switching
has a publicly available P4 implementation [12] that we can
compare against. This implementation requires 231 lines of
uncommented P4, in comparison to the 37 lines of Domino
code in Figure 3a. Not only that, using P4 also requires the
programmer to manually specify tables, the actions within
the tables, how tables are chained, and what headers are
required—all to implement a single data-plane algorithm.
As the Domino compiler shows, this process can be au-
tomated; to demonstrate this, we developed a backend for
Domino that generates the equivalent P4 code (lines of code
for these auto-generated P4 programs are listed in Table 4).
Lastly, data-plane algorithms on software platforms today
(NPUs, Click [66], the Linux qdisc subsystem [8]) are pro-
grammed in languages resembling Domino—hence we are
confident that the Domino syntax is already familiar to net-
work operators.
5.2 Compiler targets
We design a concrete set of compiler targets for Domino
based on the Banzai machine model. First, we specify com-
putational limits on atoms in each compiler target using atom
templates. Using the Synopsys Design Compiler [10], we
quantify each atom’s area in a 32 nm standard-cell library
when running at 1 GHz. Second, using an individual atom’s
area and a switching chip’s area [57], we determine the ma-
chine’s resource limits, i.e., the pipeline width for each atom
and the pipeline depth.
Computational limits: Stateless atoms are easier to de-
sign because arbitrary stateless operations can be spread
out across multiple pipeline stages without violating atomic-
ity (§2.3). We design a stateless atom that can support simple
arithmetic (add, subtract, left shift, right shift), logical (and,
or, xor), relational (>=, <=, ==, !=), or conditional operations
(C’s “?” operator) on a set of packet fields. Any packet field
can also be substituted with a constant operand.
Designing stateful atoms is more involved because it de-
termines which algorithms the switch can support. A more
complex stateful atom can support more data-plane algo-
rithms, but occupies greater chip area. To illustrate this, we
design a containment hierarchy of stateful atoms, where each
atom can express all stateful operations that its predecessor
can. When synthesized to a 32 nm standard-cell library, all
of our designed atoms meet timing at 1 GHz and their area
increases with the atom’s complexity (Table 3).
Resource limits: We design one compiler target for each
9
Algorithm Description Least ex-
pressive
atom
# of stages,
max.
atom-
s/stage
Ingress
or
Egress
Pipeline?
Domino
LOC
P4
LOC
Bloom filter [37] (3
hash functions)
Set membership bit on every packet. Write 4, 3 Either 29 104
Heavy Hitters [98]
(3 hash functions)
Increment Count-Min Sketch [42] on every
packet.
RAW 10, 9 Either 35 192
Flowlets [86] Update saved next hop if flowlet threshold
is exceeded.
PRAW 6, 2 Ingress 37 107
RCP [93] Accumulate RTT sum if RTT is under max-
imum allowable RTT.
PRAW 3, 3 Egress 23 75
Sampled Net-
Flow [24]
Sample a packet if packet count reaches N;
Reset count to 0 when it reaches N.
IfElseRAW 4, 2 Either 18 70
HULL [29] Update counter for virtual queue. Sub 7, 1 Egress 26 95
Adaptive Virtual
Queue [67]
Update virtual queue size and virtual ca-
pacity
Nested 7, 3 Ingress 36 147
Compute priorities
for weighted fair
queueing [88]
Compute packet’s virtual start time using
finish time of last packet in that flow.
Nested 4, 2 Ingress 29 87
DNS TTL change
tracking [34]
Track number of changes in announced
TTL for each domain
Nested 6,3 Ingress 27 119
CONGA [27] Update best path’s utilization/id if we see
a better path. Update best path utilization
alone if it changes.
Pairs 4, 2 Ingress 32 89
CoDel [74] Update: Whether we are marking or not,
Time for next mark, Number of marks so
far, Time at which min. queueing delay
will exceed target.
Doesn’t
map
15, 3 Egress 57 271
Table 4: Data-plane algorithms
combination of a stateful atom along with the single stateless
atom in Table 3. We determine resource limits for stateful
and stateless atoms separately. For the stateless atom, as-
suming a chip area of 200 mm2 (the smallest area given by
Gibb et al. [57]), and an acceptable overhead of 7% (the area
overheads for actions in RMT [36]), we can support ~10000
stateless atoms, given the area of 1384 µm2 per instance. If
these 10000 atoms were spread across the same number of
stages (32) as RMT, we could support up to ~300 stateless
atoms per stage.
A similar analysis for the stateful atoms yields 70 stateful
atoms per stage for the most complex stateful atom (Pairs)
with an area of 5997 µm2. However, stateful atoms access
per-stage memory banks storing state. Providing 70 inde-
pendent memory banks per stage supporting one read and
write per clock is prohibitive. Furthermore, given an overall
stage memory budget, slicing it into many small banks re-
duces the amount of memory accessible to each atom. This
is problematic for hash-based algorithms that need to hash
into a large memory space. Taking these into account, we
limit the number of stateful atoms to around 10 per stage,
which is still sufficient for the data-plane algorithms that
we are interested in. The area overhead of these 10 state-
ful atoms is ~1%.
We next look at the multiplexers to route inputs to these
atoms from specific packet fields and route outputs from
these atoms to specific packet fields. For this, we rely on
RMT [36], which estimates a crossbar area of 6 mm2 for a
32-stage pipeline with 224 action units. Scaling this propor-
tionally to 300 atoms, we estimate a crossbar area of 8 mm2
with a 4% area overhead.
In summary, we assume 32 stages in total, 300 stateless
atoms per stage and 10 stateful atoms per stage for all com-
piler targets with an area overhead of 12% (7% for stateless
atoms, 1% for the stateful atoms, and 4% for the crossbars).
By no means is this the only design. We only claim that
this is feasible and show that it can be used to implement
a variety of data-plane algorithms, which is far beyond a
fixed-function switch today. We anticipate Banzai machines
evolving as data-plane algorithms demand more of the hard-
ware.
5.3 Compiling Domino programs to Banzai
machines
We now consider every target from Table 3, and every
data-plane algorithm from Table 4 to determine if the algo-
10
rithm can run at line rate on a particular Banzai machine.
We say an algorithm can run at line rate on a Banzai ma-
chine if every codelet within the data-plane algorithm can
be mapped (§4.3) to either the stateful or stateless atom pro-
vided by the Banzai machine. Because stateful atoms are
arranged in a containment hierarchy, we list the least expres-
sive stateful atom/target required for each data-plane algo-
rithm in Table 4.
We note two lessons for designing programmable
switches from Table 4. First, atoms supporting stateful op-
erations on a single state variable are sufficient for several
data-plane algorithms. For instance, the algorithms from
Bloom Filter through DNS TTL Change Tracking in Table 4
can be run at line rate using the Nested Ifs atom that manip-
ulates a single state variable. Second, there are algorithms
that need to update a pair of state variables atomically. One
example is CONGA, whose code we reproduce below:
if (p.util < best_path_util[p.src]) {
best_path_util[p.src] = p.util;
best_path[p.src] = p.path_id;
} else if (p.path_id == best_path[p.src]) {
best_path_util[p.src] = p.util;
}
Here, best_path (the path id of the best path for a particu-
lar destination) is updated conditioned on best_path_util
(the utilization of the best path to that destination)8 and vice
versa. These two state variables cannot be separated into
different stages and still guarantee a packet transaction’s se-
mantics. The Pairs atom, where the update to a state variable
is conditioned on a predicate of a pair of state variables, al-
lows us to run CONGA at line rate.
While the targets in Table 3 are sufficient for several data-
plane algorithms, there are algorithms that they can’t run at
line rate. An example is CoDel, which cannot be imple-
mented because it requires a square root operation that isn’t
provided by any of our targets. One possibility is a look-up
table abstraction that allows us to approximate such mathe-
matical functions. We leave this exploration to future work.
Compilation time: Compilation time is dominated by
SKETCH’s search procedure. To speed up the search, we
limit SKETCH to search for constants (e.g., for addition) of
size up to 5 bits, given that the constants seen within state-
ful codelets in our algorithms are small. Our longest com-
pilation time is 10 seconds when CoDel doesn’t map to a
Banzai machine with the Pairs atom because SKETCH has
to rule out every configuration in its search space. This time
will increase if we increase the bit width of constants that
SKETCH has to search; however, because the data-plane al-
gorithms themselves are small, we don’t expect compilation
times to be a concern.
5.4 Performance vs. programmability
8p.src is the address of the host originating this message, and
hence the destination for the host receiving it and executing
CONGA.
Atom Min. delay (picosec-
onds)
Programmability
(# of algorithms
implemented by
atom)
Performance
(Max. line
rate in billion
pkts/sec)
Write 176 1 5.68
ReadAddWrite
(RAW)
316 2 3.16
Predicated
ReadAd-
dWrite
(PRAW)
393 4 2.54
IfElse
ReadAd-
dWrite
(IfElseRAW)
392 5 2.55
Subtract (Sub) 409 6 2.44
Nested Ifs
(Nested)
580 9 1.72
Paired up-
dates (Pairs)
609 10 1.64
Table 5: Programmability increases with more complex
atoms, but performance decreases.
While powerful atoms like Pairs can implement more
data-plane algorithms, they have a performance cost. A
more expressive atom incurs longer signal propagation de-
lays and implies a lower clock frequency or line rate (the
inverse of propagation delay). To quantify this intuition,
we consider each stateful atom from Table 3 and synthe-
size a circuit with the lowest possible delay. As we increase
the complexity of the atom, the number of algorithms from
Table 4 that it can implement increases (programmability),
while at the same time, its achievable line rate (performance)
decreases (Table 5).9 This decrease in line rate can be ex-
plained by looking at the simplified circuit diagrams for the
first three atoms (Table 6), which show an increase in circuit
depth with atom complexity.
6. RELATEDWORK
Abstract machines for line-rate switches: NetASM [83]
is an abstract machine and intermediate representation (IR)
for programmable data planes that is portable across network
devices—FPGAs, virtual switches, and line-rate switches.
Banzai is a machine model for line-rate switches alone, and
hence models practical constraints required for line-rate for-
warding that NetASM doesn’t. For instance, Banzai ma-
chines don’t permit sharing state between atoms and use
atom templates to limit computations that can happen at line
rate. Further, while NetASM’s dataflow framework focuses
only on target-independent middle-end optimizations such
as dead-code elimination, the Domino compiler implements
a compiler back-end for line-rate switches (§4.3).
Programmable data planes: Software data planes such
as Click [66], RouteBricks [46], and Fastpass [77] are flexi-
ble but lack the performance required for large-scale deploy-
ments. Network Processors [19, 20] (NPUs) were an attempt
to bridge the gap. NPUs are faster than software routers; yet,
9The slightly non-monotonic behavior between PRAW and
IfElseRAW is because the logic synthesis tool is not optimal and
employs many heuristics.
11
Atom Circuit Min.
delay in
picosec-
onds
Write
pkt_1
Const 2-to-1
 Mux
x
176
ReadAddWrite
(RAW)
pkt_1
Const
Adder x
2-to-1
 Mux
x
0
2-to-1
 Mux 316
Predicated
ReadAd-
dWrite
(PRAW)
pkt_1
Const
3-to-1
 Mux
Adder
2-to-1
 Mux
RELOP
pkt_2
pkt_1
Const 3-to-1
 Mux
pkt_2
x
x
2-to-1
 Mux
x
0
2-to-1
 Mux
x
0
393
Table 6: Minimum delay of an atom increases with circuit
depth. MUX stands for a multiplexer, RELOP stands for a
relational operation between two operands.
they remain ~10× slower than switching chips [36].
Eden [33] provides a programmable data plane using com-
modity switches by programming end hosts alone. Domino
targets programmable switches that increase the scope of
programmable data planes relative to an end-host-only solu-
tion. For instance, Domino permits us to express in-network
congestion control, AQM, and congestion-aware load bal-
ancing (CONGA), which are beyond Eden’s capabilities.
Tiny Packet Programs (TPP) [61] allow end hosts to embed
small programs in packet headers, which are then executed
by the switch. TPPs are written in a restricted instruction
set to facilitate switch execution; we show that switch in-
structions must and can be substantially richer (Table 3) to
support stateful data-plane algorithms.
An alternative is to utilize hardware such as FPGAs;
examples include NetFPGA [69], Switchblade [31], and
Chimpp [81]. These designs are slower than switching
ASICs, and are rarely used in production network equip-
ment. The Arista 7124 FX [2] is a commercial switch with
an on-board FPGA, but its capacity is limited to 160 Gbit-
s/sec when using the on-board FPGA—10× less than the
multi-terabit capacities of programmable switch chips [26].
Jose et al. [62] focus on compiling P4 programs to pro-
grammable data planes such as the RMT and FlexPipe ar-
chitectures. Their work focuses on compiling stateless data-
plane tasks such as forwarding and routing, while Domino
focuses on stateful data-plane algorithms.
Packet-processing languages: Many programming lan-
guages target the network control plane. Examples include
Frenetic [54], Pyretic [71], and Maple [95]. Domino focuses
on the data plane instead, which requires different program-
ming constructs and compilation techniques.
Several DSLs target the data-plane. Click [66] uses C++
for packet processing on software routers. NOVA [56],
packetC [47], Intel’s auto-partitioning C compiler [44], Pa-
cLang [48, 49], and Microengine C [16, 18] target network
processors [19, 20]. Domino’s C-like syntax and sequential
semantics are inspired by these DSLs. However, by target-
ing line-rate switches, Domino is more constrained: e.g., it
needs to ensure that the compiled programs can run at line-
rate, hence the language forbids loops and includes no syn-
chronization constructs as there is no shared state in Banzai
machines.
The SNAP system [32] programs stateful data-plane algo-
rithms using a network transaction: an atomic block of code
that treats the entire network as one switch [63] and uses a
compiler to translate network transactions into rules on each
switch. SNAP doesn’t compile these switch-local rules into
a switch’s pipeline. Domino can be used to compile SNAP’s
switch-local rules to an atom pipeline and is an enabler for
SNAP and other network-wide abstractions. FAST [72] is
another system that provides switch support and software ab-
stractions for state machines. Banzai’s atoms support more
general stateful processing beyond state machines that en-
able a much wide class of data-plane algorithms to be im-
plemented.
7. CONCLUSION
This paper presented Domino, a C-like imperative lan-
guage that allows programmers to write packet-processing
code using packet transactions, which are sequential code
blocks that are atomic and isolated from other such code
blocks. The Domino compiler compiles packet transactions
to be executed on Banzai, which is a machine model based
on programmable line-rate switch architectures [17, 26, 36].
Our results suggest that it is possible to have both a fa-
miliar programming model and line-rate performance, pro-
vided that the algorithm can indeed run at line rate. Packet-
processing languages are still in their infancy; we hope these
results will prompt further work on programming abstrac-
tions for packet-processing hardware.
8. REFERENCES
[1] 100g data planes, dp 6440, dp 6430 | corsa technology.
http://www.corsa.com/products/dp6440/.
[2] 7124fx application switch.
https://www.arista.com/assets/data/pdf/
7124FX/7124FX_Data_Sheet.pdf.
[3] Appendix: Codel pseudocode.
http://queue.acm.org/appendices/codel.html.
[4] Arista - arista 7050 series. https:
//www.arista.com/en/products/7050-series.
12
[5] Cavium and XPliant introduce a fully programmable
switch silicon family scaling to 3.2 terabits per
second. http://www.cavium.com/newsevents-
Cavium-and-XPliant-Introduce-a-Fully-
Programmable-Switch-Silicon-Family.html.
[6] Cavium XPliant switches and Microsoft azure
networking achieve SAI routing interoperability.
http:
//www.cavium.com/newsevents-Cavium-XPliant-
Switches-and-Microsoft-Azure-Networking-
Achieve-SAI-Routing-Interoperability.html.
[7] Cisco nexus family.
http://www.cisco.com/c/en/us/products/
switches/cisco_nexus_family.html.
[8] Components of Linux Traffic Control.
http://tldp.org/HOWTO/Traffic-Control-HOWTO/
components.html.
[9] Dell force10. http://www.force10networks.com/.
[10] Design compiler - synopsys.
http://www.synopsys.com/Tools/
Implementation/RTLSynthesis/DesignCompiler/
Pages/default.aspx.
[11] DPDK: Data plane development kit.
http://dpdk.org/.
[12] Flowlet switching in p4.
https://github.com/p4lang/tutorials/tree/
master/SIGCOMM_2015/flowlet_switching.
[13] High Capacity StrataXGS®Trident II Ethernet Switch
Series. http://www.broadcom.com/products/
Switching/Data-Center/BCM56850-Series.
[14] High-density 25/100 gigabit ethernet StrataXGS
tomahawk ethernet switch series.
http://www.broadcom.com/products/Switching/
Data-Center/BCM56960-Series.
[15] High performance gdn 100g top-of-rack (tor) switch
for datacenter | algo-logic systems inc.
http://algo-logic.com/gdn-100g-tor-switch.
[16] Intel enhances network processor family with new
software tools and expanded performance.
http://www.intel.com/pressroom/archive/
releases/2001/20010220net.htm.
[17] Intel FlexPipe.
http://www.intel.com/content/dam/www/public/
us/en/documents/product-briefs/ethernet-
switch-fm6000-series-brief.pdf.
[18] Intel internet exchange architecture.
http://www.intel.com/design/network/papers/
intelixa.pdf.
[19] Intel IXP2800 network processor.
http://www.ic72.com/pdf_file/i/587106.pdf.
[20] IXP4XX Product Line of Network Processors.
http://www.intel.com/content/www/us/en/
intelligent-systems/previous-generation/
intel-ixp4xx-intel-network-processor-
product-line.html.
[21] Mellanox Products: SwitchX-2 Ethernet Optimized
for SDN. http://www.mellanox.com/page/
products_dyn?product_family=146&mtag=
switchx_2_en.
[22] New cisco asic has a programmable data plane.
http://searchnetworking.techtarget.com/news/
2240177388/New-Cisco-ASIC-has-a-
programmable-data-plane.
[23] P4 Specification.
http://p4.org/spec/p4-latest.pdf.
[24] Sampled netflow.
http://www.cisco.com/c/en/us/td/docs/ios/
12_0s/feature/guide/12s_sanf.html.
[25] Three-address code. https:
//en.wikipedia.org/wiki/Three-address_code.
[26] XPliant™Ethernet Switch Product Family.
http://www.cavium.com/XPliant-Ethernet-
Switch-Product-Family.html.
[27] ALIZADEH, M., EDSALL, T., DHARMAPURIKAR,
S., VAIDYANATHAN, R., CHU, K., FINGERHUT, A.,
LAM, V. T., MATUS, F., PAN, R., YADAV, N., AND
VARGHESE, G. CONGA: Distributed
Congestion-aware Load Balancing for Datacenters. In
SIGCOMM (2014).
[28] ALIZADEH, M., GREENBERG, A., MALTZ, D. A.,
PADHYE, J., PATEL, P., PRABHAKAR, B.,
SENGUPTA, S., AND SRIDHARAN, M. Data Center
TCP (DCTCP). In SIGCOMM (2010).
[29] ALIZADEH, M., KABBANI, A., EDSALL, T.,
PRABHAKAR, B., VAHDAT, A., AND YASUDA, M.
Less is more: Trading a little bandwidth for ultra-low
latency in the data center. In Proceedings of the 9th
USENIX Symposium on Networked Systems Design
and Implementation (NSDI 12) (San Jose, CA, 2012),
USENIX, pp. 253–266.
[30] ALLEN, J. R., KENNEDY, K., PORTERFIELD, C.,
AND WARREN, J. Conversion of control dependence
to data dependence. In Proceedings of the 10th ACM
SIGACT-SIGPLAN Symposium on Principles of
Programming Languages (New York, NY, USA,
1983), POPL ’83, ACM, pp. 177–189.
[31] ANWER, M. B., MOTIWALA, M., TARIQ, M. B.,
AND FEAMSTER, N. Switchblade: A platform for
rapid deployment of network protocols on
programmable hardware. In SIGCOMM (2011).
[32] ARASHLOO, M. T., KAROL, Y., GREENBERG, M.,
REXFORD, J., AND WALKER, D. Snap: Stateful
network-wide abstractions for packet processing.
arXiv:1512.00822.
[33] BALLANI, H., COSTA, P., GKANTSIDIS, C.,
GROSVENOR, M. P., KARAGIANNIS, T.,
KOROMILAS, L., AND O’SHEA, G. Enabling
end-host network functions. In Proceedings of the
2015 ACM Conference on Special Interest Group on
Data Communication (New York, NY, USA, 2015),
SIGCOMM ’15, ACM, pp. 493–507.
13
[34] BILGE, L., KIRDA, E., KRUEGEL, C., AND
BALDUZZI, M. EXPOSURE: finding malicious
domains using passive DNS analysis. In Proceedings
of the Network and Distributed System Security
Symposium, NDSS 2011, San Diego, California, USA,
6th February - 9th February 2011 (2011).
[35] BOSSHART, P., DALY, D., GIBB, G., IZZARD, M.,
MCKEOWN, N., REXFORD, J., SCHLESINGER, C.,
TALAYCO, D., VAHDAT, A., VARGHESE, G., AND
WALKER, D. P4: Programming Protocol-independent
Packet Processors. SIGCOMM Comput. Commun.
Rev. 44, 3 (July 2014), 87–95.
[36] BOSSHART, P., GIBB, G., KIM, H.-S., VARGHESE,
G., MCKEOWN, N., IZZARD, M., MUJICA, F., AND
HOROWITZ, M. Forwarding Metamorphosis: Fast
Programmable Match-action Processing in Hardware
for SDN. In SIGCOMM (2013).
[37] BRODER, A., MITZENMACHER, M., AND
MITZENMACHER, A. B. I. M. Network applications
of bloom filters: A survey. In Internet Mathematics
(2002), pp. 636–646.
[38] CHEUNG, A., SOLAR-LEZAMA, A., AND MADDEN,
S. Using program synthesis for social
recommendations. In Proceedings of the 21st ACM
International Conference on Information and
Knowledge Management (New York, NY, USA,
2012), CIKM ’12, ACM, pp. 1732–1736.
[39] CHEUNG, A., SOLAR-LEZAMA, A., AND MADDEN,
S. Optimizing database-backed applications with
query synthesis. In Proceedings of the 34th ACM
SIGPLAN Conference on Programming Language
Design and Implementation (New York, NY, USA,
2013), PLDI ’13, ACM, pp. 3–14.
[40] CLARKE, E. M., MCMILLAN, K. L., ZHAO, X.,
FUJITA, M., AND YANG, J. Spectral transforms for
large boolean functions with applications to
technology mapping. In Design Automation, 1993.
30th Conference on (1993), IEEE, pp. 54–60.
[41] CONG, J., AND DING, Y. Flowmap: An optimal
technology mapping algorithm for delay optimization
in lookup-table based fpga designs. Computer-Aided
Design of Integrated Circuits and Systems, IEEE
Transactions on 13, 1 (1994), 1–12.
[42] CORMODE, G., AND MUTHUKRISHNAN, S. An
improved data stream summary: The count-min sketch
and its applications. J. Algorithms 55, 1 (Apr. 2005),
58–75.
[43] CYTRON, R., FERRANTE, J., ROSEN, B. K.,
WEGMAN, M. N., AND ZADECK, F. K. Efficiently
computing static single assignment form and the
control dependence graph. ACM Transactions on
Programming Language Systems 13, 4 (1991),
451–490.
[44] DAI, J., HUANG, B., LI, L., AND HARRISON, L.
Automatically partitioning packet processing
applications for pipelined architectures. In
Proceedings of the 2005 ACM SIGPLAN Conference
on Programming Language Design and
Implementation (New York, NY, USA, 2005), PLDI
’05, ACM, pp. 237–248.
[45] DOBRESCU, M., ARGYRAKI, K., AND RATNASAMY,
S. Toward predictable performance in software
packet-processing platforms. In Proceedings of the 9th
USENIX Conference on Networked Systems Design
and Implementation (Berkeley, CA, USA, 2012),
NSDI’12, USENIX Association, pp. 11–11.
[46] DOBRESCU, M., EGI, N., ARGYRAKI, K., CHUN,
B.-G., FALL, K., IANNACCONE, G., KNIES, A.,
MANESH, M., AND RATNASAMY, S. Routebricks:
Exploiting parallelism to scale software routers. In
Proceedings of the ACM SIGOPS 22Nd Symposium on
Operating Systems Principles (New York, NY, USA,
2009), SOSP ’09, ACM, pp. 15–28.
[47] DUNCAN, R., AND JUNGCK, P. packetC Language
for High Performance Packet Processing. In 11th
IEEE International Conference on High Performance
Computing and Communications (2009).
[48] ENNALS, R., SHARP, R., AND MYCROFT, A. Linear
types for packet processing. In Programming
Languages and Systems, D. Schmidt, Ed., vol. 2986 of
Lecture Notes in Computer Science. Springer Berlin
Heidelberg, 2004, pp. 204–218.
[49] ENNALS, R., SHARP, R., AND MYCROFT, A. Task
partitioning for multi-core network processors. In
Compiler Construction, R. Bodik, Ed., vol. 3443 of
Lecture Notes in Computer Science. Springer Berlin
Heidelberg, 2005, pp. 76–90.
[50] ESTAN, C., AND VARGHESE, G. New directions in
traffic measurement and accounting: Focusing on the
elephants, ignoring the mice. ACM Trans. Comput.
Syst. 21, 3 (Aug. 2003), 270–313.
[51] ESTAN, C., VARGHESE, G., AND FISK, M. Bitmap
algorithms for counting active flows on high-speed
links. IEEE/ACM Trans. Netw. 14, 5 (Oct. 2006),
925–937.
[52] FENG, W.-C., SHIN, K. G., KANDLUR, D. D., AND
SAHA, D. The blue active queue management
algorithms. IEEE/ACM Trans. Netw. 10, 4 (Aug.
2002), 513–528.
[53] FLOYD, S., AND JACOBSON, V. Random early
detection gateways for congestion avoidance.
IEEE/ACM Trans. Netw. 1, 4 (Aug. 1993), 397–413.
[54] FOSTER, N., HARRISON, R., FREEDMAN, M. J.,
MONSANTO, C., REXFORD, J., STORY, A., AND
WALKER, D. Frenetic: A Network Programming
Language. In ICFP (2011).
[55] GEMBER-JACOBSON, A., VISWANATHAN, R.,
PRAKASH, C., GRANDL, R., KHALID, J., DAS, S.,
AND AKELLA, A. Opennf: Enabling innovation in
network function control. In Proceedings of the 2014
ACM Conference on SIGCOMM (New York, NY,
14
USA, 2014), SIGCOMM ’14, ACM, pp. 163–174.
[56] GEORGE, L., AND BLUME, M. Taming the ixp
network processor. In Proceedings of the ACM
SIGPLAN 2003 Conference on Programming
Language Design and Implementation (New York,
NY, USA, 2003), PLDI ’03, ACM, pp. 26–37.
[57] GIBB, G., VARGHESE, G., HOROWITZ, M., AND
MCKEOWN, N. Design principles for packet parsers.
In Architectures for Networking and Communications
Systems (ANCS), 2013 ACM/IEEE Symposium on (Oct
2013), pp. 13–24.
[58] GREENBERG, A., HAMILTON, J. R., JAIN, N.,
KANDULA, S., KIM, C., LAHIRI, P., MALTZ, D. A.,
PATEL, P., AND SENGUPTA, S. Vl2: A scalable and
flexible data center network. In SIGCOMM (2009).
[59] HERLIHY, M. Wait-free synchronization. ACM Trans.
Program. Lang. Syst. 13, 1 (Jan. 1991), 124–149.
[60] HONG, C.-Y., CAESAR, M., AND GODFREY, P. B.
Finishing flows quickly with preemptive scheduling.
In SIGCOMM (2012).
[61] JEYAKUMAR, V., ALIZADEH, M., GENG, Y., KIM,
C., AND MAZIÈRES, D. Millions of Little Minions:
Using Packets for Low Latency Network
Programming and Visibility. In SIGCOMM (2014).
[62] JOSE, L., YAN, L., VARGHESE, G., AND
MCKEOWN, N. Compiling Packet Programs to
Reconfigurable Switches. In NSDI (2015).
[63] KANG, N., LIU, Z., REXFORD, J., AND WALKER,
D. Optimizing the "one big switch" abstraction in
software-defined networks. In Proceedings of the
Ninth ACM Conference on Emerging Networking
Experiments and Technologies (New York, NY, USA,
2013), CoNEXT ’13, ACM, pp. 13–24.
[64] KATABI, D., HANDLEY, M., AND ROHRS, C.
Congestion Control for High Bandwidth-Delay
Product Networks. In SIGCOMM (2002).
[65] KELLEY JR, J. E., AND WALKER, M. R.
Critical-path planning and scheduling. In Papers
presented at the December 1-3, 1959, eastern joint
IRE-AIEE-ACM computer conference (1959), ACM,
pp. 160–173.
[66] KOHLER, E., MORRIS, R., CHEN, B., JANNOTTI, J.,
AND KAASHOEK, M. F. The Click Modular Router.
ACM Trans. Comput. Syst. 18, 3 (Aug. 2000),
263–297.
[67] KUNNIYUR, S. S., AND SRIKANT, R. An adaptive
virtual queue (avq) algorithm for active queue
management. IEEE/ACM Trans. Netw. 12, 2 (Apr.
2004), 286–299.
[68] LAM, M. Software pipelining: An effective
scheduling technique for vliw machines. In
Proceedings of the ACM SIGPLAN 1988 Conference
on Programming Language Design and
Implementation (New York, NY, USA, 1988), PLDI
’88, ACM, pp. 318–328.
[69] LOCKWOOD, J. W., MCKEOWN, N., WATSON, G.,
GIBB, G., HARTKE, P., NAOUS, J., RAGHURAMAN,
R., AND LUO, J. NetFPGA–An Open Platform for
Gigabit-Rate Network Switching and Routing. In
IEEE International Conf. on Microelectronic Systems
Education (2007).
[70] MICHELI, G. D. Synthesis and Optimization of
Digital Circuits, 1st ed. McGraw-Hill Higher
Education, 1994.
[71] MONSANTO, C., REICH, J., FOSTER, N., REXFORD,
J., AND WALKER, D. Composing Software-defined
Networks. In NSDI (2013).
[72] MOSHREF, M., BHARGAVA, A., GUPTA, A., YU,
M., AND GOVINDAN, R. Flow-level state transition
as a new switch primitive for sdn. In Proceedings of
the 2014 ACM Conference on SIGCOMM (New York,
NY, USA, 2014), SIGCOMM ’14, ACM, pp. 377–378.
[73] MUCHNICK, S. S. Advanced Compiler Design and
Implementation. Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA, 1997.
[74] NICHOLS, K., AND JACOBSON, V. Controlling
Queue Delay. ACM Queue 10, 5 (May 2012).
[75] PALKAR, S., LAN, C., HAN, S., JANG, K., PANDA,
A., RATNASAMY, S., RIZZO, L., AND SHENKER, S.
E2: A framework for nfv applications. In Proceedings
of the 25th Symposium on Operating Systems
Principles (New York, NY, USA, 2015), SOSP ’15,
ACM, pp. 121–136.
[76] PAN, R., NATARAJAN, P., PIGLIONE, C., PRABHU,
M., SUBRAMANIAN, V., BAKER, F., AND
VERSTEEG, B. Pie: A lightweight control scheme to
address the bufferbloat problem. In High Performance
Switching and Routing (HPSR), 2013 IEEE 14th
International Conference on (July 2013), pp. 148–155.
[77] PERRY, J., OUSTERHOUT, A., BALAKRISHNAN, H.,
SHAH, D., AND FUGAL, H. Fastpass: A Centralized
“Zero-queue” Datacenter Network. In SIGCOMM
(2014).
[78] PHOTHILIMTHANA, P. M., JELVIS, T., SHAH, R.,
TOTLA, N., CHASINS, S., AND BODIK, R.
Chlorophyll: Synthesis-aided compiler for low-power
spatial architectures. In PLDI (2014), pp. 396–407.
[79] RAU, B. R., AND GLAESER, C. D. Some scheduling
techniques and an easily schedulable horizontal
architecture for high performance scientific
computing. In Proceedings of the 14th annual
workshop on Microprogramming, MICRO 1981,
Chatham (Cape Cod), Massachusetts, USA (1981),
pp. 183–198.
[80] ROY, A., ZENG, H., BAGGA, J., PORTER, G., AND
SNOEREN, A. C. Inside the social network’s
(datacenter) network. In Proceedings of the 2015 ACM
Conference on Special Interest Group on Data
Communication (New York, NY, USA, 2015),
SIGCOMM ’15, ACM, pp. 123–137.
[81] RUBOW, E., MCGEER, R., MOGUL, J., AND
15
VAHDAT, A. Chimpp: A Click-based programming
and simulation environment for reconfigurable
networking hardware. In ANCS (2010).
[82] SHAH, N. Understanding network processors.
[83] SHAHBAZ, M., AND FEAMSTER, N. The case for an
intermediate representation for programmable data
planes. In SOSR (2015), pp. 3:1–3:6.
[84] SHERRY, J., HASAN, S., SCOTT, C.,
KRISHNAMURTHY, A., RATNASAMY, S., AND
SEKAR, V. Making middleboxes someone else’s
problem: Network processing as a cloud service. In
Proceedings of the ACM SIGCOMM 2012 Conference
on Applications, Technologies, Architectures, and
Protocols for Computer Communication (New York,
NY, USA, 2012), SIGCOMM ’12, ACM, pp. 13–24.
[85] SINGH, A., ONG, J., AGARWAL, A., ANDERSON,
G., ARMISTEAD, A., BANNON, R., BOVING, S.,
DESAI, G., FELDERMAN, B., GERMANO, P.,
KANAGALA, A., PROVOST, J., SIMMONS, J.,
TANDA, E., WANDERER, J., HÖLZLE, U., STUART,
S., AND VAHDAT, A. Jupiter rising: A decade of Clos
topologies and centralized control in google’s
datacenter network. In Proceedings of the 2015 ACM
Conference on Special Interest Group on Data
Communication (New York, NY, USA, 2015),
SIGCOMM ’15, ACM, pp. 183–197.
[86] SINHA, S., KANDULA, S., AND KATABI, D.
Harnessing TCPs Burstiness using Flowlet Switching.
In 3rd ACM SIGCOMM Workshop on Hot Topics in
Networks (HotNets) (San Diego, CA, November
2004).
[87] SIVARAMAN, A., KIM, C., KRISHNAMOORTHY, R.,
DIXIT, A., AND BUDIU, M. Dc.p4: Programming the
forwarding plane of a data-center switch. In
Proceedings of the 1st ACM SIGCOMM Symposium
on Software Defined Networking Research (New York,
NY, USA, 2015), SOSR ’15, ACM, pp. 2:1–2:8.
[88] SIVARAMAN, A., SUBRAMANIAN, S., AGRAWAL,
A., CHOLE, S., CHUANG, S.-T., EDSALL, T.,
ALIZADEH, M., KATTI, S., MCKEOWN, N., AND
BALAKRISHNAN, H. Towards programmable packet
scheduling. In Proceedings of the 14th ACM Workshop
on Hot Topics in Networks (New York, NY, USA,
2015), HotNets-XIV, ACM, pp. 23:1–23:7.
[89] SOLAR-LEZAMA, A., RABBAH, R., BODÍK, R.,
AND EBCIOG˘LU, K. Programming by sketching for
bit-streaming programs. In Proceedings of the 2005
ACM SIGPLAN Conference on Programming
Language Design and Implementation (New York,
NY, USA, 2005), PLDI ’05, ACM, pp. 281–294.
[90] SOLAR-LEZAMA, A., TANCAU, L., BODIK, R.,
SESHIA, S., AND SARASWAT, V. Combinatorial
sketching for finite programs. In Proceedings of the
12th International Conference on Architectural
Support for Programming Languages and Operating
Systems (New York, NY, USA, 2006), ASPLOS XII,
ACM, pp. 404–415.
[91] STANLEY, S. Roving reporter: Reference platforms
for sdn and nfv. https://
embedded.communities.intel.com/community/en/
hardware/blog/2013/06/03/roving-reporter-
reference-platforms-for-sdn-and-nfv.
[92] STOICA, I., SHENKER, S., AND ZHANG, H.
Core-stateless fair queueing: A scalable architecture to
approximate fair bandwidth allocations in high-speed
networks. IEEE/ACM Trans. Netw. 11, 1 (Feb. 2003),
33–46.
[93] TAI, C., ZHU, J., AND DUKKIPATI, N. Making Large
Scale Deployment of RCP Practical for Real
Networks. In INFOCOM (2008).
[94] TENNENHOUSE, D. L., AND WETHERALL, D. J.
Towards an active network architecture. In DARPA
Active NEtworks Conference and Exposition, 2002.
Proceedings (2002), IEEE, pp. 2–15.
[95] VOELLMY, A., WANG, J., YANG, Y. R., FORD, B.,
AND HUDAK, P. Maple: Simplifying sdn
programming using algorithmic policies. In
Proceedings of the ACM SIGCOMM 2013 Conference
on SIGCOMM (New York, NY, USA, 2013),
SIGCOMM ’13, ACM, pp. 87–98.
[96] WILSON, C., BALLANI, H., KARAGIANNIS, T., AND
ROWTRON, A. Better never than late: Meeting
deadlines in datacenter networks. In SIGCOMM
(2011).
[97] WU, W., HE, K., AND AKELLA, A. Perfsight:
Performance diagnosis for software dataplanes. In
Proceedings of the 2015 ACM Conference on Internet
Measurement Conference (New York, NY, USA,
2015), IMC ’15, ACM, pp. 409–421.
[98] YU, M., JOSE, L., AND MIAO, R. Software defined
traffic measurement with opensketch. In Proceedings
of the 10th USENIX Symposium on Networked
Systems Design and Implementation (NSDI 13)
(Lombard, IL, 2013), USENIX, pp. 29–42.
[99] ZATS, D., DAS, T., MOHAN, P., BORTHAKUR, D.,
AND KATZ, R. Detail: Reducing the flow completion
time tail in datacenter networks. In SIGCOMM (2012).
16
