Relaxing constraints in stateful network data plane design by Cascone, Carmelo et al.
Relaxing constraints in stateful network data plane design
Carmelo Cascone‡, Roberto Bifulco∗, Salvatore Pontarelli+
‡ Politecnico di Milano, ∗ NEC Laboratories Europe, + CNIT/Univ. Roma Tor Vergata
carmelo.cascone@polimi.it, roberto.bifulco@neclab.eu, salvatore.pontarelli@uniroma2.it
1. INTRODUCTION
Modern network devices have to meet stringent perfor-
mance requirements while providing support for a growing
number of use cases and applications. In such a context, a
programmable network data plane has emerged as an impor-
tant feature of modern forwarding elements, such as switches
and network cards. Bosshart et al. [1] introduced RMT,
a first example of a high-performance programmable data
plane. RMT provides a reconfigurable switching ASIC that
can parse and modify arbitrary packet headers in a pipeline
of match-action tables (MAT). Interestingly, [1] shows that
such programmability can be supported with performance
comparable to state-of-the-art fixed-function chips: it can
process packets at a line rate of 640 Gb/s.
More recently, Sivaraman et al. [2] presented an abstrac-
tion for a switching ASIC, named Banzai, which supports
the programming of stateful packet processing functions. The
statefullness lays in the ability to create and modify state
while processing a packet, enabling the definition of func-
tions that depend on the history of previously received pack-
ets. Such functions enable complex applications such as
stateful firewalls, active queue management, scheduling, mon-
itoring, etc.
Banzai extends RMT’s MATs by adding stateful actions,
named “atoms”. Each atom, as the name suggests, performs
state operations atomically. The atomicity is required to
guarantee consistency, i.e., read and write operations to an
atom’s memory area cannot be performed by different pack-
ets at the same time. In effect, Banzai requires the serial pro-
cessing of all the packets. This model is convenient since a
forwarding element’s data plane is already processing pack-
ets in a serial manner.
However, to meet a given performance target, the serial
processing model requires the definition of a strict time bud-
get for the processing of each packet. For instance, in the
case of RMT, the switching ASIC is dimensioned to process
640 Gb/s with minimum size Ethernet packets (64 bytes),
which translates to a time budget of 1 ns per packet1. Like-
wise, the chip clock frequency is dimensioned according to
the desired target throughput. In the previous case, a 1 Ghz
clock is used to provide the 640 Gb/s. The final outcome is
that each atom has to perform state read, modification and
write operations in at most 1 ns, i.e., 1 clock cycle.
Unfortunately, while providing line rate guarantees, Ban-
zai fails to implement more complex functions that require
atoms that cannot be executed in the available time budget.
1Actually, the minimum size used in the switch may be larger.
E.g., as explained in [2], it is not feasible to implement a
square root operation in 1 clock cycle at 1 Ghz with a stan-
dard 32-nm process.
A solution to this problem could be the partitioning of a
complex action’s execution over multiple clock cycles. How-
ever, if read-modify-write operations happen in two or more
distinct clock cycles, the system can easily end up in an in-
consistent state. For example, consider an action that imple-
ments a packet counter in two cycles, and assume two pack-
ets arriving back-to-back. The second packet would cause
a read from memory before the processing of the first could
write back the incremented counter value, leading to wrong
counting. Locking the memory would prevent inconsisten-
cies, but it would also cause the second packet to wait for
the first one to update the memory, potentially reducing the
switch’s throughput.
While there seems to be little that could be done to im-
prove on the consistency/throughput trade off, we notice that
the design introduced before uses a not very frequent work-
load to dimension the data plane. In many common scenar-
ios, network packets have an average size which is bigger
than the minimum size. Therefore, in this work we explore
opportunities to relax the constraint of requiring stateful ac-
tions to complete execution in 1 clock cycle. Instead, we al-
low functions that span multiple clock cycles and verify the
outcome of such a decision when processing a real traffic
trace. In particular, we assess both the risk of harming con-
sistency, in RMT-like architectures, and the potential cost in
terms of throughput when introducing locking to still pro-
vide consistency guarantees.
2. CONTRIBUTION
Background: store-and-forward architectures. A design
that requires the execution of stateful actions in only 1 clock
cycle, derives from the worst case assumption that all pack-
ets have minimum size and that they arrive back-to-back,
i.e., with no inter-packet gaps. Instead, packets produced by
today’s application have variable size. E.g., spanning from
64 bytes to 1500 bytes, in a typical case.
Switches like RMT consume packets in a store-and-forward
fashion, i.e., packets are first completely read from input
ports, then they are parsed and their headers processed by
MATs. With RMT, to sustain a line rate of 640 Gb/s, with
a chip clocked at 1 Ghz, we can assume packets are read in
chunk of at most 80 bytes (i.e., 80 × 8 bit × 1 Ghz = 640
Gb/s). Consequently, it will take 1 clock cycle to read pack-
ets with minimum size ≤ 80 bytes, while it will take more
cycles to read longer packets, e.g., 19 for 1500 bytes. Still,
1
ar
X
iv
:1
70
2.
02
34
7v
1 
 [c
s.N
I] 
 8 
Fe
b 2
01
7
 0
 0.2
 0.4
 0.6
 0.8
 1
 64 300  600  900 1200 1500
Cu
m
ul
at
ive
 fr
ac
tio
n
Packet size (bytes)
(a)
 0
 0.2
 0.4
 0.6
 0.8
 1
 1 2  5  10  15  20%
 c
on
cu
rre
nc
y 
ha
za
rd
Pipeline depth (clock cycles)
(b)
 0
 0.2
 0.4
 0.6
 0.8
 1
 1 2  5  10  15  20
R
el
at
iv
e 
th
ro
ug
hp
ut
Pipeline depth (clock cycles)
α
0.0
0.2
0.4
0.6
0.8
1.0
(c)
Figure 1: Trace-based simulation results
the switch’s pipeline can process the parsed headers in one
clock cycle, independently from its size. The result is that,
even when all packets arrive back-to-back, the variability of
the packet size will cause the pipeline to experience one or
more idle cycles. Can we exploit this consideration to relax
the atomicity constraint?
Trace-based simulations. We run simulations using real
traffic traces in order to (i) assess the risk of harming con-
sistency when using stateful functions that span many clock
cycles, and to (ii) evaluate the throughput when introducing
locking to provide consistency guarantees. We used packet
traces from a US backbone link collected in 2015 [3]. We ac-
celerate the trace to drive 100% utilization, removing inter-
packet gaps. Figure 1a shows the cumulative distribution
function (CDF) of the packet size found in the trace. We fur-
ther define α as a parameter to modify the size of each packet
in order to gradually approximate the worst case. When
α = 0 all packets have minimum size, i.e., our worst case.
Instead, with α = 1, we use the trace’s orginal packet size
distribution. Finally, we model the packet size modification
function such that intermediate values of α represent realis-
tic cases, e.g. when α = 0.2 the CDF is similar to that found
in Facebook’s datacenter [4].
Assuming a stateful action implemented as a pipeline of
many instructions, where the first reads from the memory
and the last writes back, we define as “concurrency hazard”
the event in which the first instruction of the action pipeline
processes a packet, while another one is currently travelling
in the same pipeline. Figure 1b shows the percentage of
time such an event occurs with the chosen traffic trace, while
varying the pipeline depth for different values of α. Clearly
when α = 0 the risk of concurrency hazard is maximum al-
ready with pipelines of 2 instructions, since we have exactly
one packet per clock cycle to process. However, when α
tends to 1, the hazard grows slowly. When α = 1, the risk of
concurrency hazard is lower than 5% with action pipelines
up to 19 clock cycles. This result suggests that a locking
strategy to prevent inconsistencies could be applicable, with
marginal harm to the throughput, if any.
We consider now the case of a trivial locking strategy that
allows only 1 packet at a time in the action pipeline. Clearly,
locking affects throughput, but most importantly voids the
guarantee of deterministic performance. Figure 1c shows
the throughput is not reduced with actions pipelines up to
10 clock cycles when α = 1, i.e. when processing traf-
fic with characteristics similar to a backbone link. In fact,
larger packets provide the pipeline with additional time to
process smaller packets that may have been queueing be-
cause of the locking. Conversely, maximum throughput is
guaranteed only up to 3 clock cycles when α = 0.2, i.e.
with traffic similar to that found in Facebook’s datacenters.
3. FUTUREWORK
Our early results suggest that there could be several cases
in which switches may apply operations that require multi-
ple clock cycles per packet, while still maintaining line-rate
throughput and consistency. Therefore, blocking architec-
tures could be a viable option for implementing complex
stateful actions in forwarding elements.
Our work is however in an early stage, despite the need
to verify our assumptions against a larger number of traf-
fic traces from different scenarios, there are still a number
of unexplored issues. For instance, in several cases packets
belonging to different flows do not read/write each others’
states. A characteristic that could be exploited to imple-
ment a smarter locking scheme. Furthermore, packet pro-
cessing actions may be different in length, with some taking
longer than others. If multi-cycle actions are uncommon,
there could be the possibility to implement some very com-
plex actions together with a larger set of simpler ones.
Exploring the above points and providing an effective hard-
ware implementation that can leverage the corresponding
findings is part of our future work.
References
[1] P. Bosshart et al. “Forwarding Metamorphosis: Fast Pro-
grammable Match-action Processing in Hardware for
SDN”. In: ACM SIGCOMM ’13.
[2] A. Sivaraman et al. “Packet Transactions: High-Level
Programming for Line-Rate Switches”. In: ACM SIG-
COMM ’16.
[3] The CAIDA UCSD Anonymized Internet Traces - 2015-
02-19. http://www.caida.org.
[4] A. Roy et al. “Inside the Social Network’s (Datacenter)
Network”. In: ACM SIGCOMM ’15.
2
