Towards a Stateful Forwarding Abstraction to Implement Scalable Network
  Functions in Software and Hardware by Petrucci, Luca et al.
Towards a Stateful Forwarding Abstraction to Implement
Scalable Network Functions in Software and Hardware
Luca Petrucci‡, Nicola Bonelli∗, Marco Bonola‡, Gregorio Procissi∗, Carmelo Cascone+,
Davide Sanvito+, Salvatore Pontarelli‡, Giuseppe Bianchi‡, Roberto Bifulco†
∗ CNIT/University of Pisa ‡ CNIT/University of Rome Tor Vergata
+ CNIT/Politecnico di Milano † NEC Laboratories Europe
ABSTRACT
An effective packet processing abstraction that leverages soft-
ware or hardware acceleration techniques can simplify the
implementation of high-performance virtual network func-
tions. In this paper, we explore the suitability of SDN switches’
stateful forwarding abstractions to model accelerated func-
tions in both software and hardware accelerators, such as op-
timized software switches and FPGA-based NICs. In partic-
ular, we select an Extended Finite State Machine abstraction
and demonstrate its suitability by implementing the Linux’s
iptables interface. By doing so, we provide the accelera-
tion of functions such as stateful firewalls, load balancers
and dynamic NATs. We find that supporting a flow-level
programming consistency model is an important feature of
a programming abstraction in this context. Furthermore,
we demonstrate that such a model simplifies the scaling of
the system when implemented in software, enabling efficient
multi-core processing without harming state consistency.
1. INTRODUCTION
Network functions, commonly implemented using hard-
ware middleboxes, are being transformed in virtualized soft-
ware appliances that run on commodity servers [33]. Net-
work operators are supporting this trend [42], usually called
Network function Virtualization (NFV) [3]. Virtual network
functions (VNFs) have a number of advantages when com-
pared to legacy hardware ones. They can be dynamically
created, updated, migrated and run on commodity servers.
However, developing VNFs for carriers is a hard task [41]. A
relevant challenge is the need to meet strict requirements in
terms of performance and reliability, while running in gen-
eral purpose, multi-tenant (i.e., virtualized) environments.
Furthermore, the required packet forwarding performance
is a continuously rising bar. A commodity server is usually
equipped with a couple of 10Gbps network interfaces, and
40Gbps interfaces are becoming common. Unfortunately,
current general purpose systems’ speed is not growing as fast
as the network interfaces speed [59]. Therefore, carriers in-
creasingly consider the introduction of network accelerators,
based on smart NICs, in their VNF designs [40, 39, 58]. In
fact, smart NICs help in supporting the implementation of
hardware accelerators as required by future network func-
tions1, without waiting for the long implementation cycles
experienced with traditional NICs [26].
An effective way to simplify the implementation of a high-
performance network function is to separate the function’s
fast path from its control path [29]. Here, the data and control
planes separation of the OpenFlow’s architecture is a rele-
vant example [34]. The OpenFlow controller implements the
slow path using high-level languages such as C, Java, Python,
etc. The switch implements the fast path using a model that
comprises a pipeline of match-action tables (MATs). There-
fore, a network function’s developer can rely on the already
optimized implementation of the MAT abstraction, e.g., pro-
vided by a fast software switch [43], for implementing a
high-performance function’s fast path.
Unfortunately, OpenFlow (and similar SDN abstractions)
is too constrained. In essence, only packet classification
based on L2-L4 protocol headers and some header rewriting
capabilities are exposed, limiting its applicability only to the
definition of trivial VNFs [23].
Fast Path Abstractions Ideally, we would like to define a
fast path abstraction that simplifies the implementation of
high-performance VNFs, while enabling a seamless adoption
of future hardware accelerators. In other words, we want to
answer the following questions: is it possible to define a com-
mon abstraction to efficiently model software and hardware
VNFs fast path functions? If such an abstraction exists, can it
model a relevant set of functions? Can the abstraction support
the performance, runtime reconfiguration and multi-tenancy
requirements of NFV environments?
An important observation is that the already mentioned
OpenFlow’s abstraction, based on MATs, is a good match
for many of the requirements we need to fulfill. First, it can
be implemented in software with high-performance [43] and
is supported by hardware implementations. As a matter of
fact, NIC vendors are already leveraging OpenFlow-like mod-
els for the programming of NIC’s functions [12]. Second,
OpenFlow devices are configured at runtime by changing
the flow entries in the MATs, without interrupting the device
functions. This enables scenarios in which VNFs are dynam-
1Such a need has been recently recognized also by standard organi-
zations such as ETSI ISG NFV, which is defining hardware acceler-
ators’ capabilities in their specifications for NFV platforms [15].
1
ar
X
iv
:1
61
1.
02
85
3v
1 
 [c
s.N
I] 
 9 
No
v 2
01
6
ically deployed or migrated. Third, MAT-based abstractions
have been proved effective to implement virtualization and
multi-tenancy [49].
Unfortunately, many of the abstractions that extend a MAT-
based model to support more flexible functions target only
hardware implementations (cfr. Sec. 2). This is particularly
true when the abstraction supports the definition of network
functions that explicitly manage some sort of algorithmic
state. That is, so-called stateful abstractions. As a result,
there is either a lack of a suitable consistency model to de-
scribe stateful functions, or there is a consistency model that
cannot be easily implemented (and scaled) in, e.g., a software
environment based on general purpose CPUs.
Contribution In this paper, we build on top of an extension
of OpenState [4], i.e., the Open Packet Processor (OPP) [5,
32], which maintains an OpenFlow-like model for program-
ming network devices, but tweaks it for the implementation of
an Extended Finate State Machine abstraction (EFSMs) [13].
That is, we believe that OPP may improve on the lack of
flexibility of OpenFlow, while still retaining most of its pre-
viously outlined benefits. Furthermore, OPP introduces a
flow-level state consistency model, which eases a scalable
implementation of the abstraction in software.
Therefore, we provide two main contributions. First, we
demonstrate that the OPP abstraction is flexible and can in
fact model a relevant set of commonly used network functions.
In particular, we implement the Linux’s iptables interface (cfr.
Sec 3) on top of OPP. Supporting iptables has the following
benefits: it enables the entire set of functions supported by ipt-
ables, such as stateful firewalls, load balancers and dynamic
NATs, to name a few; it provides an interface that is already
commonly used [18] and which is well-known by developers,
while still enabling the implementation of additional, perhaps
more innovative, new functions2. Second, we provide an
OPP software implementation that can dynamically scale to
run on multiple CPU cores, maintaining the consistency of
the implemented stateful network functions (cfr. Sec 4). We
identify in the OPP’s flow level state consistency the key ab-
straction to enable a scalable software implementation. Since
there is already an OPP implementation for the NetFPGA
SUME [58], our software implementation demonstrates that
OPP can be implemented efficiently in both hardware and
software.
2. STATEFUL ABSTRACTIONS
In this section, we briefly recap our requirements for a
stateful forwarding abstraction for the data plane, and review
currently available ones that suite our needs. Finally, we
provide a primer of OPP, which is our selected abstraction.
2We remark that such an approach has been already proven suc-
cessful in the past, since familiar abstractions enable one to re-use
current implementations, policies and configurations without modi-
fication [30].
Requirements. As previously mentioned, we are interested
in selecting an abstraction that is flexible enough to model
several network functions’ data plane while fulfilling the
following requirements: (i) it is efficiently implementable
in both software and hardware (i.e., smart NICs); (ii) it en-
ables the deployment of new functions at runtime; (iii) it
can support multiple concurrent functions. The OpenFlow’s
MAT-based abstraction fulfills the above three requirements,
but fails to model an important set of network functions that
require to keep algorithmic state. We refer to these set of func-
tions as stateful functions. Therefore, a stateful abstraction is
one that can model stateful functions.
Given the good match of a MAT-based abstraction with
our requirements, it is reasonable to explore extensions of
such abstraction. Hopefully, one could retain the required
properties while adding the flexibility that is lacking in the
current OpenFlow abstraction.
P4. Recently, a number of proposals started to investigate
stateful forwarding abstractions in the fast path, using a
pipeline of MATs as base model. Most notably, the P4 lan-
guage [9] allows a programmer to describe stateful functions
using a pipeline of configurable MATs, which is defined in a
hardware independent manner. P4 can be then compiled to
a number of targets, for instance a smart NIC [28] or a soft-
ware switch [47]. To describe stateful functions, P4 provides
registers that can be accessed within the MATs pipeline to
keep state (and manipulate it).
While the hardware independent nature of P4 makes it
appealing, we found that none of the currently available ver-
sions of P4 [54, 53] defines a concurrency model that helps in
managing the stored state. In other words, P4 cannot model
assumptions on how the data (registers) is read or written
and, therefore, it is hard to provide guarantees in terms of
state consistency in actual implementations. This is a fun-
damental need when dealing with state that can be shared
and manipulated concurrently, e.g., by different stages of
a packet processing pipeline. In fact, such information is
used to arbitrate and optimize memory accesses in actual
implementations. For instance, NetASM [48], an abstract
intermediate representation for programmable pipelined data
planes, captures this information by distinguishing between
per packet and persistent states.
As a result, in P4 the only way in which state consistency,
hence algorithm correctness, is guaranteed is to assume that
the pipeline processes one packet at a time. Such assumption
is of course not acceptable, since pipelining of packets is
required to provide high performance in hardware 3.
POF. POF [52] enables the definition of flexible MATs, with
actions that could be used to read and modify state. Since
3We are aware that the P4 Language Design Working Group is
currently working on a concurrency model for the next release of the
language. However, at the time of writing, this aspect is still work
in progress and, to the best of our knowledge, we are not aware of
any target hardware platform implementing such specification.
2
similar considerations to those of P4 apply here, in the interest
of space, we do not describe in details POF in this paper.
Banzai. Another relevant proposal is Domino/Banzai [51].
Here, the authors propose both a Domain Specific Language
(DSL), Domino, and a machine model, Banzai, for the def-
inition and implementation at line rate of stateful packet
processing functions. The Banzai machine exposes a set of
specialized stateful processing instructions called “atoms”,
used to implement the action part of a MAT. Atoms have
two main limitations: they must execute their operations
atomically and cannot share state with other atoms. Differ-
ently from P4, a program written in Domino, if accepted by
the Domino’s compiler, offers strong consistency guarantees.
The compiler is responsible to find a mapping of the portions
of a Domino program to the available atoms. Depending on
the expressiveness of the atoms, [51] shows that a fair amount
of stateful functions can be expressed.
Unfortunately, the Banzai model has a few issues when
applied to our context. First, the model is thought to work
with line rate switches. That is, it is based on the assumption
of having a sequential pipelined processor. In software, this
assumption is not true anymore. As a result, unless only one
of the CPU’s cores is used to process sequentially all the
packets, limiting scalability, the assumption that atom’s state
can be atomically updated would require a software imple-
mentation to resort to expensive resource locking techniques.
That is, it would need to provide mutually exclusive access
from different CPU’s cores to the atom’s state. Furthermore,
such locking should happen for each processed packet and
for each pipeline’s stage. Second, having a consistency guar-
antee based on the atomic execution of state updates frees
the programmer from stating its assumptions on how state
is accessed. This simplifies the programming of functions
in Domino, but brings in the same issues mentioned for P4.
That is, it leaves unspecified which portion of the data a given
function reads or writes. Finally, at the time of writing, we
are not aware of any Banzai machine implementation for
smart NICs.
2.1 Open Packet Processor
The abstractions mentioned so far are all lacking explicit
information about which part of the overall state is used for
the implementation of a given algorithm. Nonetheless, in
most of the cases each packet actually belongs to one of
several different state contexts. For instance, in the case
of a stateful firewall, each bi-directional flow has its own
state, e.g., new or established, and two flows don’t usually
affect each other state. As in the case of NetASM, using an
abstraction that can identify and isolate these contexts helps
in solving state consistency problems on a per-context level
rather then on a global level.
Notable examples of stateful forwarding abstractions that
provide the identification of state contexts are FAST [38]
and OpenState [4]. Since the approaches are quite similar,
we will be focusing on the latter, with which we are more
familiar. Also, OpenState recently evolved into the Open
Packet Processor (OPP), which provides a more powerful
abstraction while maintaining the core of the OpenState ideas.
In OPP, Extended Finite State Machines (EFSM) are used
to model stateful forwarding algorithms. We leave aside the
details of the EFSM implementation, to which we will come
back shortly, to point instead to an important property of
the OPP’s abstraction. That is, the need to explicitly define
which part of the overall device’s state a packet is going to
modify during its processing. As in FAST, such operation is
done by defining the state on a per-flow basis. An arbitrary
combination of packet’s header values4 is used to define the
portion of the state such packet is going to (and allowed to)
use. The state identified in this way is called flow context.
Furthermore and differently from FAST, in OPP a packet can
read one flow context and update a different one. In fact, it is
possible to specify different combinations of packet’s header
values for the read and update operations of a flow context.
This last property is named cross-flow state update.
The explicit handling of flow contexts and the support for
cross-flow state updates helps in the implementation of a
number of functions. For instance, one could implement a
simple L2 learning switch, or more complex applications,
as we will show later presenting an implementation of the
iptables’ interface (Sec. 3). The rest of this section presents
the OPP’s machine model.
2.1.1 Machine model
The OPP machine model extends the MATs pipeline model
assumed by OpenFlow. MATs are substituted with stages,
which can be either stateless or stateful. A stateless stage is in
fact an OpenFlow-like MAT. The pipeline processes packets’
headers to define corresponding forwarding behaviors. OPP
assumes packets headers are already parsed when passed to
the pipeline, therefore, OPP can potentially leverage related
work on programmable packet parsing and reconfigurable
match tables [19, 10]. A stateful stage (Fig. 1) adds a number
of elements to a plain MAT.
Flow context. First, when a packet header enters the stage,
a lookup key extractor builds a key that uniquely identifies a
flow context for such packet. The extractor is programmed
at runtime by specifying a list of relevant header fields and
packet’s metadata (e.g., the TCP/UDP 4-tuple or the in port).
The key is then used to extract the flow context from the flow
context table. The context includes a state label s and an
array of registers ~R = {r0, r1, ..., r(k−1)}. If no context is
found for a given key, a default context is used (i.e., with
all values set to 0). Furthermore, each flow context can be
associated with a (hard or idle) timeout.
Conditions. Second, the packet’s header and metadata, which
now include the flow context, are passed to a condition block.
The condition block can be programmed at runtime for the
evaluation of up to m expressions with mathematical oper-
ators like >, <, =. For example, conditions can be used to
4Including metadata such as the port the packet was received on.
3
Figure 1: Architecture of an OPP processing block
compare if a port number is greater than the value stored in
a given flow context’s register. The output of the condition
block is a Boolean vector ~C = {c0, c1, ..., c(m−1)}, where
ci = 1 if the i-th condition is true, otherwise ci = 0.
EFSM table. Third, the packet’s header and metadata, plus
~C and state label s, are passed to the EFSM table. Such
table is a typical MAT that supports ternary matching on
the just mentioned values. For each entry in the table, a
programmer can specify (i) a list of OpenFlow-like actions to
be executed on the packet, (ii) the next state label s in which
the flow context shall be set, and (iii) a list of instructions
to update the flow context registers ~R. Furthermore, the
update functions can also operate on the global variables
~G = {G0, G1, ..., G(h−1)}. Since the variables ~G are global,
their read and update operations happen atomically. In effect,
the EFSM table describes the transitions of the state machine.
Context update. Finally, the packet’s header and metadata,
the action to be applied, the update instructions and the new
value of the state label are passed to the update logic block.
Here, an array of Arithmetic and Logic Units (ALUs) per-
forms the required update instructions to update the values
stored in both the flow context registers ~R and global regis-
ters ~G, using arithmetic functions. Such functions can range
from simple integer sums, for instance to update the value
of a register representing a packet or byte counter, to more
complex ones, e.g., floating point processing, depending on
the specific implementation and required performance. The
result of this block is then used to update the flow context
identified by the key generated by the update key extractor.
Such extractor works in the same way of the lookup key
extractor.
2.1.2 Remarks
The OPP model defines three different types of states, with
corresponding consistency models.
First, the packet state, which has a non-concurrent access
model. That is, it can be only accessed and updated during
the current packet processing. By definition, such state is
created when a packet enters the stage and destroyed once
the processing is complete. The packet state is practically
described by the packet’s metadata. Here, it is worth to
notice that all the discussed abstractions, including OpenFlow,
define the packet state.
Second, the global state, which is common to all the pack-
ets. This state exists for the entire life of the stage and is
stored by the ~G registers. As in Banzai, in OPP the global
state can be only updated with atomic operations. In prac-
tice, a programmer that uses global states should be well
aware that the strong consistency guarantees may introduce
performance overheads in some implementations.
Third, OPP defines a flow state. The flow state only exists
for the life of a flow and is captured in the OPP’s flow context.
Here, the consistency is guaranteed on a per-flow basis, since
the model does not allow concurrent accesses to the flow
state. However, multiple packets accessing and modifying
different flow states can be processed in parallel. Which
packets can execute in parallel can be easily derived. Recall
that the programmer has to explicitly specify beforehand the
need to read and write a flow state during the processing of a
packet. Such operation is concretely performed configuring
the lookup and update key extractors.
Finally, notice that these states are all contained within one
stage. In fact, only n bits of the packet metadata (i.e., the
packet state) can be used as a mean to move state from one
stage to the next.
2.2 NetFPGA prototype
An OPP hardware prototype is available for the NetFPGA
SUME [58]. Describing the details of the prototype is out
of scope for this paper. However, we report a few important
data that help putting the OPP abstraction in a more concrete
perspective in terms of hardware requirements and constraints.
First, the prototype can forward the packets of 4 10GbE ports
at line rate. Providing a 156.25Mpps (Million packets per
second) throughput. The system may support flow context
consistency by adding queues before a stage. Such queues
have to hold packets that can potentially access the same
flow contexts concurrently. However, notice that: (i) the
pipeline can be meanwhile used by other packets belonging
to different flows; (ii) the packets queued belong to the same
flows and don’t get therefore reordered within their flows;
(iii) the additional introduced delay per packet is negligible,
being of few clock cycles (in our prototype a packet would
need to wait at most 6 clock cycles).
The hash tables used to store the flow contexts are imple-
mented using RAM blocks. The EFSM tables are instead
implemented using very small TCAMs. That is, an EFSM
TCAM has 32 entries of 160 bits. Indeed, TCAM imple-
mentation over FPGAs is very inefficient and is currently a
widely open research issue [24, 55, 25]. Still, in Sec. 3, we
4
param. value descr.
k 4 Number of flow context’s registers.
Each register is 32bit long.
Each condition is in the form
m 8 var1 op var2, with operand being one
of >, <, =, and variables being
packet header’s fields, registers or constants.
n 32b Size of packet’s metadata moved between stages
h 8 Number or global registers. Each
register is 32bit long.
Number ALUs. Each ALU performs an
operation in the form res = var1 op var2.
ALUs 5 res and vars can be: global register,
flow context’s register or packet’s
fields (including metadata). op can be
one of +, −, shift, etc.
Table 1: Parameters of a OPP hardware prototype
resource type Reference switch OpenState OPP switch
# Slice LUTs 49436 (11%) 62637 (14%) 71712 (16%)
# Block RAMs 194 (13%) 245 (16%) 393 (26%)
Table 2: Hardware cost comparison of OPP, NetFPGA
SUME ref. switch and OpenState.
will demonstrate that this small number of TCAM entries
can easily capture the implementation of an iptables interface.
The prototype fixes the parameters of the machine model as
shown in Table 1.
The whole system has been synthesized using the stan-
dard Xilinx design flow. Table 2 reports the logic and mem-
ory resources (in terms of absolute numbers and fraction of
available FPGA resources) used by the OPP FPGA imple-
mentation, and compare these results with those required
for the NetFPGA SUME single-stage reference switch and
a OpenState stage. We remark OPP is an evolution of the
original OpenState stage. The synthesis results confirm the
trend already shown by [10]: the hardware area is dominated
by memory, while adding intelligence/features in the logic
require a small silicon overhead. Notice that the reported
resources include the overhead of several blocks, such as the
microcontroller for OPP configuration, the input/output FIFO
for the 10GbE interfaces etc., which are required to operate
the FPGA and do not need to be replicated for each stage. In
fact, given the required resources, a NetFPGA SUME can
currently host up to 6 stateful OPP Stages.
3. THE IPTABLES USE CASE
In this section, we present a possible implementation of
the iptables interface using the OPP abstraction. While we
support almost entirely the interface, for space constraints we
limit the discussion to a relevant subset of functions. Nonethe-
less, we present the full implementation of three relevant use
cases: a stateful firewall, a load balancer and a dynamic NAT.
We begin the section with a short primer of iptables and then
describe the three use cases.
3.1 iptables primer
Iptables is a well known Linux’s user interface to control
the Netfiltermodule, which is responsible for processing
packets traversing the Linux’s networking subsystem. In
cooperation with the conntrackmodule, Netfilter supports
a wide range of network functions such as: filtering, NAT,
stateful firewall, load balancer, anomaly detection, etc.
Iptables uses rules to configure Netfilter. The first of the
matching rules triggers the execution of a corresponding
action. Rules are specified as follows:
iptables $command $table $match $target
where command is, e.g., insert, delete, etc. and includes
also the specification of a Netfilter’s hook. A hook indicates
the position within the IP networking subsystem, e.g., PRE-
ROUTING, FORWARD, POSTROUTING. Each table, e.g.,
nat, filter, etc., provides a given set of packet processing ca-
pabilities. The match is somewhat similar to an OpenFlow
rule’s match part, although being more flexible. Finally, the
target specifies the action to be applied, including: DROP,
SNAT (source NAT), DNAT (destination NAT) and others.
We support almost completely the iptables interface. Nonethe-
less, we highlight that some functions have not been imple-
mented, such as those that act on the packet’s payload.
3.2 iptables implementation
Our iptables implementation is conceptually simple. A
thin software layer translates the iptables rules into a set of
entries for the OPP stages. The software layer is a regular
OpenFlow controller extended to support OPP. In fact, we
modify RYU5 and its OpenFlow implementation [32], adding
new protocol messages to populate and inspect EFSM tables,
flow context tables, etc.
The software layer most interesting part is probably in the
way the translation is actually performed. In fact, the rule-
based nature of iptables does not require any complex, e.g.,
programming language’s compilation-like process. Instead,
the translation is very often an almost 1-to-1 mapping with the
OPP entries, provided that some relevant functional blocks
are identified in advance. For example, the first use case
implements the function for the Linux’s conntrack module
using a set of 7 EFSM table’s entries. Therefore, in the rest
of the section we focus on describing the translation from
a relevant set of iptables rules to entries for the OPP stages.
More specifically, using the RYU-powered iptables interface,
we implement three different network functions combining 5
iptables’ rules. In particular, we implement a stateful firewall
(2 rules), a load balancer (2 rules) and a dynamic NAT (1 rule).
Our implementation requires 4 OPP stateful stages, plus a
stateless one, to support the three functions altogether6.
To describe the OPP stages configuration, we represent
just the EFSM tables and their entries, including the lookup
and update key extractors configuration. Using the example
5https://osrg.github.io/ryu/
6The number of stages may vary if a particular iptables function is
used. E.g., the iptables’ function RECENT actually defines a new
table, hence, we need to allocate a new OPP stage to implement it.
5
scenario of Fig. 3, the full configuration of our use cases is
shown in Fig. 2. We assume that the node R of Fig. 3 is
implemented as a VNF using OPP, with the network inter-
faces (e0, e1, e2) mapped on the OPP abstraction’s ports (0,
1, 2). The colors identify the OPP entries used to implement
the different use case. When an entry has multiple colors, it
means that it is shared by the corresponding use cases. In the
rest of the section, we describe the three use cases, one by
one, in increasing level of complexity.
DMZ 8.0.0.0/24 8.0.0.1
e0
r2
Internet
s6
10.0.0.1
s5
LAN 10.0.0.0/24
s2 s3
8.0.0.6 8.0.0.5 1.0.0.1
e1
e2
10.0.0.2 10.0.0.3
Figure 3: Use cases scenario
LAN/DMZ isolation. The first and most simple network
function is a stateful firewall. The firewall allows a host in
the DMZ to communicate with a host in the LAN only if the
latter initiated the communication. In iptables, this is realised
by the following rules:
iptables -A FORWARD -i e2 -o e1 -j ACCEPT
iptables -A FORWARD -i e1 -o e2 -m state
--state ESTABLISHED -j ACCEPT
The translation of these rules only requires the use of stage
0 (7 rules) and of stage 4 (4 rules), which is stateless. Stage
0 is configured with a bidirectional 4 tuple flow7 for both
the update and lookup key extractors. Such a configuration
provides the same flow context’s key regardless of the order
of the address and port fields. I.e., both directions of the flow
are mapped to the same flow context. The entries of stage 0
actually implement a connection tracking state machine, i.e.,
the Linux’s conntrack function. When at least a packet per
direction is exchanged, the connection is considered estab-
lished. Such information is then passed to the stage 4 using
the packet’s metadata, where a forwarding decision is taken.
For example, assuming that the source 10.0.0.2:123 (from
the LAN) sends a packet to the destination 8.0.0.5:678 (from
the DMZ), the system would work as follows. The first
packet from 10.0.0.2:123 hits the stage 0, and not having any
pre-existing flow context, is assigned with a default context
identified by the lookup key {10.0.0.2:123,8.0.0.5:678}. The
default context’s state label is set to 0, therefore, the EFSM
table 0’s entry #1 is matched. The corresponding actions set a
new state for the flow context (with state label value 12) and
send the packet to stage 4. Here, notice that the flow context
being updated is again {10.0.0.2:123,8.0.0.5:678}, since the
lookup and update keys are configured to be the same. In
stage 4, the packet is forwarded towards its destination, to
port 1 (EFSM table 4’s entry #2).
7Source and destination addresses and ports.
The DMZ host’s response packet will be associated with
the same flow context of the first packet, having the same
key {10.0.0.2:123,8.0.0.5:678}. Here, recall we configured
the key extractors to use a bidirectional flow hashing. The
packet is associated with the already existing flow context,
whose state label is set to 12. Therefore, EFSM table 0’s
entry #5 is matched. The entry actions (i) update the state
label to 2, which actually means ”established”, (ii) set the
packet’s metadata (b0−3) to 1 and (iii) send the packet to
stage 4. Here, the metadata value is used to signal that the
flow is established, so the DMZ host’s packet matches entry
#1 and is forwarded to the LAN. All the remaining packets
belonging to this (bidirectional) flow will be associated with
a state label 2 in stage 0, and then finally forwarded by stage
4’s entries, effectively enforcing the firewall policy.
Notice that the flow context, set in stage 0’s entries, is
always associated with a 20 seconds idle timeout8 (e.g., ac-
tion SET STATE(2, 20s)). That is, when none of the pro-
cessed packets is associated with such a flow context for more
than 20 seconds, the context gets evicted.
Load balancer. In this use case, we configure a load balancer
function that assigns TCP connections to two web servers, in
a round-robin fashion. Directing traffic to a given web server
is as easy as configuring a static NAT rule. Nonetheless,
the complexity of the use case is in the enforcement of the
round-robin selection, and in the consistent forwarding after
a connection has been assigned to a server. That is, the
destination web server is selected when the first connection’s
packet is received, and all the remaining packets for that flow
should be forwarded to that same web server.
In iptables, the following rules configure such function:
iptables -t nat -A PREROUTING -i e0
-d 1.0.0.1 -p tcp --dport 80
-m statistic --mode nth
--every 2 -j DNAT
--to-destination 10.0.0.3:80
iptables -t nat -A PREROUTING -i e0
-d 1.0.0.1 -p tcp --dport 80
-m statistic --mode nth
--every 1 -j DNAT
--to-destination 10.0.0.2:80
Our implementation requires 2 stateful OPP stages (0 and
3) and a stateless stage (4) to translate the above rules. In
stage 0, a global variable (G0) is used to keep a counter that
is incremented whenever a new flow starts. The value of the
counter is then passed in the packet’s metadata to stage 3.
Here, the server assignment decision is taken looking at the
last bit of the packet’s metadata, which contains the counter
value9. Depending on such value, a NAT operation is applied
8The timeout values used throughout this section are just an example
and are not motivated by any rationale.
9Notice that the OPP hardware implementation does not support
arbitrary modulo operations. Using the ESFM table (TCAM) mask
feature, we implement a mod 2 operation. However, if we had to
load balance among 3 servers, we could not use the same entries
6
# state inport src_addr src_port dst_addr dst_port actions 
0 0 1 10.0.0.0/24
SET_STATE (11, 20 s);
SET_METADATA (b0-3=0);
GOTO 4;
1 0 2 8.0.0.0/24
SET_STATE (12, 20 s);
SET_METADATA (b0-3=0);
GOTO 4;
2 11 1 10.0.0.0/24 SET_METADATA (b0-3=0);GOTO 4;
3 11 2 8.0.0.0/24
SET_STATE (2, 20 s);
SET_METADATA (b0-3=1);
GOTO 4;
4 12 2 8.0.0.0/24 SET_METADATA (b0-3=0);GOTO 4;
5 12 1 10.0.0.0/24
SET_STATE (2, 20 s)
SET_METADATA (b0-3=1);
GOTO 4;
6 2 * WRITE_METADATA (b0-3=1)GOTO 4;
7 0 0 1.0.0.1 80
SET_STATE (1, 20 s); 
SET_METADATA (b0-3=G0);
UPDATE (G0 +1 à G0);
GOTO 3;
8 1 0 1.0.0.1 80 GOTO 3;
9 0 2 10.0.0.2 80 GOTO 4;
10 0 2 10.0.0.3 80 GOTO 4;
11 0 2
SET_STATE (1, 20 s);
SET_METADATA (b16:b31=G1);
UPDATE (G1+1) à G1;
GOTO 1;
12 1 2 GOTO 3;
13 0 0 1.0.0.1 GOTO 2;
# state b0-3 inport actions 
0 0 ***0 0
SET_HEADER (ip.dst=10.0.0.2, tcp.dst=80)
UPDATE (10.0.0.2 à R0, 80 à R1);
SET_STATE (1, 20 s); 
GOTO 4;
1 0 ***1 0
SET_HEADER (ip.dst=10.0.0.3,tcp.dst=80);
UPDATE (10.0.0.3 à R0, 80 à R1);
SET_STATE (1, 20 s); 
GOTO 4;
2 1 0 SET_HEADER (ip.dst=R0,tcp.dst=R1);GOTO 4;
3 0 2
SET_STATE (1, 20 s);
UPDATE ((b16:b31)àR1);
SET_HEADER (ip.src = 1.0.0.1, tcp.src = (b16:b31) );
GOTO 4;
4 1 2
SET_STATE (1, 20 s);
SET_HEADER (ip.src = 1.0.0.1, tcp.src = R1);
GOTO 4;
# b0-3 inport src_addr src_port dst addr actions 
0 0 1 10.0.0.0/24 DROP;
1 1 1 10.0.0.0/24 OUT 2;
2 8.0.0.0/24 OUT 1;
3 10.0.0.0/24 OUT 2;
4 2 10.0.0.2 80 SET_HEADER (ip.src=1.0.0.1,tcp.src=80);OUT 0;
5 2 10.0.0.3 80 SET_HEADER (ip.src=1.0.0.1,tcp.src=80);OUT 0;
6 OUT 0;
EFSM Table 0
lookup/update = biflow
EFSM Table 3
lookup/update = (ip.src, tcp.src)
EFSM Table 4 (stateless)
# inport actions 
0 2 SET_METADATA (b16-b31=state_label);GOTO 2;
# state inport actions 
0 0 2
SET_STATE (1, 20 s);
UPDATE (ip.srcà R0, tcp.src à R1);
GOTO 3;
1 1 0 SET_HEADER (ip.dst=R0, tcp.dst=R1);GOTO 4;
EFSM Table 2
lookup= (ip.src, ip.proto, tcp.src)
update= (ip.dst, ip.proto, tcp.dst)
EFSM Table 1
lookup: b16-b31
LAN/DMZ isolation Load balancer Dynamic NAT
Figure 2: OPP stages for the Iptables use cases
to the incoming flows from the public Internet, translating the
destination address from the external facing address to one
of the internal server’s addresses. Finally, stage 4 performs
the forwarding decision and applies the static NAT rules for
the traffic in the opposite direction. I.e., it rewrites the source
address, replacing the internal servers’ addresses with the
external facing address.
Considers now a source on the Internet, e.g., 2.0.0.7:678,
sending its first TCP packet, to establish a connection to the
load balancer’s address 1.0.0.1:80. The packet will get to
stage 0, where a default flow context will be assigned to it
(with key {2.0.0.7:678,1.0.0.1:80}). Therefore, it will match
entry #7, which applies four actions. First, the flow context
configuration shown in Fig.2. Fortunately, it is possible to emulate
a Montgomery modular multiplication [35] setting the right values
for G0 increments, together with a proper mask for the metadata
matching in stage 3. Such a technique allows a programmer to
describe, e.g., any arbitrary number of server in this use case.
is updated with a flow label 1, which stands for ”first packet
received”. Second, the value contained in G0 is copied in
the metadata (b0−3). Third, the G0 value is incremented
(UPDATE(G0 + 1 → G0)). Finally, the packet is sent to
stage 3. Here, it is important to point that the second and
third operation happen at the same time. In fact, the G0 value
before the increment is copied in the metadata (parallel read).
In stage 3, the packet is associated with a new default flow
context, identified by the key {2.0.0.7:678}10. Assuming
that the G0 value copied in the packet’s metadata was orig-
inally 0, the entry #0 will be matched. The entry’s actions
(i) rewrite the destination address to one of the actual web
servers (10.0.0.2:80), (ii) copy corresponding address and
port number in flow context’s registers R0 and R1, (iii) set
10Recall that each flow context exists only in the scope of the stage
within which it was created. That is, each stage has its own flow
context table and key extractors configuration.
7
the flow context’s state label to 1, meaning ”server assigned”,
and (iv) send the packet to stage 4. Here, the packet, with
the rewritten destination address, matches entry #3 and gets
forwarded to the selected web server.
The server’s response packet will be associated with a
default flow context in stage 0. In fact, the packet is asso-
ciated with a context key {10.0.0.2:80, 2.0.0.7:678}. No-
tice that the first packet created instead a context with key
{2.0.0.7:678,1.0.0.1:80}. Therefore, entry #9 is matched and
the packet is sent to stage 4. We remark that since the entry’s
actions do not perform any flow context update, no entry for
this context is created in the flow context table. In Stage 4,
the packet matches entry #4, that restores the source address
to 1.0.0.1 before forwarding it to the Internet.
When a new packet from 2.0.0.7:678 is received, it will
be associated with the flow context {2.0.0.7:678,1.0.0.1:80},
whose state label is 1. The packet will match EFSM table 0’s
entry #8, which sends it to stage 3. Here, the flow context
with key {2.0.0.7:678} is retrieved. This context’s state label
is 1, as it was created by the first packet. Furthermore, flow
context’s registers R0 and R1 contain the assigned server
address for such flow. Therefore, the packet matches entry
#2, whose actions rewrite the destination address and port
with the values contained in the registers and send the packet
to stage 4, where it is finally forwarded to the correct server.
Dynamic NAT. The last function we present performs dy-
namic NAT between the LAN and the Internet of Fig. 3,
translating local source addresses into a public IP address,
with a dynamically selected source port, and viceversa. The
port is selected from a pre-configured bucket of available
ports. In iptables, this is described with the following rule:
iptables -t nat -A POSTROUTING
-i e2 -o e0 -j MASQUERADE
In this case, the rule translation uses 4 stateful stages plus
the stateless stage. For the implementation of this function,
however, we had to perform a special trick to represent the
bucket of available ports. We use a dedicated OPP stage (1)
and configure it to work as a memory stack. While the OPP
machine model allows a programmer to play such a trick, we
recognize that this is a stretch of the abstraction, which we
will further discuss in Sec. 5.
In particular, stage 1 is configured by the controller, using
OpenFlow-like messages, to pre-populate the entries in the
flow context table. The entries’ state labels are in fact port
numbers. The stage’s lookup key extractor is configured to
use a value contained in the packet’s metadata. Such value is
used as a stack pointer and should be set in a previous stage.
Whenever a packet is received, the flow context pointed by the
metadata value is attached to the packet. Such context’s state
label is in fact a port number. The port number is assigned
to the packet’s metadata by the matching entry’s action, and
therefore it can be used in a following stage. The controller
runs a helper function periodically, monitoring the state of
the stages, to track ports usage. The ports that became free
(for instance because flow contexts were evicted) are pushed
again the stage 1’s flow context table.
Having clear the role of stage 1, we can now describe the
other stages. Stage 0 is used to extract the ”stack pointer”
used in stage 1, to dynamically assign a port number. The
pointer is stored in the global register G1. After traversing
stage 1, the packet’s metadata will contain the assigned port
number. Stage 2 is the only example, in the presented cases,
in which a cross-flow state update is required. In effect,
stage 2 is responsible for restoring the original NATted host’s
address and port number for packets coming from the Internet
(port 0). This is done by storing the original values in two flow
context’s registers, when the first packet of a flow (coming
from port 2) is received. However, the outgoing flow (from
port 2) has to set such values for the incoming flow (from port
1). Therefore, stage 2’s lookup key extractor is configured
to use the source address and port of the packet, while the
update key extractor uses the destination address and port.
Stage 3 is instead in charge of performing the NAT for the
outgoing flow. That is, it writes in the packet header the
external facing address and the assigned port number. Finally,
stage 4 performs the usual forwarding decision.
As in the previous case, we assume that the source 10.0.0.4:123
is sending a packet to the destination 2.0.0.1:678. In stage 0,
the packet is associated with a new flow context and matches
entry #11. The entry’s actions are similar to the load balancer
case, but this time a different global variable, G1 is copied in
the packet’s metadata. In stage 1, the packet’s metadata are
used to lookup a pre-populated flow context. Such context’s
state label is copied in the packet metadata by the (only)
matching entry and the packet is sent to stage 2. Here, a new
context with key {10.0.0.4:123} is associated to the packet,
which therefore matches entry #0. The entry’s action does
just an update of a flow context before sending the packet
to stage 3. However, the flow context update happens on a
different flow than the one to which the packet belongs. In
fact, the update key extractor is configured to use the packet’s
destination address and port. Hence, while the packet is at-
tached with the flow context {10.0.0.4:123}, the action will
update the flow context {2.0.0.1:678}. The update sets the
context’s state label to 1, meaning ”translation entry stored”,
and sets the values of the origin host’s address and port num-
ber in the R0 and R1 registers. The processing in stage 3
and 4 proceeds as in the previous cases, therefore we skip
the description for brevity. Now, assume that the extracted
port for the first packet was 444, then, a response packet from
2.0.0.1:678 is sent to 1.0.0.1:444. The packet is associated
with a new flow context in stage 0, and therefore it matches
entry #13, whose action sends the packet to stage 2. Here, the
packet will be associated with the flow context {2.0.0.1:678},
that was updated by the first packet. Therefore, the entry’s
actions (i) write the values stored in registers R0 and R1 in
the packet’s header destination address and port number, and
(ii) send the packet to stage 4, where it is correctly forwarded
to the LAN’s host.
8
4. SOFTWARE IMPLEMENTATION
While the previous section demonstrated the flexibility
of the OPP abstraction, in this section we try to assess its
scalability when implemented in software. We do so present-
ing a OPP software implementation that can scale to run on
multiple CPU’s cores. The section describes the implemen-
tation and the provided optimizations, including the support
for multi-core processing. Finally, we report the results of
experimental performance tests.
Our software implementation extends OFSoftSwitch [1]
(OFSS), an OpenFlow compliant software switch. OFSS is
a popular tool in the academic community, as it provides a
clean and flexible implementation of OpenFlow that makes
it suitable for functional experimentation. However, the soft-
ware was not originally designed with performance in mind.
Therefore, in addition to implement the functional extensions
required to support OPP, we optimized OFSS for perfor-
mance.
In a nutshell, the architecture of OFSS consists of two
processes, the one (ofprotocol) in charge of communi-
cating with an external controller and to set general config-
urations, and the other (ofdatapath) implementing the
actual switching operations. The ofdatapath module is
designed as a single–process application and relies on the
netdev library to access network devices.
The reminder of this section elaborates upon the main
performance impairments included in OFSS together with the
countermeasures undertaken to remove or possibly mitigate
their effect and accelerate the switch performance.
4.1 Code Optimizations
The first direction to improve the performance of OFSS
was to virtualize its netdev library in order to integrate
the I/O engine PFQ [7], using its accelerated pcap adapta-
tion layer. This allowed us to replace the slow AF PACKET
sockets with the accelerated PFQ sockets without changing
the OFSS code. Nonetheless, a thorough code analysis of
OFSS revealed a significant number of possible performance
bottlenecks. In the following, we list the major performance
modifications to the original software architecture.
Dynamic Memory Allocation. The datapath of OFSS makes
extensive use of dynamic memory allocations and related
memory releases. This dramatically impacts the packet for-
warding performance as the cost of each pair of calls is around
200-500 CPU cycles. We implemented a zero–malloc opti-
mization that allows OFSS to generally run without perform-
ing dynamic memory allocations. Whenever required, the
semantic of the data structures have been changed in order
to cope with memory buffers without ownership, which are
passed along the datapath as managed memory. Furthermore,
the packet handler has been re-designed in order to fit into
a single chunk of memory, replacing the original scattered
model. This permits to save two extra additional memory
allocations and deallocations.
Hash Maps Refactory. Hash maps are pervasively used
throughout the OFSS datapath. Wherever possible, hash
maps have been replaced with more efficient struct data
types to save very frequent memory indirections when ac-
cessing specific protocol fields. In addition, the remaining
hash maps have been equipped with a set of managed small
memory nodes, which are allocated at construction time. Size
and number of such nodes are configurable at compile time,
instead. Since hash tables are aware of whether the mem-
ory nodes are managed or not (using an annotation on each
single node), they can concurrently use both the small set
of pre–allocated nodes and the additional nodes allocated
on–demand. In the case of the hash tables associated with the
packet handler, this optimization allows to save (on average)
up to 3 or 4 table rehashes, as they systematically host around
a dozen of entries per each received packet.
Zero Copy. Both the semantics of pcap and PFQ allow
one to take advantage of the memory persistency of a packet,
during the call of a pcap handler. This semantic has been
leveraged to retain from saving a copy of the payload of
each packet, when not strictly required11. Strictly speaking,
the zero–copy optimization consists in removing a pair of
malloc/free together with a memcpy.
Batch processing. The original version of OFSS processes
one packet at a time. Instead, we enabled batch processing
of packets. Therefore, the forwarding function of OFSS has
been changed in order to consume, per each port, a batch
of packets up to a configurable number, before switching to
another port. The beneficial effects of such an optimization
are mainly due to the increased cache locality that occurs
while processing packets. In addition, in modern CPU, this
mode of operation allows one to take advantage of packet
pre–fetching. That is the CPU is explicitly instructed to pre–
fetch data while doing some other processing. As a result,
the CPU can retrieve one or more consecutive packets while
the current one is being processed.
4.2 Scalability
One of the advantages of using PFQ is to enable fine–
grained parallel computation in a simple and programmable
way. Unlike other accelerating alternatives, PFQ integrates an
in–kernel processing stage that is fully programmable through
a high–level functional language. Such a processing engine
is mainly intended as a “pre–processing stage” and allows
the execution of dynamic and hot–swappable (i.e., atomically
upgradable at run–time) processing pipelines (computations).
Steering functions are particularly relevant to enable paral-
lelism as they allow to deliver packets to group of sockets by
using a hash based load balancing algorithm. More generally,
steering can be performed according to arbitrary criteria with
the overall target of distributing the processing and avoid
state sharing across cores. Such feature turned out to be
straightforwardly applicable to enable the scaling of our OPP
11 E.g., when the packet is consumed in the contest of the forwarding
thread of execution.
9
software switch. In fact, we retrieve the PFQ’s steering crite-
ria (keys) directly from the lookup and update key extractor
configurations.
4.3 Evaluation
We carried out an extensive experimental campaign, under
different scenarios, to understand the absolute performance
of our implemented prototype and its scalability. The experi-
mental test bed consists of two identical PCs with 8-core Intel
Xeon E51660V3 CPUs (3.0GHz), equipped with a pair of
identical Intel 82599 10G NICs. One of the PCs runs the soft-
ware switch the other is a load generator. Both systems run
a Linux Debian stable distribution (kernel v. 3.16). A third
server runs the controller and is connected to the switch’s
server using 1G control network interface.
 0
 2
 4
 6
 8
 10
 12
 14
 16
64 128 256 512 1024 1500
Th
ro
ug
hp
ut
 (M
pp
s)
Packet Size (Bytes)
Line Rate
OFSS
aOFSS - 1 core 
aOFSS - 2 cores
aOFSS - 3 cores
aOFSS - 4 cores
Figure 4: OpenFlow pipeline throughput
 0
 20
 40
 60
 80
 100
 120
64 128 256 512 937 1024 1500
Ac
ce
le
ra
tio
n 
Co
nt
rib
ut
io
n 
(%
)
Packet Size (Bytes)
PFQ
Zero Malloc
Batch Proc.
Hash Map
Zero Copy
Figure 5: Acceleration contributions to OFSS
OpenFlow performance. The first set of experiments are
pure speed tests to benchmark the performance of the accel-
erated version of OFSS (aOFSS) when running a standard
OpenFlow pipeline. Figure 4 shows the achieved throughput,
when varying the number of cores and the packet sizes. Both
the original OFSS performance and line rate limits are also
reported for comparison. The implemented acceleration tech-
niques provide a dramatic performance improvement, with
the throughput nearly hitting line rate in all the cases. A up
90x speedup factor with respect to the original OFSS.
The contribution of each optimization technique to the total
throughput is reported in Figure 5. The results are obtained
by selectively switching off one contribution at a time on
the accelerated version of OFSS running on a single core
and measuring the observed performance drop. Data are fi-
nally normalized to give a fair visualization of each term as
a stacked histogram. It is worth noticing that, in this case,
multi–core acceleration is not accounted since the experiment
was run on one core. The results show that the generic I/O ac-
celeration provided by PFQ and the zero–malloc optimization
have a mostly equally beneficial impact on the performance
boost. For shorter packet sizes, the impact of the other op-
timizations is all but negligible as they contributes for up to
40% of the overall performance improvement.
OPP performance. The second set of experiments measures
the aOFSS performance when using OPP, including multi-
ple pipeline configurations. Figure 6 shows the throughput
achieved by aOFSS when using stateless OPP stages. The
system still hits line rate for packets of realistic sizes bigger
than 128B. In the most critical case of shortest packet size,
performance decreases linearly with the number of stages,
but still reaching well above 10 Mpps with 4 running cores.
Notice that the performance for one stage are comparable
to the ones of OpenFlow. In fact, a stateless OPP stage is
functionally equivalent to an OpenFlow table.
Figure 7 shows the performance for a pipeline of stateful
OPP stages. As expected, the performance decreases, with
line rate achieved for packet sizes bigger than 256B. The
degradation is more significant as the number of stages in-
creases. However, we remark that in our test we measured
a somewhat worst–case behavior. That is, every packet per-
forms a state update. As shown in Sec. 3, in real use cases
most of the packets perform just a state lookup, with only
few of them actually triggering a state modification.
Finally, we note that the above results provide a lower
bound for the previously described use cases, namely LAN/
DMZ isolation (1 stateful stages + 1 stateless stage), load
balancer (2 stateful stages + 1 stateless stage) and dynamic
NAT (4 stateful stages + 1 stateless stage). However, we did
not test the actual use cases’ rules, furthermore, the tests did
not include the evaluation of operations on global variables.
Testing the use cases performance, including the operations
on global variables, would have required the testing of actual
traffic traces to be meaningful. Since we lacked significant
traffic traces for the presented use cases, we leave the evalua-
tion of the impact of such operations for future work.
5. DISCUSSION
In this section we discuss the advantages and the limits of
the OPP abstraction when using it for realizing scalable (vir-
tual) software network functions, report on lessons learned
and possible future work.
10
 0
 5
 10
 15
 20
Line R
ate
1 Stage
2 Stages
3 Stages
4 stages
Line R
ate
1 Stage
2 Stages
3 Stages
4 stages
Line R
ate
1 Stage
2 Stages
3 Stages
4 stages
Line R
ate
1 Stage
2 Stages
3 Stages
4 stages
Line R
ate
1 Stage
2 Stages
3 Stages
4 stages
Line R
ate
1 Stage
2 Stages
3 Stages
4 stages
64 Bytes
128 Bytes
256 Bytes
512 Bytes
1024 Bytes
1500 BytesT
hr
ou
gh
pu
t (M
pp
s)
1 core
2 cores
3 cores
4 cores
Figure 6: Stateless OPP stages throughput
 0
 5
 10
 15
 20
Line R
ate
1 Stage
2 Stages
3 Stages
4 stages
Line R
ate
1 Stage
2 Stages
3 Stages
4 stages
Line R
ate
1 Stage
2 Stages
3 Stages
4 stages
Line R
ate
1 Stage
2 Stages
3 Stages
4 stages
Line R
ate
1 Stage
2 Stages
3 Stages
4 stages
Line R
ate
1 Stage
2 Stages
3 Stages
4 stages
64 Bytes
128 Bytes
256 Bytes
512 Bytes
1024 Bytes
1500 BytesT
hr
ou
gh
pu
t (M
pp
s)
1 core
2 cores
3 cores
4 cores
Figure 7: Stateful OPP stages throughput
Abstraction flexibility. Apart from the presented use cases,
iptables allows a network administrator to express fairly com-
plex network functions. We implemented an almost complete
iptables’ interface using OPP and a number of network func-
tions. Therefore, we are quite confident that the OPP abstrac-
tion can capture a large number of functions. Furthermore,
although not presented in the examples, we could implement
also functions that required packet generation instructions,
since our prototype integrates the InSP API [6].
Software scalability. The OPP machine model can be effi-
ciently implemented in software. While a hardware imple-
mentation that provides line rate performance is available for
many abstractions, a scalable software implementation builds
on completely different requirements. The concept of flow
state greatly helps the scaling of the OPP machine model in
software, allowing the distribution of the load to different
cores. A feature we could straightforwardly implement in the
majority of the cases using standard software acceleration
libraries.
However, it is worth noticing that the access to the global
variables is likely to introduce an important performance
overhead in software implementations. Unfortunately, we
admittedly did not test the performance impact in presence of
global variables operations. As a general rule, a programmer
should always try to express her algorithms avoiding the use
of global variables. Otherwise, she should be well aware of
these overheads. For example, in the load balancer example
of Sec. 3, a global variable is used only on the first packet
of a new flow. Hence most of the traffic is not introducing
the mentioned overheads. Indeed, a traffic pattern with a
large number of new starting flows in a short time, as it may
be the case during a Denial of Service (DoS) attack, could
severely impact the overall system forwarding performance.
Notheless, we point that such an issue is not specific to our
implementation, and is typical for this kind of systems [27].
In fact, a form of DoS detection and mitigation function
is usually deployed to protect the system by such attacks.
Incidentally, our iptables implementation can deploy such a
protecting network function, on the very same pipeline, using
the proper set of iptables rules.
Software performance. Our software prototype does not
shine in terms of absolute performance numbers, when com-
pared to the state of the art. However, our work did not aim
at improving the state of the art in software acceleration. In
fact, we selected a not optimized software switch for our
prototype. Our focus was to study weather the selected ab-
straction, OPP, was suitable for scaling in multi-core systems,
and why. Furthermore, we notice that our system cannot be
fairly compared to traditional software switches, which apply
mostly ”stateless” processing. Systems such as SoftFlow [23]
are a more fair comparison term.
Slow path. In addition to the rules translation, our iptables
implementation requires the controller to perform also some
helper tasks on the slow path. In a production environment,
we expect such tasks to be run by the different VNFs that use
OPP as a mean to implement their fast paths.
Programming complexity. Programming functions in OPP
is hard. Or at least, it is not straightforward as it could be with
a high-level language. We believe that approaches such as
Domino are very promising to address the issue. Introducing
support for specifying flow-level consistency requirements
in such languages is indeed an interesting area for future
research.
Abstraction maturity. While some things are done right in
OPP, others have still space for improvements. We provide
two examples. First, ALUs support the same operations on
global registers and on flow context’s registers. In hardware
implementations, the complexity of operations on global reg-
isters is limited by the need to execute them atomically in one
clock cycle. With the flow state concept, such requirement is
relaxed for operations on flow context’s registers. Supporting
more complex operations at the flow level could open up the
space for additional use cases. Second, we had to use a full
stage in the dynamic NAT example to implement a memory
stack. Such a function could be probably better provided as a
dedicated function in the abstraction, e.g., implemented as an
action. These two examples motivate our intention to further
refine the OPP abstraction in future.
FPGA implementation. The most FPGA’s resources con-
suming blocks of an OPP stage are those used for the TCAM
(EFSM table) and the hash table (flow context table). In par-
11
ticular, the TCAM is generated only using the FPGA logic
blocks and flip-flop. With the current FPGA technologies, we
expect to be technically limited to 32-64 TCAM entries per
OPP stage. Possible improvements could be achieved by us-
ing suitable FPGA tailored TCAM implementations such as
the one recently presented in [25]. Inclusion of TCAM mem-
ories on chip for FPGA-based smart NICs is also a possible
future option that would drastically tackle the issue. Instead,
the scaling issues of hash tables is easier. Already today,
the OPP prototype, based on the last generation NetFPGA
board, could theoretically provide up to 13 OPP stages, with
16k entries of 256b for each stage. Furthermore, the new
FPGAs, such as the Xilinx Virtex UltraScale+ family [56],
can provide much larger memory blocks.
Implementation options. While we leveraged an FPGA-
based hardware implementation and provided a software im-
plementation, it would be interesting to verify how different
architectures would implement the OPP abstraction. For
instance, using NPUs, GPUs or even ASICs.
We also tried to described the OPP pipeline using P4. Un-
fortunately, as mentioned in Sec. 2, at the time of writing
the language could not capture the required consistency mod-
els. In fact, in [47] the authors do not implement stateful
operations when compiling P4 to a software switch target.
6. RELATED WORK
We discussed stateful forwarding abstractions already in
Sec. 2. In this section, we briefly review related work on NFV,
the implementation of high-performance functions, related
abstractions and issues in supporting the virtualization and
sharing of hardware accelerators.
VNFs implementation. SoftFlow [23] is probably the clos-
est related work. In SoftFlow, OpenVSwitch is extended to
integrate more flexible processing blocks called SoftFlow
actions, which can implement complex stateful functions.
Combining SoftFlow actions with an OpenFlow-like pipeline
of MATs enables a developer to implement arbitrary network
functions. However, each SoftFlow action is in fact a black
box, which consumes and produces packets, i.e., as if it was a
VM attached to an OpenFlow switch. Conversely, in our case
network functions are entirely programmed using a white box
approach based on the EFSM abstraction. In fact, SoftFlow
can only offload packet classification to hardware NICs and
just for packets entering the system, while we can potentially
offload both packet classification and stateful operations to
the smart NIC.
Click [37] adopts a model in which arbitrary functional
blocks, called elements, can be composed into graphs to im-
plement a network function. ClickNP [31] uses the Click’s
abstraction but adds the possibility to implement some ele-
ments as hardware functions to be run on, e.g., a smart NIC.
However, in SoftFlow, Click and ClickNP actions or elements
implementation is still a complex task, which has to be per-
formed if there are no pre-implemented modules that meet
the developer’s need. To address this issue, NetBricks [41] de-
fines as abstraction a set of more fine-granular primitives that
combined can describe a large number of software network
functions. The primitives implementation is optimized, there-
fore functions expressed using the NetBricks’ abstraction
provide high-performance. In this sense, our approach is sim-
ilar, since we adopt the set of fine-granular MAT-based OPP
functions to describe network functions. Still, the approaches
differ in flexibility and hardware support. NetBricks is more
flexible and expressive, but targets pure software functions.
Our OPP-based solution can express only network functions
that deal with packet headers, but provides full hardware
support.
Hardware acceleration. Using hardware acceleration to
meet performance requirements has been explored in the
past for specific cases, e.g., firewalls implementation [2].
Recently, the increase in network speed, combined with the
trend of NFV, is fueling a new stream of research to support
the hardware acceleration features provided by the NICs. For
example, Dragonet [50] is an OS network protocol stack
that takes into account NIC’s hardware capabilities for the
implementation of network protocols. Our work goes in a
similar direction, exploring a different part of the solution
space: we select a less flexible abstraction that captures only
a subset of the NIC’s hardware capabilities. Nonetheless,
such abstraction is easier to leverage for the programming of
network functions and has readily available implementations.
FlexNIC [28] envisions the support of RMT in future NICs,
providing a way to execute the RMT-based processing while
exchanging packets between the NIC and the host’s memory.
We consider such work orthogonal to our contribution, since
the NIC could use an OPP-like processing model instead of
one based on P4.
Software acceleration. An extensive comparison of soft-
ware accelerated capturing techniques can be found in [11, 36,
17]. Relevant software accelerated engines are PF RING [16],
PF RING ZC (Zero Copy) [14], Netmap [44], DPDK [21]
and PFQ [7]. PF RING ZC, Netmap and DPDK bypass the
Operating System by memory mapping the ring descriptors
of NICs at user space, allowing even a single CPU to receive
64 bytes long packets up to full 10 Gbps line speed. In addi-
tion, DPDK adds a set of libraries for fast packet processing
on multicore architectures for Linux. Netmap and DPDK
have been successfully used in accelerating soft switch as in
the case of the VALE [46] switch and mSwitch [20] (netmap)
and CuckooSwitch [57] and DPDK vSwitch [22] (DPDK).
Netmap was also used to accelerate packet forwarding in
Click [45]. PFQ, instead, relies on vanilla device drivers and
leverages different levels of parallelism to accelerate packet
I/O. In addition, PFQ is equipped with a native functional lan-
guage to program in–kernel early stage packet processing [8].
7. CONCLUSION
In this paper, we demonstrated that an abstraction that
defines flow-level states can be efficiently implemented in
12
software. Furthermore, for functions that work only with
flow-level states, such implementation can be easily scaled to
run on multi-core systems.
The OPP software implementation is available at github.
com/beba-eu/beba-switch. We plan to upstream the
performance enhancement to OFSoftSwitch. A subset of the
OPP API (mainly OpenState) is currently under discussion
for inclusion in OpenFlow v.1.6.
8. REFERENCES
[1] OFSoftSwitch. https:
//github.com/CPqD/ofsoftswitch13.
[2] K. Accardi, T. Bock, F. Hady, and J. Krueger. Network
processor acceleration for a linux* netfilter firewall. In
Architecture for networking and communications
systems, 2005. ANCS 2005. Symposium on, ACM
ANCS 2005, pages 115–123, 10 2005.
[3] AT&T, BT, CenturyLink, China Mobile, Colt,
Deutusche Telekom, KDDI, NTT, Orange, Telefom
Italia, Telefonica, Telstra, and Verizon. Network
function virtualization - white paper.
http://www.tid.es/es/Documents/NFV_
White_PaperV2.pdf.
[4] G. Bianchi, M. Bonola, A. Capone, and C. Cascone.
Openstate: Programming platform-independent stateful
openflow applications inside the switch. ACM
SIGCOMM CCR, 44(2):44–51, 4 2014.
[5] G. Bianchi, M. Bonola, D. Sanvito, C. Cascone,
G. Katsikas, D. Kostic, L. Polcak, and R. Bifulco.
Extended BEBA abstraction API. BEBA project
deliverable D2.3, 2016.
[6] R. Bifulco, J. Boite, M. Bouet, and F. Schneider.
Improving sdn with inspired switches. In Proceedings
of the Symposium on SDN Research, SOSR ’16, pages
11:1–11:12, New York, NY, USA, 2016. ACM.
[7] N. Bonelli, S. Giordano, and G. Procissi. Network
traffic processing with pfq. IEEE Journal on Selected
Areas in Communications, 34(6):1819–1833, 6 2016.
[8] N. Bonelli, S. Giordano, G. Procissi, and L. Abeni. A
purely functional approach to packet processing. In
Proceedings of the Tenth ACM/IEEE Symposium on
Architectures for Networking and Communications
Systems, ACM ANCS ’14, pages 219–230. ACM, 2014.
[9] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown,
J. Rexford, C. Schlesinger, D. Talayco, A. Vahdat,
G. Varghese, et al. P4: Programming
protocol-independent packet processors. ACM
SIGCOMM CCR, 44(3):87–95, 2014.
[10] P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese,
N. McKeown, M. Izzard, F. Mujica, and M. Horowitz.
Forwarding metamorphosis: Fast programmable
match-action processing in hardware for sdn. In ACM
SIGCOMM ’13, ACM SIGCOMM ’13, pages 99–110.
ACM, 2013.
[11] L. Braun et al. Comparing and improving current
packet capturing solutions based on commodity
hardware. In ACM IMC ’10, IMC 2010, pages 206–217.
ACM, 2010.
[12] Broadcom. Broadcom Announces Scalable 25/50G
Ethernet Controller Product Family.
https://www.broadcom.com/press/
release.php?id=s923886, 7 2015.
[13] K. T. Cheng and A. S. Krishnakumar. Automatic
functional test generation using the extended finite state
machine model. In ACM DAC, ACM DAC 1993, pages
86–91. ACM, 1993.
[14] L. Deri. Pf ring zc (zero copy). http://www.ntop.
org/products/packet-capture/pf_ring/
pf_ring-zc-zero-copy/.
[15] ETSI. Network functions virtualisation (NFV);
acceleration technologies; VNF interfaces specification,
3 2016.
[16] F. Fusco and L. Deri. High speed network traffic
analysis with commodity multi-core systems. In Proc.
of IMC ’10, ACM IMC 2010, pages 218–224. ACM,
2010.
[17] S. Gallenmu¨ller, P. Emmerich, F. Wohlfart, D. Raumer,
and G. Carle. Comparison of frameworks for
high-performance packet io. In Proceedings of the
Eleventh ACM/IEEE Symposium on Architectures for
Networking and Communications Systems, IEEE
ANCS ’15, pages 29–38. IEEE Computer Society,
2015.
[18] R. Gandhi, Y. C. Hu, and M. Zhang. Yoda: A highly
available layer-7 load balancer. In ACM EuroSys ’16,
ACM EuroSys ’16, pages 21:1–21:16. ACM, 2016.
[19] G. Gibb, G. Varghese, M. Horowitz, and N. McKeown.
Design principles for packet parsers. In ACM/IEEE
ANCS ’13.
[20] M. Honda, F. Huici, G. Lettieri, and L. Rizzo. mswitch:
A highly-scalable, modular software switch. In
Proceedings of the 1st ACM SIGCOMM Symposium on
Software Defined Networking Research, ACM SOSR
’15, pages 1:1–1:13. ACM, 2015.
[21] Intel. Intel DPDK: Data Plane Development Kit.
http://dpdk.org, 7 2016.
[22] Intel Corporation. Packet Processing. Intel DPDK
vSwitch - OVS.
https://github.com/01org/dpdk-ovs, 6
2015.
[23] E. J. Jackson, M. Walls, A. Panda, J. Pettit, B. Pfaff,
J. Rajahalme, T. Koponen, and S. Shenker. Softflow: A
middlebox architecture for open vswitch. In 2016
USENIX Annual Technical Conference (USENIX ATC
16), USENIX ATC’16, pages 15–28. USENIX
Association, 6 2016.
[24] B. Jean-Louis. Using block RAM for high performance
read/write TCAMs, 2012. Xilinx XAPP204.
13
[25] W. Jiang. Scalable ternary content addressable memory
implementation using fpgas. In Architectures for
Networking and Communications Systems (ANCS),
2013 ACM/IEEE Symposium on, pages 71–82, Oct
2013.
[26] L. Jose, L. Yan, G. Varghese, and N. McKeown.
Compiling packet programs to reconfigurable switches.
In 12th USENIX Symposium on Networked Systems
Design and Implementation (NSDI 15), USENIX NSDI
’15, pages 103–115. USENIX Association, 5 2015.
[27] F. Kargl, J. Maier, and M. Weber. Protecting web
servers from distributed denial of service attacks. In
Proc. 10th International Conference on World Wide
Web, WWW ’01, pages 514–524, New York, NY, USA,
2001. ACM.
[28] A. Kaufmann, S. Peter, T. Anderson, and
A. Krishnamurthy. FlexNIC: rethinking network DMA.
In USENIX HotOS, USENIX HotOS 2015, 2015.
[29] S. Keshav and R. Sharma. Issues and trends in router
design. IEEE Communications Magazine,
36(5):144–151, 5 1998.
[30] T. Koponen, K. Amidon, P. Balland, M. Casado,
A. Chanda, B. Fulton, I. Ganichev, J. Gross, P. Ingram,
E. Jackson, A. Lambeth, R. Lenglet, S.-H. Li,
A. Padmanabhan, J. Pettit, B. Pfaff, R. Ramanathan,
S. Shenker, A. Shieh, J. Stribling, P. Thakkar,
D. Wendlandt, A. Yip, and R. Zhang. Network
virtualization in multi-tenant datacenters. In 11th
USENIX Symposium on Networked Systems Design and
Implementation (NSDI 14), USENIX NSDI’14, pages
203–216. USENIX Association, 4 2014.
[31] B. Li, K. Tan, L. L. Luo, Y. Peng, R. Luo, N. Xu,
Y. Xiong, and P. Cheng. Clicknp: Highly flexible and
high-performance network processing with
reconfigurable hardware. In ACM SIGCOMM ’16.
[32] B. Marco, B. Valerio, S. Davide, P. Salvatore,
B. Giuseppe, C. Carmelo, C. Antonio, P. Viktor,
P. Libor, and B. Pavel. Extended BEBA abstraction
proof of concept prototype. BEBA project deliverable
D2.4, 2016.
[33] J. Martins, M. Ahmed, C. Raiciu, V. Olteanu,
M. Honda, R. Bifulco, and F. Huici. Clickos and the art
of network function virtualization. In Proceedings of
the 11th USENIX Conference on Networked Systems
Design and Implementation, USENIX NSDI’14, pages
459–473. USENIX Association, 2014.
[34] N. McKeown, T. Anderson, H. Balakrishnan,
G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and
J. Turner. Openflow: Enabling innovation in campus
networks. ACM SIGCOMM CCR, 38(2):69–74, 3 2008.
[35] P. L. Montgomery. Modular multiplication without trial
division. Mathematics of computation,
44(170):519–521, 1985.
[36] V. Moreno, J. Ramos, P. Santiago del Rio,
J. Garcia-Dorado, F. Gomez-Arribas, and J. Aracil.
Commodity packet capture engines: Tutorial, cookbook
and applicability. Communications Surveys Tutorials,
IEEE, 17(3):1364–1390, 8 2015.
[37] R. Morris, E. Kohler, J. Jannotti, and M. F. Kaashoek.
The click modular router. In ACM Trans. on Computer
Systems, 2000.
[38] M. Moshref, A. Bhargava, A. Gupta, M. Yu, and
R. Govindan. Flow-level state transition as a new
switch primitive for sdn. In Proceedings of the Third
Workshop on Hot Topics in Software Defined
Networking, ACM HotSDN ’14, pages 61–66. ACM,
2014.
[39] Netronome. AgilioTM CX 2x40GbE intelligent server
adapter. https:
//www.netronome.com/media/redactor_
files/PB_Agilio_CX_2x40GbE.pdf.
[40] openNFP. openNFP. http://open-nfp.org/, 6
2016.
[41] A. Panda, S. Han, K. Jang, M. Walls, S. Ratnasamy,
and S. Shenker. Netbricks: Taking the v out of nfv. In
12th USENIX Symposium on Operating Systems Design
and Implementation (OSDI 16), USENIX OSDI’16.
USENIX Association, 2016.
[42] L. Peterson, A. Al-Shabibi, T. Anshutz, S. Baker,
A. Bavier, S. Das, J. Hart, G. Palukar, and W. Snow.
Central office re-architected as a data center. IEEE
Communications Magazine, 54(10):96–101, 10 2016.
[43] B. Pfaff, J. Pettit, T. Koponen, E. Jackson, A. Zhou,
J. Rajahalme, J. Gross, A. Wang, J. Stringer, P. Shelar,
K. Amidon, and M. Casado. The design and
implementation of open vswitch. In USENIX NSDI ’15,
USENIX NSDI ’15, pages 117–130. USENIX
Association, 5 2015.
[44] L. Rizzo. Netmap: a novel framework for fast packet
i/o. In Proc. of USENIX ATC’2012, USENIX ATC’12,
pages 1–12. USENIX Association, 2012.
[45] L. Rizzo, M. Carbone, and G. Catalli. Transparent
acceleration of software packet forwarding using
netmap. In INFOCOM, 2012 Proceedings IEEE, IEEE
INFOCOM 2012, pages 2471–2479, 3 2012.
[46] L. Rizzo and G. Lettieri. Vale, a switched ethernet for
virtual machines. In Proceedings of the 8th
International Conference on Emerging Networking
Experiments and Technologies, ACM CoNEXT ’12,
pages 61–72. ACM, 2012.
[47] M. Shahbaz, S. Choi, B. Pfaff, C. Kim, N. Feamster,
N. McKeown, and J. Rexford. Pisces: A programmable,
protocol-independent software switch. In ACM
SIGCOMM ’16, 2016.
[48] M. Shahbaz and N. Feamster. The case for an
intermediate representation for programmable data
planes. In ACM SIGCOMM SOSR ’15.
[49] R. Sherwood, G. Gibb, K.-K. Yap, G. Appenzeller,
M. Casado, N. McKeown, and G. Parulkar. Can the
production network be the testbed? In USENIX OSDI
14
’10, USENIX OSDI ’10, pages 365–378. USENIX
Association, 2010.
[50] P. Shinde, A. Kaufmann, T. Roscoe, and S. Kaestle. We
need to talk about nics. In Proceedings of the 14th
USENIX Conference on Hot Topics in Operating
Systems, USENIX HotOS’13, pages 1–1. USENIX
Association, 2013.
[51] A. Sivaraman, A. Cheung, M. Budiu, C. Kim,
M. Alizadeh, H. Balakrishnan, G. Varghese,
N. McKeown, and S. Licking. Packet transactions:
High-level programming for line-rate switches. In ACM
SIGCOMM ’16, ACM SIGCOMM ’16, pages 15–28.
ACM, 2016.
[52] H. Song. Protocol-oblivious forwarding: Unleash the
power of sdn through a future-proof forwarding plane.
In ACM SIGCOMM HotSDN ’13.
[53] The P4 language Consortium. The P4 Language
Specification, version 1.1.0.
http://p4.org/wp-content/uploads/
2016/03/p4_v1.1.pdf, January 2016.
[54] The P4 language Consortium. The P4 Language
Specification, version 1.0.2.
http://p4.org/wp-content/uploads/
2015/04/p4-latest.pdf, March 2015.
[55] Z. Ullah, M. Jaiswal, Y. Chan, and R. Cheung. FPGA
Implementation of SRAM-based Ternary Content
Addressable Memory. In IEEE 26th International
Parallel and Distributed Processing Symposium
Workshops & PhD Forum (IPDPSW), IEEE IPDPSW
2012, 2012.
[56] Xilinx. UltraScale Architecture and Product Overview.
https://www.xilinx.com/support/
documentation/data_sheets/
ds890-ultrascale-overview.pdf, 2016.
[57] D. Zhou, B. Fan, H. Lim, M. Kaminsky, and D. G.
Andersen. Scalable, High Performance Ethernet
Forwarding with CuckooSwitch. In Proceedings of the
Ninth ACM Conference on Emerging Networking
Experiments and Technologies, ACM CoNEXT ’13,
pages 97–108. ACM, 2013.
[58] N. Zilberman, Y. Audzevich, G. A. Covington, and
A. W. Moore. Netfpga sume: Toward 100 gbps as
research commodity. IEEE Micro ’14, 34(5):32–41,
2014.
[59] N. Zilberman, P. M. Watts, C. Rotsos, and A. W. Moore.
Reconfigurable network systems and software-defined
networking. Proceedings of the IEEE,
103(7):1102–1124, 7 2015.
15
