Network function virtualization allows network functions to be implemented and managed flexibly as a service function chain (SFC) in the data plane to process flows. However, software-based SFCs lead to poor performance compared to proprietary middleboxes. Moreover, existing solutions tackling performance issues suffer from the development complexity incurred by hardware details. To address these problems, we leverage both the high performance of P4-capable devices and the high flexibility of P4 language. In this paper, we present a P4 Service Chaining framework (P4SC), which tackles multiple challenges for P4 to support the SFC implementation. P4SC provides a suite of primitives allowing efficient SFC expression, and a converter and a generator converting input SFC requests to the corresponding P4 program. Here, an algorithm based on longest common subsequence (LCS) is used to allow simultaneously implementing multiple SFCs. Moreover, P4SC offers a runtime manager for flexible SFC management at runtime. It also provides an automatic integration mechanism to integrate P4-based NFs into P4SC. We implement a P4SC prototype, which supports three types of P4-capable devices. The experimental results show that P4SC outperforms state of the arts with orders-of-magnitude SFC performance improvement while maintains high flexibility.
The associate editor coordinating the review of this manuscript and approving it for publication was Honglong Chen . significantly alleviates hardware expenses. As of today, SFCs have been widely adopted in production networks, such as mobile networks [4] and data center networks [5] .
However, software-based NFs and SFCs suffer from poor performance [6] [7] [8] , which impedes their usability in applications with tight performance requirements (e.g., distributed memory caches [8] ). For example, Ananta software Muxes introduces a latency from 200µs to 1ms [6] , [8] , while DPDK [9] incurs a packet processing latency of hundreds of microseconds in the worst-case [10] . On the other hand, recent solutions exploit host-based hardware devices, such as GPU [11] or FPGA [12] , to accelerate SFC. However, such techniques require domain knowledge of the underlying hardware architecture [12] , while the development work is burdensome and prone to bugs. In a word, no existing solutions provide high performance and high flexibility simultaneously.
To achieve both high performance and high flexibility, we explore P4-based programmable networks to implement SFCs. P4 [13] is a programming language that enables operators to define the packet processing logics of P4-capable devices. Although it seems that P4 has the potential to offer both high performance and high flexibility, there are four challenges when realizing SFCs atop P4: (i) describing SFCs in native P4 has to be manually, thereby high development complexity remains; (ii) constructing multiple SFCs on one P4 program needs to resolve NF dependency conflicts, bringing additional development efforts; (iii) the variety of device control APIs makes it complicated for operators to manage SFCs at runtime; and (iv) dynamically integrating new P4-based NFs should not interfere with existing NFs, which needs considerable time and efforts.
In this paper, we introduce the design and development of a P4 Service Chaining framework (P4SC). P4SC addresses the above four challenges. Specifically, P4SC includes: (i) a suite of primitives to describe SFCs in the form of SFC implementation requests, each of which specifies the composition of NFs and the SFC structure; (ii) a converter and a generator to convert input requests to the corresponding P4 program, and an algorithm based on longest common subsequence (LCS) [14] to enable the simultaneous implementation of multiple SFCs; (iii) a runtime manager to provide a suite of control commands that shield device heterogeneity and enable convenient SFC management; and (iv) an automatic integration mechanism to integrate P4-based NFs into P4SC without conflicts.
We implement a P4SC prototype, which supports three types of commonly-used P4-capable devices. We build six real-world SFCs to evaluate P4SC with comparisons to software-based SFCs and DPDK-based SFCs. The experimental results show that P4SC outperforms software-based SFCs with orders of magnitude performance improvement and reduces the per-packet processing latency by up to 92.30% compared to DPDK-based SFCs, meanwhile maintaining high flexibility.
In summary, our contributions are six-fold:
• We identify four challenges of implementing SFCs on P4-capable devices. To tackle these challenges, we propose P4SC, a high-performance and flexible NFV framework, that enables and eases the implementation of SFCs on P4-capable devices. P4SC is the first work that extends the capability of P4-capable devices to support SFCs. It also encourages more development and use of SFCs on P4-capable devices.
• We design a suite of P4SC primitives for operators to describe SFCs in an intuitive way without domain knowledge and substrate details.
• We design both a converter and a generator in P4SC to convert input SFC implementation requests to corresponding P4 programs and support simultaneous implementation of multiple SFCs with correctness preserved.
• We design a runtime manager in P4SC to reduce the burdens of managing SFCs at runtime. • We design an automatic integration mechanism in P4SC to supply the integration of new NFs without conflicts.
• We implement a P4SC prototype and perform extensive experiments to evaluate it. The experimental results show that P4SC provides orders-of-magnitude performance improvement compared to state of the arts. The remainder of this paper is organized as follows.
We give an overview about the background of SFC and P4 in Section II. The design of P4SC is articulated in Section III, which includes an overview, and all the components. The implementation details and evaluations are presented in the next two sections, Sections IV and V. We present some discussions in Section VI, followed by a summary of related works in Section VII. The paper is concluded in Section VIII.
II. BACKGROUND AND DESIGN CHALLENGES
In this section, we present the background of SFC and the procedure of implementing SFCs on the P4-capable device. We then elaborate the design challenges. 
A. SERVICE FUNCTION CHAIN (SFC)
In NFV, NFs (a.k.a. service functions) are often be chained according to a given packet processing order. Such an NF chain is referred as Service Function Chain (SFC), which enables high-level creation and composition of NFs and applies value-added services to selected flows [3] . An SFC contains service function paths (SFPs), each of which travels one or more SFC components, including the SFC classifier, NFs, and service function forwarders (SFFs). Fig. 1 shows a typical SFC, where three NFs and two SFPs are available. At the ''Begin'', the SFC classifier encapsulates each incoming packet with a tag based on classification rules by appending metadata fields (tags) to the packet. The tags enable the information exchange between different NFs. They also allow SFFs to forward packets to corresponding SFPs. The figure shows that SFFs direct two flows to the NFs on two VOLUME 7, 2019 SFPs, respectively. The flows along SFP1 will be processed by the IDS and the Firewall, while the flows along SFP2 will be processed by the Load Balancer (LB). The NFs on a SFP perform actions on received packets after parsing both packet headers and the historical processing information recorded in tags. For example, the Firewall checks packet 5-tuples and drops malicious packets. Moreover, the forwarding rules at SFFs are dynamically configured and controlled by a controller at runtime so that SFPs can be dynamically selected. Fig. 1 also shows that to implement SFCs on the P4-capable device, operators are supposed to describe both SFPs and complete SFC components (i.e., the SFC classifier, NFs, and SFFs), and enable SFC management at runtime.
B. IMPLEMENTING SFCS ON THE P4-CAPABLE DEVICE
P4 [13] is a domain-specific language that empowers operators to customize the packet processing pipelines of P4-capable devices. In a P4 program, operators are able to develop their own network protocols using headers and parsers. Moreover, they can use P4 to express NFs, such as innetwork cache [15] and load balancer [16] . Specifically, operators conduct a three-phase procedure to deploy an NF on the P4-capable device. (i) In the first phase (i.e., the development phase), operators declare several match-action tables, each of which matches packets by reading specific header fields and performs actions on matched packets. They invoke P4 tables via control flows to indicate the processing order among P4 tables. (ii) In the second phase (i.e., the compile phase), the P4 compiler takes the P4 program as input and produces device configurations to deploy the P4-capable device. (iii) In the third phase (i.e., the runtime phase), operators dynamically modify entries recorded in P4 tables by invoking control APIs to deploy NF policies on the P4-capable device. Now, to implement SFCs, operators should describe the SFC classifier and SFPs on a P4 program in addition to NFs. To describe the SFC classifier, operators can define a P4 table that matches specific packet fields, e.g., the source IP address, to classify packets. According to match results, this table invokes a compound action to tag matched packets by changing the value of a specific metadata field. Then, they can invoke P4 tables used to realize NFs and specify the execution order between P4 tables on P4 control flows. Note that SFFs are not required because packets are automatically delivered to NFs according to the SFPs defined in P4 control flows. Furthermore, operators are able to manage SFCs at runtime by dynamically selecting SFPs and deploying NF policies via the control APIs exposed by P4-capable devices.
C. DESIGN CHALLENGES
When aiming at exploring the high performance and programmability of P4-capable devices for SFCs, the design and implementation of P4SC encounters challenges at each of the three aforementioned phases due to lacking the following critical features, which need to be tackled in P4SC.
1) EXPRESSIVE AND SIMPLE SFC DESCRIPTION
An flexible SFC implementation requires a simple but expressive method to describe SFCs to minimize development efforts. However, native P4 language only offers a straightforward method in describing SFCs manually via composing P4 tables and sequentially assigning positions to NFs in the P4 program. This method introduces additional development overheads. Therefore, to address this issue, an expressive and simple approach used to describe SFCs is needed. In response, we design a suite of primitives to describe SFCs efficiently, while shielding unnecessary details (Section III-B1).
2) EFFECTIVE AND EFFICIENT CONVERSION MECHANISM
There is often a need of implementing multiple SFCs jointly on the same P4-capable device. For example, in the scenario of telecom clouds, operators can maximize the utilization rate of network resources by implementing multiple SFCs in parallel [17] . Thus, the conversion mechanism of P4SC should precisely represent multiple SFCs on the output P4 program. However, this mechanism may violate the P4 grammar due to the NF dependency conflicts between different SFCs. Therefore, care must be taken in the design of conversion mechanism. To this end, we design both a converter and a generator to correctly represent SFPs and SFC components on the output P4 program. In particular, we design an LCSbased algorithm to efficiently merge SFCs under the premise of the P4 grammar, to enable the simultaneous implementation of multiple SFCs. (Section III-B2 and III-C)
3) UNIVERSAL AND FLEXIBLE SFC MANAGEMENT
At runtime, operators need to select SFPs to process packets and update NF rules based on SFC policies. However, different P4-capable devices vary in control APIs, which leads to inflexible SFC management. Moreover, these control APIs are coupled with P4 program details, which brings management difficulties. Therefore, we are challenged to provide universal and flexible SFC management at runtime. In response, we design several universal control commands to provide flexible SFC management without involving the heterogeneity of underlying devices. (Section III-D)
4) CONFLICT-FREE NF INTEGRATION
According to dynamic application requirements, operators may raise the demand of integrating a new NF to the SFC to offer a specific service function. However, since newly imported NFs may have conflicts (e.g., the parsing conflict) with existing NFs, we are challenged to orchestrate NFs in one P4 program to integrate the new NF into the SFC. In response, we design an automatic integration mechanism to help operators to conveniently integrate their new NFs into P4SC with high flexibility in development. (Section III-E) 
III. DESIGN OF P4SC
In this section, we present a system-wide overview of P4SC and the design of its three main components. The three components are converter, the generator and the runtime manager, respectively. In addition, an automatic integration mechanism is designed to assist the integration of P4-based NFs into P4SC.
A. P4SC OVERVIEW Fig. 2 presents an overview of P4SC framework. The upper part of the figure shows the three phases of P4SC at work, while the lower part is an example of an SFC. First, P4SC provides a suite of primitives for operators to describe SFCs in the form of SFC implementation requests. Each request indicates a specific SFC, which includes SFPs and SFC components. Second, the converter acquires SFPs and SFC components from the requests, and the generator represents these features on the output P4 program under the premise of the P4 grammar. The P4 program will be deployed to the P4-capable device. Third, operators can use the runtime manager to control SFCs and conduct required reconfigurations at runtime. In addition, the architecture components of P4SC are shown in Fig. 3 , and with it, we give brief overviews of each component below. The further details of these components are introduced in the subsequent subsections, respectively.
Converter (Section III-B). The converter provides a suite of primitives for operators to describe SFCs without involving complex details. After receiving the SFC implementation requests from operators, the converter extracts SFPs and SFC components from input requests and represents them in directed acyclic graphs (DAGs). The converter then transforms DAGs into the intermediate representation (IR), which is a straightforward representation of SFCs. To observe the P4 grammar during the workflow of the converter, we design an LCS-based algorithm that introduces a small and acceptable number of duplicate P4 tables to merge DAGs to IR.
Generator (Section III-C). The generator is responsible for producing the P4 program based on IR. In P4SC, each P4-based NF is maintained in a P4SC block, which contains the NF name, P4 source codes, and the names of P4 control flows. The generator takes IR as input and extracts related P4 source codes from the P4SC blocks indicated by the NF nodes of IR. Thereafter, it composes the P4 control flows based on extracted P4 source codes, and produces the output P4 program to deploy the P4-capable device.
Runtime manager (Section III-D). The runtime manager is responsible for enabling operators to manage the SFCs running on P4-capable devices at runtime. To enable high flexibility, we design several unified control commands in the runtime manager. These high-level control commands shield device and infrastructure heterogeneity. By invoking these commands, operators are competent to efficiently manage NFs and SFCs running on P4-capable devices at runtime.
Automatic integration mechanism (Section III-E). The automatic integration mechanism is designed to help operators import new P4-based NFs to P4SC. This mechanism automatically resolves conflicts between newly imported NF and NFs that have already existed in P4SC. By means of this mechanism, P4SC significantly mitigates the burdens of integrating new NFs.
B. CONVERTER DESIGN
The converter provides a suite of primitives used to construct SFC implementation requests, each of which describes the features of a specific SFC. The converter converts each request to a DAG. It uses an LCS-based algorithm to merge these DAGs to IR, and delivers IR to the generator. 
1) CONVERTING SFC IMPLEMENTATION REQUESTS TO DAGS
When implementing SFCs, operators need to describe SFPs and SFC components to construct SFCs. In response, we design a suite of primitives in the converter, which are presented in Table 1 . These primitives are used to describe VOLUME 7, 2019 SFCs in SFC implementation requests, each of which corresponds to a specific SFC. Therefore, operators are competent to describe various kinds of SFCs, such as the SFC with multiple endpoints, by using these primitives. For example, to describe an SFC with multiple endpoints, operators can use the primitive ''NF1 before NF2'' to define the execution order between different NFs, while exploiting the primitive ''NF1 then NF2 or NF3'' to describe SFPs.
The converter supports the implementation of arbitrary SFCs, including DAG-based SFCs and non-DAG SFCs. The structure of a DAG-based SFC is a DAG, while that of a non-DAG SFC is a directed graph with cycles. For DAG-based SFCs, the converter directly converts them to corresponding DAGs. However, the non-DAG SFCs are prohibited by the P4 grammar in consideration of line-rate performance guarantee. In response, for non-DAG SFCs, the converter converts them to DAG-based SFCs in the beginning, and then transforms DAG-based SFCs into DAGs. Specifically, we summary non-DAG SFCs as two forms: (1) an NF appears multiple times in an SFC, or (2) the SFC has loop conditions. In the former scenario, the converter requires operators to rename the NF, which has been invoked, with a serial number as writing requests. For example, assume an NF named ''Firewall'' has been invoked two times. Operators need to use a serial number ''3'' to invoke this NF again in an SFC implementation request: ''Firewall_3''. Moreover, in the latter scenario, the converter provides the primitive ''NF1 loop'' for handling loop conditions, while using a node attribute to indicate the start of a loop. By this means, the converter converts non-DAG SFCs to DAG-based SFCs with the P4 grammar observed.
Thereafter, the converter allocates a unique SFC ID to every request, and creates an NF node for each NF in a request. An NF node is associated with some attributes, including the NF name, the node length, which is equal to the number of P4 tables occupied by this NF, an SFC ID array used to identify the DAGs that utilize this node, and a pointer list used to connect to other NF nodes. Meanwhile, the converter strips out serial numbers from NF names and produce the DAG by connecting NF nodes according to the order defined in the input request.
2) MERGING DAGS TO IR
P4SC is supposed to provide the ability of simultaneously implementing multiple SFCs on the same P4-capable device. To achieve this goal, we design the converter to merge DAGs, each of which corresponds to the features of a specific SFC, to represent multiple SFCs on the output P4 program.
However, merging DAGs may incur NF dependency conflicts, which bring the failure of SFC implementation. The NF dependency conflict indicates reverse NF invocation [13] , [18] . For example, assume DAG1 specifies that network address translator (NAT) is executed before load balancer (LB), while DAG2 indicates the reverse. Therefore, merging DAG1 and DAG2 encounters the problem of determining the execution order between NAT and LB: if the merged DAG executes NAT first, then the correctness of DAG2 is violated; vice versa. One straightforward solution is to invoke NFs multiple times, but it is not allowed by the P4 grammar [18] . Therefore, we are challenged to tackle the NF dependency conflicts between different DAGs under the premise of the P4 grammar.
a: THE STRAWMAN METHOD
To address the above problem, we first present the strawman method of merging DAGs. As is shown in Fig. 4 , this method introduces a pre-visiting node to distribute flows and connects this node with original DAGs in parallel. To resolve NF dependency conflicts, this solution creates duplicate nodes to observe the P4 grammar. However, as a compromise, lots of resources are wasted due to the exponential number of duplicate P4 tables. For example, the experimental results in Section V-D show that the strawman method introduces more than 200 duplicate P4 tables when merging 32 SFCs, which is costly and unacceptable. 
b: ALGORITHM
To this end, we develop an LCS-based algorithm described in Algorithm 1 to avoid massive overheads. The converter iterates this algorithm to merge DAGs and produces an IR in the end. When merging NF sequences, Algorithm 1 uses the LCS algorithm to find out the LCS. Since the LCS is shared among NF sequences, this algorithm only needs to create duplicate nodes for other NFs that not exist in the LCS, such that it minimizes the total number of duplicate nodes. We compare our solution with the strawman method in Section V-D. We detail Algorithm 1 in what follows.
The converter takes two DAGs as the input of Algorithm 1. First, it acquires the topological sequences of DAGs (lines 2-3). By referring to NF node length, the LCS produces ''sharedOrder'', which is an NF node sequence that occupies maximum P4 tables (line 4). If ''sharedOrder'' is empty, the converter connects the two DAGs to a previsiting node and ends up the procedure (lines 5-7). Otherwise, it combines the two NF node sequences. The converter regards the shorter node sequence as ''Attach'' and the longer node sequence as ''Base'' (lines [8] [9] [10] [11] [12] . The node sequence between the first node and the last node of ''sharedOrder'' on Algorithm order1 ← Topological_Sort(DAG1) 3: order2 ← Topological_Sort(DAG2) 4: sharedOrder ← LCS(order1, order2) 5: if sharedOrder is None then 6: return Simple_Merge_DAG(DAG1, DAG2) 7: end if 8: if order1.length ≥ order2.length then 9: Base, Attach ← order1, order2 10: else 11: Base, Attach ← order2, order1 12: end if 13: mainSegment, first, follow ← Attach.Split (sharedOrder) 14: Base.Insert_First_and_Follow(first, follow) 15: pos ← Base.Index(first.Last_Node()) + 1 16: for each node in mainSegment do 17: if node in sharedOrder then 18: pos ← Base.Index(node) 19 :
else 22: Base.Insert(node) 23: end if 24: pos + + 25: end for 26: IR ← Add_Links(Base, DAG1, DAG2) 27: return IR 28: end function ''Attach'' is named as ''mainSegment''. Meanwhile, ''first'' is the node sequence before ''mainSegment'', and ''follow'' is the node sequence after ''mainSegment'' (line 13). The converter copies ''first'' and inserts the replica before the first NF node of ''sharedOrder'' on ''Base''. Similarly, the replica of ''follow'' is placed after the last NF node of ''sharedOrder'' (line 14). The converter uses a pointer ''pos'' to point to the place after ''first'' on ''Base'' (line 15). For every node on ''mainSegment'', it determines if this node exists in ''share-dOrder''. If so, the converter combines the SFC ID array of this node with that of the same node on ''Base'' (lines [17] [18] [19] [20] . If false, the replica of this node is inserted to the place indicated by ''pos'' (lines [21] [22] . Then ''pos'' is moved to the next node on ''Base'' (line 24). Finally, the converter recovers the structures of input DAGs on ''Base'' and produces IR (lines 26-27).
c: EXAMPLES
First, we present an example in Fig. 5 (a) to illustrate the differences between P4SC and the strawman method and how the NF dependency conflicts are addressed. The two input DAGs, DAG1 and DAG2, indicate different execution orders between ''NF2'' and ''NF3'', which introduces an NF dependency conflict. To merge DAG1 and DAG2, the strawman method connects the two DAGs in parallel by adding three duplicate nodes, which significantly adds overheads. In contrast, Algorithm 1 only introduces one duplicate node to resolve the NF dependency conflict.
Second, we elaborate the converter workflow that includes Algorithm 1 in Fig. 5(b) . At first, the converter extracts SFPs and SFC components from input requests and represents them in DAGs. Thereafter, Algorithm 1 acquires topological sequences using the topological sorting, and feeds sequences to LCS to produce ''sharedOrder''. By referring ''sharedOrder'', Algorithm 1 splits the shorter node sequence, ''Attach'', into three subsequences, ''first'', ''mainSegment'', and ''follow'' (''follow'' is none in this case), and individually merges the three subsequences into the longer node sequence, ''Base''. For handling the ''first'' and ''follow'', Algorithm 1 directly copies them and inserts replicas into ''Base''. For handling ''mainSegment'', Algorithm 1 iterates every node of ''mainSegment'' and determines whether the current node exists in ''sharedOrder''. If so, Algorithm 1 skips this node (e.g., ''NF1'' and ''NF3'' in ''mainSegment''). Otherwise, Algorithm 1 copies this node and inserts the replica into the ''Base'' (e.g., ''NF6'' in ''mainSegment'').
In the end, Algorithm 1 recovers original DAG structures on the merged sequence and produces IR.
3) HANDLING DUPLICATE NF NODES IN IR
After producing IR, the converter handles NF nodes that appear multiple times in IR. It searches the P4SC blocks by the name of duplicate nodes and creates a unique block replica for every duplicate node in IR. The name of block replica is appended with a serial number to distinguish it from the original name.
C. GENERATOR DESIGN
According to IR, the generator uses the workflow described in Algorithm 2 to produce the P4 program. We illustrate Algorithm 2 as follows.
Algorithm. First of all, it produces the NF node sequence of IR using topological sorting, and records SFPs in linked lists (line 2). A linked list is assigned a path ID that corresponds to an SFP (line 3). For each node in the NF sequence, the generator obtains P4 control flow codes from relevant P4SC blocks based on the node name. It modifies P4 control flow codes with if-else statements of SFC IDs to indicate the boundary between different SFCs, and appends these codes to the node (lines 4-7). Subsequently, the generator introduces an empty control flow pair for ingress and egress control flows (line 8). It selects the nodes that exist in all linked lists and marks them in the NF node sequence (line 9). Then it traverses the NF node sequence and identifies whether a node is marked. If so, the generator directly populates the codes of P4 control flows recorded in this node to the control flow pair (lines [11] [12] [13] . Otherwise, it acquires path IDs of the linked lists in which this node exists (lines [15] [16] [17] [18] [19] [20] , and inserts the codes of P4 control flows into the control flow pair as well as using if-else statements for path IDs to set the boundary of SFPs (lines [21] [22] . If a node has an attribute that indicates a loop, the generator adds the loopback action to the last P4 table cited by this node (lines 24-26). Furthermore, the generator initials the output P4 program based on the control flow pair (line 28), and combines this program with a target-dependent backbone program, which provides target-dependent definitions such as the standard metadata (line 29). Operators can change this backbone program to accommodate to other P4-capable devices. Finally, the generator inserts a P4 table used to implement the SFC classifier in the beginning of the P4 ingress control flow and produces the output program (lines [30] [31] . Note that the match fields of the SFC classifier can be flexibly changed according to operator demands.
Example. To better illustrate the workflow of generator, we present an example of producing the P4 ingress control flow according to IR. As is shown in Fig. 6 , IR consists of two SFPs, ''SFP1'' and ''SFP2'' that are originated from ''SFC1'' and ''SFC2'', respectively. First of all, the generator sorts IR to produce the NF node sequence. Secondly, the generator assigns each node corresponding P4 control flow codes. In this step, the boundary between ''SFC1'' and ''SFC2'' is NFNodeSeq, SFPLists ← Topological_Sort_ IR(IR) 3: SFPLists ← Set_Path_IDs(SFPLists) 4: for each node in NFNodeSeq do 5: node.ingress ← Add_Ingress(node.name, node.SFCID) 6: node.egress ← Add_Egress(node.name, node.SFCID) 7: end for 8: ingress, egress ← Initial_Control_Flows() 9: Mark_Nodes(NFNodeSeq, SFPLists) 10: for each node in NFNodeSeq do 11: if node.marked is True then 12: ingress.Add_Marked_Node(node.ingress, node.SFCID) 13: egress.Add_Marked_Node(node.egress, node.SFCID) 14: else 15: pathIDs ← ∅ 16: for each SFP in SFPLists do 17: if node in SFP then 18: pathIDs.Append(SFP.pathID) 19: end if 20: end for 21: ingress.Add_Node(node.ingress, pathIDs, node.SFCID) 22: egress.Add_Node(node.egress, pathIDs, node.SFCID) 23: end if 24: if node.loopback is True then 25: egress.Set_Loopback() 26: end if 27: end for 28: program ← Init_Program(ingress, egress) 29: program ← Replace_Metadata_Macros(program) 30: program ← Add_SFC_Classifier(program) 31: return program 32: end function set by using if-else statements of SFC IDs on the control flow codes. Thirdly, the generator marks both ''NF1'' and ''NF4'' since they exist in all SFPs. Finally, the generator fills the control flow with the codes appended on each node. It directly populates the codes of ''NF1'' and ''NF4'' because the two nodes are marked. Meanwhile, the codes of ''NF2'' and ''NF3'' are limited by path IDs to represent SFPs.
D. RUNTIME MANAGER DESIGN
As is shown in Fig. 7 , the runtime manager is responsible for providing control commands to operators to manage NFs and SFCs running on P4-capable devices at runtime. However, we encounter two problems in the design of the runtime manager. We present them as follows. 
1) DEVICE HETEROGENEITY
Operators may raise the requirement of implementing SFCs on the heterogeneous infrastructure that consists of different P4-capable devices. However, the device heterogeneity brings non-trivial management difficulties since different P4-capable devices vary in control APIs. In this scenario, operators are required to be familiar with control APIs of every kind of P4-capable device to implement SFC policies on heterogeneous infrastructure. Moreover, when SFCs crash due to unknown reasons, operators have to spend a long time ranging from a few hours to several days to troubeshoot SFCs.
2) LOW-LEVEL P4 PROGRAM DETAILS
The universal approach of enforcing high-level policies on P4-capable devices is to leverage the control APIs generated by the P4 compiler. However, these APIs are tightly coupled with P4 program details (e.g., the name of a P4 table). These details are supposed to be transparent to operators, who only care about high-level SFC policies in terms of NF rules. For example, operators focus on NF rules (e.g., ''forbid ARP packets'') rather than P4 table entries (e.g., ''ruleType:table_add, table:firewall_table, match:ether Type = 0 × 0806, action:drop'').
In response, the runtime manager provides two types of control commands, which enable device management, and SFC management, respectively, for operators to control SFCs at runtime. We briefly introduce two types of control commands below.
3) CONTROL COMMANDS FOR DEVICE MANAGEMENT
The runtime manager provides a set of unified control commands to shield device heterogeneity instead of burdening operators to manage the infrastructure through various control APIs. We also design some control commands used to check the device status in terms of CPU utilization and memory usage. These commands can be easily implemented atop the device management CLI.
4) CONTROL COMMANDS FOR SFC MANAGEMENT
In addition, the runtime manager offers some control commands to populate NF rules and implement SFC policies without involving complex P4 program details. Operators can easily issue these commands in a script written in a high-level programming language like Python to populate NF rules. Also, operators are capable of selecting an SFC or an SFP to process incoming traffic based on the flow kind and policies. For example, they can indicate an SFP, where the video optimizer lies in, to process video traffic.
E. INTEGRATING NETWORK FUNCTIONS INTO P4SC
In addition, we design an automatic integration mechanism in the generator to help operators to integrate their NFs into P4SC. The automatic integration mechanism automatically converts input configurations that describe a specific NF to a reusable P4SC block. It shields low-level details and provides operators with a simple approach of importing customized NFs. However, the design of automatic integration mechanism should resolve three problems about the coexistence between the newly imported NF and NFs that have already existed in P4SC. We elaborate them in what follows.
1) PARSING CONFLICT
In a P4 program, operators define headers and parsers and use them to compose the parse graph (PG), i.e., the finite state machine (FSM) that describes the parsing logic. Since different NFs vary in processing intents, their P4 implementations differ in PGs to parse incoming packets. However, the P4 grammar only permits one PG per P4 program. This situation brings the parsing conflict, i.e., different P4-based NFs cannot coexist in a P4 program due to the difference of PGs. To illustrate, assume that the NFs built in P4SC only support parsing IPv4 packets. When operators attempt to import NAT64 that implements the parsing of IPv6 packets, the parsing conflict happens since the PG of NAT64 differs with that of the NFs existed in P4SC.
2) STANDARD METADATA DIFFERENCE
The P4 tables used to implement NF may use some standard metadata fields to perform packet processing actions. However, these fields are associated with the details of device architecture and vary between different P4-capable devices. For example, the NF that forwards packets is supposed to modify the value of a standard metadata field to set the output port. In BMv2 [19] , a software-based P4 switch, this field is standard_metadata.egress_spec, while NetFPGA-SUME [20] uses sume_metadata.dst_port to set the output port. This difference leads to the incompatibility between the NF and the target P4-capable device.
3) MULTIPLE P4 TABLE INVOCATION
To describe the packet processing logic of an NF, operators need to declare several P4 tables and invoke them in the P4 control flow. However, different P4-based NFs may invoke a P4 table for multiple times. This situation is forbidden by the P4 grammar and thus will lead to implementation failures.
To resolve above problems, we design the automatic integration mechanism with four steps, parse graph union, macro replacement, name replacement, and P4SC block generation.
4) PARSE GRAPH UNION
To avoid parsing conflict, we design the parse graph union in the generator. In our design, all NFs maintained by P4SC shares the same PG, which is defined in a file, F. When operators input a P4 program to import a new NF, the generator first extracts PGs from both F and the input P4 program. Recall, the PG is an FSM that consists of a set of nodes and a set of edges. Each node contains the header structure and the parse state, while each edge represents the parsing transition conditions between two nodes. Thereafter, the generator combines the node set and the edge set of F with that of the input P4 program using set union operation. According to merged sets, the generator produces the new PG and stores it in a new F. To illustrate this step, we present an example in Fig. 8 . In this case, F has three nodes and two edges and the input P4 program has four nodes and three edges. The figure shows two PGs (i.e., PG A and PG B) that represent the parsing logic of F and that of the input P4 program, respectively. The generator merges the nodes and edges of PGs by means of set union operation and produces the PG of F, which maintains complete processing intents and resolves parsing conflicts.
5) METADATA REPLACEMENT
Moreover, the generator identifies standard metadata fields used by the input P4 program. It replaces these fields with corresponding macros such as INGRESS_PORT. When generating the P4 program, the generator associates these micros with target-specific standard metadata fields regarding the type of the target P4-capable device (line 29 of Algorithm 2).
6) NAME REPLACEMENT
Furthermore, the generator rewrites the names of tables and actions to avoid multiple P4 table invocation. It takes a straightforward solution that simply appends the NF name to the original name of tables and actions. To illustrate, assume that the P4 program used to implement NAT declares a table called ''modify_IP_address''. The generator rewrites its name to ''NAT_modify_IP_address''.
7) P4SC BLOCK GENERATION
Finally, the generator creates a new P4SC block consisting of the NF name, P4 source codes, and the names of P4 control flows, based on the input P4 program. After that, the new NF is successfully integrated into P4SC.
IV. IMPLEMENTATION
We implement a P4SC prototype with 1500 lines of code (LoC). Our implementation consists of two parts.
(1) The converter and the generator are implemented atop the state-of-the-art P4 compiler, P4C [24] . In our implementation, operators can select the version of output P4 program produced by the generator. Moreover, when integrating new NFs into P4SC, the generator will first invoke P4C to convert the input P4 program to a JSON file, which is a straightforward transition of P4 source codes, in order to support both P4 14 and P4 16 . (2) The runtime manager relies on Apache Thrift [25] to establish communication channels with P4-capable devices. Currently, P4SC supports three types of P4-capable devices, BMv2 [19] , NetFPGA-SUME [20] , and the Tofino-based switch [26] . BMv2 is a software-based P4 switch, while NetFPGA-SUME and Tofino-based switch are two P4-capable hardware devices. We have published source codes of P4SC at [27] and integrated 25 NFs extracted from open-source P4 programs into P4SC.
V. EVALUATION
We conduct extensive experiments to evaluate P4SC. We repeat each experiment for 100 times. Our experimental results include: [29] , with higher SFC performance.
(Section V-G) 
A. SETUP 1) TESTBED
P4SC is running on a server, configuring with twelve 2.3 GHz CPU cores and 128 GB RAM. Our experiments are conducted on a testbed shown in Fig. 9 , consisting of one P4-capable device and two servers. We select a Tofino-based switch as the P4-capable device to evaluate P4SC. Moreover, we use MoonGen [30] to generate test traffic at 10 Gbps. In our testbed, one server runs the MoonGen sender, while another one runs the MoonGen receiver. By analyzing the traffic statistics reported by the MoonGen receiver, we acquire realtime SFC performance, in terms of throughput and per-packet processing latency.
2) NFs AND SFCs
We select eight NFs to compose SFCs.
• L2fwd matches the destination MAC address of the packet using exact match to determine the output port with 100 rules.
• L3fwd matches the source and destination IP addresses of the packet using longest prefix match (LPM) to determine the output port with 100 rules.
• Firewall matches packet 5-tuples using exact match based on 100 rules, and drops packets when table miss happens.
• NAT translates the source IP address of the packet according to 100 rules. It reads a 4-tuple of packet fields, including the source and destination MAC addresses, and the source and destination IP addresses, using exact match.
• LB hashes the 5-tuple of the packet to balance the traffic load. It matches the destination IP address using LPM and selects the output port based on the result of crc32 hashing. LB is configured with 100 rules.
• VPN realizes the encapsulation and decapsulation of generic routing encapsulation (GRE) used in IPSec VPN. It identifies non-GRE packets and encapsulates them with a GRE header.
• Monitor reports the information of incoming packets. In our implementation, this NF generates a 5-tuple digest for every packet and sends this digest to a specific port. The controller application listens to the port and prints the packet information based on the received digest.
• IDS realizes the identification of port scanner built in the Bro [31] . At runtime, this NF counts both SYN packets and RST packets using P4 registers. If the total packet number recorded in the register exceeds a pre-defined threshold, it raises an alert by copying the packet and sends the packet replica to the controller port. In our experiments, we let this NF generate replica for every incoming packet to simulate the heaviest load. Moreover, we select six real-world SFCs and write corresponding SFC implementation requests, as listed in Table 2 . We detail these SFCs as follows.
• SFCs for data center (DC). There are two kinds of traffic in the DC, the east-west traffic between servers, and the north-south traffic from the outside of the DC. We present four SFCs that provide security services for DC traffic, SFC1 [5] , SFC2 [21] , and SFC3 [22] for the north-south DC traffic, and SFC4 [22] for the east-west DC traffic.
• SFC for HTTP services. We present SFC5 for HTTP services [21] . This SFC is composed of LB, firewall and NAT. At runtime, LB distributes the HTTP traffic and the non-HTTP traffic to two SFPs. To enhance the performance, one SFP forwards the HTTP traffic to go through a performance enhancement proxy (PEP). The non-HTTP traffic in another SFP skips the operations of PEP. Thereafter, firewall applies security strategies, and NAT executes the private-to-public address transition.
• SFC for Gi-LAN. The Gi interface is a major mobile traffic carrier between the external packet data network and the gateway general packet radio service (GPRS) support node [23] . Considering the requirements of service-level agreement (SLA), the Gi-LAN requires the dynamical deployment of SFCs to accommodate the traffic growth. We present SFC6 for Gi-LAN extracted from [23] . It schedules flows to go through NAT, L2fwd, LB, L3fwd, and firewall in this order. 
B. EXPRESSIVENESS OF P4SC
In this experiment, we show that P4SC makes it easier to describe SFCs by comparing the LoC required to implement SFCs. We use P4SC, naive P4 language, and DPDK to describe the six real-world SFCs, respectively. Moreover, we compare their LoC in describing complicated SFCs. We set the number of SFPs of complicated SFCs from 20 to 100 to vary the SFC complexity. As shown in Fig. 10 , P4SC requires far fewer LoC than the methods of using naive P4 language and DPDK. It reduces development LoC by two orders of magnitude compared to another two methods. This is because P4SC provides several primitives that shield complex substrate details, such that operators can customize their SFCs in an intuitive way.
C. PERFORMANCE BENEFITS OF P4SC
In this experiment, we demonstrate that P4SC can significantly improve the performance of SFCs. We use P4SC to implement SFCs on a Tofino-based switch. For comparison, we use BMv2 to implement software-based SFCs and use another NFV framework, BESS, to implement DPDKbased NFs and SFCs. We stress-test SFCs with 40 Gbps traffic and measure their throughput and per-packet processing latency.
1) REAL-WORLD SFCs
First, we evaluate the performance of the six real-world SFCs. As shown in Fig. 11 , P4SC improves the SFC performance by a large ratio. Compared to software-based SFCs, P4SC provides two orders of magnitude performance improvement in both throughput and per-packet processing latency. Moreover, P4SC outperforms DPDK-based SFCs with up to 7× throughput increase and 99% latency reduction. Note that the performance benefits of P4SC decreases as the increase of packet size because all of these solutions have reached line rate.
2) COMPLICATED SFCs
Second, we evaluate the performance benefits of P4SC for complicated SFCs. We generate two types of complicated SFCs, i.e., very long linear SFCs and the SFCs with massive SFPs. To vary the complexity of SFCs, we set the number of NFs and the number of SFPs from 20 to 100, respectively. We use P4SC, BMv2, and BESS to implement these complicated SFCs and measure the SFC performance. The results in Fig. 12 and Fig. 13 show that P4SC maintains high performance for SFCs regardless of the increase of SFC complexity, while another two solutions downgrade performance. The reason behind this is that P4SC offloads SFCs to P4-capable devices so that it offers line-rate performance guarantee.
D. EXECUTION TIME OF P4SC
In this experiment, we evaluate the timeliness of P4SC as implementing SFCs. First, we write SFC implementation requests that implement the six real-world SFCs and input them to P4SC. We measure the execution time of converting input requests to corresponding P4 programs. Fig. 14(a) shows that P4SC generates the P4 program for an arbitrary real-world SFC within 1 s. Second, we measure the time of deploying complicated SFCs using P4SC. Since the execution time of P4SC depends on the number of NFs, we vary the number of NFs from 20 to 100 in complicated SFCs. Fig. 14 (3) An NF occurs only once in an SFC. We vary the number of the SFCs to be merged from 2 1 to 2 7 to simulate massive SFC implementation requests. Fig. 15(a) shows that P4SC is capable of merging a hundred of SFCs in less than 0.5 s, which is fast and acceptable. Second, we measure the overheads of P4SC when merging SFCs. Recall that the overheads of P4SC are caused by introducing duplicate P4 tables to preserve the correctness of output P4 programs. We compare P4SC with the strawman method that inserts a pre-visiting node to merge SFCs. If an SFC has n NFs while another SFC has m NFs, then the output IR produced by the strawman method has n+m+1 nodes. We individually use the two methods to merge SFCs. Fig. 15(b) shows that compared to the strawman method, P4SC significantly reduces the number of duplicate P4 tables as merging SFCs, which avoids massive overheads.
For example, in the case of merging 128 SFCs, P4SC only uses 13.15% of duplicate tables introduced by the strawman method.
Third, we evaluate the performance overheads incurred by merging SFCs. We randomly select three of the real-world SFCs and simultaneously implement them on the P4-capable device. We compare the performance of the merged SFC against that of unmodified SFCs. Our experiments indicate that P4SC introduces less than 1% performance overheads in merging SFCs.
F. EFFICIENCY OF RUNTIME MANAGEMENT
In this experiment, we evaluate the efficiency of runtime management. First, we evaluate the timeliness of runtime manager. We choose three basic control commands, i.e., Add_NF_Rule, Delete_NF_Rule, and Modify_NF_Rule, and measure their execution time. We use the runtime manager of P4SC to execute these control commands. As shown in Fig. 16(a) , the results indicate that the runtime manager introduces additional 10 ms to manage NF rules. The reason is that the runtime manager needs to translate control commands into low-level configurations. Since the SFC management costs hundreds of milliseconds, the latency overheads of runtime manager is negligible and acceptable. Second, we measure the time of intergating new NFs into P4SC based on the automatic integration mechanism. We generate new NFs with respect to the behaviors of real NFs. We set the number of new NFs to be integrated from 20 to 100. Fig. 16(b) shows that even with 100 new NFs, the automatic integration mechanism of P4SC costs less than 1.5 s to complete the integration, which is fast and efficient.
G. SUPPORT FOR NETWORK SERVICE HEADER
In this experiment, we demonstrate that P4SC outperforms another NFV framework, BESS [28] , by providing highperformance SFC implementation for NSH [29] . Specifically, NSH is an SFC technique that describes a header format for delivering packets to corresponding SFPs. It runs atop NFV frameworks to support high-level traffic scheduling. We build NSH on both P4SC and BESS and implement the very long linear SFCs used by Section V-C. We measure the performance of SFCs. The results in Fig. 17 show that compared to BESS, P4SC provides NSH with high-throughput and lowlatency SFC implementation. 
VI. DISCUSSION
P4SC achieves high performance SFC implementation based on the high packet processing performance of P4-capable devices. However, some P4-capable devices (e.g., NetFPGA-SUME) have limited memory resources and programming constraints that impede the implementation of complex NFs. Thus, a concern on P4SC's support for complex NFs may raise. As is mentioned in Section III-E, P4SC is able to adopt arbitrary NFs written in P4 by means of the automatic integration mechanism. Therefore, whether P4SC can support complex NFs or not depends on the P4 grammar, rather than the capability of the P4-capable device. In other words, an NF can be supported by P4SC as long as it can be described by P4. Many previous research efforts [15] , [16] , [32] have demonstrated that P4 can describe a lot of complex NFs.
Nevertheless, another potential concern may also rise due to the fact that P4 provides limited support of queue scheduling and per-flow state management, such that some complex NFs cannot be realized using P4, hence not P4SC. This concern can be alleviated in three ways: (i) Using extern objects in P4 to integrate target-dependent functions (e.g., queue management). (ii) Invoking P4SC to implement these NFs on P4-capable software devices to satisfy resource and programmability requirements. (iii) According to recent development trend of programmable hardware [33] , it is totally possible that the future version of P4 can support complex packet processing operations and enable the feasibility of describing complex NFs.
VII. RELATED WORK
There are two major categories of related work, namely, NF orchestration and SFC acceleration. We give brief reviews of these work below with comparisons to our work.
A. NF ORCHESTRATION
Many recent works [22] , [34] [35] [36] [37] [38] [39] propose NF orchestration technologies for SFC implementation. PGA [35] provides a graph-based abstraction to express network policies for NF orchestration. By translating the graph produced by PGA to corresponding P4SC primitives, P4SC can adopt PGA as a frontend system to support more comprehensive NF orchestration. Moreover, ClickP4 [40] orchestrates NFs using P4. It modularizes P4-based NFs to enable the on-demand orchestration. Unlike ClickP4, P4SC is designed to generate the P4 program based on the SFPs and SFC components described in SFC implementation requests. Besides, P4NFV [41] manages NFs on the P4-enabled data plane. Although P4NFV is an NF management framework, it may also be utilized to implement SFCs. However, compared to P4SC, P4NFV has three main flaws. (1) P4NFV relies on network topologic to maintain SFC structure, which is less flexible compared to P4SC. (2) P4NFV regards each P4-capable device as an individual NF node (i.e., each P4-capable device only runs one NF), such that it provides low utilization rate of device resources. (3) P4NFV does not support the simultaneous implementation of multiple SFCs on the same infrastructure. These shortcomings compromise the feasibility of P4NFV as an SFC implementation framework.
B. SFC ACCELERATION
A recent trend advocates the use of data plane programmability and programmable devices, including multicore servers [42] , FPGA [10] , [12] , [43] , GPU [44] , and programmable switches [45] , [46] , for improving the SFC performance. NetVM [42] eliminates redundant packet transmissions among NFs to accelerate SFCs on multi-core servers. The FPGA-based solutions [10] , [12] , [43] and GPU-based solutions [44] offload NFs to FPGA or GPU to improve SFC performance. Nevertheless, the above solutions fail to meet the tight performance requirements of some applications, e.g., microsecond-level clock synchronization [47] , due to limited processing capability of devices. In contrast, P4SC uses P4-capable devices such that it empowers SFCs with Tbps-level throughput and ultra-low latency (in a few microseconds). Metron [45] deploys offloadable NFs on OpenFlow switches for SFC acceleration. However, Metron can only deploy a small portion of NFs on OpenFlow switches, which are less flexible and programmable compared to P4-capable devices. Instead, P4SC can bring more performance benefits for SFCs since most of NFs can be offloaded to P4-capable devices. Furthermore, some studies exploit the notion of modularizing or parallelizing NFs to accelerate SFCs [7] , [22] , [48] . P4SC is complementary to these approaches and can be deployed with them.
In addition, compared to the previous version of this paper [49] , seven major enhancements are made in this manuscript. (1) An overview of SFC and its implementation are presented along with an illustration example (Section II-A). Detailed background of the steps of implementing SFCs on the P4-capable device are also added, which will help the understanding of P4SC greatly because the steps are the foundation of our work (Section II-B).
(2) A detailed workflow of the generator in Algorithm 2 is presented (Section III-C). (3) A runtime manager is added in P4SC for managing SFCs at runtime (Section III-D). It provides operators with a suite of control commands that shield the device heterogeneity and add flexibility for SFC management. (4) An automatic integration mechanism for operators is developed to conveniently integrate their P4-based NFs into P4SC (Section III-E). (5) Implementation details including the support for three types of P4-capable devices (Section IV), and six SFCs to be used in evaluation (Section V-A). (6) Extensive experiments are conducted to demonstrate that P4SC is competent to provide high performance SFC implementation on the P4-capable device, while maintaining high flexibility (Section V). (7) A section of discussions is added on the topics relating to the extensibility of P4SC (Section VI).
VIII. CONCLUSION
In this paper, we proposed P4SC, a high-performance and flexible framework for implementing SFCs. It builds on top of the programmability of P4 to offer great flexibility and efficiency for the SFC implementation, naturally utilizing the high packet processing performance of the P4-capable device. P4SC includes primitives to specify SFCs in an SFC implementation request, a converter and a generator to take the requests to produce a P4 program, and a runtime manager to control and reconfigure SFCs at execution. In addition, an LCS-based algorithm is developed to allow multiple SFCs to be implemented in parallel on the same device following the P4 grammar correctly. An automatic integration mechanism is also developed to integrate P4-based NFs into P4SC. Six real-world SFCs are implemented on various P4-capable devices via P4SC. The experimental results show that P4SC provides high-performance and efficient SFC implementation, and enables high flexibility of SFC management at runtime. 
