Packet classification is widely used in Software-Defined Networking (SDN). At present, packet classification is mainly implemented by software, such as OpenvSwitch, which has the disadvantage of low performance. On the Field Programmable Gate Array (FPGA) platform, FPGA has the advantages of reconfigurability and high processing performance. The current work proposes the FPGA-based Bit-Vector algorithms in packet classification, which has the advantages of determining latency and high throughput.
I. INTRODUCTION
Packet classification is one of the core functions required by many commercial network services and is also the central problem of Software-Defined Networking (SDN) [1] based on OpenFlow [2] . SDN requires multiple fields of the packets to be checked against a predefined rule set, which contains thousands of rules and wildcards. The significant challenges of traditional packet classification include: (1) saving memory resources, (2) sustaining high performance, and (3) facilitating dynamic updates.
In the past decade, the academic community has proposed many effective methods to face the packet classification challenges. (1) saving memory resources, searchers propose the decision tree algorithms [3] - [10] . On the basis of rule feature, the decision tree algorithms [3] - [10] construct a hierarchical tree structure, using a tree search algorithm to find the decision tree and obtain matching rules at the leaf nodes. However, the increasing number of fields also results in the rise of processing delays. (2) sustaining high The associate editor coordinating the review of this manuscript and approving it for publication was Songwen Pei . performance, searchers propose a solution based on Ternary Content Addressable Memory (TCAM). TCAM has been widely adopted in the industrial field [14] , [15] . TCAM provides a line-speed parallel lookup capacity, but it is expensive, power-hungry, and not suitable for representing rules with range fields. (3) facilitating dynamic updates, searchers propose the tuple space algorithms [11] - [13] . The tuple space algorithms divide different tuples according to the length of the wildcard and search each tuple to obtain the matching rule with the highest priority. The tuple space algorithms support dynamic updates, but the cost of search time increases along with the number of tuples.
The decision tree algorithms and the tuple space algorithms are challenging to configure and deploy memory resources on hardware. However, this paper focuses on FPGA-based Schemes. Field Programmable Gate Array (FPGA) has the advantage of reconfigurability and is widely used in network processing [16] - [19] . Lakshman and Stiliadis [20] , Jiang and Prasanna [21] , and Ganedageara et al. [22] proposed the bit-vector-based (BV-based) algorithms for FPGA. The advantage of the BV-based algorithms is that it can use the pipeline [23] composed of Processing Engines (PE) to improve throughput. However, the BV-based algorithm needs to store at least 2 × L N -bit vectors (L represents the total number of bits in all matching fields, and N represents the total number of rules in the rule set). The memory consumption rises rapidly when L and N increase. However, to the best of our knowledge, there are few schemes for optimizing memory at the moment.
The value of each bit in the rule set can be three states (''0'', ''1'', and '' * '') where '' * '' is known as do not care bit or wildcard. The '' * '' state is always regarded as matched irrespective of the input bit. Fig. I shows the percentage of wildcards in 10K rule sets and 100K rule sets synthesized by WeeBV [26] and this paper, respectively. The various rule sets are generated by ClassBench [24] and ClassBench-ng [25] . The percentage of wildcards in rule sets generated each time is slightly different because the program will delete the same rules [24] , [25] . Wildcards occupy a significant proportion in the typical rule set, and the average wildcard ratio of the traditional five-tuple rule set is 39.13%, and the average wildcard ratio of the OpenFlow rule set is 49.19%.
According to the fact that the wildcard PE does not affect the matching results, the matching results can be directly output without accessing memory [26] . This paper proposes a memory compression scheme called Memory-shared Bit-Vector (MsBV) for BV-based classification lookup tables and introduces a new Memory-shared Homogeneous Twodimensional Pipeline architecture (MsTP). In MsTP, every two MsPEs constitute a PE pair, and the two are each other's partner PE. Two MsPEs share one memory to achieve the purpose of memory compression. All MsPEs in MsTP connect to the two-level memory block through the bus architecture MsBus. When a PE pair consists of two standard MsPEs, one local shared memory cannot provide memory for two MsPEs at the same time. Therefore, one of the MsPEs accesses the two-level memory block to ensure the correctness of the scheme.
We make the following contributions:
• We propose Memory-shared Homogeneous Twodimensional Pipeline for matching rules, MsTP. MsTP includes Memory-shared Processing Engines (MsPE) and a Memory-shared Bus architecture (MsBus) for connections between MsPE and the two-level memory. MsTP supports multiple MsPEs share memory blocks to save memory resources.
• We utilize a bit matrix and propose a rearrange technology for MsTP to determine the potential that memory consumption can be minimized. We optimize the ordering of Bit Matrix (BM) columns by taking advantage of the feature that the matching result of the BV-based algorithm is independent of the matching order of sub-fields of the packet. The rearranged sub Bit Matrix (sub-BM) is mapped to the corresponding MsPE for execution to optimize memory usage.
• Based on ClassBench [24] and ClassBench-ng [25] , we synthesize 100K 5-tuple rules and 100K OpenFlow rules to evaluate MsBV. We find out the optimal MsBus architecture parameters that minimize the rate of conflict through simulation experiments. Based on typical rule sets, we compare MsBV with StrideBV [22] to get the result of memory compression.
The rest of the paper is organized as follows: Section II elaborates background; Section III elaborates MsBV scheme and implementation architecture; Section IV introduces the experiment and evaluation result; Section V introduces related work about the memory compression scheme; Finally, Section VI draws conclusions.
II. BACKGROUND
SDN requires extremely high processing performance. We demonstrate that via tight integration of logic and memory resources available to achieve rule matching on FPGA. BV algorithm is a widely used FPGA-based algorithm because it has the characteristics of deterministic search delay.
Lakshman and Stiliadis [20] proposed an FPGA-based algorithm called Bit-Vector (BV) search algorithm. The BV search algorithm returns a bit-vector for each search. The bitvector is an N -bit vector. Each bit of the bit-vector represents the match result for a rule. The rule that matches the input packet header is represented by ''1,'' while the rule that does not match the input packet header is represented by ''0.''
The final summation operation cost of the BV-based algorithm is enormous, especially when the number of rules (or the number of matching fields) increases, the length of the bit-vector (or the number of bit-vectors) rises, and the clock period required for the process grows. The latency of a single packet processing also augments. T. Ganegedara [22] proposed the StrideBV algorithm to deal with the problem of delay. All the matching fields are divided into L s sub-fields, where s (1 ≤ s ≤ L) denotes the bit length of the sub-field. K j (j = 0, 1, . . ., L s −1) denotes the s bit in sub-field j. StrideBV reduces the number of matching searches and also decreases search latency. Fig. 2a is an example of the rule set. An example of StrideBV is illustrated in Fig. 2b , where stride = 2. As a consequence, the matching result of the rule set is BV = {0001}, indicating that only the rule R 3 matches the input packet heads successfully.
The memory consumption of the StrideBV algorithm BV's table can be expressed as equation 1:
Based on StrideBV [22] , Qu and Prasanna [23] came up with a two-dimensional pipeline architecture. The rule set is divided into N n subsets, where n (1 ≤ n ≤ N ) represents the number of rules in a sub-rule set. The two-dimensional pipeline architecture reduces the length of a single bit-vector and improves the scalability of the rule set. The architecture example shown in Fig. 2c is a two-dimensional pipeline composed of homogeneous PEs. PE[i, j] represents the PEs located in the i-th row and the j-th column. Each PE[i, j] integrates an SRAM memory to store the BV table BV (i,j) . The contents of the BV table BV (i,j) are the results of matching subfield K j of subset i. As illustrated in Fig. 2c , s is set to 2, n is set to 2, and the final result is BV={0001}.
The memory consumption of the two-dimensional pipeline architecture can be expressed as equation 2:
Equation (1) and (2) show that StrideBV [22] has the same space complexity as two-dimensional pipeline architecture [23] . The memory consumption is proportional to the number of rules and the total number of bits in all matching fields. Memory resources in FPGAs are so stringent that it is challenging to provide the memory required for SDN-based applications. SDN-based applications, the number of rule fields and the size of rule sets are increasing dramatically [10] . Furthermore, wildcards occupy a significant proportion in the typical rule set. This paper proposes a memory-shared homogeneous two-dimensional pipeline named MsBV.
III. MEMORY-SHARED BIT-VECTOR-BASED SCHEME
The design of the Processing Engine in the twodimensional pipeline architecture [23] is shown in Fig. 3 BV (1, 0) in the local memory. MsPE [1, 1] connects the twolevel memory block through the data bus and stores the bitvector table BV (1, 1) into the two-level memory block. The last component of MsTP is the priority encoder (PrEnc) [22] , which is located at the end of each horizontal pipeline, and PrEnc reports the local highest priority match result. The vertical pipeline of the PrEnc collects the matching results, and the final result is obtained.
MsPE contains the following components: (1) Controller, is responsible for writing initial or updated rules to the corresponding memory, determining which memory the MsPE accesses based on the state of the PE pair and the data bus;
(2) n-bit AND, a logic unit used to perform AND operation; (3) Packet Reg, storing s-bit packet header; (4) SubBV Reg, storing the n-bit sub bit-vector which is the matching result; (5) Local memory, which stores the BV table. The size of the BV table is the amount of memory required for a MsPE matching operation. In the MsTP, two MsPEs share a local memory; (6) Two-level memory, MsPE will access the two-level memory when the corresponding local memory is already occupied;
The ideal case is that the wildcard MsPE and standard MsPE form a PE pair, which can access as few two-level memory blocks as possible. However, the real-life rule set is challenging to achieve the ideal case, so we need a preprocessing technology to process the ruleset.
B. BIT MATRIX REARRANGE TECHNOLOGY 1) BASIC TECHNOLOGY
There are four types of match in multi-field packet classification: prefix match, range match, wildcard match, and exact match. To enable hardware to support range matching, we convert range match to prefix match [27] . Wildcard match and exact match can be handled as two exceptional cases of prefix match. Therefore, this paper converts all matching types of rules to prefix matches. So for the subsequent discussion of this paper, only prefix matching is considered.
As illustrated in Fig. 6 , after we convert the matching type of each field to a prefix match, the entire rule set becomes a Bit Matrix (BM) having a size of L × N . When s and n are set, the BM can be decomposed into several sub-BMs of size s × n by traversing from front to back, top to bottom in each field. We define it as a wildcard sub-BM when a sub-BM consists of s × n wildcards; otherwise, it is termed as a standard sub-BM. In MsTP, sub-BM(i, j) generates a BV table BV (i,j) for matching, and the corresponding MsPE[i, j] will perform table lookup matching operation on BV (i,j) . Thus, the one-to-one mapping relationship is provided between the sub-BM(i, j) and MsPE[i, j]. Fig. 7a indicates the initial sub-BM distribution, the white box represents the standard sub-BM, and the gray box represents the wildcard sub-BM. Each sub-BM is marked as [x, y, z], demonstrating the sub-BM of the x-th row and the y-th column in the z-th field.
The output of MsBV scheme is irrelevant to the matching order of packet fields. Sub-BM(i, j) mapped with MsPE[i, j] will change accordingly as we adjust the column order in Fig. 7a . We make standard sub-BM and wildcard sub-BM adjacent as far as possible. Furthermore, if one standard MsPE and one wildcard MsPE form a PE pair, local memory will be utilized thoroughly. Moreover, the overall performance of MsBV varies extremely with different arrangement results.
We identify the MsPE mapped to the wildcard sub-BM as the wildcard MsPE, and the MsPE assigned to the standard sub-BM as the standard MsPE. If the initial sub-BM(i, j) is directly mapped to the MsPE[i, j] in the corresponding position without the process of bit matrix rearrange technology, then a large number of PE pairs with ''bad combination'' will appear. For example, two standard MsPEs form a PE pair, and one of the standard MsPEs needs to access the two-level memory block through the data bus. Furthermore, the local shared memory will be emptied if two wildcard MsPEs form a PE pair, which will cause the waste of memory resources.
We look forward to having as many PE pairs with ''perfect combination'' as possible. ''perfect combination'' means that one standard MsPE and one wildcard MsPE form a PE pair.
In the light of the statistics of WeeBV [26] , wildcards appear more at the end of each field. The pseudo-code of the bit matrix rearrange technology is as follows:
Algorithm 1 Bit Matrix Rearrange Technology Require: Matrix_bit: old bit matrix; s: stride of the StrideBV; n: the number of rules in a sub-rule set; Matrix_new: empty bit matrix. Ensure: Matrix_new: new bit matrix; Output_Pkt: packets output to the next module Output Engine. 1: A, B, C = 0; 2: for i = (0, N − 1, n) do 3: for j = (0, L − 1, s) do 4: if Matrix_bit(i, j) == * then 5: Matrix_New( i n , j s ) = 1;
6: end for 10: end for 11: for Column = (0, L 2s − 1 2 ) do 12: for Row = (0, N n − 1) do 13: if 
2) OPTIMIZED TECHNOLOGY a: SEGMENT REARRANGE TECHNOLOGY
The wildcard sub-BM is concentrated in the tail of each field. The segment rearrange technology takes advantage of it and adopts the methods of ''cutting'' and ''recombining.'' ''Cutting'' is to evenly divide the sub-BM of each field in the initial sub-BM map into two parts according to the column: ''the head'' and ''the tail.'' ''Recombination'' means that ''the head'' of all the fields are placed together, and ''the tail'' of all the fields are placed together. Fig. 9 is an example of a segment rearrange technology for packets with two fields. The sub-BM distribution in the initial state is subjected to ''cutting'' and ''recombination'' processing to obtain a sub-BM distribution in the intermediate state illustrated in Fig. 9b . We deal with sub-BM distribution in the intermediate state in Fig. 9b [3, 4, 1] ). The second column of sub-BM and the fifth column of sub-BM are adjacent, and the third column is adjacent to the fourth column. Then we get the sub-BM distribution in the final state illustrated in Fig. 9c .
b: FIELD REARRANGE TECHNOLOGY
Further on, we make some novel discoveries that there are apparent differences among the wildcard rate of each field in the OpenFlow rule set. We observe that the change ranges are 14.86% -100%, and the wildcard ratio of 10 fields are above 80% or below 20% in Fig. 10 . In order to make full use of the characteristics of the OpenFlow rule set, we propose the field rearrange technology. Through preprocessing, we change the order of the fields according to the proportion of the wildcard. An example of the field rearrange technology is illustrated in Fig. 11 . We find that field 2 has the lowest percentage of the wildcard and field 0 has the highest percentage of the wildcard. More specifically, we adjust the order of the fields. Field 2, Field 1, Field 0 are arranged from left to right. 
C. UPDATE STRATEGY
First of all, we consider all the n rules handled by a particular horizontal pipeline consisting of L s MsPEs. We propose to use n valid bits to keep track of all the n rules. A valid bit is a binary digit indicating the validity of a specific rule. A rule is valid only if its corresponding valid bit is set to ''1'' [23] . We reiterate the problem definition of dynamic updates as three subproblems:
(1) Deletion. For a rule to be deleted, we reset its corresponding valid bit to "0." An invalid rule is not available for producing any match result. (2) Insertion. First, we find the rule whose valid bit is set to "0," then the controller changes the rule though update BV tables, and finally, we set the valid bit to "1." (3) Modification. We can regard the modification operation as one delete operation and one insertion operation.
IV. EVALUATION
A. EXPERIMENTAL SETUP 1) IMPLEMENTATION PLATFORM
All simulation experiments are run on a machine with Intel i7-4720HQ CPU@2.60GHz and the operating system is Ubuntu 14.04. Besides, we conduct experiments using Quartus (Prime 17.0), targeting the Intel Arria II GX FPGA, which consists of 2.8Mb Block RAM, 36100 Combinational ALUTs, and 27369 Registers. The minimize size block memory on Arria II GX FPGAs is 20Kb.
2) SYNTHETIC CLASSIFIERS
This paper carries out simulation experiments, using the seed files of ClassBench [24] and ClassBench-ng [25] to generate 100K 5-tuple rules and 100K OpenFlow rules. Rule sets include Access Control List (ACL), Firewall (FW), IP chain (IPC), and OpenFlow 1.0. The ACL, FW, and IPC rule sets contain 5 fields and the OpenFlow rule sets contain 12 fields. These rule sets can be viewed on the open-source website GitHub [28] , [29] .
3) PARAMETERS
The initial parameters s and n need to be evaluated in the MsTP architecture, where s (1 ≤ s ≤ L) represents the bit length of the subfield, and n (1 ≤ n ≤ N ) represents the number of rules in a subset. The minimize size of the memory block on Arria II GX FPGAs is 20Kb, and the minimum depth of the memory block is 512 bits. One 20Kb memory block can represent 2 9 × 40 bits (Minimum depth: 512 bits) to 20K × 1 bits (Maximum depth: 20K bits). One MsPE needs 2 s × n bits to store the BV table, which means that the minimum value of our s is 9. What is more, we refer to the results of the comparative experiment [23] , [26] which is conducted on FPGA with comprehensive consideration of main factors such as throughput and delay. The experimental results show that the value s is inversely proportional to the performance so that the parameters are set as s = 9 and n = 40.
B. PARAMETER EVALUATION EXPERIMENT OF MSBUS ARCHITECTURE
According to the physical characteristics of the memory devices, the memory device has two read ports. Here is the definition of ''conflict.'' The local memory will not conflict because only two MsPEs are connected. For a two-level memory block, conflicts will occur if three or more MsPEs access a two-level memory block at the same time, requiring arbitration and increasing the pipeline delay. The rate of conflict can be reduced by increasing the bus and two-level memory block. However, it will increase resource consumption. Therefore, a balance needs to be struck between resource consumption and controlling conflict rate.
In MsTP architecture, each row has W1/W2 data buses and two-level memory blocks. W1 is the parameter for the 5-tuple rule set, and W2 is the parameter for the OpenFlow rule set. Different W1 or W2 corresponds to different styles of MsBus architecture. The conflict rates under different MsBus architecture parameters among the basic technology, the segment rearrange technology, and the field rearrange technology is measured through simulation experiments and are illustrated in Fig. 12 With the increase of parameters W2, we found the conflict rate of basic technology has decreased significantly. This is because many fields are not available, and the contents are all wildcard characters in the OpenFlow rule set. PE pairs with ''perfect combination'' have appeared as much as possible. Two optimized technologies are still applicable to the Open-Flow rule set, and the effect for the 5-tuple rule set is more evident with the rise of the parameter W1. This paper concludes from the software simulation experiments that for the ACL rule set, when W1 = 2, the conflict rate is 0; for the FW or IPC rule set, when W1 = 3, the conflict rate is 0; for the OpenFlow rule set, when W2 = 1, the conflict rate is 0.
On the basis of experimental consequences, each row of the MsTP architecture requires 2 or 3 buses and 2 or 3 two-level memory blocks in the 5-tuple rule set and each row of the MsTP architecture requires only one bus and one two-level memory block in the OpenFlow rule set to avoid accessing conflicts. The MsBus parameter of the OpenFlow rule set is smaller. On this account, the compression effect for the OpenFlow rule set is very extremely significant.
C. RESOURCE ANALYSIS
In the light of the physical characteristics of the memory, the two-level memory block has two ports. MsBus architecture sets the two-level memory block capacity to be twice the local shared memory capacity and supports simultaneous access by two MsPEs.
We consider two aspects at the same time, the most appropriate parameters W1 and W2 obtained from the simulation experiment and the capacity of the two-level memory block. We implement MsBV on the FPGA platform. As illustrated in Table 1 , the OpenFlow rule sets save 57.53% of the number of ALUTs, 37.59% of the number of Registers, and 43.69% of memory resources respectively compared with StrideBV [22] . MsBV has a significant compression effect for the OpenFlow rule set. On the other hand, performance improvement is not evident for the FW rule set and the IPC rule set.
V. RELATED WORK
Here are other bit-vector-based memory compression schemes, for example, Li et al. [26] proposed WeeBV scheme. In the light of the characteristic that the bit-vector does not change after the wildcard PE processes the input bit-vector, Li et al. [26] removes the SRAM memory of the wildcard PE to achieve the goal of compressing the memory. However, WeeBV [26] has certain limitations in the face of dynamic updates. WeeBV needs to reconfigure wildcard PEs when wildcard PEs become standard PEs, so that has poor update performance. Despite WeeBV, which proposes a sinkupdate algorithm, plenty of empty rules are required so that WeeBV increases memory consumption.
VI. CONCLUSION
This paper proposes a memory compression scheme MsBV for the BV-based packet classification approach that supports multi-field rule matching, which reduces memory consumption while ensuring processing performance. MsBV constructs a memory-shared homogeneous two-dimensional pipeline architecture-MsTP. In MsTP, every two MsPEs form a PE pair and share one local memory, thus effectively reducing memory overhead. For OpenFlow rules, MsBV saves 43.69% memory resources, 57.53% the number of ALUTs, and 37.59% the number of Registers compared to StrideBV [22] .
