Abstract-A branching program machine (BM) is a special purpose processor that uses only two kinds of instructions: Branch and output instructions. Thus, the architecture for the BM is much simpler than that for a general purpose processor (MPU). Since the BM uses the dedicated instructions for a special purpose application, it is faster than the MPU. This paper presents a packet classifier using a parallel branching program machine (PBM). To reduce computation time and code size, first, a set of rules for the packet classifier is partitioned into groups. Then, they are evaluated by the PBM in parallel. Also, this paper shows a method to estimate the number of necessary BMs to realize the packet classifier. The PBM32 consisting of 32 BMs has been implemented on an FPGA, and compared with the Intel's Core2Duo@1.2GHz. The PBM32 is 8.1-11.1 times faster than the Core2Duo, and the PBM32 requires only 0.2-10.3 percent of the memory for the Core2Duo.
I. INTRODUCTION A packet classification
is a key technology in the router and the firewall. A packet header includes a protocol number, a source address, a destination address, and a port number. The packet classifier performs a predefined action for a corresponding rule. Applications for the packet classifier include a firewall (FW), an access control list (ACL), and an IP chain for an IP masquerading technique. Different uses require systems with different performance. Thus, different architecture should be used. In the data centers and the ISPs (Internet Service Providers), the required throughput is more than tens giga bits per second. Thus, CAMs, FPGAs, or ASICs are used. These devices dissipate much power or require a high development cost. On the other hand, in low-end users including SOHO (small office and home office), the embedded processors or the general purpose processors are used. In this research, we consider the packet filter for the low-end users. So, we compare the performance with a general purpose processor or MPU. The throughput for the state-of-the-art packet classifier using the MPU is at most hundreds mega bits per second [4] , so it cannot keep up with accelerated speed up of the Internet. This paper shows a packet classifier using a parallel branching program machine (PBM) [12] . A branching program machine (BM) is a special purpose processor that uses only two instructions [2] , [1] , [21] . Thus, the BM has simpler architecture than the MPU. Since the BM has the dedicated branch instructions that are frequently used in the packet classifier, it is faster than the MPU. To realize the packet classifier by the PBM, first, a set of rules for the packet classifier is partitioned into groups. Then, they are evaluated by the PBM in parallel. The rest of the paper is organized as follows: Section 2 defines the packet classifier; Section 3 introduces the PBM; Section 4 shows the realization of the packet classifier using the PBM; Section 5 compares the PBM with the Intel's Core2Duo; and Section 6 concludes the paper.
II. PRELIMINARY

A. Packet Classifier
A packet classification table consists of a set of rules. Each rule has six input fields: Source address (SA), destination address (DA), source port (SP), destination port (DP), protocol number (PRT), and flag number (FLG) 1 . Also, it generates a rule number (Rule). A field has entries. In this paper, since we consider a realization of the packet classifier for the Internet protocol version 4 (IPv4), SA and DA have 32 bits, DP, SP, and FLG have 16 bits, and PRT has 8 bits. An entry for SA and DA is specified by an IP address; that for SP and DP is specified by a range of a port number; that for PRT is specified by a protocol number; and that for FLG is specified by a bit vector [18] . Thus, SA and DA are detected by an LPM match; SP and DP are detected by a range match; and PRT and FLG are detected by an exact match. A packet classifier detects matched rules using the packet classification table. When two or more rules match, it selects a rule having the highest priority. In this paper, we assume that the rule with the largest number has the highest priority. Note that, any packet matches a default rule whose rule number is zero. Obviously, the default rule has the lowest priority.
Example 2.1: Table I shows an example of the packet  classification table, where an asterisk '*' in an entry matches both 0 and 1, while a dash '-' in a field matches any pattern. Note that, each field has four bits, rather than the actual number of bits to simplify the example. Consider the packet classification table  shown in Table I . The packet header with SA = 0000, DA = 1010, SP = 8, DP = 8, P RT = T CP, and F LG = 1111 matches rule 3, rule 1, and the default rule. Since the rule 3 has the highest priority, the rule 3 is detected.
Example 2.2:
B. Representation of Entries by Interval Functions
An entry of a rule can be represented by an interval function [16] . First, we define the interval function.
Definition 2.1:
where A and B are integers that satisfy 0 ≤ A ≤ B ≤ 2 n − 1. Next, we represent any entry by the interval function. Suppose that the packet header is represented by 6-tuple (X SA , X DA , X SP , X DP , X P RT , X F LG ). Since the entry for SP and DP is represented by the range match, they can be directly represented by interval functions. When A = B in Expr. (1), it shows the exact match. Let b be the protocol number. The entry for PRT is represented by
Similarly, any entry for FLG can be represented by an interval function. Let 
Note that, e i is an r-bit unit vector, where only i-th bit is one, and the other bits are zeros. Example 2.3: Table II represents entries in Table I using Exprs. (1), (2) , and (3). Note that, PRT is represented by integers: T CP = 1, U DP = 2, and ICM P = 3. In Table II , e i denotes the unit vector corresponding to the rule number.
Example 2.4:
By assigning entries shown in Table II to Expr. (4), we have a vectorized packet classification function F , where k = 6 and r = 5. When a packet header has values SA=0000, DA=1010, SP=8, DP=8, PRT=1, and FLG=1111, we have F = (0, 1, 0, 1). It means that rule 3, rule 1, and the default rule are matched.
. 
C. Priority Encoder Function
A packet header may match multiple rules. To detect the rule with the highest priority, we use a priority encoder function. The priority encoder function for r rules generates a log 2 rbit binary number.
Example 2.5: When the vector F = (0, 1, 0, 1) is applied to the priority encoder function shown in Table III , we have (0, 1, 1). This means that the rule 3 is detected.
The priority encoder function can be represented by the interval function.
Example 2.6: Table IV shows an example of the priority encoder function for r = 4.
By using the vectorized classification function and the priority encoder function, we can realize the packet classifier with the specified priority.
D. Number of Rules
The embedded packet classifier implemented by the general purpose processor [4] , [5] uses 100-300 rules [6] . To compare the performance, we also assume that the number of rules is 200.
III. PARALLEL BRANCHING PROGRAM MACHINE[12]
The packet classifier is realized by a parallel branching program machine (PBM). First, each field is converted to a decision diagram. Then, these decision diagrams are evaluated in parallel.
A. MTQDD
An arbitrary n-variable logic function can be represented by a BDD (Binary Decision Diagram) [3] . An MTBDD (MultiTerminal Binary Decision Diagram) can evaluate many outputs at a time. Evaluation of the MTBDD requires n table look-ups. In this paper, we consider that the evaluation time for the BDD is proportional to a longest path length (LPL). Definitions and optimization techniques are shown in [9] .
To further speed up the evaluation, an MDD (Multi-valued Decision Diagram) [8] is used. In the MDD(q), q variables are grouped to form a 2 q -valued super variable. Note that a BDD is equivalent to an MDD (1) . When the function is represented by an MDD(q), at most n q table look-ups are necessary to evaluate an input vector [7] . The evaluation time can be reduced by increasing q. However, a node for the MDD(q) requires pointers proportional to 2 q . For many benchmark functions, total memory size for the MDD(2) achieves its minimum [10] . Since MDD(2) has 4 branches, it is denoted by a QDD (Quaternary Decision Diagram). The QDD machine is known to be the best for the area-time complexity [11] .
Example 3.7: Fig.1 shows an example of the MTBDD. Fig. 2 shows the MTQDD that is derived from the MTBDD in Fig. 1 . [17] Three instructions are used to evaluate an MTQDD. A 2-address binary branch instruction (B BRANCH) and a 3-address quaternary branch instruction (Q BRANCH) evaluate a non-terminal node, while a dataset instruction (DATASET) evaluates a terminal node. Mnemonics and their internal representations for B BRANCH, Q BRANCH and DATASET are shown in Fig. 3 .
B. Instructions for the Branching Program Machine
B BRANCH performs a binary branch: If the value of the variable specified by INDEX is equal to 0, then GOTO ADDR0, else GOTO ADDR1. DATASET performs an output operation and a jump operation. First, DATASET writes DATA (16 bits) to a register specified by REG. Then, GOTO ADDR. Q BRANCH jumps to one of four addresses: Three
ADDR0 ADDR1 ADDR2 PC+1 SEL=11 00 01 10 11 00 01 10 11 00 01 10 11 00 01 10 11 shows the MTQDD with address assignment for Q BRANCH instructions, where SEL has the same meaning as Fig. 4 . For A6, B BRANCH instruction is used for an unconditional jump, since the terminal node '10' is already assigned to A3. Thus, the program in Fig. 9 evaluates the MTQDD.
By changing the address and the SEL as shown in Fig. 6 , we can remove the unconditional jump. In this way, for the 3-address quarternary branch, we can optimize the code. The number of unconditional jumps can be minimized by an optimization method shown in [17] . Fig. 10 shows a branching program machine (BM). It consists of the instruction memory that stores up to 256 words of 32 bits; the instruction decoder; the program counter (PC); and the register file. In our implementation, two clocks are used to execute each instruction of the BM. Double-rank filp-flops [13] are used to implement the output register. into L 1 latches by using C Clock. When all the outputs and state variables are evaluated, the values of L 1 are sent to L 2 latches by using S Clock. Fig. 11 shows the architecture of the 8 BM consisting of eight BMs. The output registers of BMs are connected to the inputs of the following BMs through programmable routing boxes. Also, each BM can operate independently.
C. Branching Program Machine (BM)
D. 8 BM
A programmable routing box implements the bitwise AND and the bitwise OR operation. It also implements constant values: In the programmable routing boxes (highlighted with gray in Fig. 11 ), constant 1s are generated to perform the bitwise AND operation, while constant 0s are generated to perform the bitwise OR operation. Since BMs are connected each other by sharing a register, each BM can send the signal to other BM by one clock within an 8 BM. Since the BM uses two clocks to perform an instruction, the communication delay can be neglected. 
E. Parallel Branching Program Machine
network interface (PHY/MAC). Each 8 BM has external
Step 1
Step 2
Step Table. outputs connecting to the programmable interconnection and the system BUS. In addition, the host MPU is used to control the whole system.
IV. REALIZATION OF PACKET CLASSIFIER USING PBM
A. Packet Classification Table Implemented by 8 BM
Since the packet classification table has many inputs and outputs, a direct realization by a single MTQDD is infeasible. Our strategy is as follows: First, we partition the set of rules into several groups (Fig. 13, Step 1) . Second, we partition each group into six fields (Fig. 13, Step 2) . Third, we convert them to the MTQDDs, and load the data to the 8 BM in the PBM (Fig. 13, Step 3) . Finally, we use the PBM to evaluate them in parallel.
Theorem 4.1: Consider a vectorized packet classification function F . Let k be the number of fields, and r be the number of rules, then we have the relation: product-of-sums (POS):
By converting the above POS, we have the sum-ofproducts (SOP) whose product consists of forms f ai,bj , where 
2 From the interval functions shown in Table II , by Theorem 4.1, Expr. (4) can be converted to
Note that, in Expr. (6), each sum corresponds to a field in the packet classification table. A function representing a sum is a vectorized field function. Note that, Expr. (6) is the product of six terms, while the 8 BM consists of eight BMs. To improve the usability of the 8 BM, we decompose each of SA field and DA field into two. Let X SAE be the even bits for SA; X SAO be the odd bits for SA; X DAE be the even bits for DA; and X DAO be the odd bits for DA. Expr. (6) is converted to the product of eight vectorized field functions as follows:
where
A DAOi , and B DAOi are integers. Note that, Expr. (7) is the product of eight sums. Thus, we can efficiently realize Expr. (7) by the 8 BM and the bitwise-AND gate. The programmable routing box shown in Fig. 11 realizes the bitwise-AND gate. 
B. Priority Encoder Function Implemented by BM
As shown in Section II-D, we assume that the number of rules is 200. When the number of rules is more than a few hundreds, the number of inputs for the priority encoder functions is too large, so it is too slow to evaluate it by the BM. To realize the priority encoder function compactly, we assume the following conditions:
1. Any pair of rules in the same group are disjoint 2 . 2. Any pair of rules that belong to different groups may intersect. Since rules are mutually disjoint in a group, the 8 BM can realize it without the priority encoder. On the other hand, since rules in different groups may intersect, an additional BM for the priority encoder function is attached to the outputs of 8 BMs. Since the number of groups is small, the number of inputs for the BM realizing the priority encoder is also small. Thus, the priority encoder function implemented by the BM is fast enough. Fig. 14 shows the realization of the packet classifier using the PBM8m, where the rules are partitioned into m groups. An 8 BM in the PBM8m realizes a group. The programmable interconnection connects the m 8 BMs, and a BM realizes the priority encoder (In our implementation, m = 4).
C. Packet Classifier Implemented by PBM
V. ANALYSIS OF VECTORIZED FIELD FUNCTIONS
By analyzing the vectorized field function, we can estimate the number of steps for the BM, and the size of hardware. First, we define the region for a vectored field function.
) be a vectorized field function, where 0 ≤ X ≤ 2 n − 1, and r be the number of rules (in other words, the number of interval functions). For each value of H, we assign a region, which is an interval or a set of intervals in [0, 2 n − 1].
2 Our tool converts the set of rules into disjoint ones to satisfy this condition. Example 5.9: Fig. 15 (a) shows the relation of intervals and regions for source address (SA). H takes five different values {0101, 0011, 0001, 1000, 0000}, and corresponding regions are [0, 3], [4, 5] , [6, 7] , [8, 8] , and [9, 15] , respectively. Fig. 15 (b) shows the relation of intervals and regions for destination address (DA). In this case, the number of regions is eight, since H takes eight values {0000, 0001, 0101, 0111, 1111, 1011, 0011, 0010}. Note that, the region for {0000} consists of two disjoint intervals [0, 3] and [15, 15] .
Example 5.9 shows that, when two interval functions have a common element and also none of the intervals are contained by the other, three new regions are produced. For example, for DA shown in Fig. 15 (b) , interval functions IN(X DA :6,8) and IN(X DA :8,9) produce three new regions ([6:7] , [8:8] , and [9:9] ). In contrast, for two interval functions, when one contains the other or does not intersect, only two regions are produced. For example, for SA shown in Fig. 15 (a (Proof) We prove it by mathematical induction. When s = 1, the number of regions is at most two. Assume that the number of regions for s interval functions is t ≤ 2s. When we add an additional interval function, at most two new regions increase. Thus, for (s + 1) interval functions, the total number of regions is at most t + 2 ≤ 2s + 2 = 2(s + 1).
2 Example 5.10: The DA shown in Fig. 15 (b) has s = 4 interval functions. The number of regions is eight.
Theorem 5.3: [15] The vectorized field function for FLG (PRT) with s intervals has at most s + 1 regions.
Theorem 5.4: [15] The vectorized field function for the address field has s intervals has at most s + 1 regions.
To derive the number of nodes for the MTBDD, first, we introduce the decomposition chart. Each column is labeled by bound variables X B , while each row is labeled free variables X F . The corresponding chart entry denotes the function value. The number of different column patterns in the decomposition chart is the column multiplicity. A column that has two or more different entries is a non-constant column, while a column that has the same entries is a constant column. Example 5.11: Fig. 16 (a) shows the decomposition chart for the vectorized field function of the DA shown in Fig. 15 (b) , where X B = (x 3 , x 2 , x 1 ), and X F = (x 0 ). Note that, the function value is written in decimal number in Fig. 16 (a) , while in Fig. 15 (b) , that is written in binary number. Columns for {011, 100, 111} are the non-constant columns.
The number of nodes for an index of a quasi-reduced MTBDD corresponds to a column multiplicity for a decomposition chart. Also, the column multiplicity is related to the number of regions for the vectorized field function. A nonconstant column in a decomposition chart is represented by a non-terminal node in the MTBDD. For example, in Fig. 16 , non-constant columns (α, β, and γ) correspond to nodes (α, β, and γ), respectively. In contrast, the constant columns correspond to the terminal nodes. Thus, the number of the different non-constant columns equals to the number of nodes for the corresponding index of the quasi-reduced MTBDD.
Lemma 5.1: [14] The number of different column patterns of the vectorized field function f with t regions is at most t.
From the above discussion, we have the upper bound of the number of nodes for the MTBDD that realizes the vectorized field function for r rules.
Theorem 5.5: In an arbitrary index for the MTBDD (MTQDD) representing the vectorized field function for r rules, the number of non-terminal nodes is at most 2r.
(Proof) We prove the case for MTBDDs. The proof for the case of MTQDDs is similar. Consider a vectorized field function consisting of r interval functions. From Theorem 5.2, the number of regions is at most 2r. From Lemma 5.1, the number of non-constant column patterns is at most 2r. Since a non-constant column pattern in the decomposition chart corresponds to a non-terminal node in the QRMTBDD, we have the theorem.
2 Theorem 5.6: Let n be the number of primary inputs, and r be the number of rules for the packet classification table. Then, the number of nodes for the MTQDD representing the vectorized field function is at most
where p is an integer satisfying 2 p ≤ 2r. (Proof) We partition the nodes of the MTBDD into three parts, and enumerate the number of nodes, separately. We assume that the root node has the index n, while the terminal node has the index zero. In the upper part, for the indices from n to n − p + 1, consider the complete binary tree. Then, the number of nodes is 2 p . The node for the MTQDD includes 3 node or one node of the MTBDD (Fig. 17(a) ). Thus, the number of the MTQDD nodes in the upper part is at most 2 p − 1 3 .
As for the middle part, from Theorem 5.5, for each index, the number of non-terminal nodes is at most 2r ( Fig. 17(b) ). Since a node for the MTQDD corresponds to two indices of the MTBDD, for the middle part, the number of nodes for the MTQDD is at most n − p 2 2r.
In the bottom part, from Fig. 17(c) , the number of terminal nodes is at most r + 1.
Therefore, from Exprs. (8), (9), and (10), we have the theorem. 2 From Theorem 5.6, we can derive the upper bound on the number of nodes for the MTQDD for vectorized field function, and also the number of BMs to represent the given packet classification function.
VI. EXPERIMENTAL RESULTS
A. Implementation of PBM32
We implemented the PBM32 on an Altera's FPGA. To control the PBM32, we attached the embedded processor Nios II/f. We used Altera's Cyclone III embedded development
