This paper presents a packet classifier using multiple LUT cascades based on edge-valued multi-valued decision diagrams (EVMDDs (k)). First, a set of rules for a packet classifier is partitioned into groups. Second, they are decomposed into field functions and Cartesian product functions. Third, they are represented by EVMDDs (k), and finally, they are converted to LUT cascades using adders. We implemented the proposed circuit on a Virtex 7 VC707 evaluation board. The system throughput is 345.60 Gbps for minimum packet size (40 Bytes). As for the normalized throughput (efficiency), the proposed one is 7.14 times better than existing FPGA implementations.
INTRODUCTION

Demands of Packet Classifier
A packet classification [20] is a key technology in routers and firewalls. A packet header includes a protocol number, a source address, a destination address, and a port number. The packet classifier performs a predefined action for a corresponding rule. Applications of the packet classifier include a firewall (FW), an access control list (ACL), and an IP chain for an IP masquerading technique.
With the rapid increase of traffic, core routers dissipate the major part the total network power [22] . Thus, we cannot use ternary content addressable memories (TCAMs), since they dissipate too much power. In addition, a reconfigurable architecture is necessary to update policy rules of the packet classifier. With the rapid growth of the Internet, packet classifiers have become the bottleneck in the network traffic management. Recently, a core router works at a 100 Gbps link speed for a minimum packet size (40 bytes). Thus, a parallel processing is an effective method to increase the system throughput. In this case, the throughput per area-efficiency is an important measure [12] . A modern FPGA consists of lookup-tables (Slices), on-chip memories (BRAMs), arithmetic circuits (DSP48Es), and so on. A balanced usage of hardware resources in FPGAs is the key to realize a high throughput per area. This paper considers a memory-based architecture on the FPGA, which dissipate lower power than TCAMs.
Contributions of the Paper
This paper proposes an architecture using multiple look-uptable (LUT) cascades based on edge-valued multi-valued decision diagrams (EVMDDs (k)). Our contributions are as follows:
1 We proposed a compact and high-speed packet classifier by LUT cascades based on EVMDDs (k). Conventional methods use only Slices and BRAMs, while the proposed architecture uses DSP48E blocks in addition to Slices and BRAMs. Thus, the proposed architecture uses available FPGA resources effectively. 2 We implemented packet classifier using multiple LUT cascades on an FPGA. Its system throughput is more than 300 Gbps. As for the efficiency measure (throughput per normalized memory size), the proposed architecture is higher than existing methods.
The rest of the paper is organized as follows: Chapter 2 introduces a packet classifier; Chapter 3 shows the LUT cascade based on an MTMDD (k); Chapter 4 shows the LUT cascade based on an EVMDD (k); Chapter 5 shows experimental results; and Chapter 6 concludes the paper.
PACKET CLASSIFIER
5-tuple Packet Classification
A packet classification table consists of a set of rules. Each rule has five input fields: Source address (SA), destination address (DA), source port (SP), destination port (DP), and protocol number (PRT). Also, it generates a rule number (Rule). A field has entries. In this paper, since we consider a realization of the packet classifier for the Internet protocol version 4 (IPv4), we assume that SA and DA have 32 bits, DP and SP have 16 bits, and PRT has 8 bits. An entry for SA or DA is specified by an IP address; that for SP or DP is specified by an interval [x,y] , where x and y denote a port number; and that for PRT is specified by a protocol number. SA and DA are detected by a longest prefix match; SP and DP are detected by a range match; and PRT is detected by an exact match. A packet classifier detects matched rules using the packet classification table. In this paper, we assume that the rule with the largest number has the highest priority. Note that, any packet matches a default rule whose rule number is zero. Obviously, the default rule has the lowest priority. When two or more rules are matched, a rule having the highest priority is selected. function [16] :
where X, A, and B are integers. Let
. Similarly, any entry for DA can be represented by an interval function. Any entry for PRT is represented by
, where b is a protocol number.
As shown in Example 2.1, multiple rules may match in a packet classification table. In such a case, we use a vectorized interval function. Let r be the number of rules.
, where e i is a unit vector with r elements, and only i-th bit is one and other bits are zeros.
For each value of H(X), we assign a segment, which is an interval or a set of intervals. Then, we define a field function F (X), which generates an unique integer index I i corresponding to the i-th segment
where Y = I 1 ×I 2 ×· · ·×I k is a set of Cartesian products of indices generated by field functions. As shown in Fig. 1 , the packet classification table is decomposed into field functions and a Cartesian product function.
We can assign an arbitrary index to a segment. In this paper, we assign indices to make an M 1 -monotone increasing function [8] to reduce the amount of memory. Let I be a set of integers including 0. An integer function f (X) : 0  0000  0  1  0001  1  2  0010  2  3  0011  2  4  0100  2  5  0101  2  6  0110  2  7  0111  2  8  1000  3  9  1001  4  10  1010  5  11  1011  5  12  1100  5  13  1101  5  14  1110  5  15  1111  5   SA x3x2x1x0 IDX SA  0  0000  0  1  0001  1  2  0010  2  3  0011  2  4  0100  2  5  0101  2  6  0110  2  7  0111  2  8  1000  3  9  1001  4  10  1010  5  11  1011  5  12  1100  5  13  1101  5  14  1110  5  15 Conversion of an MTBDD node into an EVBDD node.
f . Each non-terminal node labeled with a variable x i has two outgoing edges which indicate nodes representing cofactors of f with respect to x i . A multi-terminal BDD (MTBDD) [2] is an extension of a BDD and represents an integer-valued function. In the MTBDD, the terminal nodes are labeled by integers.
Let X = (X 1 , X 2 , . . . , X u ) be a partition of the input variables, and |X i | be the number of inputs in X i . X i is called a super variable. When the Shannon expansions are performed with respect to super variables |X i |, where |X i | = k, all the non-terminal nodes have 2 k edges. In this case, we have a multi-valued multi-terminal decision diagram (MTMDD(k)) [4] . Note that, an MTMDD(1) means an MTBDD. The width of the MDD (k) at the height k is the number of edges crossing the section of the MDD (k) between super variables X i+1 and X i , where the edges incident to the same node are counted as one.
Let p be the number of rules, and |X| = n. An M 1 -monotone increasing function can be realized by an LUT cascade [17] shown in Fig. 6 . Connections between LU T i and LU T i−1 requires r i = log 2 μ i rails. Since a modern FPGA has BRAMs and distributed RAMs (realized by Slices), LUT cascades are easy to implement. The amount of memory for LU T i based on an MTMDD (k) is r i ·2 (k+ri+1) . Thus, the total amount of memory for an LUT cascade is M = u i=0 r i · 2 (k+r i+1 ) . The number of unique indices for the M 1 -monotone increasing function is equal to the number of segments. A reduction of r i reduces the amount of memory for an LUT cascade. Thus, to reduce the amount of memory for the LUT cascade, we partition rules into subrules which increases the least number of segments. Fig. 4 converts the field function for SP shown in Fig. 3 As for an M 1 -monotone increasing function, the upper bound on the number of rails in the LUT cascade has been analyzed. 
Example 3.3
Partition of Rules by Greedy Algorithm
Since a field function produces at most 2p +1 segments, it is compactly realized by an LUT cascade [10] . However, the Cartesian product function produces O(p 5 ) segments [19] . Thus, a direct realization by an LUT cascade is hard. To reduce the number of segments, we partition rules into subrules. Then, we realize subrules by circuits shown in Fig. 7 . Since two or more rules may match at the same time, we attached the maximum selector to the output.
Let [x, y] be an entry for a field. Then, y −x is the size of the interval. We propose the greedy algorithm to partition rules as follows: R = {r 1 , r 2 , . . . , r p }  be the set of rules, p be the number of rules, G = {G 1 , G 2 , . . . , G q } be the partition of rules, and q be the number of groups of rules.
Algorithm 3.1 (Partition of rules) Let
Compute the sum of sizes of intervals d for each rule.
Then, sort the rules in decreasing order as 
Do Steps 4.1 to 4.4 until i > p.
For
1 ≤ j ≤ q, decompose G j ∪ r i by4.3. If M grp < M single , then G j ← G j ∪ r i . Other- wise, G q+1 ← r i , and q ← q + 1. 4.4. i ← i + 1.
Terminate.
Algorithm 3.1 partitions the packet classification table efficiently using its property. Real-life packet classification tables in an inherent data structure are analyzed in [7] . Since many packet classification tables are maintained by humans, global controls (wide range port) are used in the global networks, while detail controls (narrow range port) are used in the local networks. Thus, in practice, the number of rails seldom becomes the worst. A simple partition algorithm can suppress the increase of segments. As a result, we can reduce the memory size.
LUT CASCADE BASED ON AN EVMDD (K)
To reduce the amount of memory for an LUT cascade, we introduce an LUT cascade based on an edge-valued multivalued decision diagram (EVMDD (k)) [6] , which is an extension of an EVBDD. An EVBDD consists of one terminal node representing zero and non-terminal nodes with a weighted 1-edge, where the weight has an integer value α. An EVBDD is obtained by recursively applying the conversion shown in Fig. 5 to each non-terminal node in an MTBDD. Note that, in the EVBDD, 0-edges have zero weights.
In an M α -monotone increasing function, subfunction f is obtained by adding α to subfunction f . Thus, an EVBDD may have smaller widths by sharing f and f with α edge (Fig. 8  (a) ). The MTBDD only shares prefixes, while the EVBDD shares both prefixes and postfixes (Fig. 8 (b) ). By rewriting the terminals of the MTBDD for the Cartesian product function, we have the M 1 -monotone increasing function. Fig. 9 shows an example to obtain an M 1 -monotone increasing function. To recover the original function, we use a translation memory. The size of the translation memory is equal to the number of terminal nodes in the MTBDD. Experimental results shows that its amount memory tends to be small. An edge-valued MDD (k) (EVMDD (k)) is an extension of the MDD (k), and represents a multi-valued input M 1 -monotone increasing function. It consists of one terminal node representing zero and non-terminal nodes with edges having integer weights, and 0-edges always have zero weights.
Let p be the number of rules, and |X| = n. An M 1 -monotone increasing function is efficiently realized by an LUT cascade with adders [9] shown in Fig. 10 . In this case, the rails represent sub-functions in the EVMDD (k). The outputs from each LU T i other than rails represent the sum of weights of edges. We call such outputs Arails which consist of ar i rails. Since the width of the EVMDD (k) for M 1 -monotone increasing function is smaller than that of the MTMDD (s), we can reduce the amount of memory for the LUT cascade by using an EVMDD (k). Since we realize the adders by DSP blocks (DSP48Es), FPGA resources are efficiently used.
The amount of memory for LU T i is (r i + ar i ) · 2 k+ri+1 . Let |X| = n be the number of inputs, and k = |X i |. The LUT cascade has u = n k LUTs. Thus, the LUT cascade based on an EVMDD (k) requires
bit of memory in total. Also, it requires u adders. Generally, an increase of k increases the amount of memory, while decreases the number of adders. Thus, in this paper, we find a value of k that uses FPGA resources efficiently. 
EXPERIMENTAL RESULTS
Implementation Setup
We implemented the proposed circuit on the Virtex 7 VC707 evaluation board (FPGA: Xilinx, XC7VX485T-2FFG, 75,900 Slices, 1,030 36KbBRAMs, and 2,800 DSP48E Blocks).
We used the Xilinx PlanAhead version 14.4 for the synthesis. As for the LUT cascade implementation, LU T i whose size is equal to or more than 36Kb LU T i is implemented by 36Kb BRAMs, while LU T i whose size is less than 36Kb LU T i is implemented by distributed RAMs using Slices. To increase the system throughput, we set the dual port mode to the memory. By Algorithm 3.1, we partitioned 9,816 ACL rules generated by ClassBench [21] into two: Subrule 1 (9,600 rules) and Subrule 2 (216 rules). Then, each subrule is decomposed into five field functions and a Cartesian product function. Finally, each function is realized by an LUT cascade based on an EVMDD (k). To reduce the widths for an EVMDD (k), we used the shifting method [14] .
Comparison of EVMDD (k) with MTMDD (k) to Implement LUT Cascade
We realized the packet classifier by three different methods: 1 A single memory. 2 LUT cascades based on MTMDDs (k). 3 LUT cascades based on an EVMDDs (k).
To find the smallest LUT cascade, we changed the size of super variables k from one to four. Fig. 12 compares the memory sizes. It shows that, for all k, EVMDDs (k) produced smaller LUT cascades than MTMDDs (k). Also, the memory size takes its minimum when k = 2. As for Cartesian product functions, EVMDDs (k) required smaller memory than MTMDDs (k) even if the translation memories are used. Fig. 13 shows the number of adders for EVMDD (k). Although EVMDD (k) requires DSP48Es, it requires less than 3.8% of available resources. Thus, the usage of DSP48Es is negligible. As shown in this part, the LUT cascade based on EVMDDs (k) efficiently utilize the resource of an FPGA.
Comparison with Other Methods
According to the result of Section 5.2, we implemented the packet classifier by the LUT cascades based on an EVMDD (k) shown in Fig. 14 , which consumes 2,024 Slices (6.7%), 37 BRAMs (3.6%), and 105 DSP48E blocks (3.8%). Since the maximum clock frequency was 543.774 MHz, we set the system clock frequency to 540 MHz. Thus, the system throughput is 0.54 (MHz) ×2 (ports)×320 (Bits) = 345.60 Gbps for minimum packet size (40 Bytes). Table 2 , the efficiency measure of the proposed architecture is 7.14 times higher than that of Cartesian-Product with Quadtrees method [13] that was the best among the existing methods. In this way, we implemented a high-speed and area efficient system.
CONCLUSION
In this paper, we showed a method to implement a packet classifier. First, the packet classification table was decomposed into two subrules. Second, they were decomposed into five field functions and a Cartesian product function. And finally, each function was realized by an LUT cascade based on an EVMDD (2). We implemented the proposed architecture on a Virtex 7 VC707 evaluation board. Experimental result showed that, the efficiency measure (throughput per normalized area) is 7.14 times higher than that of an existing method.
The rules for the packet classifier should be updated (added and deleted) frequently. The addition and deletion of a registered vector can be done in time that is proportional to the number of cells in the LUT cascade [10] . One of a future project is applying this update method in the proposed architecture.
ACKNOWLEDGMENTS
This research is supported in part by the Grants in Aid for Scientistic Research of JSPS. Reviewer's comments were useful to improve the paper.
