.Mwtruct-Packet Classification (PC) has been a critical data path function for many emerging networking applications. An interesting approach is the use of TCAhl to achieve deterministic, high speed PC. However, apart from high cost and power consumption, due to slow growing clock rate for memory technoIogv in general, PC A-ey worrlF-System Design, Simulations r. INTRODUCTION Packet Classification (PC) has wide applications in networking devices to support fiiewall, access control list (ACL), and quality of service (QoS) in access, edge, andior core networks. PC involves various matching conditions, e.g., longest prefix matching (LPM), exact matching, and range matching, malung it a complicated pattem matching issue. Moreover, since PC lies in the critical data path of a router and it has to act upon each and every packet at wire-speed, this creates a potential bottleneck in the router data path, particularly for high speed interfaces. For example, at OC192 (10 Gbps) full line rate, a line card (Le) needs to process about 25 million packets per second (Mpps) in the worst-case when minimum sized packets (40 bytes each) arrive back-to-back. As the agp-egate line rate to be supported by an LC is moving towards OC758, it poses significant challenges for the design of packet classifiers to allow wire-speed forwarding.
r. INTRODUCTION Packet Classification (PC) has wide applications in networking devices to support fiiewall, access control list (ACL), and quality of service (QoS) in access, edge, andior core networks. PC involves various matching conditions, e.g., longest prefix matching (LPM), exact matching, and range matching, malung it a complicated pattem matching issue. Moreover, since PC lies in the critical data path of a router and it has to act upon each and every packet at wire-speed, this creates a potential bottleneck in the router data path, particularly for high speed interfaces. For example, at OC192 (10 Gbps) full line rate, a line card (Le) needs to process about 25 million packets per second (Mpps) in the worst-case when minimum sized packets (40 bytes each) arrive back-to-back. As the agp-egate line rate to be supported by an LC is moving towards OC758, it poses significant challenges for the design of packet classifiers to allow wire-speed forwarding.
The existing algorithmic approach including geometric algorithms based on the hierarchical trie [I] 121 [3] [4] and most heuristic aIgorithms [6] [7j [SI [9] generally require nondetmninistic number oE memory accesses €or each lookup, 
2OO2.4.4~03011-1).
access latency: limiting the throughput performance. Moreover, most algorithmic approaches, e.g., geometric algorithms, apply only to 2-dimensIona1 cases. Although some heuristic algorithms address higher dimensional cases, they offer nondeterministic performance, which differs from one case to another. In contrast, l e m q content addressable memov (TCAM) based solutions are more viable to match high speed line rates, while making software design fairly simple. A TCAM finds a matched rule in O(I) clock cycle and therefore offers the highest possible 1ookuph"ching performance. However, despite its superior performance, it is still a challenge for a TCAM based solution to match OC I92 to OC768 line rates. For example, for a TCAM with 100 MHz clock rate; it can perform 100 million (M) TCAM lookups per second. Since each typical 5-tuple policy table matching requires two TCAM loohps, BS will be explained in detail -later, the TCAM throughput for the 5-tuple matching is SOMpps. As aforementioned, to keep up with OC 192 line rate, PC has to keep up with 25Mpps lookup rate, which translates into a budget of two 5-tuple matches per packet. The budget reduces to 0.5 matches per packet at OC768. Apparently, with LPM and firewslVACL competing for the same TCAM resources, it would be insufficient using a single 100 MHz TCAM for PC while maintaining OC192 to OC768 line rates. Although increasing the TCAM clock rate can improve the performance, it is unlikely that a TCAM technology that matches the OC768 line speed d l be available anytime soon, given that the memory speed improves by only 7% each year [ 171.
Instead of stnving to reduce the access latency for a single TCAM, a more effective approach is to exploit chip-level parallelism (CLP) io improve overall PC throughput performance. However, a naive approach to realize CLP by simply duplicating the databases to a set of uncoordinated TCAM chips can be costly, given that TCAM is an expensive commodity. in a previous work [I61 by two of the authors of the present work, it wa5 demonstrated that by making use of the structure of IPv4 route prefixes, a multi-TCAM solution that exploits CLP can actually achieve high throughput performance gain in supporting LPM with low memoq cost.
Another important benefit of using TCAM CLP for PC is its ability to effectively solve the range matching prdblem. [ 101 reported that today's real-world policy filtering OF) tables involve significant percentages of rules with ranges. Supporting rules with ranges or range matching in TCAM can lead to very low TCAM storageefficiency, e.g., 16% asreported in [lo] . [lo] proposed an extended TCAM scheme to improve the TCAM storage efficiency, in which TCAM hierarchy and circuits for range comparisons are introduced. Another widely adopted solution to deal with range matching is to do a range preprocessing/encoding by mapping ranges to a shofi sequence of encoded bits, known as bit-mapping [I I]. The application of the bit-map based range encoding for packet classification using a TCAM were also reported 1111 [12] [13] [14] [lS]. A key challenge for range encoding is the need to encode multiple subfields in a search key extracted from the packet to be classified at wire-speed. To achieve high speed search key encoding parallel search key sub-field encoding were proposed in [11] [13], which however, assume the availability of multiple processors and multiple memories for the encoding. To ensure the applicability of the range encoding scheme to m y commercial network processors and TCAM coprocessors, the authors of this paper proposed to use TCPLM itself for sequential range encoding E1 51, which howeverl reduces the TCAM throughput performance. Using TCAM CLP for range encoding provides a natural solution which solves the performance issue encountered in [IS] .
However, extending the idea in [16] to allow TCAM CLP for general PC is a nontrivial task for the following two reasons: 1) the structure of a general policy rule: such as a 5-tuple rule is much more complex than that of a route and it does not follow a simple structure like a prefix; 2 ) it involves three different matchng conditions including prefix, range, and exact matches.
In this paper, we propose an efficient TCAM CLP scheme, called Distributed Parallel PC with Range Encoding (DPPC-RE), for the typical 5-tuple PC. First, a rule database partitioning algorithm is designed to allow different partitioned rule groups to be distributed to different TCAMs with minimum redundancy. Then a greedy heuristic algorithm is proposed to evenly balance the ttaffk load and storage demand among all the TCAMs. On the basis of these algorithms and combined with the range encoding ideas in [ 153, both a static algorithm and a fully adaptive algorithm are proposed to deal with range encoding and load balancing simultaneously. The simulation results show that the proposed solution can achieve 100 Mpps throughput performance matching OC768 line rate, with just SO% additional TCAM resource compared with a single TCAM solution at about 25 Mpps throughput performance.
The rest of the paper is organized as follows Section 11 gives the definitions and theorems which will be used throughout the paper. Section III presents the ideas and algorithms of the DPPC-RE scheme. Section IV presents the implementation details on how to realize DPPC-RE. The performance evaluation of the proposed solution is given in Section V. Finally, Section VI concludes the paper. DIP( 1 -32), SPORT( 1 -1 6). DPORT (1 -1 6), PROT( I @): where SIP, DIP, SPORT, DPORT, and PROT represent source IF' address, destination IP address, source port, destination port, and protocol number, respectively. DIP and SIP require longest prefis matching (LPM); WORT and DPORT generally require range matching; and PROT requires exact matching. Except for sub-fields with range matching, any other sub-field in a match condition can be expressed using a single string of teman hits, i.e., 0, 1, or"don't care" *. Table I gives an esample of a typical five-tuple rule table.
_ _ L2

L3 La
DEFINITIONS AND THEOREMS
Rrrles
._. Adatching: In the contest of TCAM based PC as is the case in this paper, matching refers to ternaT matching in the following sense. h search key is said to match a particular match condition, if for each and every corresponding bit position in both search key and the match condition, either of the following two conditions is met: (1) the bit values are identical; (2) the bit in the match condition is "don't care" or *.
L5
So far, we have defined the basic terminologies for rule matching. Now we establish some important concepts upon which the distributed TCAM PC is developed.
ID:
The idea of the proposed distributed TCAM PC is to make use of a small number of bit values extracted from certain bit positions in the search key and match condition as IDS to (I) divide match conditions or rules into p u p s , which are mapped to different TCAMs; ( 2 ) direct a search key to a specific TCAM for rule matching.
In this paper, we use P number of bits picked from given bit positions in the DIP, SIP, and/or PROT sub-fields of a match condition as the rule ID, denoted as Rule-ID, for the match condition and use P number of bits ei2racted from the corresponding search key positions as the key ID, denoted as Key-ID, for the search key. For example, suppose P = 4, and they are extracted from SIP(1), DIP(7),DIP(16) and PROT(8).
Then the rule-ID for the match condition <I. 1 .*,*, 2.*. *.*, *, *, 6> is"Ol*O"and the key-ID forthe searchkey 4 . Trufic inrensig is used to characterize the workload in the system. As the design is targeted at PC at OC768 line rate, we define traffic intensity as the ratio between the actual traffic load and the w " s e traffic load at OC768 line rate, i.e., 100 Mpps. Tllroughptrt !.uti0 is defined as the ratio between Throughput and the worst-case traffic load at OC768 line rate. Now, two theorems arc established, which state under what conditions the proposed solution ensures correct rule matching and maintains the orignal ordering of the packets, respectively.
Theorem 1 : For each PC, corrcct rule matching is guaranteed if a) All the rules belonging to the same Key-l-ID group are placed in the same TCAM with correct priority orders. b) A search key containing a given Key-ID is matched against the rules in the TCAM, in which the Corresponding Key-ID group is placed.
Proof: On the one hand, a necessary condition for a given search key to match a rule is that the Rule-1D for this rule matches the Kev-ID for the search key. On the other hand, any rule that does not belong to this Key-ID group cannot match the search key, because the Key-ID group contains all the rules that match the Key-ID. Hence, a rule match can occur only between the search key and the rules belonging to the Key-ID group corresponding to the search key. As a result, meeting conditions a) and b) will guarantee the correct rule matching Theorem 2: The original packet ordering for any given application flow is maintained if packets with the same Key-ID are processed in order.
Proof: First, note that packet ordering should be maintained only for packets betonging to the same application flow and an application flow is in general identified by the five-tuple. Second, note that packets from a given application flow must have the same Key-ID by definition. Hence, the original packet ordering for any given application Bow is maintained if packets with the same Key-TD are processed in order.
(TCP and UDP PROTs have different values at these two bit positions) of the PROT sub-field as one of the ID bits. All the rest of the bits in the PROT sub-field have fixed one-to-one mapping relationship with the 8th or 5th bits, and do not lead to 111. ALGORITHMS AND SOLUTIONS anv new information about the PROT;
The key problems we aim to solve are I ) how to make use of CLP IO achieve high performance with minimum cost; 2) how to solve the TCAM range matching issue to improve the TCAM storage efficiency (consequently controlling the cost and power consumption). A scheme called Distributed Parallel Packet Classification with Range Encoding (DPPC-RE) is proposed.
The idea of DPPC is the follo\ving. First, by appropriately selecting the ID bits, a large rule table is partitioned into several Key-ID groups of similar sizes. Second, by applying certain load-balancing and storage-balancing heuristics, the rules (Key-ID groups) are distributed evenly 10 several TCAM chips. As a result, multiple packet classifications corresponding to different Key-ID groups can be performed simultaneously, which significantly improves PC throughput performance without incuning much additional cost.
The idea of RE is to encode the range sub-fields of the rules and the corresponding sub-fields in a search key into bit-vectors, respectively. In this way: the number of temay strings (or TCAM entries, which will be defmed shortly in Section II1.C) required to express a rule with non-trivial ranges can be significantly reduced (e.g. to only one string), improving TCAM storage efficiency. In DPPC-RE, the TCAM chips that are used to perform rule matching are also used to perform search key encoding. This not only offers a natural way for parallel search key encoding, but also makes it possible to develop efficient load-balancing schemes, m h n g DPPC-RE indeed a practical solution. In what follows, we introduce DPPC-RE in detail.
A. ID Bits Selection
The objective of ID-bit selection is to minimize the number of redundant rules (introduced due to the overlapping among Key-ID groups) and to balance the size of the Key-ID groups (large discrepancy of the Key-ID group sizes may result in low TCAM storage utilization).
A brute-force approach to solve the above optimization problem wouId be to traverse all of the P-bit combination out of IfT-bit rules to get the best solution. However, since the value of W is relatively large (104 bits for the typical S-tuple rules), the complexity is generally too high to do so. Hence, we introduce a series of empirical rules based on the 5 real-world database analyses [ 181 that are used throughout the rest of the paper to simplify the computation as follows: 1) Since the sub-fields, DPORT and SPORT, in a rule may have non-trivial ranges whch need io be encoded, we choose not to take these two sub-fields into account for ID-bit selection;
2 ) According to the analysis of several real-world rule databases 1181, over 70% mles are r+5th non-wildcarded PROT sub-field, and over 95% of these non-wildcarded PROT sub-fields are either TCP(6) or UDP( 11) (approhatelv 50% are TCP). Hence, one may select either the 8th or the 5th bit 3 ) Note that the rules with wildcard(s) in their Rule-IDs are actually those incumng redundant storage. The more the wildcards a rule has in its Rulr-ID; the more Key-ID groups it belongs to and consequent!!; the more redundant storage it incurs. In the 5 real-world rule databases, there are over 92% rules xvhose DIP subfields are prefixes no longer than 25 bits and there are over 90% rules whose SIP sub-fields are prefixes no longer than 25 bits, So we choose not to use the last 7 bits (i.e., the 26th to 32nd bits) of these two sub-fields, since they are uddcards in most cases.
Based on these 3 empirical rules, the traversal is simplified as: choose art optimal (P-1)-bit combination out of 50 bits of DIP and SIP sub-fields (DIP(1-25), SIP(1-25)), and then combine these (P-1) bits with PROT (8) orPROT (5) to form the P-bit ID.
Fig2 shows rtn example of the ID-bit selection for Database #5
[lS] (with I550 total number of rules). We use an equally weighted sum of two objectives, i.e., the minimization of the variance among the sizes of the Key-ID groups and the total number of redundant rules, to find the 4-bit combination: PROT(S), DIP(I)> DIP(21) and SIP (4)"'. We find that, although the sizes of the Rule-ID p u p s are unbalanced, the sizes of the Key-ID groups are quite similar, which allows memory-efficient schemes to be developed for the distribution of rules to TCAMs. 
B. Distributed Table Consfrucrion
The Find aK-division { Q,.k = l..,,$x 1 of the Key-ID groups that
Minimize: Then, the two algorithms are run to get two solutions, respectively, and the better one is chosen finally.
to be NP-hXd.
Capacity First Algorithm (CFA): The objective kpKm[kl is regarded as a constraint. In this algorithm, the Key-ID groups with relatively more rules will be distnbuted first. In each round, the current Key-ID group will be assigned to the TCAM with the lzast number of rules under the load constraint. 
I)
____________-____-_---__--__----------
The Distributed Table Construction We still use the rule database set #5 as an example. Suppose that the traffic load distnbution among the Key-ID groups is a s depicted in Fig. 4 , which is selected intentionally to have large variance to create a difficult case for load-balancing.
Note that the ID-bits are PROT(S), DIP(l), DIP(21), and SP (4) (Table Contents) 1 NumberofRule-ID I Xumberof I TmfEc Load 1
C. Solutions for Range Matching
Range matching is a critical issue for effective use of TCAM for PC. The real word databases in [lo] showed the TCAM storage efficiency can be as low as 16% due to the existence of a large number of rules with ranges. We apply our earlier proposed Dynamic Range Encoding Scheme (DRES) [ 
D. Eficient Load-balmxi fig Schemes
Note that the DPPC formulation is static, in the sense that once the Key-ID groups are populated in hfferent TCAMs, the performance is pretty much subject to trafic pattem changes.
The inclusion of Range Encoding provides us a very efficient way to dynamically halance the PC traffic in response to traffic pattern changes. The key idea is to duplicate range encoding tables to all the TCAMs and hence allow a KE to be performed using any one of the TCAMs to dynamically balance the load. Since the size of the range tables are small, e.g., no 
_ _ _ l _ _ _ _ _ _ _ l _ _ _ l _ _ _ _ _ _ _ _ _ _ _ _ _ l _ _ _ l _ -I_-f___--__-___-f___-X__---__-------------
The following two aIgorithms are proposed to solve the above problem.
Stagger Round Robin (SRR):
The idea is to allocate the KE tasks of the incoming packets whose RM tasks are performed in a specific TCAM to other TCAM chips in a Round-Robin 
Comments:
In the case when K=2, the objective is a constant "0". This means that no matter how large the variance of the RM load ratios among all the TCAM chips is, SRR can always perfectly balance the overall traffic load.
Sinceosx (x' -2 ) / (~ -1 ) < 0.5, it means in any case, SRR can always reduce the variance of the overall load ratio to less than half of that of the RM tasks. Further discussions on the performance of SRR and FA are presented in Section V IV. MLEMENTATION OF TWE DPPC-RE SCHEME The detailed implementation of the DPPC-RE mechanism is depicted in FigS. Beside the TCAM chips and the associated SRAMs to accommodate the match conditions and the associated actions, three major additional components are included in co-operating with the TCAM chips, i.e., a Distnbutor, a set of Processing Units (PUS) and a Mapper. Some associated small buffer queues are used as well. Now we descnbe these components i n details.
A. The Distributor
This component is actually a scheduler. It partitions the PC traffic among the TCAM chips, More specifically, it performs three major tasks. First, it extracts the Key-ID from the 5-tuple received from a network processing unit (NPU). The Key-ID is used as an identifier to dispatch the RM keys to the associated TCAM The 5-tuple is pushed into the RM FIFO queue of the corresponding TCAM (Solid arrows in Fig. 5) .
Second. the distributor distributes the KE traffic among the TCAM chips, based on either the FA or SRR algorithm. The correspondmg information, i.e., the SPORT and DPORT are pushed into the KE FIFO of the TCAM selected (dashed arrows in Fig.5 ).
Third, the distributor maintains K Serial Numbers ( 9 " ) or SM counters, one for each TCAM An S/N is used to identify each incoming packet (or more precisely, each incoming five-tuple). Whenever a packet arrives, the distributor adds " 1"
A KE FIFO is a small FIFO queue where the information used for KE is held. The format of each unit in the KE FIFO is given in Fig.6 (c).
(cyclical with modulus equal to the RM FIFO depth) to the S/N counter for the corresponding TCAM the packet is mapped to. A Tag is defined as the combination of an S/N and a TCAM number (CAMID). This tag is used to uniquely identif), a packet and its associated RM TCAM. The format of the Tag is depicted in Fig.G(a) .
Ftg. 5 DPPC-RE mechanism
As we shaIl explain shortly, the tag is used by Mapper to retum the KE results back to the correct TCAM and to allow the PU for that TCAM to establish the association of these results with the corresponding five-tuple in the RM queue. 
B. RM FIFU. KE FIFO, Key Buffer and Tag FIFU
ARM FIFO is a small FIFO queue where the infomation for RM of the incoming packets is held. The format of each unit in the RM FIFO is gven in Fig.6(b) . (The numbers in the brackets indicate the number of memory bits needed for the subfields).
Differing from the RM and KE FIFOs, a Key Buffer is not a FIFO queue, but a fast register file accessed using an S/N as the address. It is where the results of KE (encoded bit vectors of the range subfields) are held. The size of a K q Buffer equals to the size of the corresponding RM FIFO, with one unit in the Key Buffer corresponds to one unit in the RM FIFO. The format of each unit is given in Fig.6(d) . The Pidid bit is used to indicate whether the content is available and up-to-date.
Note that the tags of the key cannot be passed through TCAM chips during the matching operations. Hence a Tag FIFO is designed for each TCAM chip to keep the tag information when the associated keys are being matched.
C. The Processing Unit
Each TCAM is associated with a Processing Unit (PU). The fimctions of a PU are to (a) schedule the RM and KE tasks assigned to the corresponding TCAM, aiming at maximizing the utilization of the corresponding TCAM; (b) ensure that the results of the incoming packets assigned to this TCAM are returned in order. In what follows, we elaborate on these two functions.
(a) Scheduling between RM and KE tasks: Note that, for any given packet, the RM operation cannot take place until the KE results are returned. Hence, it is apparent that the units in a RM FIFO would wait for a longer time than the units in a KE FIFO, For h s reason: RM tasks should be assigned higher priority than KE tasks. However, our analysis (not given here due to the page limitation) indicates that a stnct-sense priority scheduler may lead to non-deterministically large processing delay. So we introduce a Weighted-Round-Robin scheme in the PU design.
More specifically, each type of tasks gain higher priority in turn based an asymmetrical Round-Robin mechanism. In other words, the KE tasks will gain higher priority for one turn (one turn represents 2 TCAM accesses, for either a RM operation or two successive KE operations) after P I turns with the higher priority assigned to RM tasks. Here n is defined as the Round-Robin Ratio (RFUl).
(b) Ordered Processing: Apparently, the order of the returned PC results from a speclfic TCAM is determined by the processing order of the RM operation. Since a RM buf5er is a FLFO queue, the PC results can still be returned in the same order as the packet arrivals, although the KE tasks of the packets may not be processed in their original sequence '". As a result, if the KE result for a given Rhd unit returns earlier than those units in front of it, this RM unit cannot be executed.
Specificallv, the PU far a given TCAM maintains a pointer points to the position in the Key Buffer that contains the KE result corresponding to the unit at the head of the RM FIFO. The value of the pointer equals the S/N of unit at the head.FW FIFO. In each TCAM cycle, PU queries the valid bit of the This is because the KEtasks whose Rhl is processed in a specific TCAhf may be assigned to different TCAMs to be processsd based on the FA or SRR algorithms. iv position that the pointer points to in the Key buffer^ If the bit is set, meaning that the KE result is ready, and it is RM's turn for execution, PU reads the KE results out from the Key Buffer and the 5-tuple information out from the RM FIFO queue, and launches the RM operation. Meanwhile the valid-bit of the current unit in the Key Buffer is reset and the pointer is incremented by 1 in 3 cyclical fashion. Since the S/N for a packet in a specific TCAM is assigned cyclically by the Distributor, the pointer is guaranteed to always point to the unit in the Key Buffer that corresponds to the he3d unit in the RM FIFO.
D. The Mapper
The funcrion of this component is to manage the result returning process of the TCAM chips. According to the processing flow of a PC operation, the mapper has to handle three types of results, i.e., the KE Phase-I results (for the SPORT sub-field), the KE-Phase-I1 results (for the DPORT sub-field): and the RM results. The type of the result is encoded in the result itself.
If the result from any TCAM is a RM result (which is decoded from the result itself), the mapper returns it to the NPU directly If it is a KE-Phase-I result, the mapper stores it in a latch and waits for the Phase I1 result which will come in the nexT cycle.
If it is a KE-Phase I1 result, the mapper uses the tag information from the Tag FIFO to determine: 1) which Key Buffer (according to the CAMID segment) should this result be returned to, and 2) which unit in the Key Buffer (according to the S/N segment) should this result be written into. 
0:
Suppose that all the packets before packet PO have been processed, and PO is now the head unit in the RM FIFO of TCAM#l Note that packet PO has S/N "00 I 10". Hence, when it is the RM's turn, PU#I probes the valid bit of thr 6th unit in the Key Buffer. @:When PU#I finds that the bit is set, it pops the head unit from the RM FIFO (the 5-tuple) and reads the contents out from the 6'h unit of the Key Buffer (the encoded key of the two ranges), and then launches a RM operation in TCAM#I. Meanwhile, the valid bit of the 6'h unit in the Key Buffer is reset and the pointer of PU#l is incremented by one and points to the P unit.
@: When Mapper receives the RM result, it returns it back to the NPU, completing the whole PC process cycle for packet PO. a: The 4-bit Key-ID "0010" is extracted by Distributor. a: According to the distributed rule table given by TABLE 11, Key-ID group "0010" is stored in TCAM#I. Suppose that the current S/N value of TCAM#I is "S', then the CAMID "001" and S/N are combined into the Tag with value "001 10(5+1)". Then the 5-tuple together with the Tag is pushed into the RM FIFO of TCAM#l.
v. EXPERIMENTAL RESULTS
A . Simulation Results
Simulation
0:
Suppose that, the current queue sizes of the 5 KE FIFOs are 2,0,1 ,I, and I, respectively. According to the FA algorithm, the KE operation of packet PO is to be performed in T C A M f .
Then the two range subfields 45535, 80>, together with the Tag, are pushed into the KE FIFO associated with TCAM#2.
8:
Suppose that now it is KE's turn or no RM task is ready for execution, PU#2 pops out the head unit (45535, 8~+Tag<00100110>) from the KE FIFO, and sends them to TCAM#2 to pedorm the two range encodings successively. Meanwhile, the corresponding tag is pushed into the Tag FIFO. One can see that at K=5, the OC-768 throughput is guaranteed even when the system is heavily loaded (traffic intensity tends to loo%), whether FA or SRR algorithm is adopted. This is mainly because the theofetic throughput upper bound at K=5 (5*1OOM/4=125Mpps) is 1.25 times of the OC768 maximum packet rate ( I OOMpps). In contrast, at K=4, the throughput falls short of the wire-speed when SRR is used, while FA performs fairly well, indicating that FA has better load-balancing capability than SRR.
Delay Performance: According to-the processing flow of the DPPC-RE scheme, the minimum delay for each PC is 10 TCAM cycles (5 for RM and 5 for KE). In general, however, additional cycles are needed for a PC because of the queuing effect. We focus on the performance when the system is heavily loaded. Fig. 8 shows the delay distribution for the back-to-back mode, i.e., when packets arrive back-to-back (Traffic intensity We note that the average delay are reasonably small except for the case at K=4 and when SRR is adopted (avg.delag>20 TCAM cycles). In this case, when the offered load reaches the theoretical limit (i.e., 100 Mpps), a large number of packets are dropped due to SRR's inability to effectively balance the load.
The delay distributions for the cases using FA (K=4 or 5) are much more concentrated than those using SRR, suggesting that FA offer much snialler and more deterministic delay performance than SRR. Note that more deterministic delay performance results in less bufferkache requirements and l o w r implementation complexity for the TCAM Classifier as well as other components in the fast data path. =loo%). Change of Traffic Pattern: In order to measure the stability and adaptability of the DPPC-RE scheme when the traffic pattem changes over time, we run the following simulations at the Back-to-Back mode (traffic intensity=lOG?!).
The traffic pattern depicted in Fig. 2 TCAM chips should be used in parallel. The total number of TCAM entries required is N xDER ~ Without a dynamic load-balancing mechanism (which can only be employed when adopting Key encoding), its performance is un-deterministic and massive loss may occur when the system is heavily loaded or traffic pattem changes.
The TCAM Expansion Ratio ERs (defined as the ratio of the total number of TCAM entries required to the total number of rules in the rule database) are calculated for all five real-world databases based on these four schemes. The results are given in Encoding (i.e., the DPPC-RE scheme) provides a v e~~ good balance in terms of the worst case performance guarantee and low memory cost, it is an attractive solution.
VI. DISCUSSION AND CONCLUSION
Insufficient memoy bandwidth for a single TCAM chip and large expansion ratio caused by the range matching problem are the two important issues that have to be solved when adopting TCAM to build high performance and low cost packet classifier for next generation multi-gigabit router interfaces.
In this paper, a distributed parallel packet classification scheme with range encoding (DPPC-RE) i s proposed to achieve OC768 (40 Gbps) wire-speed packet classification with minimum TCAM cost. DPPC-RE includes a rule partition algorithm to distnbute rules into dlfferent TCAM chips with minimum redundancy, and a heuristic algorithm to balance the traffic load and storage demand among all TCAMs. The implementation details and a comprehensive performance evaluation are also presented.
A key issue that has not been addressed in this paper is how to update the rule and range tables with minimum impact on the packet classification process. The consistent policy VII. ACKNOWLEDGEMENT
