Abstract-Packet processing is a critical operation in a high-speed router, and in order for this router to achieve memory efficient and fast Oð1Þ lookup operations, Bloom filters (BFs) have been widely used as a packet classifier to reduce expensive hash table accesses. However, it has been identified that a parallel packet classifier (PPC), using all n parallel BFs for a lookup, is neither power nor throughput efficient for high-speed routers. In this paper, we propose a multitiered packet classifier (MPC), both to save power and to improve throughput, with the same memory size as that of a PPC. While a PPC with n BFs consumes ÂðnÞ BF access complexity for a lookup, our MPC is designed to have the complexity which is probabilistically significantly less than ÂðnÞ. Furthermore, by preprocessing a group of lookups in one cycle in an MPC, we assign each lookup to its associated BF at best effort, and consequently, obtain a higher throughput. With the same reason, as in preprocessing, our MPC design reduces a significant amount of power by preventing accesses to noninvolved BFs during a lookup. In simulation for flow identification with NLANR traces, we observed that the MPC throughput is increased by at most 100 percent, compared to a PPC. Additionally, our MPC shows 4.2 times power efficiency over an equivalent PPC, in terms of power saving.
Ç

INTRODUCTION
A S the demand for high-speed and large-scale routers continues to surge, a class of fast packet processing, such as packet classification and IP lookup, has become the critical data path functions for many emerging networking applications. Those functions have enjoyed a wide range of application in networking devices, such as to support firewall, access control list, and quality of service in several network domains. This type of packet processing matches a packet in high speed against a prioritized set of rules, which are made up of one or more fields for IP lookup or packet classification, respectively. For example, for a Cisco GSR 12,416 router of a 160 Gbps rate, a packet lookup with a rule table consumes approximately 2 ns per packet in the worst case under the condition of a minimum 40-byte packet.
In general, a good packet processing scheme has to satisfy the following three criteria:
1. Throughput: A lookup must forward packets fast enough to synchronize with the rapidly increasing line-rate. 2. Power: To keep the cooling system complexity reasonable and to ensure the power dissipation for a lookup is scaled well. 3. Memory: Because memory size directly affects the system's cost, its lookup speed, and power dissipation, the total memory requirement needs to grow slowly with the number of rules. Researchers have proposed several efficient packet processing schemes to address one of the above three criteria, and there are currently three major techniques for achieving an efficient packet processing, Ternary Content Addressable Memory (TCAM) [1] , [2] , trie-based [3] , and hash-based schemes. TCAMs are known to provide a deterministic, high-speed packet classifier [1] . However, due to its noncommodity nature and brute-force search method, TCAMs' cost and power dissipation are inclined to become prohibitive for large tables and high line rates. Unlike TCAMs, trie-based schemes use a tree-like data structure to successively classify a packet successively a few bits at a time. However, a trie inherently suffers from additional space requirement which is needed to hold pointers from nodes to their children and the sequential memory accesses due to these pointers.
In contrast, since hash-based schemes do not perform brute-force searches like TCAMs do, they can potentially achieve an order-of-magnitude in power saving. Furthermore, unlike tries, the hash tables employ a flat datastructure, which can potentially achieve smaller memory sizes that are amenable to on-chip SRAM. A Bloom filter (BF) has been widely documented in literature on networking [4] , [5] , [6] , [7] , [8] , [9] as a hash-based scheme. A BF with k hash functions is essentially a generalized hash mechanism as an approximate membership classifier on a key set. Dharmapurikar et al. [6] introduce the first algorithm to employ BFs working in parallel for conducting IP lookup operations. The authors employ both a set of onchip BFs for a fast and approximate match and off-chip hash tables for the confirmation of the approximately matched prefix. The primary goal in [6] is to minimize the number of off-chip hash probes per lookup as one memory access to an off-chip hash table is desired [10] . Similarly, the approaches using BFs for a fast lookup are applied to the packet inspection application for network security [7] , to packet classification identifying flows [8] , and to other network applications like content delivery [5] .
Although a BF's memory efficiency hash warranted its extensive usage in the field, a BF of an m-bit vector has a limitation of a false positive, f-positive. That is, the key's lookup still returns "yes" even when this key is not a BF member, and this is because all k bit positions indexed by the key's k hash indexes are collided with other keys' hash indexes. It is necessary to sustain an extremely low false positive rate and to resolve the false positive are necessary in a high-speed packet classification. However, this is a challenging task because probing all k bits in parallel for a lookup is not found to be power efficient, and in fact probing the rest bits is wasteful if any of k bits is known to be of the value "0" in advance. Utilizing such a power-efficient probing is suggested in a pipelined BF [11] . In a pipeline stage of the scheme [11] , a subgroup of k 0 hash functions, k 0 < k are used to access one m-bit vector BF memory with k 0 read ports, and the next probing in the following pipeline stage is dependent of the previous stage's result.
However, a pipelined BF scheme in [11] suffers as follows: 1) Reduced lookup throughput: Among the pipeline stages of different lookups, a structural hazard happens because the BF memory supports only k 0 read ports in one clock for a lookup. Thus, if a lookup requires all the pipeline stages, the lookup processing will consume the same number of clocks as the number of pipeline stages. 2) The power saving benefit of a pipelined BF is limited to a single BF.
In addition to the issues from a single BF, for a variety of applications that use a set of n BFs, a key (or a packet) query to n BFs has been designed in an ad-hoc manner, such that it probes all n BFs. For instance, in packet classification [8] for a Juniper T640 with 160 router ports, each of 160 BFs is assigned to a port for the flows' record, and all BFs are to be probed to find the next output port for an incoming flow. Authors in [12] , [13] introduce a segmentation scheme, since segmented and load-balanced subtables show a lookup performance. In this scheme, they assign a BF to each subtable so that unnecessary off-chip accesses to subtables are probabilistically eliminated. This elimination is only effective when a BF returns "no" in a lookup. Thus, even if a subset of BFs returns "yes" as f-positives, we need to conduct a lookup to the rest BFs.
However, probing all BFs for a packet lookup in a clock cycle is not considered throughput efficient, because only one BF is actually associated with the lookup while the rest of BFs can do other lookups in the clock cycle. Thus, distributing lookup requests to their corresponding BFs without probing the irrelevant BFs during the operation is considered both power and throughput-efficient in packet classification.
In this paper, we propose a multitiered packet classifier (MPC) of n BFs in order to provide such a lookup distribution to achieve higher power and throughput efficiencies, compared to a parallel packet classifier (PPC) of n parallel BFs [6] , [8] , [9] , [14] . Specially, a PPC accesses n BFs for one lookup in every clock cycle while our MPC accesses n BFs for several lookups in every clock cycle, and this is done with the same BFs' memory allocation as that of a PPC. To build two-tiered BFs as an example of an MPC, the total PPC memory is strategically split into small-sized BFs with one read port for the prestage phase and largesized BFs with k À 1 read ports for the poststage phase. Next, a small-sized BF is logically connected to two largesized BFs, so that a forest of binary trees is built.
In this forest, a lookup starts in pipeline from its parent BFs in the prestage to its children BFs in the poststage. The lookup can proceed with two children BFs in a poststage, if this lookup in a parent BF returns a positive, either true or false. If a BF lookup in the prestage is to be a true positive, then accessing its children BFs in the poststage is not avoidable, since a BF lookup does not contain a false negative. Reversely, if the lookup in the prestage is to be a true negative, there is no need to probe the two large-sized BFs which are irrelevant to the lookup, contrast to a PPC. Thus, the total number of the reached large-sized BFs for the lookup is probabilistically far smaller than n À 2, because this is depending on the small-sized BFs' f-positives in the prestage. Since a huge pile of memory reads can exhaust a power significantly, we can achieve an improved power efficiency through such distribution where a group of lookups is sent to to its corresponding BFs in a clock cycle at best.
In addition to the power efficiency, we can simultaneously achieve better throughput efficiency, since the rest of the idle large-sized BFs in the poststage can be utilized for other lookups. That is, if we process several lookups through multiports of a small-sized-BF in one prestage clock, we disseminate the lookups to their corresponding large-sized BFs in the poststage, and the lookups are processed by BFs in parallel. Our MPC for packet classification satisfies the first two criteria (Power and Throughput) better than a PPC while still with the same memory usage, and it is identified that an MPC is beneficial for any packet processing [6] , [9] , [15] , [16] . This paper has the following contributions:
. We propose a packet classifier with n BFs, termed an MPC, in a multitiered configuration of BFs with the same memory capacity as that of a PPC. . By splitting a whole BF memory of multiports into small-sized BF memory segments, the MPC fabrication in memory modules does not increase the total area cost as significant as a PPC. . We propose new algorithms on insert, query, and delete operations for an MPC, and they are as easy to implement as those of a PPC. . The packet classification evaluation of an MPC scheme has been demonstrated, using NLANR traces [17] . It has been shown that the proposed MPC scheme has 4.2 and two times power and throughput efficiencies against a PPC. The related works, using BFs for packet processing, are shown in Section 2. Section 3 introduces the problem statement of using n BFs for packet processing. In Section 4, we show the MPC memory architecture in addition to the insert, query, and delete operations. Particularly, we consider two kinds of lookups, successful and unsuccessful, and we calculate their probabilities. In Section 5, with IP traces from [17], we measure the power efficiency with different BF configurations. Finally, Section 6 summarizes the benefits of MBFs and our future work.
RELATED WORKS
In this section, we enumerate BF literature on high-power and throughput efficiencies. Also, a list of packet processing applications using BFs is discussed with advantages and disadvantages.
A Power-Efficient BF
Suppose, an m-bit vector memory for a BF has k 0 read ports despite the number of hash functions in a BF is k) k 0 for a high precision lookup. For a pipelining scheme [11] , let k/ k 0 ¼ 3 and there are three pipeline stages as shown in the example in Fig. 1 . Among the three lookups in a BF, lookup L1 does not need stage S2 and S3, because probing a BF in S1 reveals that a key is not a BF member. Although this prevention of unnecessary memory accesses in these stages reduces power, the pipelining scheme needs three clock cycles in the worst case. This is observed in a true or false positive lookup, when a lookup requires all the stages. Since the m Â 1 memory supports only k 0 read ports, the overlapped access to the memory in stage S2 for lookup L2 and stage S1 for lookup L3 causes a structural hazard [18] , as shown in Fig. 1a . Although, having k read ports can resolve this hazard, the provision of a larger number of k read ports is not efficient in terms of the necessary hardware implementation. The other solution is to utilize "stall" twice, as shown in Fig. 1b . However, these stalls cause to decrease the throughput in processing the three lookups. In contrast, our MPC takes one clock to process a lookup, and an MPC of n BFs in a multitiering and pipelining configuration is designed to process multiple lookups in a clock cycle.
Fast Packet Classification with n BFs
The packet classification goal is to identify a flow that is characterized with a five-tuple (source IP (SIP), destination IP (DIP), protocol, source port (SP), destination port (DP), and a protocol), and then to forward the flow to a corresponding output port. Several types of packet classifiers are suggested to meet this goal like those that are TCAM-based and SRAM-based [2] , [15] , [19] , [20] , [21] . In a hash-based approach, a packet classifier in [8] uses BFs in parallel, so that for a given packet lookup all BFs need to be checked in order to find the packet-associated flow, and this packet is forwarded to a corresponding port where a BF returns "yes." However, in a high-speed lookup performed on a BF, the number of memory read ports in the BF can sufficiently provide a significantly low f-positive. Also, the number of BFs to be probed is as large as the number of a high-speed router's ports, and this means we need to access all BFs in a brute-force way. Unlike, the above schemes of the ÂðnÞ BF access complexity among n BFs, our MPC demands probabilistically less complexity than ÂðnÞ for a lookup, and this implies that we can save on power, which is otherwise consumed for unnecessary BF accesses.
In addition to the power saving through a sub-ÂðnÞ BF access complexity per lookup, our MPC also provides multiple lookup throughput per clock cycle. Besides the BF applications used for packet processing, applications of other domains have utilized the benefit of BFs just as well, such as dynamic BF for data management [9] , wide-area web caching [22] , content delivery across overlay networks [5] , IP traceback [23] , and query routing in peer-to-peer networks [16] . Even in a wireless sensor networks the power saving is a paramount issue; a coordinated packet traceback mechanism in [24] is introduced with the concept of dimensions in hash algorithms in which a dimension can be expanded by the number of either hash functions, hash tables, or both. However, all these applications simply process one lookup to n BFs in parallel and are resulting in the ÂðnÞ lookup complexity, while our MPC processes several lookups in a clock cycle for a high throughput.
PROBLEM STATEMENT OF FAST PACKET CLASSIFICATION
The issue of how to reduce the number of expensive offchip accesses through n on-chip BFs is a paramount concern in processing a packet [6] , [8] , [11] , [12] , [13] as well as network application including wireless sensor network [24] . However, in this section, we formalize and restrict this issue to only addressable to the packet classification domain. A parallel lookup with n BFs is a common configuration in packet processing [8] , [12] , [13] . This is shown in Fig. 2 , where a five-tuple of SIP, DIP, protocol, SP, and DP is extracted from a packet and a lookup of the five-tuple is composed among the n BFs. Fast on-chip packet processing with n BFs is beneficial, because this approach not only reduces the number of expensive off-chip hash probes [6] , [10] but also enhance the load balance in a set of off-chip hash tables [12] , [13] . Due to f-positives from the BFs, all positives are required to be confirmed by a hash table of the recorded flows. It is emphasized that providing a perfect match in the off-chip hash table is necessary in packet classification for QoS and security concern, and consequently, this produces a BFs' access that is in contention to the n hash tables. BFs can be fabricated on-chip due to their memory efficiency while their hash tables are located off-chip due to large memory requirement as in other schemes [6] , [14] , [15] . In this configuration of on/off chip separation, the packet lookup throughput is bounded to the processing time in the off-chip hash table.
We can calculate the worst-case throughput of a parallelpacket classifier engine in Fig. 2 in the following way: Given a lookup of a minimum 40-byte packet, there are two kinds of lookups, an unsuccessful lookup (UL) in which a key is relentlessly searched although it does not exist in BFs, and a successful but time-consuming lookup (SL) in which a key is to be searched in BFs. Let t s and t u denote the processing times in an off-chip hash table (HT) for an SL and a UL, respectively. Then, the packet lookup throughput in n BFs is calculated as follows:
where p s is an SL rate, and the nf and ðn À 1Þf terms explain the expected numbers of f-positives, which are based on the binomial distribution of identical and independent BFs in an SL and a UL, respectively. Based on (1), Fig. 3 shows the throughput where HT's processing time in an SL, t s , is 1.001 times of 2 ns in a modern T-RAM [25] and t u is set to 0.5 times of 2 ns. In the worst case of p s ¼ 1, the lookup throughput with BFs of k ¼ 10 read ports shows that this configuration can barely keep up with 160 Gbps, while BFs of k ¼ 15 read ports can meet the bandwidth requirement. Thus, a large number of read ports in a BF memory is required for obtaining a high throughput, and is also preferable for avoiding accessing irrelevant BFs of such a large number of ports for a lookup is preferable. In the following section, we present the aforementioned avoidance with an MPC by distributing lookups through small-sized BFs of a few ports, so that a subset of the lookups is processed in large-sized BFs in one clock cycle for higher power and throughput efficiencies.
A MULTITIERED PACKET CLASSIFIER WITH n BFS
In this section, we first present a basic theory behind a BF and an f-positive. We then introduce the steps to build an MPC and to implement insert, query, and delete operations in an MPC for the better performance.
Bloom Filter Theory
A legacy BF for representing a set S of n i items (or keys) is described by an m-bit array memory with each initially set to 0. A BF uses k independent hash functions h 0 ; . . . ; h kÀ1 within the range of [0: m À 1]. For mathematical convenience, we make a natural assumption that these hash functions map each key in the universe to a random number uniform over the range as the authors [26] claim. For insertion of each key e j 0 2 S, the bits indexed by h k 0 ðe j 0 Þ are set to 1 for 0 k
To query that key e 0 is in S, k bits by k memory reads through h k 0 ðe 0 Þ should all be 1. If that is the case, a BF returns "yes" about a query of key e 0 . If that is not the case, then clearly e 0 is not a member of S. Even if a BF returns "yes," there exists a probability of an f-positive, such that key y is falsely believed to belong to set S due to the random gathering of k bits of value 1 set by independent keys.
The above probability f of an f-positive can be formulated in a straightforward way, given our assumption that hash functions are perfectly random. Among m bits, the chance of a bit being value 0 by one h k is 1=m. After all n i elements of S are hashed k times into the BF, i.e., totaling kÁn i times, the probability that a specific bit is still 0 is asymptotically p ¼ ð1 À 1=mÞ
kni % e Àkn i =m . Then, the probability of an f-positive by randomly choosing k bits among m bits is
This probability is bounded, and the optimal k, the number of hash functions that minimizes f, is easily found k ¼ ln 2 _ ðm=n i Þ. After some algebraic manipulation, it is clear that the requirement of f ¼ 2
Àw , where w is called lookup precision, suggests m ! n i log 2 ð1=Þ= ln 2 % 1:44n i log 2 ð1=Þ ¼ 1:44n i w: ð3Þ From (3), the following important lemma can be derived:
Lemma 1 (Linear Property). Linear property between m and n exists in (3) because given f requires that variable n i is linearly proportionate to variable m. Furthermore, in an optimal configuration, k becomes w according to the following derivation:
and to be a scheme of a deterministic Oð1Þ lookup processing 500 M packets a second for a 160 Gbps router, k needs to be at least 29 (% log 2 1=500 M).
Each hash function corresponds to one random lookup in an m-bit BF. Thus, a BF having k hash functions for high throughput needs the exact same k number of memory read ports in an m-bit memory module. Although, the state-ofthe-art VLSI technology can fabricate memory modules with multiple ports, supporting more than ten ports is tremendously challenging to implement, as noted in a concise summary of the recent embedded memory technologies [27] . Fig. 4 shows such a difficulty in terms of the power and area costs measured by CACTI [28] , according to the number of read ports in a single memory module. The conclusion from the figure is that the power and area costs are superlinear with respect to the number of read ports. Thus, a BF is considered as a high computation element due to the large value of k for the high-speed router, and, thereby, reconfiguring such BFs for a power and throughput-efficient lookup is proven necessary. Fig. 5 shows the basic principles in constructing an MPC, and this layout demonstrates how an MPC is superior in power efficiency than a PPC. Suppose, there are four BFs in a PPC, as shown in Fig. 5a , and each BF is equipped with k memory read ports. In this parallel configuration, we need to access k bits in each BF and the access is performed in one clock cycle. Thus, the PPC's lookup throughput is one per clock cycle and the PPC needs a power for 4k-bit BF memory access, in order to process one lookup.
Basic Principles of an MPC
In contrast, our MPC can reduce the aforementioned power usage by probing only a subset of k bits in a BF. Suppose, there are two smaller BFs of one read port and we put them in the prestage of four larger BFs of k À 1 read ports, as shown in Fig. 5b . Then, we conceptually connect each smaller BF in a prestage to two larger BFs in a poststage via a tree relationship. That is, if a smaller BF (or a parent BF) returns a positive in a lookup, we need to probe two larger BFs (or children BFs) that are connected to the smaller BF. Suppose, a key A is encoded in a smaller BF and a larger BF, as shown in Fig. 5b , and we search for the key from the prestage. Since there is no false negative, a BF, which encodes the key A, should return "yes" in the key lookup. The second smaller BF in the prestage may return "yes" with a false positive, and its probability is 1/2, based on (2)
) bits in order to process a lookup while a PPC probes 4k bits, confirming that our MPC can reduce the power usage for 4k-bit memory access in a PPC.
In addition to the power saving, our MPC can increase the lookup throughput by using dual read ports for the two smaller BFs in the prestage, as shown in Fig. 5c . Suppose, we encode keys A and B into the first and fourth larger BFs in the poststage and into both smaller BFs in the prestage. Next, we assign one read port for a lookup A and the other read port for a lookup B. Since we can process two lookups in two smaller BFs in one clock cycle, we can place two lookups for keys A and B in the four larger BFs during the next clock cycle under the condition that there is no false positive in the smaller BFs. Thus, at the complementary probability of a f-positive, i.e., 1/2, our MPC can increase the lookup throughput.
In a nutshell, our MPC reduces the power required for BF memory access by preprocessing a lookup with the smaller BFs in the prestage and by confirming the lookup with the larger BFs in the poststage. Note that if we increase the number of read ports in smaller BFs in the prestage, we can further minimize the power consumption, since the f-positive probability decreases according to (2) . In terms of the throughput increase, the smaller BFs in the prestage disseminate lookups into corresponding larger BFs in the poststage one clock cycle. Fig. 6 shows the detailed configuration example of an MPC, a two-tiered PC (2TPC) built on top of 4 BFs; this is in place of a PPC used in a dashed box of Fig. 2 . Letters A and D denote the address and data ports in a BF memory, respectively. A BF in the layer 2, i.e., the prestage, has one read port while a BF in the layer 1, i.e., the poststage, has k À 1 read ports. Since, we organize an MPC in a pipeline configuration, we can access two BFs in stage S2, if a parent BF in stage S1 returns "yes" in a lookup. Similarly, we follow the same lookup steps in a three-tiered PC (3TPC), which is constructed on top of eight BFs, as shown in Fig. 7 . Note that all small-sized BFs in S1 and S2 have one read port, while the large-sized BFs in S3 have k À 2 read ports, and these setups are purposely built this way in order to make a fair memory comparison with a PPC with eight BFs of k read ports.
Building a Multitiered Packet Classifier
In addition to these two architecture examples, we derive one mathematical proof that an MPC uses the same memory size as that of a PPC in a general case. For example, given desired f-positive f ¼ 2 Àw , the total PPC memory in bits with n BFs is nÁm, where m is a BF's memory based on (3). However, with the linear property between m and n i and an additive operation on memory size m t , we can reconfigure BFs in an (r þ 1Þ-tiered way, r> 0, while the same memory size, m M , for an MPC is used as follows:
where m t is the total memory of BFs on layer t, r þ 1 is the number of tiers, 2 t n i is the number of keys in B t i , and the lookup precisions of a BF on layer 1 and t, w 1 , and w t , are w À r and 1, respectively. Based on (3), the f-positives of BFs on layer 1 and 2 in a 3TPC are expected to be 2
ÀðwÀ2Þ
and 2 À1 , respectively, and the second term in (5a) is the sum of small-sized BFs from layer 2 to layer r þ 1. Also, a BF from layer 1 covers n i elements, and a BF from layer 2 covers 2n i keys. In general, B . Thus, the power used to probe them can be saved.
In addition to the power concern, we design a throughput efficient scheme in an MPC configuration. However, the higher throughput efficiency cannot be achieved in this setup simply setting b to a value greater than 1. Although (5)'s derivation shows that an MPC has the same memory size as a PPC, but processing a lookup in small-sized BFs of one read port does not provide a higher throughput in large-sized BFs on a lower layer. For instance, even if b in Fig. 6 with w 2 ¼ 1 is set to two, a one-read-port BF on layer 2 cannot process two lookups in one cycle. Thus, the number of read ports in the small-sized BF needs to be the same as b. In general, the number needs to be b Á w 2 for a throughput-efficient MPC. As suggested in [6] , using miniBFs with fewer read ports is the solution without degrading lookup accuracy. However, even if a BF is broken into several mini-BFs, the total number of read ports in the miniBFs is the same as that of a PPC. Thus, breaking a BF into mini-BFs only gives the possibility of fabricating BFs for packet processing, but does not incur the benefit of high throughput. However, our MPC has two benefits of fewer numbers of read ports and an area cost reduction, which can lead to fabricate small-sized BFs of multiread ports for a high throughput without introducing area overhead.
Figs. 8a and 8b show such two benefits: the smaller number of fabricated read ports and the smaller die area for a 2TPC. Fig. 8a shows the required read port numbers in fabricating different numbers of BFs for a PPC, a 2TPC, and a 3TPC, respectively. In fabricating, a 2TPC and a 3TPC use 4 percent and 10 percent less read port count than a PPC in all cases. Fig. 8b shows 2TPC and PPC area costs in different numbers of w and n i , and the area costs using four mini-BFs for a BF in each case using CACTI model [28] are measured. Now, we show how to fabricate multiports in a smallsized BF without incurring hardware overhead. There is a noticeable gap between dotted and solid meshes in Fig. 8b , and the reason is that fabricating multiports in a small-sized memory does not require area as much as in a large-sized memory. In the figure, there is a small area increase for the multiport memory, compared to a PPC's area. Thus, it is explained that the buffer size b can amount to five at most. Also, utilizing dual reads on falling and rising edges in a clock [18] can double the memory read capacity and a lookup throughput (i.e., double data rate scheme implemented in DRAM and AMD Athlon64). Thus, the buffer size becomes twice larger and the maximum b is 10 without incurring the memory overhead in an MPC.
Insert Operation in an MPC
The insert operation of a key in a BF on layer 1 is as simple as the key's insertion into a legacy BF, as shown in Section 4.1. Similarly, on layer j, if a key to hash is assigned to B j i , the key is given to B jþ1 bi=2c for the insert operation, 1 < j s. The detailed procedure is shown in Procedure insert, which does k j times memory write on layer j. Therefore, the memory write complexity of one key insertion is P s t¼1 k t ¼ w ¼ k P , which is the same as a PPC, where k P is based on (4). Also, note that the first vertically lined for can be in pipeline, because BF memories on any layer are independent from other layers. Thus, in every cycle, one key insertion is performed on the condition that B 1 i on layer 1, 1 i n, supports multiports.
Query Operation in an MPC
Unlike the insert operation, where only the involved BFs are accessed, the query operation needs to access all BFs to find which BF returns "yes." Because, except one involved BF, the rest of irrelevant BFs give f-positives, and this may lead to packet misclassification. Since the irrelevant BFs in an MPC are not considered for probing, the BF access complexity in processing a lookup with n BFs is far less than the value n. To provide such a complexity, we split the memory of a PPC into small-sized BFs and large-sized BFs in multitiers, and they are connected in binary trees. Next, accesses to large-sized BFs are made only if their parents of small-sized BFs return "yes" (or value 1 in D), as shown in Fig. 6 . Additionally, BFs in multitiers can be in pipeline, so that there is no performance degradation. Before the detail procedure, let us introduce definitions of a true path and a false path entangled in an MPC.
Definition 1 (True Path). In a query operation among a forest shown in Fig. 6 , a true path, t-path, for a key occurs and it is composed of BFs from a tree root to a leaf, which are supposed to return "yes" as a true positive. These are involved in the previous insert operation for the key.
For example, if a key is assigned to set B 2 in PBFs, the BFs on a t-path for 2TPCs are shadow-boxed B Fig. 6 . Also, the length of a t-path is 2 in the case of 2TPCs. From the above definition, in a query operation all BFs on the t-path should return "yes" for a given key as a legacy BF returns "yes" because each BF has this key as a member.
Unlike a t-path, a false path is made from a group of BFs giving f-positives, so that packet misclassification can occur. The detailed definition of a false path is as follows:
Definition 2 (False Path). In query operation, a group of BFs can give f-positives, and these f-positives with other true positives can produce a false path, f-path, when it is querying along the consecutive layers.
For instance, the f-positive series of BFs can be from the off-branch BFs along a t-path to the bottom of the tree, as shown in the checked boxes of Fig. 6 . Alternatively, they can be from a root of a tree, as shown in the checked boxes of Fig. 7 . Notice, that the f-positives by the BFs cannot contribute an f-path by definition, even if they do not stem from a branch of a t-path or are not part of a complete tree path in a forest. Also, note that the number of f-paths indicates the number of packet misclassifications. An important fact from the above definition is that the probability of an f-path misclassification contributing to its packet misclassification is cumulatively calculated as the product of every f-positive on that f-path.
False Classification in a Successful Lookup
We divide a lookup in two ways: 1) a successful lookup, and 2) an unsuccessful lookup, as in [29] , [30] . In network applications, a router needs to determine any packet's destination, based on a flow table about the classification information. If there is a flow in the table, we call the lookup an SL. Now, we proceed to show the misclassification probability in an SL.
By a recursive definition, the probability P a ðiÞ that root a in a binary tree having i packet misclassifications is defined as the product of the following three: the probability of producing an f-positive in root a of the binary tree, the probability that a left subtree has i À j packet misclassifications, and the probability that a right subtree has j packet misclassifications. The above relationship is depicted as the following:
where f a is the probability of an f-positive from BF a, and as a base case, P B 1
. Finally, the dominant probability, P s ð1Þ with a single packet misclassification occurring across a forest is illustrated in the following:
where r is the number of tiers, the first term is the summation of (7)'s probabilities about BFs attached on a t-path and the second term is the summation of probabilities for the remaining trees in the forest.
False Classification in an Unsuccessful Lookup
Since not all packets are under specific flows, based on a flow table, a UL is just as important as much as an SL. Note that unlike an SL, in a UL there is no t-path, and this implies that anything a BF returns, if any, is an f-positive. The dominant probability, P u ð1Þ with a single packet misclassification happening in a UL is
Procedure query shows the details of the query operation on an MPC. The code in the vertical line of Procedure query can be implemented in parallel. Also, it calls the subroutine query_BT to work recursively and in pipeline on each layer of the binary tree in order to check the BF for the key e just as a legacy BF does. Also, pipelining on layers in a binary tree ensures that the complexity remains Âð1Þ as a PPC's complexity should be.
Based on (7) and (8), the expected packet misclassification, considering SL and UL rates, is
where p s is an SL rate, and E s and E u are the average packet misclassifications for an SL and a UL, respectively. There is a minuscule classification performance degradation in using an MPC. Fig. 9 shows the average packet misclassification of a PPC and a 2TPC, based on (9), with a rate of successful lookup p s . There are three important considerations: 1) Given a desired f-positive, f, as long as the n is larger, the value of the average packet misclassification becomes larger due to bigger binomial coefficient value Bðf; nÞ. 2) Given the same memory size, the probabilities of PPC-n, and 2TPC-n for a UL are the same while in a dominant rate of an SL, there is a minuscule difference, 2E-9, between them. 3) The difference gets smaller as long as n is larger. In conclusion, as long as the number of BFs, n, and the rate p s are larger, the difference of packet misclassifications between a PPC and a 2TPC is negligible. The one-packet misclassifications of (7) and (8) show the same phenomenon shown in Fig. 9 .
Delete Operation in an MPC
Delete operation is not as easy as insert because a basic BF in [6] , [8] , [9] , [14] does not support deletion of a key which was encoded in the BF. Inserting keys into a BF is easy, since one hash a key k times and set the indexed bits to 1. Unfortunately, one cannot perform a deletion by reversing the process. If we hash the element to be deleted and set the corresponding bits to 0, we may be setting a location to 0 that is hashed to by some other element in the set. In this case, the BF no longer correctly reflects all key in the set.
To avoid this problem, Fan et al. [22] introduce the idea of a counting Bloom filter. In a counting Bloom filter (CBF), each entry in the BF is not a single bit but rather a small counter. When a key is inserted, the corresponding counters are incremented; when an item is deleted, the corresponding counters are decremented. The authors claim that choosing a 4-bit counter is sufficiently large enough to avoid counter overflow. In addition, authors [31] propose two schemes, SRAM-based CBF and linear-feedback-shiftregister-based CBF, where the first uses an SRAM array of counts and a shared counter, and the second utilizes an array of up/down linear feedback shift registers.
If a counting BF in [22] is adopted, delete operation can be as easy as the basic BF. Line 4 in Insert procedure, B j bi=2 jÀ1 c ½h t ðeÞ ¼ 1, shows bit setting for the basic BF. However, if delete operation is provided the line needs to be changed to B j bi=2 jÀ1 c ½h t ðeÞ++ as a counting BF is used at line 4 for delete procedure.
SIMULATION ENVIRONMENT AND RESULT
CACTI [28] models SRAM architecture in terms of area, access time, and power. With the help of the CACTI model, we measured throughputs and power readouts of PPC and MPC with IP traces which are from NLANR PMA and Internet Traffic Research Group [17] . We assume that a PPC needs one cycle to process a packet lookup to n parallel BFs, and in an MPC a small-sized BF with multiports can process a group of lookups in one cycle, while a large-sized BF with multiports processes a lookup in high precision. The IP traces we used are PUR, SDA, FRG, and PSC, which have 19.4K, 29.5K, 39.7K, and 37.9K flows as rules, respectively. In simulation, we tested 193.3K, 292.2K, 337K, and 314.3K packets in flow identification with different number of router ports, each having the same number of flows equally.
Experiment for Power
For power estimation, each pipeline stage is designed to process a single lookup; this is in contrast to a multilookup capability in a throughput experiment of the following Fig. 9 . The average packet misclassification for a PPC-n and a 3TPC-n in a different SL rate. section. For the theoretical comparison, we calculate the average number of memory reads per lookup in MPCs based on (9) . As suggested in [6] , using mini-BFs with fewer read ports is the solution without degrading lookup accuracy. However, even if a BF is broken into several mini-BFs, the total number of read ports in the mini-BFs is the same as the number of the original BF. Thus, breaking a BF into mini-BFs only improves the possibility of fabricating BFs for high throughput in packet processing, but it does not warrant power reduction and lower area costs. However, the proposed MPCs offer benefits such as reducing the number of read ports and power consumption during lookup operation. Fig. 10 shows such two benefits: the smaller number of fabricated read ports and the smaller number of memory reads for a lookup in 2TPCs and 3TPCs. Suppose, 15 ports are required in a BF's fabrication in PPCs. The first three solid lines show the required number of read ports in fabrication of different number of BFs for PPCs, 2TPCs, and 3TPCs, respectively. The other two marked lines are the number of operational memory reads for a given lookup. During the fabricating process, 2TPCs and 3TPCs use four percent and 10 percent fewer numbers of read ports than PPCs. In addition, for a given packet lookup, the average number of operational memory reads in 64 BFs is rapidly reduced to 1.9 and 3.8 times for 2TPCs and 3TPCs, respectively, compared to PPCs. Thus, we are confident that during a lookup in MPCs less power is consumed in a real packet classification scenario. Table 1 shows the typical power value used in CACTI in the case of FRG trace. Based on these values, we measure the power for other three traces, AMP, PSC, and PUR, as shown in Fig. 11. Fig. 11 shows the average power usage of the four traces with 10 runs under different configurations (PPCs, 2TPCs, and 3TPCs). We set w ¼ 20 for a PPC, and the lookup precisions of a large-sized BF in layer 1 are set to 19 and 18 for 2TPCs and 3TPCs, respectively. The power efficiency ratios of 3TPCs against PPCs in AMP, PSC, FRG, and PUR are at most 4.2, 4.1, 3.7, and 3.2, respectively. Also, the power efficiency ratios of 3TPCs against 2TPCs in AMP, PSC, FRG, and PUR are 1.9, 1.9, 1.7, and 1.5, respectively. From these results, we conclude that an MPC is more power efficient than a PPC, and as the number of multitiers increases, the power efficiency becomes more significant.
Experiment for Throughput
The throughput is defined as the number of packets over the number of simulation cycles when processing the whole IP traces, and we assume that each small-or large-sized BF takes one clock cycle to process a lookup. Fig. 12 shows the average throughput ratios of four traces by 10 runs in a 2PC architecture, where each small-sized BF on layer2 has a b-sized buffer to process b packets in one cycle. Once they process packets in the their buffers, the results are forwarded to large-sized BFs on layer 1. A BF on layer 1 works on a partially processed packet, only if a parent BF of the BF returns "yes" to the packet. Thus, if a BF on layer 2 returns "no" about a packet, the large-sized children BFs can process other following packets, and this behavior leads to a higher throughput. In each subfigure, with various numbers of BFs we found that the larger the buffer size is, the higher throughput ratio is, and this proves that our MPC is capable of producing a higher throughput performance than a PPC can. For example, at most 2.0 times throughput was observed in a PSC trace. Although, we simulated the scenario with, at most, 64 BFs, our MPC shows a higher throughput than those in Fig. 12 , even if a larger number of BFs and buffer size b are used.
CONCLUSION AND FUTURE WORK
For achieving power and throughput-efficient packet processing, we have suggested using an MPC to reconfigure BFs into small-sized BFs and large-sized BFs in a multitiered way, which does not incur memory overhead like a PPC. By Lemma 1, in Section 4.1, we showed the necessary steps to build an MPC with the same memory capacity as that of a PPC in Section 4.3. It is observed that the numbers of fabricated read ports in BFs' memory as well as the area cost for an MPC are reduced with the same memory allocation. Also, we showed the insert, delete, and query operations are as simple to implement as those of PBFs. Finally, with the same memory capacity as a PPC, our MPC provides the same probability of an f-positive for a UL, while maintaining a minuscule increase in the probability for an SL as shown in Fig. 9 . The simulation with NLANR's IP traces for flow identification shows that an MPC hash higher efficiencies in all traces than that of a PPC, at most 2.0 and 4.2 times of throughput and power, respectively.
The tasks to provide an efficient memory system like banking or pipeline tree in packet processing have been investigated recently [32] , [33] . In future work, we will consider such a memory bank system for greater throughput efficiency in an MPC. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
