Bloom filter is widely used in network packet processing due to its fast lookup speed and small memory cost. However, the non-negligible false positive rate and the difficulty of online update still prevent it from extensive utilization. In this paper, we propose a cache-based counting Bloom filter architecture, C 2 BF, which is not only easy to update online but also benefical for fast verification for precise matching. We also present a high speed hardware C 2 BF architecture with off-chip memory and fast cache replacement method. This paper includes three contributions: 1) compressed CBF implementation and its updating algorithm; 2) pattern grouping for higher cache hit rate; 3) onchip cache organization and replacement policy. Experiments show that our prototype of C 2 BF reduces more than 70% of the verification processing time with cache design compared with traditional schemes without cache.
Introduction and Related Work
Computer and network applications largely involve information representation and lookup. Therefore, compressed storage structures and fast query method can significantly improve processing capability. Bloom filter is an efficient data representation method and has been widely used in database and network applications. For example, it can be used for network measurement, network security, packet routing or resource routing, web cache sharing and nodes collaborating in peer-to-peer network etc [1] . A Bloom filter uses a number of hash functions to map each input string to a hash array, the input string is considered as a pattern if all the mapped entries are equal to one. It can be safely filtered if at least one of the entries is zero. However, the matched results include false positives. There are many variations based on regular Bloom filter, such as Space-code Bloom filter [3] , Spectral Bloom filter [4] etc. They improve regular expression in reducing memory consumption or reducing false positive rate; nevertheless, the matching error still possibly exists.
While most Bloom filter applications tolerate its imprecise matching for searching a large data set, there are situations requiring exact matching. For precise matching in routing protocols, H. Song [2] uses extended Bloom filter to verify suspicious matching packets if they really contain a pattern. The false positive analysis is achieved by extending Bloom filter with additional link of signatures. However, it does not mention how to handle the increased processing latency with this scheme.
A separate table can be used for precise matching, which keeps all the mapped patterns for each entry with a non-zero counter. When a string leads to a match by mapping to all of non-zero entries, it needs to be compared with the entry's pattern list. It is a "false match" if the string does not exist in pattern list. In fact, we only need to compare with the pattern list of the entry with the smallest counter, rather than compare with the pattern list of all the mapped entries.
However, in high-speed network applications, the pattern set is huge and FPGA memory is normally an insufficient resource. Thus the pattern table needs to be allocated off chip. The drawback is the extra time and energy for accessing it. Hence a block of registers or RAM is used as cache on FPGA. The processing engine searches cache first, and a cache-line with a number of patterns is brought in for each cache miss. For better cache replacement, continuous table entries has to be as relative and compressed as possible. Besides, we explore the inherent characteristics in patterns for better organization in pattern table, so as to achieve higher cache hit rate.
In order to be memory, time and energy efficient, the contributions of this paper lie in three aspects: first, the compressed CBF structure and its update algorithm; second, pattern grouping based on the relativity of patterns in network packets; third, on-chip cache structure designed with cache index and cache block, and proper replacement policy for higher cache hit rate. The rest of the paper is organized as follows. The system overview is introduced in Section 2 and the implementation for CBF array udpating is described in Section 3. Section 4 shows pattern grouping and cache organization for higher cache hit rate. The experiments are presented in Section 5. We draw our conclusion in Section 6.
System Overview of CBF with Cache
Bloom filter (BF) involves three operations: build up, lookup and update. Bloom filter with one-bit wide array provides fast lookup. Counting Bloom filter (CBF) with counters is utilized to support the insertion or deletion of patterns. The pattern matching process involves frequent hash table lookup operations. Traditional BF is used for lookup operation and CBF is used for update operation. Considering hardware platform for high-speed network processing, memory is a limited resource. As a tradeoff, we use an on-chip BF table associated with an off-chip CBF table for reference. For precise matching, each counter is associated with a pattern link similar as in [2] . The update of counter array in CBF modifies its associated pattern array at the same time.
However, the off-chip access latency is very high which is about 10 times of FPGA clock cycle. As a compromise, we use a special on-chip area as cache. Fig. 1 shows an overview of system structure and matching verification process in C 2 BF. The matching verification process queries pattern list associated with the hashed entries. Using cache, it first searches the cache and returns the pattern list for cache hit; or else, the pattern list is acquired from off-chip memory and a cache line is refreshed.
Suppose n=5K patterns are mapped to a hash array of m=80K entries by k=8 hash functions, as shown in Fig. 1 . Since the pattern size of 10 bytes is large and k copies of patterns are redundant, the associated items such as P 21 are their offset values in T 3 , and they are 13 bits wide for n=5K. It can be seen that the off-chip CBF array T 1 and Pattern Table memory T 3 are much larger than on-chip regular Bloom filter array.
Fig. 1. System overview and matching verification process
The matched pattern can be pointed to by a pointer from corresponding link list. But link list is not straightforward for implementation on FPGA programs. Since CBF is primarily built off-line, a compressed pattern array is used instead. After the building of CBF table and link list are completed, we can put the associated patterns one by one in an associative pattern list table T 2 . Fig. 2(a) shows an example, T 1 stores a start address next to each counter, address 0 for entries with counter 0, 1 for the first item in T 2 and so on. T 2 stores a tag and pattern offset in T 3 . The tag is "1" for the indication of last item with the current counter or is "0" otherwise. Since k copies for each pattern are mapped to CBF, there are k×n = 40K entries in T 2 . 
Update of CBF and pattern array
After initial set up of T 1 , T 2 and T 3 in off-chip memory, the update operation will affect these tables. First, CBF deletion operation finds the hashed entries, decrement their counters by one in T 1 and delete the corresponding pattern in T 2 . If the counter is greater than 1, the "1" tag for the end of this entry should also be modified if necessary. Fig. 2(b) shows an example of deleting "Pattern_2". For network security application, the deletion operation normally means deleting a recognized characteristic string out of the dataset, which rarely happens. Considering the number of deleted items should not be very large, we do not further adjust T 2 or T 3 table by filling the space of removed k or 1 copies of the deleted pattern. Alternatively, we maintain a deletion table T 4 , which keeps the invalid entry addresses in T 2 and pattern table T 3 . T 4 is a sorted table in ascending order and is referenced for insertion operation.
Second, CBF insertion operation is triggered when a new pattern P new is added. If T 4 is not empty, it chooses the last item which is also the smallest pattern number N min , and add P new in entry N min of T 3 . Otherwise, P new is added at the end of T 3 . P new is mapped to k entries and their counters are incremented by one in CBF table. Fig. 3(a) shows an example of inserting P new on entry 2. The tag of P 22 is modified to "0" and the original P 22 room points to the new position of P 22 , the next of which is the inserted new pattern P new . However, this kind of operation leads to longer time for pattern lookup. Another method is to divide the whole table into multiple sub-tables, and reserve a number of empty items for new patterns between two sub-tables. Fig. 3(b) shows a sub-table for entries 1 to 50. When a new pattern P 23 is inserted to entry 2, all the items between the start address "3" of entry 3 and the reserved position are moved backward one entry in parallel to leave a room for P 23 . The moving operation can be performed in parallel in one clock cycle. Moreover, all the T 2 addresses for non-zero entries within 1-50 in T 1 need to be incremented by one.
Pattern grouping and cache design
The cache mechanism takes advantage of pattern locality detected in matched network packets. Some patterns have similar functionality. Therefore, instead of using random pattern array, we preliminarily divide patterns into groups. Take Snort rule set as example, it consists of categories including ftp, dns, dos, P2P, Trojan Horse etc. Consecutive sequence number is assigned to patterns that belong to one group. The group size depends on the unit of cache replacement. In addition to the concern of pattern category before building a pattern table, we can also train the table with real network traffic. The matched packets mostly belong to malicious traffic. During a short-term period, the malicious network flows are possibly relative and they might attack repeatedly. This characteristic brings opportunities for cache. During CBF lookup for a time period, a history table is used to record all the matched patterns one by one. This table can be analyzed off-line to discover potential relativities for better pattern grouping.
The verification process without cache mechanism originally works as follows. If there are k hash functions, when a text produces a match in Bloom filter, it maps to k entries in T 1 , the one with minimum counter value is chosen. Then it looks up T 2 table to locate the associated pattern numbers with this address. Next it looks up T 3 table with the pattern number and returns each of the patterns to be compared. This process stops when the returned pattern is the same as the text, which means true positive; or it stops until all of the patterns have been brought in, which means false positive. The maximum possible delay time is the 2×max_counter×τ off-chip . However, τ off-chip is much larger than FPGA clock cycle τ. We want to decrease the verification delay using proper cache design. Fig. 4 shows cache index and cache block structure. The index table indicates if at least one of the associated patterns for an entry in T 1 has been brought to cache. An index entry includes a full tag, counter address in T 1 , counter value and links to addresses in cache block for its associated patterns. By comparing the counter value and the number of attached pattern links, the tag indicates whether all of this counter's associated patterns are in cache, 1 for full and 0 for non-full. For simpler search, the cache index is organized in such a way that the T 1 addresses are in ascending order. Cache-line is the basic replacement unit in cache block. It is known that T 1 and T 2 tables are randomly distributed due to hash characteristics, while T 3 table can be carefully organized using pattern group. Accordingly, a cache-line is comprised of a pattern group, access times and age. The latter two items are designed for cache replacement.
When a string is preliminarily matched, the mapped addresses in simple Bloom Filter hash array are also the addresses in T 1 . We first check whether one of them exists in cache index. If none of them appears, it produces a cache miss to lookup off-chip memory. If one of its addresses has full tag, all its linked patterns in cache block are searched and compared in parallel. At the same time, the access times of their groups are increased by one. If none of addresses in cache index is associated with a full tag, it also compares all of the linked patterns; however, if none of the patterns equals to input string, it still produces a cache miss.
For a cache miss, when a pattern is brought on-chip for comparison with the input string, its pattern group is also brought to cache to write a new cache-line. As shown in Fig. 4 , the pattern associates with k mapped counter addresses in T 1 . The update process include cache index update and cache block update. For cache index update, if Add 1 of Pattern 1 is already in index, the link to Pattern 1 address in cache block is attached to Add 1 entry, whose tag should also be updated. Otherwise, a new entry is inserted in cache index. For cache block update, the pattern group is brought to a new cache-line if there is an avaible cache-line. Otherwise, we need to replace a cache-line with the new pattern group. Considering similarity in short term network traffic, we use the Least Recently Used (LRU) principle for cache replacemnt policy. In particular, it can choose cache-lines with the smallest access times. As shown in Fig. 4 , the accessed times field needs to be reset periodically since network locality only stands for short term traffic. Otherwise, the early frequently accessed cache-line would never be replaced. If there are more than one cache-line, choose the one with the largest age. As shown in Fig. 4 , the age field is incremented preriodically to indicate how long the cache-line has been in cache. Besides, the LRU policy can also be implemented in another way with age priority. It first chooses cache-lines with the largest age, and then chooses the one with the smallest access times.
Analysis and experiment evaluation
We first analyze the average length of pattern list associated with each counter. A Bloom filter uses k independent hash functions to map each input string to an array of m bits, which is initially trained with n patterns. The false positive rate f is given as ( )
and parameter k that minimizes f is k opt = (m/n) ln2 [1] .The probability that a counter equals to i in its corresponding CBF is shown in (1).
( )
Song etc. in [2] shows that theoretically, the counters of value 1 are more than 99% among non-zero counters. Then for each matching verification, it requires only one access to CBF arrays T 1 , T 2 or T 3 .
Targeted at hardware implementation on FPGA, we write our C 2 BF system in Verilog HDL and run simulation in Modelsim. For compact off-chip memory design, three hash functions illustrated in [5] are used for comparison, including H3, BIT and XOR. Then we can choose a hash function that generates a smaller counter value, for both the average and maximum values. Similar as example in Section 2, the number of hash functions k =8, the number of patterns n is 5K, and the number of entries in hash table m is m 1 =80K or m 2 =100K, the false positive rate is f 1 = 5.7×10 -4 or f 2 = 1.4×10 -4 . The minimum required offchip memory and counter distribution of three hash functions is shown in Table 1 . The average value for non-zero counters is slightly more than 1, and the match verification requires about one access to off-chip memory. Comparatively, H3 hash function has better distribution for counters in T 1 . To evaluate our cache design, we compare the total processing time of Bloom filter for precise matching, with or without cache mechanism. We use 40Kb on FPGA as cache block for the size of 500 maximum patterns (500×10×8 = 40Kb). The size of cache-line is related with that of a pattern group. During cache replacement, patterns of one group are brought to a cache-line. Cache hit rate can be increased since the same kind of traffic flows appear closely. Suppose that the average cache hit rate is around 50%, the off-chip access time is 10 times of FPGA clock cycles, and cache block contains 50 cache-lines, each of which has 10 patterns. In order to test our system under different traffic conditions, we simulate network packets with different true positive rates, which is the pattern matching probabilities, of 0.1%, 1% and 10% and 50%. Table 2 shows the processing time comparison for precise matching with or without cache design. Based on cache size, larger cache-line size corresponds to less cache-line number, which means more patterns can be brought in each time, but there are less pattern groups in cache. Three schemes are compared: 25 cache-lines with 20 patterns as Scheme 1, 50 cache-lines with 10 patterns as Scheme 1, 100 cache-lines with 5 patterns as Scheme 3. The cache-hit ratio also depends on the traffic relativity of continous matched packets. If a pattern of Group n is matched, the probability that one of the next few matched patterns also belongs to Group n is defined as traffic relativity. We also compare the cache hit ratio using two LRU policies, with access times priority or with age priority, for cache replacement. *it is the reduction for matching verification time compared to traditional scheme without cache. Table 3 compares the cache hit rates and reduction ratios of verification time. The cache hit rate can reach more than 80%; compared to traditional scheme without cache, it reduces more than 70% of the verification processing time with cache design. Besides, LRU replacement policy with access times priority has higher cache hit rate and is more suitable for network applications.
Conclusions
Traditional Bloom filter has two main drawbacks: first, it does not support online update; second, it produces false positives. Accordingly, we design and implement a cache-based CBF system on FPGA for higher performance. In order to reduce the number of off-chip memory accesses, we design a compressed CBF array and use pattern grouping for cache replacement. Considering pattern relativity and traffic locality, patterns are categorized in groups. For a potential match, it first checks cache index whether hash entries exist in cache; if not, it searches off-chip memory and replaces a cache-line with a pattern group. Experiments show that C 2 BF can significantly reduce a large percent of match verification time. Several cache schemes and cache replacement polices are also compared under different traffic conditions. The cache hit rate can reach more than 80%, which reduces more than 70% of matching verifiction time compared with traditional schemes without cache.
