With the wide adoption of internet into our everyday lives, internet security becomes an important issue. Intrusion detection at the network level is an effective way of stopping malicious attacks at the source and preventing viruses and worms from wide spreading. The key component in a successful network intrusion detection system is a high performance pattern matching engine that can uncover the malicious activities in real time. In this paper, we propose a highly parallel, scalable hardware based network intrusion detection system, that can handle variable pattern length efficiently and effectively. Pattern matchings are completed in O(log M ) time where M is the longest pattern length. Implementation is done on a standard off-the-shelf FPGA. Comparison with the other techniques shows promising results.
Introduction
Network Intrusion Detection System (NIDS) performs packet inspection to identify, prevent and inhibit malicious attacks over internet. It can effectively stop viruses, worms, and spams from wide spreading. Pattern matching is the key component in the network intrusion detection systems. Using modern reconfigurable platforms, like FPGA, design and implement a parallel, high performance pattern matching engine for network intrusion detection is the goal of this paper.
Traditionally, network intrusion detection systems are implemented completely in software. Snort [20] is a well-known open source software network intrusion detection system. It matches pattern database against each packet to identify malicious target connections. With the rapid growth of pattern database, and the rapid growth of network bandwidth, software only solution can not process the internet traffic in full network link speed. A natural approach will be to move the computation intensive pattern matching to hardware. The main idea is to use specialized hardware resources along with a conventional processor. In this way, the conventional CPU can process all the general-computing tasks and the specialized co-processor can deal with string pattern matching, where parallelism, regularity of computations can be exploit by custom hardware resources.
Extensive researches exist on general pattern matching algorithms. The BoyerMoore algorithm [6] is widely used for its efficiency in single-pattern matching problems. However, the current implementation of Boyer-Moore in Snort is not efficient in seeking multiple patterns from given payloads [3] . Aho and Corasick [1] proposed an algorithm for concurrently matching multiple strings. Their algorithm uses the structure of a finite automation that accepts all strings in the set. Two implementations of the Aho-Corasick algorithm have been done for Snort, by Mike Fisk [10] and Marc Norton [17] , respectively. Fisk and Varghese [11] presented a multiple-pattern search algorithm that combines the one-pass approach of Aho-Corasick with the skipping feature of Boyer-Moore as optimized for the average by Horspool. The work by Tuck, et al. [22] takes a different approach to optimizing Aho-Corasick by instead looking at bitmap compression and path compression to reduce the amount of memory needed. All these approaches are developed mainly for software implementation. To examine packets in real time with full network link speed, a hardware solution is more favorable.
There are two main groups of hardware solutions for fast pattern matching. The first group generally applies finite state machine (FSM) to process patterns in sets. Aldwairi et al. [2] designed a memory based accelerator based on the Aho-Corasick algorithm. In their work, rules are divided into smaller sets that generate separate FSMs which can run in parallel. Hence significantly reduces the size of state tables and increases the throughput. Liu et al. [15] designed and implemented a fast string matching algorithm on network processor. Baker and Prasanna [4] proposed a pipelined, buffered implementation of the Knuth-MorrisPratt algorithm [13] on FPGA. Li et al. [14] implemented rule clustering for fast pattern matching based on [9] with FPGA platform. Pattern size is limited by the available hardware resources. Tan and Sherwood [21] designed special purpose architecture working in conjunction with string matching algorithms optimized for the architecture. Performance improvement in this first group is generally achieved by dividing patterns into smaller sets, and deeply pipelining the pattern matching process. However, these type of approaches all have the shortcoming in scalability, when the pattern database grows exponentially, these type of approaches will suffer from extensive resources consumption and not able to maintain the same level of performance. Deep pipelining also has the side effect of increased latency, which is detrimental to some internet traffic.
The second group of hardware solutions uses hash tables as the foundation for pattern matchings. Dharmapurikar et al. [8] used bloom filters [5] to perform string matching. The strings are compressed by calculating multiple hash functions over each string. The compressed set of strings is then stored into a small memory which is queried to find out whether a given string belongs to the set. If a string is found to be a member of the bloom filter, it is declared as a possible match and a hash Figure 1 to illustrate our idea. In this example, there are only two patterns to be matched, "A TEST" and "TEST IS". The maximum pattern length M = 7. Hence k = log (7) = 2. We slice patterns into substrings of length 2 2 , 2 1 , and 2 0 . Pattern substring tables are built for each substring length. In this example, we show the exact value of the substring for illustration purpose. In reality, these tables will be hash tables for speed matching. An example input string is also shown in Figure 1 . Using our approach, the matching is done in 3 steps. In step 1, all substrings of length 4 of the input string are matched against the pattern substring table with length equals 4. In step 2, all substrings of length 2 of the input string are matched against the pattern substring table with length equals 2. In step 3, all substrings of length 1 of the input string are matched against the pattern substring table with length equals 1. A match of pattern "A TEST" is declared when both substring "A TE" and substring "ST" are matched during the process. The detail of the matching process and the data structures used will be presented in the rest of this paper.
In this paper, we use M to represent the maximum pattern length, example value of M could be 256 or 512. We use N to represent the number of patterns. A typical N would be 2k, which can fit the current snort rule set [20] . The main research contributions of this paper are:
-Handles variable pattern length efficiently and effectively while using hash tables. -Finishes matching in O(log M ) steps, where M is the maximum pattern length. -Excellent scalability. Pattern matching performance is not affected by the growth of pattern database.
The remainder of this paper is organized as follows. The architecture of our technique is shown in Section 2. The concepts and data structures used in our approach are introduced in Section 3. Section 4 presents the algorithms. Section 5 presents the implementation on a reconfigurable platform. Section 6 presents the concluding remarks.
Architecture
The block diagram of our proposed architecture is shown in Figure 2 . In this architecture, the core elements are an array of PEs (Processing Element). The number of PEs equals to the size of the input string S. A PE processes a substring of the input against all the same length substrings of the patterns. The input string is processed in rounds of different substring length. Each PE will first process all the 2 k bytes substring of the input string, then 2 k−1 , etc. The design diagram of a PE is shown in Figure 3 . The inputs of a PE are a substring and a substring select signal that determines the length of the substring that will be worked on. First the input string will be passed to the hash function block and a hashing value will be obtained. This hash value will be used to do a hash table lookup. The result of hash table lookup will be passed to the match logic block to determine if there is a match or not. The design of each PE is kept simple. Duplicated hardware is used for the Match Logic block to increase the performance.
Basic Concepts and Data Structures
In this section, we introduce the basic concepts which will be used in the later sections. The data structures used in our approaches are also presented in this section.
Let us first define the problem that we are trying to solve. Assuming a packet carries a string S of length L, and we know a set of N patterns, p [1] , p [2] , ..., p[N ], the goal of Network Intrusion Detection System (NIDS) is to determine if there is any exact matching between pattern p[i] and a substring of S. Let M be the maximum pattern length, and let k = logM . The main idea of our approach is to slice each pattern into substrings of length 2 i , where 0 ≤ i ≤ k. Input data string S is read in as a whole and processed in rounds of different substring length. First all substrings of length 2 k are processed, then all substrings of length 2 k−1 , etc. The whole matching are completed in k steps. After finding a match of a substring, we will first decide if all the previous substrings in the pattern are matched, If yes, then a partial match is identified. And then, we will see if this is the last substring in the partially matched pattern. If yes, then an potential exact match is declared and a red flag will be raised by the network intrusion detection system and processed accordingly by the host system.
Three sets of data structures are used in our approach, and we will introduce them one by one. The first data structure of interest is the Pattern Length table. It is an array that stores each pattern's length and indexed by the pattern ID. The binary representation of each pattern length shows what substrings that this pattern will be decomposed into. An example is shown in Figure 4 . In this example, for the first pattern with pattern ID equals to 1 and length equals to 33, it will be sliced into a substring of length 32 and a substring of length 1, as depicted by its binary representation in Figure 4 . The second set of data structure of interest is a set of hash tables that stores the pre-processed information for each substrings of each patterns. For pattern substrings of length 1, since there can only be 256 values, no hashing is done. Instead, a table of 256 entries is created. Each entry contains three elements, the first element is the value of this entry, the second element is the starting pattern ID, and the third element is the number of patterns that have the same value from the starting pattern ID. An example is shown in Figure 5 . In this example, there are three patterns with value "a" as the last byte. Hence, in the HASH 0 table, there is an entry with value equal to "a", starting pattern ID equal to 100, and number of consecutive patterns equal to 3. Figure 6 (a). There are five columns in each hash table. Extra columns are used to handle hashing collisions. There are two sources of potential hashing collisions exist in our scheme. First, different substrings could be hashed to the same hash value. Second, different patterns could have the same substring. For example, pattern "hell" and pattern "hello" have the same 4 bytes substring "hell". To handle hashing collisions efficiently, for each hash value, we reserve two space for pattern ID in column two and column three respectively. These two pattern ID will be read in the same clock cycle and processed by hardware simultaneously. When there are more than 2 substrings are hashed to the same value, a separate table called Sup Table is used to record these values. Sup Table is also shown in Figure 6(a) . Column four of the HASH i table points to the starting Supplement index, and column five identify the number of consecutive entries in the Sup Table that have the same hash value. In the example shown in Figure 6( Table and Match Table The third data structure that we use is the Match Table, which is a threedimensional bit array, with length equals to the input string length L, width equals to the number of patterns N , and the height equals to number of different substring length k. This table is used to record the substring matches found, which is in turn used for determining whole pattern match. For each substring match, a "1" will be recorded using the substring length, matched pattern id, and the position of the substring in the input string S. An example is showing in Figure 6 
Algorithms
In this section, the algorithms of our approach are presented. An example is given at the end of this section to show how the algorithms work. There are two main algorithms in our approach. Algorithm Init Matching shown in Algorithm 4.1 handles the initialization of all the necessary data structures. The second algorithm Pattern Matching shown in Algorithm 4.2 processes the input strings for potential matchings.
In algorithm Init Matching, first all the odd length patterns are sorted by the value of the last byte. This is necessary for building the lookup Afterward, we will hash each substring of each pattern, and store the pattern ID accordingly. Based on our HASH i table, there are two spaces to store pattern ID. We will first try to store the pattern ID of a particular hash value in one of these two spaces. If both of these two spaces are occupied, we will then place the pattern ID in the Sup Table and update the last two columns of the HASH i table accordingly. The last step of the Init Matching algorithm populates the HASH 0 table with the sorted pattern information. Updating the pattern set when we need to add or remove a pattern can be done in the similar fashion of Algorithm 4.1.
The main algorithm that processes each input string for potential matching patterns is algorithm Pattern Matching. There are two functions notable used in Algorithm 4.2 , i.e., Pre Substring(pl,i) and Post Substring(pl,i), where pl is the pattern length and i is the current substring length. These two functions are used to determine if there is other substrings in the current pattern or not. If there are substrings before the current substring with length i in a pattern of length pl, Pre Substring(pl,i) will return the previous substring length. Otherwise, Pre Substring(pl,i) will return "0". Post Substring(pl,i) will return "1" if there is any substring after the current substring with length i, and return "0" if the current substring is the last substring of the pattern. In Pattern Matching algorithm, for each substring length and each substring, we will first run the hash function to obtain a hash value. The hash value is used to lookup the corresponding hash table. If there are matches found in the hash table, for each matched pattern ID, we will examine its previous substrings and post substrings. If there is no previous substring or if there is a previous substring and it is also matched to the same pattern, we will mark "1" in the M atch T able for this input substring, at this substring length and this matched pattern. After we mark "1" in the Match is used heavily in our approach. Implementing hashing in hardware is relatively inexpensive. A class of universal hash functions called H 3 described in [18] were found to be suitable for hardware implementation. Our implementation of hash function falls into this class.
An example of the matching process is shown in Figure 7 . This is a continuation of the simple example shown in the introduction section. The detail of the M atch T able is shown in Figure 7 . In this example, there are one input string S that has 14 bytes, two patterns to be matched, and three different substring [4] 2.4Gb/s 120 20.0 Los Alamos [12] 2.2Gb/s 243 9.1 Wash U. -DFA [16] 0.952Gb/s 260 3.7 Wash U. -Bloom [8] 0.8Gb/s 0.76 1058 UCLA [7] 2.88Gb/s 160 18.0 length. During the first round, where we match all substrings of length 4, there are two matches, one for "A TE" and one for "TEST". Both matches lead to a marked "1" in the matching table. Moving on to the second round, where we match all substrings of length 2. Substring "ST" is matched, and since the previous substring of the same pattern is also marked as matched to the same pattern, "1" is marked for the "ST" substring match. Since substring "ST" has no substring after it, a match is declared. There is also a substring match of " I" found, however, since the previous substring is not marked as matched, we do not mark "1" in the match table for the location where " I" is matched.
Implementation
We have described our design in VHDL and targeted it to the Xilinx Virtex II architecture with -7 speed grade. We use the Xilinx ISE 7.1i and Mentor Graphic ModelSim 6.0 development tools. We have implemented a linear array of these PEs. Using a Virtex II XC2V6000, we are able to accommodate 128 PEs. This allows us to handle input string length of 128 bytes. The corresponding clock frequencies are 220 MHz. Since we need an average of six clock cycles to complete pattern matching for 128 bytes, hence our average throughput is 0.22x128/6 = 4.7 Gb/s. Memory consideration in the implementation is very important in achieving high performance. In our design, Match Table is the key in consolidating partial matches from substrings into full matches. Concurrent read/write accesses to the Match Table could be the bottleneck of our performance. In the real implementation, we actually slice the match table into thin slices. As shown in Figure 8 , we can assign one slice of match table per PE. Each PE will write to its own slice of match table, and read from other slices of the match tables if needed. So the memory design requirement of the Match Table becomes single write, multiple read instead of multiple write, multiple read. For our implementation on Xilinx Virtex II, we mapped each slice of match table into one 18kb Block SelectRAM. There are 3.5 Mb of total memory constituted by these 18kb Block SelectRAM on a chip [23] . We can fit all 128 slices of match tables easily. Memory implementation for the hash tables can be optimized in the same fashion. We can duplicate multiple copies of the hash tables and distribute among the PEs. Since we only need to read from the hash tables during the matching process, each copy of hash tables can be implemented using multi-port memory and shared among several PEs. Performance of our system can be further improved with the availability of more hardware resources. There are two ways that this performance gain could take place. First, we could use a larger FPGA that can accommodate 6x128 PEs. In this way, input strings can be processed in a pipelined fashion. At every clock cycle, there will be a 128 bytes string input and a 128 bytes string output. Second, multiple copies of the current design can be used in parallel to process multiple input streams at the same time. Either way, scalability can be achieved easily with the addition of new hardware resources.
The throughput, unit size, and performance of our design is compared with several other designs in Table 1 . While generating high throughput, our design works relatively well in the unit size and performance. The real strength of our design comes when the number of patterns grows significantly and the speed of network increases dramatically, we do not have to make huge change in our design, only increase in hardware resources will make our design scale as needed.
Conclusion
In this paper, we propose a new hardware solution for NIDS. Our solution can handle variable pattern length efficiently while using the hash function approach. Pattern matching is processed in O(log M ) steps, where M is the maximum pattern length. Enabling fast pattern matching is the key component in successful network intrusion detection. As a next step, we plan to explore beyond exact pattern matching, identifying threats that are not exactly the same as the known patterns, but are variants of the known patterns.
