Abstract-We develop GPU adaptations of the Aho-Corasick string matching algorithm for the the case when all data reside initially in the GPU memory and the results are to be left in this memory. We consider several refinements to a base GPU implementation and measure the performance gain from each refinement. Experiments conducted on an NVIDIA Tesla GT200 GPU that has 240 cores running off of a Xeon 2.8GHz quadcore host CPU show that our Aho-Corasick GPU adaptation achieves a speedup between 8.5 and 9.5 relative to a single-thread CPU implementation and between 2.4 and 3.2 relative to the best multithreaded implementation.
I. INTRODUCTION
In multipattern string matching, we are to report all occurrences of a given set or dictionary of patterns in a target string. Multipattern string matching arises in a number of applications including network intrusion detection, digital forensics, business analytics, and natural language processing. For example, the popular open-source network intrusion detection system Snort [1] has a dictionary of several thousand patterns that are matched against the contents of Internet packets and the open-source file carver Scalpel [2] searches for all occurrences of headers and footers from a dictionary of about 40 header/-footer pairs in disks that are many gigabytes in size. In both applications, the performance of the multipattern matching engine is paramount. In the case of Snort, it is necessary to search for thousands of patterns in relatively small packets at Internet speed while in the case of Scalpel we need to search for tens of patterns in hundreds of gigabytes of disk data. Snort [1] employs the Aho-Corasick [3] multipattern search method while Scalpel [2] uses the Boyer-Moore single pattern search algorithm [4] .
Several researchers have attempted to improve the performance of multipattern matching applications through the use of parallelism. For example, Scarpazza et al. [5] , [6] port the deterministic finite automata version of the AhoCorasick method to the IBM Cell Broadband Engine (CBE) while Zha et al. [7] port a compressed form of the nondeterministic finite automata version of the Aho-Corasick method to the CBE. Jacob et al. [8] port Snort to a GPU (we refer the rerader to the full version of this paper [15] for a description of the GPU architecture). However, in their port, they replace the Aho-Corasick search method employed by Snort with the Knuth-Morris-Pratt [9] single-pattern matching algorithm. Specifically, they search for 16 different patterns in a packet in parallel employing 16 GPU cores. Huang et al. [10] do network intrusion detection on a GPU based on the multipattern search algorithm of Wu and Manber [11] . Smith et al. [12] use deterministic finite automata and extended deterministic finite automata to do regular expression matching on a GPU for intrusion detection applications. Marziale et al. [13] propose the use of GPUs and massive parallelism for in-place file carving. However, Zha and Sahni [14] show that the performance of an in-place file carver is limited by the time required to read data from the disk rather than the time required to search for headers and footers (when a fast multipattern matching algorithm is used). Hence, by doing asynchronous disk reads, the pattern matching time is effectively overlapped by the disk read time and the total time for the in-place carving operation equals that of the disk read time. Therefore, this application cannot benefit from the use of a GPU to accelerate pattern matching.
Our focus in this paper is accelerating the Aho-Corasick multipattern string matching algorithm through the use of a GPU. In this paper, we assume that the target string resides in the device memory and the results are to be left in this memory. The case when the target string is initially in the CPU memory and the results of the matching are to be left in the CPU memory is considered by us in the full version of this paper [15] . We further assume that the pattern data structure is precomputed and stored in the GPU. Although we researched GPU adaptations of the Boyer-Moore multipattern matching algorithm as well, these adaptations did not perform as well as our GPU adapatations of the Aho-Corasick algorithm. So, we do not report on the Boyer-Moore adaptations here.
The remainder of this paper is organized as follows. In Section II we describe the Aho-Corasick algorithm. Section III describes our GPU adaptation and Section IV discusses our experimental results. We conclude in Section V.
II. THE AHO-CORASICK ALGORITHM
There are two versions-nondeterministic and deterministicof the Aho-Corasick (AC) [3] multipattern matching algorithm. We use the deterministic version in our work as it makes half as many state transitions as made by the non-deterministic version. In the deterministic version (DFA), each state has a transition pointer for every character in the alphabet as well as a list of matched patterns. Aho and Corasick [3] show how to compute the transition pointers. The number of state transitions made by a DFA when searching for matches in a string of length n is n. Figure 1 gives the Aho-Corasick DFA for the patterns abcaabb, abcaabbcc, acb, acbccabb, ccabb, bccabc, and bbccabca drawn from the 3-letter alphabet {a,b,c}
III. GPU ADAPTATION

A. Strategy
The input to the multipattern matcher is a character array input and the output is an array output of states. Both arrays reside in device memory. output [i] gives the state of the AC DFA following the processing of input [i] . Since every state of the AC DFA contains a list of patterns that are matched when this state is reached, output[i] enables us to determine all matching patterns that end at input character i. If we assume that the number of states in the AC DFA is no more than 65536, a state can be encoded using two bytes and the size of the output array is twice that of the input array.
Our computational strategy is to partition the output array into blocks of size S block (Figure 2 summarizes the notation used in this section). The blocks are numbered (indexed) 0 through n/S block , where n is the number of output values to be computed. Note that n equals the number of input characters as well. output[i * S block : (i+1) * S block −1] comprises the ith output block. To compute the ith output block, it is sufficient for us to use AC on input[b * S block − maxL + 1 : (b + 1) * S block − 1], where maxL is the length of the longest pattern (for simplicity, we assume that there is a character that is not the first character of any pattern and set input[−maxL + 1 : −1] equal to this character). So, a block actually processes a string whose length is S block + maxL − 1 and produces S block elements of the output. The number of blocks is B = n/S block .
Suppose that an output block is computed using T threads. Then, each thread could compute S thread = S block /T of the output values to be computed by the block. So, thread Figure 3 gives the pseudocode for a T -thread computation of block i of the output using the AC DFA. The variables used are self-explanatory and the correctness of the pseudocode follows from the preceding discussion.
The AC DFA resides in texture memory because texture memory is cached and is sufficiently large to accommodate the DFA. While shared and constant memories will result in better performance, neither is large enough to accommodate the DFA. Note that each state of a DFA has A transitions, where A is the alphabet size. For ASCII, A = 256. Assuming that the total number of states is fewer than 65536, each state transition of a DFA takes 2 bytes. So, a DFA with d states requires 512d bytes. In the 16KB shared memory that our Tesla has, we can store at best a 32-state DFA. The constant memory on the Tesla is 64KB. So, this can handle, at best, a 128-state DFA.
A nice feature of Algorithm basic is that all T threads that work on a single block can execute in lock-step fashion as there is no divergence in the execution paths of these T threads. This makes it possible for an SM of a GPU to efficiently compute an output block using T threads. With 30 SMs, we can compute 30 output blocks at a time. The pseudocode of Figure 3 does, however, have deficiencies that are expected to result in non-optimal performance on a GPU. These deficiencies are listed below.
Deficiency D1: Since the input array resides in device memory, every reference to the array input requires a device memory transaction (in this case a read). There are two sources of inefficiency when the read accesses to input are actually made on the Tesla GPU-(a) Our Tesla GPU performs devicememory transactions for a half-warp (16) of threads at a time. The available bandwidth for a single transaction is 128 bytes. Each thread of our code reads 1 byte. So, a half warp reads 16 bytes. Hence, barring any other limitation of our GPU, our code will utilize 1/8th the available bandwidth between device memory and an SM. (b) The Tesla is able to coalesce the device memory transactions from several threads of a half warp into a single transaction. However, coalescing occurs only when the device-memory accesses of two or more threads in a half-warp lie in the same 128-byte segment of device memory. When S thread > 128, the values of inputStartIndex for consecutive threads in a half-warp (note that two threads t1 and t2 are in the same half warp iff t1/16 = t2/16 ) are more than 128 bytes apart. Consequently, for any given value of the loop index i, the read accesses made to the array input by the threads of a half warp lie in different 128-byte segments and so no coalescing occurs. Although the pseudocode is written to enable all threads to simultaneously access the needed input character from device memory, an actual implementation on the Tesla GPU will serialize these accesses and, in fact, every read from device memory will transmit exactly 1 byte to an SM resulting in a 1/128 utilization of the available bandwidth.
Deficiency D2: The writes to the array output suffer from deficiencies similar to those identified for the reads from the array input. Assuming that our DFA has no more than 2 16 = 65536 states, each state can be encoded using 2 bytes. So, a half-warp writes 64 bytes when the available bandwidth for a half warp is 128 bytes. Further, no coalescing takes place as no two threads of a half warp write to the same 128-byte segment. Hence, the writes get serialized and the utilized bandwidth is 2 bytes, which is 1/64th of the available bandwidth.
Analysis of Total Work
Using the strategy of Figure 3 , we essentially do multipattern searches on B * T strings of length S thread + maxL − 1 each. With a linear complexity for multipattern search, the total work, T W , is roughly equivalent to that done by a sequential algorithm working on an input string of length
So, our strategy incurs an overhead of 1 S thread * (maxL − 1) * 100% in terms of the effective length of the string that is to be searched.
B. Addressing the Deficiencies 1) Deficiency D1-Reading from device memory:
A simple way to improve the utilization of available bandwidth between the device memory and an SM, is to have each thread input 16 characters at a time, process these 16 characters, and write the output values for these 16 characters to device memory. For this, we will need to cast the input array from its native data type unsigned char to the data type uint4 as below:
reads the 16 bytes input[16 * i:16 * i+15] and stores these in the variable in4, which is assigned space in shared memory. Since the Tesla is able to read up to 128 bits (16 bytes) at a time for each thread, this simple change increases bandwidth utilization for the reading of the input data from 1/128 of capacity to 1/8 of capacity! However, this increase in bandwidth utilization comes with some cost. To extract the characters from in4 so they may be processed one at a time by our algorithm, we need to do a shift and mask operation on the 4 components of in4. We shall see later that this cost may be avoided by doing a recast to unsigned char.
Since a Tesla thread cannot read more than 128 bits at a time, the only way to improve bandwidth utilization further is to coalesce the accesses of multiple threads in a half warp. To get full bandwidth utilization at least 8 threads in a half warp will need to read uint4s that lie in the same 128-byte segment. However, the data to be processed by different threads do not lie in the same segment. To get around this problem, threads cooperatively read all the data needed to process a block, store this data in shared memory, and finally read and process the data from shared memory. In the pseudocode of Figure 4 , T threads cooperatively read the input data for block b. This pseudocode, which is for thread t operating on block b, assumes that S block and maxL − 1 are divisible by 16 so that a whole number of uint4s are to be read and each read begins at the start of a uint4 boundary (assuming that input[−maxL + 1] begins at a uint4 boundary). In each iteration (except possibly the last one), T threads read a consecutive set of T uint4s from device memory to shared memory and each uint4 is 16 input characters.
In each iteration (except possibly the last one) of the for loop, a half warp reads 16 adjacent uint4s for a total of 256 adjacent bytes. If input [−maxL+1 ] is at a 128-byte boundary of device memory, S block is a multiple of 128, and T is a // define space in shared memory to store the input data shared unsigned char multiple of 8, then these 256 bytes fall in 2 128-byte segments and can be read with two memory transactions. So, bandwidth utilization is 100%. Although 100% utilization is also obtained using uint2s (now each thread reads 8 bytes at a time rather than 16 and a half warp reads 128 bytes in a single memory transaction), the observed performance is slightly better when a half warp reads 256 bytes in 2 memory transactions.
Once we have read the data needed to process a block into shared memory, each thread may generate its share of the output array as in Algorithm basic but with the reads being done from shared memory. Thread t will need sInput[t * S thread : (t+1) * S thread +maxL−2] or sInputU int4[t * S thread /16 : (t + 1) * S thread /16 + (maxL − 1)/16 − 1], depending on whether a thread reads the input data from shared memory as characters or as uint4s. When the latter is done, we need to do shifts and masks to extract the characters from the 4 unsigned integer components of a uint4.
Although the input scheme of Figure 4 succeeds in reading in the data utilizing 100% of the bandwidth between device memory and an SM, there is potential for shared-memory bank conflicts when the threads read the data from shared memory. Shared memory is partitioned into 16 banks. The ith 32-bit word of shared memory is in bank i mod 16. For maximum performance the threads of a half warp should access data from different banks. Suppose that S thread = 224 and sInput begins at a 32-bit word boundary. Let tW ord = S thread /4 (tW ord = 224/4 = 56 for our example) denote the number of 32-bit words processed by a thread exclusive of the additional maxL − 1 characters needed to properly handle the boundary. In the first iteration of the data processing loop, thread t needs sInput[t * S thread ], 0 ≤ t < T . So, the words accessed by the threads in the half warp 0 ≤ t < 16 are t * tW ord, 0 ≤ t < 16 and these fall into banks (t * tW ord) mod 16, 0 ≤ t < 16. For our example, tW ord = 56 and (t * 56) mod 16 = 0 when t is even and (t * 56) mod 16 = 8 when t is odd. Since each bank is accessed 8 times by the half warp, the reads by a half warp are serialized to 8 shared memory accesses. Further, since on each iteration, each thread steps right by one character, the bank conflicts remain on every iteration of the process loop. We observe that whenever tW ord is even, at least threads 0 and 8 access the same bank (bank 0) on each iteration of the process loop. Theorem 1 shows that when tW ord is odd, there are no shared-memory bank conflicts.
Theorem 1: When tW ord is odd, (i * tW ord) mod 16 = (jk) mod 16, 0 ≤ i < j < 16.
Proof: The proof is by contradiction. Assume there exist i and j such that 0 ≤ i < j < 16 and (i * tW ord) mod 16 = (j * tW ord) mod 16. For this to be true, there must exist nonnegative integers a, b, and c, a < c, 0 ≤ b < 16 such that i * tW ord = 16a + b and j * tW ord = 16c + b. So, (j−i) * tW ord = 16(c−a). Since tW ord is odd and c−a > 0, j − i must be divisible by 16. However, j − i < 16 and so cannot be divisible by 16. This contradiction implies that our assumption is invalid and the theorem is proved.
It should be noted that even when tW ord is odd, the input for every block begins at a 128-byte segment of device memory (assuming that for the first block begins at a 128-byte segment) provided T is a multiple of 32. To see this, observe that S block = 4 * T * tW ord, which is a multiple of 128 whenever T is a multiple of 32. As noted earlier, since the Tesla schedules threads in warps of size 32, we normally would choose T to be a multiple of 32.
2) Deficiency D2-Writing to device memory: We could use the same strategy used to overcome deficiency D1 to improve bandwidth utilization when writing the results to device memory. This would require us to first have each thread write the results it computes to shared memory and then have all threads collectively write the computed results from shared memory to device memory using uint4s. Since the results take twice the space taken by the input, such a strategy would necessitate a reduction in S block by two-thirds. This reduction in block size increases the total work overhead significantly. We can avoid this increase in total work overhead by doing the following: (a) First, each thread processes the first maxL − 1 characters it is to process. The processing of these characters generates no output and so we need no memory to store output. (b) Next, each thread reads the remaining S thread characters of input data it needs from shared memory to registers. For this, we declare a register array of unsigned integers and typecast sInput to unsigned integer. Since, the T threads have a total of 16,384 registers, we have sufficient registers provided S block ≤ 4 * 16384 = 64K (in reality, S block would need to be slightly smaller than 64K as registers are needed to store other values such as loop variables). Since total register memory exceeds the size of shared memory, we always have enough register space to save the input data that is in shared memory. Unless S block ≤ 4864, we cannot store all the results in shared memory. However, to do 128-byte write transactions to device memory, we need only sets of 64 adjacent results (recall that each result is 2 bytes). So, the shared memory needed to store the results is 128T bytes. Since we are contemplating T = 64, we need only 8K of shared memory to store the results from the processing of 64 characters per thread. Once each thread has processed 64 characters and stored these in shared memory, we may write the results to device memory. The total number of outputs generated by a thread is S thread = 4 * tW ord. These outputs take a total of 8 * tW ord bytes. So, when tW ord is odd (as required by Theorem 1), the output generated by a thread is a non-integral number of uint4s (recall that each uint4 is 16 bytes). Hence, the output for some of the threads does not begin at the start of a uint4 boundary of the device array output and we cannot write the results to device memory as uint4s. Rather, we need to write as uint2s (a thread generates an integral number tW ord of uint2s). With each thread writing a uint2, it takes 16 threads to write 128 bytes of output from that thread. So, T threads can write the output generated from the processing of 64 characters/thread in 16 rounds of uint2 writes. One difficulty is that, as noted earlier, when tW ord is odd, even though the segment of device memory to which the output from a thread is to be written begins at a uint2 boundary, it does not begin at a uint4 boundary. This means also that this segment does not begin at a 128-byte boundary (note that every 128-byte boundary is also a uint4 boundary). So, even though a half-warp of 16 threads is writing to 128 bytes of contiguous device memory, these 128-bytes may not fall within a single 128-byte segment. When this happens, the write is done as two memory transactions. The described procedure to handle 64 characters of input per thread is repeated S thread /64 times to complete the processing of the entire input block. In case S thread is not divisible by 64, each thread produces fewer than 64 results in the last round. For example, when S thread = 228, we have a total of 4 rounds. In each of the first three rounds, each thread processes 64 input characters and produces 64 results. In the last round, each thread processes 36 characters and produces 36 results. In the last round, groups of threads either write to contiguous device memory segments of size 64 or 8 bytes and some of these segments may span 2 128-byte segments of device memory.
As we can see, using an odd tW ord is required to avoid shared-memory bank conflicts but using an odd tW ord (actually using a tW ord value that is not a multiple of 16) results in suboptimal writes of the results to device memory. To optimize writes to device memory, we need to use a tW ord value that is a multiple of 16. Since the Tesla executes threads on an SM in warps of size 32, T would normally be a multiple of 32. Further, to hide memory latency, it is recommended that T be at least 64. With T = 64 and a 16KB shared memory, S thread can be at most 16 * 1024/64 = 256 and so tW ord can be at most 64. However, since a small amount of shared memory is needed for other purposes, tW ord < 64. The largest value possible for tW ord that is a multiple of 16 is therefore 48. The total work, T W , when tW ord = 48 and maxL = 17 is n * (1 + 1 4 * 48 * 16) = 0.083n. Compared to the case tW ord = 57, the total work overhead increases from 7% to 8.3%. Whether we are better off using tW ord = 48, which results in optimized writes to device memory but sharedmemory bank conflicts and larger work overhead, or with tW ord = 57, which has no shared-memory bank conflicts and lower work overhead but suboptimal writes to device memory, can be determined experimentally.
IV. EXPERIMENTAL RESULTS
For all versions of our CUDA code, we set maxL = 17, T = 64, and S block = 14592. Consequently, S thread = S block /T = 228 and tW ord = S thread /4 = 57. Note that since tW ord is odd, we will not have shared-memory bank conflicts (Theorem 1). We note that since our code is written using a 1-dimensional grid of blocks and since a grid dimension is required to be < 65536 [17] , our code can handle at most 65535 blocks. With the chosen block size, n must be less than 912MB. For larger n, we can rewrite the code using a two-dimensional indexing scheme for blocks.
For our experiments, we used a pattern dictionary from [2] that has 33 patterns. The target search strings were extracted from a disk image and we used n = 10MB, 100MB, and 904MB.
3) Aho-Corasick Algorithm: We evaluated the performance of the following versions of our AC algorithm: AC0 This is Algorithm basic (Figure 3 ) with the DFA stored in device memory. AC1 This differs from AC0 only in that the DFA is stored in texture memory. AC2 The AC1 code is enhanced so that each thread reads 16 characters at a time from device memory rather than 1. This reading is done using a variable of type unint4. The read data is stored in shared memory. The processing of the read data is done by reading it one character at a time from shared memory and writing the resulting state to device memory directly. AC3 The AC2 code is further enhanced so that threads cooperatively read data from device memory to shared memory as in Figure 4 . time. The read data is processed as in AC2. AC4 This is the AC3 code with deficiency D2 eliminated using a register array to save the input and cooperative writes as described in Section III-B2. We experimented with a variant of AC3 in which data was read from shared memory as uints, the encoded 4 characters in a uint were extracted using shifts and masks, and DFA transitions done on these 4 characters. This variant took about 1% to 2% more time than AC3 and is not reported on further. Also, we considered variants of AC4 in which tW ord = 48 and 56 and these, respectively, took approximately 14.78% and 7.8% more time that AC4. We do not report on these variants further either. Table I gives the run time for each of our AC versions. As can be seen, the run time decreases noticeably with each enhancement made to the code. Table II gives the speedup attained by each version relative to AC0 and Figure 5 is a plot of this speedup. Simply relocating the DFA from device memory to texture memory as is done in AC1 results in a speedup of almost 2. Performing all of the enhancements yields a speedup of almost 8 when n = 10MB and almost 9 when n = 904MB. 
4) Comparison with Multicore Computing on Host:
For benchmarking purposes, we programmed also a multithreaded version of the AC algorithm and ran it on the quad-core Xeon host that our GPU is attached to. The multithreaded version replicated the AC DFA so that each thread had its own copy to work with. For n = 10MB and 100MB we obtained best performance using 8 threads while for n = 500MB and 904MB best performance was obtained using 4 threads. The 8-threads code delivered a speedup of 2.67 and 3.59, respectively, for n = 10MB and 100MB relative to the single-threaded code. For n = 500MB and 904MB, the speedup achieved by the 4-thread code was, respectively, 3.88 and 3.92, which is very close to the maximum speedup of 4 that a quad-core can deliver.
AC4 offers speedups of 8.5, 9.2, and 9.5 relative to the single-thread CPU code for n = 10MB, 100MB, and 904MB, respectively. The speedups relative to the best multithreaded quad-core codes were, respectively, 3.2, 2.6, and 2.4, respectively.
V. CONCLUSION
We have developed a multipattern matching for GPUs that is based on the Aho-Corasick multipattern matching algorithm AC [3] . Experiments show that our GPU adaptation of AC achieves speedups between 8.5 and 9.5 relative to a single- thread CPU code and speedups between 2.4 and 3.2 relative to a multithreaded code that uses all cores of our quad-core host.
