Accelerating the signature matching function is essential to perform Deep Packet Inspection (DPI) at line rates. The conversion of the signatures into the Deterministic Finite Automaton (DFA) enables performance of this function at linear time. However, since the DFA is extremely storage inefficient, it is compressed before being stored in the memory. Although state-of-the-art bitmap-based compression algorithms can perform line rate signature matching, they only achieve transition compression of ~90-95%. Addressing the storage inefficiency, two bitmap-based transition compression algorithms were proposed by Subramanian et al. in 2016 to achieve transition compression of over 98%. A theoretical relationship is established in this article between the achievable signature matching throughput and the number of pipeline stages required to perform the decompression through the hardware accelerator based on the proposed techniques. Additional optimizations are proposed and evaluated to improve the per-stream signature matching throughput through the proposed decompression engines. The experimental evaluation of the optimizations shows that the perstream signature matching throughput can be improved by a factor of 1.2-1.4x. A software model of the proposed decompression engines was designed and evaluated across a multitude of payload byte streams to verify the functional correctness of the proposed compression methods.
Prior Art and Key Contributions 2.1 Introduction to DFA
If Sˈ represents the set of 'N' DFA states and Cˈ represents the set of characters, each entry in the DFA stores the state transition δ(s, c) = t, where s, t ∈ Sˈ represents the current state and the next state, respectively; c ∈ Cˈ represents the character; and δ represents the state transition function. The total number of state transitions generated in the DFA is a product of the total number of states generated and the number of characters for which the transitions are represented in a state as shown in (1).
# Total State Transitions = N × C′
(1)
Hardware-Oriented Transition Compression Techniques
Hardware-oriented transition compression techniques can be classified into hash-based and bitmap-based techniques (Subramanian et al., 2016) . Hash-based solutions identify and store the non-redundant transitions in a DFA in a hash table by using the current state and character as hash keys. A hash-based technique was proposed in Lunteren & Guanella (2012) where the DFA transitions are converted into rules which are stored in the memory. There can be multiple rules associated with every unique transition and all of them have to be stored in onchip memory. As the number of signatures increases, the number of customized rules also increases, which are eventually stored in the off-chip DRAM. The latency associated with a rule fetch from the DRAM reduces the signature matching throughput affecting the scalability of the solution. In worst case scenarios, the throughput achieved by the solution dropped to 2 Gbps, while the maximum throughput that can be achieved by the solution is 73.6 Gbps.
On the other hand, bitmap-based techniques compress adjacent transitions that are identical to each other in a DFA and a bitmap is used to identify the transition indices which have been compressed. For example, a K-bit bitmap is used to identify if a transition is compressed or not in a sequence of K state transitions. An index 'i' in the K-bit bitmap has a '0' stored, if the transition corresponding to that index is compressed. On the other hand, the index has a '1', to identify a transition that is not compressed. The transitions that are not compressed are stored in a unique transition list with a unique transition index to identify their location. The unique transition index corresponding to a transition can be found by calculating the number Reorganized and Compact DFA (RCDFA) (Wang et al., 2011) is a bitmap-based transition compression technique which performs bitmap-based compression along the state axis. The RCDFA originally achieves transition compression rates of the order of 97-98%. However, in order to reduce the number of bitmaps stored in the memory, RCDFA stores additional redundant state transitions, which reduces the compression rates to 95%, resulting in an increased overall memory footprint of the compressed DFA.
Another bitmap-based transition compression technique is proposed in Qi et al. (2011) , which achieves transition compression of the order of about 90% by grouping states and is called Front-End Acceleration for Content-Aware Network processing (FEACAN). This technique observes both intra-state as well as inter-state transition redundancy in a DFA. The intra-state redundancy is removed by compressing the transitions using bitmap along the character axis.
For example (see Figure 2 (a)), the state transitions corresponding to the index in character 7
is compressed in states 0, 2, 3, 4 and 7 since it is identical to that of character 6. The interstate redundancy is removed by grouping the states into subsets and by comparing the compressed transitions within the state groups. After state grouping, one of the states in the group is referred to as the leader state and all other states in the group are called the member states. After the state grouping, a comparison is made at each unique transition index between To summarize, the memory components in bitmap-based techniques can be split into control and transition memories. The transition compression rates achieved through the bitmapbased techniques are around 90% in the case of FEACAN and 95% in the case of RCDFA. Thus, the resulting transition compression rates result in inefficient storage of the compressed DFA in the on-chip memories.
Key Contributions
The state-of-the-art bitmap-based compression techniques do not result in efficient transition compression, leading to inefficient usage of on-chip memories to store the compressed transitions. This can either be due to the algorithmic limitations as in the case of Qi et al. (2011) or the redundant transitions being stored to reduce the number of unique bitmaps stored in memory, as in the case of Wang et al. (2011) . Addressing these weaknesses, two bitmap-based transition compression techniques, the Member State Bitmask Technique (MSBT) and the Leader State Compression Technique (LSCT) were proposed in Subramanian et al. (2016) .
The key idea behind these two techniques is an additional level of indexing with the introduction of bitmasks, which efficiently index the non-redundant transitions after bitmapbased transition compression. The additional indexing not only results in a reduced transition memory usage, but also reduces the overall memory usage to store the compressed transitions.
the hardware engines. The experimental evaluation of the optimizations introduced in the hardware accelerator shows further increases in the per-stream signature matching throughput by a factor of 1.2 to 1.4x.
• States which share the same bitmap and a certain percentage of identical transitions, defined by the transition threshold, are clustered into a group. After grouping the states into subsets of states, a leader state is identified for each group, while the rest of the states are called the member states. As part of the inter-state compression, the transitions in a member state that are identical to the transitions in a leader state at each unique transition index are compressed.
The member transition which is not identical to that of the leader transition in a member state can be identified using a Member Transition Bitmask (MTB) for each of the member states.
The MTB is composed of a sequence of mask bits, where each bit corresponds to a unique transition index and represents whether a member transition at an index is identical or different in comparison to the leader transition at the same index. If the member and leader transitions are identical at the unique transition index, then the bitmask bit corresponding to representing that the member transition at the index is different from the leader transition at the same index. On the other hand, the bitmask bit at index '3' for the state '2' has a '0', representing that the member transition at the index is the same as the leader transition at the same index. It can be seen from Figure 3 If a DFA is compressed using the MSBT, a next state transition can be decompressed in the following way. If the current state is a leader state, then the transition at the unique transition index corresponding to the incoming character is directly assigned as the next state. If the current state is a member state, the bitmask bit corresponding to the unique transition index decides the next state. If the bitmask bit at the unique transition index is a '1', the member transition which remains uncompressed corresponding to the unique transition index is assigned as the next state. If the bitmask bit corresponding to the unique transition index is a '0', the leader transition corresponding to the unique transition index is chosen as the next state.
The cost paid for the additional compression is the memory used to store the MTB. The maximum width of the unique transition index for a group defines the length of the MTB. For example, the maximum width of the unique transition index corresponding to group '0' is 7, resulting in a 7-bit MTB for each member state in the group. A cumulative sum of all member transitions, which remain uncompressed until each member state in a group is stored in the memory along with the MTBs, is shown in Figure 3(d) . For example, a cumulative sum of '3', corresponding to state 7 represents that 3 member transitions are stored in memory before the first uncompressed member transition belonging to state 7 is stored.
In the MSBT, the states are encoded and represented as a combination of leaderID and memberID similar to the state encoding technique used in Qi et al. (2011) . Figure 3 (e) shows the state encoding between the two representations. The leaderID identifies the group to which a state belongs and the memberID identifies the member representation within a group of states. The memberID for the leader state is always kept '0' to easily differentiate between a leader state and other member states.
Functional Description of the Hardware Decompression Engine (MSBT)

Components of the Hardware Decompression Engine
A hardware acceleration engine is proposed to decompress the transitions that are compressed using the MSBT. Figure 4 shows the functional architecture of the signature matching engine.
The engine is split across three processing stages, to include the Address Lookup Stage (ALS), the Leader Transition and Bitmask Fetch Stage (LTBFS) and the Member Fetch Stage (MFS).
There are four lookup tables across which the compressed transitions and the control information are split. The Leader Transition Table ( LTT) belongs to the second stage and stores the transitions which remain uncompressed after the bitmap-based compression among the leader states. The Member Bitmask Table ( MBT) also belongs to the second stage and stores the MTB for each member state along with the cumulative sum of transitions. The Member Transition Table ( MTT) belongs to the third stage and stores the member transitions which remain uncompressed after the intra-and inter-state compression. The Address fetched from the LTT and the transition fetched from the MTT depending on the bitmask bit corresponding to the leader offset and the current state, as discussed previously.
All memories used in the MSBT implementation are single port memories and can be categorized into control and transition memories. AMT and MBT belong to the control memories, as they store the control information such as the base addresses, the bitmaps and the bitmasks which are used to compute the location of a compressed transition. On the other hand, the LTT and the MTT are transition memories as they actually store the compressed state transitions. The basic idea of the proposal is to store more information in the control memory in comparison to the state-of-the-art implementations to improve the transition compression, resulting in an overall reduction in the memory usage. The transition fetch for a state character combination of '4' and '5' is used as an example to explain the decompression process. The information fetched from various memories is highlighted in green, while the bits of interest in the bitmap and the bitmask are highlighted in red in Figure 5 . The leaderID corresponding to state 4 is '0' and is used as the address to corresponding to the leader and member states are shown in Figure 6 (d). For example, the entry corresponding to the unique transition index '2' in group '0' has a '0' which represents that the leader transition corresponding to the entry is the most repeated transition. On the other hand, the entry corresponding to the unique transition index '3' in group '0' has a '1'
Example of a Transition Fetch
representing that the leader transition corresponding to the entry is not the most repeated transition.
If a DFA is compressed using the LSCT, the next state transition can be decompressed in the following way. If the current state is a leader state, then the bitmask bit corresponding to the unique transition index in the LTB is calculated. If the bitmask bit is '0', then the most repeated transition is assigned as the next state. If the bitmask bit is '1', then the leader transition which remains uncompressed corresponding to the unique transition index is assigned as the next state. If the current state is a member state, the bitmask bit corresponding to the unique transition index in the MTB is identified. If the bitmask bit is '0', then the same procedure is followed as in the case of the leader state. If the bitmask bit is '1', then the member transition which remains uncompressed corresponding to the unique transition index is assigned as the next state.
Functional Description of the Hardware Decompression Engine (LSCT)
address of the first transition that remains uncompressed and the first bitmask for each group.
These are referred to as TT base address and BT base address, respectively. The AMT also stores the bitmap for each of the groups and the most repeated transition in the leader state.
Similar to the MSBT, the AMT and BT are the control memories and TT is the transition memory.
How to Fetch a Compressed Transition
The current state and the incoming character are passed as inputs to the first stage to compute the leader offset and the address location to fetch the LTB and the MTB (BT_LTB_ADDR & BT_MTB_ADDR). The leaderID (which is part of the state encoding) corresponding to the current state is used as the address to fetch the data from the AMT. The leader offset is calculated analogous to how the leader offset was computed in the case of MSBT. The BT base address represents the address from which the LTB is fetched. The memberID when added to the BT base address provides the address from which the MTB and the cumulative sum of transitions are fetched. The first stage also provides the TT base address and the most repeated transition as inputs to the second stage.
The addresses generated by the first stage are used to fetch the LTB and the MTB from the BT simultaneously. The bitmask bit corresponding to the leader offset is checked in both the LTB and MTB; these bits are denoted as the leader bitmask bit and the member bitmask bit. If the current state is a leader state and the leader bitmask bit is '1' or if the current state is a member state and the member bitmask bit is '0', a transition offset is calculated similar to the member offset calculation in the case of MSBT. The transition offset is then added together with the TT base address to generate the TT address location. On the other hand, if the current state is a member state and the member bitmask bit is '1', the transition offset which is calculated is when the generated address can be used for TT to the third stage based on the above combinations.
The third stage takes the generated TT address, the most repeated transition and assigns the compressed state transitions depending on the bitmasks. The transition is fetched from the generated TT address and the next state is multiplexed between the transition fetched from TT and the most repeated transition based on the combinations discussed above. 
Example of a Transition Fetch
Pipelining vs Throughput
As mentioned in the previous section, each functional stage is a combination of a single memory lookup followed by a combinatorial function which processes the data from the memory. So, it would take 3 clock cycles for a transition to be decompressed from the memory.
For example, if the character is consumed in the first clock cycle as an input, the subsequent character can only be consumed in the 4th clock cycle, as it takes 3 clock cycles to process the character and identify the next state corresponding to it. So, in order to keep the pipeline busy and to fully utilize the hardware resources, characters from multiple data streams can be interleaved as proposed in Basu & Narlikar (2005) . Since there are 3 pipeline stages (corresponding to 3 memory accesses), characters from 3 different data streams are passed to the pipeline in an interleaved manner once every 3 clock cycles.
To generalize, if 'P' is assumed to be the total number of pipeline stages to process a character from one data stream, characters from 'P' different data streams have to be interleaved to extract the best from the hardware resources. Since each character corresponds to 8 bits and the system can consume one character every clock cycle, the throughput that can be achieved is a product of the frequency and the character width. Assuming F to be the frequency at which the system is clocked, the throughput T achieved by the decompression system can be generalized as shown in (2).
T = F × 8 bps
Since characters from multiple streams are input to effectively utilize the hardware resources, the maximum throughput that is achieved by one of the interleaved streams is inversely proportional to the number of pipeline stages. Equation (3) shows the maximum achievable throughput Tstream_max_pipeline for a single stream among the interleaved streams during pipelined operation.
Based on (1), the throughput, T, can either be increased by increasing the clock frequency or increasing the character width, i.e., by sending multiple characters per stream per clock cycle.
Processing multiple characters per stream requires the conversion of the DFA into a multistride DFA, which results in an exponential memory growth and is not a scalable approach (Becchi & Crowley, 2013) . Thus, the only way in which the throughput can be improved is by increasing the operating frequency of the transition decompression. The frequency that can be achieved depends on two factors: the latency associated with the SRAM fetch; and the latency associated with the combinatorial processing path in the pipeline. The latency associated with design flow. Similarly, the combinatorial processing logic associated with various functional stages can be broken down into multiple pipeline stages by introducing additional registers to increase the frequency of operation. On the contrary, splitting the combinatorial path into pipelines will also reduce the maximum throughput that can be achieved by a single stream, since they are inversely proportional to each other, as shown in (3).
In the case of the MSBT and the LSCT, the number of pipeline stages required to process an incoming character can be broken down into a fixed and a variable count. The former is the number of pipeline stages which is a bare minimum requirement due to the associated memory fetches. In both the MSBT and the LSCT, the fixed pipeline stage count is 3, as there are 3 memory fetches as part of the transition decompression. The latter is a variable component, which results from the combinatorial block in each stage being split into multiple smaller stages to improve the clock frequency. In the case of the MSBT and the LSCT, the variable pipeline stage counts for the first stage are defined as Δ and θ, respectively. For example, if the combinatorial block in the ALS is split into 3 (Δ) stages, it would take 4 pipeline stages (3 (Δ) + 1 stage for AMT lookup) in the MSBT to finish the processing associated with the ALS. On the other hand, in the MSBT and the LSCT, the variable pipeline count associated with the second stage is defined by η, as the processing associated with these stages is exactly the same. Equations (4) and (5), respectively, define the total pipeline stage count PMSBT and PLSCT for MSBT and LSCT after design pipelining.
PMSBT = 3 + Δ + η (4)
Providing multiple streams to the system is a best-case scenario to completely utilize the hardware resources. In the worst-case scenario, when only one stream is available to the pipeline, the maximum throughput that can be achieved is the same as described in (3) (6) and (7), respectively. The fixed pipeline component in (6) and (7) is represented as 2, as there are only 2 memory accesses required to identify the next state.
P(c)MSBT iv is defined as the probability of a transition fetch from the third functional stage assigned as the next state, when a sequence of M bytes is inspected by the MSBT decompression system. Equation (8) shows the throughput Tstream_max_MSBT that can be achieved by a single stream in the worst-case scenario in relation to the probability of the transition fetch. Since PMSBT_min is smaller than PMSBT, a higher throughput can be achieved when P(c)MSBT is smaller. Equation (9) 
Experimental Evaluation
A compiler was developed which takes the DFA as an input to generate the compressed transitions along with the control information such as the bitmaps and bitmasks as outputs.
The maximum number of states in a group was restricted to 256 states and the transition threshold used for inter-state compression was set to 80%. The same set of states is used as inputs for all the three bitmap-based compression techniques. The compression algorithm was implemented in a Xeon server machine running at 4.4 GHz with 500 GB of main memory.
As proof of concept, the compression scheme was evaluated on DFAs generated across 5 different rule-sets listed in Table 1 . The rule-sets were carefully identified to contain both strings and regular expression signatures. Exact match is a group of 500 string signatures synthetically generated from the tool developed by Becchi (2016) . The other four rule-sets are extracted from Snort (Roesch, 1999) and Bro (Paxson, 1999) intrusion detection systems, respectively, and are a combination of simple strings and complex regular expressions. The signature sets were converted into the DFA using the regex tool (Becchi, 2016) and the custom compiler performs the MSBT and the LSCT on the generated DFAs. Column 3 in Table 1 presents the total DFA states generated, while columns 5 and 6 present the total leader states and member states after state grouping. Column 4 presents the total number of uncompressed transitions in the DFA. MSBT index the redundant transitions in the member state which are not stored in the memory. Table 2 compares the average number of transitions that remain uncompressed in a member state after compressing the DFA using FEACAN and the MSBT. The information in Table 3 is extracted from the compilation results after performing the MSBT and the FEACAN compression on the signature sets described in Table 1 . It can be seen that about 50 to 80% of the member transitions stored in FEACAN are redundant and can be compressed efficiently through the introduction of MTBs in the MSBT. transitions in the leader are the single most repeated transitions and can be compressed effectively, which results in an increase in the transition compression.
The A-DFA achieves the best transition compression results when compared with all the other bitmap-based compression techniques and is also close to the theoretical limit. Table 4 shows a comparison of the average number of transitions which have to be fetched from the memory in each of the compression methods, before identifying the compressed state transition corresponding to a state character combination. The data in Table 4 was generated by directly analysing the compressed DFA generated through each of the techniques. A closer look at the results from Table 4 Memory used to store the compressed automata is calculated by computing the sum of the memory used to store the compressed transitions and the memory used to store the control information, such as the base addresses, bitmaps and bitmasks. Table 5 lists the width of various information that is stored in memory as part of various techniques. Figure 12 shows a comparison of the memory usage across different bitmap-based compression techniques. A small improvement of 4-5% in the transition compression ratio seen in Figure 11 , in the case of MSBT, translates into an overall reduction in memory by 50% in most of the signature sets. Similarly, in the case of LSCT, even a very minute increase in the transition compression results in a significant 5-10% reduction in memory usage when compared with MSBT. Figure 13 shows a comparison of the transition and control memory usage across various techniques. As part of the control memory, FEACAN only stores the base 1 MB were generated based on the values chosen for PM, which were 0.35, 0.55, 0.75 and 0.95.
The DFA generated by the regex tool is used as an input to perform the transition compression using the MSBT and the LSCT compilers. A software model of the decompression system was developed for both the MSBT and the LSCT, which was used to perform signature matching on the compressed signatures. Table 6 shows a summary of the signature matching results obtained from the software model of the decompression systems and compares the results with a DFA-based signature matching engine across different signature sets and across the various PM values. The identical signature matching results seen in Table 6 show the functional correctness of the proposed compression methods. In addition to tracking the total number of signature matches, as shown in Table 6 , the next state transitions generated for each of the state character combinations were also monitored and the results were identical in all three systems across all the signatures sets across all the PM values (not shown in the results section). Figure 14 . Overview of the software-based simulation environment to verify the decompression system.
Transition Fetch -Dynamic
Figure 15(a) shows the statistics of the number of transitions fetched from the third stage in the case of MSBT as a percentage of total transition fetches, considering traffic traces with various levels of maliciousness. As the probability of maliciousness increases with different character traces, it can be clearly seen that more transitions are fetched from the third functional stage. This can be attributed to two reasons. Firstly, as the maliciousness level in the traces increases, the states which are at higher depths are visited, where the depth of a state refers to the number of positive character matches in a signature. The states which are of states are member states after the state grouping process. In the case of the traces with higher maliciousness levels, the probability that the states traversed are member states is very high. The transitions which lead to the state at a higher depth are generally distinct and cannot be compressed. Thus, these transitions will belong to the member states which remain uncompressed, resulting in more transitions fetched from the third stage. Figure 16 Figure 16(b) , respectively, shows the improvement in the per-stream throughput that can be achieved in the worst-case scenarios based on the relationship established in the previous sections. It is assumed that it takes two clock cycles to fetch the compressed state transition from the second stage, while it takes three clock cycles to fetch the compressed transition from the third stage. Based on this assumption, the improvement in the per-stream throughput is calculated based on (8) and (9), as shown in Figure 15 (b) and Figure   16 (b). As the level of maliciousness increases in the traffic, more transitions are fetched from the third functional stage, which reduces the per-stream throughput in the worst-case scenarios. In the case of lower levels of maliciousness, the per-stream throughput that can be achieved in the case of LSCT is less than for MSBT, as the probability of a transition fetch from the third functional stage is higher in LSCT. The difference in the throughput gradually reduces as the levels of maliciousness increase, due to the distribution of the compressed transitions. According to Michela et al. (2008) , traffic traces with the highest maliciousness levels are not a common occurrence in network traffic traces. So, with low levels of maliciousness, MSBT can be used to achieve better signature matching throughput than LSCT.
By assigning the next state directly from the second stage, the per-stream signature matching throughput can be increased by a factor of 1.2 to 1.4 times in the case of the MSBT and the LSCT, as shown in Figure 15 (b) and Figure 16(b) , respectively.
Conclusion
Hardware acceleration of signature matching is a key requirement to perform deep packet inspection at line rates. (2016) proposed two bitmap-based transition compression techniques to achieve transition compression rates of the order of over 98%. The transition decompression through the proposed techniques is performed in a hardware accelerator so that the signature matching can be performed at line rates. The fundamental building blocks of the hardware accelerator performing the transition decompression corresponding to these techniques were first proposed in this article. Furthermore, a software model corresponding to the decompression engines was designed and verified to validate the proposed compression methods. The functionality of the decompression engines was verified by injecting multiple 1 MB streams of bytes of different levels of maliciousness, across different signature sets, into the software models and the DFA. The identical signature matching results further validated the functional correctness of the proposed compression methods. Furthermore, a theoretical relationship was established between the signature matching throughput achieved through these systems and the number of pipeline stages required by the hardware accelerator to perform the transition decompression. Based on this analysis, further optimization methods were proposed to improve the per-stream signature matching throughput. Experimental evaluations further showed that the proposed optimizations improve the per-stream signature matching throughput by a factor of 1.2x to 1.4x in comparison to the throughput that is achieved without the optimizations.
