29 research outputs found
Fast and Compact Regular Expression Matching
We study 4 problems in string matching, namely, regular expression matching,
approximate regular expression matching, string edit distance, and subsequence
indexing, on a standard word RAM model of computation that allows
logarithmic-sized words to be manipulated in constant time. We show how to
improve the space and/or remove a dependency on the alphabet size for each
problem using either an improved tabulation technique of an existing algorithm
or by combining known algorithms in a new way
Efficient Multistriding of Large Non-deterministic Finite State Automata for Deep Packet Inspection
Multistride automata speed up input matching because each multistriding transformation halves the size of the input string, leading to a potential 2x speedup. However, up to now little effort has been spent in optimizing the building process of multistride automata, with the result that current algorithms cannot be applied to real-life, large automata such as the ones used in commercial IDSs, because the time and the memory space needed to create the new automaton quickly becomes unfeasible. In this paper, new algorithms for efficient building of multistride NFAs for packet inspection are presented, explaining how these new techniques can outperform the previous algorithms in terms of required time and memory usag
Subsequence Automata with Default Transitions
Let be a string of length with characters from an alphabet of size
. The \emph{subsequence automaton} of (often called the
\emph{directed acyclic subsequence graph}) is the minimal deterministic finite
automaton accepting all subsequences of . A straightforward construction
shows that the size (number of states and transitions) of the subsequence
automaton is and that this bound is asymptotically optimal.
In this paper, we consider subsequence automata with \emph{default
transitions}, that is, special transitions to be taken only if none of the
regular transitions match the current character, and which do not consume the
current character. We show that with default transitions, much smaller
subsequence automata are possible, and provide a full trade-off between the
size of the automaton and the \emph{delay}, i.e., the maximum number of
consecutive default transitions followed before consuming a character.
Specifically, given any integer parameter , , we
present a subsequence automaton with default transitions of size
and delay . Hence, with we
obtain an automaton of size and delay . On
the other extreme, with , we obtain an automaton of size and delay , thus matching the bound for the standard subsequence
automaton construction. Finally, we generalize the result to multiple strings.
The key component of our result is a novel hierarchical automata construction
of independent interest.Comment: Corrected typo
Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts
We study the approximate string matching and regular expression matching
problem for the case when the text to be searched is compressed with the
Ziv-Lempel adaptive dictionary compression schemes. We present a time-space
trade-off that leads to algorithms improving the previously known complexities
for both problems. In particular, we significantly improve the space bounds,
which in practical applications are likely to be a bottleneck
From Regular Expression Matching to Parsing
Given a regular expression and a string , the regular expression
parsing problem is to determine if matches and if so, determine how it
matches, e.g., by a mapping of the characters of to the characters in .
Regular expression parsing makes finding matches of a regular expression even
more useful by allowing us to directly extract subpatterns of the match, e.g.,
for extracting IP-addresses from internet traffic analysis or extracting
subparts of genomes from genetic data bases. We present a new general
techniques for efficiently converting a large class of algorithms that
determine if a string matches regular expression into algorithms that
can construct a corresponding mapping. As a consequence, we obtain the first
efficient linear space solutions for regular expression parsing
Faster Approximate String Matching for Short Patterns
We study the classical approximate string matching problem, that is, given
strings and and an error threshold , find all ending positions of
substrings of whose edit distance to is at most . Let and
have lengths and , respectively. On a standard unit-cost word RAM with
word size we present an algorithm using time When is
short, namely, or this
improves the previously best known time bounds for the problem. The result is
achieved using a novel implementation of the Landau-Vishkin algorithm based on
tabulation and word-level parallelism.Comment: To appear in Theory of Computing System