5,126 research outputs found
Online Pattern Matching for String Edit Distance with Moves
Edit distance with moves (EDM) is a string-to-string distance measure that
includes substring moves in addition to ordinal editing operations to turn one
string to the other. Although optimizing EDM is intractable, it has many
applications especially in error detections. Edit sensitive parsing (ESP) is an
efficient parsing algorithm that guarantees an upper bound of parsing
discrepancies between different appearances of the same substrings in a string.
ESP can be used for computing an approximate EDM as the L1 distance between
characteristic vectors built by node labels in parsing trees. However, ESP is
not applicable to a streaming text data where a whole text is unknown in
advance. We present an online ESP (OESP) that enables an online pattern
matching for EDM. OESP builds a parse tree for a streaming text and computes
the L1 distance between characteristic vectors in an online manner. For the
space-efficient computation of EDM, OESP directly encodes the parse tree into a
succinct representation by leveraging the idea behind recent results of a
dynamic succinct tree. We experimentally test OESP on the ability to compute
EDM in an online manner on benchmark datasets, and we show OESP's efficiency.Comment: This paper has been accepted to the 21st edition of the International
Symposium on String Processing and Information Retrieval (SPIRE2014
Succinct Dictionary Matching With No Slowdown
The problem of dictionary matching is a classical problem in string matching:
given a set S of d strings of total length n characters over an (not
necessarily constant) alphabet of size sigma, build a data structure so that we
can match in a any text T all occurrences of strings belonging to S. The
classical solution for this problem is the Aho-Corasick automaton which finds
all occ occurrences in a text T in time O(|T| + occ) using a data structure
that occupies O(m log m) bits of space where m <= n + 1 is the number of states
in the automaton. In this paper we show that the Aho-Corasick automaton can be
represented in just m(log sigma + O(1)) + O(d log(n/d)) bits of space while
still maintaining the ability to answer to queries in O(|T| + occ) time. To the
best of our knowledge, the currently fastest succinct data structure for the
dictionary matching problem uses space O(n log sigma) while answering queries
in O(|T|log log n + occ) time. In this paper we also show how the space
occupancy can be reduced to m(H0 + O(1)) + O(d log(n/d)) where H0 is the
empirical entropy of the characters appearing in the trie representation of the
set S, provided that sigma < m^epsilon for any constant 0 < epsilon < 1. The
query time remains unchanged.Comment: Corrected typos and other minor error
Online Self-Indexed Grammar Compression
Although several grammar-based self-indexes have been proposed thus far,
their applicability is limited to offline settings where whole input texts are
prepared, thus requiring to rebuild index structures for given additional
inputs, which is often the case in the big data era. In this paper, we present
the first online self-indexed grammar compression named OESP-index that can
gradually build the index structure by reading input characters one-by-one.
Such a property is another advantage which enables saving a working space for
construction, because we do not need to store input texts in memory. We
experimentally test OESP-index on the ability to build index structures and
search query texts, and we show OESP-index's efficiency, especially
space-efficiency for building index structures.Comment: To appear in the Proceedings of the 22nd edition of the International
Symposium on String Processing and Information Retrieval (SPIRE2015
- …