71 research outputs found
Succinct Dictionary Matching With No Slowdown
The problem of dictionary matching is a classical problem in string matching:
given a set S of d strings of total length n characters over an (not
necessarily constant) alphabet of size sigma, build a data structure so that we
can match in a any text T all occurrences of strings belonging to S. The
classical solution for this problem is the Aho-Corasick automaton which finds
all occ occurrences in a text T in time O(|T| + occ) using a data structure
that occupies O(m log m) bits of space where m <= n + 1 is the number of states
in the automaton. In this paper we show that the Aho-Corasick automaton can be
represented in just m(log sigma + O(1)) + O(d log(n/d)) bits of space while
still maintaining the ability to answer to queries in O(|T| + occ) time. To the
best of our knowledge, the currently fastest succinct data structure for the
dictionary matching problem uses space O(n log sigma) while answering queries
in O(|T|log log n + occ) time. In this paper we also show how the space
occupancy can be reduced to m(H0 + O(1)) + O(d log(n/d)) where H0 is the
empirical entropy of the characters appearing in the trie representation of the
set S, provided that sigma < m^epsilon for any constant 0 < epsilon < 1. The
query time remains unchanged.Comment: Corrected typos and other minor error
Block trees
Let string S[1..n] be parsed into z phrases by the Lempel-Ziv algorithm. The corresponding compression algorithm encodes S in O(z) space, but it does not support random access to S. We introduce a data structure, the block tree, that represents S in O(z log(n/z)) space and extracts any symbol of S in time O(log(n/z)), among other space-time tradeoffs. The structure also supports other queries that are useful for building compressed data structures on top of S. Further, block trees can be built in linear time and in a scalable manner. Our experiments show that block trees offer relevant space-time tradeoffs compared to other compressed string representations for highly repetitive strings. (C) 2020 Elsevier Inc. All rights reserved.Peer reviewe
Cartesian Tree Matching and Indexing
We introduce a new metric of match, called Cartesian tree matching, which means that two strings match if they have the same Cartesian trees. Based on Cartesian tree matching, we define single pattern matching for a text of length n and a pattern of length m, and multiple pattern matching for a text of length n and k patterns of total length m. We present an O(n+m) time algorithm for single pattern matching, and an O((n+m) log k) deterministic time or O(n+m) randomized time algorithm for multiple pattern matching. We also define an index data structure called Cartesian suffix tree, and present an O(n) randomized time algorithm to build the Cartesian suffix tree. Our efficient algorithms for Cartesian tree matching use a representation of the Cartesian tree, called the parent-distance representation
Cartesian ํธ๋ฆฌ์ ๊ธฐ๋ฐํ ๋ฌธ์์ด ๋งค์นญ ๋ฐ ์ธ๋ฑ์ฑ
ํ์๋
ผ๋ฌธ (์์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ปดํจํฐ๊ณตํ๋ถ, 2020. 8. ๋ฐ๊ทผ์.We introduce a new metric of match, called Cartesian tree matching, which means that two strings match if they have the same Cartesian trees.
Based on Cartesian tree matching, we define single pattern matching for a text of length n and a pattern of length m, and multiple pattern matching for a text of length n and k patterns of total length m.
We present an O(n+m) time algorithm for single pattern matching, and an O((n+m) log k) deterministic time or O(n+m) randomized time algorithm for multiple pattern matching.
We also define an index data structure called Cartesian suffix tree, and present an O(n) randomized time algorithm to build the Cartesian suffix tree.
Our efficient algorithms for Cartesian tree matching use a representation of the Cartesian tree, called the parent-distance representation.๋ณธ ๋
ผ๋ฌธ์์๋ Cartesian ํธ๋ฆฌ์ ๊ธฐ๋ฐํ ์๋ก์ด ๋งค์นญ ๊ธฐ์ค์ธ Cartesian ํธ๋ฆฌ ๋งค์นญ์ ์ ์ํ๋ค. ์ด๋ ๋ ๋ฌธ์์ด์ Cartesian ํธ๋ฆฌ๊ฐ ์๋ก ๊ฐ์ ๋, ๋ ๋ฌธ์์ด์ ๋งค์นญ๋ ๊ฒ์ผ๋ก ์ ์ํ๋ ๋ฌธ์ ์ด๋ค.
Cartesian ํธ๋ฆฌ ๋งค์นญ์ ๊ธฐ์ค ํ์์, ๋ณธ ์ฐ๊ตฌ์์๋ ๊ธธ์ด n์ธ ํ
์คํธ์ ๊ธธ์ด m์ธ ํจํด ์ฌ์ด์ ๋จ์ผํจํด๋งค์นญ ๋ฌธ์ ์ ๊ธธ์ด n์ธ ํ
์คํธ์ ๊ธธ์ด์ ํฉ์ด m์ธ ์ฌ๋ฌ ๊ฐ์ ํจํด ์ฌ์ด์ ๋ค์คํจํด๋งค์นญ ๋ฌธ์ ๋ฅผ ์ ์ํ๊ณ , ๋จ์ผํจํด๋งค์นญ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๋ O(n+m) ์๊ฐ ์๊ณ ๋ฆฌ์ฆ๊ณผ ๋ค์คํจํด๋งค์นญ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๋ O((n+m) log k) ์๊ฐ ๊ฒฐ์ ๋ก ์ ์๊ณ ๋ฆฌ์ฆ ๋ฐ O(n+m) ์๊ฐ ๋ฌด์์ ์๊ณ ๋ฆฌ์ฆ์ ์ ์ํ๋ค. ๋ํ, Cartesian ํธ๋ฆฌ ๋งค์นญ์ ๋ํ ์ธ๋ฑ์ค ์๋ฃ๊ตฌ์กฐ์ธ Cartesian ์ ๋ฏธ์ฌํธ๋ฆฌ๋ฅผ ์ ์ํ๊ณ ,
์ด๋ฅผ ๊ตฌ์ถํ๋ O(n) ์๊ฐ ๋ฌด์์ ์๊ณ ๋ฆฌ์ฆ์ ์ ์ํ๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ Cartesian tree๋ฅผ ํํํ๋ ๋ฐฉ์์ธ ๋ถ๋ชจ๊ฑฐ๋ฆฌํํ (parent-distance representation)์ ์ ์ํ๊ณ , ์ด๋ฅผ ์ด์ฉํ์ฌ ์ ๋ฌธ์ ๋ค์ ํด๊ฒฐํ๋ ํจ์จ์ ์ธ ์๊ณ ๋ฆฌ์ฆ๋ค์ ์ ์ํ๋ค.Chapter 1 Introduction 1
Chapter 2 Problem Definition 4
2.1 Basic notations 4
2.2 Cartesian tree matching 4
Chapter 3 Single Pattern Matching in O(n + m) Time 7
3.1 Parent-distance representation 7
3.2 Computing parent-distance representation 9
3.3 Failure function 11
3.4 Text search 13
3.5 Computing failure function 13
3.6 Correctness and time complexity 14
3.7 Cartesian tree signature 15
Chapter 4 Multiple Pattern Matching in O((n + m) log k) Time 17
4.1 Constructing the Aho-Corasick automaton 17
4.2 Multiple pattern matching 21
Chapter 5 Cartesian Suffix Tree in Randomized O(n) Time 22
5.1 Defining Cartesian suffix tree 22
5.2 Constructing Cartesian suffix tree 23
Chapter 6 Conclusion 26
Bibliography 27
์์ฝ 31Maste
Fast Searching in Packed Strings
Given strings and the (exact) string matching problem is to find all
positions of substrings in matching . The classical Knuth-Morris-Pratt
algorithm [SIAM J. Comput., 1977] solves the string matching problem in linear
time which is optimal if we can only read one character at the time. However,
most strings are stored in a computer in a packed representation with several
characters in a single word, giving us the opportunity to read multiple
characters simultaneously. In this paper we study the worst-case complexity of
string matching on strings given in packed representation. Let be
the lengths and , respectively, and let denote the size of the
alphabet. On a standard unit-cost word-RAM with logarithmic word size we
present an algorithm using time O\left(\frac{n}{\log_\sigma n} + m +
\occ\right). Here \occ is the number of occurrences of in . For this improves the bound of the Knuth-Morris-Pratt algorithm.
Furthermore, if our algorithm is optimal since any
algorithm must spend at least \Omega(\frac{(n+m)\log
\sigma}{\log n} + \occ) = \Omega(\frac{n}{\log_\sigma n} + \occ) time to
read the input and report all occurrences. The result is obtained by a novel
automaton construction based on the Knuth-Morris-Pratt algorithm combined with
a new compact representation of subautomata allowing an optimal
tabulation-based simulation.Comment: To appear in Journal of Discrete Algorithms. Special Issue on CPM
200
- โฆ