71 research outputs found

    Succinct Dictionary Matching With No Slowdown

    Full text link
    The problem of dictionary matching is a classical problem in string matching: given a set S of d strings of total length n characters over an (not necessarily constant) alphabet of size sigma, build a data structure so that we can match in a any text T all occurrences of strings belonging to S. The classical solution for this problem is the Aho-Corasick automaton which finds all occ occurrences in a text T in time O(|T| + occ) using a data structure that occupies O(m log m) bits of space where m <= n + 1 is the number of states in the automaton. In this paper we show that the Aho-Corasick automaton can be represented in just m(log sigma + O(1)) + O(d log(n/d)) bits of space while still maintaining the ability to answer to queries in O(|T| + occ) time. To the best of our knowledge, the currently fastest succinct data structure for the dictionary matching problem uses space O(n log sigma) while answering queries in O(|T|log log n + occ) time. In this paper we also show how the space occupancy can be reduced to m(H0 + O(1)) + O(d log(n/d)) where H0 is the empirical entropy of the characters appearing in the trie representation of the set S, provided that sigma < m^epsilon for any constant 0 < epsilon < 1. The query time remains unchanged.Comment: Corrected typos and other minor error

    Block trees

    Get PDF
    Let string S[1..n] be parsed into z phrases by the Lempel-Ziv algorithm. The corresponding compression algorithm encodes S in O(z) space, but it does not support random access to S. We introduce a data structure, the block tree, that represents S in O(z log(n/z)) space and extracts any symbol of S in time O(log(n/z)), among other space-time tradeoffs. The structure also supports other queries that are useful for building compressed data structures on top of S. Further, block trees can be built in linear time and in a scalable manner. Our experiments show that block trees offer relevant space-time tradeoffs compared to other compressed string representations for highly repetitive strings. (C) 2020 Elsevier Inc. All rights reserved.Peer reviewe

    Suffix-Prefix Queries on a Dictionary

    Get PDF

    Cartesian Tree Matching and Indexing

    Get PDF
    We introduce a new metric of match, called Cartesian tree matching, which means that two strings match if they have the same Cartesian trees. Based on Cartesian tree matching, we define single pattern matching for a text of length n and a pattern of length m, and multiple pattern matching for a text of length n and k patterns of total length m. We present an O(n+m) time algorithm for single pattern matching, and an O((n+m) log k) deterministic time or O(n+m) randomized time algorithm for multiple pattern matching. We also define an index data structure called Cartesian suffix tree, and present an O(n) randomized time algorithm to build the Cartesian suffix tree. Our efficient algorithms for Cartesian tree matching use a representation of the Cartesian tree, called the parent-distance representation

    Cartesian ํŠธ๋ฆฌ์— ๊ธฐ๋ฐ˜ํ•œ ๋ฌธ์ž์—ด ๋งค์นญ ๋ฐ ์ธ๋ฑ์‹ฑ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. ๋ฐ•๊ทผ์ˆ˜.We introduce a new metric of match, called Cartesian tree matching, which means that two strings match if they have the same Cartesian trees. Based on Cartesian tree matching, we define single pattern matching for a text of length n and a pattern of length m, and multiple pattern matching for a text of length n and k patterns of total length m. We present an O(n+m) time algorithm for single pattern matching, and an O((n+m) log k) deterministic time or O(n+m) randomized time algorithm for multiple pattern matching. We also define an index data structure called Cartesian suffix tree, and present an O(n) randomized time algorithm to build the Cartesian suffix tree. Our efficient algorithms for Cartesian tree matching use a representation of the Cartesian tree, called the parent-distance representation.๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” Cartesian ํŠธ๋ฆฌ์— ๊ธฐ๋ฐ˜ํ•œ ์ƒˆ๋กœ์šด ๋งค์นญ ๊ธฐ์ค€์ธ Cartesian ํŠธ๋ฆฌ ๋งค์นญ์„ ์ œ์•ˆํ•œ๋‹ค. ์ด๋Š” ๋‘ ๋ฌธ์ž์—ด์˜ Cartesian ํŠธ๋ฆฌ๊ฐ€ ์„œ๋กœ ๊ฐ™์„ ๋•Œ, ๋‘ ๋ฌธ์ž์—ด์„ ๋งค์นญ๋œ ๊ฒƒ์œผ๋กœ ์ •์˜ํ•˜๋Š” ๋ฌธ์ œ์ด๋‹ค. Cartesian ํŠธ๋ฆฌ ๋งค์นญ์˜ ๊ธฐ์ค€ ํ•˜์—์„œ, ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๊ธธ์ด n์ธ ํ…์ŠคํŠธ์™€ ๊ธธ์ด m์ธ ํŒจํ„ด ์‚ฌ์ด์˜ ๋‹จ์ผํŒจํ„ด๋งค์นญ ๋ฌธ์ œ์™€ ๊ธธ์ด n์ธ ํ…์ŠคํŠธ์™€ ๊ธธ์ด์˜ ํ•ฉ์ด m์ธ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํŒจํ„ด ์‚ฌ์ด์˜ ๋‹ค์ค‘ํŒจํ„ด๋งค์นญ ๋ฌธ์ œ๋ฅผ ์ •์˜ํ•˜๊ณ , ๋‹จ์ผํŒจํ„ด๋งค์นญ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” O(n+m) ์‹œ๊ฐ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๋‹ค์ค‘ํŒจํ„ด๋งค์นญ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” O((n+m) log k) ์‹œ๊ฐ„ ๊ฒฐ์ •๋ก ์  ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ฐ O(n+m) ์‹œ๊ฐ„ ๋ฌด์ž‘์œ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์‹œํ•œ๋‹ค. ๋˜ํ•œ, Cartesian ํŠธ๋ฆฌ ๋งค์นญ์— ๋Œ€ํ•œ ์ธ๋ฑ์Šค ์ž๋ฃŒ๊ตฌ์กฐ์ธ Cartesian ์ ‘๋ฏธ์‚ฌํŠธ๋ฆฌ๋ฅผ ์ •์˜ํ•˜๊ณ , ์ด๋ฅผ ๊ตฌ์ถ•ํ•˜๋Š” O(n) ์‹œ๊ฐ„ ๋ฌด์ž‘์œ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์‹œํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” Cartesian tree๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ์‹์ธ ๋ถ€๋ชจ๊ฑฐ๋ฆฌํ‘œํ˜„ (parent-distance representation)์„ ์ •์˜ํ•˜๊ณ , ์ด๋ฅผ ์ด์šฉํ•˜์—ฌ ์œ„ ๋ฌธ์ œ๋“ค์„ ํ•ด๊ฒฐํ•˜๋Š” ํšจ์œจ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์„ ์ œ์‹œํ•œ๋‹ค.Chapter 1 Introduction 1 Chapter 2 Problem Definition 4 2.1 Basic notations 4 2.2 Cartesian tree matching 4 Chapter 3 Single Pattern Matching in O(n + m) Time 7 3.1 Parent-distance representation 7 3.2 Computing parent-distance representation 9 3.3 Failure function 11 3.4 Text search 13 3.5 Computing failure function 13 3.6 Correctness and time complexity 14 3.7 Cartesian tree signature 15 Chapter 4 Multiple Pattern Matching in O((n + m) log k) Time 17 4.1 Constructing the Aho-Corasick automaton 17 4.2 Multiple pattern matching 21 Chapter 5 Cartesian Suffix Tree in Randomized O(n) Time 22 5.1 Defining Cartesian suffix tree 22 5.2 Constructing Cartesian suffix tree 23 Chapter 6 Conclusion 26 Bibliography 27 ์š”์•ฝ 31Maste

    Fast Searching in Packed Strings

    Get PDF
    Given strings PP and QQ the (exact) string matching problem is to find all positions of substrings in QQ matching PP. The classical Knuth-Morris-Pratt algorithm [SIAM J. Comput., 1977] solves the string matching problem in linear time which is optimal if we can only read one character at the time. However, most strings are stored in a computer in a packed representation with several characters in a single word, giving us the opportunity to read multiple characters simultaneously. In this paper we study the worst-case complexity of string matching on strings given in packed representation. Let mโ‰คnm \leq n be the lengths PP and QQ, respectively, and let ฯƒ\sigma denote the size of the alphabet. On a standard unit-cost word-RAM with logarithmic word size we present an algorithm using time O\left(\frac{n}{\log_\sigma n} + m + \occ\right). Here \occ is the number of occurrences of PP in QQ. For m=o(n)m = o(n) this improves the O(n)O(n) bound of the Knuth-Morris-Pratt algorithm. Furthermore, if m=O(n/logโกฯƒn)m = O(n/\log_\sigma n) our algorithm is optimal since any algorithm must spend at least \Omega(\frac{(n+m)\log \sigma}{\log n} + \occ) = \Omega(\frac{n}{\log_\sigma n} + \occ) time to read the input and report all occurrences. The result is obtained by a novel automaton construction based on the Knuth-Morris-Pratt algorithm combined with a new compact representation of subautomata allowing an optimal tabulation-based simulation.Comment: To appear in Journal of Discrete Algorithms. Special Issue on CPM 200
    • โ€ฆ
    corecore