Search CORE

71 research outputs found

Succinct Dictionary Matching With No Slowdown

Author: A.V. Aho
J.I. Munro
K. Sadakane
P. Elias
R.M. Fano
S. Dori
W.-K. Hon
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

The problem of dictionary matching is a classical problem in string matching: given a set S of d strings of total length n characters over an (not necessarily constant) alphabet of size sigma, build a data structure so that we can match in a any text T all occurrences of strings belonging to S. The classical solution for this problem is the Aho-Corasick automaton which finds all occ occurrences in a text T in time O(|T| + occ) using a data structure that occupies O(m log m) bits of space where m <= n + 1 is the number of states in the automaton. In this paper we show that the Aho-Corasick automaton can be represented in just m(log sigma + O(1)) + O(d log(n/d)) bits of space while still maintaining the ability to answer to queries in O(|T| + occ) time. To the best of our knowledge, the currently fastest succinct data structure for the dictionary matching problem uses space O(n log sigma) while answering queries in O(|T|log log n + occ) time. In this paper we also show how the space occupancy can be reduced to m(H0 + O(1)) + O(d log(n/d)) where H0 is the empirical entropy of the characters appearing in the trie representation of the set S, provided that sigma < m^epsilon for any constant 0 < epsilon < 1. The query time remains unchanged.Comment: Corrected typos and other minor error

arXiv.org e-Print Archive

CiteSeerX

Crossref

Block trees

Author: Belazzougui Djamal
Caceres Manuel
Gagie Travis
Gawrychowski Pawel
Kaerkkaeinen Juha
Navarro Gonzalo
Ordonez Alberto
Puglisi Simon J.
Tabei Yasuo
Publication venue
Publication date: 01/05/2021
Field of study

Let string S[1..n] be parsed into z phrases by the Lempel-Ziv algorithm. The corresponding compression algorithm encodes S in O(z) space, but it does not support random access to S. We introduce a data structure, the block tree, that represents S in O(z log(n/z)) space and extracts any symbol of S in time O(log(n/z)), among other space-time tradeoffs. The structure also supports other queries that are useful for building compressed data structures on top of S. Further, block trees can be built in linear time and in a scalable manner. Our experiments show that block trees offer relevant space-time tradeoffs compared to other compressed string representations for highly repetitive strings. (C) 2020 Elsevier Inc. All rights reserved.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Suffix-Prefix Queries on a Dictionary

Author: Loukides Grigorios
Pissis Solon P.
Thankachan Sharma V.
Zuba Wiktor
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)
Publication date: 01/01/2023
Field of study

VU Research Portal

Dagstuhl Research Online Publication Server

King's Research Portal

Cartesian Tree Matching and Indexing

Author: Amir Amihood
Landau Gad M.
Park Kunsoo
Park Sung Gwan
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)
Publication date: 01/01/2019
Field of study

We introduce a new metric of match, called Cartesian tree matching, which means that two strings match if they have the same Cartesian trees. Based on Cartesian tree matching, we define single pattern matching for a text of length n and a pattern of length m, and multiple pattern matching for a text of length n and k patterns of total length m. We present an O(n+m) time algorithm for single pattern matching, and an O((n+m) log k) deterministic time or O(n+m) randomized time algorithm for multiple pattern matching. We also define an index data structure called Cartesian suffix tree, and present an O(n) randomized time algorithm to build the Cartesian suffix tree. Our efficient algorithms for Cartesian tree matching use a representation of the Cartesian tree, called the parent-distance representation

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Cartesian 트리에 기반한 문자열 매칭 및 인덱싱

Author: 박성관
Publication venue: 서울대학교 대학원
Publication date: 01/08/2020
Field of study

학위논문 (석사) -- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2020. 8. 박근수.We introduce a new metric of match, called Cartesian tree matching, which means that two strings match if they have the same Cartesian trees. Based on Cartesian tree matching, we define single pattern matching for a text of length n and a pattern of length m, and multiple pattern matching for a text of length n and k patterns of total length m. We present an O(n+m) time algorithm for single pattern matching, and an O((n+m) log k) deterministic time or O(n+m) randomized time algorithm for multiple pattern matching. We also define an index data structure called Cartesian suffix tree, and present an O(n) randomized time algorithm to build the Cartesian suffix tree. Our efficient algorithms for Cartesian tree matching use a representation of the Cartesian tree, called the parent-distance representation.본 논문에서는 Cartesian 트리에 기반한 새로운 매칭 기준인 Cartesian 트리 매칭을 제안한다. 이는 두 문자열의 Cartesian 트리가 서로 같을 때, 두 문자열을 매칭된 것으로 정의하는 문제이다. Cartesian 트리 매칭의 기준 하에서, 본 연구에서는 길이 n인 텍스트와 길이 m인 패턴 사이의 단일패턴매칭 문제와 길이 n인 텍스트와 길이의 합이 m인 여러 개의 패턴 사이의 다중패턴매칭 문제를 정의하고, 단일패턴매칭 문제를 해결하는 O(n+m) 시간 알고리즘과 다중패턴매칭 문제를 해결하는 O((n+m) log k) 시간 결정론적 알고리즘 및 O(n+m) 시간 무작위 알고리즘을 제시한다. 또한, Cartesian 트리 매칭에 대한 인덱스 자료구조인 Cartesian 접미사트리를 정의하고, 이를 구축하는 O(n) 시간 무작위 알고리즘을 제시한다. 본 논문에서는 Cartesian tree를 표현하는 방식인 부모거리표현 (parent-distance representation)을 정의하고, 이를 이용하여 위 문제들을 해결하는 효율적인 알고리즘들을 제시한다.Chapter 1 Introduction 1 Chapter 2 Problem Definition 4 2.1 Basic notations 4 2.2 Cartesian tree matching 4 Chapter 3 Single Pattern Matching in O(n + m) Time 7 3.1 Parent-distance representation 7 3.2 Computing parent-distance representation 9 3.3 Failure function 11 3.4 Text search 13 3.5 Computing failure function 13 3.6 Correctness and time complexity 14 3.7 Cartesian tree signature 15 Chapter 4 Multiple Pattern Matching in O((n + m) log k) Time 17 4.1 Constructing the Aho-Corasick automaton 17 4.2 Multiple pattern matching 21 Chapter 5 Cartesian Suffix Tree in Randomized O(n) Time 22 5.1 Defining Cartesian suffix tree 22 5.2 Constructing Cartesian suffix tree 23 Chapter 6 Conclusion 26 Bibliography 27 요약 31Maste

SNU Open Repository and Archive

Fast Searching in Packed Strings

Author: A. Amir
D.E. Knuth
E.W. Myers
G. Navarro
J. Tarhio
K. Fredriksson
K. Fredriksson
R. Baeza-Yates
R.A. Baeza-Yates
R.M. Karp
R.S. Boyer
S. Wu
S.T. Klein
T.A. Welch
V.L. Arlazarov
W. Masek
W. Rytter
Publication venue
Publication date: 01/01/2009
Field of study

Given strings

P

and

Q

the (exact) string matching problem is to find all positions of substrings in

Q

matching

P

. The classical Knuth-Morris-Pratt algorithm [SIAM J. Comput., 1977] solves the string matching problem in linear time which is optimal if we can only read one character at the time. However, most strings are stored in a computer in a packed representation with several characters in a single word, giving us the opportunity to read multiple characters simultaneously. In this paper we study the worst-case complexity of string matching on strings given in packed representation. Let

m \leq n

be the lengths

P

and

Q

, respectively, and let

\sigma

denote the size of the alphabet. On a standard unit-cost word-RAM with logarithmic word size we present an algorithm using time O\left(\frac{n}{\log_\sigma n} + m + \occ\right). Here \occ is the number of occurrences of

P

Q

. For

m = o(n)

this improves the

O(n)

bound of the Knuth-Morris-Pratt algorithm. Furthermore, if

m = O(n/\log_\sigma n)

our algorithm is optimal since any algorithm must spend at least \Omega(\frac{(n+m)\log \sigma}{\log n} + \occ) = \Omega(\frac{n}{\log_\sigma n} + \occ) time to read the input and report all occurrences. The result is obtained by a novel automaton construction based on the Knuth-Morris-Pratt algorithm combined with a new compact representation of subautomata allowing an optimal tabulation-based simulation.Comment: To appear in Journal of Discrete Algorithms. Special Issue on CPM 200

arXiv.org e-Print Archive

CiteSeerX

Elsevier - Publisher Connector

Crossref

Online Research Database In Technology