Search CORE

399 research outputs found

Space-efficient detection of unusual words

Author: A Apostolico
A Apostolico
CAR Hoare
D Belazzougui
D Belazzougui
J Herold
J Lin
M Crochemore
S Chairungsee
Publication venue
Publication date: 01/01/2015
Field of study

Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of

O(\sigma^2\log^2 n)

bits, where

n

is the length of the string and

\sigma

is the size of the alphabet. The size of the stack is

o(n)

except for very large values of

\sigma

. We further improve the algorithm by removing its time dependency on

\sigma

, by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate under-represented strings that

\textit{do not occur}

in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale.Comment: arXiv admin note: text overlap with arXiv:1502.0637

arXiv.org e-Print Archive

Computing minimal and maximal suffixes of a substring

Author: Babenko Maxim
Gawrychowski Pawel
Kociumaka Tomasz
Kolesnichenko Ignat
Starikovskaia Tatiana
Publication venue: 'Elsevier BV'
Publication date: 25/07/2016
Field of study

An Almost Optimal Edit Distance Oracle

Author: Charalampopoulos Panagiotis
Gawrychowski Pawe?
Mozes Shay
Weimann Oren
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)
Publication date: 01/01/2021
Field of study

We consider the problem of preprocessing two strings S and T, of lengths m and n, respectively, in order to be able to efficiently answer the following queries: Given positions i,j in S and positions a,b in T, return the optimal alignment score of S[i..j] and T[a..b]. Let N = mn. We present an oracle with preprocessing time N^{1+o(1)} and space N^{1+o(1)} that answers queries in log^{2+o(1)}N time. In other words, we show that we can efficiently query for the alignment score of every pair of substrings after preprocessing the input for almost the same time it takes to compute just the alignment of S and T. Our oracle uses ideas from our distance oracle for planar graphs [STOC 2019] and exploits the special structure of the alignment graph. Conditioned on popular hardness conjectures, this result is optimal up to subpolynomial factors. Our results apply to both edit distance and longest common subsequence (LCS). The best previously known oracle with construction time and size ?(N) has slow ?(?N) query time [Sakai, TCS 2019], and the one with size N^{1+o(1)} and query time log^{2+o(1)}N (using a planar graph distance oracle) has slow ?(N^{3/2}) construction time [Long & Pettie, SODA 2021]. We improve both approaches by roughly a ? N factor

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server