Search CORE

13 research outputs found

Linear-time Computation of DAWGs, Symmetric Indexing Structures, and MAWs for Integer Alphabets

Author: Bannai Hideo
Fujishige Yuta
Inenaga Shunsuke
Takeda Masayuki
Tsujimaru Yuki
Publication venue
Publication date: 03/07/2023
Field of study

The directed acyclic word graph (DAWG) of a string

y

of length

n

is the smallest (partial) DFA which recognizes all suffixes of

y

with only

O(n)

nodes and edges. In this paper, we show how to construct the DAWG for the input string

y

from the suffix tree for

y

, in

O(n)

time for integer alphabets of polynomial size in

n

. In so doing, we first describe a folklore algorithm which, given the suffix tree for

y

, constructs the DAWG for the reversed string of

y

O(n)

time. Then, we present our algorithm that builds the DAWG for

y

O(n)

time for integer alphabets, from the suffix tree for

y

. We also show that a straightforward modification to our DAWG construction algorithm leads to the first

O(n)

-time algorithm for constructing the affix tree of a given string

y

over an integer alphabet. Affix trees are a text indexing structure supporting bidirectional pattern searches. We then discuss how our constructions can lead to linear-time algorithms for building other text indexing structures, such as linear-size suffix tries and symmetric CDAWGs in linear time in the case of integer alphabets. As a further application to our

O(n)

-time DAWG construction algorithm, we show that the set

\mathsf{MAW}(y)

of all minimal absent words (MAWs) of

y

can be computed in optimal, input- and output-sensitive

O(n + |\mathsf{MAW}(y)|)

time and

O(n)

working space for integer alphabets.Comment: This is an extended version of the paper "Computing DAWGs and Minimal Absent Words in Linear Time for Integer Alphabets" from MFCS 201

arXiv.org e-Print Archive

Human Genome Analysis

Author: Kratochvíl Jan
Publication venue: Vysoká škola báňská - Technická univerzita Ostrava
Publication date: 01/01/2019
Field of study

Tato diplomová práce se zabývá implementací sufixových automatů, které jsou využity ve vyhledávání řetězců v DNA sekvencích. V první části práce je seznámení s problematikou sekvenování a mapování DNA. Následuje teoretická část popisující datové struktury sufixový strom a sufixové pole využívané ve vyhledávání v textu. Dále je seznámení se sufixovými automaty, na které navazují kompaktní sufixové automaty, návrh a implementace této struktury. Implementace je zaměřena na rozdělení vstupního řetězce na několik podřetězců, kde pro každý tento podřetězec je sestrojen sufixový automat. Bylo provedeno několik experimentů nad implementací této datové struktury. Výsledky experimentů jsou shrnuty v závěru této práce.This thesis describes the implementation of suffix automatons used for string searching on long DNA sequences. The first chapter talks about DNA sequencing and mapping. Then follows a~theoretic primer on the topic of suffix trees and suffix arrays which are widely used for searching over long strings. The next chapter introduces suffix automatons, which are followed by compact suffix automatons, design draft and implementation of this structure. The implementation focuses on splitting the input string into several substrings, where for each substring a suffix automaton is constructed. A~wide number of experiments have been conducted over this data structure. Finally, the results from various experiments are summed up in the closing section.460 - Katedra informatikyvýborn

DSpace at VSB Technical University of Ostrava

Data Structures for Efficient String Algorithms

Author: Fischer Johannes
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 08/10/2007
Field of study

This thesis deals with data structures that are mostly useful in the area of string matching and string mining. Our main result is an O(n)-time preprocessing scheme for an array of n numbers such that subsequent queries asking for the position of a minimum element in a specified interval can be answered in constant time (so-called RMQs for Range Minimum Queries). The space for this data structure is 2n+o(n) bits, which is shown to be asymptotically optimal in a general setting. This improves all previous results on this problem. The main techniques for deriving this result rely on combinatorial properties of arrays and so-called Cartesian Trees. For compressible input arrays we show that further space can be saved, while not affecting the time bounds. For the two-dimensional variant of the RMQ-problem we give a preprocessing scheme with quasi-optimal time bounds, but with an asymptotic increase in space consumption of a factor of log(n). It is well known that algorithms for answering RMQs in constant time are useful for many different algorithmic tasks (e.g., the computation of lowest common ancestors in trees); in the second part of this thesis we give several new applications of the RMQ-problem. We show that our preprocessing scheme for RMQ (and a variant thereof) leads to improvements in the space- and time-consumption of the Enhanced Suffix Array, a collection of arrays that can be used for many tasks in pattern matching. In particular, we will see that in conjunction with the suffix- and LCP-array 2n+o(n) bits of additional space (coming from our RMQ-scheme) are sufficient to find all occ occurrences of a (usually short) pattern of length m in a (usually long) text of length n in O(m*s+occ) time, where s denotes the size of the alphabet. This is certainly optimal if the size of the alphabet is constant; for non-constant alphabets we can improve this to O(m*log(s)+occ) locating time, replacing our original scheme with a data structure of size approximately 2.54n bits. Again by using RMQs, we then show how to solve frequency-related string mining tasks in optimal time. In a final chapter we propose a space- and time-optimal algorithm for computing suffix arrays on texts that are logically divided into words, if one is just interested in finding all word-aligned occurrences of a pattern. Apart from the theoretical improvements made in this thesis, most of our algorithms are also of practical value; we underline this fact by empirical tests and comparisons on real-word problem instances. In most cases our algorithms outperform previous approaches by all means

Digitale Hochschulschriften der LMU

On-Line Construction of Compact Directed Acyclic Word Graphs

Author: A. Blumer
A. Blumer
D. Gusfield
E. McCreight
E. Ukkonen
J. Kärkkäinen
M. Crochemore
U. Manber
Publication venue
Publication date: 01/01/2001
Field of study

A Compact Directed Acyclic Word Graph (CDAWG) is a space-efficient text indexing structure, that can be used in several different string algorithms, especially in the analysis of biological sequences. In this paper, we present a new on-line algorithm for its construction, as well as the construction of a CDAWG for a set of strings

CiteSeerX

Crossref

On-line construction of compact directed acyclic word graphs

Author: A. Shinohara
G. Mauri
G. Pavesi
H. Hoshino
M. Takeda
S. Arikawa
S. Inenaga
Publication venue: 'Elsevier BV'
Publication date: 01/01/2005
Field of study

Many different index structures, providing efficient solutions to problems related to pattern matching, have been introduced so far. Examples of these structures are suffix trees and directed acyclic word graphs (DAWGs), which can be efficiently constructed in linear time and space. Compact directed acyclic word graphs (CDAWGs) are an index structure preserving some features of both suffix trees and DAWGs, and require less space than both of them. An algorithm which directly constructs CDAWGs in linear time and space was first introduced by Crochemore and Verin, based on McCreight's algorithm for constructing suffix trees. In this work, we present a novel on-line linear-time algorithm that builds the CDAWG for a single string as well as for a set of strings, inspired by Ukkonen's on-line algorithm for constructing suffix trees

Elsevier - Publisher Connector

AIR Universita degli studi di Milano

On-line construction of compact directed acyclic word graphs

Author: Apostolico
Ayumi Shinohara
Blumer
Blumer
Cleary
Cleary
Crochemore
Crochemore
Crochemore
Crochemore
Giancarlo Mauri
Giulio Pavesi
Grossi
Gusfield
Hiromasa Hoshino
Holub
Inenaga
Inenaga
Inenaga
Kosaraju
Kurtz
Manber
Masayuki Takeda
McCreight
Mäkinen
Setsuo Arikawa
Shunsuke Inenaga
Takeda
Ukkonen
Ukkonen
Weiner
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref