13 research outputs found
Linear-time Computation of DAWGs, Symmetric Indexing Structures, and MAWs for Integer Alphabets
The directed acyclic word graph (DAWG) of a string of length is the
smallest (partial) DFA which recognizes all suffixes of with only
nodes and edges. In this paper, we show how to construct the DAWG for the input
string from the suffix tree for , in time for integer alphabets
of polynomial size in . In so doing, we first describe a folklore algorithm
which, given the suffix tree for , constructs the DAWG for the reversed
string of in time. Then, we present our algorithm that builds the
DAWG for in time for integer alphabets, from the suffix tree for
. We also show that a straightforward modification to our DAWG construction
algorithm leads to the first -time algorithm for constructing the affix
tree of a given string over an integer alphabet. Affix trees are a text
indexing structure supporting bidirectional pattern searches. We then discuss
how our constructions can lead to linear-time algorithms for building other
text indexing structures, such as linear-size suffix tries and symmetric CDAWGs
in linear time in the case of integer alphabets. As a further application to
our -time DAWG construction algorithm, we show that the set
of all minimal absent words (MAWs) of can be computed in
optimal, input- and output-sensitive time and
working space for integer alphabets.Comment: This is an extended version of the paper "Computing DAWGs and Minimal
Absent Words in Linear Time for Integer Alphabets" from MFCS 201
Human Genome Analysis
Tato diplomová práce se zabývá implementací sufixových automatů, které jsou využity ve vyhledávání řetězců v DNA sekvencích. V první části práce je seznámení s problematikou sekvenování a mapování DNA. Následuje teoretická část popisující datové struktury sufixový strom a sufixové pole využívané ve vyhledávání v textu. Dále je seznámení se sufixovými automaty, na které navazují kompaktní sufixové automaty, návrh a implementace této struktury. Implementace je zaměřena na rozdělení vstupního řetězce na několik podřetězců, kde pro každý tento podřetězec je sestrojen sufixový automat. Bylo provedeno několik experimentů nad implementací této datové struktury. Výsledky experimentů jsou shrnuty v závěru této práce.This thesis describes the implementation of suffix automatons used for string searching on long DNA sequences. The first chapter talks about DNA sequencing and mapping. Then follows a~theoretic primer on the topic of suffix trees and suffix arrays which are widely used for searching over long strings. The next chapter introduces suffix automatons, which are followed by compact suffix automatons, design draft and implementation of this structure. The implementation focuses on splitting the input string into several substrings, where for each substring a suffix automaton is constructed. A~wide number of experiments have been conducted over this data structure. Finally, the results from various experiments are summed up in the closing section.460 - Katedra informatikyvýborn
Data Structures for Efficient String Algorithms
This thesis deals with data structures that are mostly useful in the area of string matching and string mining. Our main result is an O(n)-time preprocessing scheme for an array of n numbers such that subsequent queries asking for the position of a minimum element in a specified interval can be answered in constant time (so-called RMQs for Range Minimum Queries). The space for this data structure is 2n+o(n) bits, which is shown to be asymptotically optimal in a general setting. This improves all previous results on this problem. The main techniques for deriving this result rely on combinatorial properties of arrays and so-called Cartesian Trees. For compressible input arrays we show that further space can be saved, while not affecting the time bounds. For the two-dimensional variant of the RMQ-problem we give a preprocessing scheme with quasi-optimal time bounds, but with an asymptotic increase in space consumption of a factor of log(n).
It is well known that algorithms for answering RMQs in constant time are useful for many different algorithmic tasks (e.g., the computation of lowest common ancestors in trees); in the second part of this thesis we give several new applications of the RMQ-problem. We show that our preprocessing scheme for RMQ (and a variant thereof) leads to improvements in the space- and time-consumption of the Enhanced Suffix Array, a collection of arrays that can be used for many tasks in pattern matching. In particular, we will see that in conjunction with the suffix- and LCP-array 2n+o(n) bits of additional space (coming from our RMQ-scheme) are sufficient to find all occ occurrences of a (usually short) pattern of length m in a (usually long) text of length n in O(m*s+occ) time, where s denotes the size of the alphabet. This is certainly optimal if the size of the alphabet is constant; for non-constant alphabets we can improve this to O(m*log(s)+occ) locating time, replacing our original scheme with a data structure of size approximately 2.54n bits. Again by using RMQs, we then show how to solve frequency-related string mining tasks in optimal time. In a final chapter we propose a space- and time-optimal algorithm for computing suffix arrays on texts that are logically divided into words, if one is just interested in finding all word-aligned occurrences of a pattern.
Apart from the theoretical improvements made in this thesis, most of our algorithms are also of practical value; we underline this fact by empirical tests and comparisons on real-word problem instances. In most cases our algorithms outperform previous approaches by all means
On-Line Construction of Compact Directed Acyclic Word Graphs
A Compact Directed Acyclic Word Graph (CDAWG) is a space-efficient text indexing structure, that can be used in several different string algorithms, especially in the analysis of biological sequences. In this paper, we present a new on-line algorithm for its construction, as well as the construction of a CDAWG for a set of strings
On-line construction of compact directed acyclic word graphs
Many different index structures, providing efficient solutions to problems related to pattern matching, have been introduced so far. Examples of these structures are suffix trees and directed acyclic word graphs (DAWGs), which can be efficiently constructed in linear time and space. Compact directed acyclic word graphs (CDAWGs) are an index structure preserving some features of both suffix trees and DAWGs, and require less space than both of them. An algorithm which directly constructs CDAWGs in linear time and space was first introduced by Crochemore and Verin, based on McCreight's algorithm for constructing suffix trees. In this work, we present a novel on-line linear-time algorithm that builds the CDAWG for a single string as well as for a set of strings, inspired by Ukkonen's on-line algorithm for constructing suffix trees