    Linear-time Computation of DAWGs, Symmetric Indexing Structures, and MAWs for Integer Alphabets

    The directed acyclic word graph (DAWG) of a string yy of length nn is the smallest (partial) DFA which recognizes all suffixes of yy with only O(n)O(n) nodes and edges. In this paper, we show how to construct the DAWG for the input string yy from the suffix tree for yy, in O(n)O(n) time for integer alphabets of polynomial size in nn. In so doing, we first describe a folklore algorithm which, given the suffix tree for yy, constructs the DAWG for the reversed string of yy in O(n)O(n) time. Then, we present our algorithm that builds the DAWG for yy in O(n)O(n) time for integer alphabets, from the suffix tree for yy. We also show that a straightforward modification to our DAWG construction algorithm leads to the first O(n)O(n)-time algorithm for constructing the affix tree of a given string yy over an integer alphabet. Affix trees are a text indexing structure supporting bidirectional pattern searches. We then discuss how our constructions can lead to linear-time algorithms for building other text indexing structures, such as linear-size suffix tries and symmetric CDAWGs in linear time in the case of integer alphabets. As a further application to our O(n)O(n)-time DAWG construction algorithm, we show that the set MAW(y)\mathsf{MAW}(y) of all minimal absent words (MAWs) of yy can be computed in optimal, input- and output-sensitive O(n+MAW(y))O(n + |\mathsf{MAW}(y)|) time and O(n)O(n) working space for integer alphabets.Comment: This is an extended version of the paper "Computing DAWGs and Minimal Absent Words in Linear Time for Integer Alphabets" from MFCS 201

    Human Genome Analysis

    Tato diplomová práce se zabývá implementací sufixových automatů, které jsou využity ve vyhledávání řetězců v DNA sekvencích. V první části práce je seznámení s problematikou sekvenování a mapování DNA. Následuje teoretická část popisující datové struktury sufixový strom a sufixové pole využívané ve vyhledávání v textu. Dále je seznámení se sufixovými automaty, na které navazují kompaktní sufixové automaty, návrh a implementace této struktury. Implementace je zaměřena na rozdělení vstupního řetězce na několik podřetězců, kde pro každý tento podřetězec je sestrojen sufixový automat. Bylo provedeno několik experimentů nad implementací této datové struktury. Výsledky experimentů jsou shrnuty v závěru této práce.This thesis describes the implementation of suffix automatons used for string searching on long DNA sequences. The first chapter talks about DNA sequencing and mapping. Then follows a~theoretic primer on the topic of suffix trees and suffix arrays which are widely used for searching over long strings. The next chapter introduces suffix automatons, which are followed by compact suffix automatons, design draft and implementation of this structure. The implementation focuses on splitting the input string into several substrings, where for each substring a suffix automaton is constructed. A~wide number of experiments have been conducted over this data structure. Finally, the results from various experiments are summed up in the closing section.460 - Katedra informatikyvýborn

    Data Structures for Efficient String Algorithms

    This thesis deals with data structures that are mostly useful in the area of string matching and string mining. Our main result is an O(n)-time preprocessing scheme for an array of n numbers such that subsequent queries asking for the position of a minimum element in a specified interval can be answered in constant time (so-called RMQs for Range Minimum Queries). The space for this data structure is 2n+o(n) bits, which is shown to be asymptotically optimal in a general setting. This improves all previous results on this problem. The main techniques for deriving this result rely on combinatorial properties of arrays and so-called Cartesian Trees. For compressible input arrays we show that further space can be saved, while not affecting the time bounds. For the two-dimensional variant of the RMQ-problem we give a preprocessing scheme with quasi-optimal time bounds, but with an asymptotic increase in space consumption of a factor of log(n). It is well known that algorithms for answering RMQs in constant time are useful for many different algorithmic tasks (e.g., the computation of lowest common ancestors in trees); in the second part of this thesis we give several new applications of the RMQ-problem. We show that our preprocessing scheme for RMQ (and a variant thereof) leads to improvements in the space- and time-consumption of the Enhanced Suffix Array, a collection of arrays that can be used for many tasks in pattern matching. In particular, we will see that in conjunction with the suffix- and LCP-array 2n+o(n) bits of additional space (coming from our RMQ-scheme) are sufficient to find all occ occurrences of a (usually short) pattern of length m in a (usually long) text of length n in O(m*s+occ) time, where s denotes the size of the alphabet. This is certainly optimal if the size of the alphabet is constant; for non-constant alphabets we can improve this to O(m*log(s)+occ) locating time, replacing our original scheme with a data structure of size approximately 2.54n bits. Again by using RMQs, we then show how to solve frequency-related string mining tasks in optimal time. In a final chapter we propose a space- and time-optimal algorithm for computing suffix arrays on texts that are logically divided into words, if one is just interested in finding all word-aligned occurrences of a pattern. Apart from the theoretical improvements made in this thesis, most of our algorithms are also of practical value; we underline this fact by empirical tests and comparisons on real-word problem instances. In most cases our algorithms outperform previous approaches by all means

    On-Line Construction of Compact Directed Acyclic Word Graphs

    A Compact Directed Acyclic Word Graph (CDAWG) is a space-efficient text indexing structure, that can be used in several different string algorithms, especially in the analysis of biological sequences. In this paper, we present a new on-line algorithm for its construction, as well as the construction of a CDAWG for a set of strings

    On-line construction of compact directed acyclic word graphs

    Many different index structures, providing efficient solutions to problems related to pattern matching, have been introduced so far. Examples of these structures are suffix trees and directed acyclic word graphs (DAWGs), which can be efficiently constructed in linear time and space. Compact directed acyclic word graphs (CDAWGs) are an index structure preserving some features of both suffix trees and DAWGs, and require less space than both of them. An algorithm which directly constructs CDAWGs in linear time and space was first introduced by Crochemore and Verin, based on McCreight's algorithm for constructing suffix trees. In this work, we present a novel on-line linear-time algorithm that builds the CDAWG for a single string as well as for a set of strings, inspired by Ukkonen's on-line algorithm for constructing suffix trees