9 research outputs found

    Regular Languages meet Prefix Sorting

    Full text link
    Indexing strings via prefix (or suffix) sorting is, arguably, one of the most successful algorithmic techniques developed in the last decades. Can indexing be extended to languages? The main contribution of this paper is to initiate the study of the sub-class of regular languages accepted by an automaton whose states can be prefix-sorted. Starting from the recent notion of Wheeler graph [Gagie et al., TCS 2017]-which extends naturally the concept of prefix sorting to labeled graphs-we investigate the properties of Wheeler languages, that is, regular languages admitting an accepting Wheeler finite automaton. Interestingly, we characterize this family as the natural extension of regular languages endowed with the co-lexicographic ordering: when sorted, the strings belonging to a Wheeler language are partitioned into a finite number of co-lexicographic intervals, each formed by elements from a single Myhill-Nerode equivalence class. Moreover: (i) We show that every Wheeler NFA (WNFA) with nn states admits an equivalent Wheeler DFA (WDFA) with at most 2n1Σ2n-1-|\Sigma| states that can be computed in O(n3)O(n^3) time. This is in sharp contrast with general NFAs. (ii) We describe a quadratic algorithm to prefix-sort a proper superset of the WDFAs, a O(nlogn)O(n\log n)-time online algorithm to sort acyclic WDFAs, and an optimal linear-time offline algorithm to sort general WDFAs. By contribution (i), our algorithms can also be used to index any WNFA at the moderate price of doubling the automaton's size. (iii) We provide a minimization theorem that characterizes the smallest WDFA recognizing the same language of any input WDFA. The corresponding constructive algorithm runs in optimal linear time in the acyclic case, and in O(nlogn)O(n\log n) time in the general case. (iv) We show how to compute the smallest WDFA equivalent to any acyclic DFA in nearly-optimal time.Comment: added minimization theorems; uploaded submitted version; New version with new results (W-MH theorem, linear determinization), added author: Giovanna D'Agostin

    Which Regular Languages can be Efficiently Indexed?

    Full text link
    In the present work, we tackle the regular language indexing problem by first studying the hierarchy of pp-sortable languages: regular languages accepted by automata of width pp. We show that the hierarchy is strict and does not collapse, and provide (exponential in pp) upper and lower bounds relating the minimum widths of equivalent NFAs and DFAs. Our bounds indicate the importance of being able to index NFAs, as they enable indexing regular languages with much faster and smaller indexes. Our second contribution solves precisely this problem, optimally: we devise a polynomial-time algorithm that indexes any NFA with the optimal value pp for its width, without explicitly computing pp (NP-hard to find). In particular, this implies that we can index in polynomial time the well-studied case p=1p=1 (Wheeler NFAs). More in general, in polynomial time we can build an index breaking the worst-case conditional lower bound of Ω(Pm)\Omega(|P| m), whenever the input NFA's width is po(m)p \in o(\sqrt{m}).Comment: Extended versio

    On Indexing and Compressing Finite Automata

    Full text link
    An index for a finite automaton is a powerful data structure that supports locating paths labeled with a query pattern, thus solving pattern matching on the underlying regular language. In this paper, we solve the long-standing problem of indexing arbitrary finite automata. Our solution consists in finding a partial co-lexicographic order of the states and proving, as in the total order case, that states reached by a given string form one interval on the partial order, thus enabling indexing. We provide a lower bound stating that such an interval requires O(p)O(p) words to be represented, pp being the order's width (i.e. the size of its largest antichain). Indeed, we show that pp determines the complexity of several fundamental problems on finite automata: (i) Letting σ\sigma be the alphabet size, we provide an encoding for NFAs using logσ+2logp+2\lceil\log \sigma\rceil + 2\lceil\log p\rceil + 2 bits per transition and a smaller encoding for DFAs using logσ+logp+2\lceil\log \sigma\rceil + \lceil\log p\rceil + 2 bits per transition. This is achieved by generalizing the Burrows-Wheeler transform to arbitrary automata. (ii) We show that indexed pattern matching can be solved in O~(mp2)\tilde O(m\cdot p^2) query time on NFAs. (iii) We provide a polynomial-time algorithm to index DFAs, while matching the optimal value for p p . On the other hand, we prove that the problem is NP-hard on NFAs. (iv) We show that, in the worst case, the classic powerset construction algorithm for NFA determinization generates an equivalent DFA of size 2p(np+1)12^p(n-p+1)-1, where nn is the number of NFA's states

    Space efficient merging of de Bruijn graphs and Wheeler graphs

    Full text link
    The merging of succinct data structures is a well established technique for the space efficient construction of large succinct indexes. In the first part of the paper we propose a new algorithm for merging succinct representations of de Bruijn graphs. Our algorithm has the same asymptotic cost of the state of the art algorithm for the same problem but it uses less than half of its working space. A novel important feature of our algorithm, not found in any of the existing tools, is that it can compute the Variable Order succinct representation of the union graph within the same asymptotic time/space bounds. In the second part of the paper we consider the more general problem of merging succinct representations of Wheeler graphs, a recently introduced graph family which includes as special cases de Bruijn graphs and many other known succinct indexes based on the BWT or one of its variants. We show that Wheeler graphs merging is in general a much more difficult problem, and we provide a space efficient algorithm for the slightly simplified problem of determining whether the union graph has an ordering that satisfies the Wheeler conditions.Comment: 24 pages, 10 figures. arXiv admin note: text overlap with arXiv:1902.0288

    Subpath Queries on Compressed Graphs: A Survey

    Get PDF
    Text indexing is a classical algorithmic problem that has been studied for over four decades: given a text T, pre-process it off-line so that, later, we can quickly count and locate the occurrences of any string (the query pattern) in T in time proportional to the query’s length. The earliest optimal-time solution to the problem, the suffix tree, dates back to 1973 and requires up to two orders of magnitude more space than the plain text just to be stored. In the year 2000, two breakthrough works showed that efficient queries can be achieved without this space overhead: a fast index be stored in a space proportional to the text’s entropy. These contributions had an enormous impact in bioinformatics: today, virtually any DNA aligner employs compressed indexes. Recent trends considered more powerful compression schemes (dictionary compressors) and generalizations of the problem to labeled graphs: after all, texts can be viewed as labeled directed paths. In turn, since finite state automata can be considered as a particular case of labeled graphs, these findings created a bridge between the fields of compressed indexing and regular language theory, ultimately allowing to index regular languages and promising to shed new light on problems, such as regular expression matching. This survey is a gentle introduction to the main landmarks of the fascinating journey that took us from suffix trees to today’s compressed indexes for labeled graphs and regular languages

    On Locating Paths in Compressed Tries

    Full text link
    In this paper, we consider the problem of compressing a trie while supporting the powerful \emph{locate} queries: to return the pre-order identifiers of all nodes reached by a path labeled with a given query pattern. Our result builds on top of the XBWT tree transform of Ferragina et al. [FOCS 2005] and generalizes the \emph{r-index} locate machinery of Gagie et al. [SODA 2018, JACM 2020] based on the run-length encoded Burrows-Wheeler transform (BWT). Our first contribution is to propose a suitable generalization of the run-length BWT to tries. We show that this natural generalization enjoys several of the useful properties of its counterpart on strings: in particular, the transform natively supports counting occurrences of a query pattern on the trie's paths and its size rr captures the trie's repetitiveness and lower-bounds a natural notion of trie entropy. Our main contribution is a much deeper insight into the combinatorial structure of this object. In detail, we show that a data structure of O(rlogn)+2n+o(n)O(r\log n) + 2n + o(n) bits, where nn is the number of nodes, allows locating the occocc occurrences of a pattern of length mm in nearly-optimal O(mlogσ+occ)O(m\log\sigma + occ) time, where σ\sigma is the alphabet's size. Our solution consists in sampling O(r)O(r) nodes that can be used as "anchor points" during the locate process. Once obtained the pre-order identifier of the first pattern occurrence (in co-lexicographic order), we show that a constant number of constant-time jumps between those anchor points lead to the identifier of the next pattern occurrence, thus enabling locating in optimal O(1)O(1) time per occurrence.Comment: Improved toehold lemma running time; added more detailed proofs that take care of all border cases in the locate strategy; postprint version to appear in SODA 202

    Algorithms and Lower Bounds for Ordering Problems on Strings

    Get PDF
    This dissertation presents novel algorithms and conditional lower bounds for a collection of string and text-compression-related problems. These results are unified under the theme of ordering constraint satisfaction. Utilizing the connections to ordering constraint satisfaction, we provide hardness results and algorithms for the following: recognizing a type of labeled graph amenable to text-indexing known as Wheeler graphs, minimizing the number of maximal unary substrings occurring in the Burrows-Wheeler Transformation of a text, minimizing the number of factors occurring in the Lyndon factorization of a text, and finding an optimal reference string for relative Lempel-Ziv encoding

    Regular Languages meet Prefix Sorting

    No full text
    Indexing strings via prefix (or suffix) sorting is, arguably, one of the most successful algorithmic techniques developed in the last decades. Can indexing be extended to languages? The main contribution of this paper is to initiate the study of the sub-class of regular languages accepted by an automaton whose states can be prefix-sorted. Starting from the recent notion of Wheeler graph [Gagie et al., TCS 2017]\u2014which extends naturally the concept of prefix sorting to labeled graphs\u2014we investigate the properties of Wheeler languages, that is, regular languages admitting an accepting Wheeler finite automaton. We first characterize this family as the natural extension of regular languages endowed with the co-lexicographic ordering: the sorted prefixes of strings belonging to a Wheeler language are partitioned into a finite number of co-lexicographic intervals, each formed by elements from a single Myhill-Nerode equivalence class. We proceed by proving several results related to Wheeler automata: (i) We show that every Wheeler NFA (WNFA) with n states admits an equivalent Wheeler DFA (WDFA) with at most 2n 12 1 12 |\u3a3| states (\u3a3 being the alphabet) that can be computed in O(n 3) time. (ii) We describe a quadratic algorithm to prefix-sort a proper superset of the WDFAs, a O(n log n)-time online algorithm to sort acyclic WDFAs, and an optimal linear-time offline algorithm to sort general WDFAs. (iii) We provide a minimization theorem that characterizes the smallest WDFA recognizing the same language of any input WDFA. The corresponding constructive algorithm runs in optimal linear time in the acyclic case, and in O(n log n) time in the general case. (iv) We show how to compute the smallest WDFA equivalent to any acyclic DFA in nearly-optimal time. Our contributions imply new results of independent interest. Contributions (i-iii) provide a new class of NFAs for which the minimization problem can be approximated within a constant factor in polynomial time. Contribution (iv) provides a provably minimum-size solution for the well-studied problem of indexing deterministicacyclic graphs for linear-time pattern matching queries
    corecore