9 research outputs found
Regular Languages meet Prefix Sorting
Indexing strings via prefix (or suffix) sorting is, arguably, one of the most
successful algorithmic techniques developed in the last decades. Can indexing
be extended to languages? The main contribution of this paper is to initiate
the study of the sub-class of regular languages accepted by an automaton whose
states can be prefix-sorted. Starting from the recent notion of Wheeler graph
[Gagie et al., TCS 2017]-which extends naturally the concept of prefix sorting
to labeled graphs-we investigate the properties of Wheeler languages, that is,
regular languages admitting an accepting Wheeler finite automaton.
Interestingly, we characterize this family as the natural extension of regular
languages endowed with the co-lexicographic ordering: when sorted, the strings
belonging to a Wheeler language are partitioned into a finite number of
co-lexicographic intervals, each formed by elements from a single Myhill-Nerode
equivalence class. Moreover: (i) We show that every Wheeler NFA (WNFA) with
states admits an equivalent Wheeler DFA (WDFA) with at most
states that can be computed in time. This is in sharp contrast with
general NFAs. (ii) We describe a quadratic algorithm to prefix-sort a proper
superset of the WDFAs, a -time online algorithm to sort acyclic
WDFAs, and an optimal linear-time offline algorithm to sort general WDFAs. By
contribution (i), our algorithms can also be used to index any WNFA at the
moderate price of doubling the automaton's size. (iii) We provide a
minimization theorem that characterizes the smallest WDFA recognizing the same
language of any input WDFA. The corresponding constructive algorithm runs in
optimal linear time in the acyclic case, and in time in the
general case. (iv) We show how to compute the smallest WDFA equivalent to any
acyclic DFA in nearly-optimal time.Comment: added minimization theorems; uploaded submitted version; New version
with new results (W-MH theorem, linear determinization), added author:
Giovanna D'Agostin
Which Regular Languages can be Efficiently Indexed?
In the present work, we tackle the regular language indexing problem by first
studying the hierarchy of -sortable languages: regular languages accepted by
automata of width . We show that the hierarchy is strict and does not
collapse, and provide (exponential in ) upper and lower bounds relating the
minimum widths of equivalent NFAs and DFAs. Our bounds indicate the importance
of being able to index NFAs, as they enable indexing regular languages with
much faster and smaller indexes. Our second contribution solves precisely this
problem, optimally: we devise a polynomial-time algorithm that indexes any NFA
with the optimal value for its width, without explicitly computing
(NP-hard to find). In particular, this implies that we can index in polynomial
time the well-studied case (Wheeler NFAs). More in general, in polynomial
time we can build an index breaking the worst-case conditional lower bound of
, whenever the input NFA's width is .Comment: Extended versio
On Indexing and Compressing Finite Automata
An index for a finite automaton is a powerful data structure that supports
locating paths labeled with a query pattern, thus solving pattern matching on
the underlying regular language. In this paper, we solve the long-standing
problem of indexing arbitrary finite automata. Our solution consists in finding
a partial co-lexicographic order of the states and proving, as in the total
order case, that states reached by a given string form one interval on the
partial order, thus enabling indexing. We provide a lower bound stating that
such an interval requires words to be represented, being the order's
width (i.e. the size of its largest antichain). Indeed, we show that
determines the complexity of several fundamental problems on finite automata:
(i) Letting be the alphabet size, we provide an encoding for NFAs
using bits per transition
and a smaller encoding for DFAs using bits per transition. This is achieved by generalizing the
Burrows-Wheeler transform to arbitrary automata. (ii) We show that indexed
pattern matching can be solved in query time on NFAs.
(iii) We provide a polynomial-time algorithm to index DFAs, while matching the
optimal value for . On the other hand, we prove that the problem is
NP-hard on NFAs. (iv) We show that, in the worst case, the classic powerset
construction algorithm for NFA determinization generates an equivalent DFA of
size , where is the number of NFA's states
Space efficient merging of de Bruijn graphs and Wheeler graphs
The merging of succinct data structures is a well established technique for
the space efficient construction of large succinct indexes. In the first part
of the paper we propose a new algorithm for merging succinct representations of
de Bruijn graphs. Our algorithm has the same asymptotic cost of the state of
the art algorithm for the same problem but it uses less than half of its
working space. A novel important feature of our algorithm, not found in any of
the existing tools, is that it can compute the Variable Order succinct
representation of the union graph within the same asymptotic time/space bounds.
In the second part of the paper we consider the more general problem of merging
succinct representations of Wheeler graphs, a recently introduced graph family
which includes as special cases de Bruijn graphs and many other known succinct
indexes based on the BWT or one of its variants. We show that Wheeler graphs
merging is in general a much more difficult problem, and we provide a space
efficient algorithm for the slightly simplified problem of determining whether
the union graph has an ordering that satisfies the Wheeler conditions.Comment: 24 pages, 10 figures. arXiv admin note: text overlap with
arXiv:1902.0288
Subpath Queries on Compressed Graphs: A Survey
Text indexing is a classical algorithmic problem that has been studied for over four decades: given a text T, pre-process it off-line so that, later, we can quickly count and locate the occurrences of any string (the query pattern) in T in time proportional to the query’s length. The earliest optimal-time solution to the problem, the suffix tree, dates back to 1973 and requires up to two orders of magnitude more space than the plain text just to be stored. In the year 2000, two breakthrough works showed that efficient queries can be achieved without this space overhead: a fast index be stored in a space proportional to the text’s entropy. These contributions had an enormous impact in bioinformatics: today, virtually any DNA aligner employs compressed indexes. Recent trends considered more powerful compression schemes (dictionary compressors) and generalizations of the problem to labeled graphs: after all, texts can be viewed as labeled directed paths. In turn, since finite state automata can be considered as a particular case of labeled graphs, these findings created a bridge between the fields of compressed indexing and regular language theory, ultimately allowing to index regular languages and promising to shed new light on problems, such as regular expression matching. This survey is a gentle introduction to the main landmarks of the fascinating journey that took us from suffix trees to today’s compressed indexes for labeled graphs and regular languages
On Locating Paths in Compressed Tries
In this paper, we consider the problem of compressing a trie while supporting
the powerful \emph{locate} queries: to return the pre-order identifiers of all
nodes reached by a path labeled with a given query pattern. Our result builds
on top of the XBWT tree transform of Ferragina et al. [FOCS 2005] and
generalizes the \emph{r-index} locate machinery of Gagie et al. [SODA 2018,
JACM 2020] based on the run-length encoded Burrows-Wheeler transform (BWT). Our
first contribution is to propose a suitable generalization of the run-length
BWT to tries. We show that this natural generalization enjoys several of the
useful properties of its counterpart on strings: in particular, the transform
natively supports counting occurrences of a query pattern on the trie's paths
and its size captures the trie's repetitiveness and lower-bounds a natural
notion of trie entropy. Our main contribution is a much deeper insight into the
combinatorial structure of this object. In detail, we show that a data
structure of bits, where is the number of nodes,
allows locating the occurrences of a pattern of length in
nearly-optimal time, where is the alphabet's
size. Our solution consists in sampling nodes that can be used as
"anchor points" during the locate process. Once obtained the pre-order
identifier of the first pattern occurrence (in co-lexicographic order), we show
that a constant number of constant-time jumps between those anchor points lead
to the identifier of the next pattern occurrence, thus enabling locating in
optimal time per occurrence.Comment: Improved toehold lemma running time; added more detailed proofs that
take care of all border cases in the locate strategy; postprint version to
appear in SODA 202
Algorithms and Lower Bounds for Ordering Problems on Strings
This dissertation presents novel algorithms and conditional lower bounds for a collection of string and text-compression-related problems. These results are unified under the theme of ordering constraint satisfaction. Utilizing the connections to ordering constraint satisfaction, we provide hardness results and algorithms for the following: recognizing a type of labeled graph amenable to text-indexing known as Wheeler graphs, minimizing the number of maximal unary substrings occurring in the Burrows-Wheeler Transformation of a text, minimizing the number of factors occurring in the Lyndon factorization of a text, and finding an optimal reference string for relative Lempel-Ziv encoding
Regular Languages meet Prefix Sorting
Indexing strings via prefix (or suffix) sorting is, arguably, one of the most successful algorithmic techniques developed in the last decades. Can indexing be extended to languages? The main contribution of this paper is to initiate the study of the sub-class of regular languages accepted by an automaton whose states can be prefix-sorted. Starting from the recent notion of Wheeler graph [Gagie et al., TCS 2017]\u2014which extends naturally the concept of prefix sorting to labeled graphs\u2014we investigate the properties of Wheeler languages, that is, regular languages admitting an accepting Wheeler finite automaton. We first characterize this family as the natural extension of regular languages endowed with the co-lexicographic ordering: the sorted prefixes of strings belonging to a Wheeler language are partitioned into a finite number of co-lexicographic intervals, each formed by elements from a single Myhill-Nerode equivalence class.
We proceed by proving several results related to Wheeler automata: (i) We show that every Wheeler NFA (WNFA) with n states admits an equivalent Wheeler DFA (WDFA) with at most 2n 12 1 12 |\u3a3| states (\u3a3 being the alphabet) that can be computed in O(n
3) time. (ii) We describe a quadratic algorithm to prefix-sort a proper superset of the WDFAs, a O(n log n)-time online algorithm to sort acyclic WDFAs, and an optimal linear-time offline algorithm to sort general WDFAs. (iii) We provide a minimization theorem that characterizes the smallest WDFA recognizing the same language of any input WDFA. The corresponding constructive algorithm runs in optimal linear time in the acyclic case, and in O(n log n) time in the general case. (iv) We show how to
compute the smallest WDFA equivalent to any acyclic DFA in nearly-optimal time. Our contributions imply new results of independent interest. Contributions (i-iii) provide a new class of NFAs for which the minimization problem can be approximated within a constant factor in polynomial time. Contribution (iv) provides a provably minimum-size solution for the well-studied problem of indexing deterministicacyclic graphs for linear-time pattern matching queries