    Order preserving pattern matching on trees and DAGs

    The order preserving pattern matching (OPPM) problem is, given a pattern string pp and a text string tt, find all substrings of tt which have the same relative orders as pp. In this paper, we consider two variants of the OPPM problem where a set of text strings is given as a tree or a DAG. We show that the OPPM problem for a single pattern pp of length mm and a text tree TT of size NN can be solved in O(m+N)O(m+N) time if the characters of pp are drawn from an integer alphabet of polynomial size. The time complexity becomes O(mlog⁥m+N)O(m \log m + N) if the pattern pp is over a general ordered alphabet. We then show that the OPPM problem for a single pattern and a text DAG is NP-complete

    On the Complexity of Exact Pattern Matching in Graphs: Binary Strings and Bounded Degree

    Exact pattern matching in labeled graphs is the problem of searching paths of a graph G=(V,E)G=(V,E) that spell the same string as the pattern P[1..m]P[1..m]. This basic problem can be found at the heart of more complex operations on variation graphs in computational biology, of query operations in graph databases, and of analysis operations in heterogeneous networks, where the nodes of some paths must match a sequence of labels or types. We describe a simple conditional lower bound that, for any constant Ï”>0\epsilon>0, an O(∣E∣1−ϔ m)O(|E|^{1 - \epsilon} \, m)-time or an O(∣E∣ m1−ϔ)O(|E| \, m^{1 - \epsilon})-time algorithm for exact pattern matching on graphs, with node labels and patterns drawn from a binary alphabet, cannot be achieved unless the Strong Exponential Time Hypothesis (SETH) is false. The result holds even if restricted to undirected graphs of maximum degree three or directed acyclic graphs of maximum sum of indegree and outdegree three. Although a conditional lower bound of this kind can be somehow derived from previous results (Backurs and Indyk, FOCS'16), we give a direct reduction from SETH for dissemination purposes, as the result might interest researchers from several areas, such as computational biology, graph database, and graph mining, as mentioned before. Indeed, as approximate pattern matching on graphs can be solved in O(∣E∣ m)O(|E|\,m) time, exact and approximate matching are thus equally hard (quadratic time) on graphs under the SETH assumption. In comparison, the same problems restricted to strings have linear time vs quadratic time solutions, respectively, where the latter ones have a matching SETH lower bound on computing the edit distance of two strings (Backurs and Indyk, STOC'15).Comment: Using Lemma 12 and Lemma 13 might to be enough to prove Lemma 14. However, the proof of Lemma 14 is correct if you assume that the graph used in the reduction is a DAG. Hence, since the problem is already quadratic for a DAG and a binary alphabet, it has to be quadratic also for a general graph and a binary alphabe

    Extraction and integration of data from semi-structured documents into business applications

    Parameterized Algorithms for String Matching to DAGs: Funnels and Beyond

    The problem of String Matching to Labeled Graphs (SMLG) asks to find all the paths in a labeled graph G = (V, E) whose spellings match that of an input string S ? ?^m. SMLG can be solved in quadratic O(m|E|) time [Amir et al., JALG 2000], which was proven to be optimal by a recent lower bound conditioned on SETH [Equi et al., ICALP 2019]. The lower bound states that no strongly subquadratic time algorithm exists, even if restricted to directed acyclic graphs (DAGs). In this work we present the first parameterized algorithms for SMLG on DAGs. Our parameters capture the topological structure of G. All our results are derived from a generalization of the Knuth-Morris-Pratt algorithm [Park and Kim, CPM 1995] optimized to work in time proportional to the number of prefix-incomparable matches. To obtain the parameterization in the topological structure of G, we first study a special class of DAGs called funnels [Millani et al., JCO 2020] and generalize them to k-funnels and the class ST_k. We present several novel characterizations and algorithmic contributions on both funnels and their generalizations

    Sparse Dynamic Programming on DAGs with Small Width

    The minimum path cover problem asks us to find a minimum-cardinality set of paths that cover all the nodes of a directed acyclic graph (DAG). We study the case when the size k of a minimum path cover is small, that is, when the DAG has a small width. This case is motivated by applications in pan-genomics, where the genomic variation of a population is expressed as a DAG. We observe that classical alignment algorithms exploiting sparse dynamic programming can be extended to the sequence-against-DAG case by mimicking the algorithm for sequences on each path of a minimum path cover and handling an evaluation order anomaly with reachability queries. Namely, we introduce a general framework for DAG-extensions of sparse dynamic programming. This framework produces algorithms that are slower than their counterparts on sequences only by a factor k. We illustrate this on two classical problems extended to DAGs: longest increasing subsequence and longest common subsequence. For the former, we obtain an algorithm with running time O(k vertical bar E vertical bar log vertical bar V vertical bar). This matches the optimal solution to the classical problem variant when the input sequence is modeled as a path. We obtain an analogous result for the longest common subsequence problem. We then apply this technique to the co-linear chaining problem, which is a generalization of the above two problems. The algorithm for this problem turns out to be more involved, needing further ingredients, such as an FM-index tailored for large alphabets and a two-dimensional range search tree modified to support range maximum queries. We also study a general sequence-to-DAG alignment formulation that allows affine gap costs in the sequence. The main ingredient of the proposed framework is a new algorithm for finding a minimum path cover of a DAG (V, E) in O(k vertical bar E vertical bar log vertical bar V vertical bar) time, improving all known time-bounds when k is small and the DAG is not too dense. In addition to boosting the sparse dynamic programming framework, an immediate consequence of this new minimum path cover algorithm is an improved space/time tradeoff for reachability queries in arbitrary directed graphs.Peer reviewe

    A Myhill-Nerode Theorem for Generalized Automata, with Applications to Pattern Matching and Compression

    The model of generalized automata, introduced by Eilenberg in 1974, allows representing a regular language more concisely than conventional automata by allowing edges to be labeled not only with characters, but also strings. Giammaresi and Montalbano introduced a notion of determinism for generalized automata [STACS 1995]. While generalized deterministic automata retain many properties of conventional deterministic automata, the uniqueness of a minimal generalized deterministic automaton is lost. In the first part of the paper, we show that the lack of uniqueness can be explained by introducing a set W(A) \mathcal{W(A)} associated with a generalized automaton A \mathcal{A} . By fixing W(A) \mathcal{W(A)} , we are able to derive for the first time a full Myhill-Nerode theorem for generalized automata, which contains the textbook Myhill-Nerode theorem for conventional automata as a degenerate case. In the second part of the paper, we show that the set W(A) \mathcal{W(A)} leads to applications for pattern matching and data compression. Wheeler automata [TCS 2017, SODA 2020] are a popular class of automata that can be compactly stored using elogâĄÏƒ(1+o(1))+O(e) e \log \sigma (1 + o(1)) + O(e) bits (e e being the number of edges, σ \sigma being the size of the alphabet) in such a way that pattern matching queries can be solved in O~(m) \tilde{O}(m) time (m m being the length of the pattern). In the paper, we show how to extend these results to generalized automata. More precisely, a Wheeler generalized automata can be stored using elogâĄÏƒ(1+o(1))+O(e+rn) \mathfrak{e} \log \sigma (1 + o(1)) + O(e + rn) bits so that pattern matching queries can be solved in O~(rm) \tilde{O}(r m) time, where e \mathfrak{e} is the total length of all edge labels, r r is the maximum length of an edge label and n n is the number of states