14 research outputs found
Order preserving pattern matching on trees and DAGs
The order preserving pattern matching (OPPM) problem is, given a pattern
string and a text string , find all substrings of which have the
same relative orders as . In this paper, we consider two variants of the
OPPM problem where a set of text strings is given as a tree or a DAG. We show
that the OPPM problem for a single pattern of length and a text tree
of size can be solved in time if the characters of are
drawn from an integer alphabet of polynomial size. The time complexity becomes
if the pattern is over a general ordered alphabet. We
then show that the OPPM problem for a single pattern and a text DAG is
NP-complete
On the Complexity of Exact Pattern Matching in Graphs: Binary Strings and Bounded Degree
Exact pattern matching in labeled graphs is the problem of searching paths of
a graph that spell the same string as the pattern . This
basic problem can be found at the heart of more complex operations on variation
graphs in computational biology, of query operations in graph databases, and of
analysis operations in heterogeneous networks, where the nodes of some paths
must match a sequence of labels or types. We describe a simple conditional
lower bound that, for any constant , an -time or an -time algorithm for exact pattern
matching on graphs, with node labels and patterns drawn from a binary alphabet,
cannot be achieved unless the Strong Exponential Time Hypothesis (SETH) is
false. The result holds even if restricted to undirected graphs of maximum
degree three or directed acyclic graphs of maximum sum of indegree and
outdegree three. Although a conditional lower bound of this kind can be somehow
derived from previous results (Backurs and Indyk, FOCS'16), we give a direct
reduction from SETH for dissemination purposes, as the result might interest
researchers from several areas, such as computational biology, graph database,
and graph mining, as mentioned before. Indeed, as approximate pattern matching
on graphs can be solved in time, exact and approximate matching are
thus equally hard (quadratic time) on graphs under the SETH assumption. In
comparison, the same problems restricted to strings have linear time vs
quadratic time solutions, respectively, where the latter ones have a matching
SETH lower bound on computing the edit distance of two strings (Backurs and
Indyk, STOC'15).Comment: Using Lemma 12 and Lemma 13 might to be enough to prove Lemma 14.
However, the proof of Lemma 14 is correct if you assume that the graph used
in the reduction is a DAG. Hence, since the problem is already quadratic for
a DAG and a binary alphabet, it has to be quadratic also for a general graph
and a binary alphabe
Extraction and integration of data from semi-structured documents into business applications
Cover title.Includes bibliographical references (p. 8).Ph. Bonnet & S. Bressan
On the Complexity of String Matching for Graphs
Peer reviewe
Parameterized Algorithms for String Matching to DAGs: Funnels and Beyond
The problem of String Matching to Labeled Graphs (SMLG) asks to find all the paths in a labeled graph G = (V, E) whose spellings match that of an input string S ? ?^m. SMLG can be solved in quadratic O(m|E|) time [Amir et al., JALG 2000], which was proven to be optimal by a recent lower bound conditioned on SETH [Equi et al., ICALP 2019]. The lower bound states that no strongly subquadratic time algorithm exists, even if restricted to directed acyclic graphs (DAGs).
In this work we present the first parameterized algorithms for SMLG on DAGs. Our parameters capture the topological structure of G. All our results are derived from a generalization of the Knuth-Morris-Pratt algorithm [Park and Kim, CPM 1995] optimized to work in time proportional to the number of prefix-incomparable matches.
To obtain the parameterization in the topological structure of G, we first study a special class of DAGs called funnels [Millani et al., JCO 2020] and generalize them to k-funnels and the class ST_k. We present several novel characterizations and algorithmic contributions on both funnels and their generalizations
Sparse Dynamic Programming on DAGs with Small Width
The minimum path cover problem asks us to find a minimum-cardinality set of paths that cover all the nodes of a directed acyclic graph (DAG). We study the case when the size k of a minimum path cover is small, that is, when the DAG has a small width. This case is motivated by applications in pan-genomics, where the genomic variation of a population is expressed as a DAG. We observe that classical alignment algorithms exploiting sparse dynamic programming can be extended to the sequence-against-DAG case by mimicking the algorithm for sequences on each path of a minimum path cover and handling an evaluation order anomaly with reachability queries. Namely, we introduce a general framework for DAG-extensions of sparse dynamic programming. This framework produces algorithms that are slower than their counterparts on sequences only by a factor k. We illustrate this on two classical problems extended to DAGs: longest increasing subsequence and longest common subsequence. For the former, we obtain an algorithm with running time O(k vertical bar E vertical bar log vertical bar V vertical bar). This matches the optimal solution to the classical problem variant when the input sequence is modeled as a path. We obtain an analogous result for the longest common subsequence problem. We then apply this technique to the co-linear chaining problem, which is a generalization of the above two problems. The algorithm for this problem turns out to be more involved, needing further ingredients, such as an FM-index tailored for large alphabets and a two-dimensional range search tree modified to support range maximum queries. We also study a general sequence-to-DAG alignment formulation that allows affine gap costs in the sequence. The main ingredient of the proposed framework is a new algorithm for finding a minimum path cover of a DAG (V, E) in O(k vertical bar E vertical bar log vertical bar V vertical bar) time, improving all known time-bounds when k is small and the DAG is not too dense. In addition to boosting the sparse dynamic programming framework, an immediate consequence of this new minimum path cover algorithm is an improved space/time tradeoff for reachability queries in arbitrary directed graphs.Peer reviewe
A Myhill-Nerode Theorem for Generalized Automata, with Applications to Pattern Matching and Compression
The model of generalized automata, introduced by Eilenberg in 1974, allows
representing a regular language more concisely than conventional automata by
allowing edges to be labeled not only with characters, but also strings.
Giammaresi and Montalbano introduced a notion of determinism for generalized
automata [STACS 1995]. While generalized deterministic automata retain many
properties of conventional deterministic automata, the uniqueness of a minimal
generalized deterministic automaton is lost.
In the first part of the paper, we show that the lack of uniqueness can be
explained by introducing a set associated with a generalized
automaton . By fixing , we are able to derive
for the first time a full Myhill-Nerode theorem for generalized automata, which
contains the textbook Myhill-Nerode theorem for conventional automata as a
degenerate case.
In the second part of the paper, we show that the set
leads to applications for pattern matching and data compression. Wheeler
automata [TCS 2017, SODA 2020] are a popular class of automata that can be
compactly stored using bits ( being
the number of edges, being the size of the alphabet) in such a way
that pattern matching queries can be solved in time (
being the length of the pattern). In the paper, we show how to extend these
results to generalized automata. More precisely, a Wheeler generalized automata
can be stored using bits so
that pattern matching queries can be solved in time, where is the total length of all edge labels, is the maximum
length of an edge label and is the number of states
On the Complexity of Exact Pattern Matching in Graphs: Binary Strings and Bounded Degree
International audienc