14,722 research outputs found
Approximate sequence alignment
Given a collection of strings and a query string, the goal of the approximate string matching is to efficiently find the strings in the collection, which are similar to the query string. In this paper, we focus on edit distance as a measure to quantify the similarity between two strings. Existing q-gram based methods use inverted lists to index the q-grams of the given string collection. These methods begin with generating the q-grams of the query string, disjoint or overlapping, and then merge the inverted lists of these q-grams. Several filtering techniques have been proposed to segment inverted lists in order to obtain relatively shorter lists, thus reducing the merging cost. The filtering technique we propose in this thesis, which is called position restricted alignment, combines well known length filtering and position filtering to provide more aggressive pruning. We then provide an indexing scheme that integrates the inverted lists storage with the proposed filter. It enables us to auto-filter the inverted lists. We evaluate the effectiveness of the proposed approach by experiments
On the Complexity of Exact Pattern Matching in Graphs: Binary Strings and Bounded Degree
Exact pattern matching in labeled graphs is the problem of searching paths of
a graph that spell the same string as the pattern . This
basic problem can be found at the heart of more complex operations on variation
graphs in computational biology, of query operations in graph databases, and of
analysis operations in heterogeneous networks, where the nodes of some paths
must match a sequence of labels or types. We describe a simple conditional
lower bound that, for any constant , an -time or an -time algorithm for exact pattern
matching on graphs, with node labels and patterns drawn from a binary alphabet,
cannot be achieved unless the Strong Exponential Time Hypothesis (SETH) is
false. The result holds even if restricted to undirected graphs of maximum
degree three or directed acyclic graphs of maximum sum of indegree and
outdegree three. Although a conditional lower bound of this kind can be somehow
derived from previous results (Backurs and Indyk, FOCS'16), we give a direct
reduction from SETH for dissemination purposes, as the result might interest
researchers from several areas, such as computational biology, graph database,
and graph mining, as mentioned before. Indeed, as approximate pattern matching
on graphs can be solved in time, exact and approximate matching are
thus equally hard (quadratic time) on graphs under the SETH assumption. In
comparison, the same problems restricted to strings have linear time vs
quadratic time solutions, respectively, where the latter ones have a matching
SETH lower bound on computing the edit distance of two strings (Backurs and
Indyk, STOC'15).Comment: Using Lemma 12 and Lemma 13 might to be enough to prove Lemma 14.
However, the proof of Lemma 14 is correct if you assume that the graph used
in the reduction is a DAG. Hence, since the problem is already quadratic for
a DAG and a binary alphabet, it has to be quadratic also for a general graph
and a binary alphabe
Approximating solution structure of the Weighted Sentence Alignment problem
We study the complexity of approximating solution structure of the bijective
weighted sentence alignment problem of DeNero and Klein (2008). In particular,
we consider the complexity of finding an alignment that has a significant
overlap with an optimal alignment. We discuss ways of representing the solution
for the general weighted sentence alignment as well as phrases-to-words
alignment problem, and show that computing a string which agrees with the
optimal sentence partition on more than half (plus an arbitrarily small
polynomial fraction) positions for the phrases-to-words alignment is NP-hard.
For the general weighted sentence alignment we obtain such bound from the
agreement on a little over 2/3 of the bits. Additionally, we generalize the
Hamming distance approximation of a solution structure to approximating it with
respect to the edit distance metric, obtaining similar lower bounds
- …