110 research outputs found
Approximating LCS and Alignment Distance over Multiple Sequences
We study the problem of aligning multiple sequences with the goal of finding
an alignment that either maximizes the number of aligned symbols (the longest
common subsequence (LCS)), or minimizes the number of unaligned symbols (the
alignment distance (AD)). Multiple sequence alignment is a well-studied problem
in bioinformatics and is used to identify regions of similarity among DNA, RNA,
or protein sequences to detect functional, structural, or evolutionary
relationships among them. It is known that exact computation of LCS or AD of
sequences each of length requires time unless the Strong
Exponential Time Hypothesis is false. In this paper, we provide several results
to approximate LCS and AD of multiple sequences.
If the LCS of sequences each of length is for some
, then in
time, we can return a common subsequence of length at least for any arbitrary constant .
It is possible to approximate the AD within a factor of two in time
. However, going below-2
approximation requires breaking the triangle inequality barrier which is a
major challenge in this area. No such algorithm with a running time of
for any is known. If the AD is , then
we design an algorithm that approximates the AD within an approximation factor
of in
time. Thus, if is a
constant, we get a below-two approximation in
time. Moreover, we show if just
one out of sequences is -pseudorandom then, we get a below-2
approximation in time
irrespective of
Near-Linear Time Insertion-Deletion Codes and (1+)-Approximating Edit Distance via Indexing
We introduce fast-decodable indexing schemes for edit distance which can be
used to speed up edit distance computations to near-linear time if one of the
strings is indexed by an indexing string . In particular, for every length
and every , one can in near linear time construct a string
with , such that, indexing
any string , symbol-by-symbol, with results in a string where for which edit
distance computations are easy, i.e., one can compute a
-approximation of the edit distance between and any other
string in time.
Our indexing schemes can be used to improve the decoding complexity of
state-of-the-art error correcting codes for insertions and deletions. In
particular, they lead to near-linear time decoding algorithms for the
insertion-deletion codes of [Haeupler, Shahrasbi; STOC `17] and faster decoding
algorithms for list-decodable insertion-deletion codes of [Haeupler, Shahrasbi,
Sudan; ICALP `18]. Interestingly, the latter codes are a crucial ingredient in
the construction of fast-decodable indexing schemes
Theoretical analysis of edit distance algorithms: an applied perspective
Given its status as a classic problem and its importance to both
theoreticians and practitioners, edit distance provides an excellent lens
through which to understand how the theoretical analysis of algorithms impacts
practical implementations. From an applied perspective, the goals of
theoretical analysis are to predict the empirical performance of an algorithm
and to serve as a yardstick to design novel algorithms that perform well in
practice. In this paper, we systematically survey the types of theoretical
analysis techniques that have been applied to edit distance and evaluate the
extent to which each one has achieved these two goals. These techniques include
traditional worst-case analysis, worst-case analysis parametrized by edit
distance or entropy or compressibility, average-case analysis, semi-random
models, and advice-based models. We find that the track record is mixed. On one
hand, two algorithms widely used in practice have been born out of theoretical
analysis and their empirical performance is captured well by theoretical
predictions. On the other hand, all the algorithms developed using theoretical
analysis as a yardstick since then have not had any practical relevance. We
conclude by discussing the remaining open problems and how they can be tackled
Dynamic Time Warping in Strongly Subquadratic Time: Algorithms for the Low-Distance Regime and Approximate Evaluation
Dynamic time warping distance (DTW) is a widely used distance measure between
time series. The best known algorithms for computing DTW run in near quadratic
time, and conditional lower bounds prohibit the existence of significantly
faster algorithms. The lower bounds do not prevent a faster algorithm for the
special case in which the DTW is small, however. For an arbitrary metric space
with distances normalized so that the smallest non-zero distance is
one, we present an algorithm which computes for two
strings and over in time . We also present an approximation algorithm which computes
within a factor of in time
for . The algorithm allows for
the strings and to be taken over an arbitrary well-separated tree
metric with logarithmic depth and at most exponential aspect ratio. Extending
our techniques further, we also obtain the first approximation algorithm for
edit distance to work with characters taken from an arbitrary metric space,
providing an -approximation in time ,
with high probability. Additionally, we present a simple reduction from
computing edit distance to computing DTW. Applying our reduction to a
conditional lower bound of Bringmann and K\"unnemann pertaining to edit
distance over , we obtain a conditional lower bound for computing DTW
over a three letter alphabet (with distances of zero and one). This improves on
a previous result of Abboud, Backurs, and Williams. With a similar approach, we
prove a reduction from computing edit distance to computing longest LCS length.
This means that one can recover conditional lower bounds for LCS directly from
those for edit distance, which was not previously thought to be the case
Hardness Amplification of Optimization Problems
In this paper, we prove a general hardness amplification scheme for optimization problems based on the technique of direct products.
We say that an optimization problem ? is direct product feasible if it is possible to efficiently aggregate any k instances of ? and form one large instance of ? such that given an optimal feasible solution to the larger instance, we can efficiently find optimal feasible solutions to all the k smaller instances. Given a direct product feasible optimization problem ?, our hardness amplification theorem may be informally stated as follows:
If there is a distribution D over instances of ? of size n such that every randomized algorithm running in time t(n) fails to solve ? on 1/?(n) fraction of inputs sampled from D, then, assuming some relationships on ?(n) and t(n), there is a distribution D\u27 over instances of ? of size O(n??(n)) such that every randomized algorithm running in time t(n)/poly(?(n)) fails to solve ? on 99/100 fraction of inputs sampled from D\u27.
As a consequence of the above theorem, we show hardness amplification of problems in various classes such as NP-hard problems like Max-Clique, Knapsack, and Max-SAT, problems in P such as Longest Common Subsequence, Edit Distance, Matrix Multiplication, and even problems in TFNP such as Factoring and computing Nash equilibrium
Approximate Hamming distance in a stream
We consider the problem of computing a -approximation of the
Hamming distance between a pattern of length and successive substrings of a
stream. We first look at the one-way randomised communication complexity of
this problem, giving Alice the first half of the stream and Bob the second
half. We show the following: (1) If Alice and Bob both share the pattern then
there is an bit randomised one-way communication
protocol. (2) If only Alice has the pattern then there is an
bit randomised one-way communication protocol.
We then go on to develop small space streaming algorithms for
-approximate Hamming distance which give worst case running time
guarantees per arriving symbol. (1) For binary input alphabets there is an
space and
time streaming -approximate Hamming distance algorithm. (2) For
general input alphabets there is an
space and time streaming
-approximate Hamming distance algorithm.Comment: Submitted to ICALP' 201
Hardness of Approximate Nearest Neighbor Search
We prove conditional near-quadratic running time lower bounds for approximate
Bichromatic Closest Pair with Euclidean, Manhattan, Hamming, or edit distance.
Specifically, unless the Strong Exponential Time Hypothesis (SETH) is false,
for every there exists a constant such that computing a
-approximation to the Bichromatic Closest Pair requires
time. In particular, this implies a near-linear query time for
Approximate Nearest Neighbor search with polynomial preprocessing time.
Our reduction uses the Distributed PCP framework of [ARW'17], but obtains
improved efficiency using Algebraic Geometry (AG) codes. Efficient PCPs from AG
codes have been constructed in other settings before [BKKMS'16, BCGRS'17], but
our construction is the first to yield new hardness results
- …