Search CORE

110 research outputs found

Approximating LCS and Alignment Distance over Multiple Sequences

Author: Das Debarati
Saha Barna
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2022)
Publication date: 24/10/2021
Field of study

We study the problem of aligning multiple sequences with the goal of finding an alignment that either maximizes the number of aligned symbols (the longest common subsequence (LCS)), or minimizes the number of unaligned symbols (the alignment distance (AD)). Multiple sequence alignment is a well-studied problem in bioinformatics and is used to identify regions of similarity among DNA, RNA, or protein sequences to detect functional, structural, or evolutionary relationships among them. It is known that exact computation of LCS or AD of

m

sequences each of length

n

requires

\Theta(n^m)

time unless the Strong Exponential Time Hypothesis is false. In this paper, we provide several results to approximate LCS and AD of multiple sequences. If the LCS of

m

sequences each of length

n

\lambda n

for some

\lambda \in [0,1]

, then in

\tilde{O}_m(n^{\lfloor\frac{m}{2}\rfloor+1})

time, we can return a common subsequence of length at least

\frac{\lambda^2 n}{2+\epsilon}

for any arbitrary constant

\epsilon >0

. It is possible to approximate the AD within a factor of two in time

\tilde{O}_m(n^{\lceil\frac{m}{2}\rceil})

. However, going below-2 approximation requires breaking the triangle inequality barrier which is a major challenge in this area. No such algorithm with a running time of

O(n^{\alpha m})

for any

\alpha < 1

is known. If the AD is

\theta n

, then we design an algorithm that approximates the AD within an approximation factor of

\left(2-\frac{3\theta}{16}+\epsilon\right)

\tilde{O}_m(n^{\lfloor\frac{m}{2}\rfloor+2})

time. Thus, if

\theta

is a constant, we get a below-two approximation in

\tilde{O}_m(n^{\lfloor\frac{m}{2}\rfloor+2})

time. Moreover, we show if just one out of

m

sequences is

(p,B)

-pseudorandom then, we get a below-2 approximation in

\tilde{O}_m(nB^{m-1}+n^{\lfloor \frac{m}{2}\rfloor+3})

time irrespective of

\theta

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Approximating LCS and Alignment Distance over Multiple Sequences

Author: Das Debarati
Saha Barna
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2022)
Publication date: 01/01/2022
Field of study

Dagstuhl Research Online Publication Server

Near-Linear Time Insertion-Deletion Codes and (1+ $\varepsilon$ )-Approximating Edit Distance via Indexing

Author: Approximating
Efficiently
Goldwasser Shafi
Haeupler Bernhard
Haeupler Bernhard
Polylogarithmic
Selected
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 09/04/2019
Field of study

We introduce fast-decodable indexing schemes for edit distance which can be used to speed up edit distance computations to near-linear time if one of the strings is indexed by an indexing string

I

. In particular, for every length

n

and every

\varepsilon >0

, one can in near linear time construct a string

I \in \Sigma'^n

with

|\Sigma'| = O_{\varepsilon}(1)

, such that, indexing any string

S \in \Sigma^n

, symbol-by-symbol, with

I

results in a string

S' \in \Sigma''^n

where

\Sigma'' = \Sigma \times \Sigma'

for which edit distance computations are easy, i.e., one can compute a

(1+\varepsilon)

-approximation of the edit distance between

S'

and any other string in

O(n \text{poly}(\log n))

time. Our indexing schemes can be used to improve the decoding complexity of state-of-the-art error correcting codes for insertions and deletions. In particular, they lead to near-linear time decoding algorithms for the insertion-deletion codes of [Haeupler, Shahrasbi; STOC `17] and faster decoding algorithms for list-decodable insertion-deletion codes of [Haeupler, Shahrasbi, Sudan; ICALP `18]. Interestingly, the latter codes are a crucial ingredient in the construction of fast-decodable indexing schemes

arXiv.org e-Print Archive

Crossref

Theoretical analysis of edit distance algorithms: an applied perspective

Author: Medvedev Paul
Publication venue
Publication date: 20/04/2022
Field of study

Given its status as a classic problem and its importance to both theoreticians and practitioners, edit distance provides an excellent lens through which to understand how the theoretical analysis of algorithms impacts practical implementations. From an applied perspective, the goals of theoretical analysis are to predict the empirical performance of an algorithm and to serve as a yardstick to design novel algorithms that perform well in practice. In this paper, we systematically survey the types of theoretical analysis techniques that have been applied to edit distance and evaluate the extent to which each one has achieved these two goals. These techniques include traditional worst-case analysis, worst-case analysis parametrized by edit distance or entropy or compressibility, average-case analysis, semi-random models, and advice-based models. We find that the track record is mixed. On one hand, two algorithms widely used in practice have been born out of theoretical analysis and their empirical performance is captured well by theoretical predictions. On the other hand, all the algorithms developed using theoretical analysis as a yardstick since then have not had any practical relevance. We conclude by discussing the remaining open problems and how they can be tackled

arXiv.org e-Print Archive

Dynamic Time Warping in Strongly Subquadratic Time: Algorithms for the Low-Distance Regime and Approximate Evaluation

Author: Kuszmaul William
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019)
Publication date: 01/01/2019
Field of study

Dynamic time warping distance (DTW) is a widely used distance measure between time series. The best known algorithms for computing DTW run in near quadratic time, and conditional lower bounds prohibit the existence of significantly faster algorithms. The lower bounds do not prevent a faster algorithm for the special case in which the DTW is small, however. For an arbitrary metric space

\Sigma

with distances normalized so that the smallest non-zero distance is one, we present an algorithm which computes

\operatorname{dtw}(x, y)

for two strings

x

and

y

over

\Sigma

in time

O(n \cdot \operatorname{dtw}(x, y))

. We also present an approximation algorithm which computes

\operatorname{dtw}(x, y)

within a factor of

O(n^\epsilon)

in time

\tilde{O}(n^{2 - \epsilon})

for

0 < \epsilon < 1

. The algorithm allows for the strings

x

and

y

to be taken over an arbitrary well-separated tree metric with logarithmic depth and at most exponential aspect ratio. Extending our techniques further, we also obtain the first approximation algorithm for edit distance to work with characters taken from an arbitrary metric space, providing an

n^\epsilon

-approximation in time

\tilde{O}(n^{2 - \epsilon})

, with high probability. Additionally, we present a simple reduction from computing edit distance to computing DTW. Applying our reduction to a conditional lower bound of Bringmann and K\"unnemann pertaining to edit distance over

\{0, 1\}

, we obtain a conditional lower bound for computing DTW over a three letter alphabet (with distances of zero and one). This improves on a previous result of Abboud, Backurs, and Williams. With a similar approach, we prove a reduction from computing edit distance to computing longest LCS length. This means that one can recover conditional lower bounds for LCS directly from those for edit distance, which was not previously thought to be the case

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Hardness Amplification of Optimization Problems

Author: Goldenberg Elazar
Karthik C. S.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 11th Innovations in Theoretical Computer Science Conference (ITCS 2020)
Publication date: 27/08/2019
Field of study

In this paper, we prove a general hardness amplification scheme for optimization problems based on the technique of direct products. We say that an optimization problem ? is direct product feasible if it is possible to efficiently aggregate any k instances of ? and form one large instance of ? such that given an optimal feasible solution to the larger instance, we can efficiently find optimal feasible solutions to all the k smaller instances. Given a direct product feasible optimization problem ?, our hardness amplification theorem may be informally stated as follows: If there is a distribution D over instances of ? of size n such that every randomized algorithm running in time t(n) fails to solve ? on 1/?(n) fraction of inputs sampled from D, then, assuming some relationships on ?(n) and t(n), there is a distribution D\u27 over instances of ? of size O(n??(n)) such that every randomized algorithm running in time t(n)/poly(?(n)) fails to solve ? on 99/100 fraction of inputs sampled from D\u27. As a consequence of the above theorem, we show hardness amplification of problems in various classes such as NP-hard problems like Max-Clique, Knapsack, and Max-SAT, problems in P such as Longest Common Subsequence, Edit Distance, Matrix Multiplication, and even problems in TFNP such as Factoring and computing Nash equilibrium

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Approximate Hamming distance in a stream

Author: Clifford Raphael
Starikovskaya Tatiana
Publication venue
Publication date: 01/01/2016
Field of study

We consider the problem of computing a

(1+\epsilon)

-approximation of the Hamming distance between a pattern of length

n

and successive substrings of a stream. We first look at the one-way randomised communication complexity of this problem, giving Alice the first half of the stream and Bob the second half. We show the following: (1) If Alice and Bob both share the pattern then there is an

O(\epsilon^{-4} \log^2 n)

bit randomised one-way communication protocol. (2) If only Alice has the pattern then there is an

O(\epsilon^{-2}\sqrt{n}\log n)

bit randomised one-way communication protocol. We then go on to develop small space streaming algorithms for

(1+\epsilon)

-approximate Hamming distance which give worst case running time guarantees per arriving symbol. (1) For binary input alphabets there is an

O(\epsilon^{-3} \sqrt{n} \log^{2} n)

space and

O(\epsilon^{-2} \log{n})

time streaming

(1+\epsilon)

-approximate Hamming distance algorithm. (2) For general input alphabets there is an

O(\epsilon^{-5} \sqrt{n} \log^{4} n)

space and

O(\epsilon^{-4} \log^3 {n})

time streaming

(1+\epsilon)

-approximate Hamming distance algorithm.Comment: Submitted to ICALP' 201

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Explore Bristol Research

Hardness of Approximate Nearest Neighbor Search

Author: A
Abboud Amir
Ahle Thomas Dybdahl
Alman Josh
Andoni Alexandr
Andoni Alexandr
Arya Sunil
Arya Sunil
Chan Timothy M.
Difference Between Closest On
Fast
Klauck Hartmut
Lower
Oblivious
Optimal
Patrascu Mihai
Shamos Michael Ian
Publication venue
Publication date: 02/03/2018
Field of study

We prove conditional near-quadratic running time lower bounds for approximate Bichromatic Closest Pair with Euclidean, Manhattan, Hamming, or edit distance. Specifically, unless the Strong Exponential Time Hypothesis (SETH) is false, for every

\delta>0

there exists a constant

\epsilon>0

such that computing a

(1+\epsilon)

-approximation to the Bichromatic Closest Pair requires

n^{2-\delta}

time. In particular, this implies a near-linear query time for Approximate Nearest Neighbor search with polynomial preprocessing time. Our reduction uses the Distributed PCP framework of [ARW'17], but obtains improved efficiency using Algebraic Geometry (AG) codes. Efficient PCPs from AG codes have been constructed in other settings before [BKKMS'16, BCGRS'17], but our construction is the first to yield new hardness results

arXiv.org e-Print Archive

Crossref