110 research outputs found

    Approximating LCS and Alignment Distance over Multiple Sequences

    Get PDF
    We study the problem of aligning multiple sequences with the goal of finding an alignment that either maximizes the number of aligned symbols (the longest common subsequence (LCS)), or minimizes the number of unaligned symbols (the alignment distance (AD)). Multiple sequence alignment is a well-studied problem in bioinformatics and is used to identify regions of similarity among DNA, RNA, or protein sequences to detect functional, structural, or evolutionary relationships among them. It is known that exact computation of LCS or AD of mm sequences each of length nn requires Θ(nm)\Theta(n^m) time unless the Strong Exponential Time Hypothesis is false. In this paper, we provide several results to approximate LCS and AD of multiple sequences. If the LCS of mm sequences each of length nn is λn\lambda n for some λ[0,1]\lambda \in [0,1], then in O~m(nm2+1)\tilde{O}_m(n^{\lfloor\frac{m}{2}\rfloor+1}) time, we can return a common subsequence of length at least λ2n2+ϵ\frac{\lambda^2 n}{2+\epsilon} for any arbitrary constant ϵ>0\epsilon >0. It is possible to approximate the AD within a factor of two in time O~m(nm2)\tilde{O}_m(n^{\lceil\frac{m}{2}\rceil}). However, going below-2 approximation requires breaking the triangle inequality barrier which is a major challenge in this area. No such algorithm with a running time of O(nαm)O(n^{\alpha m}) for any α<1\alpha < 1 is known. If the AD is θn\theta n, then we design an algorithm that approximates the AD within an approximation factor of (23θ16+ϵ)\left(2-\frac{3\theta}{16}+\epsilon\right) in O~m(nm2+2)\tilde{O}_m(n^{\lfloor\frac{m}{2}\rfloor+2}) time. Thus, if θ\theta is a constant, we get a below-two approximation in O~m(nm2+2)\tilde{O}_m(n^{\lfloor\frac{m}{2}\rfloor+2}) time. Moreover, we show if just one out of mm sequences is (p,B)(p,B)-pseudorandom then, we get a below-2 approximation in O~m(nBm1+nm2+3)\tilde{O}_m(nB^{m-1}+n^{\lfloor \frac{m}{2}\rfloor+3}) time irrespective of θ\theta

    Approximating LCS and Alignment Distance over Multiple Sequences

    Get PDF

    Near-Linear Time Insertion-Deletion Codes and (1+ε\varepsilon)-Approximating Edit Distance via Indexing

    Full text link
    We introduce fast-decodable indexing schemes for edit distance which can be used to speed up edit distance computations to near-linear time if one of the strings is indexed by an indexing string II. In particular, for every length nn and every ε>0\varepsilon >0, one can in near linear time construct a string IΣnI \in \Sigma'^n with Σ=Oε(1)|\Sigma'| = O_{\varepsilon}(1), such that, indexing any string SΣnS \in \Sigma^n, symbol-by-symbol, with II results in a string SΣnS' \in \Sigma''^n where Σ=Σ×Σ\Sigma'' = \Sigma \times \Sigma' for which edit distance computations are easy, i.e., one can compute a (1+ε)(1+\varepsilon)-approximation of the edit distance between SS' and any other string in O(npoly(logn))O(n \text{poly}(\log n)) time. Our indexing schemes can be used to improve the decoding complexity of state-of-the-art error correcting codes for insertions and deletions. In particular, they lead to near-linear time decoding algorithms for the insertion-deletion codes of [Haeupler, Shahrasbi; STOC `17] and faster decoding algorithms for list-decodable insertion-deletion codes of [Haeupler, Shahrasbi, Sudan; ICALP `18]. Interestingly, the latter codes are a crucial ingredient in the construction of fast-decodable indexing schemes

    Theoretical analysis of edit distance algorithms: an applied perspective

    Full text link
    Given its status as a classic problem and its importance to both theoreticians and practitioners, edit distance provides an excellent lens through which to understand how the theoretical analysis of algorithms impacts practical implementations. From an applied perspective, the goals of theoretical analysis are to predict the empirical performance of an algorithm and to serve as a yardstick to design novel algorithms that perform well in practice. In this paper, we systematically survey the types of theoretical analysis techniques that have been applied to edit distance and evaluate the extent to which each one has achieved these two goals. These techniques include traditional worst-case analysis, worst-case analysis parametrized by edit distance or entropy or compressibility, average-case analysis, semi-random models, and advice-based models. We find that the track record is mixed. On one hand, two algorithms widely used in practice have been born out of theoretical analysis and their empirical performance is captured well by theoretical predictions. On the other hand, all the algorithms developed using theoretical analysis as a yardstick since then have not had any practical relevance. We conclude by discussing the remaining open problems and how they can be tackled

    Dynamic Time Warping in Strongly Subquadratic Time: Algorithms for the Low-Distance Regime and Approximate Evaluation

    Get PDF
    Dynamic time warping distance (DTW) is a widely used distance measure between time series. The best known algorithms for computing DTW run in near quadratic time, and conditional lower bounds prohibit the existence of significantly faster algorithms. The lower bounds do not prevent a faster algorithm for the special case in which the DTW is small, however. For an arbitrary metric space Σ\Sigma with distances normalized so that the smallest non-zero distance is one, we present an algorithm which computes dtw(x,y)\operatorname{dtw}(x, y) for two strings xx and yy over Σ\Sigma in time O(ndtw(x,y))O(n \cdot \operatorname{dtw}(x, y)). We also present an approximation algorithm which computes dtw(x,y)\operatorname{dtw}(x, y) within a factor of O(nϵ)O(n^\epsilon) in time O~(n2ϵ)\tilde{O}(n^{2 - \epsilon}) for 0<ϵ<10 < \epsilon < 1. The algorithm allows for the strings xx and yy to be taken over an arbitrary well-separated tree metric with logarithmic depth and at most exponential aspect ratio. Extending our techniques further, we also obtain the first approximation algorithm for edit distance to work with characters taken from an arbitrary metric space, providing an nϵn^\epsilon-approximation in time O~(n2ϵ)\tilde{O}(n^{2 - \epsilon}), with high probability. Additionally, we present a simple reduction from computing edit distance to computing DTW. Applying our reduction to a conditional lower bound of Bringmann and K\"unnemann pertaining to edit distance over {0,1}\{0, 1\}, we obtain a conditional lower bound for computing DTW over a three letter alphabet (with distances of zero and one). This improves on a previous result of Abboud, Backurs, and Williams. With a similar approach, we prove a reduction from computing edit distance to computing longest LCS length. This means that one can recover conditional lower bounds for LCS directly from those for edit distance, which was not previously thought to be the case

    Hardness Amplification of Optimization Problems

    Get PDF
    In this paper, we prove a general hardness amplification scheme for optimization problems based on the technique of direct products. We say that an optimization problem ? is direct product feasible if it is possible to efficiently aggregate any k instances of ? and form one large instance of ? such that given an optimal feasible solution to the larger instance, we can efficiently find optimal feasible solutions to all the k smaller instances. Given a direct product feasible optimization problem ?, our hardness amplification theorem may be informally stated as follows: If there is a distribution D over instances of ? of size n such that every randomized algorithm running in time t(n) fails to solve ? on 1/?(n) fraction of inputs sampled from D, then, assuming some relationships on ?(n) and t(n), there is a distribution D\u27 over instances of ? of size O(n??(n)) such that every randomized algorithm running in time t(n)/poly(?(n)) fails to solve ? on 99/100 fraction of inputs sampled from D\u27. As a consequence of the above theorem, we show hardness amplification of problems in various classes such as NP-hard problems like Max-Clique, Knapsack, and Max-SAT, problems in P such as Longest Common Subsequence, Edit Distance, Matrix Multiplication, and even problems in TFNP such as Factoring and computing Nash equilibrium

    Approximate Hamming distance in a stream

    Get PDF
    We consider the problem of computing a (1+ϵ)(1+\epsilon)-approximation of the Hamming distance between a pattern of length nn and successive substrings of a stream. We first look at the one-way randomised communication complexity of this problem, giving Alice the first half of the stream and Bob the second half. We show the following: (1) If Alice and Bob both share the pattern then there is an O(ϵ4log2n)O(\epsilon^{-4} \log^2 n) bit randomised one-way communication protocol. (2) If only Alice has the pattern then there is an O(ϵ2nlogn)O(\epsilon^{-2}\sqrt{n}\log n) bit randomised one-way communication protocol. We then go on to develop small space streaming algorithms for (1+ϵ)(1+\epsilon)-approximate Hamming distance which give worst case running time guarantees per arriving symbol. (1) For binary input alphabets there is an O(ϵ3nlog2n)O(\epsilon^{-3} \sqrt{n} \log^{2} n) space and O(ϵ2logn)O(\epsilon^{-2} \log{n}) time streaming (1+ϵ)(1+\epsilon)-approximate Hamming distance algorithm. (2) For general input alphabets there is an O(ϵ5nlog4n)O(\epsilon^{-5} \sqrt{n} \log^{4} n) space and O(ϵ4log3n)O(\epsilon^{-4} \log^3 {n}) time streaming (1+ϵ)(1+\epsilon)-approximate Hamming distance algorithm.Comment: Submitted to ICALP' 201

    Hardness of Approximate Nearest Neighbor Search

    Full text link
    We prove conditional near-quadratic running time lower bounds for approximate Bichromatic Closest Pair with Euclidean, Manhattan, Hamming, or edit distance. Specifically, unless the Strong Exponential Time Hypothesis (SETH) is false, for every δ>0\delta>0 there exists a constant ϵ>0\epsilon>0 such that computing a (1+ϵ)(1+\epsilon)-approximation to the Bichromatic Closest Pair requires n2δn^{2-\delta} time. In particular, this implies a near-linear query time for Approximate Nearest Neighbor search with polynomial preprocessing time. Our reduction uses the Distributed PCP framework of [ARW'17], but obtains improved efficiency using Algebraic Geometry (AG) codes. Efficient PCPs from AG codes have been constructed in other settings before [BKKMS'16, BCGRS'17], but our construction is the first to yield new hardness results
    corecore