1,799 research outputs found

    Small space and streaming pattern matching with k edits

    Full text link
    In this work, we revisit the fundamental and well-studied problem of approximate pattern matching under edit distance. Given an integer kk, a pattern PP of length mm, and a text TT of length nmn \ge m, the task is to find substrings of TT that are within edit distance kk from PP. Our main result is a streaming algorithm that solves the problem in O~(k5)\tilde{O}(k^5) space and O~(k8)\tilde{O}(k^8) amortised time per character of the text, providing answers correct with high probability. (Hereafter, O~()\tilde{O}(\cdot) hides a poly(logn)\mathrm{poly}(\log n) factor.) This answers a decade-old question: since the discovery of a poly(klogn)\mathrm{poly}(k\log n)-space streaming algorithm for pattern matching under Hamming distance by Porat and Porat [FOCS 2009], the existence of an analogous result for edit distance remained open. Up to this work, no poly(klogn)\mathrm{poly}(k\log n)-space algorithm was known even in the simpler semi-streaming model, where TT comes as a stream but PP is available for read-only access. In this model, we give a deterministic algorithm that achieves slightly better complexity. In order to develop the fully streaming algorithm, we introduce a new edit distance sketch parametrised by integers nkn\ge k. For any string of length at most nn, the sketch is of size O~(k2)\tilde{O}(k^2) and it can be computed with an O~(k2)\tilde{O}(k^2)-space streaming algorithm. Given the sketches of two strings, in O~(k3)\tilde{O}(k^3) time we can compute their edit distance or certify that it is larger than kk. This result improves upon O~(k8)\tilde{O}(k^8)-size sketches of Belazzougui and Zhu [FOCS 2016] and very recent O~(k3)\tilde{O}(k^3)-size sketches of Jin, Nelson, and Wu [STACS 2021]

    Edit Distance: Sketching, Streaming and Document Exchange

    Full text link
    We show that in the document exchange problem, where Alice holds x{0,1}nx \in \{0,1\}^n and Bob holds y{0,1}ny \in \{0,1\}^n, Alice can send Bob a message of size O(K(log2K+logn))O(K(\log^2 K+\log n)) bits such that Bob can recover xx using the message and his input yy if the edit distance between xx and yy is no more than KK, and output "error" otherwise. Both the encoding and decoding can be done in time O~(n+poly(K))\tilde{O}(n+\mathsf{poly}(K)). This result significantly improves the previous communication bounds under polynomial encoding/decoding time. We also show that in the referee model, where Alice and Bob hold xx and yy respectively, they can compute sketches of xx and yy of sizes poly(Klogn)\mathsf{poly}(K \log n) bits (the encoding), and send to the referee, who can then compute the edit distance between xx and yy together with all the edit operations if the edit distance is no more than KK, and output "error" otherwise (the decoding). To the best of our knowledge, this is the first result for sketching edit distance using poly(Klogn)\mathsf{poly}(K \log n) bits. Moreover, the encoding phase of our sketching algorithm can be performed by scanning the input string in one pass. Thus our sketching algorithm also implies the first streaming algorithm for computing edit distance and all the edits exactly using poly(Klogn)\mathsf{poly}(K \log n) bits of space.Comment: Full version of an article to be presented at the 57th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2016

    Small-Space Algorithms for the Online Language Distance Problem for Palindromes and Squares

    Full text link
    We study the online variant of the language distance problem for two classical formal languages, the language of palindromes and the language of squares, and for the two most fundamental distances, the Hamming distance and the edit (Levenshtein) distance. In this problem, defined for a fixed formal language LL, we are given a string TT of length nn, and the task is to compute the minimal distance to LL from every prefix of TT. We focus on the low-distance regime, where one must compute only the distances smaller than a given threshold kk. In this work, our contribution is twofold: - First, we show streaming algorithms, which access the input string TT only through a single left-to-right scan. Both for palindromes and squares, our algorithms use O(kpoly logn)O(k \cdot\mathrm{poly}~\log n) space and time per character in the Hamming-distance case and O(k2poly logn)O(k^2 \cdot\mathrm{poly}~\log n) space and time per character in the edit-distance case. These algorithms are randomised by necessity, and they err with probability inverse-polynomial in nn. - Second, we show deterministic read-only online algorithms, which are also provided with read-only random access to the already processed characters of TT. Both for palindromes and squares, our algorithms use O(kpoly logn)O(k \cdot\mathrm{poly}~\log n) space and time per character in the Hamming-distance case and O(k4poly logn)O(k^4 \cdot\mathrm{poly}~\log n) space and amortised time per character in the edit-distance case.Comment: Accepted to ISAAC'2

    Approximate Similarity Search Under Edit Distance Using Locality-Sensitive Hashing

    Get PDF
    Edit distance similarity search, also called approximate pattern matching, is a fundamental problem with widespread database applications. The goal of the problem is to preprocess n strings of length d, to quickly answer queries q of the form: if there is a database string within edit distance r of q, return a database string within edit distance cr of q. Previous approaches to this problem either rely on very large (superconstant) approximation ratios c, or very small search radii r. Outside of a narrow parameter range, these solutions are not competitive with trivially searching through all n strings. In this work we give a simple and easy-to-implement hash function that can quickly answer queries for a wide range of parameters. Specifically, our strategy can answer queries in time O?(d3^rn^{1/c}). The best known practical results require c ? r to achieve any correctness guarantee; meanwhile, the best known theoretical results are very involved and difficult to implement, and require query time that can be loosely bounded below by 24^r. Our results significantly broaden the range of parameters for which there exist nontrivial theoretical bounds, while retaining the practicality of a locality-sensitive hash function

    File Updates Under Random/Arbitrary Insertions And Deletions

    Full text link
    A client/encoder edits a file, as modeled by an insertion-deletion (InDel) process. An old copy of the file is stored remotely at a data-centre/decoder, and is also available to the client. We consider the problem of throughput- and computationally-efficient communication from the client to the data-centre, to enable the server to update its copy to the newly edited file. We study two models for the source files/edit patterns: the random pre-edit sequence left-to-right random InDel (RPES-LtRRID) process, and the arbitrary pre-edit sequence arbitrary InDel (APES-AID) process. In both models, we consider the regime in which the number of insertions/deletions is a small (but constant) fraction of the original file. For both models we prove information-theoretic lower bounds on the best possible compression rates that enable file updates. Conversely, our compression algorithms use dynamic programming (DP) and entropy coding, and achieve rates that are approximately optimal.Comment: The paper is an extended version of our paper to be appeared at ITW 201
    corecore