17 research outputs found

    Faster algorithms for 1-mappability of a sequence

    Full text link
    In the k-mappability problem, we are given a string x of length n and integers m and k, and we are asked to count, for each length-m factor y of x, the number of other factors of length m of x that are at Hamming distance at most k from y. We focus here on the version of the problem where k = 1. The fastest known algorithm for k = 1 requires time O(mn log n/ log log n) and space O(n). We present two algorithms that require worst-case time O(mn) and O(n log^2 n), respectively, and space O(n), thus greatly improving the state of the art. Moreover, we present an algorithm that requires average-case time and space O(n) for integer alphabets if m = {\Omega}(log n/ log {\sigma}), where {\sigma} is the alphabet size

    Longest Common Prefixes with kk-Errors and Applications

    Full text link
    Although real-world text datasets, such as DNA sequences, are far from being uniformly random, average-case string searching algorithms perform significantly better than worst-case ones in most applications of interest. In this paper, we study the problem of computing the longest prefix of each suffix of a given string of length nn over a constant-sized alphabet that occurs elsewhere in the string with kk-errors. This problem has already been studied under the Hamming distance model. Our first result is an improvement upon the state-of-the-art average-case time complexity for non-constant kk and using only linear space under the Hamming distance model. Notably, we show that our technique can be extended to the edit distance model with the same time and space complexities. Specifically, our algorithms run in O(nlogknloglogn)\mathcal{O}(n \log^k n \log \log n) time on average using O(n)\mathcal{O}(n) space. We show that our technique is applicable to several algorithmic problems in computational biology and elsewhere

    Longest property-preserved common factor

    Get PDF
    In this paper we introduce a new family of string processing problems. We are given two or more strings and we are asked to compute a factor common to all strings that preserves a specific property and has maximal length. Here we consider two fundamental string properties: square-free factors and periodic factors under two different settings, one per property. In the first setting, we are given a string x and we are asked to construct a data structure over x answering the following type of on-line queries: given string y, find a longest square-free factor common to x and y. In the second setting, we are given k strings and an integer 1 < k’ ≤ k and we are asked to find a longest periodic factor common to at least k’ strings. We present linear-time solutions for both settings. We anticipate that our paradigm can be extended to other string properties

    ALFRED: A Practical Method for Alignment-Free Distance Computation

    No full text
    Alignment-free approaches are gaining persistent interest in many sequence analysis applications such as phylogenetic inference and metagenomic classification/clustering, especially for large-scale sequence datasets. Besides the widely used k-mer methods, the average common substring (ACS) approach has emerged to be one of the well-known alignment-free approaches. Two recent works further generalize this ACS approach by allowing a bounded number k of mismatches in the common substrings, relying on approximation (linear time) and exact computation, respectively. Albeit having a good worst-case time complexity O(n log(k) n), the exact approach is complex and unlikely to be efficient in practice. Herein, we present ALFRED, an alignment-free distance computation method, which solves the generalized common substring search problem via exact computation. Compared to the theoretical approach, our algorithm is easier to implement and more practical to use, while still providing highly competitive theoretical performances with an expected run-time of O(n logk n). By applying our program to phylogenetic inference as a case study, we find that our program facilitates to exactly reconstruct the topology of the reference phylogenetic tree for a set of 27 primate mitochondrial genomes, at reasonably acceptable speed. ALFRED is implemented in C++ programming language and the source code is freely available online

    Compressing Dictionary Matching Index via Sparsification Technique

    No full text

    Algorithmic Framework For Approximate Matching Under Bounded Edits With Applications To Sequence Analysis

    No full text
    We present a novel algorithmic framework for solving approximate sequence matching problems that permit a bounded total number k of mismatches, insertions, and deletions. The core of the framework relies on transforming an approximate matching problem into a corresponding exact matching problem on suitably edited string suffixes, while carefully controlling the required number of such edited suffixes to enable the design of efficient algorithms. For a total input size of n, our framework limits the number of generated edited suffixes to no more than a factor of O(log kn) of the input size (for any constant k), and restricts the algorithm to linear space usage by overlapping the generation and processing of edited suffixes. Our framework improves the best known upper bound of n2k1.5/2Ω(logn/k) for the classic k-edit longest common substring problem [Abboud, Williams, and Yu; SODA 2015] to yield the first strictly sub-quadratic time algorithm that runs in O(nlog kn) time and O(n) space for any constant k. We present similar subquadratic time and linear space algorithms for (i) computing the alignment-free distance between two genomes based on the k-edit average common substring measure, (ii) mapping reads/read fragments to a reference genome while allowing up to k edits, and (iii) computing all-pair maximal k-edit common substrings (also, suffix/prefix overlaps), which has applications in clustering and assembly. We expect our algorithmic framework to be a broadly applicable theoretical tool, and may inspire the design of practical heuristics and software
    corecore