195 research outputs found

    Linear-Time Algorithm for Long LCF with k Mismatches

    Get PDF
    In the Longest Common Factor with k Mismatches (LCF_k) problem, we are given two strings X and Y of total length n, and we are asked to find a pair of maximal-length factors, one of X and the other of Y, such that their Hamming distance is at most k. Thankachan et al. [Thankachan et al. 2016] show that this problem can be solved in O(n log^k n) time and O(n) space for constant k. We consider the LCF_k(l) problem in which we assume that the sought factors have length at least l. We use difference covers to reduce the LCF_k(l) problem with l=Omega(log^{2k+2}n) to a task involving m=O(n/log^{k+1}n) synchronized factors. The latter can be solved in O(m log^{k+1}m) time, which results in a linear-time algorithm for LCF_k(l) with l=Omega(log^{2k+2}n). In general, our solution to the LCF_k(l) problem for arbitrary l takes O(n + n log^{k+1} n/sqrt{l}) time

    Efficient Computation of Sequence Mappability

    Get PDF
    Sequence mappability is an important task in genome re-sequencing. In the (k,m)(k,m)-mappability problem, for a given sequence TT of length nn, our goal is to compute a table whose iith entry is the number of indices j≠ij \ne i such that length-mm substrings of TT starting at positions ii and jj have at most kk mismatches. Previous works on this problem focused on heuristic approaches to compute a rough approximation of the result or on the case of k=1k=1. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that works in O(nmin⁡{mk,log⁡k+1n})\mathcal{O}(n \min\{m^k,\log^{k+1} n\}) time and O(n)\mathcal{O}(n) space for k=O(1)k=\mathcal{O}(1). It requires a carefu l adaptation of the technique of Cole et al.~[STOC 2004] to avoid multiple counting of pairs of substrings. We also show O(n2)\mathcal{O}(n^2)-time algorithms to compute all results for a fixed mm and all k=0,
,mk=0,\ldots,m or a fixed kk and all m=k,
,n−1m=k,\ldots,n-1. Finally we show that the (k,m)(k,m)-mappability problem cannot be solved in strongly subquadratic time for k,m=Θ(log⁡n)k,m = \Theta(\log n) unless the Strong Exponential Time Hypothesis fails.Comment: Accepted to SPIRE 201

    MissMax: Alignment-free sequence comparison with mismatches through filtering and heuristics

    Get PDF
    BACKGROUND: Measuring sequence similarity is central for many problems in bioinformatics. In several contexts alignment-free techniques based on exact occurrences of substrings are faster, but also less accurate, than alignment-based approaches. Recently, several studies attempted to bridge the accuracy gap with the introduction of approximate matches in the definition of composition-based similarity measures. RESULTS: In this work we present MissMax, an exact algorithm for the computation of the longest common substring with mismatches between each suffix of a sequence x and a sequence y. This collection of statistics is useful for the computation of two similarity measures: the longest and the average common substring with k mismatches. As a further contribution we provide a “relaxed” version of MissMax that does not guarantee the exact solution, but it is faster in practice and still very precise

    Approximating Longest Common Substring with k mismatches: Theory and Practice

    Get PDF
    In the problem of the longest common substring with k mismatches we are given two strings X, Y and must find the maximal length ? such that there is a length-? substring of X and a length-? substring of Y that differ in at most k positions. The length ? can be used as a robust measure of similarity between X, Y. In this work, we develop new approximation algorithms for computing ? that are significantly more efficient that previously known solutions from the theoretical point of view. Our approach is simple and practical, which we confirm via an experimental evaluation, and is probably close to optimal as we demonstrate via a conditional lower bound

    CAD Tools for DNA Micro-Array Design, Manufacture and Application

    Get PDF
    Motivation: As the human genome project progresses and some microbial and eukaryotic genomes are recognized, numerous biotechnological processes have attracted increasing number of biologists, bioengineers and computer scientists recently. Biotechnological processes profoundly involve production and analysis of highthroughput experimental data. Numerous sequence libraries of DNA and protein structures of a large number of micro-organisms and a variety of other databases related to biology and chemistry are available. For example, microarray technology, a novel biotechnology, promises to monitor the whole genome at once, so that researchers can study the whole genome on the global level and have a better picture of the expressions among millions of genes simultaneously. Today, it is widely used in many fields- disease diagnosis, gene classification, gene regulatory network, and drug discovery. For example, designing organism specific microarray and analysis of experimental data require combining heterogeneous computational tools that usually differ in the data format; such as, GeneMark for ORF extraction, Promide for DNA probe selection, Chip for probe placement on microarray chip, BLAST to compare sequences, MEGA for phylogenetic analysis, and ClustalX for multiple alignments. Solution: Surprisingly enough, despite huge research efforts invested in DNA array applications, very few works are devoted to computer-aided optimization of DNA array design and manufacturing. Current design practices are dominated by ad-hoc heuristics incorporated in proprietary tools with unknown suboptimality. This will soon become a bottleneck for the new generation of high-density arrays, such as the ones currently being designed at Perlegen [109]. The goal of the already accomplished research was to develop highly scalable tools, with predictable runtime and quality, for cost-effective, computer-aided design and manufacturing of DNA probe arrays. We illustrate the utility of our approach by taking a concrete example of combining the design tools of microarray technology for Harpes B virus DNA data

    Semantizing Complex 3D Scenes using Constrained Attribute Grammars

    Get PDF
    International audienceWe propose a new approach to automatically semantize complex objects in a 3D scene. For this, we define an expressive formalism combining the power of both attribute grammars and constraint. It offers a practical conceptual interface, which is crucial to write large maintainable specifications. As recursion is inadequate to express large collections of items, we introduce maximal operators, that are essential to reduce the parsing search space. Given a grammar in this formalism and a 3D scene, we show how to automatically compute a shared parse forest of all interpretations -- in practice, only a few, thanks to relevant constraints. We evaluate this technique for building model semantization using CAD model examples as well as photogrammetric and simulated LiDAR data

    Longest common substring made fully dynamic

    Get PDF
    Given two strings S and T, each of length at most n, the longest common substring (LCS) problem is to find a longest substring common to S and T. This is a classical problem in computer science with an O(n)-time solution. In the fully dynamic setting, edit operations are allowed in either of the two strings, and the problem is to find an LCS after each edit. We present the first solution to this problem requiring sublinear time in n per edit operation. In particular, we show how to find an LCS after each edit operation in Õ(n2/3) time, after Õ(n)-time and space preprocessing. 1 This line of research has been recently initiated in a somewhat restricted dynamic variant by Amir et al. [SPIRE 2017]. More specifically, they presented an Õ(n)-sized data structure that returns an LCS of the two strings after a single edit operation (that is reverted afterwards) in Õ(1) time. At CPM 2018, three papers (Abedin et al., Funakoshi et al., and Urabe et al.) studied analogously restricted dynamic variants of problems on strings. We show that the techniques we develop can be applied to obtain fully dynamic algorithms for all of these variants. The only previously known sublinear-time dynamic algorithms for problems on strings were for maintaining a dynamic collection of strings for comparison queries and for pattern matching, with the most recent advances made by Gawrychowski et al. [SODA 2018] and by Clifford et al. [STACS 2018]. As an intermediate problem we consider computing the solution for a string with a given set of k edits, which leads us, in particular, to answering internal queries on a string. The input to such a query is specified by a substring (or substrings) of a given string. Data structures for answering internal string queries that were proposed by Kociumaka et al. [SODA 2015] and by Gagie et al. [CCCG 2013] are used, along with new ones, based on ingredients such as the suffix tree, heavy-path decomposition, orthogonal range queries, difference covers, and string periodicity
