research

Efficient Computation of Sequence Mappability

Abstract

Sequence mappability is an important task in genome re-sequencing. In the (k,m)(k,m)-mappability problem, for a given sequence TT of length nn, our goal is to compute a table whose iith entry is the number of indices jij \ne i such that length-mm substrings of TT starting at positions ii and jj have at most kk mismatches. Previous works on this problem focused on heuristic approaches to compute a rough approximation of the result or on the case of k=1k=1. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that works in O(nmin{mk,logk+1n})\mathcal{O}(n \min\{m^k,\log^{k+1} n\}) time and O(n)\mathcal{O}(n) space for k=O(1)k=\mathcal{O}(1). It requires a carefu l adaptation of the technique of Cole et al.~[STOC 2004] to avoid multiple counting of pairs of substrings. We also show O(n2)\mathcal{O}(n^2)-time algorithms to compute all results for a fixed mm and all k=0,,mk=0,\ldots,m or a fixed kk and all m=k,,n1m=k,\ldots,n-1. Finally we show that the (k,m)(k,m)-mappability problem cannot be solved in strongly subquadratic time for k,m=Θ(logn)k,m = \Theta(\log n) unless the Strong Exponential Time Hypothesis fails.Comment: Accepted to SPIRE 201

    Similar works