Search CORE

Although real-world text datasets, such as DNA sequences, are far from being uniformly random, average-case string searching algorithms perform significantly better than worst-case ones in most applications of interest. In this paper, we study the problem of computing the longest prefix of each suffix of a given string of length

n

over a constant-sized alphabet that occurs elsewhere in the string with

k

-errors. This problem has already been studied under the Hamming distance model. Our first result is an improvement upon the state-of-the-art average-case time complexity for non-constant

k

and using only linear space under the Hamming distance model. Notably, we show that our technique can be extended to the edit distance model with the same time and space complexities. Specifically, our algorithms run in

\mathcal{O}(n \log^k n \log \log n)

time on average using

\mathcal{O}(n)

space. We show that our technique is applicable to several algorithmic problems in computational biology and elsewhere

arXiv.org e-Print Archive

Crossref

King's Research Portal

The k-Mappability Problem Revisited

Author: Amir Amihood
Boneh Itai
Kondratovsky Eitan
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)
Publication date: 01/01/2021
Field of study

The k-mappability problem has two integers parameters m and k. For every subword of size m in a text S, we wish to report the number of indices in S in which the word occurs with at most k mismatches. The problem was lately tackled by Alzamel et al. [Mai Alzamel et al., 2018]. For a text with constant alphabet ? and k ? O(1), they present an algorithm with linear space and O(nlog^{k+1}n) time. For the case in which k = 1 and a constant size alphabet, a faster algorithm with linear space and O(nlog(n)log log(n)) time was presented in [Mai Alzamel et al., 2020]. In this work, we enhance the techniques of [Mai Alzamel et al., 2020] to obtain an algorithm with linear space and O(n log(n)) time for k = 1. Our algorithm removes the constraint of the alphabet being of constant size. We also present linear algorithms for the case of k = 1, |?| ? O(1) and m = ?(?n)

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Faster algorithms for 1-mappability of a sequence

Author: Alzamel Mai
Charalampopoulos Panagiotis
Iliopoulos Costas S.
Pissis Solon P.
Radoszewski Jakub
Sung Wing-Kin
Publication venue: 'Elsevier BV'
Publication date: 23/05/2019
Field of study

In the k-mappability problem, we are given a string x of length n and integers m and k, and we are asked to count, for each length-m factor y of x, the number of other factors of length m of x that are at Hamming distance at most k from y. We focus here on the version of the problem where k=1. There exists an algorithm to solve this problem for k=1 requiring time O(mnlog⁡n/log⁡log⁡n) using space O(n). Here we present two new algorithms that require worst-case time O(mn) and O(nlog⁡nlog⁡log⁡n), respectively, and space O(n), thus greatly improving the previous result. Moreover, we present another algorithm that requires average-case time and space O(n) for integer alphabets of size σ if m=Ω(logσ⁡n). Notably, we show that this algorithm is generalizable for arbitrary k, requiring average-case time O(kn) and space O(n) if m=Ω(klogσ⁡n), assuming that the letters are independent and uniformly distributed random variables. Finally, we provide an experimental evaluation of our average-case algorithm demonstrating its competitiveness to the state-of-the-art implementation

CWI's Institutional Repository

King's Research Portal