Search CORE

17 research outputs found

Faster algorithms for 1-mappability of a sequence

Author: A Amir
G Manzini
J Fischer
M Crochemore
MA Bender
ML Fredman
ML Metzker
NA Fonseca
SV Thankachan
T Derrien
U Manber
Publication venue
Publication date: 11/05/2017
Field of study

In the k-mappability problem, we are given a string x of length n and integers m and k, and we are asked to count, for each length-m factor y of x, the number of other factors of length m of x that are at Hamming distance at most k from y. We focus here on the version of the problem where k = 1. The fastest known algorithm for k = 1 requires time O(mn log n/ log log n) and space O(n). We present two algorithms that require worst-case time O(mn) and O(n log^2 n), respectively, and space O(n), thus greatly improving the state of the art. Moreover, we present an algorithm that requires average-case time and space O(n) for integer alphabets if m = {\Omega}(log n/ log {\sigma}), where {\sigma} is the alphabet size

arXiv.org e-Print Archive

Crossref

Longest Common Prefixes with $k$ -Errors and Applications

Author: A Apostolico
AF Smit
B Bollobás
C Leimeister
C Pizzi
DE Willard
G Kucherov
G Manzini
G Navarro
H Alamro
I Ulitsky
J Fischer
KR Rasmussen
M Alzamel
MA Bender
MI Abouelhoda
N Välimäki
P Eades
R Kolpakov
S Faro
S Grabowski
S Karlin
SV Thankachan
SV Thankachan
SV Thankachan
T Derrien
T Flouri
TH Cormen
U Manber
Publication venue
Publication date: 01/01/2018
Field of study

Although real-world text datasets, such as DNA sequences, are far from being uniformly random, average-case string searching algorithms perform significantly better than worst-case ones in most applications of interest. In this paper, we study the problem of computing the longest prefix of each suffix of a given string of length

n

over a constant-sized alphabet that occurs elsewhere in the string with

k

-errors. This problem has already been studied under the Hamming distance model. Our first result is an improvement upon the state-of-the-art average-case time complexity for non-constant

k

and using only linear space under the Hamming distance model. Notably, we show that our technique can be extended to the edit distance model with the same time and space complexities. Specifically, our algorithms run in

\mathcal{O}(n \log^k n \log \log n)

time on average using

\mathcal{O}(n)

space. We show that our technique is applicable to several algorithmic problems in computational biology and elsewhere

arXiv.org e-Print Archive

Crossref

King's Research Portal

Longest property-preserved common factor

Author: D Belazzougui
D Gusfield
H Bannai
J-P Duval
L Chi
M Dumitran
M Farach
M Federico
M Lothaire
P Peterlongo
P Peterlongo
S Inenaga
SR Chowdhury
SV Thankachan
SV Thankachan
SW Bae
T Kociumaka
T Starikovskaya
WI Chang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

In this paper we introduce a new family of string processing problems. We are given two or more strings and we are asked to compute a factor common to all strings that preserves a specific property and has maximal length. Here we consider two fundamental string properties: square-free factors and periodic factors under two different settings, one per property. In the first setting, we are given a string x and we are asked to construct a data structure over x answering the following type of on-line queries: given string y, find a longest square-free factor common to x and y. In the second setting, we are given k strings and an integer 1 < k’ ≤ k and we are asked to find a longest periodic factor common to at least k’ strings. We present linear-time solutions for both settings. We anticipate that our paradigm can be extended to other string properties

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Trieste

Crossref

INRIA a CCSD electronic archive server

Archivio della Ricerca - Università di Pisa

King's Research Portal

ALFRED: A Practical Method for Alignment-Free Distance Computation

Author: ALURU S
APOSTOLICO A
CHOCKALINGAM SP
LIU YC
THANKACHAN SV
Publication venue: 'Mary Ann Liebert Inc'
Publication date: 01/01/2016
Field of study

Alignment-free approaches are gaining persistent interest in many sequence analysis applications such as phylogenetic inference and metagenomic classification/clustering, especially for large-scale sequence datasets. Besides the widely used k-mer methods, the average common substring (ACS) approach has emerged to be one of the well-known alignment-free approaches. Two recent works further generalize this ACS approach by allowing a bounded number k of mismatches in the common substrings, relying on approximation (linear time) and exact computation, respectively. Albeit having a good worst-case time complexity O(n log(k) n), the exact approach is complex and unlikely to be efficient in practice. Herein, we present ALFRED, an alignment-free distance computation method, which solves the generalized common substring search problem via exact computation. Compared to the theoretical approach, our algorithm is easier to implement and more practical to use, while still providing highly competitive theoretical performances with an expected run-time of O(n logk n). By applying our program to phylogenetic inference as a case study, we find that our program facilitates to exactly reconstruct the topology of the reference phylogenetic tree for a set of 27 primate mitochondrial genomes, at reasonably acceptable speed. ALFRED is implemented in C++ programming language and the source code is freely available online

Dspace at IIT Bombay

Compressing Dictionary Matching Index via Sparsification Technique

Author: Hon WK
Ku TH
Lam TW
Shah R
Tam SL
Thankachan SV
Vitter JS
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

HKU Scholars Hub

Algorithmic Framework For Approximate Matching Under Bounded Edits With Applications To Sequence Analysis

Author: A Abboud
A Apostolico
A Apostolico
Amir Abboud
B Langmead
C Pizzi
C-A Leimeister
D Burstein
D Gusfield
DD Sleator
EM McCreight
F Guyon
G Chang
G Kucherov
G Manzini
H Li
H Li
JT Simpson
M Comin
M Domazet-Lošo
MR Brown
N Välimäki
O Bonham-Carter
R Li
S Aluru
S Burkhardt
SV Thankachan
SV Thankachan
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/01/2018
Field of study

We present a novel algorithmic framework for solving approximate sequence matching problems that permit a bounded total number k of mismatches, insertions, and deletions. The core of the framework relies on transforming an approximate matching problem into a corresponding exact matching problem on suitably edited string suffixes, while carefully controlling the required number of such edited suffixes to enable the design of efficient algorithms. For a total input size of n, our framework limits the number of generated edited suffixes to no more than a factor of O(log kn) of the input size (for any constant k), and restricts the algorithm to linear space usage by overlapping the generation and processing of edited suffixes. Our framework improves the best known upper bound of n2k1.5/2Ω(logn/k) for the classic k-edit longest common substring problem [Abboud, Williams, and Yu; SODA 2015] to yield the first strictly sub-quadratic time algorithm that runs in O(nlog kn) time and O(n) space for any constant k. We present similar subquadratic time and linear space algorithms for (i) computing the alignment-free distance between two genomes based on the k-edit average common substring measure, (ii) mapping reads/read fragments to a reference genome while allowing up to k edits, and (iii) computing all-pair maximal k-edit common substrings (also, suffix/prefix overlaps), which has applications in clustering and assembly. We expect our algorithmic framework to be a broadly applicable theoretical tool, and may inspire the design of practical heuristics and software

Crossref

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)