Search CORE

689 research outputs found

Longest common substrings with k mismatches

Author: Flouri Tomas
Giaquinta Emanuele
Kobert Kassian
Ukkonen Esko
Publication venue
Publication date: 01/01/2015
Field of study

The longest common substring with k-mismatches problem is to find, given two strings S-1 and S-2, a longest substring A(1) of S-1 and A(2) of S-2 such that the Hamming distance between A(1) and A(2) isPeer reviewe

arXiv.org e-Print Archive

Elsevier - Publisher Connector

Crossref

Aaltodoc Publication Archive

Helsingin yliopiston digitaalinen arkisto

MissMax: Alignment-free sequence comparison with mismatches through filtering and heuristics

Author: PIZZI CINZIA
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

BACKGROUND: Measuring sequence similarity is central for many problems in bioinformatics. In several contexts alignment-free techniques based on exact occurrences of substrings are faster, but also less accurate, than alignment-based approaches. Recently, several studies attempted to bridge the accuracy gap with the introduction of approximate matches in the definition of composition-based similarity measures. RESULTS: In this work we present MissMax, an exact algorithm for the computation of the longest common substring with mismatches between each suffix of a sequence x and a sequence y. This collection of statistics is useful for the computation of two similarity measures: the longest and the average common substring with k mismatches. As a further contribution we provide a “relaxed” version of MissMax that does not guarantee the exact solution, but it is faster in practice and still very precise

Crossref

Springer - Publisher Connector

PubMed Central

Archivio istituzionale della ricerca - Università di Padova

Efficient Computation of Sequence Mappability

Author: Alzamel Mai
Charalampopoulos Panagiotis
Iliopoulos Costas S.
Kociumaka Tomasz
Pissis Solon P.
Radoszewski Jakub
Straszyński Juliusz
Publication venue
Publication date: 31/07/2018
Field of study

Sequence mappability is an important task in genome re-sequencing. In the

(k,m)

-mappability problem, for a given sequence

T

of length

n

, our goal is to compute a table whose

i

th entry is the number of indices

j \ne i

such that length-

m

substrings of

T

starting at positions

i

and

j

have at most

k

mismatches. Previous works on this problem focused on heuristic approaches to compute a rough approximation of the result or on the case of

k=1

. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that works in

\mathcal{O}(n \min\{m^k,\log^{k+1} n\})

time and

\mathcal{O}(n)

space for

k=\mathcal{O}(1)

. It requires a carefu l adaptation of the technique of Cole et al.~[STOC 2004] to avoid multiple counting of pairs of substrings. We also show

\mathcal{O}(n^2)

-time algorithms to compute all results for a fixed

m

and all

k=0,\ldots,m

or a fixed

k

and all

m=k,\ldots,n-1

. Finally we show that the

(k,m)

-mappability problem cannot be solved in strongly subquadratic time for

k,m = \Theta(\log n)

unless the Strong Exponential Time Hypothesis fails.Comment: Accepted to SPIRE 201

arXiv.org e-Print Archive

VU Research Portal

CWI's Institutional Repository

INRIA a CCSD electronic archive server

Edit Distance: Sketching, Streaming and Document Exchange

Author: Belazzougui Djamal
Zhang Qin
Publication venue
Publication date: 14/07/2016
Field of study

We show that in the document exchange problem, where Alice holds

x \in \{0,1\}^n

and Bob holds

y \in \{0,1\}^n

, Alice can send Bob a message of size

O(K(\log^2 K+\log n))

bits such that Bob can recover

x

using the message and his input

y

if the edit distance between

x

and

y

is no more than

K

, and output "error" otherwise. Both the encoding and decoding can be done in time

\tilde{O}(n+\mathsf{poly}(K))

. This result significantly improves the previous communication bounds under polynomial encoding/decoding time. We also show that in the referee model, where Alice and Bob hold

x

and

y

respectively, they can compute sketches of

x

and

y

of sizes

\mathsf{poly}(K \log n)

bits (the encoding), and send to the referee, who can then compute the edit distance between

x

and

y

together with all the edit operations if the edit distance is no more than

K

, and output "error" otherwise (the decoding). To the best of our knowledge, this is the first result for sketching edit distance using

\mathsf{poly}(K \log n)

bits. Moreover, the encoding phase of our sketching algorithm can be performed by scanning the input string in one pass. Thus our sketching algorithm also implies the first streaming algorithm for computing edit distance and all the edits exactly using

\mathsf{poly}(K \log n)

bits of space.Comment: Full version of an article to be presented at the 57th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2016

arXiv.org e-Print Archive

Crossref

Accurate long read mapping using enhanced suffix arrays

Author: Dawyndt Peter
De Schrijver Joachim
Fack Veerle
Van Criekinge Wim
Vyverman Michaël
Publication venue: 'Scitepress'
Publication date: 01/01/2010
Field of study

With the rise of high throughput sequencing, new programs have been developed for dealing with the alignment of a huge amount of short read data to reference genomes. Recent developments in sequencing technology allow longer reads, but the mappers for short reads are not suited for reads of several hundreds of base pairs. We propose an algorithm for mapping longer reads, which is based on chaining maximal exact matches and uses heuristics and the Needleman-Wunsch algorithm to bridge the gaps. To compute maximal exact matches we use a specialized index structure, called enhanced suffix array. The proposed algorithm is very accurate and can handle large reads with mutations and long insertions and deletions

Ghent University Academic Bibliography