Search CORE

1,698 research outputs found

Efficient Computation of Sequence Mappability

Author: Alzamel Mai
Charalampopoulos Panagiotis
Iliopoulos Costas S.
Kociumaka Tomasz
Pissis Solon P.
Radoszewski Jakub
Straszyński Juliusz
Publication venue
Publication date: 31/07/2018
Field of study

Sequence mappability is an important task in genome re-sequencing. In the

(k,m)

-mappability problem, for a given sequence

T

of length

n

, our goal is to compute a table whose

i

th entry is the number of indices

j \ne i

such that length-

m

substrings of

T

starting at positions

i

and

j

have at most

k

mismatches. Previous works on this problem focused on heuristic approaches to compute a rough approximation of the result or on the case of

k=1

. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that works in

\mathcal{O}(n \min\{m^k,\log^{k+1} n\})

time and

\mathcal{O}(n)

space for

k=\mathcal{O}(1)

. It requires a carefu l adaptation of the technique of Cole et al.~[STOC 2004] to avoid multiple counting of pairs of substrings. We also show

\mathcal{O}(n^2)

-time algorithms to compute all results for a fixed

m

and all

k=0,\ldots,m

or a fixed

k

and all

m=k,\ldots,n-1

. Finally we show that the

(k,m)

-mappability problem cannot be solved in strongly subquadratic time for

k,m = \Theta(\log n)

unless the Strong Exponential Time Hypothesis fails.Comment: Accepted to SPIRE 201

arXiv.org e-Print Archive

VU Research Portal

CWI's Institutional Repository

INRIA a CCSD electronic archive server

Pattern Matching in Multiple Streams

Author: A. Amir
D. Breslauer
F. Ergun
G.M. Landau
G.M. Landau
H. Karloff
K. Abrahamson
M. Ružić
R. Clifford
R. Clifford
R. Clifford
R. Clifford
R. Clifford
T.S. Jayram
Z. Galil
Publication venue
Publication date: 01/01/2012
Field of study

We investigate the problem of deterministic pattern matching in multiple streams. In this model, one symbol arrives at a time and is associated with one of s streaming texts. The task at each time step is to report if there is a new match between a fixed pattern of length m and a newly updated stream. As is usual in the streaming context, the goal is to use as little space as possible while still reporting matches quickly. We give almost matching upper and lower space bounds for three distinct pattern matching problems. For exact matching we show that the problem can be solved in constant time per arriving symbol and O(m+s) words of space. For the k-mismatch and k-difference problems we give O(k) time solutions that require O(m+ks) words of space. In all three cases we also give space lower bounds which show our methods are optimal up to a single logarithmic factor. Finally we set out a number of open problems related to this new model for pattern matching.Comment: 13 pages, 1 figur

arXiv.org e-Print Archive

Crossref

Warwick Research Archives Portal Repository

Faster algorithms for 1-mappability of a sequence

Author: A Amir
G Manzini
J Fischer
M Crochemore
MA Bender
ML Fredman
ML Metzker
NA Fonseca
SV Thankachan
T Derrien
U Manber
Publication venue
Publication date: 11/05/2017
Field of study

In the k-mappability problem, we are given a string x of length n and integers m and k, and we are asked to count, for each length-m factor y of x, the number of other factors of length m of x that are at Hamming distance at most k from y. We focus here on the version of the problem where k = 1. The fastest known algorithm for k = 1 requires time O(mn log n/ log log n) and space O(n). We present two algorithms that require worst-case time O(mn) and O(n log^2 n), respectively, and space O(n), thus greatly improving the state of the art. Moreover, we present an algorithm that requires average-case time and space O(n) for integer alphabets if m = {\Omega}(log n/ log {\sigma}), where {\sigma} is the alphabet size

arXiv.org e-Print Archive

Crossref

A practical index for approximate dictionary matching with few mismatches

Author: Cisłak Aleksander
Grabowski Szymon
Publication venue
Publication date: 11/02/2016
Field of study

Approximate dictionary matching is a classic string matching problem (checking if a query string occurs in a collection of strings) with applications in, e.g., spellchecking, online catalogs, geolocation, and web searchers. We present a surprisingly simple solution called a split index, which is based on the Dirichlet principle, for matching a keyword with few mismatches, and experimentally show that it offers competitive space-time tradeoffs. Our implementation in the C++ language is focused mostly on data compaction, which is beneficial for the search speed (e.g., by being cache friendly). We compare our solution with other algorithms and we show that it performs better for the Hamming distance. Query times in the order of 1 microsecond were reported for one mismatch for the dictionary size of a few megabytes on a medium-end PC. We also demonstrate that a basic compression technique consisting in

q

-gram substitution can significantly reduce the index size (up to 50% of the input text size for the DNA), while still keeping the query time relatively low

arXiv.org e-Print Archive

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

The streaming $k$ -mismatch problem

Author: Clifford Raphaël
Kociumaka Tomasz
Porat Ely
Publication venue
Publication date: 09/04/2018
Field of study

We consider the streaming complexity of a fundamental task in approximate pattern matching: the

k

-mismatch problem. It asks to compute Hamming distances between a pattern of length

n

and all length-

n

substrings of a text for which the Hamming distance does not exceed a given threshold

k

. In our problem formulation, we report not only the Hamming distance but also, on demand, the full \emph{mismatch information}, that is the list of mismatched pairs of symbols and their indices. The twin challenges of streaming pattern matching derive from the need both to achieve small working space and also to guarantee that every arriving input symbol is processed quickly. We present a streaming algorithm for the

k

-mismatch problem which uses

O(k\log{n}\log\frac{n}{k})

bits of space and spends \ourcomplexity time on each symbol of the input stream, which consists of the pattern followed by the text. The running time almost matches the classic offline solution and the space usage is within a logarithmic factor of optimal. Our new algorithm therefore effectively resolves and also extends an open problem first posed in FOCS'09. En route to this solution, we also give a deterministic

O( k (\log \frac{n}{k} + \log |\Sigma|) )

-bit encoding of all the alignments with Hamming distance at most

k

of a length-

n

pattern within a text of length

O(n)

. This secondary result provides an optimal solution to a natural communication complexity problem which may be of independent interest.Comment: 27 page

arXiv.org e-Print Archive

Crossref

Explore Bristol Research