Search CORE

11,956 research outputs found

Fully dynamic data structure for LCE queries in compressed space

Author: Bannai Hideo
I Tomohiro
Inenaga Shunsuke
Nishimoto Takaaki
Takeda Masayuki
Publication venue
Publication date: 01/01/2016
Field of study

A Longest Common Extension (LCE) query on a text

T

of length

N

asks for the length of the longest common prefix of suffixes starting at given two positions. We show that the signature encoding

\mathcal{G}

of size

w = O(\min(z \log N \log^* M, N))

[Mehlhorn et al., Algorithmica 17(2):183-198, 1997] of

T

, which can be seen as a compressed representation of

T

, has a capability to support LCE queries in

O(\log N + \log \ell \log^* M)

time, where

\ell

is the answer to the query,

z

is the size of the Lempel-Ziv77 (LZ77) factorization of

T

, and

M \geq 4N

is an integer that can be handled in constant time under word RAM model. In compressed space, this is the fastest deterministic LCE data structure in many cases. Moreover,

\mathcal{G}

can be enhanced to support efficient update operations: After processing

\mathcal{G}

O(w f_{\mathcal{A}})

time, we can insert/delete any (sub)string of length

y

into/from an arbitrary position of

T

O((y+ \log N\log^* M) f_{\mathcal{A}})

time, where

f_{\mathcal{A}} = O(\min \{ \frac{\log\log M \log\log w}{\log\log\log M}, \sqrt{\frac{\log w}{\log\log w}} \})

. This yields the first fully dynamic LCE data structure. We also present efficient construction algorithms from various types of inputs: We can construct

\mathcal{G}

O(N f_{\mathcal{A}})

time from uncompressed string

T

; in

O(n \log\log n \log N \log^* M)

time from grammar-compressed string

T

represented by a straight-line program of size

n

; and in

O(z f_{\mathcal{A}} \log N \log^* M)

time from LZ77-compressed string

T

with

z

factors. On top of the above contributions, we show several applications of our data structures which improve previous best known results on grammar-compressed string processing.Comment: arXiv admin note: text overlap with arXiv:1504.0695

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Indexing large genome collections on a PC

Author: Danek Agnieszka
Deorowicz Sebastian
Grabowski Szymon
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 28/03/2014
Field of study

Motivation: The availability of thousands of invidual genomes of one species should boost rapid progress in personalized medicine or understanding of the interaction between genotype and phenotype, to name a few applications. A key operation useful in such analyses is aligning sequencing reads against a collection of genomes, which is costly with the use of existing algorithms due to their large memory requirements. Results: We present MuGI, Multiple Genome Index, which reports all occurrences of a given pattern, in exact and approximate matching model, against a collection of thousand(s) genomes. Its unique feature is the small index size fitting in a standard computer with 16--32\,GB, or even 8\,GB, of RAM, for the 1000GP collection of 1092 diploid human genomes. The solution is also fast. For example, the exact matching queries are handled in average time of 39\,

\mu

s and with up to 3 mismatches in 373\,

\mu

s on the test PC with the index size of 13.4\,GB. For a smaller index, occupying 7.4\,GB in memory, the respective times grow to 76\,

\mu

s and 917\,

\mu

s. Availability: Software and Suuplementary material: \url{http://sun.aei.polsl.pl/mugi}

arXiv.org e-Print Archive

Directory of Open Access Journals

FigShare