Search CORE

9,578 research outputs found

Improved ESP-index: a practical self-index for highly repetitive texts

Author: F. Claude
F. Claude
G. Navarro
J. Barbay
J.I. Munro
K. Goto
O. Delpratt
S. Maruyama
T. Gagie
T. Gagie
T. Yamamoto
Publication venue
Publication date: 01/01/2014
Field of study

While several self-indexes for highly repetitive texts exist, developing a practical self-index applicable to real world repetitive texts remains a challenge. ESP-index is a grammar-based self-index on the notion of edit-sensitive parsing (ESP), an efficient parsing algorithm that guarantees upper bounds of parsing discrepancies between different appearances of the same subtexts in a text. Although ESP-index performs efficient top-down searches of query texts, it has a serious issue on binary searches for finding appearances of variables for a query text, which resulted in slowing down the query searches. We present an improved ESP-index (ESP-index-I) by leveraging the idea behind succinct data structures for large alphabets. While ESP-index-I keeps the same types of efficiencies as ESP-index about the top-down searches, it avoid the binary searches using fast rank/select operations. We experimentally test ESP-index-I on the ability to search query texts and extract subtexts from real world repetitive texts on a large-scale, and we show that ESP-index-I performs better that other possible approaches.Comment: This is the full version of a proceeding accepted to the 11th International Symposium on Experimental Algorithms (SEA2014

arXiv.org e-Print Archive

Crossref

RLZAP: Relative Lempel-Ziv with Adaptive Pointers

Author: A Farruggia
C Boucher
C Hoobin
D Belazzougui
H Ferrada
J Ziv
J Ziv
M Léonard
P Ferragina
R Raman
S Deorowicz
S Deorowicz
S Kuruppu
Publication venue
Publication date: 01/01/2016
Field of study

Relative Lempel-Ziv (RLZ) is a popular algorithm for compressing databases of genomes from individuals of the same species when fast random access is desired. With Kuruppu et al.'s (SPIRE 2010) original implementation, a reference genome is selected and then the other genomes are greedily parsed into phrases exactly matching substrings of the reference. Deorowicz and Grabowski (Bioinformatics, 2011) pointed out that letting each phrase end with a mismatch character usually gives better compression because many of the differences between individuals' genomes are single-nucleotide substitutions. Ferrada et al. (SPIRE 2014) then pointed out that also using relative pointers and run-length compressing them usually gives even better compression. In this paper we generalize Ferrada et al.'s idea to handle well also short insertions, deletions and multi-character substitutions. We show experimentally that our generalization achieves better compression than Ferrada et al.'s implementation with comparable random-access times

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Pisa