Search CORE

571 research outputs found

Low Space External Memory Construction of the Succinct Permuted Longest Common Prefix Array

Author: D Okanohara
J Fischer
J Fischer
J Fischer
J Kärkkäinen
J Kärkkäinen
J Kärkkäinen
J Sirén
JI Munro
JS Vitter
K Sadakane
K Sadakane
P Ferragina
P Ferragina
P Ferragina
R Dementiev
T Beller
T Kasai
U Manber
W Hon
W Szpankowski
Publication venue
Publication date: 01/01/2016
Field of study

The longest common prefix (LCP) array is a versatile auxiliary data structure in indexed string matching. It can be used to speed up searching using the suffix array (SA) and provides an implicit representation of the topology of an underlying suffix tree. The LCP array of a string of length

n

can be represented as an array of length

n

words, or, in the presence of the SA, as a bit vector of

2n

bits plus asymptotically negligible support data structures. External memory construction algorithms for the LCP array have been proposed, but those proposed so far have a space requirement of

O(n)

words (i.e.

O(n \log n)

bits) in external memory. This space requirement is in some practical cases prohibitively expensive. We present an external memory algorithm for constructing the

2n

bit version of the LCP array which uses

O(n \log \sigma)

bits of additional space in external memory when given a (compressed) BWT with alphabet size

\sigma

and a sampled inverse suffix array at sampling rate

O(\log n)

. This is often a significant space gain in practice where

\sigma

is usually much smaller than

n

or even constant. We also consider the case of computing succinct LCP arrays for circular strings

arXiv.org e-Print Archive

Crossref

MPG.PuRe

RLZAP: Relative Lempel-Ziv with Adaptive Pointers

Author: A Farruggia
C Boucher
C Hoobin
D Belazzougui
H Ferrada
J Ziv
J Ziv
M Léonard
P Ferragina
R Raman
S Deorowicz
S Deorowicz
S Kuruppu
Publication venue
Publication date: 01/01/2016
Field of study

Relative Lempel-Ziv (RLZ) is a popular algorithm for compressing databases of genomes from individuals of the same species when fast random access is desired. With Kuruppu et al.'s (SPIRE 2010) original implementation, a reference genome is selected and then the other genomes are greedily parsed into phrases exactly matching substrings of the reference. Deorowicz and Grabowski (Bioinformatics, 2011) pointed out that letting each phrase end with a mismatch character usually gives better compression because many of the differences between individuals' genomes are single-nucleotide substitutions. Ferrada et al. (SPIRE 2014) then pointed out that also using relative pointers and run-length compressing them usually gives even better compression. In this paper we generalize Ferrada et al.'s idea to handle well also short insertions, deletions and multi-character substitutions. We show experimentally that our generalization achieves better compression than Ferrada et al.'s implementation with comparable random-access times

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Pisa

Tree Compression with Top Trees Revisited

Author: F Wang
G Busatto
JI Munro
M Charikar
M Hirakawa
M Lohrey
M Lohrey
NJ Larsson
P Ferragina
PJ Downey
S Alstrup
S Gog
S Maneth
S Maruyama
Publication venue
Publication date: 01/01/2015
Field of study

We revisit tree compression with top trees (Bille et al, ICALP'13) and present several improvements to the compressor and its analysis. By significantly reducing the amount of information stored and guiding the compression step using a RePair-inspired heuristic, we obtain a fast compressor achieving good compression ratios, addressing an open problem posed by Bille et al. We show how, with relatively small overhead, the compressed file can be converted into an in-memory representation that supports basic navigation operations in worst-case logarithmic time without decompression. We also show a much improved worst-case bound on the size of the output of top-tree compression (answering an open question posed in a talk on this algorithm by Weimann in 2012).Comment: SEA 201

arXiv.org e-Print Archive

Crossref

KITopen

Repository KITopen

Leicester Research Archive

Lightweight Lempel-Ziv Parsing

Author: D. Okanohara
D. Okanohara
E. Ohlebusch
E. Ohlebusch
G. Chen
G. Navarro
G. Navarro
J. Barbay
J. Fischer
J. Kärkkäinen
J. Ziv
M. Crochemore
M.I. Abouelhoda
P. Ferragina
P. Ferragina
R. Cánovas
S. Kreft
S. Kuruppu
T. Gagie
T. Kasai
T. Starikovskaya
U. Manber
W.I. Chang
Publication venue
Publication date: 01/01/2013
Field of study

We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior in most cases, particularly at the lowest memory levels and for highly repetitive data. As a part of the algorithm, we describe new methods for computing matching statistics which may be of independent interest.Comment: 12 page

arXiv.org e-Print Archive

Crossref

Wave Energy: a Pacific Perspective

Author: D. Gusfield
D. Okanohara
G. Manzini
J. Fischer
J. Kärkkäinen
J. Kärkkäinen
K. Sadakane
M.I. Abouelhoda
P. Ferragina
R. Dementiev
R. Sinha
S.J. Puglisi
S.J. Puglisi
T. Kasai
U. Manber
V. Mäkinen
Publication venue: The Royal Society
Publication date: 01/01/2009
Field of study

This is the author's peer-reviewed final manuscript, as accepted by the publisher. The published article is copyrighted by The Royal Society and can be found at: http://rsta.royalsocietypublishing.org/.This paper illustrates the status of wave energy development in Pacific Rim countries by characterizing the available resource and introducing the region‟s current and potential future leaders in wave energy converter development. It also describes the existing licensing and permitting process as well as potential environmental concerns. Capabilities of Pacific Ocean testing facilities are described in addition to the region‟s vision of the future of wave energy

CiteSeerX

Crossref

ScholarsArchive@OSU

Research Repository RMIT University

Archivio Istituzionale della Ricerca- Università del Piemonte Orientale

Composite repetition-aware data structures

Author: A Blumer
A Lempel
D Arroyuelo
D Belazzougui
DE Willard
J Radoszewski
J Sirén
J Ziv
M Crochemore
M Crochemore
M Raffinot
P Ferragina
S Kreft
T Gagie
V Mäkinen
V Mäkinen
W Rytter
Publication venue
Publication date: 01/01/2015
Field of study

In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure. The key component of our constructions is the run-length encoded BWT (RLBWT), which takes space proportional to the number of BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it with data structures from LZ77 indexes, which take space proportional to the number of LZ77 factors, and with the compact directed acyclic word graph (CDAWG), which takes space proportional to the number of extensions of maximal repeats. The combination of CDAWG and RLBWT enables also a new representation of the suffix tree, whose size depends again on the number of extensions of maximal repeats, and that is powerful enough to support matching statistics and constant-space traversal.Comment: (the name of the third co-author was inadvertently omitted from previous version

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Udine

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

The Tree Inclusion Problem: In Linear Space and Faster

Author: Alstrup S.
Alstrup S.
Alstrup S.
Alstrup S.
Bender M. A.
Cole R.
Demaine E. D.
Ferragina P.
Inge Li Gortz
Muthukrishnan S.
Philip Bille
Schlieder T.
Termier A.
Yang L. H.
Zezula P.
Publication venue
Publication date: 01/01/2011
Field of study

Given two rooted, ordered, and labeled trees

P

and

T

the tree inclusion problem is to determine if

P

can be obtained from

T

by deleting nodes in

T

. This problem has recently been recognized as an important query primitive in XML databases. Kilpel\"ainen and Mannila [\emph{SIAM J. Comput. 1995}] presented the first polynomial time algorithm using quadratic time and space. Since then several improved results have been obtained for special cases when

P

and

T

have a small number of leaves or small depth. However, in the worst case these algorithms still use quadratic time and space. Let

n_S

l_S

, and

d_S

denote the number of nodes, the number of leaves, and the %maximum depth of a tree

S \in \{P, T\}

. In this paper we show that the tree inclusion problem can be solved in space

O(n_T)

and time: O(\min(l_Pn_T, l_Pl_T\log \log n_T + n_T, \frac{n_Pn_T}{\log n_T} + n_{T}\log n_{T})). This improves or matches the best known time complexities while using only linear space instead of quadratic. This is particularly important in practical applications, such as XML databases, where the space is likely to be a bottleneck.Comment: Minor updates from last tim

arXiv.org e-Print Archive

Crossref

Online Research Database In Technology

Compressed Subsequence Matching and Packed Tree Coloring

Author: A. Tiskin
A. Tiskin
D.D. Sleator
G. Das
H. Mannila
J. Ziv
J. Ziv
M. Charikar
M. Crochemore
M. Thorup
M.A. Bender
M.L. Fredman
N.J. Larsson
O. Berkman
P. Cégielski
P. Cégielski
P. Ferragina
P.F. Dietz
R.A. Baeza-Yates
S. Abiteboul
S. Alstrup
S. Alstrup
S. Alstrup
T. Yamamoto
W. Rytter
Z. Troníček
Publication venue
Publication date: 01/01/2014
Field of study

We present a new algorithm for subsequence matching in grammar compressed strings. Given a grammar of size

n

compressing a string of size

N

and a pattern string of size

m

over an alphabet of size

\sigma

, our algorithm uses

O(n+\frac{n\sigma}{w})

space and

O(n+\frac{n\sigma}{w}+m\log N\log w\cdot occ)

O(n+\frac{n\sigma}{w}\log w+m\log N\cdot occ)

time. Here

w

is the word size and

occ

is the number of occurrences of the pattern. Our algorithm uses less space than previous algorithms and is also faster for

occ=o(\frac{n}{\log N})

occurrences. The algorithm uses a new data structure that allows us to efficiently find the next occurrence of a given character after a given position in a compressed string. This data structure in turn is based on a new data structure for the tree color problem, where the node colors are packed in bit strings.Comment: To appear at CPM '1

arXiv.org e-Print Archive

CiteSeerX

Crossref

Online Research Database In Technology

Efficient and Compact Representations of Some Non-canonical Prefix-Free Codes

Author: A Itai
ES Schwartz
F Claude
G Navarro
G Navarro
JI Munro
P Ferragina
RL Wessner
T Gagie
T Gagie
W Evans
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-46049-9_5[Abstract] For many kinds of prefix-free codes there are efficient and compact alternatives to the traditional tree-based representation. Since these put the codes into canonical form, however, they can only be used when we can choose the order in which codewords are assigned to characters. In this paper we first show how, given a probability distribution over an alphabet of σσ characters, we can store a nearly optimal alphabetic prefix-free code in o(σ)o(σ) bits such that we can encode and decode any character in constant time. We then consider a kind of code introduced recently to reduce the space usage of wavelet matrices (Claude, Navarro, and Ordóñez, Information Systems, 2015). They showed how to build an optimal prefix-free code such that the codewords’ lengths are non-decreasing when they are arranged such that their reverses are in lexicographic order. We show how to store such a code in O(σlogL+2ϵL)O(σlog⁡L+2ϵL) bits, where L is the maximum codeword length and ϵϵ is any positive constant, such that we can encode and decode any character in constant time under reasonable assumptions. Otherwise, we can always encode and decode a codeword of ℓℓ bits in time O(ℓ)O(ℓ) using O(σlogL)O(σlog⁡L) bits of space.Ministerio de Economía, Industria y Competitividad; TIN2013-47090-C3-3-PMinisterio de Economía, Industria y Competitividad; TIN2015-69951-RMinisterio de Economía, Industria y Competitividad; ITC-20151305Ministerio de Economía, Industria y Competitividad; ITC-20151247Xunta de Galicia; GRC2013/053Chile. Núcleo Milenio Información y Coordinación en Redes; ICM/FIC.P10-024FCOST. IC1302Academy of Finland; 268324Academy of Finland; 25034

Repositorio da Universidade da Coruña

Crossref

Archivio della Ricerca - Università di Pisa

Archivio Istituzionale della Ricerca- Università del Piemonte Orientale

VST - VLT Survey Telescope Integration Status

Author: Belfiore C.
Brescia M.
Capaccioli M.
Caputi O.
Castiello G.
Cortecchia F.
Ferragina L.
Fierro D.
Fiume V.
Mancini D.
Mancini G.
Marra G.
Marty L.
Mazzola G.
Parisi L.
Pellone L.
Perrotta F.
Porzio V.
Schipani P.
Sciarretta G.
Sedmak G.
Spirito G.
Valentino M.
Publication venue
Publication date: 01/01/2005
Field of study

The VLT Survey Telescope (VST) is a 2.6m aperture, wide field, UV to I facility, to be installed at the European Southern Observatory (ESO) on the Cerro Paranal Chile. VST was primarily intended to complement the observing capabilities of VLT with wide-angle imaging for detecting and pre-characterising sources for further observations with the VLT.Comment: 2 pages, 2 figures, conferenc

arXiv.org e-Print Archive

Archivio della ricerca - Università degli studi di Napoli Federico II