Search CORE

678 research outputs found

Reordering Rows for Better Compression: Beyond the Lexicographic Order

Author: Gutarra Eduardo
Kaser Owen
Lemire Daniel
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/07/2012
Field of study

Sorting database tables before compressing them improves the compression rate. Can we do better than the lexicographical order? For minimizing the number of runs in a run-length encoding compression scheme, the best approaches to row-ordering are derived from traveling salesman heuristics, although there is a significant trade-off between running time and compression. A new heuristic, Multiple Lists, which is a variant on Nearest Neighbor that trades off compression for a major running-time speedup, is a good option for very large tables. However, for some compression schemes, it is more important to generate long runs rather than few runs. For this case, another novel heuristic, Vortex, is promising. We find that we can improve run-length encoding up to a factor of 3 whereas we can improve prefix coding by up to 80%: these gains are on top of the gains due to lexicographically sorting the table. We prove that the new row reordering is optimal (within 10%) at minimizing the runs of identical values within columns, in a few cases.Comment: to appear in ACM TOD

arXiv.org e-Print Archive

R-libre

Crossref

Composite repetition-aware data structures

Author: A Blumer
A Lempel
D Arroyuelo
D Belazzougui
DE Willard
J Radoszewski
J Sirén
J Ziv
M Crochemore
M Crochemore
M Raffinot
P Ferragina
S Kreft
T Gagie
V Mäkinen
V Mäkinen
W Rytter
Publication venue
Publication date: 01/01/2015
Field of study

In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure. The key component of our constructions is the run-length encoded BWT (RLBWT), which takes space proportional to the number of BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it with data structures from LZ77 indexes, which take space proportional to the number of LZ77 factors, and with the compact directed acyclic word graph (CDAWG), which takes space proportional to the number of extensions of maximal repeats. The combination of CDAWG and RLBWT enables also a new representation of the suffix tree, whose size depends again on the number of extensions of maximal repeats, and that is powerful enough to support matching statistics and constant-space traversal.Comment: (the name of the third co-author was inadvertently omitted from previous version

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Udine

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

On multivariable cumulant polynomial sequences with applications

Author: Di Nardo E.
Publication venue
Publication date: 01/01/2016
Field of study

A new family of polynomials, called cumulant polynomial sequence, and its extensions to the multivariate case is introduced relied on a purely symbolic combinatorial method. The coefficients of these polynomials are cumulants, but depending on what is plugged in the indeterminates, either sequences of moments either sequences of cumulants can be recovered. The main tool is a formal generalization of random sums, also with a multivariate random index and not necessarily integer-valued. Applications are given within parameter estimations, L\'evy processes and random matrices and, more generally, problems involving multivariate functions. The connection between exponential models and multivariable Sheffer polynomial sequences offers a different viewpoint in characterizing these models. Some open problems end the paper.Comment: 17 pages, In pres

arXiv.org e-Print Archive

Institutional Research Information System University of Turin

Reordering Columns for Smaller Indexes

Author: Abadi
Alber
Anantha
Anh
Antoshenkov
Aouiche
Barnard
Bassiouni
Bhattacharjee
Cai
Chen
Daniel Lemire
Dehne
Eavis
Engene
Faloutsos
Fang
Flahive
Flahive
Garey
Golomb
Haddadi
Hamilton
Haverkort
Holloway
Holloway
Kamel
Kaser
Lemire
Lemke
Moffat
Moffat
Ng
Niedermeier
Owen Kaser
Peano
Pinar
Richards
Savage
Scholer
Vo
Witten
Wu
Zobel
Publication venue: 'Elsevier BV'
Publication date: 22/02/2011
Field of study

Column-oriented indexes-such as projection or bitmap indexes-are compressed by run-length encoding to reduce storage and increase speed. Sorting the tables improves compression. On realistic data sets, permuting the columns in the right order before sorting can reduce the number of runs by a factor of two or more. Unfortunately, determining the best column order is NP-hard. For many cases, we prove that the number of runs in table columns is minimized if we sort columns by increasing cardinality. Experimentally, sorting based on Hilbert space-filling curves is poor at minimizing the number of runs.Comment: to appear in Information Science

arXiv.org e-Print Archive

R-libre

Crossref

A big data approach for sequences indexing on the cloud via burrows wheeler transform

Author: Mario Randazzo
Simona Ester Rombo
Publication venue
Publication date: 20/07/2020
Field of study

Indexing sequence data is important in the context of Precision Medicine, where large amounts of "omics"data have to be daily collected and analyzed in order to categorize patients and identify the most effective therapies. Here we propose an algorithm for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. Our approach is the first that distributes the index computation and not only the input dataset, allowing to fully benefit of the available cloud resources. Copyright © 2020 for this paper by its authors

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Palermo