Search CORE

12 research outputs found

Efficient compression of large repetitive strings

Author: Hoobin C
Publication venue: RMIT University
Publication date
Field of study

When is comes to managing large volumes of data, general-purpose compressors such as gzip are ubiquitous. They are fast, practical and available on every modern platform from standard desktops to mobile devices. These tools exploit local redundancy in a text using a fixed-size sliding window. This window is usually very small relative to the text, however, in principle it can be as large as available memory. The window acts as a dictionary. Compression is achieved by replacing substrings with pointers to previous occurrences found in the dictionary. This type of algorithm becomes problematic when dealing with collections that are larger than physical memory, as it fails to capture any non-local redundancy, that is, repetition that occurs outside of its search window. With rapid growth in the already enormous amount of data we store and process there is a pressing need for improving compression effectiveness, reducing both storage requirements and decompression costs. However, many systems still use general-purpose compression tools on large highly repetitive data collections. In this thesis we focus on addressing this issue. We explore compression in a variety of domains where large volumes of data need to be stored and accessed, and general-purpose compression tools are cannon. First we discuss our work on web corpus compression, then we discuss the implementation of a practical index for repetitive texts that gives strong theoretical bounds in terms of size and access, and finally, we discuss our work on compression of high-throughput sequencing reads. We show that in all cases, our new methods improve on current techniques in both run-time and compression effectiveness, and provide important functionality such as fast decoding and random access

RMIT Research Repository

RLZAP: Relative Lempel-Ziv with Adaptive Pointers

Author: A Farruggia
C Boucher
C Hoobin
D Belazzougui
H Ferrada
J Ziv
J Ziv
M Léonard
P Ferragina
R Raman
S Deorowicz
S Deorowicz
S Kuruppu
Publication venue
Publication date: 01/01/2016
Field of study

Relative Lempel-Ziv (RLZ) is a popular algorithm for compressing databases of genomes from individuals of the same species when fast random access is desired. With Kuruppu et al.'s (SPIRE 2010) original implementation, a reference genome is selected and then the other genomes are greedily parsed into phrases exactly matching substrings of the reference. Deorowicz and Grabowski (Bioinformatics, 2011) pointed out that letting each phrase end with a mismatch character usually gives better compression because many of the differences between individuals' genomes are single-nucleotide substitutions. Ferrada et al. (SPIRE 2014) then pointed out that also using relative pointers and run-length compressing them usually gives even better compression. In this paper we generalize Ferrada et al.'s idea to handle well also short insertions, deletions and multi-character substitutions. We show experimentally that our generalization achieves better compression than Ferrada et al.'s implementation with comparable random-access times

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Pisa

Relative Lempel-Ziv Compression of Suffix Arrays

Author: A Farruggia
C Hoobin
D Belazzougui
J Ziv
M Cáceres
NJ Larsson
P Ferragina
R González
S Deorowicz
S Kuruppu
T Gagie
T Gagie
U Manber
V Mäkinen
Publication venue: Springer Science and Business Media Deutschland GmbH
Publication date: 01/01/2020
Field of study

We show that a combination of differential encoding, random sampling, and relative Lempel-Ziv (RLZ) parsing is effective for compressing suffix arrays, while simultaneously allowing very fast decompression of arbitrary suffix array intervals, facilitating pattern matching. The resulting text index, while somewhat larger (5-10x) than the recent r-index of Gagie, Navarro, and Prezza (Proc. SODA ’18)—still provides significant compression, and allows pattern location queries to be answered more than two orders of magnitude faster in practice.Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

Comportement tribologique en glissement sec de fontes a chemises de moteurs thermiques

Author: A Moffat
C Hoobin
ER Fiala
H Ferrada
HE Williams
S Gog
Publication venue
Publication date: 01/01/1986
Field of study

SIGLECNRS TD 15304 / INIST-CNRS - Institut de l'Information Scientifique et TechniqueFRFranc

arXiv.org e-Print Archive

Crossref

OpenGrey Repository

University of Melbourne Institutional Repository

Aerodynamic investigations on close-coupled canard configuration. Pt. 3 Measurements on the configuration of the International Vortex Flow Experiment in symmetrical and incompressible flow

Author: A Aggarwal
A Lempel
C Hoobin
G Badkobeh
GS Brodal
J Ziv
JS Vitter
R Dementiev
R Kolpakov
S Kreft
S Kuruppu
T Gagie
T Gagie
T Gagie
Publication venue
Publication date: 01/01/1988
Field of study

With 225 figs., 2 tabs.TIB: RA 1517(1988,2) / FIZ - Fachinformationszzentrum Karlsruhe / TIB - Technische InformationsbibliothekSIGLEDEGerman

Crossref

OpenGrey Repository

Scalable Reference Genome Assembly from Compressed Pan-Genome Index with Spark

Author: A Auton
A Dilthey
AI Maarala
B Kehr
B Langmead
C Hoobin
D Decap
D Jeffrey
D Valenzuela
D Valenzuela
DA Rasko
E Trost
H Li
H Tettelin
J Siren
J Ziv
K Schneeberger
L Huang
L Rouli
M Niemenmaa
Q Zhao
RM Sherman
RM Sherman
S Mallick
S Rajasekaran
Z Duan
Z Hu
Publication venue: Springer International Publishing
Publication date: 18/09/2020
Field of study

High-throughput sequencing (HTS) technologies have enabled rapid sequencing of genomes and large-scale genome analytics with massive data sets. Traditionally, genetic variation analyses have been based on the human reference genome assembled from a relatively small human population. However, genetic variation could be discovered more comprehensively by using a collection of genomes i.e., pan-genome as a reference. The pan-genomic references can be assembled from larger populations or a specific population under study. Moreover, exploiting the pan-genomic references with current bioinformatics tools requires efficient compression and indexing methods. To be able to leverage the accumulating genomic data, the power of distributed and parallel computing has to be harnessed for the new genome analysis pipelines. We propose a scalable distributed pipeline, PanGenSpark, for compressing and indexing pan-genomes and assembling a reference genome from the pan-genomic index. We experimentally show the scalability of the PanGenSpark with human pan-genomes in a distributed Spark cluster comprising 448 cores distributed to 26 computing nodes. Assembling a consensus genome of a pan-genome including 50 human individuals was performed in 215 min and with 500 human individuals in 1468 min. The index of 1.41 TB pan-genome was compressed into a size of 164.5 GB in our experiments.Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

Large-Alphabet Semi-Static Entropy Coding Via Asymmetric Numeral Systems

Author: Alistair Moffat
Brisaboa N. R.
Brisaboa N. R.
Burrows M.
Culpepper J. S.
Dubé D.
Duda J.
Fraenkel A. S.
Hoobin C.
Martinez M.
Matthias Petri
Moffat A.
Petri M.
Pibiri G. E.
van Leeuwen J.
Yan H.
Zukowski M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections

Author: Büttcher S.
Büttcher S.
Cannane A.
Chen G.
Croft B.
Ferragina P.
Hoobin C.
Huffman D.
Kreft S.
Kuruppu S.
Larsson N. J.
Manning C. D.
Moffat A.
Scholer F.
Tombros A.
Tsegay Y.
Turpin A.
Ziv J.
Ziv J.
Ziv J.
Zukowski M.
Publication venue: 'VLDB Endowment'
Publication date
Field of study

Crossref

Lempel–Ziv-Like Parsing in Small Space

Author: A Amir
A Policriti
AJ Wyner
C Hoobin
C Ochoa
D Lemire
G Manzini
J Alakuijala
J Ziv
J Ziv
JA Storer
NJ Larsson
P Elias
P Ferragina
PC Shields
S Deorowicz
S Deorowicz
S Wandelt
SJ Puglisi
SR Kosaraju
T Gagie
TW Tillson
V Mäkinen
VI Levenshtein
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Supercritical CO2 interpolymer complex encapsulation improves heat stability of probiotic bifidobacteria

Author: AH Khalil
AJ Fontana
AY Tamime
B Huff
C Stanton
C. I. Mamvura
CG Vinderola
CI Mamvura
D Rodrigues
E Elliot
F Abe
F Weinbreck
FA Bruno
FS Moolman
HC Hsiao
I Jankovic
J Booyens
J Theunissen
JA Troller
JS Lee
K Adhikari
K Kailasapathy
K O’Riordan
M Kramer
M. S. Thantsha
ME Sanders
MR Adams
MS Thantsha
MS Thantsha
N Ishibashi
N Micanel
NP Shah
NP Shah
P Hoobin
P. W. Labuschagne
PC Teixeira
PJ Simpson
R Crittenden
R Vidhyalakshmi
RI Dave
S Fasoli
T Heidebach
TD Boylston
TE Cloete
TL Hansen
XC Meng
Y Kourkoutas
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/02/2014
Field of study

The probiotic industry faces the challenge of retention of probiotic culture viability as numbers of these cells within their products inevitably decrease over time. In order to retain probiotic viability levels above the therapeutic minimum over the duration of the product’s shelf life, various methods have been employed, among which encapsulation has received much interest. In line with exploitation of encapsulation for protection of probiotics against adverse conditions, we have previously encapsulated bifidobacteria in poly-(vinylpyrrolidone)-poly-(vinylacetate-co-crotonic acid) (PVP:PVAc-CA) interpolymer complex microparticles under supercritical conditions. The microparticles produced had suitable characteristics for food applications and also protected the bacteria in simulated gastrointestinal fluids. The current study reports on accelerated shelf life studies of PVP:PVAc-CA encapsulated Bifidobacterium lactis Bb12 and Bifidobacterium longum Bb46. Samples were stored as free powders in glass vials at 30 °C for 12 weeks and then analysed for viable counts and water activity levels weekly or fortnightly. Water activities of the samples were within the range of 0.25–0.43, with an average a w = 0.34, throughout the storage period. PVP:PVAc-CA interpolymer complex encapsulation retained viable levels above the recommended minimum for 10 and 12 weeks, for B. longum Bb46 and B. lactis Bb12, respectively, thereby extending their shelf lives under high storage temperature by between 4 and 7 weeks. These results reveal the possibility for manufacture of encapsulated probiotic powders with increased stability at ambient temperatures. This would potentially allow the supply of a stable probiotic formulation to impoverished communities without proper storage facilities recommended for most of the currently available commercial probiotic products.University of Pretoria, National Research Foundation (NRF), South Africa and The Council for Scientific and Industrial Research (CSIR), Pretoria.http://www.springer.com/chemistry/biotech/journal/11274hb201

Crossref

UPSpace at the University of Pretoria