Search CORE

3 research outputs found

MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression

Author: Farnoud Farzad
Kim Minji
Ligo Jonathan G.
Milenkovic Olgica
Veeravalli Venugopal V.
Zhang Xiejia
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 19/02/2016
Field of study

Background: Metagenomics is a genomics research discipline devoted to the study of microbial communities in environmental samples and human and animal organs and tissues. Sequenced metagenomic samples usually comprise reads from a large number of different bacterial communities and hence tend to result in large file sizes, typically ranging between 1–10 GB. This leads to challenges in analyzing, transferring and storing metagenomic data. In order to overcome these data processing issues, we introduce MetaCRAM, the first de novo, parallelized software suite specialized for FASTA and FASTQ format metagenomic read processing and lossless compression. Results: MetaCRAM integrates algorithms for taxonomy identification and assembly, and introduces parallel execution methods; furthermore, it enables genome reference selection and CRAM based compression. MetaCRAM also uses novel reference-based compression methods designed through extensive studies of integer compression techniques and through fitting of empirical distributions of metagenomic read-reference positions. MetaCRAM is a lossless method compatible with standard CRAM formats, and it allows for fast selection of relevant files in the compressed domain via maintenance of taxonomy information. The performance of MetaCRAM as a stand-alone compression platform was evaluated on various metagenomic samples from the NCBI Sequence Read Archive, suggesting 2- to 4-fold compression ratio improvements compared to gzip. On average, the compressed file sizes were 2-13 percent of the original raw metagenomic file sizes. Conclusions: We described the first architecture for reference-based, lossless compression of metagenomic data. The compression scheme proposed offers significantly improved compression ratios as compared to off-the-shelf methods such as zip programs. Furthermore, it enables running different components in parallel and it provides the user with taxonomic and assembly information generated during execution of the compression pipeline. Availability: The MetaCRAM software is freely available at http://web.engr.illinois.edu/~mkim158/metacram.html. The website also contains a README file and other relevant instructions for running the code. Note that to run the code one needs a minimum of 16 GB of RAM. In addition, virtual box is set up on a 4GB RAM machine for users to run a simple demonstration

PubMed Central

Caltech Authors

Additional file 4 of MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression

Author: Farzad Farnoud (3576665)
Jonathan Ligo (3576659)
Minji Kim (1722313)
Olgica Milenkovic (534266)
Venugopal Veeravalli (3576656)
Xiejia Zhang (3576662)
Publication venue
Publication date
Field of study

Outcome of MetaCRAM. Additional file 2 illustrates detailed outcome of MetaCRAM, such as files and folders produced after compression and decompression, and an example of console output. (PDF 82 kb

FigShare

MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression

Author: A Kiely
AJ Pinho
B Langmead
C Kozanitis
DA Huffman
DB Rusch
DC Jones
DC Richter
DE Wood
DH Huson
DR Zerbino
ES Lander
F Hach
Farzad Farnoud
GV Cormack
H Li
HH Kong
I Ochoa
J Peterson
JC Dohm
Jonathan G. Ligo
K Somasundaram
MG Langille
MH-Y Fritz
Minji Kim
MN Sakib
Olgica Milenkovic
P Elias
PR Loh
R Leinonen
S Boisvert
S Deorowicz
S Golomb
SF Altschul
V Yanovsky
Venugopal V. Veeravalli
WT Liu
X Chen
Xiejia Zhang
XL Wu
Y Peng
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref