18 research outputs found

    Data compression for sequencing data

    Get PDF
    Post-Sanger sequencing methods produce tons of data, and there is a general agreement that the challenge to store and process them must be addressed with data compression. In this review we first answer the question “why compression” in a quantitative manner. Then we also answer the questions “what” and “how”, by sketching the fundamental compression ideas, describing the main sequencing data types and formats, and comparing the specialized compression algorithms and tools. Finally, we go back to the question “why compression” and give other, perhaps surprising answers, demonstrating the pervasiveness of data compression techniques in computational biology

    Indexing large genome collections on a PC

    Full text link
    Motivation: The availability of thousands of invidual genomes of one species should boost rapid progress in personalized medicine or understanding of the interaction between genotype and phenotype, to name a few applications. A key operation useful in such analyses is aligning sequencing reads against a collection of genomes, which is costly with the use of existing algorithms due to their large memory requirements. Results: We present MuGI, Multiple Genome Index, which reports all occurrences of a given pattern, in exact and approximate matching model, against a collection of thousand(s) genomes. Its unique feature is the small index size fitting in a standard computer with 16--32\,GB, or even 8\,GB, of RAM, for the 1000GP collection of 1092 diploid human genomes. The solution is also fast. For example, the exact matching queries are handled in average time of 39\,ÎĽ\mus and with up to 3 mismatches in 373\,ÎĽ\mus on the test PC with the index size of 13.4\,GB. For a smaller index, occupying 7.4\,GB in memory, the respective times grow to 76\,ÎĽ\mus and 917\,ÎĽ\mus. Availability: Software and Suuplementary material: \url{http://sun.aei.polsl.pl/mugi}

    third generation sequencing data analytics on mobile devices cache oblivious and out of core approaches as a proof of concept

    Get PDF
    Abstract Mobile (third-generation) sequencing technologies, including Oxford Nanopore's MinION and SmidgION, have the benefit of outputting long sequence reads (up to hundred thousands of bases) in a portable manner. These sequencing devices fit in the palm of a hand and only require a USB outlet. Unfortunately, the development of data analysis tools for these technologies is in a nascent stage, impeding on the portability of these devices. The objective of this work is to introduce an out-of-core approach to port Nanopore analytics on mobile devices such as tablets or smartphones, often used in extreme experimental settings with special ergonomics needs and ease of sterilization. In this paper, we present a serial k-mer parser/counter for FAST5 files, and a de Bruijn graph construction method which can run on a hand-held device. In order to accomplish this portability we develop novel cache oblivious data structures and out-of-core chunked processing methods. Our toolset, which we refer to as Nanopore Portable Analytics Library (NanoPAL), wase implemented in ISO C++ v.14 and compiled for Android devices. Using MinION data (Zaire Ebolavirus species and others), we evaluate the time required to parse and build the de Bruijn graph with respect to the file sizes and RAM allocation. These metrics were compared to those of minimap/miniasm. On an LG Nexus 5 with 2GB or RAM, 2MB L2 cache and 16GB storage, the out-of-core NanoPAL is able to process FAST5 files at about 30 minutes per 0.5 GB, creating sorted k-mer and de Bruijn graph files. The recompiled minimap/miniasm tool cannot complete FAST5 files larger than 170MB. In conjunction with base calling/error correction, and with addition of assembly procedures downstream, NanoPAL can be effectively used to perform analyses of MinION/SmidgION data locally on a mobile device

    MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression

    Get PDF
    Background: Metagenomics is a genomics research discipline devoted to the study of microbial communities in environmental samples and human and animal organs and tissues. Sequenced metagenomic samples usually comprise reads from a large number of different bacterial communities and hence tend to result in large file sizes, typically ranging between 1–10 GB. This leads to challenges in analyzing, transferring and storing metagenomic data. In order to overcome these data processing issues, we introduce MetaCRAM, the first de novo, parallelized software suite specialized for FASTA and FASTQ format metagenomic read processing and lossless compression. Results: MetaCRAM integrates algorithms for taxonomy identification and assembly, and introduces parallel execution methods; furthermore, it enables genome reference selection and CRAM based compression. MetaCRAM also uses novel reference-based compression methods designed through extensive studies of integer compression techniques and through fitting of empirical distributions of metagenomic read-reference positions. MetaCRAM is a lossless method compatible with standard CRAM formats, and it allows for fast selection of relevant files in the compressed domain via maintenance of taxonomy information. The performance of MetaCRAM as a stand-alone compression platform was evaluated on various metagenomic samples from the NCBI Sequence Read Archive, suggesting 2- to 4-fold compression ratio improvements compared to gzip. On average, the compressed file sizes were 2-13 percent of the original raw metagenomic file sizes. Conclusions: We described the first architecture for reference-based, lossless compression of metagenomic data. The compression scheme proposed offers significantly improved compression ratios as compared to off-the-shelf methods such as zip programs. Furthermore, it enables running different components in parallel and it provides the user with taxonomic and assembly information generated during execution of the compression pipeline. Availability: The MetaCRAM software is freely available at http://web.engr.illinois.edu/~mkim158/metacram.html. The website also contains a README file and other relevant instructions for running the code. Note that to run the code one needs a minimum of 16 GB of RAM. In addition, virtual box is set up on a 4GB RAM machine for users to run a simple demonstration
    corecore