Search CORE

3,110 research outputs found

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform

Author: A. J. Cox
Chen
Dewey
G. Rosone
Kozanitis
M. J. Bauer
T. Jakobi
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

Motivation The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being widely applied to the large sets of sequences often encountered as the outcome of DNA sequencing experiments. In previous work, we presented a novel algorithm that allows the BWT of human genome scale data to be computed on very moderate hardware, thus enabling us to investigate the BWT as a tool for the compression of such datasets. Results We first used simulated reads to explore the relationship between the level of compression and the error rate, the length of the reads and the level of sampling of the underlying genome and compare choices of second-stage compression algorithm. We demonstrate that compression may be greatly improved by a particular reordering of the sequences in the collection and give a novel `implicit sorting' strategy that enables these benefits to be realised without the overhead of sorting the reads. With these techniques, a 45x coverage of real human genome sequence data compresses losslessly to under 0.5 bits per base, allowing the 135.3Gbp of sequence to fit into only 8.2Gbytes of space (trimming a small proportion of low-quality bases from the reads improves the compression still further). This is more than 4 times smaller than the size achieved by a standard BWT-based compressor (bzip2) on the untrimmed reads, but an important further advantage of our approach is that it facilitates the building of compressed full text indexes such as the FM-index on large-scale DNA sequence collections.Comment: Version here is as submitted to Bioinformatics and is same as the previously archived version. This submission registers the fact that the advanced access version is now available at http://bioinformatics.oxfordjournals.org/content/early/2012/05/02/bioinformatics.bts173.abstract . Bioinformatics should be considered as the original place of publication of this article, please cite accordingl

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Pisa

Publications at Bielefeld University

Recommended from our members

Exome resequencing and GWAS for growth, ecophysiology, and chemical and metabolomic composition of wood of Populus trichocarpa.

Author: Davis Mark F
Famula Randi
Fiehn Oliver
Guerra Fernando P
Holliday Jason
Neale David B
Richards James H
Shuren Richard
Stanton Brian J
Suren Haktan
Sykes Robert
Publication venue: eScholarship, University of California
Publication date: 01/11/2019
Field of study

BackgroundPopulus trichocarpa is an important forest tree species for the generation of lignocellulosic ethanol. Understanding the genomic basis of biomass production and chemical composition of wood is fundamental in supporting genetic improvement programs. Considerable variation has been observed in this species for complex traits related to growth, phenology, ecophysiology and wood chemistry. Those traits are influenced by both polygenic control and environmental effects, and their genome architecture and regulation are only partially understood. Genome wide association studies (GWAS) represent an approach to advance that aim using thousands of single nucleotide polymorphisms (SNPs). Genotyping using exome capture methodologies represent an efficient approach to identify specific functional regions of genomes underlying phenotypic variation.ResultsWe identified 813 K SNPs, which were utilized for genotyping 461 P. trichocarpa clones, representing 101 provenances collected from Oregon and Washington, and established in California. A GWAS performed on 20 traits, considering single SNP-marker tests identified a variable number of significant SNPs (p-value < 6.1479E-8) in association with diameter, height, leaf carbon and nitrogen contents, and δ15N. The number of significant SNPs ranged from 2 to 220 per trait. Additionally, multiple-marker analyses by sliding-windows tests detected between 6 and 192 significant windows for the analyzed traits. The significant SNPs resided within genes that encode proteins belonging to different functional classes as such protein synthesis, energy/metabolism and DNA/RNA metabolism, among others.ConclusionsSNP-markers within genes associated with traits of importance for biomass production were detected. They contribute to characterize the genomic architecture of P. trichocarpa biomass required to support the development and application of marker breeding technologies

eScholarship - University of California

Prefix-free parsing for building big BWTs

Author: Boucher C.
Gagie T.
Kuhnle A.
Langmead B.
Manzini G.
Mun T.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive - a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-MB run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 h using 21 GB of memory, suggesting that we can build a 6.73 GB index for 1000 complete human-genome haplotypes in approximately 102 h using about 1 TB of memory

Archivio della Ricerca - Università di Pisa

Archivio Istituzionale della Ricerca- Università del Piemonte Orientale

Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly

Author: Depristo
Durbin
Gingeras
H. Li
Homer
Idury
Iqbal
Lam
Levy
Myers
Myers
Myers
Peltola
Pevzner
Staden
Zerbino
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

Motivation: Eugene Myers in his string graph paper (Myers, 2005) suggested that in a string graph or equivalently a unitig graph, any path spells a valid assembly. As a string/unitig graph also encodes every valid assembly of reads, such a graph, provided that it can be constructed correctly, is in fact a lossless representation of reads. In principle, every analysis based on whole-genome shotgun sequencing (WGS) data, such as SNP and insertion/deletion (INDEL) calling, can also be achieved with unitigs. Results: To explore the feasibility of using de novo assembly in the context of resequencing, we developed a de novo assembler, fermi, that assembles Illumina short reads into unitigs while preserving most of information of the input reads. SNPs and INDELs can be called by mapping the unitigs against a reference genome. By applying the method on 35-fold human resequencing data, we showed that in comparison to the standard pipeline, our approach yields similar accuracy for SNP calling and better results for INDEL calling. It has higher sensitivity than other de novo assembly based methods for variant calling. Our work suggests that variant calling with de novo assembly be a beneficial complement to the standard variant calling pipeline for whole-genome resequencing. In the methodological aspects, we proposed FMD-index for forward-backward extension of DNA sequences, a fast algorithm for finding all super-maximal exact matches and one-pass construction of unitigs from an FMD-index. Availability: http://github.com/lh3/fermi Contact: [email protected]: Rev2: submitted version with minor improvements; 7 page

arXiv.org e-Print Archive

CiteSeerX

Crossref

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

Author: Boucher C.
Gagie T.
Kuhnle A.
Langmead B.
Manzini G.
Mun T.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

While short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform () of the string that will allow us to find the interval in the string\u2019s suffix array () containing pointers to starting positions of occurrences of a given pattern; second, a sample of the that\u2014when used with the rank data structure\u2014allows us access to the . The rank data structure can be kept small even for large genomic databases, by run-length compressing the , but until recently there was no means known to keep the sample small without greatly slowing down access to the . Now that Gagie et al. (SODA 2018) have defined an sample that takes about the same space as the run-length compressed \u2014we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018 we showed how to build the of large genomic databases efficiently (WABI 2018) but the problem of building Gagie et al.\u2019s sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over Bowtie with respect to both memory and time

Archivio della Ricerca - Università di Pisa

Archivio Istituzionale della Ricerca- Università del Piemonte Orientale