10,387 research outputs found

    Fast and accurate short read alignment with Burrows–Wheeler transform

    Get PDF
    Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals

    Fast inexact mapping using advanced tree exploration on backward search methods

    Full text link
    Background: Short sequence mapping methods for Next Generation Sequencing consist on a combination of seeding techniques followed by local alignment based on dynamic programming approaches. Most seeding algorithms are based on backward search alignment, using the Burrows Wheeler Transform, the Ferragina and Manzini Index or Suffix Arrays. All these backward search algorithms have excellent performance, but their computational cost highly increases when allowing errors. In this paper, we discuss an inexact mapping algorithm based on pruning strategies for search tree exploration over genomic data. Results: The proposed algorithm achieves a 13x speed-up over similar algorithms when allowing 6 base errors, including insertions, deletions and mismatches. This algorithm can deal with 400 bps reads with up to 9 errors in a high quality Illumina dataset. In this example, the algorithm works as a preprocessor that reduces by 55% the number of reads to be aligned. Depending on the aligner the overall execution time is reduced between 20–40%. Conclusions: Although not intended as a complete sequence mapping tool, the proposed algorithm could be used as a preprocessing step to modern sequence mappers. This step significantly reduces the number reads to be aligned, accelerating overall alignment time. Furthermore, this algorithm could be used for accelerating the seeding step of already available sequence mappers. In addition, an out-of-core index has been implemented for working with large genomes on systems without expensive memory configurations.The authors would like to thank the Universitat Politecnica de Valencia (Spain) in the frame of the grant "High-performance tools for the alignment of genetic sequences using graphic accelerators (GPGPUs)/Herramientas de altas prestaciones para el alineamiento de secuencias geneticas mediante el uso de aceleradores graficos (GPGPUs)", research program PAID-06-11, code 2025.Salavert Torres, J.; Tomás Domínguez, AE.; Tárraga Giménez, J.; Medina Castelló, I.; Dopazo Blazquez, J.; Blanquer Espert, I. (2015). Fast inexact mapping using advanced tree exploration on backward search methods. BMC Bioinformatics. 16(18):1-11. https://doi.org/10.1186/s12859-014-0438-3S1111618Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Brief Biol. 2010; 11(5):473–83.Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981; 147:195–7.Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol. 1982; 162:705–8.Durbin R, Eddy SR, Krogh A, Mitchison G. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press: Cambridge; 1998. [ http://books.google.es/books?id=R5P2GlJvigQC ]Ferragina P, Manzini G. Indexing compressed text. J ACM. 2005; 52(4):552–81. doi:10.1145/10820361082039Burrows M, Wheeler DJ. A block-sorting lossless data compression algorithm. Technical Report 124. (SRC Digital, DEC Palo Alto); May 1994Manzini G. An analysis of the burrows-wheeler transform. In: Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms. NY: ACM-SIAM: 1999. p. 669–77.Ferragina P, Manzini G. Opportunistic data structures with applications. In: FOCS: 2000. p. 390–398.Li H, Durbin R. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics. 2009; 25(14):1754–1760.Li R, Yu C, Li Y, Lam T-W, Yiu S-M, Kristiansen K, et al. Soap2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009; 25(15):1966–1967. doi:10.1093/bioinformatics/btp336.Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009; 10:(R25).Luo R, Wong T, Zhu J, Liu C-M, Zhu X, Wu E, et al. Soap3-dp: Fast, accurate and sensitive gpu-based short read aligner. PLoS ONE. 2013; 8(5):65632. doi:10.1371/journal.pone.0065632Liu Y, Schmidt B. Long read alignment based on maximal exact match seeds. Bioinformatics. 2012; 28(18):318–324. doi:10.1093/bioinformatics/bts414Klus P, Lam S, Lyberg D, Cheung M, Pullan G, McFarlane I, et al. Barracuda - a fast short read sequence aligner using graphics processing units. BMC Res Notes. 2012; 5(1):27. doi:10.1186/1756-0500-5-27Salavert J, Blanquer I, Andrés T, Vicente H, Ignacio M, Joaquín T, et al. Using gpus for the exact alignment of short-read genetic sequences by means of the burrows-wheeler transform. IEEE/ACM Trans Comput Biol Bioinf. 2012; 9(4):1245–56. doi:10.1109/TCBB.2012.49Xin Y, Liu B, Min B, Li WXY, Cheung RCC, Fong AS, et al. Parallel architecture for {DNA} sequence inexact matching with burrows-wheeler transform. Microelectron J. 2013; 44(8):670–82. doi:10.1016/j.mejo.2013.05.004Manber U, Myers G. Suffix arrays: A new method for on-line string searches. In: Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms. SODA ’90Philadelphia, PA, USA: Society for Industrial and Applied Mathematics: 1990. p. 319–327. http://dl.acm.org/citation.cfm?id=320176.320218Abouelhoda MI, Kurtz S, Ohlebusch E. The enhanced suffix array and its applications to genome analysis. In: Proc. Workshop on Algorithms in Bioinformatics, in Lecture Notes in Computer Science,Heidelberger, Berlin: Springer: 2002. p. 449–63.Vyverman M, De Baets B, Fack V, Dawyndt P. essamem: finding maximal exact matches using enhanced sparse suffix arrays. Bioinformatics. 2013; 29(6):802–4. doi:10.1093/bioinformatics/btt042Oguzhan Kulekci M, Hon W-K, Shah R, Scott Vitter J, Xu B. Psi-ra: a parallel sparse index for genomic read alignment. BMC Genomics. 2011; 12(Suppl 2):7. doi:10.1186/1471-2164-12-S2-S7Sadakane K. New text indexing functionalities of the compressed suffix arrays. J Algorithms. 2003; 48(2):294–313. doi:10.1016/S0196-6774(03)00087-7Liu C-M, Wong T, Wu E, Luo R, Yiu S-M, Li Y, et al. Soap3: ultra-fast gpu-based parallel alignment tool for short reads. Bioinformatics. 2012; 28(6):878–9. doi:10.1093/bioinformatics/bts061. http://bioinformatics.oxfordjournals.org/content/28/6/878.full.pdf+htmlLam TW, Li R, Tam A, Wong S, Wu E, Yiu SM. High throughput short read alignment via bi-directional bwt. In: IEEE International Conference On Bioinformatics and Biomedicine, 2009. BIBM ’09.,Washington, D.C., USA: IEEE Computer Society Press: 2009. p. 31–6. doi:10.1109/BIBM.2009.42Li H, Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010; 26(5):589–95. doi:10.1093/bioinformatics/btp698Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Meth. 2012; 9(4):357–9. doi:10.1038/nmeth.1923Mu JC, Jiang H, Kiani A, Mohiyuddin M, Asadi NB, Wong WH. Fast and accurate read alignment for resequencing. Bioinformatics. 2012; 28(18):2366–73. doi:10.1093/bioinformatics/bts450Ning Z, Cox AJ, Mullikin JC. Ssaha: A fast search method for large dna databases. Genome Res. 2001; 11(10):1725–9. doi:10.1101/gr.194201Marco-Sola S, Sammeth M, Guigo R, Ribeca P. The GEM mapper: fast, accurate and versatile alignment by filtration. Nat Meth. 2012; 9(12):1185–8. doi:10.1038/nmeth.2221Sadakane K. A library for compressed full-text indexes. https://code.google.com/p/csalib/ (2010)Mäkinen V, Navarro G, Sadakane K. Advantages of backward searching; efficient secondary memory and distributed implementation of compressed suffix arrays. In: Proceedings of the 15th International Conference on Algorithms and Computation. ISAAC’04,Berlin, Heidelberg: Springer: 2004. p. 681–92. doi:10.1007/978-3-540-30551-4_59. http://dx.doi.org/10.1007/978-3-540-30551-4_59Puglisi SJ, Smyth WF, Turpin AH. A taxonomy of suffix array construction algorithms. ACM Comput Surv. 2007; 39(2). doi:10.1145/1242471.1242472Okanohara D, Sadakane K. A linear-time burrows-wheeler transform using induced sorting. In: Karlgren J, Tarhio J, Hyyrö H, editors. String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 5721. Heidelberg, Berlin: Springer: 2009. p. 90–101.Grossi R, Vitter J. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SICOMP: SIAM J Comput. 2005; 35(2):378–407

    MaxSSmap: A GPU program for mapping divergent short reads to genomes with the maximum scoring subsequence

    Get PDF
    Programs based on hash tables and Burrows-Wheeler are very fast for mapping short reads to genomes but have low accuracy in the presence of mismatches and gaps. Such reads can be aligned accurately with the Smith-Waterman algorithm but it can take hours and days to map millions of reads even for bacteria genomes. We introduce a GPU program called MaxSSmap with the aim of achieving comparable accuracy to Smith-Waterman but with faster runtimes. Similar to most programs MaxSSmap identifies a local region of the genome followed by exact alignment. Instead of using hash tables or Burrows-Wheeler in the first part, MaxSSmap calculates maximum scoring subsequence score between the read and disjoint fragments of the genome in parallel on a GPU and selects the highest scoring fragment for exact alignment. We evaluate MaxSSmap's accuracy and runtime when mapping simulated Illumina E.coli and human chromosome one reads of different lengths and 10\% to 30\% mismatches with gaps to the E.coli genome and human chromosome one. We also demonstrate applications on real data by mapping ancient horse DNA reads to modern genomes and unmapped paired reads from NA12878 in 1000 genomes. We show that MaxSSmap attains comparable high accuracy and low error to fast Smith-Waterman programs yet has much lower runtimes. We show that MaxSSmap can map reads rejected by BWA and NextGenMap with high accuracy and low error much faster than if Smith-Waterman were used. On short read lengths of 36 and 51 both MaxSSmap and Smith-Waterman have lower accuracy compared to at higher lengths. On real data MaxSSmap produces many alignments with high score and mapping quality that are not given by NextGenMap and BWA. The MaxSSmap source code is freely available from http://www.cs.njit.edu/usman/MaxSSmap

    Accurate long read mapping using enhanced suffix arrays

    Get PDF
    With the rise of high throughput sequencing, new programs have been developed for dealing with the alignment of a huge amount of short read data to reference genomes. Recent developments in sequencing technology allow longer reads, but the mappers for short reads are not suited for reads of several hundreds of base pairs. We propose an algorithm for mapping longer reads, which is based on chaining maximal exact matches and uses heuristics and the Needleman-Wunsch algorithm to bridge the gaps. To compute maximal exact matches we use a specialized index structure, called enhanced suffix array. The proposed algorithm is very accurate and can handle large reads with mutations and long insertions and deletions

    SOAP3-dp: Fast, Accurate and Sensitive GPU-based Short Read Aligner

    Get PDF
    To tackle the exponentially increasing throughput of Next-Generation Sequencing (NGS), most of the existing short-read aligners can be configured to favor speed in trade of accuracy and sensitivity. SOAP3-dp, through leveraging the computational power of both CPU and GPU with optimized algorithms, delivers high speed and sensitivity simultaneously. Compared with widely adopted aligners including BWA, Bowtie2, SeqAlto, GEM and GPU-based aligners including BarraCUDA and CUSHAW, SOAP3-dp is two to tens of times faster, while maintaining the highest sensitivity and lowest false discovery rate (FDR) on Illumina reads with different lengths. Transcending its predecessor SOAP3, which does not allow gapped alignment, SOAP3-dp by default tolerates alignment similarity as low as 60 percent. Real data evaluation using human genome demonstrates SOAP3-dp's power to enable more authentic variants and longer Indels to be discovered. Fosmid sequencing shows a 9.1 percent FDR on newly discovered deletions. SOAP3-dp natively supports BAM file format and provides a scoring scheme same as BWA, which enables it to be integrated into existing analysis pipelines. SOAP3-dp has been deployed on Amazon-EC2, NIH-Biowulf and Tianhe-1A.Comment: 21 pages, 6 figures, submitted to PLoS ONE, additional files available at "https://www.dropbox.com/sh/bhclhxpoiubh371/O5CO_CkXQE". Comments most welcom

    GPU-Accelerated BWT Construction for Large Collection of Short Reads

    Full text link
    Advances in DNA sequencing technology have stimulated the development of algorithms and tools for processing very large collections of short strings (reads). Short-read alignment and assembly are among the most well-studied problems. Many state-of-the-art aligners, at their core, have used the Burrows-Wheeler transform (BWT) as a main-memory index of a reference genome (typical example, NCBI human genome). Recently, BWT has also found its use in string-graph assembly, for indexing the reads (i.e., raw data from DNA sequencers). In a typical data set, the volume of reads is tens of times of the sequenced genome and can be up to 100 Gigabases. Note that a reference genome is relatively stable and computing the index is not a frequent task. For reads, the index has to computed from scratch for each given input. The ability of efficient BWT construction becomes a much bigger concern than before. In this paper, we present a practical method called CX1 for constructing the BWT of very large string collections. CX1 is the first tool that can take advantage of the parallelism given by a graphics processing unit (GPU, a relative cheap device providing a thousand or more primitive cores), as well as simultaneously the parallelism from a multi-core CPU and more interestingly, from a cluster of GPU-enabled nodes. Using CX1, the BWT of a short-read collection of up to 100 Gigabases can be constructed in less than 2 hours using a machine equipped with a quad-core CPU and a GPU, or in about 43 minutes using a cluster with 4 such machines (the speedup is almost linear after excluding the first 16 minutes for loading the reads from the hard disk). The previously fastest tool BRC is measured to take 12 hours to process 100 Gigabases on one machine; it is non-trivial how BRC can be parallelized to take advantage a cluster of machines, let alone GPUs.Comment: 11 page

    Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly

    Full text link
    Motivation: Eugene Myers in his string graph paper (Myers, 2005) suggested that in a string graph or equivalently a unitig graph, any path spells a valid assembly. As a string/unitig graph also encodes every valid assembly of reads, such a graph, provided that it can be constructed correctly, is in fact a lossless representation of reads. In principle, every analysis based on whole-genome shotgun sequencing (WGS) data, such as SNP and insertion/deletion (INDEL) calling, can also be achieved with unitigs. Results: To explore the feasibility of using de novo assembly in the context of resequencing, we developed a de novo assembler, fermi, that assembles Illumina short reads into unitigs while preserving most of information of the input reads. SNPs and INDELs can be called by mapping the unitigs against a reference genome. By applying the method on 35-fold human resequencing data, we showed that in comparison to the standard pipeline, our approach yields similar accuracy for SNP calling and better results for INDEL calling. It has higher sensitivity than other de novo assembly based methods for variant calling. Our work suggests that variant calling with de novo assembly be a beneficial complement to the standard variant calling pipeline for whole-genome resequencing. In the methodological aspects, we proposed FMD-index for forward-backward extension of DNA sequences, a fast algorithm for finding all super-maximal exact matches and one-pass construction of unitigs from an FMD-index. Availability: http://github.com/lh3/fermi Contact: [email protected]: Rev2: submitted version with minor improvements; 7 page
    corecore