117 research outputs found

    MaxSSmap: A GPU program for mapping divergent short reads to genomes with the maximum scoring subsequence

    Get PDF
    Programs based on hash tables and Burrows-Wheeler are very fast for mapping short reads to genomes but have low accuracy in the presence of mismatches and gaps. Such reads can be aligned accurately with the Smith-Waterman algorithm but it can take hours and days to map millions of reads even for bacteria genomes. We introduce a GPU program called MaxSSmap with the aim of achieving comparable accuracy to Smith-Waterman but with faster runtimes. Similar to most programs MaxSSmap identifies a local region of the genome followed by exact alignment. Instead of using hash tables or Burrows-Wheeler in the first part, MaxSSmap calculates maximum scoring subsequence score between the read and disjoint fragments of the genome in parallel on a GPU and selects the highest scoring fragment for exact alignment. We evaluate MaxSSmap's accuracy and runtime when mapping simulated Illumina E.coli and human chromosome one reads of different lengths and 10\% to 30\% mismatches with gaps to the E.coli genome and human chromosome one. We also demonstrate applications on real data by mapping ancient horse DNA reads to modern genomes and unmapped paired reads from NA12878 in 1000 genomes. We show that MaxSSmap attains comparable high accuracy and low error to fast Smith-Waterman programs yet has much lower runtimes. We show that MaxSSmap can map reads rejected by BWA and NextGenMap with high accuracy and low error much faster than if Smith-Waterman were used. On short read lengths of 36 and 51 both MaxSSmap and Smith-Waterman have lower accuracy compared to at higher lengths. On real data MaxSSmap produces many alignments with high score and mapping quality that are not given by NextGenMap and BWA. The MaxSSmap source code is freely available from http://www.cs.njit.edu/usman/MaxSSmap

    Fast inexact mapping using advanced tree exploration on backward search methods

    Full text link
    Background: Short sequence mapping methods for Next Generation Sequencing consist on a combination of seeding techniques followed by local alignment based on dynamic programming approaches. Most seeding algorithms are based on backward search alignment, using the Burrows Wheeler Transform, the Ferragina and Manzini Index or Suffix Arrays. All these backward search algorithms have excellent performance, but their computational cost highly increases when allowing errors. In this paper, we discuss an inexact mapping algorithm based on pruning strategies for search tree exploration over genomic data. Results: The proposed algorithm achieves a 13x speed-up over similar algorithms when allowing 6 base errors, including insertions, deletions and mismatches. This algorithm can deal with 400 bps reads with up to 9 errors in a high quality Illumina dataset. In this example, the algorithm works as a preprocessor that reduces by 55% the number of reads to be aligned. Depending on the aligner the overall execution time is reduced between 20–40%. Conclusions: Although not intended as a complete sequence mapping tool, the proposed algorithm could be used as a preprocessing step to modern sequence mappers. This step significantly reduces the number reads to be aligned, accelerating overall alignment time. Furthermore, this algorithm could be used for accelerating the seeding step of already available sequence mappers. In addition, an out-of-core index has been implemented for working with large genomes on systems without expensive memory configurations.The authors would like to thank the Universitat Politecnica de Valencia (Spain) in the frame of the grant "High-performance tools for the alignment of genetic sequences using graphic accelerators (GPGPUs)/Herramientas de altas prestaciones para el alineamiento de secuencias geneticas mediante el uso de aceleradores graficos (GPGPUs)", research program PAID-06-11, code 2025.Salavert Torres, J.; Tomás Domínguez, AE.; Tárraga Giménez, J.; Medina Castelló, I.; Dopazo Blazquez, J.; Blanquer Espert, I. (2015). Fast inexact mapping using advanced tree exploration on backward search methods. BMC Bioinformatics. 16(18):1-11. https://doi.org/10.1186/s12859-014-0438-3S1111618Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Brief Biol. 2010; 11(5):473–83.Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981; 147:195–7.Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol. 1982; 162:705–8.Durbin R, Eddy SR, Krogh A, Mitchison G. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press: Cambridge; 1998. [ http://books.google.es/books?id=R5P2GlJvigQC ]Ferragina P, Manzini G. Indexing compressed text. J ACM. 2005; 52(4):552–81. doi:10.1145/10820361082039Burrows M, Wheeler DJ. A block-sorting lossless data compression algorithm. Technical Report 124. (SRC Digital, DEC Palo Alto); May 1994Manzini G. An analysis of the burrows-wheeler transform. In: Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms. NY: ACM-SIAM: 1999. p. 669–77.Ferragina P, Manzini G. Opportunistic data structures with applications. In: FOCS: 2000. p. 390–398.Li H, Durbin R. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics. 2009; 25(14):1754–1760.Li R, Yu C, Li Y, Lam T-W, Yiu S-M, Kristiansen K, et al. Soap2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009; 25(15):1966–1967. doi:10.1093/bioinformatics/btp336.Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009; 10:(R25).Luo R, Wong T, Zhu J, Liu C-M, Zhu X, Wu E, et al. Soap3-dp: Fast, accurate and sensitive gpu-based short read aligner. PLoS ONE. 2013; 8(5):65632. doi:10.1371/journal.pone.0065632Liu Y, Schmidt B. Long read alignment based on maximal exact match seeds. Bioinformatics. 2012; 28(18):318–324. doi:10.1093/bioinformatics/bts414Klus P, Lam S, Lyberg D, Cheung M, Pullan G, McFarlane I, et al. Barracuda - a fast short read sequence aligner using graphics processing units. BMC Res Notes. 2012; 5(1):27. doi:10.1186/1756-0500-5-27Salavert J, Blanquer I, Andrés T, Vicente H, Ignacio M, Joaquín T, et al. Using gpus for the exact alignment of short-read genetic sequences by means of the burrows-wheeler transform. IEEE/ACM Trans Comput Biol Bioinf. 2012; 9(4):1245–56. doi:10.1109/TCBB.2012.49Xin Y, Liu B, Min B, Li WXY, Cheung RCC, Fong AS, et al. Parallel architecture for {DNA} sequence inexact matching with burrows-wheeler transform. Microelectron J. 2013; 44(8):670–82. doi:10.1016/j.mejo.2013.05.004Manber U, Myers G. Suffix arrays: A new method for on-line string searches. In: Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms. SODA ’90Philadelphia, PA, USA: Society for Industrial and Applied Mathematics: 1990. p. 319–327. http://dl.acm.org/citation.cfm?id=320176.320218Abouelhoda MI, Kurtz S, Ohlebusch E. The enhanced suffix array and its applications to genome analysis. In: Proc. Workshop on Algorithms in Bioinformatics, in Lecture Notes in Computer Science,Heidelberger, Berlin: Springer: 2002. p. 449–63.Vyverman M, De Baets B, Fack V, Dawyndt P. essamem: finding maximal exact matches using enhanced sparse suffix arrays. Bioinformatics. 2013; 29(6):802–4. doi:10.1093/bioinformatics/btt042Oguzhan Kulekci M, Hon W-K, Shah R, Scott Vitter J, Xu B. Psi-ra: a parallel sparse index for genomic read alignment. BMC Genomics. 2011; 12(Suppl 2):7. doi:10.1186/1471-2164-12-S2-S7Sadakane K. New text indexing functionalities of the compressed suffix arrays. J Algorithms. 2003; 48(2):294–313. doi:10.1016/S0196-6774(03)00087-7Liu C-M, Wong T, Wu E, Luo R, Yiu S-M, Li Y, et al. Soap3: ultra-fast gpu-based parallel alignment tool for short reads. Bioinformatics. 2012; 28(6):878–9. doi:10.1093/bioinformatics/bts061. http://bioinformatics.oxfordjournals.org/content/28/6/878.full.pdf+htmlLam TW, Li R, Tam A, Wong S, Wu E, Yiu SM. High throughput short read alignment via bi-directional bwt. In: IEEE International Conference On Bioinformatics and Biomedicine, 2009. BIBM ’09.,Washington, D.C., USA: IEEE Computer Society Press: 2009. p. 31–6. doi:10.1109/BIBM.2009.42Li H, Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010; 26(5):589–95. doi:10.1093/bioinformatics/btp698Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Meth. 2012; 9(4):357–9. doi:10.1038/nmeth.1923Mu JC, Jiang H, Kiani A, Mohiyuddin M, Asadi NB, Wong WH. Fast and accurate read alignment for resequencing. Bioinformatics. 2012; 28(18):2366–73. doi:10.1093/bioinformatics/bts450Ning Z, Cox AJ, Mullikin JC. Ssaha: A fast search method for large dna databases. Genome Res. 2001; 11(10):1725–9. doi:10.1101/gr.194201Marco-Sola S, Sammeth M, Guigo R, Ribeca P. The GEM mapper: fast, accurate and versatile alignment by filtration. Nat Meth. 2012; 9(12):1185–8. doi:10.1038/nmeth.2221Sadakane K. A library for compressed full-text indexes. https://code.google.com/p/csalib/ (2010)Mäkinen V, Navarro G, Sadakane K. Advantages of backward searching; efficient secondary memory and distributed implementation of compressed suffix arrays. In: Proceedings of the 15th International Conference on Algorithms and Computation. ISAAC’04,Berlin, Heidelberg: Springer: 2004. p. 681–92. doi:10.1007/978-3-540-30551-4_59. http://dx.doi.org/10.1007/978-3-540-30551-4_59Puglisi SJ, Smyth WF, Turpin AH. A taxonomy of suffix array construction algorithms. ACM Comput Surv. 2007; 39(2). doi:10.1145/1242471.1242472Okanohara D, Sadakane K. A linear-time burrows-wheeler transform using induced sorting. In: Karlgren J, Tarhio J, Hyyrö H, editors. String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 5721. Heidelberg, Berlin: Springer: 2009. p. 90–101.Grossi R, Vitter J. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SICOMP: SIAM J Comput. 2005; 35(2):378–407

    Genetically improved BarraCUDA.

    Get PDF
    BACKGROUND: BarraCUDA is an open source C program which uses the BWA algorithm in parallel with nVidia CUDA to align short next generation DNA sequences against a reference genome. Recently its source code was optimised using "Genetic Improvement". RESULTS: The genetically improved (GI) code is up to three times faster on short paired end reads from The 1000 Genomes Project and 60% more accurate on a short BioPlanet.com GCAT alignment benchmark. GPGPU BarraCUDA running on a single K80 Tesla GPU can align short paired end nextGen sequences up to ten times faster than bwa on a 12 core server. CONCLUSIONS: The speed up was such that the GI version was adopted and has been regularly downloaded from SourceForge for more than 12 months

    Inexact Mapping of Short Biological Sequences in High Performance Computational Environments

    Full text link
    La bioinformática es la aplicación de las ciencias computacionales a la gestión y análisis de datos biológicos. A partir de 2005, con la aparición de los secuenciadores de ADN de nueva generación surge lo que se conoce como Next Generation Sequencing o NGS. Un único experimento biológico puesto en marcha en una máquina de secuenciación NGS puede producir fácilmente cientos de gigabytes o incluso terabytes de datos. Dependiendo de la técnica elegida este proceso puede realizarse en unas pocas horas o días. La disponibilidad de recursos locales asequibles, tales como los procesadores multinúcleo o las nuevas tarjetas gráfi cas preparadas para el cálculo de propósito general GPGPU (General Purpose Graphic Processing Unit ), constituye una gran oportunidad para hacer frente a estos problemas. En la actualidad, un tema abordado con frecuencia es el alineamiento de secuencias de ADN. En bioinformática, el alineamiento permite comparar dos o más secuencias de ADN, ARN, o estructuras primarias proteicas, resaltando sus zonas de similitud. Dichas similitudes podrían indicar relaciones funcionales o evolutivas entre los genes o proteínas consultados. Además, la existencia de similitudes entre las secuencias de un individuo paciente y de otro individuo con una enfermedad genética detectada podría utilizarse de manera efectiva en el campo de la medicina diagnóstica. El problema en torno al que gira el desarrollo de la tesis doctoral consiste en la localización de fragmentos de secuencia cortos dentro del ADN. Esto se conoce bajo el sobrenombre de mapeo de secuencia o sequence mapping. Dicho mapeo debe permitir errores, pudiendo mapear secuencias incluso existiendo variabilidad genética o errores de lectura en el mapeo. Existen diversas técnicas para abordar el mapeo, pero desde la aparición de la NGS destaca la búsqueda por pre jos indexados y agrupados mediante la transformada de Burrows-Wheeler [28] (o BWT en lo sucesivo). Dicha transformada se empleó originalmente en técnicas de compresión de datos, como es el caso del algoritmo bzip2. Su utilización como herramienta para la indización y búsqueda posterior de información es más reciente [22]. La ventaja es que su complejidad computacional depende únicamente de la longitud de la secuencia a mapear. Por otra parte, una gran cantidad de técnicas de alineamiento se basan en algoritmos de programación dinámica, ya sea Smith-Watterman o modelos ocultos de Markov. Estos proporcionan mayor sensibilidad, permitiendo mayor cantidad de errores, pero su coste computacional es mayor y depende del tamaño de la secuencia multiplicado por el de la cadena de referencia. Muchas herramientas combinan una primera fase de búsqueda con la BWT de regiones candidatas al alineamiento y una segunda fase de alineamiento local en la que se mapean cadenas con Smith-Watterman o HMM. Cuando estamos mapeando permitiendo pocos errores, una segunda fase con un algoritmo de programación dinámica resulta demasiado costosa, por lo que una búsqueda inexacta basada en BWT puede resultar más e ficiente. La principal motivación de la tesis doctoral es la implementación de un algoritmo de búsqueda inexacta basado únicamente en la BWT, adaptándolo a las arquitecturas paralelas modernas, tanto en CPU como en GPGPU. El algoritmo constituirá un método nuevo de rami cación y poda adaptado a la información genómica. Durante el periodo de estancia se estudiarán los Modelos ocultos de Markov y se realizará una implementación sobre modelos de computación funcional GTA (Aggregate o Test o Generate), así como la paralelización en memoria compartida y distribuida de dicha plataforma de programación funcional.Salavert Torres, J. (2014). Inexact Mapping of Short Biological Sequences in High Performance Computational Environments [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/43721TESI

    Algorithms for the mapping of genome sequences in GPGPU

    Full text link
    This project focuses on using GPGPUs for solving the inexact alignment of short-reads with respect to a reference indexed using the Burrows-Wheeler Transform. To be more speci c we dealt with a solution of an alignment that allows up to one error.Seide, D. (2012). Algorithms for the mapping of genome sequences in GPGPU. http://hdl.handle.net/10251/16955.Archivo delegad

    FPGA acceleration of short read alignment with high-level synthesis

    Get PDF
    With the introduction of next-generation sequencing (NGS) technologies, DNA sequencing is becoming an increasingly widespread process. When performed on human patients, it can allow for the prediction and prevention of diseases. An essential part of this bioinformatics pipeline is short read alignment}, which refers to aligning short fragments of DNA to the large and expansive reference genome. This can be a very time-consuming process with much room for improvement. This thesis improves on Bowtie 2, an aligner that is already very popular and high-performing. Through the use of OpenCL, it is possible to parallelize this application for both GPU and FPGA by using the same code. Several different levels of parallelism are implemented in order to achieve speedup on Bowtie 2
    • …