38 research outputs found

    Flexible, fast and accurate sequence alignment profiling on GPGPU with PaSWAS

    Get PDF
    Motivation To obtain large-scale sequence alignments in a fast and flexible way is an important step in the analyses of next generation sequencing data. Applications based on the Smith-Waterman (SW) algorithm are often either not fast enough, limited to dedicated tasks or not sufficiently accurate due to statistical issues. Current SW implementations that run on graphics hardware do not report the alignment details necessary for further analysis. Results With the Parallel SW Alignment Software (PaSWAS) it is possible (a) to have easy access to the computational power of NVIDIA-based general purpose graphics processing units (GPGPUs) to perform high-speed sequence alignments, and (b) retrieve relevant information such as score, number of gaps and mismatches. The software reports multiple hits per alignment. The added value of the new SW implementation is demonstrated with two test cases: (1) tag recovery in next generation sequence data and (2) isotype assignment within an immunoglobulin 454 sequence data set. Both cases show the usability and versatility of the new parallel Smith-Waterman implementation. (...

    Impact of Gut Bacteria on the Infection and Transmission of Pathogenic Arboviruses by Biting Midges and Mosquitoes

    Get PDF
    Tripartite interactions among insect vectors, midgut bacteria, and viruses may determine the ability of insects to transmit pathogenic arboviruses. Here, we investigated the impact of gut bacteria on the susceptibility of Culicoides nubeculosus and Culicoides sonorensis biting midges for Schmallenberg virus, and of Aedes aegypti mosquitoes for Zika and chikungunya viruses. Gut bacteria were manipulated by treating the adult insects with antibiotics. The gut bacterial communities were investigated using Illumina MiSeq sequencing of 16S rRNA, and susceptibility to arbovirus infection was tested by feeding insects with an infectious blood meal. Antibiotic treatment led to changes in gut bacteria for all insects. Interestingly, the gut bacterial composition of untreated Ae. aegypti and C. nubeculosus showed Asaia as the dominant genus, which was drastically reduced after antibiotic treatment. Furthermore, antibiotic treatment resulted in relatively more Delftia bacteria in both biting midge species, but not in mosquitoes. Antibiotic treatment and subsequent changes in gut bacterial communities were associated with a significant, 1.8-fold increased infection rate of C. nubeculosus with Schmallenberg virus, but not for C. sonorensis. We did not find any changes in infection rates for Ae. aegypti mosquitoes with Zika or chikungunya virus. We conclude that resident gut bacteria may dampen arbovirus transmission in biting midges, but not so in mosquitoes. Use of antimicrobial compounds at livestock farms might therefore have an unexpected contradictory effect on the health of animals, by increasing the transmission of viral pathogens by biting midges.</p

    Application of high performance compute technology in bioinformatics

    No full text
    Bioinformatics and computational biology are driven by growing volumes of data in biological systems that also tend to increase in complexity. The research presented in this thesis focuses on the need to analyze such data volumes in such complexity. The results show that the application of high-performance compute technologies, preferably combined with low-cost hardware, is a successful approach to generate new bioinformatics approaches that allow addressing new types of data analyses and research questions in biology. An overview of the technologies and recent developments in biology and computer science relevant for this thesis (Chapter 1) identifies current high-throughput sequencing platforms as a key technology. Sequencing platforms now deliver data sets up to terabytes in size for elucidating genome structure, gene content, gene activity, as well as gene variants. The concepts and technologies from computer science to handle these large amounts of data include (a) grid technologies for compute parallelization while making more efficient use of existing low-cost infrastructure; (b) graphics cards for increased compute power and (c) graph databases for large data volume storage and advanced methods for analyses. This thesis presents novel applications and added value of these three concepts for bioinformatics research. Small RNAs are important regulators of genome function, yet their prediction in genomes is still a major computational challenge (Chapter 2). They tend to have a minimal free energy (MFE) significantly lower than the MFE of non-small RNA sequences with the same nucleotide composition. Evaluation of many MFEs is, however, too compute-intensive for genome-wide screening. With a local grid infrastructure of desktop computers, MFE distributions of a very large collection of sequence compositions were pre-calculated and used to determine the MFE distribution for any given sequence composition by interpolation. This approach allows on-the-fly calculation for any candidate sequence composition and makes genome-wide screening with this characteristic of a pre-miRNA sequence feasible. This way, MFE evaluation can be added as a new parameter for genome-wide selection of potential small RNA candidates (Chapter 2). The concept of large-scale pre-calculation of compute-intensive parameters is one of the options for future bioinformatics analyses. Sequence alignment is essential in the analysis of next-generation sequencing data. The gold standard for sequence alignment is the Smith-Waterman (SW) algorithm. Existing implementations of the full SW algorithm are either not fast enough, or limited to dedicated tasks, usually to optimize for speed, whereas popular heuristic SW versions (such as BLAST) suffer from statistical issues. Graphics hardware is well-suited to speed up SW alignments, but SW on graphics cards does not report the alignment details desired by biologists for further analysis. This thesis presents the CUDA-based Parallel SW Alignment Software (PaSWAS) (Chapter 3). PaSWAS gives (a) easy access to the computational power of NVIDIA-based graphics cards for high-speed sequence alignments, (b) information such as score, number of gaps and mismatches with the accuracy of the full SW algorithm and (c) a report of multiple hits per alignment. Two use cases show the usability and versatility of the new parallel Smith-Waterman implementation for bioinformatics analyses. It demonstrates the added value of the use of low-cost graphics cards in bioinformatics software. To further promote the use of PaSWAS, a new implementation, pyPaSWAS, provides the SW sequence alignment code fully packed in Python and the more widely accepted OpenCL language (Chapter 4). Moreover, pyPaSWAS now supports an affine gap penalty. This way, pyPaSWAS presents an easy Python-based environment for accurate and retrievable parallel SW sequence alignments on GPUs and multi-core systems. The strategy of integrating Python with high-performance parallel compute languages to create a developer- and user-friendly environment is worth to be considered for other computationally-intensive bioinformatics algorithms. Thanks to the accuracy and retrieval characteristics of (py)PaSWAS, it was noted that long sequencing reads on the PacBio platform can contain many artificial palindromic sequences. These palindromes are due to errors introduced by whole-genome amplification (WGA). Next-generation sequencing requires sufficient amounts of DNA. If not available, WGA is routinely used to generate the amounts of DNA required. The introduction of artificial palindromic sequences hampers assembly and severely limits the value of long sequencing reads. Pacasus is a novel software tool to identify and resolve such artificial palindromic sequences in long sequencing reads (Chapter 5). Two use cases show that Pacasus markedly improves read mapping and assembly of WGA DNA. In comparison, the quality of mapping and assembly is similar to the quality obtained with non-amplified DNA. Therefore, with Pacasus, long-read technology becomes feasible for the sequencing of samples for which only very small amounts of DNA are available, such as single cells or single chromosomes. Numerous tools and databases exist to annotate and investigate the functions encoded in properly assembled genomes, such as InterProScan, KEGG, GO and many more. Comparisons of functionalities across multiple genomes is, however, not trivial. The concept of graph databases is a promising novel approach from computer science for such multi-genome comparisons. For a data set of all (> 150,000) genes of 17 fungal species functionally annotated with InterProScan, the associated KEGG, GO and annotation data are imported and interconnected in a new Neo4j graph database (Chapter 6). Relationships in this database are visualized and mined with a newly refurbished and extended Neo4j plugin for Cytoscape. Inspection of (sub)graphs of functional annotations is an attractive way to compare and group functional annotation across species. In the use case of the seventeen fungal genomes, it helped to outline, compare and explain details of the life style of groups of individual species. The general discussion of this thesis provides an outlook on the future of bioinformatics in the context of the results here presented (Chapter 7). A grid infrastructure is recommended as a feasible and attractive cost-effective strategy to create compute power, as is the further inclusion of graphics cards. Full implementation of graph technology is considered necessary for advancing bioinformatics. The work presented in this thesis also shows that use of grids, graphics cards and graph technology imply the redesign of existing software applications. To be able to create novel stable, predictable and user-friendly applications in bioinformatics, formal training in software engineering principles is highly recommended. Courses and other programs are necessary for the life-long learning that will be crucial for the future of bioinformatics. The main challenges for bioinformatics in the years to come are all data centered: issues with growing data volumes, with more data types and with higher data complexity. To deal with these challenges, further integration of now separate fields of science is warranted in ways we cannot even image yet. </p

    pyPaSWAS

    No full text
    Python-based Parallel Smith-Waterman Alignment Software. It uses pyCUDA and pyOpenCL to perform the aligments on CPUs, GPUs and XeonPhi

    Flexible, fast and accurate sequence alignment profiling on GPGPU with PaSWAS

    No full text
    To obtain large-scale sequence alignments in a fast and flexible way is an important step in the analyses of next generation sequencing data. Applications based on the Smith-Waterman (SW) algorithm are often either not fast enough, limited to dedicated tasks or not sufficiently accurate due to statistical issues. Current SW implementations that run on graphics hardware do not report the alignment details necessary for further analysis
    corecore