331 research outputs found

    BEAGLE 3:Improved Performance, Scaling, and Usability for a High-Performance Computing Library for Statistical Phylogenetics

    Get PDF
    © 2019 The Author(s). BEAGLE is a high-performance likelihood-calculation library for phylogenetic inference. The BEAGLE library defines a simple, but flexible, application programming interface (API), and includes a collection of efficient implementations for calculation under a variety of evolutionary models on different hardware devices. The library has been integrated into recent versions of popular phylogenetics software packages including BEAST and MrBayes and has been widely used across a diverse range of evolutionary studies. Here, we present BEAGLE 3 with new parallel implementations, increased performance for challenging data sets, improved scalability, and better usability. We have added new OpenCL and central processing unit-threaded implementations to the library, allowing the effective utilization of a wider range of modern hardware. Further, we have extended the API and library to support concurrent computation of independent partial likelihood arrays, for increased performance of nucleotide-model analyses with greater flexibility of data partitioning. For better scalability and usability, we have improved how phylogenetic software packages use BEAGLE in multi-GPU (graphics processing unit) and cluster environments, and introduced an automated method to select the fastest device given the data set, evolutionary model, and hardware. For application developers who wish to integrate the library, we also have developed an online tutorial. To evaluate the effect of the improvements, we ran a variety of benchmarks on state-of-the-art hardware. For a partitioned exemplar analysis, we observe run-time performance improvements as high as 5.9-fold over our previous GPU implementation. BEAGLE 3 is free, open-source software licensed under the Lesser GPL and available at https://beagle-dev.github.io

    Tandem duplication, circular permutation, molecular adaptation: how Solanaceae resist pests via inhibitors

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The Potato type II (Pot II) family of proteinase inhibitors plays critical roles in the defense system of plants from <it>Solanaceae </it>family against pests. To better understand the evolution of this family, we investigated the correlation between sequence and structural repeats within this family and the evolution and molecular adaptation of Pot II genes through computational analysis, using the putative ancestral domain sequence as the basic repeat unit.</p> <p>Results</p> <p>Our analysis discovered the following interesting findings in Pot II family. (1) We classified the structural domains in Pot II family into three types (original repeat domain, circularly permuted domain, the two-chain domain) according to the existence of two linkers between the two domain components, which clearly show the circular permutation relationship between the original repeat domain and circularly permuted domain. (2) The permuted domains appear more stable than original repeat domain, from available structural information. Therefore, we proposed a multiple-repeat sequence is likely to adopt the permuted domain from contiguous sequence segments, with the N- and C-termini forming a single non-contiguous structural domain, linking the bracelet of tandem repeats. (3) The analysis of nonsynonymous/synonymous substitution rates ratio in Pot II domain revealed heterogeneous selective pressures among amino acid sites: the reactive site is under positive Darwinian selection (providing different specificity to target varieties of proteinases) while the cysteine scaffold is under purifying selection (essential for maintaining the fold). (4) For multi-repeat Pot II genes from <it>Nicotiana </it>genus, the proteolytic processing site is under positive Darwinian selection (which may improve the cleavage efficiency).</p> <p>Conclusion</p> <p>This paper provides comprehensive analysis and characterization of Pot II family, and enlightens our understanding on the strategies (Gene and domain duplication, structural circular permutation and molecular adaptation) of <it>Solanaceae </it>plants for defending pathogenic attacks through the evolution of Pot II genes.</p

    High performance bioinformatics and computational biology on general-purpose graphics processing units

    Get PDF
    Bioinformatics and Computational Biology (BCB) is a relatively new multidisciplinary field which brings together many aspects of the fields of biology, computer science, statistics, and engineering. Bioinformatics extracts useful information from biological data and makes these more intuitive and understandable by applying principles of information sciences, while computational biology harnesses computational approaches and technologies to answer biological questions conveniently. Recent years have seen an explosion of the size of biological data at a rate which outpaces the rate of increases in the computational power of mainstream computer technologies, namely general purpose processors (GPPs). The aim of this thesis is to explore the use of off-the-shelf Graphics Processing Unit (GPU) technology in the high performance and efficient implementation of BCB applications in order to meet the demands of biological data increases at affordable cost. The thesis presents detailed design and implementations of GPU solutions for a number of BCB algorithms in two widely used BCB applications, namely biological sequence alignment and phylogenetic analysis. Biological sequence alignment can be used to determine the potential information about a newly discovered biological sequence from other well-known sequences through similarity comparison. On the other hand, phylogenetic analysis is concerned with the investigation of the evolution and relationships among organisms, and has many uses in the fields of system biology and comparative genomics. In molecular-based phylogenetic analysis, the relationship between species is estimated by inferring the common history of their genes and then phylogenetic trees are constructed to illustrate evolutionary relationships among genes and organisms. However, both biological sequence alignment and phylogenetic analysis are computationally expensive applications as their computing and memory requirements grow polynomially or even worse with the size of sequence databases. The thesis firstly presents a multi-threaded parallel design of the Smith- Waterman (SW) algorithm alongside an implementation on NVIDIA GPUs. A novel technique is put forward to solve the restriction on the length of the query sequence in previous GPU-based implementations of the SW algorithm. Based on this implementation, the difference between two main task parallelization approaches (Inter-task and Intra-task parallelization) is presented. The resulting GPU implementation matches the speed of existing GPU implementations while providing more flexibility, i.e. flexible length of sequences in real world applications. It also outperforms an equivalent GPPbased implementation by 15x-20x. After this, the thesis presents the first reported multi-threaded design and GPU implementation of the Gapped BLAST with Two-Hit method algorithm, which is widely used for aligning biological sequences heuristically. This achieved up to 3x speed-up improvements compared to the most optimised GPP implementations. The thesis then presents a multi-threaded design and GPU implementation of a Neighbor-Joining (NJ)-based method for phylogenetic tree construction and multiple sequence alignment (MSA). This achieves 8x-20x speed up compared to an equivalent GPP implementation based on the widely used ClustalW software. The NJ method however only gives one possible tree which strongly depends on the evolutionary model used. A more advanced method uses maximum likelihood (ML) for scoring phylogenies with Markov Chain Monte Carlo (MCMC)-based Bayesian inference. The latter was the subject of another multi-threaded design and GPU implementation presented in this thesis, which achieved 4x-8x speed up compared to an equivalent GPP implementation based on the widely used MrBayes software. Finally, the thesis presents a general evaluation of the designs and implementations achieved in this work as a step towards the evaluation of GPU technology in BCB computing, in the context of other computer technologies including GPPs and Field Programmable Gate Arrays (FPGA) technology

    Inference of Many-Taxon Phylogenies

    Get PDF
    Phylogenetic trees are tree topologies that represent the evolutionary history of a set of organisms. In this thesis, we address computational challenges related to the analysis of large-scale datasets with Maximum Likelihood based phylogenetic inference. We have approached this using different strategies: reduction of memory requirements, reduction of running time, and reduction of man-hours

    Research And Application Of Parallel Computing Algorithms For Statistical Phylogenetic Inference

    Get PDF
    Estimating the evolutionary history of organisms, phylogenetic inference, is a critical step in many analyses involving biological sequence data such as DNA. The likelihood calculations at the heart of the most effective methods for statistical phylogenetic analyses are extremely computationally intensive, and hence these analyses become a bottleneck in many studies. Recent progress in computer hardware, specifically the increase in pervasiveness of highly parallel, many-core processors has created opportunities for new approaches to computationally intensive methods, such as those in phylogenetic inference. We have developed an open source library, BEAGLE, which uses parallel computing methods to greatly accelerate statistical phylogenetic inference, for both maximum likelihood and Bayesian approaches. BEAGLE defines a uniform application programming interface and includes a collection of efficient implementations that use NVIDIA CUDA, OpenCL, and C++ threading frameworks for evaluating likelihoods under a wide variety of evolutionary models, on GPUs as well as on multi-core CPUs. BEAGLE employs a number of different parallelization techniques for phylogenetic inference, at different granularity levels and for distinct processor architectures. On CUDA and OpenCL devices, the library enables concurrent computation of site likelihoods, data subsets, and independent subtrees. The general design features of the library also provide a model for software development using parallel computing frameworks that is applicable to other domains. BEAGLE has been integrated with some of the leading programs in the field, such as MrBayes and BEAST, and is used in a diverse range of evolutionary studies, including those of disease causing viruses. The library can provide significant performance gains, with the exact increase in performance depending on the specific properties of the data set, evolutionary model, and hardware. In general, nucleotide analyses are accelerated on the order of 10-fold and codon analyses on the order of 100-fold

    Unexpected mitochondrial genome diversity revealed by targeted single-cell genomics of heterotrophic flagellated protists

    Get PDF
    This is the author accepted manuscript. The final version is available from Nature Research via the DOI in this recordData availability: Complete mtDNA sequences assembled from this study are available at GenBank under the accession numbers MK188935 to MK188947, MN082144 and MN082145. Sequencing data are available under NCBI BioProject PRJNA379597. Reads have been deposited at NCBI Sequence Read Archive with accession number SRP102236. Partial mtDNA contigs and other important contigs mentioned in the text are available from Figshare at https://doi.org/10.6084/m9.figshare.7314728. Nuclear SAG assemblies are available from Figshare at https://doi.org/10.6084/m9.figshare.7352966. A protocol is available from protocols.io at: https://doi.org/10.17504/protocols.io.ywpfxdn.Code availability: The bioinformatic workflow is available at https://doi.org/10.5281/zenodo.192677; additional statistical analysis code is available at https://doi.org/10.6084/m9.figshare.9884309.Most eukaryotic microbial diversity is uncultivated, under-studied and lacks nuclear genome data. Mitochondrial genome sampling is more comprehensive, but many phylogenetically important groups remain unsampled. Here, using a single-cell sorting approach combining tubulin-specific labelling with photopigment exclusion, we sorted flagellated heterotrophic unicellular eukaryotes from Pacific Ocean samples. We recovered 206 single amplified genomes, predominantly from underrepresented branches on the tree of life. Seventy single amplified genomes contained unique mitochondrial contigs, including 21 complete or near-complete mitochondrial genomes from formerly under-sampled phylogenetic branches, including telonemids, katablepharids, cercozoans and marine stramenopiles, effectively doubling the number of available samples of heterotrophic flagellate mitochondrial genomes. Collectively, these data identify a dynamic history of mitochondrial genome evolution including intron gain and loss, extensive patterns of genetic code variation and complex patterns of gene loss. Surprisingly, we found that stramenopile mitochondrial content is highly plastic, resembling patterns of variation previously observed only in plants.Gordon and Betty Moore FoundationLeverhulme TrustDavid and Lucile Packard FoundationRoyal SocietyEuropean Molecular Biology OrganizationCONICYT FONDECYTGenome Canad

    BOOL-AN: A method for comparative sequence analysis and phylogenetic reconstruction

    Get PDF
    A novel discrete mathematical approach is proposed as an additional tool for molecular systematics which does not require prior statistical assumptions concerning the evolutionary process. The method is based on algorithms generating mathematical representations directly from DNA/RNA or protein sequences, followed by the output of numerical (scalar or vector) and visual characteristics (graphs). The binary encoded sequence information is transformed into a compact analytical form, called the Iterative Canonical Form (or ICF) of Boolean functions, which can then be used as a generalized molecular descriptor. The method provides raw vector data for calculating different distance matrices, which in turn can be analyzed by neighbor-joining or UPGMA to derive a phylogenetic tree, or by principal coordinates analysis to get an ordination scattergram. The new method and the associated software for inferring phylogenetic trees are called the Boolean analysis or BOOL-AN
    corecore