378 research outputs found

    Inference of Many-Taxon Phylogenies

    Get PDF
    Phylogenetic trees are tree topologies that represent the evolutionary history of a set of organisms. In this thesis, we address computational challenges related to the analysis of large-scale datasets with Maximum Likelihood based phylogenetic inference. We have approached this using different strategies: reduction of memory requirements, reduction of running time, and reduction of man-hours

    Diversification In The Neotropics – Evolution And Population Genetics Of The Armored Catfish Hypancistrus Sp. From The Xingu River

    Get PDF
    The Xingu River, one of the largest tributaries of the Amazon River, is currently in peril due to the recent construction of hydroelectric dams, but little is known about the numerous fish species it supports. This dissertation focuses on three pleco catfish species belonging to the genus Hypancistrus from the Xingu River with partially overlapping distributions: H. zebra, H. sp. (L174), and H. sp. (L66/333). Chapter 1 is a bibliographic review of Amazonian freshwater fish diversity, with the goal of discussing the hypotheses of speciation mechanisms that can be tested in this system, including the relative importance of ecological adaptation and vicariance caused by topographical divides and waterfalls and rapids, and arguing this is an important overlooked model for the study of speciation processes. The goal of Chapter 2 was to use genomic data to unravel the basic relationships among eight described and eleven undescribed species belonging to the genus Hypancistrus distributed across the Orinoco and Amazon Basins. The phylogenetic analyses support the existence of two clades corresponding to each basin, but relationships among some of the species are poorly supported. Further exploratory analyses in combination of hypotheses testing indicate there are at least four admixed lineages in the Amazon clade. Chapter 3 investigated the evolution of Hypancistrus from the Xingu River based on genomic data. With dense sampling of H. sp. (L66/333), phylogenetic and population genetic analyses reveal a gradient of genetic structure along the river, with introgression from lineages of Hypancistrus from other Amazon River tributaries close to the mouth of the Xingu. On the upstream limit of the distribution of H. sp. (L66/333), a population hybridized with H. sp. (L174) is found just upstream of waterfalls, that act as a partial barrier to gene flow. Tests for past gene flow suggest there is signal for multiple introgression events between these lineages, but the direction, timing, and intensity of these events is still unclear. Overall, these results indicate the evolution of Hypancistrus was exceptionally complex. Fascinating patterns of diversification are emerging from this system that is unfortunately in risk of extinction due to the impacts of damming

    Evolutionary genomics : statistical and computational methods

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward

    High performance bioinformatics and computational biology on general-purpose graphics processing units

    Get PDF
    Bioinformatics and Computational Biology (BCB) is a relatively new multidisciplinary field which brings together many aspects of the fields of biology, computer science, statistics, and engineering. Bioinformatics extracts useful information from biological data and makes these more intuitive and understandable by applying principles of information sciences, while computational biology harnesses computational approaches and technologies to answer biological questions conveniently. Recent years have seen an explosion of the size of biological data at a rate which outpaces the rate of increases in the computational power of mainstream computer technologies, namely general purpose processors (GPPs). The aim of this thesis is to explore the use of off-the-shelf Graphics Processing Unit (GPU) technology in the high performance and efficient implementation of BCB applications in order to meet the demands of biological data increases at affordable cost. The thesis presents detailed design and implementations of GPU solutions for a number of BCB algorithms in two widely used BCB applications, namely biological sequence alignment and phylogenetic analysis. Biological sequence alignment can be used to determine the potential information about a newly discovered biological sequence from other well-known sequences through similarity comparison. On the other hand, phylogenetic analysis is concerned with the investigation of the evolution and relationships among organisms, and has many uses in the fields of system biology and comparative genomics. In molecular-based phylogenetic analysis, the relationship between species is estimated by inferring the common history of their genes and then phylogenetic trees are constructed to illustrate evolutionary relationships among genes and organisms. However, both biological sequence alignment and phylogenetic analysis are computationally expensive applications as their computing and memory requirements grow polynomially or even worse with the size of sequence databases. The thesis firstly presents a multi-threaded parallel design of the Smith- Waterman (SW) algorithm alongside an implementation on NVIDIA GPUs. A novel technique is put forward to solve the restriction on the length of the query sequence in previous GPU-based implementations of the SW algorithm. Based on this implementation, the difference between two main task parallelization approaches (Inter-task and Intra-task parallelization) is presented. The resulting GPU implementation matches the speed of existing GPU implementations while providing more flexibility, i.e. flexible length of sequences in real world applications. It also outperforms an equivalent GPPbased implementation by 15x-20x. After this, the thesis presents the first reported multi-threaded design and GPU implementation of the Gapped BLAST with Two-Hit method algorithm, which is widely used for aligning biological sequences heuristically. This achieved up to 3x speed-up improvements compared to the most optimised GPP implementations. The thesis then presents a multi-threaded design and GPU implementation of a Neighbor-Joining (NJ)-based method for phylogenetic tree construction and multiple sequence alignment (MSA). This achieves 8x-20x speed up compared to an equivalent GPP implementation based on the widely used ClustalW software. The NJ method however only gives one possible tree which strongly depends on the evolutionary model used. A more advanced method uses maximum likelihood (ML) for scoring phylogenies with Markov Chain Monte Carlo (MCMC)-based Bayesian inference. The latter was the subject of another multi-threaded design and GPU implementation presented in this thesis, which achieved 4x-8x speed up compared to an equivalent GPP implementation based on the widely used MrBayes software. Finally, the thesis presents a general evaluation of the designs and implementations achieved in this work as a step towards the evaluation of GPU technology in BCB computing, in the context of other computer technologies including GPPs and Field Programmable Gate Arrays (FPGA) technology

    Proceedings of the Fifth Workshop on Information Theoretic Methods in Science and Engineering

    Get PDF
    These are the online proceedings of the Fifth Workshop on Information Theoretic Methods in Science and Engineering (WITMSE), which was held in the Trippenhuis, Amsterdam, in August 2012

    Evolutionary Genomics

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward

    Statistical estimation problems in phylogenomics and applications in microbial ecology

    Get PDF
    With the growing awareness of the potential for microbial communities to play a role in human health, environmental remediation and other important processes, the challenge of understanding such a complex population through the lens of high-throughput sequencing output has risen to the fore. For a de novo sequenced community, the first step to understanding the population involves comparing the sequences to a reference database in some form. In this dissertation, we consider some challenges and benefits of organizing the reference data according to evolution, with orthologous genes grouped together and stored as a multiple sequence alignment and phylogenetic tree. First we consider the related problem of estimating the population-level phylogeny of a group of species based on the alignments and phylogenies of several individual genes. Under one common model, species tree estimation is provably statistically consistent by several different methods, but those proofs rely on two separate and potentially shaky assumptions: that every species appears in the data for every gene (i.e., there is no missing data), and that since gene tree estimation is itself consistent, the gene trees used to compute the population-level tree are correct. Second, we explore some novel ways to use a Bayesian MCMC algorithm for jointly estimating alignment and phylogeny. The result is increased accuracy for large alignments, where the MCMC method alone would not be tractable. In the process, we identify a peculiar property of this Bayesian algorithm: it performs much differently on simulated sequences than on sequences from biological alignment benchmarks. No other alignment method tested showed the same divergence. Finally, we present two different practical applications a reference database containing an alignment and tree for a group of gene families in the context of microbial ecology. The first is an algorithm that uses the tree and alignment to construct an ensemble of profile hidden Markov models that improves remote homology detection. The second is a data visualization technique that generates an image of the community with a high density of data, but one that makes it naturally easy to compare many different samples at a time, potentially uncovering otherwise elusive patterns in the data

    Statistical methods for biological sequence analysis for DNA binding motifs and protein contacts

    Get PDF
    Over the last decades a revolution in novel measurement techniques has permeated the biological sciences filling the databases with unprecedented amounts of data ranging from genomics, transcriptomics, proteomics and metabolomics to structural and ecological data. In order to extract insights from the vast quantity of data, computational and statistical methods are nowadays crucial tools in the toolbox of every biological researcher. In this thesis I summarize my contributions in two data-rich fields in biological sciences: transcription factor binding to DNA and protein structure prediction from protein sequences with shared evolutionary ancestry. In the first part of my thesis I introduce our work towards a web server for analysing transcription factor binding data with Bayesian Markov Models. In contrast to classical PWM or di-nucleotide models, Bayesian Markov models can capture complex inter-nucleotide dependencies that can arise from shape-readout and alternative binding modes. In addition to giving access to our methods in an easy-to-use, intuitive web-interface, we provide our users with novel tools and visualizations to better evaluate the biological relevance of the inferred binding motifs. We hope that our tools will prove useful for investigating weak and complex transcription factor binding motifs which cannot be predicted accurately with existing tools. The second part discusses a statistical attempt to correct out the phylogenetic bias arising in co-evolution methods applied to the contact prediction problem. Co-evolution methods have revolutionized the protein-structure prediction field more than 10 years ago, and, until very recently, have retained their importance as crucial input features to deep neural networks. As the co-evolution information is extracted from evolutionarily related sequences, we investigated whether the phylogenetic bias to the signal can be corrected out in a principled way using a variation of the Felsenstein's tree-pruning algorithm applied in combination with an independent-pair assumption to derive pairwise amino counts that are corrected for the evolutionary history. Unfortunately, the contact prediction derived from our corrected pairwise amino acid counts did not yield a competitive performance.2021-09-2

    Phylogeny, taxonomy and biogeography of Ceiba Mill. (Malvaceae: Bombacoideae)

    Get PDF
    The Neotropics is the most species-rich area in the world and the mechanisms that generated and maintain its biodiversity are still debated. This thesis contributes to the debate by investigating the evolutionary and biogeographic history of the genus Ceiba. Ceiba comprises 18 mostly neotropical species endemic to two major biomes, seasonally dry tropical forests (SDTFs) and rain forests, and therefore represents an ideal case to shed light on patterns of neotropical plant evolution and diversification. Species of Ceiba, with their swollen, spiny trunks and large, beautiful flowers are one of the most characteristic elements of neotropical SDTF, one of the most threatened biomes in the tropics. Despite this, Ceiba has an historically complex taxonomy with some issues of species delimitation unresolved, especially within a species complex (Ceiba insignis agg.). Initial phylogenetic analyses of DNA sequence data from the nuclear ribosomal internal transcribed spacers (ITS) for 24 accessions representing 14 species of Ceiba recovered the genus as monophyletic and showed geographical and ecological structure in three main clades: (i) a humid forest lineage of three accessions of C. pentandra sister to the remaining species; (ii) a highly supported clade composed of C. schottii and C. aesculifolia from Central American and Mexican SDTF plus two accessions of C. samauma from inter Andean valleys from Peru; and (iii) a highly supported South American SDTF clade including 10 species showing little sequence variation. Within this South American clade, no species represented by multiple accessions were resolved as monophyletic. To investigate unresolved species relationships further, next-generation hybrid capture was used to sequence 377 loci for 103 accessions representing all 18 Ceiba species. This data set was assembled using different approaches (de novo and reference mapping) and with different software and settings to assess their impact in downstream phylogenetic analysis. The 377 loci were concatenated and analysed under the maximum likelihood framework treated as a single partition. The well resolved and sampled NGS phylogenies showed a similar pattern of geographical and ecological structure as inferred using ITS. The genus Neobuchia was recovered within the SDTF Central American and Mexican clade, and should therefore be incorporated within Ceiba. In the South American SDTF clade, there were multiple examples where a monophyletic group recognised as a taxonomic species was nested within another, paraphyletic taxonomic species, which suggests recent, ancestor-descendent species relationships. Within this clade, individual gene trees showed high conflict. Coalescent-based species delimitation analysis and morphological data revealed no clear species boundaries between C. pubiflora and C. glaziovii, and these species should be synonymised. A subset of 111 loci was used to generate a dated phylogeny based on penalised likelihood analysis using the fossil flower of Eriotheca prima from the middle to late Eocene as a primary calibration. The stem node age of Ceiba was estimated as 45 Ma. The rain forest species C. pentandra and C. samauma, and the campos rupestres species C. jasminodora, were resolved with long stem lineages and shallow crown groups. Whilst some SDTF species were very old (e.g., C. trischistandra) and monophyletic, many South American SDTF species were resolved with short stem lineages and relatively deep crown groups, possibly suggesting low rates of extinction in the large Caatinga SDTF region. In addition, several South American SDTF species were not resolved as monophyletic. Such results of younger, non-monophyletic SDTF species and older, monophyletic rain forest species contrast with recent predictions that rain forest species may, on average, have more recent origins than SDTF species and will more often be non-monophyletic. Ceiba has different and distinctive phylogenetic patterns that contradict recent theoretical predictions. It demonstrates that studies of other clades sampled densely with multiple accessions of each species using a multi-locus approach are needed if we are to understand the nature of species and their boundaries, and the diversification process in neotropical trees
    • …
    corecore