378 research outputs found
Inference of Many-Taxon Phylogenies
Phylogenetic trees are tree topologies that represent the evolutionary history of a set of organisms. In this thesis, we address computational challenges related to the analysis of large-scale datasets with Maximum Likelihood based phylogenetic inference. We have approached this using different strategies: reduction of memory requirements, reduction of running time, and reduction of man-hours
Diversification In The Neotropics – Evolution And Population Genetics Of The Armored Catfish Hypancistrus Sp. From The Xingu River
The Xingu River, one of the largest tributaries of the Amazon River, is currently in peril due to the recent construction of hydroelectric dams, but little is known about the numerous fish species it supports. This dissertation focuses on three pleco catfish species belonging to the genus Hypancistrus from the Xingu River with partially overlapping distributions: H. zebra, H. sp. (L174), and H. sp. (L66/333). Chapter 1 is a bibliographic review of Amazonian freshwater fish diversity, with the goal of discussing the hypotheses of speciation mechanisms that can be tested in this system, including the relative importance of ecological adaptation and vicariance caused by topographical divides and waterfalls and rapids, and arguing this is an important overlooked model for the study of speciation processes. The goal of Chapter 2 was to use genomic data to unravel the basic relationships among eight described and eleven undescribed species belonging to the genus Hypancistrus distributed across the Orinoco and Amazon Basins. The phylogenetic analyses support the existence of two clades corresponding to each basin, but relationships among some of the species are poorly supported. Further exploratory analyses in combination of hypotheses testing indicate there are at least four admixed lineages in the Amazon clade. Chapter 3 investigated the evolution of Hypancistrus from the Xingu River based on genomic data. With dense sampling of H. sp. (L66/333), phylogenetic and population genetic analyses reveal a gradient of genetic structure along the river, with introgression from lineages of Hypancistrus from other Amazon River tributaries close to the mouth of the Xingu. On the upstream limit of the distribution of H. sp. (L66/333), a population hybridized with H. sp. (L174) is found just upstream of waterfalls, that act as a partial barrier to gene flow. Tests for past gene flow suggest there is signal for multiple introgression events between these lineages, but the direction, timing, and intensity of these events is still unclear. Overall, these results indicate the evolution of Hypancistrus was exceptionally complex. Fascinating patterns of diversification are emerging from this system that is unfortunately in risk of extinction due to the impacts of damming
Evolutionary genomics : statistical and computational methods
This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward
High performance bioinformatics and computational biology on general-purpose graphics processing units
Bioinformatics and Computational Biology (BCB) is a relatively new
multidisciplinary field which brings together many aspects of the fields of
biology, computer science, statistics, and engineering. Bioinformatics extracts
useful information from biological data and makes these more intuitive and
understandable by applying principles of information sciences, while
computational biology harnesses computational approaches and technologies
to answer biological questions conveniently. Recent years have seen an
explosion of the size of biological data at a rate which outpaces the rate of
increases in the computational power of mainstream computer technologies,
namely general purpose processors (GPPs). The aim of this thesis is to explore
the use of off-the-shelf Graphics Processing Unit (GPU) technology in the high
performance and efficient implementation of BCB applications in order to meet
the demands of biological data increases at affordable cost.
The thesis presents detailed design and implementations of GPU solutions for
a number of BCB algorithms in two widely used BCB applications, namely
biological sequence alignment and phylogenetic analysis. Biological sequence
alignment can be used to determine the potential information about a newly
discovered biological sequence from other well-known sequences through
similarity comparison. On the other hand, phylogenetic analysis is concerned
with the investigation of the evolution and relationships among organisms,
and has many uses in the fields of system biology and comparative genomics.
In molecular-based phylogenetic analysis, the relationship between species is
estimated by inferring the common history of their genes and then
phylogenetic trees are constructed to illustrate evolutionary relationships
among genes and organisms. However, both biological sequence alignment
and phylogenetic analysis are computationally expensive applications as their computing and memory requirements grow polynomially or even worse with
the size of sequence databases.
The thesis firstly presents a multi-threaded parallel design of the Smith-
Waterman (SW) algorithm alongside an implementation on NVIDIA GPUs. A
novel technique is put forward to solve the restriction on the length of the
query sequence in previous GPU-based implementations of the SW algorithm.
Based on this implementation, the difference between two main task
parallelization approaches (Inter-task and Intra-task parallelization) is
presented. The resulting GPU implementation matches the speed of existing
GPU implementations while providing more flexibility, i.e. flexible length of
sequences in real world applications. It also outperforms an equivalent GPPbased
implementation by 15x-20x. After this, the thesis presents the first
reported multi-threaded design and GPU implementation of the Gapped
BLAST with Two-Hit method algorithm, which is widely used for aligning
biological sequences heuristically. This achieved up to 3x speed-up
improvements compared to the most optimised GPP implementations.
The thesis then presents a multi-threaded design and GPU implementation of
a Neighbor-Joining (NJ)-based method for phylogenetic tree construction and
multiple sequence alignment (MSA). This achieves 8x-20x speed up compared
to an equivalent GPP implementation based on the widely used ClustalW
software. The NJ method however only gives one possible tree which strongly
depends on the evolutionary model used. A more advanced method uses
maximum likelihood (ML) for scoring phylogenies with Markov Chain Monte
Carlo (MCMC)-based Bayesian inference. The latter was the subject of another
multi-threaded design and GPU implementation presented in this thesis,
which achieved 4x-8x speed up compared to an equivalent GPP
implementation based on the widely used MrBayes software.
Finally, the thesis presents a general evaluation of the designs and
implementations achieved in this work as a step towards the evaluation of
GPU technology in BCB computing, in the context of other computer technologies including GPPs and Field Programmable Gate Arrays (FPGA)
technology
Proceedings of the Fifth Workshop on Information Theoretic Methods in Science and Engineering
These are the online proceedings of the Fifth Workshop on Information Theoretic Methods in Science and Engineering (WITMSE), which was held in the Trippenhuis, Amsterdam, in August 2012
Evolutionary Genomics
This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward
Statistical estimation problems in phylogenomics and applications in microbial ecology
With the growing awareness of the potential for microbial communities to play a role in human health, environmental remediation and other important processes, the challenge of understanding such a complex population through the lens of high-throughput sequencing output has risen to the fore. For a de novo sequenced community, the first step to understanding the population involves comparing the sequences to a reference database in some form. In this dissertation, we consider some challenges and benefits of organizing the reference data according to evolution, with orthologous genes grouped together and stored as a multiple sequence alignment and phylogenetic tree.
First we consider the related problem of estimating the population-level phylogeny of a group of species based on the alignments and phylogenies of several individual genes. Under one common model, species tree estimation is provably statistically consistent by several different methods, but those proofs rely on two separate and potentially shaky assumptions: that every species appears in the data for every gene (i.e., there is no missing data), and that since gene tree estimation is itself consistent, the gene trees used to compute the population-level tree are correct. Second, we explore some novel ways to use a Bayesian MCMC algorithm for jointly estimating alignment and phylogeny. The result is increased accuracy for large alignments, where the MCMC method alone would not be tractable. In the process, we identify a peculiar property of this Bayesian algorithm: it performs much differently on simulated sequences than on sequences from biological alignment benchmarks. No other alignment method tested showed the same divergence.
Finally, we present two different practical applications a reference database containing an alignment and tree for a group of gene families in the context of microbial ecology. The first is an algorithm that uses the tree and alignment to construct an ensemble of profile hidden Markov models that improves remote homology detection. The second is a data visualization technique that generates an image of the community with a high density of data, but one that makes it naturally easy to compare many different samples at a time, potentially uncovering otherwise elusive patterns in the data
Statistical methods for biological sequence analysis for DNA binding motifs and protein contacts
Over the last decades a revolution in novel measurement techniques has permeated the biological sciences filling the databases with unprecedented amounts of data ranging from genomics, transcriptomics, proteomics and metabolomics to structural and ecological data. In order to extract insights from the vast quantity of data, computational and statistical methods are nowadays crucial tools in the toolbox of every biological researcher. In this thesis I summarize my contributions in two data-rich fields in biological sciences: transcription factor binding to DNA and protein structure prediction from protein sequences with shared evolutionary ancestry.
In the first part of my thesis I introduce our work towards a web server for analysing transcription factor binding data with Bayesian Markov Models. In contrast to classical PWM or di-nucleotide models, Bayesian Markov models can capture complex inter-nucleotide dependencies that can arise from shape-readout and alternative binding modes. In addition to giving access to our methods in an easy-to-use, intuitive web-interface, we provide our users with novel tools and visualizations to better evaluate the biological relevance of the inferred binding motifs. We hope that our tools will prove useful for investigating weak and complex transcription factor binding motifs which cannot be predicted accurately with existing tools.
The second part discusses a statistical attempt to correct out the phylogenetic bias arising in co-evolution methods applied to the contact prediction problem. Co-evolution methods have revolutionized the protein-structure prediction field more than 10 years ago, and, until very recently, have retained their importance as crucial input features to deep neural networks. As the co-evolution information is extracted from evolutionarily related sequences, we investigated whether the phylogenetic bias to the signal can be corrected out in a principled way using a variation of the Felsenstein's tree-pruning algorithm applied in combination with an independent-pair assumption to derive pairwise amino counts that are corrected for the evolutionary history. Unfortunately, the contact prediction derived from our corrected pairwise amino acid counts did not yield a competitive performance.2021-09-2
Phylogeny, taxonomy and biogeography of Ceiba Mill. (Malvaceae: Bombacoideae)
The Neotropics is the most species-rich area in the world and the mechanisms that
generated and maintain its biodiversity are still debated. This thesis contributes to the
debate by investigating the evolutionary and biogeographic history of the genus Ceiba.
Ceiba comprises 18 mostly neotropical species endemic to two major biomes, seasonally
dry tropical forests (SDTFs) and rain forests, and therefore represents an ideal case
to shed light on patterns of neotropical plant evolution and diversification. Species of
Ceiba, with their swollen, spiny trunks and large, beautiful flowers are one of the most
characteristic elements of neotropical SDTF, one of the most threatened biomes in the
tropics. Despite this, Ceiba has an historically complex taxonomy with some issues
of species delimitation unresolved, especially within a species complex (Ceiba insignis
agg.).
Initial phylogenetic analyses of DNA sequence data from the nuclear ribosomal
internal transcribed spacers (ITS) for 24 accessions representing 14 species of Ceiba
recovered the genus as monophyletic and showed geographical and ecological structure
in three main clades: (i) a humid forest lineage of three accessions of C. pentandra
sister to the remaining species; (ii) a highly supported clade composed of C. schottii
and C. aesculifolia from Central American and Mexican SDTF plus two accessions of
C. samauma from inter Andean valleys from Peru; and (iii) a highly supported South
American SDTF clade including 10 species showing little sequence variation. Within
this South American clade, no species represented by multiple accessions were resolved
as monophyletic.
To investigate unresolved species relationships further, next-generation hybrid capture
was used to sequence 377 loci for 103 accessions representing all 18 Ceiba species.
This data set was assembled using different approaches (de novo and reference mapping)
and with different software and settings to assess their impact in downstream
phylogenetic analysis. The 377 loci were concatenated and analysed under the maximum
likelihood framework treated as a single partition. The well resolved and sampled
NGS phylogenies showed a similar pattern of geographical and ecological structure
as inferred using ITS. The genus Neobuchia was recovered within the SDTF Central
American and Mexican clade, and should therefore be incorporated within Ceiba. In
the South American SDTF clade, there were multiple examples where a monophyletic
group recognised as a taxonomic species was nested within another, paraphyletic taxonomic
species, which suggests recent, ancestor-descendent species relationships. Within
this clade, individual gene trees showed high conflict. Coalescent-based species delimitation
analysis and morphological data revealed no clear species boundaries between
C. pubiflora and C. glaziovii, and these species should be synonymised.
A subset of 111 loci was used to generate a dated phylogeny based on penalised likelihood
analysis using the fossil flower of Eriotheca prima from the middle to late Eocene
as a primary calibration. The stem node age of Ceiba was estimated as 45 Ma. The
rain forest species C. pentandra and C. samauma, and the campos rupestres species C.
jasminodora, were resolved with long stem lineages and shallow crown groups. Whilst
some SDTF species were very old (e.g., C. trischistandra) and monophyletic, many
South American SDTF species were resolved with short stem lineages and relatively
deep crown groups, possibly suggesting low rates of extinction in the large Caatinga
SDTF region. In addition, several South American SDTF species were not resolved
as monophyletic. Such results of younger, non-monophyletic SDTF species and older,
monophyletic rain forest species contrast with recent predictions that rain forest species
may, on average, have more recent origins than SDTF species and will more often be
non-monophyletic.
Ceiba has different and distinctive phylogenetic patterns that contradict recent theoretical
predictions. It demonstrates that studies of other clades sampled densely with
multiple accessions of each species using a multi-locus approach are needed if we are to
understand the nature of species and their boundaries, and the diversification process
in neotropical trees
- …