531 research outputs found
Algorithms in comparative genomics
The field of comparative genomics is abundant with problems of interest to computer scientists. In this thesis, the author presents solutions to three contemporary problems: obtaining better alignments for phylogeny reconstruction, identifying related RNA sequences in genomes, and ranking Single Nucleotide Polymorphisms (SNPs) in genome-wide association studies (GWAS).
Sequence alignment is a basic and widely used task in bioinformatics. Its applications include identifying protein structure, RNAs and transcription factor binding sites in genomes, and phylogeny reconstruction. Phylogenetic descriptions depend not only on the employed reconstruction technique, but also on the underlying sequence alignment. The author has studied and established a simple prescription for obtaining a better phylogeny by improving the underlying alignments used in phylogeny reconstruction. This was achieved by improving upon Gotoh\u27s iterative heuristic by iterating with maximum parsimony guide-trees. This approach has shown an improvement in accuracy over standard alignment programs.
A novel alignment algorithm named Probalign-RNAgenome that can identify non-coding RNAs in genomic sequences was also developed. Non-coding RNAs play a critical role in the cell such as gene regulation. It is thought that many such RNAs lie undiscovered in the genome. To date, alignment based approaches have shown to be more accurate than thermodynamic methods for identifying such non-coding RNAs. Probalign-RNAgenome employs a probabilistic consistency based approach for aligning a query RNA sequence to its homolog in a genomic sequence. Results show that this approach is more accurate on real data than the widely used BLAST and Smith- Waterman algorithms.
Within the realm of comparative genomics are also a large number of recently conducted GWAS. GWAS aim to identify regions in the genome that are associated with a given disease. The support vector machine (SVM) provides a discriminative alternative to the widely used chi-square statistic in GWAS. A novel hybrid strategy that combines the chi-square statistic with the SVM was developed and implemented. Its performance was studied on simulated data and the Wellcome Trust Case Control Consortium (WTCCC) studies. Results presented in this thesis show that the hybrid strategy ranks causal SNPs in simulated data significantly higher than the chi-square test and SVM alone. The results also show that the hybrid strategy ranks previously replicated SNPs and associated regions (where applicable) of type 1 diabetes, rheumatoid arthritis, and Crohn\u27s disease higher than the chi-square, SVM, and SVM Recursive Feature Elimination (SVM-RFE)
Evolutionary Inference via the Poisson Indel Process
We address the problem of the joint statistical inference of phylogenetic
trees and multiple sequence alignments from unaligned molecular sequences. This
problem is generally formulated in terms of string-valued evolutionary
processes along the branches of a phylogenetic tree. The classical evolutionary
process, the TKF91 model, is a continuous-time Markov chain model comprised of
insertion, deletion and substitution events. Unfortunately this model gives
rise to an intractable computational problem---the computation of the marginal
likelihood under the TKF91 model is exponential in the number of taxa. In this
work, we present a new stochastic process, the Poisson Indel Process (PIP), in
which the complexity of this computation is reduced to linear. The new model is
closely related to the TKF91 model, differing only in its treatment of
insertions, but the new model has a global characterization as a Poisson
process on the phylogeny. Standard results for Poisson processes allow key
computations to be decoupled, which yields the favorable computational profile
of inference under the PIP model. We present illustrative experiments in which
Bayesian inference under the PIP model is compared to separate inference of
phylogenies and alignments.Comment: 33 pages, 6 figure
Phylogenetic Trees and Their Analysis
Determining the best possible evolutionary history, the lowest-cost phylogenetic tree, to fit a given set of taxa and character sequences using maximum parsimony is an active area of research due to its underlying importance in understanding biological processes. As several steps in this process are NP-Hard when using popular, biologically-motivated optimality criteria, significant amounts of resources are dedicated to both both heuristics and to making exact methods more computationally tractable. We examine both phylogenetic data and the structure of the search space in order to suggest methods to reduce the number of possible trees that must be examined to find an exact solution for any given set of taxa and associated character data. Our work on four related problems combines theoretical insight with empirical study to improve searching of the tree space. First, we show that there is a Hamiltonian path through tree space for the most common tree metrics, answering Bryant\u27s Challenge for the minimal such path. We next examine the topology of the search space under various metrics, showing that some metrics have local maxima and minima even with perfect data, while some others do not. We further characterize conditions for which sequences simulated under the Jukes-Cantor model of evolution yield well-behaved search spaces. Next, we reduce the search space needed for an exact solution by splitting the set of characters into mutually-incompatible subsets of compatible characters, building trees based on the perfect phylogenies implied by these sets, and then searching in the neighborhoods of these trees. We validate this work empirically. Finally, we compare two approaches to the generalized tree alignment problem, or GTAP: Sequence alignment followed by tree search vs. Direct Optimization, on both biological and simulated data
Recommended from our members
Improved methods for phylogenetics
textPhylogenetics is the study of evolutionary relationships. It is a scientific
endeavour to discover history, and it is not easy. Massive amounts of data
together with computationally difficult optimization problems mean that
heuristics are prevalent, and ever better techniques are sought. New
approaches are valuable if they are more accurate, but are considered even more
so if they are faster than pre-existing methods. Improvements to existing
algorithms, whether in terms of space requirements, or faster running times,
are also worthwhile. This dissertation explores three new techniques, each of
which is valuable according to the previous definitions.
The first contribution is TASPI, a system for storing collections of
phylogenetic trees, and performing post-tree analyses. TASPI stores collections
of trees more compactly than the previous method, and this compact structure
lends itself to post-tree analyses. This results in the ability to compute
strict and majority consensus trees faster than common alternatives. As an
added benefit, TASPI is written in ACL2, which allows properties of the
algorithms and data structures to be formally verified.
The second contribution is an improved method to generate phylogenetic trees.
A common methodology involves two steps, first estimating a Multiple Sequence
Alignment (MSA), and then estimating a tree using that MSA. This method
changes the way in which the MSA is estimated, and this leads to improved
accuracy of the resultant trees. Also, in some cases, the time required is
also reduced.
The third contribution is BLuTGEN, a method by which a phylogenetic tree is
estimated from sequence data, but without ever generating an MSA for the full
dataset. BLuTGEN is as accurate as one of the best published tree estimation
techniques (SATĂ©), but takes a novel approach which allows it to be applied
to much larger datasets.Computer Science
Current methods for automated filtering of multiple sequence alignments frequently worsen single-gene phylogenetic inference
Phylogenetic inference is generally performed on the basis of multiple sequence alignments (MSA). Because errors in an alignment can lead to errors in tree estimation, there is a strong interest in identifying and removing unreliable parts of the alignment. In recent years several automated filtering approaches have been proposed, but despite their popularity, a systematic and comprehensive comparison of different alignment filtering methods on real data has been lacking. Here, we extend and apply recently introduced phylogenetic tests of alignment accuracy on a large number of gene families and contrast the performance of unfiltered versus filtered alignments in the context of single-gene phylogeny reconstruction. Based on multiple genome-wide empirical and simulated data sets, we show that the trees obtained from filtered MSAs are on average worse than those obtained from unfiltered MSAs. Furthermore, alignment filtering often leads to an increase in the proportion of well-supported branches that are actually wrong. We confirm that our findings hold for a wide range of parameters and methods. Although our results suggest that light filtering (up to 20% of alignment positions) has little impact on tree accuracy and may save some computation time, contrary to widespread practice, we do not generally recommend the use of current alignment filtering methods for phylogenetic inference. By providing a way to rigorously and systematically measure the impact of filtering on alignments, the methodology set forth here will guide the development of better filtering algorithms
Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference
Phylogenetic inference is generally performed on the basis of multiple sequence alignments (MSA). Because errors in an alignment can lead to errors in tree estimation, there is a strong interest in identifying and removing unreliable parts of the alignment. In recent years several automated filtering approaches have been proposed, but despite their popularity, a systematic and comprehensive comparison of different alignment filtering methods on real data has been lacking. Here, we extend and apply recently introduced phylogenetic tests of alignment accuracy on a large number of gene families and contrast the performance of unfiltered versus filtered alignments in the context of single-gene phylogeny reconstruction. Based on multiple genome-wide empirical and simulated data sets, we show that the trees obtained from filtered MSAs are on average worse than those obtained from unfiltered MSAs. Furthermore, alignment filtering often leads to an increase in the proportion of well-supported branches that are actually wrong. We confirm that our findings hold for a wide range of parameters and methods. Although our results suggest that light filtering (up to 20% of alignment positions) has little impact on tree accuracy and may save some computation time, contrary to widespread practice, we do not generally recommend the use of current alignment filtering methods for phylogenetic inference. By providing a way to rigorously and systematically measure the impact of filtering on alignments, the methodology set forth here will guide the development of better filtering algorithm
Novel approaches for large-scale phylogenetics and applications in the context of the amphibian tree of life
During this thesis, I addressed some problems associated with large-scale
phylogenetic analyses by tackling issues related to missing data and careful
handling and addition of novel data in large-scale reconstructions, presenting an
application of this approach in the context of amphibian phylogenetics.
I developed a method (called “Concatabominations”) building on the original Safe
Taxonomic Reduction method (Wilkinson 1995) as an alternative approach to the
issue of identifying rogue taxa. The safe removal of rogue taxa due to missing data
can potentially reduce the terraces in tree space search and improve resolution in
the final consensus tree. In a pragmatic point of view, the new method can help in
targeting taxa that require further sampling during a research design.
Novel sequence data for the rediscovered Ericabatrachus baleensis allowed to explore
its placement in the Amphibian tree of life. I tested the inclusion of novel data using
a backbone alignment from a previous work (de novo analysis) and a backbone
phylogenetic tree (constrained analysis), after careful curation of gene partitions to
include in an analysis. I found that the use of a constrained phylogenetic inference
using a previous accepted tree seems to be a practical solution to the rapid
phylogenetic placement of a taxon in cases of well-supported relationships.
However, a de novo analysis might ensure an optimal alignment and avoid risks
introduced when adding new data.
Finally, I investigated the evolutionary relationships of the three lineages of the
extant amphibians (Anura, Caudata and Gymnophiona) using an independent
source of evidence: miRNAs, recently used to help resolve difficult phylogenetic
problems. The analyses yielded a high number of shared miRNAs using the
Xenopus tropicalis genome, contrasting with a lower number of miRNAs discovered
using the Axolotl transcriptome. This suggests that not using genomic data is not
ideal to validate miRNAs. Nevertheless, in spite of the limitations, I was able to
find two potential novel miRNAs: one supporting the monophyly of Lissamphibia,
and another supporting the Batrachia hypothesis.
Overall, I hope the work developed in this thesis contributes with new insights into
large-scale phylogenetics and in particular to amphibian phylogenetics
- …