531 research outputs found

    Algorithms in comparative genomics

    Get PDF
    The field of comparative genomics is abundant with problems of interest to computer scientists. In this thesis, the author presents solutions to three contemporary problems: obtaining better alignments for phylogeny reconstruction, identifying related RNA sequences in genomes, and ranking Single Nucleotide Polymorphisms (SNPs) in genome-wide association studies (GWAS). Sequence alignment is a basic and widely used task in bioinformatics. Its applications include identifying protein structure, RNAs and transcription factor binding sites in genomes, and phylogeny reconstruction. Phylogenetic descriptions depend not only on the employed reconstruction technique, but also on the underlying sequence alignment. The author has studied and established a simple prescription for obtaining a better phylogeny by improving the underlying alignments used in phylogeny reconstruction. This was achieved by improving upon Gotoh\u27s iterative heuristic by iterating with maximum parsimony guide-trees. This approach has shown an improvement in accuracy over standard alignment programs. A novel alignment algorithm named Probalign-RNAgenome that can identify non-coding RNAs in genomic sequences was also developed. Non-coding RNAs play a critical role in the cell such as gene regulation. It is thought that many such RNAs lie undiscovered in the genome. To date, alignment based approaches have shown to be more accurate than thermodynamic methods for identifying such non-coding RNAs. Probalign-RNAgenome employs a probabilistic consistency based approach for aligning a query RNA sequence to its homolog in a genomic sequence. Results show that this approach is more accurate on real data than the widely used BLAST and Smith- Waterman algorithms. Within the realm of comparative genomics are also a large number of recently conducted GWAS. GWAS aim to identify regions in the genome that are associated with a given disease. The support vector machine (SVM) provides a discriminative alternative to the widely used chi-square statistic in GWAS. A novel hybrid strategy that combines the chi-square statistic with the SVM was developed and implemented. Its performance was studied on simulated data and the Wellcome Trust Case Control Consortium (WTCCC) studies. Results presented in this thesis show that the hybrid strategy ranks causal SNPs in simulated data significantly higher than the chi-square test and SVM alone. The results also show that the hybrid strategy ranks previously replicated SNPs and associated regions (where applicable) of type 1 diabetes, rheumatoid arthritis, and Crohn\u27s disease higher than the chi-square, SVM, and SVM Recursive Feature Elimination (SVM-RFE)

    Evolutionary Inference via the Poisson Indel Process

    Full text link
    We address the problem of the joint statistical inference of phylogenetic trees and multiple sequence alignments from unaligned molecular sequences. This problem is generally formulated in terms of string-valued evolutionary processes along the branches of a phylogenetic tree. The classical evolutionary process, the TKF91 model, is a continuous-time Markov chain model comprised of insertion, deletion and substitution events. Unfortunately this model gives rise to an intractable computational problem---the computation of the marginal likelihood under the TKF91 model is exponential in the number of taxa. In this work, we present a new stochastic process, the Poisson Indel Process (PIP), in which the complexity of this computation is reduced to linear. The new model is closely related to the TKF91 model, differing only in its treatment of insertions, but the new model has a global characterization as a Poisson process on the phylogeny. Standard results for Poisson processes allow key computations to be decoupled, which yields the favorable computational profile of inference under the PIP model. We present illustrative experiments in which Bayesian inference under the PIP model is compared to separate inference of phylogenies and alignments.Comment: 33 pages, 6 figure

    Phylogenetic Trees and Their Analysis

    Full text link
    Determining the best possible evolutionary history, the lowest-cost phylogenetic tree, to fit a given set of taxa and character sequences using maximum parsimony is an active area of research due to its underlying importance in understanding biological processes. As several steps in this process are NP-Hard when using popular, biologically-motivated optimality criteria, significant amounts of resources are dedicated to both both heuristics and to making exact methods more computationally tractable. We examine both phylogenetic data and the structure of the search space in order to suggest methods to reduce the number of possible trees that must be examined to find an exact solution for any given set of taxa and associated character data. Our work on four related problems combines theoretical insight with empirical study to improve searching of the tree space. First, we show that there is a Hamiltonian path through tree space for the most common tree metrics, answering Bryant\u27s Challenge for the minimal such path. We next examine the topology of the search space under various metrics, showing that some metrics have local maxima and minima even with perfect data, while some others do not. We further characterize conditions for which sequences simulated under the Jukes-Cantor model of evolution yield well-behaved search spaces. Next, we reduce the search space needed for an exact solution by splitting the set of characters into mutually-incompatible subsets of compatible characters, building trees based on the perfect phylogenies implied by these sets, and then searching in the neighborhoods of these trees. We validate this work empirically. Finally, we compare two approaches to the generalized tree alignment problem, or GTAP: Sequence alignment followed by tree search vs. Direct Optimization, on both biological and simulated data

    Current methods for automated filtering of multiple sequence alignments frequently worsen single-gene phylogenetic inference

    Get PDF
    Phylogenetic inference is generally performed on the basis of multiple sequence alignments (MSA). Because errors in an alignment can lead to errors in tree estimation, there is a strong interest in identifying and removing unreliable parts of the alignment. In recent years several automated filtering approaches have been proposed, but despite their popularity, a systematic and comprehensive comparison of different alignment filtering methods on real data has been lacking. Here, we extend and apply recently introduced phylogenetic tests of alignment accuracy on a large number of gene families and contrast the performance of unfiltered versus filtered alignments in the context of single-gene phylogeny reconstruction. Based on multiple genome-wide empirical and simulated data sets, we show that the trees obtained from filtered MSAs are on average worse than those obtained from unfiltered MSAs. Furthermore, alignment filtering often leads to an increase in the proportion of well-supported branches that are actually wrong. We confirm that our findings hold for a wide range of parameters and methods. Although our results suggest that light filtering (up to 20% of alignment positions) has little impact on tree accuracy and may save some computation time, contrary to widespread practice, we do not generally recommend the use of current alignment filtering methods for phylogenetic inference. By providing a way to rigorously and systematically measure the impact of filtering on alignments, the methodology set forth here will guide the development of better filtering algorithms

    Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference

    Get PDF
    Phylogenetic inference is generally performed on the basis of multiple sequence alignments (MSA). Because errors in an alignment can lead to errors in tree estimation, there is a strong interest in identifying and removing unreliable parts of the alignment. In recent years several automated filtering approaches have been proposed, but despite their popularity, a systematic and comprehensive comparison of different alignment filtering methods on real data has been lacking. Here, we extend and apply recently introduced phylogenetic tests of alignment accuracy on a large number of gene families and contrast the performance of unfiltered versus filtered alignments in the context of single-gene phylogeny reconstruction. Based on multiple genome-wide empirical and simulated data sets, we show that the trees obtained from filtered MSAs are on average worse than those obtained from unfiltered MSAs. Furthermore, alignment filtering often leads to an increase in the proportion of well-supported branches that are actually wrong. We confirm that our findings hold for a wide range of parameters and methods. Although our results suggest that light filtering (up to 20% of alignment positions) has little impact on tree accuracy and may save some computation time, contrary to widespread practice, we do not generally recommend the use of current alignment filtering methods for phylogenetic inference. By providing a way to rigorously and systematically measure the impact of filtering on alignments, the methodology set forth here will guide the development of better filtering algorithm

    Novel approaches for large-scale phylogenetics and applications in the context of the amphibian tree of life

    Get PDF
    During this thesis, I addressed some problems associated with large-scale phylogenetic analyses by tackling issues related to missing data and careful handling and addition of novel data in large-scale reconstructions, presenting an application of this approach in the context of amphibian phylogenetics. I developed a method (called “Concatabominations”) building on the original Safe Taxonomic Reduction method (Wilkinson 1995) as an alternative approach to the issue of identifying rogue taxa. The safe removal of rogue taxa due to missing data can potentially reduce the terraces in tree space search and improve resolution in the final consensus tree. In a pragmatic point of view, the new method can help in targeting taxa that require further sampling during a research design. Novel sequence data for the rediscovered Ericabatrachus baleensis allowed to explore its placement in the Amphibian tree of life. I tested the inclusion of novel data using a backbone alignment from a previous work (de novo analysis) and a backbone phylogenetic tree (constrained analysis), after careful curation of gene partitions to include in an analysis. I found that the use of a constrained phylogenetic inference using a previous accepted tree seems to be a practical solution to the rapid phylogenetic placement of a taxon in cases of well-supported relationships. However, a de novo analysis might ensure an optimal alignment and avoid risks introduced when adding new data. Finally, I investigated the evolutionary relationships of the three lineages of the extant amphibians (Anura, Caudata and Gymnophiona) using an independent source of evidence: miRNAs, recently used to help resolve difficult phylogenetic problems. The analyses yielded a high number of shared miRNAs using the Xenopus tropicalis genome, contrasting with a lower number of miRNAs discovered using the Axolotl transcriptome. This suggests that not using genomic data is not ideal to validate miRNAs. Nevertheless, in spite of the limitations, I was able to find two potential novel miRNAs: one supporting the monophyly of Lissamphibia, and another supporting the Batrachia hypothesis. Overall, I hope the work developed in this thesis contributes with new insights into large-scale phylogenetics and in particular to amphibian phylogenetics
    • …
    corecore