31 research outputs found

    Distinguishing regional from within-codon rate heterogeneity in DNA sequence alignments

    Get PDF
    We present an improved phylogenetic factorial hidden Markov model (FHMM) for detecting two types of mosaic structures in DNA sequence alignments, related to (1) recombination and (2) rate heterogeneity. The focus of the present work is on improving the modelling of the latter aspect. Earlier papers have modelled different degrees of rate heterogeneity with separate hidden states of the FHMM. This approach fails to appreciate the intrinsic difference between two types of rate heterogeneity: long-range regional effects, which are potentially related to differences in the selective pressure, and the short-term periodic patterns within the codons, which merely capture the signature of the genetic code. We propose an improved model that explicitly distinguishes between these two effects, and we assess its performance on a set of simulated DNA sequence alignments

    Addressing the shortcomings of three recent bayesian methods for detecting interspecific recombination in DNA sequence alignments

    Get PDF
    We address a potential shortcoming of three probabilistic models for detecting interspecific recombination in DNA sequence alignments: the multiple change-point model (MCP) of Suchard et al. (2003), the dual multiple change-point model (DMCP) of Minin et al. (2005), and the phylogenetic factorial hidden Markov model (PFHMM) of Husmeier (2005). These models are based on the Bayesian paradigm, which requires the solution of an integral over the space of branch lengths. To render this integration analytically tractable, all three models make the same assumption that the vectors of branch lengths of the phylogenetic tree are independent among sites. While this approximation reduces the computational complexity considerably, we show that it leads to the systematic prediction of spurious topology changes in the Felsenstein zone, that is, the area in the branch lengths configuration space where maximum parsimony consistently infers the wrong topology due to long-branch attraction. We apply two Bayesian hypothesis tests, based on an inter- and an intra-model approach to estimating the marginal likelihood. We then propose a revised model that addresses these shortcomings, and compare it with the aforementioned models on a set of synthetic DNA sequence alignments systematically generated around the Felsenstein zone

    Bayesian machine learning methods for predicting protein-peptide interactions and detecting mosaic structures in DNA sequences alignments

    Get PDF
    Short well-defined domains known as peptide recognition modules (PRMs) regulate many important protein-protein interactions involved in the formation of macromolecular complexes and biochemical pathways. High-throughput experiments like yeast two-hybrid and phage display are expensive and intrinsically noisy, therefore it would be desirable to target informative interactions and pursue in silico approaches. We propose a probabilistic discriminative approach for predicting PRM-mediated protein-protein interactions from sequence data. The model suffered from over-fitting, so Laplacian regularisation was found to be important in achieving a reasonable generalisation performance. A hybrid approach yielded the best performance, where the binding site motifs were initialised with the predictions of a generative model. We also propose another discriminative model which can be applied to all sequences present in the organism at a significantly lower computational cost. This is due to its additional assumption that the underlying binding sites tend to be similar.It is difficult to distinguish between the binding site motifs of the PRM due to the small number of instances of each binding site motif. However, closely related species are expected to share similar binding sites, which would be expected to be highly conserved. We investigated rate variation along DNA sequence alignments, modelling confounding effects such as recombination. Traditional approaches to phylogenetic inference assume that a single phylogenetic tree can represent the relationships and divergences between the taxa. However, taxa sequences exhibit varying levels of conservation, e.g. due to regulatory elements and active binding sites, and certain bacteria and viruses undergo interspecific recombination. We propose a phylogenetic factorial hidden Markov model to infer recombination and rate variation. We examined the performance of our model and inference scheme on various synthetic alignments, and compared it to state of the art breakpoint models. We investigated three DNA sequence alignments: one of maize actin genes, one bacterial (Neisseria), and the other of HIV-1. Inference is carried out in the Bayesian framework, using Reversible Jump Markov Chain Monte Carlo

    Parametric inference of recombination in HIV genomes

    Full text link
    Recombination is an important event in the evolution of HIV. It affects the global spread of the pandemic as well as evolutionary escape from host immune response and from drug therapy within single patients. Comprehensive computational methods are needed for detecting recombinant sequences in large databases, and for inferring the parental sequences. We present a hidden Markov model to annotate a query sequence as a recombinant of a given set of aligned sequences. Parametric inference is used to determine all optimal annotations for all parameters of the model. We show that the inferred annotations recover most features of established hand-curated annotations. Thus, parametric analysis of the hidden Markov model is feasible for HIV full-length genomes, and it improves the detection and annotation of recombinant forms. All computational results, reference alignments, and C++ source code are available at http://bio.math.berkeley.edu/recombination/.Comment: 20 pages, 5 figure

    Improved Bayesian methods for detecting recombination and rate heterogeneity in DNA sequence alignments

    Get PDF
    DNA sequence alignments are usually not homogeneous. Mosaic structures may result as a consequence of recombination or rate heterogeneity. Interspecific recombination, in which DNA subsequences are transferred between different (typically viral or bacterial) strains may result in a change of the topology of the underlying phylogenetic tree. Rate heterogeneity corresponds to a change of the nucleotide substitution rate. Various methods for simultaneously detecting recombination and rate heterogeneity in DNA sequence alignments have recently been proposed, based on complex probabilistic models that combine phylogenetic trees with factorial hidden Markov models or multiple changepoint processes. The objective of my thesis is to identify potential shortcomings of these models and explore ways of how to improve them. One shortcoming that I have identified is related to an approximation made in various recently proposed Bayesian models. The Bayesian paradigm requires the solution of an integral over the space of parameters. To render this integration analytically tractable, these models assume that the vectors of branch lengths of the phylogenetic tree are independent among sites. While this approximation reduces the computational complexity considerably, I show that it leads to the systematic prediction of spurious topology changes in the Felsenstein zone, that is, the area in the branch lengths configuration space where maximum parsimony consistently infers the wrong topology due to long-branch attraction. I demonstrate these failures by using two Bayesian hypothesis tests, based on an inter- and an intra-model approach to estimating the marginal likelihood. I then propose a revised model that addresses these shortcomings, and demonstrate its improved performance on a set of synthetic DNA sequence alignments systematically generated around the Felsenstein zone. The core model explored in my thesis is a phylogenetic factorial hidden Markov model (FHMM) for detecting two types of mosaic structures in DNA sequence alignments, related to recombination and rate heterogeneity. The focus of my work is on improving the modelling of the latter aspect. Earlier research efforts by other authors have modelled different degrees of rate heterogeneity with separate hidden states of the FHMM. Their work fails to appreciate the intrinsic difference between two types of rate heterogeneity: long-range regional effects, which are potentially related to differences in the selective pressure, and the short-term periodic patterns within the codons, which merely capture the signature of the genetic code. I have improved these earlier phylogenetic FHMMs in two respects. Firstly, by sampling the rate vector from the posterior distribution with RJMCMC I have made the modelling of regional rate heterogeneity more flexible, and I infer the number of different degrees of divergence directly from the DNA sequence alignment, thereby dispensing with the need to arbitrarily select this quantity in advance. Secondly, I explicitly model within-codon rate heterogeneity via a separate rate modification vector. In this way, the within-codon effect of rate heterogeneity is imposed on the model a priori, which facilitates the learning of the biologically more interesting effect of regional rate heterogeneity a posteriori. I have carried out simulations on synthetic DNA sequence alignments, which have borne out my conjecture. The existing model, which does not explicitly include the within-codon rate variation, has to model both effects with the same modelling mechanism. As expected, it was found to fail to disentangle these two effects. On the contrary, I have found that my new model clearly separates within-codon rate variation from regional rate heterogeneity, resulting in more accurate predictions

    Evidence of animal mtDNA recombination between divergent populations of the potato cyst nematode Globodera pallida

    Get PDF
    Recombination is typically assumed to be absent in animal mitochondrial genomes (mtDNA). However, the maternal mode of inheritance means that recombinant products are indistinguishable from their progenitor molecules. The majority of studies of mtDNA recombination assess past recombination events, where patterns of recombination are inferred by comparing the mtDNA of different individuals. Few studies assess contemporary mtDNA recombination, where recombinant molecules are observed as direct mosaics of known progenitor molecules. Here we use the potato cyst nematode, Globodera pallida, to investigate past and contemporary recombination. Past recombination was assessed within and between populations of G. pallida, and contemporary recombination was assessed in the progeny of experimental crosses of these populations. Breeding of genetically divergent organisms may cause paternal mtDNA leakage, resulting in heteroplasmy and facilitating the detection of recombination. To assess contemporary recombination we looked for evidence of recombination between the mtDNA of the parental populations within the mtDNA of progeny. Past recombination was detected between a South American population and several UK populations of G. pallida, as well as between two South American populations. This suggests that these populations may have interbred, paternal mtDNA leakage occurred, and the mtDNA of these populations subsequently recombined. This evidence challenges two dogmas of animal mtDNA evolution; no recombination and maternal inheritance. No contemporary recombination between the parental populations was detected in the progeny of the experimental crosses. This supports current arguments that mtDNA recombination events are rare. More sensitive detection methods may be required to adequately assess contemporary mtDNA recombination in animals

    Evaluation of methods for detecting conversion events in gene clusters

    Get PDF
    Background: Gene clusters are genetically important, but their analysis poses significant computational challenges. One of the major reasons for these difficulties is gene conversion among the duplicated regions of the cluster, which can obscure their true relationships. Many computational methods for detecting gene conversion events have been released, but their performance has not been assessed for wide deployment in evolutionary history studies due to a lack of accurate evaluation methods. Results: We designed a new method that simulates gene cluster evolution, including large-scale events of duplication, deletion, and conversion as well as small mutations. We used this simulation data to evaluate several different programs for detecting gene conversion events. Conclusions: Our evaluation identifies strengths and weaknesses of several methods for detecting gene conversion, which can contribute to more accurate analysis of gene cluster evolution

    Phylogenetic Detection of Recombination with a Bayesian Prior on the Distance between Trees

    Get PDF
    Genomic regions participating in recombination events may support distinct topologies, and phylogenetic analyses should incorporate this heterogeneity. Existing phylogenetic methods for recombination detection are challenged by the enormous number of possible topologies, even for a moderate number of taxa. If, however, the detection analysis is conducted independently between each putative recombinant sequence and a set of reference parentals, potential recombinations between the recombinants are neglected. In this context, a recombination hotspot can be inferred in phylogenetic analyses if we observe several consecutive breakpoints. We developed a distance measure between unrooted topologies that closely resembles the number of recombinations. By introducing a prior distribution on these recombination distances, a Bayesian hierarchical model was devised to detect phylogenetic inconsistencies occurring due to recombinations. This model relaxes the assumption of known parental sequences, still common in HIV analysis, allowing the entire dataset to be analyzed at once. On simulated datasets with up to 16 taxa, our method correctly detected recombination breakpoints and the number of recombination events for each breakpoint. The procedure is robust to rate and transition∶transversion heterogeneities for simulations with and without recombination. This recombination distance is related to recombination hotspots. Applying this procedure to a genomic HIV-1 dataset, we found evidence for hotspots and de novo recombination

    Improved Bayesian methods for detecting recombination and rate heterogeneity in DNA sequence alignments

    Get PDF
    DNA sequence alignments are usually not homogeneous. Mosaic structures may result as a consequence of recombination or rate heterogeneity. Interspecific recombination, in which DNA subsequences are transferred between different (typically viral or bacterial) strains may result in a change of the topology of the underlying phylogenetic tree. Rate heterogeneity corresponds to a change of the nucleotide substitution rate. Various methods for simultaneously detecting recombination and rate heterogeneity in DNA sequence alignments have recently been proposed, based on complex probabilistic models that combine phylogenetic trees with factorial hidden Markov models or multiple changepoint processes. The objective of my thesis is to identify potential shortcomings of these models and explore ways of how to improve them. One shortcoming that I have identified is related to an approximation made in various recently proposed Bayesian models. The Bayesian paradigm requires the solution of an integral over the space of parameters. To render this integration analytically tractable, these models assume that the vectors of branch lengths of the phylogenetic tree are independent among sites. While this approximation reduces the computational complexity considerably, I show that it leads to the systematic prediction of spurious topology changes in the Felsenstein zone, that is, the area in the branch lengths configuration space where maximum parsimony consistently infers the wrong topology due to long-branch attraction. I demonstrate these failures by using two Bayesian hypothesis tests, based on an inter- and an intra-model approach to estimating the marginal likelihood. I then propose a revised model that addresses these shortcomings, and demonstrate its improved performance on a set of synthetic DNA sequence alignments systematically generated around the Felsenstein zone. The core model explored in my thesis is a phylogenetic factorial hidden Markov model (FHMM) for detecting two types of mosaic structures in DNA sequence alignments, related to recombination and rate heterogeneity. The focus of my work is on improving the modelling of the latter aspect. Earlier research efforts by other authors have modelled different degrees of rate heterogeneity with separate hidden states of the FHMM. Their work fails to appreciate the intrinsic difference between two types of rate heterogeneity: long-range regional effects, which are potentially related to differences in the selective pressure, and the short-term periodic patterns within the codons, which merely capture the signature of the genetic code. I have improved these earlier phylogenetic FHMMs in two respects. Firstly, by sampling the rate vector from the posterior distribution with RJMCMC I have made the modelling of regional rate heterogeneity more flexible, and I infer the number of different degrees of divergence directly from the DNA sequence alignment, thereby dispensing with the need to arbitrarily select this quantity in advance. Secondly, I explicitly model within-codon rate heterogeneity via a separate rate modification vector. In this way, the within-codon effect of rate heterogeneity is imposed on the model a priori, which facilitates the learning of the biologically more interesting effect of regional rate heterogeneity a posteriori. I have carried out simulations on synthetic DNA sequence alignments, which have borne out my conjecture. The existing model, which does not explicitly include the within-codon rate variation, has to model both effects with the same modelling mechanism. As expected, it was found to fail to disentangle these two effects. On the contrary, I have found that my new model clearly separates within-codon rate variation from regional rate heterogeneity, resulting in more accurate predictions

    Analysis of recombination in molecular sequence data

    Get PDF
    We present the new and fast method Recco for analyzing a multiple alignment regarding recombination. Recco is based on a dynamic program that explains one sequence in the alignment with the other sequences using mutation and recombination. The dynamic program allows for an intuitive visualization of the optimal solution and also introduces a parameter α controlling the number of recombinations in the solution. Recco performs a parametric analysis regarding α and orders all pareto-optimal solutions by increasing number of recombinations. α is also directly related to the Savings value, a quantitative and intuitive measure for the preference of recombination in the solution. The Savings value and the solutions have a simple interpretation regarding the ancestry of the sequences in the alignment and it is usually easy to understand the output of the method. The distribution of the Savings value for non-recombining alignments is estimated by processing column permutations of the alignment and p-values are provided for recombination in the alignment, in a sequence and at a breakpoint position. Recco also uses the p-values to suggest a single solution, or recombinant structure, for the explained sequence. Recco is validated on a large set of simulated alignments and has a recombination detection performance superior to all current methods. The analysis of real alignments confirmed that Recco is among the best methods for recombination analysis and further supported that Recco is very intuitive compared to other methods.Wir präsentieren Recco, eine neue und schnelle Methode zur Analyse von Rekombinationen in multiplen Alignments. Recco basiert auf einem dynamischen Programm, welches eine Sequenz im Alignment durch die anderen Sequenzen im Alignment rekonstruiert, wobei die Operatoren Mutation und Rekombination erlaubt sind. Das dynamische Programm ermöglicht eine intuitive Visualisierung der optimalen Lösung und besitzt einen Parameter α, welcher die Anzahl der Rekombinationsereignisse in der optimalen Lösung steuert. Recco führt eine parametrische Analyse bezüglich des Parameters α durch, so dass alle pareto-optimalen Lösungen nach der Anzahl ihrer Rekombinationsereignisse sortiert werden können. α steht auch direkt in Beziehung mit dem sogenannten Savings-Wert, der die Neigung zum Einfügen von Rekombinationsereignissen in die optimale Lösung quantitativ und intuitiv bemisst. Der Savings-Wert und die optimalen Lösungen haben eine einfache Interpretation bezüglich der Historie der Sequenzen im Alignment, so dass es in der Regel leicht fällt, die Ausgabe von Recco zu verstehen. Recco schätzt die Verteilung des Savings-Werts für Alignments ohne Rekombinationen durch einen Permutationstest, der auf Spaltenpermutationen basiert. Dieses Verfahren resultiert in p-Werten für Rekombination im Alignment, in einer Sequenz und an jeder Position im Alignment. Basierend auf diesen p-Werten schlägt Recco eine optimale Lösung vor, als Schätzer für die rekombinante Struktur der erklärten Sequenz. Recco wurde auf einem großen Datensatz simulierter Alignments getestet und erzielte auf diesem Datensatz eine bessere Vorhersagegüte in Bezug auf das Erkennen von Alignments mit Rekombination als alle anderen aktuellen Verfahren. Die Analyse von realen Datensätzen bestätigte, dass Recco zu den besten Methoden für die Rekombinationsanalyse gehört und im Vergleich zu anderen Methoden oft leichter verständliche Resultate liefert
    corecore