343 research outputs found

    Detection of recombination in DNA multiple alignments with hidden markov models

    Get PDF
    CConventional phylogenetic tree estimation methods assume that all sites in a DNA multiple alignment have the same evolutionary history. This assumption is violated in data sets from certain bacteria and viruses due to recombination, a process that leads to the creation of mosaic sequences from different strains and, if undetected, causes systematic errors in phylogenetic tree estimation. In the current work, a hidden Markov model (HMM) is employed to detect recombination events in multiple alignments of DNA sequences. The emission probabilities in a given state are determined by the branching order (topology) and the branch lengths of the respective phylogenetic tree, while the transition probabilities depend on the global recombination probability. The present study improves on an earlier heuristic parameter optimization scheme and shows how the branch lengths and the recombination probability can be optimized in a maximum likelihood sense by applying the expectation maximization (EM) algorithm. The novel algorithm is tested on a synthetic benchmark problem and is found to clearly outperform the earlier heuristic approach. The paper concludes with an application of this scheme to a DNA sequence alignment of the argF gene from four Neisseria strains, where a likely recombination event is clearly detected

    Bayesian machine learning methods for predicting protein-peptide interactions and detecting mosaic structures in DNA sequences alignments

    Get PDF
    Short well-defined domains known as peptide recognition modules (PRMs) regulate many important protein-protein interactions involved in the formation of macromolecular complexes and biochemical pathways. High-throughput experiments like yeast two-hybrid and phage display are expensive and intrinsically noisy, therefore it would be desirable to target informative interactions and pursue in silico approaches. We propose a probabilistic discriminative approach for predicting PRM-mediated protein-protein interactions from sequence data. The model suffered from over-fitting, so Laplacian regularisation was found to be important in achieving a reasonable generalisation performance. A hybrid approach yielded the best performance, where the binding site motifs were initialised with the predictions of a generative model. We also propose another discriminative model which can be applied to all sequences present in the organism at a significantly lower computational cost. This is due to its additional assumption that the underlying binding sites tend to be similar.It is difficult to distinguish between the binding site motifs of the PRM due to the small number of instances of each binding site motif. However, closely related species are expected to share similar binding sites, which would be expected to be highly conserved. We investigated rate variation along DNA sequence alignments, modelling confounding effects such as recombination. Traditional approaches to phylogenetic inference assume that a single phylogenetic tree can represent the relationships and divergences between the taxa. However, taxa sequences exhibit varying levels of conservation, e.g. due to regulatory elements and active binding sites, and certain bacteria and viruses undergo interspecific recombination. We propose a phylogenetic factorial hidden Markov model to infer recombination and rate variation. We examined the performance of our model and inference scheme on various synthetic alignments, and compared it to state of the art breakpoint models. We investigated three DNA sequence alignments: one of maize actin genes, one bacterial (Neisseria), and the other of HIV-1. Inference is carried out in the Bayesian framework, using Reversible Jump Markov Chain Monte Carlo

    Phylogenetic Detection of Recombination with a Bayesian Prior on the Distance between Trees

    Get PDF
    Genomic regions participating in recombination events may support distinct topologies, and phylogenetic analyses should incorporate this heterogeneity. Existing phylogenetic methods for recombination detection are challenged by the enormous number of possible topologies, even for a moderate number of taxa. If, however, the detection analysis is conducted independently between each putative recombinant sequence and a set of reference parentals, potential recombinations between the recombinants are neglected. In this context, a recombination hotspot can be inferred in phylogenetic analyses if we observe several consecutive breakpoints. We developed a distance measure between unrooted topologies that closely resembles the number of recombinations. By introducing a prior distribution on these recombination distances, a Bayesian hierarchical model was devised to detect phylogenetic inconsistencies occurring due to recombinations. This model relaxes the assumption of known parental sequences, still common in HIV analysis, allowing the entire dataset to be analyzed at once. On simulated datasets with up to 16 taxa, our method correctly detected recombination breakpoints and the number of recombination events for each breakpoint. The procedure is robust to rate and transition∶transversion heterogeneities for simulations with and without recombination. This recombination distance is related to recombination hotspots. Applying this procedure to a genomic HIV-1 dataset, we found evidence for hotspots and de novo recombination

    Examining recombination and intra-genomic conflict dynamics in the evolution of anti-microbial resistant bacteria

    Get PDF
    The spread of antimicrobial resistance (AMR) among pathogenic bacterial species threatens to undercut much of the progress made in treating infectious diseases. AMR genes can disseminate between and within populations via horizontal gene transfer (HGT). Selfish mobile genetic elements (MGEs) can encode resistance and spread between host cells. Homologous recombination can alter the core genes of pathogens with resistant donors via HGT too. MGEs may be cured from host genomes through transformation. Hence, MGEs may be able to avoid deletion by disrupting transformation. This work aims to understand how the dynamics of these processes affect the epidemiology of AMR pathogens. To understand these dynamics, I co-developed a new version of the popular recombination detection tool Gubbins. Through simulation studies, I find this new version to be both accurate in reconstructing the relationships between isolates, and efficient in terms of its use of computational resources. I then apply Gubbins to both AMR lineages and species-wide datasets of the pathogen Streptococcus pneumoniae. I find that recombination frequently occurs around core genes involved in both drug resistance and the host immune response. Additionally, an MGE was able to successfully spread within a population by disrupting the transformation machinery, preventing its loss from the host. Finally, I investigate two recent examples of MGEs disrupting transformation in the gram-negative species Acinetobacter baumannii and Legionella pneumophila. I find that while these insertions may decrease the efficiency of transformations within cells, the observed recombination rates largely reflect the selection pressures on isolates. With MGEs only partially able to inhibit these observable transformation events. These results show how selection pressures from clinical interventions shape pathogen genomes through diverse, often interspecies, recombination events. The spread of MGEs can also be favoured by both these selection pressures, and their ability to disrupt host cell machinery.Open Acces

    Tandem gene arrays in Trypanosoma brucei: Comparative phylogenomic analysis of duplicate sequence variation

    Get PDF
    BACKGROUND: The genome sequence of the protistan parasite Trypanosoma brucei contains many tandem gene arrays. Gene duplicates are created through tandem duplication and are expressed through polycistronic transcription, suggesting that the primary purpose of long, tandem arrays is to increase gene dosage in an environment where individual gene promoters are absent. This report presents the first account of the tandem gene arrays in the T. brucei genome, employing several related genome sequences to establish how variation is created and removed. RESULTS: A systematic survey of tandem gene arrays showed that substantial sequence variation existed across the genome; variation from different regions of an array often produced inconsistent phylogenetic affinities. Phylogenetic relationships of gene duplicates were consistent with concerted evolution being a widespread homogenising force. However, tandem duplicates were not usually identical; therefore, any homogenising effect was coincident with divergence among duplicates. Allelic gene conversion was detected using various criteria and was apparently able to both remove and introduce sequence variation. Tandem arrays containing structural heterogeneity demonstrated how sequence homogenisation and differentiation can occur within a single locus. CONCLUSION: The use of multiple genome sequences in a comparative analysis of tandem gene arrays identified substantial sequence variation among gene duplicates. The distribution of sequence variation is determined by a dynamic balance of conservative and innovative evolutionary forces. Gene trees from various species showed that intraspecific duplicates evolve in concert, perhaps through frequent gene conversion, although this does not prevent sequence divergence, especially where structural heterogeneity physically separates a duplicate from its neighbours. In describing dynamics of sequence variation that have consequences beyond gene dosage, this survey provides a basis for uncovering the hidden functionality within tandem gene arrays in trypanosomatids

    Genetic variation and dispersal in Penstemon hirsutus and P. tenuiflorus

    Get PDF
    Studying plant-pollinator relationships essential for understanding angiosperm evolution. In the large endemic genus Penstemon (Plantaginaceae), shifts in pollination syndrome are proposed to be important for explaining taxonomic and morphological diversity (Wilson et al., 2004; Wolfe et al., 2006). However, little work has been done to determine the relationship between morphological and genetic divergence within pollination syndromes. This study utilized genetic data to explore whether divergence in corolla morphology among nine closely related, bee pollinated Penstemon species was consistent with pollinator-driven selection. Bee pollinated species in Penstemon subsection Penstemon are often divided into two morphological groups based on inflation of the corolla throat. This trait has been proposed to be an important target for pollinator selection (Pennell, 1935; Clements, 1995). Consistent with this theory, phylogenetic analyses of the nuclear gene GBSSI indicated that genetic divergence among species was consistent with morphological divergence. However, a similar pattern was reflected by chloroplast data (noncoding region rps16-trnK), indicating divergence likely occurred during a period of geographic isolation. Therefore, pollinator-driven selection on corolla morphology does not appear to be the primary cause of divergence among morphological groups. Instead, gene flow patterns indicate that pollinator selection may serve to reinforce pre-existing divergence between groups. Population genetic analyses utilized microsatellites in addition to GBSSI and rps16-trnK and focused specifically on two species that share a high degree of morphological similarity. Penstemon hirsutus and P. tenuiflorus have small average population sizes, fragmented geographic distributions and are pollinated by insects (primarily bees) with short foraging ranges. Patterns consistent with geographic isolation of populations were observed in P. hirsutus, but P. tenuiflorus nuclear data reflected much lower levels of genetic structure. Although P. hirsutus and P. tenuiflorus are highly similar, these findings imply that pollinator dynamics differ between species. Therefore, a trait other than corolla throat inflation may be important for explaining differences in pollinator-driven selection. Furthermore, P. tenuiflorus may represent an example of Slatkin’s paradox. Higher than expected pollen-mediated gene flow among populations of P. tenuiflorus suggest that long-distance foragers such as moths may be more important than previously realized

    Integration of Alignment and Phylogeny in the Whole-Genome Era

    Get PDF
    With the development of new sequencing techniques, whole genomes of many species have become available. This huge amount of data gives rise to new opportunities and challenges. These new sequences provide valuable information on relationships among species, e.g. genome recombination and conservation. One of the principal ways to investigate such information is multiple sequence alignment (MSA). Currently, there is large amount of MSA data on the internet, such as the UCSC genome database, but how to effectively use this information to solve classical and new problems is still an area lacking of exploration. In this thesis, we explored how to use this information in four problems, i.e. sequence orthology search problem, multiple alignment improvement problem, short read mapping problem, and genome rearrangement inference problem. For the first problem, we developed a EM algorithm to iteratively align a query with a multiple alignment database with the information from a phylogeny relating the query species and the species in the multiple alignment. We also infer the query\u27s location in the phylogeny. We showed that by doing alignment and phylogeny inference together, we can improve the accuracies for both problems. For the second problem, we developed an optimization algorithm to iteratively refine the multiple alignment quality. Experiment results showed our algorithm is very stable in term of resulting alignments. The results showed that our method is more accurate than existing methods, i.e. Mafft, Clustal-O, and Mavid, on test data from three sets of species from the UCSC genome database. For the third problem, we developed a model, PhyMap, to align a read to a multiple alignment allowing mismatches and indels. PhyMap computes local alignments of a query sequence against a fixed multiple-genome alignment of closely related species. PhyMap uses a known phylogenetic tree on the species in the multiple alignment to improve the quality of its computed alignments while also estimating the placement of the query on this tree. Both theoretical computation and experiment results show that our model can differentiate between orthologous and paralogous alignments better than other popular short read mapping tools (BWA, BOWTIE and BLAST). For the fourth problem, we gave a simple genome recombination model which can express insertions, deletions, inversions, translocations and inverted translocations on aligned genome segments. We also developed an MCMC algorithm to infer the order of the query segments. We proved that using any Euclidian metrics to measure distance between two sequence orders in the tree optimization goal function will lead to a degenerated solution where the inferred order will be the order of one of the leaf nodes. We also gave a graph-based formulation of the problem which can represent the probability distribution of the order of the query sequences

    Recombination drives genomic divergence and cooperation across bacterial populations

    Get PDF
    Genetic recombination is a major driving evolutionary force across the Tree of Life, but it plays a special role in bacteria. In the absence of sexual reproduction, bacteria rely on horizontal transmission to recombine their genetic material. Horizontally transferred genes can spread rapidly across a population, greatly expanding the niches available to bacteria. We are only beginning to understand the degree to which recombination varies between and even within bacterial taxa. In the first part of this dissertation (Chapter 1 and Chapter 2), I investigate how patterns of recombination influence bacterial diversity. In Chapter 1, I studied the pan-genome of Cronobacter sakazakii, an emerging neonatal pathogen. I identified a suite of frequently recombined genes that may contribute to the success of C. sakazakii as a pathogen, with many of these genes playing roles in virulence, antibiotic resistance, and niche-specific adaptation. In Chapter 2, I compare the pan-genomes of Streptococcus agalactiae, Streptococcus pyogenes, and Streptococcus suis, three opportunistic pathogens. Although all three species exhibit recombination, I identified differences among them in the quantity of recombined genes, the clinical relevance of these genes, and the mobile genetic elements used in their spread. The second part of this dissertation (Chapter 3) covers how recombination affects cooperation within bacterial populations. While cooperation is widespread among bacteria, it can quickly degrade in the absence of a way to enforce it. I propose the recombination can function as an enforcement mechanism for cooperation, that increased cooperation can also lead to conditions favorable to recombination, and that recombination itself is an altruistic activity. Overall, this dissertation covers the many ways genetic recombination can affect the evolutionary dynamics of bacterial populations, from promoting genome diversity and adaptation in pathogens to enforcing uniformity in cooperative local populations

    Nouvel algorithme pour évaluer l'influence environnementale du Coronavirus par le biais d'une analyse phylogéographique

    Get PDF
    Abstract: This thesis presents a comprehensive exploration of phylogeographic methodologies designed to unravel intricate interplays between divergence patterns within coronaviruses and relevant environmental attributes. The study encompasses the integration of genetic and climatic factors to discern the complex relationships underlying viral evolution and distribution. The research commences with the development of a Python-based phylogeographic analysis pipeline, facilitating the investigation of the relationship between genetic diversity and geographic distribution. The pipeline employs a sliding window approach to identify regions within viral genetic sequences aligning with regional climatic conditions. This unified system orchestrates a range of analytical operations and is cross-platform compatible, catering to various operating systems. Building upon this foundation, an application is developed, enhancing the reproducibility and accessibility of the analysis. Neo4j and Snakemake technologies are leveraged to empower researchers in data preprocessing, parameters tuning, results saving, and data visualizing. Real-world data, including genomic sequences, lineage information, population statistics, and climate data, are curated and integrated into the Neo4j graph database. To broaden the scope to encompass various Coronaviruses, the study incorporates Host-Virus cophylogeny analysis, horizontal gene transfer analysis, and other strategies, thereby enriching the research landscape. Moreover, the study addresses scalability and efficiency concerns, crucial for accommodating expanding datasets, and evolving research requirements. The enhanced workflow facilitates parallel task execution, significantly boosting performance. The outcomes highlight key fragments correlating with specific environmental factors, reinforcing the platform's utility in deciphering complex evolutionary dynamics. As a result, this research makes a substantial contribution to the field of phylogeography, providing researchers with a powerful toolkit for investigating species distribution patterns and environmental influences. The insights derived from this study have the potential to reveal fundamental principles governing the interplay between genetic variation and geographical attributes across various species.Ce mémoire présente une étude exhaustive des méthodologies phylogéographiques appliquées à la compréhension des schémas de divergence au sein des coronavirus et de leur relation avec les facteurs environnementaux pertinents. L’analyse intègre de manière systématique les aspects génétiques et climatiques dans le but d’éclaircir les relations complexes qui sous-tendent l’évolution et la répartition de ces agents viraux. L’approche méthodologique entreprise débute par la conception et l’implémentation d’un pipeline d’analyse phylogéographique, conçu en langage Python. Ce pipeline constitue une plateforme d’investigation destinée à scruter la corrélation entre la diversité génétique des coronavirus et leur distribution géographique. L’utilisation d’une technique de fenêtre glissante dans ce pipeline permet d’identifier les régions spécifiques au sein des séquences génétiques virales qui présentent des associations significatives avec les conditions climatiques propres à chaque région. Cette solution intégrée englobe une variété d’opérations analytiques, offrant une interopérabilité adaptable à différentes plates-formes et systèmes d’exploitation. Sur la base de ces fondements, une application a été élaborée en vue d’optimiser la reproductibilité et l’accessibilité des analyses scientifiques. Cette avancée s’appuie sur l’utilisation des technologies de pointe telles que Neo4j et Snakemake, permettant ainsi aux chercheurs d’exploiter la préparation des données, l’ajustement des paramètres, la sauvegarde des résultats et la visualisation des données. L’intégration de données du monde réel, incluant des séquences génomiques, des informations sur les lignées, des statistiques de population et des données climatiques, s’est effectuée avec une grande rigueur, ces données ayant été soigneusement sélectionnées puis intégrées dans la base de données graphique Neo4j. Afin d’étendre la portée de cette étude et d’englober différents types de coronavirus, l’analyse inclut également des investigations avancées telles que l’étude de la cophylogénie entre hôtes et virus, l’analyse des transferts horizontaux de gènes, ainsi que d’autres stratégies analytiques, contribuant ainsi de manière significative à l’enrichissement du paysage de la recherche scientifique. De plus, l’étude aborde les préoccupations de scalabilité et d’efficacité, essentielles pour accueillir des ensembles de données en expansion et des besoins de recherche. Le flux de travail amélioré prend en charge l’exécution parallèle des tâches, amplifiant les performances. Les résultats mettent en évidence des fragments clés corrélant avec des facteurs environnementaux spécifiques, renforçant l’utilité de la plateforme pour décoder les dynamiques évolutives complexes. En conséquence, cette recherche contribue au domaine de la phylogéographie, offrant aux chercheurs une boîte à outils solide pour explorer les schémas de distribution des espèces et les influences environnementales. Les conclusions tirées de cette étude ont le potentiel de révéler les principes fondamentaux sous-jacents à l’interaction entre la variation génétique et les attributs géographiques à travers les espèces

    Population Genomic Inferences from Sparse High-Throughput Sequencing of Two Populations of Drosophila melanogaster

    Get PDF
    Short-read sequencing techniques provide the opportunity to capture genome-wide sequence data in a single experiment. A current challenge is to identify questions that shallow-depth genomic data can address successfully and to develop corresponding analytical methods that are statistically sound. Here, we apply the Roche/454 platform to survey natural variation in strains of Drosophila melanogaster from an African (n = 3) and a North American (n = 6) population. Reads were aligned to the reference D. melanogaster genomic assembly, single nucleotide polymorphisms were identified, and nucleotide variation was quantified genome wide. Simulations and empirical results suggest that nucleotide diversity can be accurately estimated from sparse data with as little as 0.2× coverage per line. The unbiased genomic sampling provided by random short-read sequencing also allows insight into distributions of transposable elements and copy number polymorphisms found within populations and demonstrates that short-read sequencing methods provide an efficient means to quantify variation in genome organization and content. Continued development of methods for statistical inference of shallow-depth genome-wide sequencing data will allow such sparse, partial data sets to become the norm in the emerging field of population genomics
    • …
    corecore