193 research outputs found

    ModÚles et algorithmes pour la segmentation de séquences biologiques et la reconstruction de leurs histoires évolutives

    Get PDF
    L’informatique est de plus en plus utilisĂ©e pour rĂ©soudre des problĂšmes dans divers domaines. C’est ainsi qu’avec l’accroissement des donnĂ©es biologiques gĂ©nĂ©rĂ©es par les techniques expĂ©rimentales Ă  haut dĂ©bit, la bio-informatique intervient pour tirer profit de ces masses de donnĂ©es et contribuer Ă  l’avancement des connaissances en sciences biologiques. La bio-informatique est un domaine interdisciplinaire ayant pour but d’étudier et de rĂ©soudre des problĂšmes computationnels issus des sciences biologiques. Un des problĂšmes intemporels Ă©tudiĂ© en bio-informatique est la reconstruction de l’histoire Ă©volutive de gĂ©nomes, qui sous-entend essentiellement celle des gĂšnes. Les gĂšnes sont le support de l’information gĂ©nĂ©tique et sont les unitĂ©s de base de l’hĂ©rĂ©ditĂ©. De nos jours, un grand nombre de maladies, telles les cancers, ont une base gĂ©nĂ©tique. Une bonne comprĂ©hension de l’évolution des gĂšnes permettrait de mieux comprendre les processus impliquĂ©s dans ces maladies pour mieux les traiter. De plus, les connaissances sur l’évolution de gĂšnes sont utiles pour la prĂ©diction et l’annotation de nouveaux gĂšnes. Il a Ă©tĂ© montrĂ© que les gĂšnes eucaryotes subissent un phĂ©nomĂšne d’épissage alternatif qui permet aux gĂšnes de produire plusieurs transcrits diffĂ©rents afin de se diversifier fonctionnellement. C’est dans ce contexte que se situe cette thĂšse de doctorat. L’objectif de la thĂšse est de dĂ©finir des modĂšles et des algorithmes efficaces et prĂ©cis pour la segmentation de sĂ©quences biologiques et la reconstruction de leurs histoires Ă©volutives en tenant compte de l’épissage alternatif. Dans cette thĂšse, j'ai contribuĂ© Ă  accroĂźtre les connaissances scientifiques en introduisant et en formalisant des modĂšles d’évolution de transcrits et de gĂšnes. Nous avons proposĂ© deux algorithmes pour la segmentation de transcrits alternatifs. Nous avons Ă©galement proposĂ© un outil de simulation de l’évolution des sĂ©quences biologiques et un outil de visualisation de coĂ©volution. Pour chacun des modĂšles et algorithmes proposĂ©s, nous avons dĂ©veloppĂ© des applications pour permettre l’utilisation facile de nos outils

    Inferring angiosperm phylogeny from EST data with widespread gene duplication

    Get PDF
    BACKGROUND: Most studies inferring species phylogenies use sequences from single copy genes or sets of orthologs culled from gene families. For taxa such as plants, with very high levels of gene duplication in their nuclear genomes, this has limited the exploitation of nuclear sequences for phylogenetic studies, such as those available in large EST libraries. One rarely used method of inference, gene tree parsimony, can infer species trees from gene families undergoing duplication and loss, but its performance has not been evaluated at a phylogenomic scale for EST data in plants. RESULTS: A gene tree parsimony analysis based on EST data was undertaken for six angiosperm model species and Pinus, an outgroup. Although a large fraction of the tentative consensus sequences obtained from the TIGR database of ESTs was assembled into homologous clusters too small to be phylogenetically informative, some 557 clusters contained promising levels of information. Based on maximum likelihood estimates of the gene trees obtained from these clusters, gene tree parsimony correctly inferred the accepted species tree with strong statistical support. A slight variant of this species tree was obtained when maximum parsimony was used to infer the individual gene trees instead. CONCLUSION: Despite the complexity of the EST data and the relatively small fraction eventually used in inferring a species tree, the gene tree parsimony method performed well in the face of very high apparent rates of duplication

    The Orthology Road: Theory and Methods in Orthology Analysis

    Get PDF
    The evolution of biological species depends on changes in genes. Among these changes are the gradual accumulation of DNA mutations, insertions and deletions, duplication of genes, movements of genes within and between chromosomes, gene losses and gene transfer. As two populations of the same species evolve independently, they will eventually become reproductively isolated and become two distinct species. The evolutionary history of a set of related species through the repeated occurrence of this speciation process can be represented as a tree-like structure, called a phylogenetic tree or a species tree. Since duplicated genes in a single species also independently accumulate point mutations, insertions and deletions, they drift apart in composition in the same way as genes in two related species. The divergence of all the genes descended from a single gene in an ancestral species can also be represented as a tree, a gene tree that takes into account both speciation and duplication events. In order to reconstruct the evolutionary history from the study of extant species, we use sets of similar genes, with relatively high degree of DNA similarity and usually with some functional resemblance, that appear to have been derived from a common ancestor. The degree of similarity among different instances of the “same gene” in different species can be used to explore their evolutionary history via the reconstruction of gene family histories, namely gene trees. Orthology refers specifically to the relationship between two genes that arose by a speciation event, recent or remote, rather than duplication. Comparing orthologous genes is essential to the correct reconstruction of species trees, so that detecting and identifying orthologous genes is an important problem, and a longstanding challenge, in comparative and evolutionary genomics as well as phylogenetics. A variety of orthology detection methods have been devised in recent years. Although many of these methods are dependent on generating gene and/or species trees, it has been shown that orthology can be estimated at acceptable levels of accuracy without having to infer gene trees and/or reconciling gene trees with species trees. Therefore, there is good reason to look at the connection of trees and orthology from a different angle: How much information about the gene tree, the species tree, and their reconciliation is already contained in the orthology relation among genes? Intriguingly, a solution to the first part of this question has already been given by Boecker and Dress [Boecker and Dress, 1998] in a different context. In particular, they completely characterized certain maps which they called symbolic ultrametrics. Semple and Steel [Semple and Steel, 2003] then presented an algorithm that can be used to reconstruct a phylogenetic tree from any given symbolic ultrametric. In this thesis we investigate a new characterization of orthology relations, based on symbolic ultramterics for recovering the gene tree. According to Fitch’s definition [Fitch, 2000], two genes are (co-)orthologous if their last common ancestor in the gene tree represents a speciation event. On the other hand, when their last common ancestor is a duplication event, the genes are paralogs. The orthology relation on a set of genes is therefore determined by the gene tree and an “event labeling” that identifies each interior vertex of that tree as either a duplication or a speciation event. In the context of analyzing orthology data, the problem of reconciling event-labeled gene trees with a species tree appears as a variant of the reconciliation problem where genes trees have no labels in their internal vertices. When reconciling a gene tree with a species tree, it can be assumed that the species tree is correct or, in the case of a unknown species tree, it can be inferred. Therefore it is crucial to know for a given gene tree whether there even exists a species tree. In this thesis we characterize event-labelled gene trees for which a species tree exists and species trees to which event-labelled gene trees can be mapped. Reconciliation methods are not always the best options for detecting orthology. A fundamental problem is that, aside from multicellular eukaryotes, evolution does not seem to have conformed to the descent-with-modification model that gives rise to tree-like phylogenies. Examples include many cases of prokaryotes and viruses whose evolution involved horizontal gene transfer. To treat the problem of distinguishing orthology and paralogy within a more general framework, graph-based methods have been proposed to detect and differentiate among evolutionary relationships of genes in those organisms. In this work we introduce a measure of orthology that can be used to test graph-based methods and reconciliation methods that detect orthology. Using these results a new algorithm BOTTOM-UP to determine whether a map from the set of vertices of a tree to a set of events is a symbolic ultrametric or not is devised. Additioanlly, a simulation environment designed to generate large gene families with complex duplication histories on which reconstruction algorithms can be tested and software tools can be benchmarked is presented

    Squamata phylogenomics and molecular evolution of venom proteins in Toxicofera

    Get PDF
    How frequent is convergent evolution? This fundamental question of evolutionary biology is challenging to address as it requires mapping innovations on a phylogeny. Phylogeny reconstruction methods, however, aim at reconstructing the tree with the minimum number of such events. Squamata the order of scaled reptiles composed of lizards, snakes, and amphisbaenians offers a striking example of such a conundrum. The Toxicofera hypothesis states that all venomous squamates such as iguanas, anguimorphs, and snakes are a monophyletic group, and that venom evolved only once in their last common ancestor, therefore constituting the only synapomorphy legitimating this group. Morphological and molecular phylogenetics of squamates in particular those of mitochondrial genes, however, result in distinct phylogenies supporting multiple convergent evolution of venomousness also because not all Toxicofera are venomous. Venom is composed of different proteins that are recruited into the venom from their original function after gene duplication. Thus, homologs of venom proteins are also found in non-venomous taxa. Thereby, the composition of Toxicofera venom resembles those of various other taxa which evolved venomousness multiple times convergently. Here, I aim for studying the molecular evolution of two venom proteins by first establishing a phylogenetic framework for the squamates group with a phylogenomic approach that makes use of all protein families in the RefSeq database of the NCBI that are available for at least 15 squamates resulting in a dataset containing 768 protein families for 272 species. I then use the resulting phylogeny to study the molecular evolution of two venom proteins independent of their single-gene phylogenies. I perform selection models of codon sequence evolution to detect variations in selection pressure between venomous and non-venomous clades. Additionally, I expect to find positively selected sites to be fast-evolving surface proteins that are co-adapting. Even though mitochondrial and nuclear phylogenies diverge a lot the results reveal evidence for multiple convergent evolutions of venom in Colubroidea, Anguimorpha, and Iguania. Venom proteins experience positive selection in snakes and anguimorphs but not in iguanas. Among positively selected sites are fast-evolving surface residues that are co-adapting with other residues. I conclude selection pressure acting on venom proteins is stronger in all Toxicofera except for Iguania compared to other squamates. This difference is not necessarily a consequence of heritability but to some extent affected by ecological factors like differences in diet

    OrthoMaM: A database of orthologous genomic markers for placental mammal phylogenetics

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Molecular sequence data have become the standard in modern day phylogenetics. In particular, several long-standing questions of mammalian evolutionary history have been recently resolved thanks to the use of molecular characters. Yet, most studies have focused on only a handful of standard markers. The availability of an ever increasing number of whole genome sequences is a golden mine for modern systematics. Genomic data now provide the opportunity to select new markers that are potentially relevant for further resolving branches of the mammalian phylogenetic tree at various taxonomic levels.</p> <p>Description</p> <p>The EnsEMBL database was used to determine a set of orthologous genes from 12 available complete mammalian genomes. As targets for possible amplification and sequencing in additional taxa, more than 3,000 exons of length > 400 bp have been selected, among which 118, 368, 608, and 674 are respectively retrieved for 12, 11, 10, and 9 species. A bioinformatic pipeline has been developed to provide evolutionary descriptors for these candidate markers in order to assess their potential phylogenetic utility. The resulting OrthoMaM (Orthologous Mammalian Markers) database can be queried and alignments can be downloaded through a dedicated web interface <url>http://kimura.univ-montp2.fr/orthomam</url>.</p> <p>Conclusion</p> <p>The importance of marker choice in phylogenetic studies has long been stressed. Our database centered on complete genome information now makes possible to select promising markers to a given phylogenetic question or a systematic framework by querying a number of evolutionary descriptors. The usefulness of the database is illustrated with two biological examples. First, two potentially useful markers were identified for rodent systematics based on relevant evolutionary parameters and sequenced in additional species. Second, a complete, gapless 94 kb supermatrix of 118 orthologous exons was assembled for 12 mammals. Phylogenetic analyses using probabilistic methods unambiguously supported the new placental phylogeny by retrieving the monophyly of Glires, Euarchontoglires, Laurasiatheria, and Boreoeutheria. Muroid rodents thus do not represent a basal placental lineage as it was mistakenly reasserted in some recent phylogenomic analyses based on fewer taxa. We expect the OrthoMaM database to be useful for further resolving the phylogenetic tree of placental mammals and for better understanding the evolutionary dynamics of their genomes, i.e., the forces that shaped coding sequences in terms of selective constraints.</p
    • 

    corecore