221 research outputs found
Function prediction in plant genomes from large scale phylogenomics analyses : P0932
With the increasing number of plant genomes being sequenced, a major challenge is to accurately transfer annotations from well characterized genomes to newly obtained sequences. GreenPhylDB is a database designed for comparative and functional genomics based on complete genome-derived gene sequences. The database currently includes gene families of protein sequences from 22 plant species, including socio-economically important crops like rice, sorghum, maize, cassava and banana. Genes from all these species are organized in clusters based on sequence similarity. The clusters are manually annotated (i.e. properly named and classified) and sequences included in each cluster are characterized by phylogenetic analysis in order to elucidate evolutionary relationships (e.g. orthologs, super-orthologs, in/out-paralogs) among genes. GreenPhyl provides a reliable and stable catalog of gene families useful for annotation on new genome sequences in plants. With its improved user interface, the new release of GreenPhyl keeps the previous gene clustering quality and introduces additional features such as specific search engines (quick search, deep search, InterPro domain combination and GO family browser). The GreenPhyl's pipeline relies on RapGreen, a new version of the RAP reconciliation tool (Dufayard et al, 2005) that allows us to root gene trees and infer orthology relationships between sequences of a family. GreenPhyl version 3 is available at http://www.greenphyl.org and is a collaborative resource of SouthGreen (southgreen.cirad.fr), a bioinformatics platform applied to the genetic and genomic resources analyses of the South and Mediterranean plants. (Résumé d'auteur
MS-DMind : multiscale data "minding" for molecular process related to biotic and abiotic stresses: pilot study with the nsLTP superfamily of proteins. Genomique Edition 2008
Exploring predicted musa genes using the greenphyl comparative genomics database : W078
With the increasing number of plant genomes being sequenced, a major challenge is to accurately transfer annotation from well characterized genomes to newly obtained sequences. GreenPhylDB is a database designed for comparative and functional genomics based on complete genome-derived gene sequences (Conte et al, 2008, Rouard M, Guignon V et al, 2011). The database currently includes gene sequences from 22 plant species, including Musa (representative of bananas and plantains). Genes from all these species are organized in clusters based on sequence similarity. The clusters (or families) are manually annotated (i.e. properly named and classified) and sequences included in each cluster are characterized by phylogenetic analysis in order to elucidate evolutionary relationships (e.g. orthologs, super-orthologs, in/out-paralogs) among genes. GreenPhyl provides a reliable (Martinez, 2011) and stable catalog of gene families useful for annotation on new genome sequences in plants. GreenPhyl has been particularly useful for studying the transcription factors of the Musa acuminata (Doubled Haploid Pahang) genome sequence recently published (D'hont et al, 2012). With its improved user interface, the new release of GreenPhyl (available at http://www.greenphyl.org) keeps the previous gene clustering quality and introduces additional features such as specific search engines (quick search, deep search, InterPro domain combination and GO family browser). This talk will present the latest development of the GreenPhyl version 3 and will give a few examples of gene family analyses in Musa. (Résumé d'auteur
Survey of Branch Support Methods Demonstrates Accuracy, Power, and Robustness of Fast Likelihood-based Approximation Schemes
Phylogenetic inference and evaluating support for inferred relationships is at the core of many studies testing evolutionary hypotheses. Despite the popularity of nonparametric bootstrap frequencies and Bayesian posterior probabilities, the interpretation of these measures of tree branch support remains a source of discussion. Furthermore, both methods are computationally expensive and become prohibitive for large data sets. Recent fast approximate likelihood-based measures of branch supports (approximate likelihood ratio test [aLRT] and Shimodaira-Hasegawa [SH]-aLRT) provide a compelling alternative to these slower conventional methods, offering not only speed advantages but also excellent levels of accuracy and power. Here we propose an additional method: a Bayesian-like transformation of aLRT (aBayes). Considering both probabilistic and frequentist frameworks, we compare the performance of the three fast likelihood-based methods with the standard bootstrap (SBS), the Bayesian approach, and the recently introduced rapid bootstrap. Our simulations and real data analyses show that with moderate model violations, all tests are sufficiently accurate, but aLRT and aBayes offer the highest statistical power and are very fast. With severe model violations aLRT, aBayes and Bayesian posteriors can produce elevated false-positive rates. With data sets for which such violation can be detected, we recommend using SH-aLRT, the nonparametric version of aLRT based on a procedure similar to the Shimodaira-Hasegawa tree selection. In general, the SBS seems to be excessively conservative and is much slower than our approximate likelihood-based method
Structural evolution drives diversification of the large LRR-RLK gene family
Cells are continuously exposed to chemical signals that they must discriminate between and respond to appropriately. In embryophytes, the leucine‐rich repeat receptor‐like kinases (LRR‐RLKs) are signal receptors critical in development and defense. LRR‐RLKs have diversified to hundreds of genes in many plant genomes. Although intensively studied, a well‐resolved LRR‐RLK gene tree has remained elusive. To resolve the LRR‐RLK gene tree, we developed an improved gene discovery method based on iterative hidden Markov model searching and phylogenetic inference. We used this method to infer complete gene trees for each of the LRR‐RLK subclades and reconstructed the deepest nodes of the full gene family. We discovered that the LRR‐RLK gene family is even larger than previously thought, and that protein domain gains and losses are prevalent. These structural modifications, some of which likely predate embryophyte diversification, led to misclassification of some LRR‐RLK variants as members of other gene families. Our work corrects this misclassification. Our results reveal ongoing structural evolution generating novel LRR‐RLK genes. These new genes are raw material for the diversification of signaling in development and defense. Our methods also enable phylogenetic reconstruction in any large gene family
GreenPhylDB: phylogenomic resources for comparative and functional genomics in plants
Poster presented at 9th PlantGEM 2011. Istanbul (Turkey), 4-7 May 201
Application du système GenFam à la réponse au stress des plantes : intégration de l'identification d'éléments cis spécifiques
UMR AGAP - équipe ID - Intégration des donnéesGenFam est un système intégratif d'analyse de familles de gènes. Ce système permet (i) de créer des familles de gènes de génomes complets, (ii) d’exécuter une analyse phylogénétique de cette famille à travers le gestionnaire de workflows Galaxy afin de définir les relations d'homologie, (iii) d'étudier des événements évolutifs à partir de blocs de synténie précalculées avec le workflow SynMap de la plateforme de génomique comparative (CoGe) et (iv) d’intégrer ces résultats dans l'interface de visualisation synthétique. La première application de GenFam est d’identifier des gènes candidats pour la tolérance aux stress environnementaux. Il nécessite de mettre en évidence la présence de séquences régulatrices cis spécifiques de la réponse aux stress (de type ABRE, DRE). Dans ce contexte, nous avons besoin d’intégrer de nouveaux outils afin de découvrir et chercher des sites de fixation de facteurs de transcription (Transcription Factor Binding Sites, TFBS) dans les séquences promotrices des gènes membre de la famille étudiée. Ce workflow Galaxy va, d'une part, sélectionner les régions flanquantes en 5' ou en 3' des gènes d'intérêts selon le choix de l'utilisateur. D'autre part, les régions flanquantes sont analysées afin de découvrir et rechercher les motifs de séquences régulatrices cis spécifiques de la réponse aux stress avec des méthodes complémentaires comme MEME, STIF, PHYME. Ces résultats ainsi que l’annotation fonctionnelle des gènes étiquetés comme étant impliqués dans la réponse au stress seront intégrés dans l’interface de visualisation. Ce travail doit permettre une réflexion sur la notion d'orthologie fonctionnelle et effectuer une recherche translationnelle depuis les espèces modèles jusqu'aux espèces d'intérêt agronomique (i.e identifier des gènes candidats pour la réponse au stress du caféier à partir d'informations fonctionnelles connues chez Arabidopsis)
Comparative genomics of gene families in relation with metabolic pathway for gene candidates highlighting
The study of gene families is an important field of comparative genomics, allowing, by the analysis of evolutionary history, to identify homology relationships, gene losses and to help in annotation transfers. The addition of metabolic information improves the identification of candidate genes by addition of functional and gene network data. We propose new web systems to facilitate and improve the analysis of gene family for the search of candidate genes in plants. GenFam* is dedicated to the manual and precise analysis of gene families and includes specific workflows running under a Galaxy platform, allowing to gather several data sources, analysis and visualization tools, in order to (i) build custom families (ii) run analysis workflows dedicated to the analysis of gene families (iii) visualize analysis results and functional evidences through a dedicated visualization dashboard. In complement to the integration of data sources, tools and visualizations, we also suggest a new way to find evidences for the identification of evolutionary events through syntenic analyses. The IDEVEN algorithm is based on the study of syntenic blocks linked to a gene family to identify speciation, Whole Genome Duplication (WGD) events, and other duplications in a family history. The identified events will be reported on the phylogeny and aim to bring complementary evidences to have a clearer view of the evolutionary history of a gene family. To extend this tool to the analysis of multiple gene families and integrate metabolic pathways data, this tool has been integrated in genesPath, which will allow a deep identification and highlighting of candidate genes of interest for a specific project called “Biomass For the Future (BFF)”. This online tool will be soon available and could be notably used for searching candidate genes involved in biosynthesis of lignin and cellulose in various plant species (such as maize and sorghum). (Résumé d'auteur
Genfam: integrative system for gene family analysis, including a method of evolutionary event identification and evidences for an involvement in environmental stress response
Important research efforts are made to characterize mechanisms of biological interest, such as stress tolerance, through gene family studies. The identification of these families allows the functional annotation of genes, as genes belonging to one family are supposed to have similar or related functions. We have developed an online comparative genomics application, GenFam, allowing the user to build custom gene families based on several data sources, to run analysis workflows, and to gather results into a synthetic visualization. It allows displaying functional evidences and evolutive history information through widely used tools and the new algorithm IDEVEN. This algorithm is complementing phylogenetic analyses by using syntenic block data and synonymous mutation rates (dS). The system contains additional modules oriented towards stress response study, such as (differential) gene expression data, functional annotations (gene ontologies), and the identification of specific stress-related cis elements in the promoter regions. The aim of this work is to facilitate knowledge representation and functional inference for scientists working on gene families, and an easier way to link stress-related evidences within a gene family tree. It can therefore highlight adaptive evolution within gene families in relation to stress-prone environments and identify candidate genes for drought tolerance in non-model crops by translational studie. (Résumé d'auteur
New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0
PhyML is a phylogeny software based on the maximum-likelihood principle. Early PhyML versions used a fast algorithm performing nearest neighbor interchanges to improve a reasonable starting tree topology. Since the original publication (Guindon S., Gascuel O. 2003. A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52:696-704), PhyML has been widely used (>2500 citations in ISI Web of Science) because of its simplicity and a fair compromise between accuracy and speed. In the meantime, research around PhyML has continued, and this article describes the new algorithms and methods implemented in the program. First, we introduce a new algorithm to search the tree space with user-defined intensity using subtree pruning and regrafting topological moves. The parsimony criterion is used here to filter out the least promising topology modifications with respect to the likelihood function. The analysis of a large collection of real nucleotide and amino acid data sets of various sizes demonstrates the good performance of this method. Second, we describe a new test to assess the support of the data for internal branches of a phylogeny. This approach extends the recently proposed approximate likelihood-ratio test and relies on a nonparametric, Shimodaira-Hasegawa-like procedure. A detailed analysis of real alignments sheds light on the links between this new approach and the more classical nonparametric bootstrap method. Overall, our tests show that the last version (3.0) of PhyML is fast, accurate, stable, and ready to use. A Web server and binary files are available from http://www.atgc-montpellier.fr/phym
- …
