13 research outputs found

    Expansion of tandem repeats in sea anemone Nematostella vectensis proteome: A source for gene novelty?

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The complete proteome of the starlet sea anemone, <it>Nematostella vectensis</it>, provides insights into gene invention dating back to the Cnidarian-Bilaterian ancestor. With the addition of the complete proteomes of <it>Hydra magnipapillata </it>and <it>Monosiga brevicollis</it>, the investigation of proteins having unique features in early metazoan life has become practical. We focused on the properties and the evolutionary trends of tandem repeat (TR) sequences in Cnidaria proteomes.</p> <p>Results</p> <p>We found that 11-16% of <it>N. vectensis </it>proteins contain tandem repeats. Most TRs cover 150 amino acid segments that are comprised of basic units of 5-20 amino acids. In total, the <it>N. Vectensis </it>proteome has about 3300 unique TR-units, but only a small fraction of them are shared with <it>H. magnipapillata, M. brevicollis</it>, or mammalian proteomes. The overall abundance of these TRs stands out relative to that of 14 proteomes representing the diversity among eukaryotes and within the metazoan world. TR-units are characterized by a unique composition of amino acids, with cysteine and histidine being over-represented. Structurally, most TR-segments are associated with coiled and disordered regions. Interestingly, 80% of the TR-segments can be read in more than one open reading frame. For over 100 of them, translation of the alternative frames would result in long proteins. Most domain families that are characterized as repeats in eukaryotes are found in the TR-proteomes from Nematostella and Hydra.</p> <p>Conclusions</p> <p>While most TR-proteins have originated from prediction tools and are still awaiting experimental validations, supportive evidence exists for hundreds of TR-units in Nematostella. The existence of TR-proteins in early metazoan life may have served as a robust mode for novel genes with previously overlooked structural and functional characteristics.</p

    K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes

    Get PDF
    The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts.Instituto de BiotecnologíaFil: Contreras-Moreira, Bruno. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: Filippi, Carla Valeria. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Agrobiotecnología y Biología Molecular (IABIMO); ArgentinaFil: Filippi, Carla Valeria. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Filippi, Carla Valeria. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: Naamati, Guy. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: García Girón, Carlos. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: Allen, James E. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: Flicek, Paul. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unid

    Genetic Diversity, Population Structure and Linkage Disequilibrium Assessment among International Sunflower Breeding Collections

    Get PDF
    Sunflower germplasm collections are valuable resources for broadening the genetic base of commercial hybrids and ameliorate the risk of climate events. Nowadays, the most studied worldwide sunflower pre-breeding collections belong to INTA (Argentina), INRA (France), and USDA-UBC (United States of America?Canada). In this work, we assess the amount and distribution of genetic diversity (GD) available within and between these collections to estimate the distribution pattern of global diversity. A mixed genotyping strategy was implemented, by combining proprietary genotyping-by-sequencing data with public whole-genome-sequencing data, to generate an integrative 11,834-common single nucleotide polymorphism matrix including the three breeding collections. In general, the GD estimates obtained were moderate. An analysis of molecular variance provided evidence of population structure between breeding collections. However, the optimal number of subpopulations, studied via discriminant analysis of principal components (K = 12), the Bayesian STRUCTURE algorithm (K = 6) and distance-based methods (K = 9) remains unclear, since no single unifying characteristic is apparent for any of the inferred groups. Different overall patterns of linkage disequilibrium (LD) were observed across chromosomes, with Chr10, Chr17, Chr5, and Chr2 showing the highest LD. This work represents the largest and most comprehensive inter-breeding collection analysis of genomic diversity for cultivated sunflower conducted to dateFil: Filippi, Carla Valeria. Instituto Nacional de Tecnologia Agropecuaria. Centro de Investigacion En Ciencias Veterinarias y Agronomicas. Instituto de Agrobiotecnologia y Biologia Molecular. - Consejo Nacional de Investigaciones Cientificas y Tecnicas. Oficina de Coordinacion Administrativa Pque. Centenario. Instituto de Agrobiotecnologia y Biologia Molecular; ArgentinaFil: Merino, Gabriela Alejandra. Universidad Nacional de Entre Ríos. Instituto de Investigación y Desarrollo en Bioingeniería y Bioinformática - Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación y Desarrollo en Bioingeniería y Bioinformática; ArgentinaFil: Montecchia, Juan Francisco. Instituto Nacional de Tecnologia Agropecuaria. Centro de Investigacion En Ciencias Veterinarias y Agronomicas. Instituto de Agrobiotecnologia y Biologia Molecular. - Consejo Nacional de Investigaciones Cientificas y Tecnicas. Oficina de Coordinacion Administrativa Pque. Centenario. Instituto de Agrobiotecnologia y Biologia Molecular; ArgentinaFil: Aguirre, Natalia Cristina. Instituto Nacional de Tecnologia Agropecuaria. Centro de Investigacion En Ciencias Veterinarias y Agronomicas. Instituto de Agrobiotecnologia y Biologia Molecular. - Consejo Nacional de Investigaciones Cientificas y Tecnicas. Oficina de Coordinacion Administrativa Pque. Centenario. Instituto de Agrobiotecnologia y Biologia Molecular; ArgentinaFil: Rivarola, Maximo Lisandro. Instituto Nacional de Tecnologia Agropecuaria. Centro de Investigacion En Ciencias Veterinarias y Agronomicas. Instituto de Agrobiotecnologia y Biologia Molecular. - Consejo Nacional de Investigaciones Cientificas y Tecnicas. Oficina de Coordinacion Administrativa Pque. Centenario. Instituto de Agrobiotecnologia y Biologia Molecular; ArgentinaFil: Naamati, Guy. European Molecular Biology Laboratory. European Bioinformatics Institute.; Reino UnidoFil: Fass, Mónica Irina. Instituto Nacional de Tecnologia Agropecuaria. Centro de Investigacion En Ciencias Veterinarias y Agronomicas. Instituto de Agrobiotecnologia y Biologia Molecular. - Consejo Nacional de Investigaciones Cientificas y Tecnicas. Oficina de Coordinacion Administrativa Pque. Centenario. Instituto de Agrobiotecnologia y Biologia Molecular; ArgentinaFil: Alvarez, Daniel. Instituto Nacional de Tecnología Agropecuaria. Centro Regional Córdoba. Estación Experimental Agropecuaria Manfredi; ArgentinaFil: Di Rienzo, Julio Alejandro. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias; ArgentinaFil: Heinz, Ruth Amelia. Instituto Nacional de Tecnologia Agropecuaria. Centro de Investigacion En Ciencias Veterinarias y Agronomicas. Instituto de Agrobiotecnologia y Biologia Molecular. - Consejo Nacional de Investigaciones Cientificas y Tecnicas. Oficina de Coordinacion Administrativa Pque. Centenario. Instituto de Agrobiotecnologia y Biologia Molecular; ArgentinaFil: Contreras Moreira, Bruno. European Molecular Biology Laboratory. European Bioinformatics Institute.; Reino UnidoFil: Lia, Verónica Viviana. Instituto Nacional de Tecnologia Agropecuaria. Centro de Investigacion En Ciencias Veterinarias y Agronomicas. Instituto de Agrobiotecnologia y Biologia Molecular. - Consejo Nacional de Investigaciones Cientificas y Tecnicas. Oficina de Coordinacion Administrativa Pque. Centenario. Instituto de Agrobiotecnologia y Biologia Molecular; ArgentinaFil: Paniego, Norma Beatriz. Instituto Nacional de Tecnologia Agropecuaria. Centro de Investigacion En Ciencias Veterinarias y Agronomicas. Instituto de Agrobiotecnologia y Biologia Molecular. - Consejo Nacional de Investigaciones Cientificas y Tecnicas. Oficina de Coordinacion Administrativa Pque. Centenario. Instituto de Agrobiotecnologia y Biologia Molecular; Argentin

    Gene expression variability across cells and species shapes innate immunity.

    Get PDF
    As the first line of defence against pathogens, cells mount an innate immune response, which varies widely from cell to cell. The response must be potent but carefully controlled to avoid self-damage. How these constraints have shaped the evolution of innate immunity remains poorly understood. Here we characterize the innate immune response's transcriptional divergence between species and variability in expression among cells. Using bulk and single-cell transcriptomics in fibroblasts and mononuclear phagocytes from different species, challenged with immune stimuli, we map the architecture of the innate immune response. Transcriptionally diverging genes, including those that encode cytokines and chemokines, vary across cells and have distinct promoter structures. Conversely, genes that are involved in the regulation of this response, such as those that encode transcription factors and kinases, are conserved between species and display low cell-to-cell variability in expression. We suggest that this expression pattern, which is observed across species and conditions, has evolved as a mechanism for fine-tuned regulation to achieve an effective but balanced response

    GET_PANGENES: calling pangenes from plant genome alignments confirms presence-absence variation.

    No full text
    Crop pangenomes made from individual cultivar assemblies promise easy access to conserved genes, but genome content variability and inconsistent identifiers hamper their exploration. To address this, we define pangenes, which summarize a species coding potential and link back to original annotations. The protocol get_pangenes performs whole genome alignments (WGA) to call syntenic gene models based on coordinate overlaps. A benchmark with small and large plant genomes shows that pangenes recapitulate phylogeny-based orthologies and produce complete soft-core gene sets. Moreover, WGAs support lift-over and help confirm gene presence-absence variation. Source code and documentation: https://github.com/Ensembl/plant-scripts

    Additional file 1 of GET_PANGENES: calling pangenes from plant genome alignments confirms presence-absence variation

    No full text
    Additional file 1: Table S1. Other Whole Genome Alignment stats for minimap2 and GSAlign algorithms. Table S2. Summary of BUSCO completeness analyses of individual genomes that are part of datasets in this paper. Table S3. Collinear genes found between Arabidopsis thaliana and A. lyrata within 23 blocks of the Ancestral Crucifer Karyotype based on Whole Genome Alignments produced with minimap2 and GSAlign. Table S4. Excerpt from BED-like pangene matrix produced during the analysis of dataset rice3. Table S5. Summary of Whole Genome Alignment (WGA) evidence for the gene models in CDS cluster Horvu_MOREX_1H01G011400 resulting from the analysis of dataset barley20. Figure S1. Overlap ratio of collinear gene models in rice, wheat and barley. Figure S2. Dot plots of collinear gene models called in rice, wheat and barley genomes. Figure S3. Venn diagrams of pangene clusters based on minimap2 and GSAlign Whole Genome Alignments of the rice3 dataset. Figure S4. Sequence identity among sequences in rice3 pangene clusters based on minimap2 (left) and GSAlign (right). Figure S5. Example of pangene cluster where the cDNA sequences have a long local alignment but the encoded CDS sequences cannot be aligned. Figure S6. Examples of rice pangene clusters not matched by Ensembl Compara orthogroups. Figure S7. Example of pangene cluster where the encoded protein sequences do not share protein domains. Figure S8. Flowchart of script check_evidence.pl , which uses as input a cluster in FASTA format and precomputed collinearity evidence in TSV format. Figure S9. Partial deletion of locus HvFT3/Ppd-H2 in barley cultivar Igri. Figure S10. Genomic context of pangene cluster HORVU.MOREX.r3.2HG0166090 (cluster members indicated with green arrows), which corresponds to barley locus HvCEN. Figure S11. Multiple alignment of protein sequences of pangene cluster HORVU.MOREX.r3.2HG0184740, which corresponds to barley locus Vrs1. Figure S12. Multiple alignment of protein sequences of pangene cluster HORVU.MOREX.r3.3HG0311160, which corresponds to barley locus HvOS2. Figure S13. Genomic context of pangene cluster gene:HORVU.MOREX.r3.7HG0752640, an example with tandem copies (cluster members indicated with green arrows), which encode acidic proteins

    Recommendations for the formatting of Variant Call Format (VCF) files to make plant genotyping data FAIR

    No full text
    In this opinion article, we discuss the formatting of files from (plant) genotyping studies, in particular the formatting of metadata in Variant Call Format (VCF) files. The flexibility of the VCF format specification facilitates its use as a generic interchange format across domains but can lead to inconsistency between files in the presentation of metadata. To enable fully autonomous machine actionable data flow, generic elements need to be further specified.We strongly support the merits of the FAIR principles and see the need to facilitate them also through technical implementation specifications. They form a basis for the proposed VCF extensions here. We have learned from the existing application of VCF that the definition of relevant metadata using controlled standards, vocabulary and the consistent use of cross-references via resolvable identifiers (machine-readable) are particularly necessary and propose their encoding.VCF is an established standard for the exchange and publication of genotyping data. Other data formats are also used to capture variant data (for example, the HapMap and the gVCF formats), but none currently have the reach of VCF. For the sake of simplicity, we will only discuss VCF and our recommendations for its use, but these recommendations could also be applied to gVCF. However, the part of the VCF standard relating to metadata (as opposed to the actual variant calls) defines a syntactic format but no vocabulary, unique identifier or recommended content. In practice, often only sparse descriptive metadata is included. When descriptive metadata is provided, proprietary metadata fields are frequently added that have not been agreed upon within the community which may limit long-term and comprehensive interoperability. To address this, we propose recommendations for supplying and encoding metadata, focusing on use cases from plant sciences. We expect there to be overlap, but also divergence, with the needs of other domains

    Gramene 2021: harnessing the power of comparative genomics and pathways for plant research.

    No full text
    Gramene (http://www.gramene.org), a knowledgebase founded on comparative functional analyses of genomic and pathway data for model plants and major crops, supports agricultural researchers worldwide. The resource is committed to open access and reproducible science based on the FAIR data principles. Since the last NAR update, we made nine releases; doubled the genome portal's content; expanded curated genes, pathways and expression sets; and implemented the Domain Informational Vocabulary Extraction (DIVE) algorithm for extracting gene function information from publications. The current release, #63 (October 2020), hosts 93 reference genomes—over 3.9 million genes in 122 947 families with orthologous and paralogous classifications. Plant Reactome portrays pathway networks using a combination of manual biocuration in rice (320 reference pathways) and orthology-based projections to 106 species. The Reactome platform facilitates comparison between reference and projected pathways, gene expression analyses and overlays of gene–gene interactions. Gramene integrates ontology-based protein structure–function annotation; information on genetic, epigenetic, expression, and phenotypic diversity; and gene functional annotations extracted from plant-focused journals using DIVE. We train plant researchers in biocuration of genes and pathways; host curated maize gene structures as tracks in the maize genome browser; and integrate curated rice genes and pathways in the Plant Reactome
    corecore