942 research outputs found

    Phylogeny-aware identification and correction of taxonomically mislabeled sequences

    Get PDF
    Molecular sequences in public databases are mostly annotated by the submitting authors without further validation. This procedure can generate erroneous taxonomic sequence labels. Mislabeled sequences are hard to identify, and they can induce downstream errors because new sequences are typically annotated using existing ones. Furthermore, taxonomic mislabelings in reference sequence databases can bias metagenetic studies which rely on the taxonomy. Despite significant efforts to improve the quality of taxonomic annotations, the curation rate is low because of the labor-intensive manual curation process. Here, we present SATIVA, a phylogeny-aware method to automatically identify taxonomically mislabeled sequences (‘mislabels’) using statistical models of evolution. We use the Evolutionary Placement Algorithm (EPA) to detect and score sequences whose taxonomic annotation is not supported by the underlying phylogenetic signal, and automatically propose a corrected taxonomic classification for those. Using simulated data, we show that our method attains high accuracy for identification (96.9% sensitivity/91.7% precision) as well as correction (94.9% sensitivity/89.9% precision) of mislabels. Furthermore, an analysis of four widely used microbial 16S reference databases (Greengenes, LTP, RDP and SILVA) indicates that they currently contain between 0.2% and 2.5% mislabels. Finally, we use SATIVA to perform an in-depth evaluation of alternative taxonomies for Cyanobacteria. SATIVA is freely available at https://github.com/amkozlov/sativa

    Minimum Information about a Biosynthetic Gene cluster

    Get PDF
    A wide variety of enzymatic pathways that produce specialized metabolites in bacteria, fungi and plants are known to be encoded in biosynthetic gene clusters. Information about these clusters, pathways and metabolites is currently dispersed throughout the literature, making it difficult to exploit. To facilitate consistent and systematic deposition and retrieval of data on biosynthetic gene clusters, we propose the Minimum Information about a Biosynthetic Gene cluster (MIBiG) data standard.Netherlands Organization for Scientific Research (NWO)/Rubicon/825.13.001EU/FP7/Joint Call OCEANBiotechnology and Biological Sciences Research Council (BBSRC)Natural Environment Research Council (UK)National Institute for Energy Ethics and Society (NIEeS; UK)Gordon and Betty Moore FoundationNational Science Foundation (NSF; US)US Department of EnergyEngineering and Physical Sciences Research Council (EPSRC

    Megx.net: integrated database resource for marine ecological genomics

    Get PDF
    Megx.net is a database and portal that provides integrated access to georeferenced marker genes, environment data and marine genome and metagenome projects for microbial ecological genomics. All data are stored in the Microbial Ecological Genomics DataBase (MegDB), which is subdivided to hold both sequence and habitat data and global environmental data layers. The extended system provides access to several hundreds of genomes and metagenomes from prokaryotes and phages, as well as over a million small and large subunit ribosomal RNA sequences. With the refined Genes Mapserver, all data can be interactively visualized on a world map and statistics describing environmental parameters can be calculated. Sequence entries have been curated to comply with the proposed minimal standards for genomes and metagenomes (MIGS/MIMS) of the Genomic Standards Consortium. Access to data is facilitated by Web Services. The updated megx.net portal offers microbial ecologists greatly enhanced database content, and new features and tools for data analysis, all of which are freely accessible from our webpage http://www.megx.net

    Identification of Habitat-Specific Biomes of Aquatic Fungal Communities Using a Comprehensive Nearly Full-Length 18S rRNA Dataset Enriched with Contextual Data

    Get PDF
    Molecular diversity surveys have demonstrated that aquatic fungi are highly diverse, and that they play fundamental ecological roles in aquatic systems. Unfortunately, comparative studies of aquatic fungal communities are few and far between, due to the scarcity of adequate datasets. We combined all publicly available fungal 18S ribosomal RNA (rRNA) gene sequences with new sequence data from a marine fungi culture collection. We further enriched this dataset by adding validated contextual data. Specifically, we included data on the habitat type of the samples assigning fungal taxa to ten different habitat categories. This dataset has been created with the intention to serve as a valuable reference dataset for aquatic fungi including a phylogenetic reference tree. The combined data enabled us to infer fungal community patterns in aquatic systems. Pairwise habitat comparisons showed significant phylogenetic differences, indicating that habitat strongly affects fungal community structure. Fungal taxonomic composition differed considerably even on phylum and class level. Freshwater fungal assemblage was most different from all other habitat types and was dominated by basal fungal lineages. For most communities, phylogenetic signals indicated clustering of sequences suggesting that environmental factors were the main drivers of fungal community structure, rather than species competition. Thus, the diversification process of aquatic fungi must be highly clade specific in some cases.The combined data enabled us to infer fungal community patterns in aquatic systems. Pairwise habitat comparisons showed significant phylogenetic differences, indicating that habitat strongly affects fungal community structure. Fungal taxonomic composition differed considerably even on phylum and class level. Freshwater fungal assemblage was most different from all other habitat types and was dominated by basal fungal lineages. For most communities, phylogenetic signals indicated clustering of sequences suggesting that environmental factors were the main drivers of fungal community structure, rather than species competition. Thus, the diversification process of aquatic fungi must be highly clade specific in some cases

    Data shopping in an open marketplace: introducing the Ontogrator web application for marking up data using ontologies and browsing using facets

    Get PDF
    In the future, we hope to see an open and thriving data market in which users can find and select data from a wide range of data providers. In such an open access market, data are products that must be packaged accordingly. Increasingly, eCommerce sellers present heterogeneous product lines to buyers using faceted browsing. Using this approach we have developed the Ontogrator platform, which allows for rapid retrieval of data in a way that would be familiar to any online shopper. Using Knowledge Organization Systems (KOS), especially ontologies, Ontogrator uses text mining to mark up data and faceted browsing to help users navigate, query and retrieve data. Ontogrator offers the potential to impact scientific research in two major ways: 1) by significantly improving the retrieval of relevant information; and 2) by significantly reducing the time required to compose standard database queries and assemble information for further research. Here we present a pilot implementation developed in collaboration with the Genomic Standards Consortium (GSC) that includes content from the StrainInfo, GOLD, CAMERA, Silva and Pubmed databases. This implementation demonstrates the power of ontogration and highlights that the usefulness of this approach is fully dependent on both the quality of data and the KOS (ontologies) used. Ideally, the use and further expansion of this collaborative system will help to surface issues associated with the underlying quality of annotation and could lead to a systematic means for accessing integrated data resources

    Meeting report : GBIF hackathon-workshop on Darwin Core and sample data (22-24 May 2013)

    Get PDF
    © The Author(s), 2014. This article is distributed under the terms of the Creative Commons Attribution License. The definitive version was published in Standards in Genomic Sciences 9 (2014): 585-598, doi:10.4056/sigs.4898640.The workshop-hackathon was convened by the Global Biodiversity Information Facility (GBIF) at its secretariat in Copenhagen over 22-24 May 2013 with additional support from several projects (RCN4GSC, EAGER, VertNet, BiSciCol, GGBN, and Micro B3). It assembled a team of experts to address the challenge of adapting the Darwin Core standard for a wide variety of sample data. Topics addressed in the workshop included 1) a review of outstanding issues in the Darwin Core standard, 2) issues relating to publishing of biodiversity data through Darwin Core Archives, 3) use of Darwin Core Archives for publishing sample and monitoring data, 4) the case for modifying the Darwin Core Text Guide specification to support many-to-many relations, and 5) the generalization of the Darwin Core Archive to a “Biodiversity Data Archive”. A wide variety of use cases were assembled and discussed in order to inform further developments.We gratefully acknowledge support from the Global Biodiversity Information Facility (GBIF), from the Global Genome Biodiversity Network (GGBN), from the EU 7FP Biodiversity, Bioinformatics, Biotechnology project (Micro B3), and from the US National Science Foundation (NSF) through the following grants: DBI-0840989 [Research Coordination Network for the Ge-nomic Standards Consortium (RCN4GSC)], IIS-1255035 [EAGER: An Interoperable Information Infrastructure for Biodiversity Research (I3BR)], ABI Development: Collaborative Research: VertNet, a New Model for Bio-diversity Networks (DBI-1062193), and Collaborative Research: BiSciCol Tracker: Towards a tagging and tracking infrastructure for biodiversity science collec-tions (DBI-0956426)

    Meeting Report: GBIF hackathon-workshop on Darwin Core and sample data (22-24 May 2013)

    Get PDF
    This is the published version, also available at http://dx.doi.org/10.4056/sigs.4898640.The workshop-hackathon was convened by the Global Biodiversity Information Facility (GBIF) at its secretariat in Copenhagen over 22-24 May 2013 with additional support from several projects (RCN4GSC, EAGER, VertNet, BiSciCol, GGBN, and Micro B3). It assembled a team of experts to address the challenge of adapting the Darwin Core standard for a wide variety of sample data. Topics addressed in the workshop included 1) a review of outstanding issues in the Darwin Core standard, 2) issues relating to publishing of biodiversity data through Darwin Core Archives, 3) use of Darwin Core Archives for publishing sample and monitoring data, 4) the case for modifying the Darwin Core Text Guide specification to support many-to-many relations, and 5) the generalization of the Darwin Core Archive to a “Biodiversity Data Archive”. A wide variety of use cases were assembled and discussed in order to inform further developments

    Meeting Report: GBIF hackathon-workshop on Darwin Core and sample data (22-24 May 2013)

    Get PDF
    This is the published version, also available at http://dx.doi.org/10.4056/sigs.4898640.The workshop-hackathon was convened by the Global Biodiversity Information Facility (GBIF) at its secretariat in Copenhagen over 22-24 May 2013 with additional support from several projects (RCN4GSC, EAGER, VertNet, BiSciCol, GGBN, and Micro B3). It assembled a team of experts to address the challenge of adapting the Darwin Core standard for a wide variety of sample data. Topics addressed in the workshop included 1) a review of outstanding issues in the Darwin Core standard, 2) issues relating to publishing of biodiversity data through Darwin Core Archives, 3) use of Darwin Core Archives for publishing sample and monitoring data, 4) the case for modifying the Darwin Core Text Guide specification to support many-to-many relations, and 5) the generalization of the Darwin Core Archive to a “Biodiversity Data Archive”. A wide variety of use cases were assembled and discussed in order to inform further developments
    corecore