12,228 research outputs found

    Making open data work for plant scientists

    Get PDF
    Despite the clear demand for open data sharing, its implementation within plant science is still limited. This is, at least in part, because open data-sharing raises several unanswered questions and challenges to current research practices. In this commentary, some of the challenges encountered by plant researchers at the bench when generating, interpreting, and attempting to disseminate their data have been highlighted. The difficulties involved in sharing sequencing, transcriptomics, proteomics, and metabolomics data are reviewed. The benefits and drawbacks of three data-sharing venues currently available to plant scientists are identified and assessed: (i) journal publication; (ii) university repositories; and (iii) community and project-specific databases. It is concluded that community and project-specific databases are the most useful to researchers interested in effective data sharing, since these databases are explicitly created to meet the researchers’ needs, support extensive curation, and embody a heightened awareness of what it takes to make data reuseable by others. Such bottom-up and community-driven approaches need to be valued by the research community, supported by publishers, and provided with long-term sustainable support by funding bodies and government. At the same time, these databases need to be linked to generic databases where possible, in order to be discoverable to the majority of researchers and thus promote effective and efficient data sharing. As we look forward to a future that embraces open access to data and publications, it is essential that data policies, data curation, data integration, data infrastructure, and data funding are linked together so as to foster data access and research productivity

    Establishment of a integrative multi-omics expression database CKDdb in the context of chronic kidney disease (CKD)

    Get PDF
    Complex human traits such as chronic kidney disease (CKD) are a major health and financial burden in modern societies. Currently, the description of the CKD onset and progression at the molecular level is still not fully understood. Meanwhile, the prolific use of high-throughput omic technologies in disease biomarker discovery studies yielded a vast amount of disjointed data that cannot be easily collated. Therefore, we aimed to develop a molecule-centric database featuring CKD-related experiments from available literature publications. We established the Chronic Kidney Disease database CKDdb, an integrated and clustered information resource that covers multi-omic studies (microRNAs, genomics, peptidomics, proteomics and metabolomics) of CKD and related disorders by performing literature data mining and manual curation. The CKDdb database contains differential expression data from 49395 molecule entries (redundant), of which 16885 are unique molecules (non-redundant) from 377 manually curated studies of 230 publications. This database was intentionally built to allow disease pathway analysis through a systems approach in order to yield biological meaning by integrating all existing information and therefore has the potential to unravel and gain an in-depth understanding of the key molecular events that modulate CKD pathogenesis

    Large-scale event extraction from literature with multi-level gene normalization

    Get PDF
    Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons -Attribution - Share Alike (CC BY-SA) license

    MycoBank gearing up for new horizons.

    Get PDF
    MycoBank, a registration system for fungi established in 2004 to capture all taxonomic novelties, acts as a coordination hub between repositories such as Index Fungorum and Fungal Names. Since January 2013, registration of fungal names is a mandatory requirement for valid publication under the International Code of Nomenclature for algae, fungi and plants (ICN). This review explains the database innovations that have been implemented over the past few years, and discusses new features such as advanced queries, registration of typification events (MBT numbers for lecto, epi- and neotypes), the multi-lingual database interface, the nomenclature discussion forum, annotation system, and web services with links to third parties. MycoBank has also introduced novel identification services, linking DNA sequence data to numerous related databases to enable intelligent search queries. Although MycoBank fills an important void for taxon registration, challenges for the future remain to improve links between taxonomic names and DNA data, and to also introduce a formal system for naming fungi known from DNA sequence data only. To further improve the quality of MycoBank data, remote access will now allow registered mycologists to act as MycoBank curators, using Citrix software

    Chemical information matters: an e-Research perspective on information and data sharing in the chemical sciences

    No full text
    Recently, a number of organisations have called for open access to scientific information and especially to the data obtained from publicly funded research, among which the Royal Society report and the European Commission press release are particularly notable. It has long been accepted that building research on the foundations laid by other scientists is both effective and efficient. Regrettably, some disciplines, chemistry being one, have been slow to recognise the value of sharing and have thus been reluctant to curate their data and information in preparation for exchanging it. The very significant increases in both the volume and the complexity of the datasets produced has encouraged the expansion of e-Research, and stimulated the development of methodologies for managing, organising, and analysing "big data". We review the evolution of cheminformatics, the amalgam of chemistry, computer science, and information technology, and assess the wider e-Science and e-Research perspective. Chemical information does matter, as do matters of communicating data and collaborating with data. For chemistry, unique identifiers, structure representations, and property descriptors are essential to the activities of sharing and exchange. Open science entails the sharing of more than mere facts: for example, the publication of negative outcomes can facilitate better understanding of which synthetic routes to choose, an aspiration of the Dial-a-Molecule Grand Challenge. The protagonists of open notebook science go even further and exchange their thoughts and plans. We consider the concepts of preservation, curation, provenance, discovery, and access in the context of the research lifecycle, and then focus on the role of metadata, particularly the ontologies on which the emerging chemical Semantic Web will depend. Among our conclusions, we present our choice of the "grand challenges" for the preservation and sharing of chemical information

    The devices, experimental scaffolds, and biomaterials ontology (DEB): a tool for mapping, annotation, and analysis of biomaterials' data

    Get PDF
    The size and complexity of the biomaterials literature makes systematic data analysis an excruciating manual task. A practical solution is creating databases and information resources. Implant design and biomaterials research can greatly benefit from an open database for systematic data retrieval. Ontologies are pivotal to knowledge base creation, serving to represent and organize domain knowledge. To name but two examples, GO, the gene ontology, and CheBI, Chemical Entities of Biological Interest ontology and their associated databases are central resources to their respective research communities. The creation of the devices, experimental scaffolds, and biomaterials ontology (DEB), an open resource for organizing information about biomaterials, their design, manufacture, and biological testing, is described. It is developed using text analysis for identifying ontology terms from a biomaterials gold standard corpus, systematically curated to represent the domain's lexicon. Topics covered are validated by members of the biomaterials research community. The ontology may be used for searching terms, performing annotations for machine learning applications, standardized meta-data indexing, and other cross-disciplinary data exploitation. The input of the biomaterials community to this effort to create data-driven open-access research tools is encouraged and welcomed.Preprin

    mockrobiota: a Public Resource for Microbiome Bioinformatics Benchmarking.

    Get PDF
    Mock communities are an important tool for validating, optimizing, and comparing bioinformatics methods for microbial community analysis. We present mockrobiota, a public resource for sharing, validating, and documenting mock community data resources, available at http://caporaso-lab.github.io/mockrobiota/. The materials contained in mockrobiota include data set and sample metadata, expected composition data (taxonomy or gene annotations or reference sequences for mock community members), and links to raw data (e.g., raw sequence data) for each mock community data set. mockrobiota does not supply physical sample materials directly, but the data set metadata included for each mock community indicate whether physical sample materials are available. At the time of this writing, mockrobiota contains 11 mock community data sets with known species compositions, including bacterial, archaeal, and eukaryotic mock communities, analyzed by high-throughput marker gene sequencing. IMPORTANCE The availability of standard and public mock community data will facilitate ongoing method optimizations, comparisons across studies that share source data, and greater transparency and access and eliminate redundancy. These are also valuable resources for bioinformatics teaching and training. This dynamic resource is intended to expand and evolve to meet the changing needs of the omics community

    Integrative biological simulation praxis: Considerations from physics, philosophy, and data/model curation practices

    Get PDF
    Integrative biological simulations have a varied and controversial history in the biological sciences. From computational models of organelles, cells, and simple organisms, to physiological models of tissues, organ systems, and ecosystems, a diverse array of biological systems have been the target of large-scale computational modeling efforts. Nonetheless, these research agendas have yet to prove decisively their value among the broader community of theoretical and experimental biologists. In this commentary, we examine a range of philosophical and practical issues relevant to understanding the potential of integrative simulations. We discuss the role of theory and modeling in different areas of physics and suggest that certain sub-disciplines of physics provide useful cultural analogies for imagining the future role of simulations in biological research. We examine philosophical issues related to modeling which consistently arise in discussions about integrative simulations and suggest a pragmatic viewpoint that balances a belief in philosophy with the recognition of the relative infancy of our state of philosophical understanding. Finally, we discuss community workflow and publication practices to allow research to be readily discoverable and amenable to incorporation into simulations. We argue that there are aligned incentives in widespread adoption of practices which will both advance the needs of integrative simulation efforts as well as other contemporary trends in the biological sciences, ranging from open science and data sharing to improving reproducibility.Comment: 10 page

    Repository of NSF Funded Publications and Data Sets: "Back of Envelope" 15 year Cost Estimate

    Get PDF
    In this back of envelope study we calculate the 15 year fixed and variable costs of setting up and running a data repository (or database) to store and serve the publications and datasets derived from research funded by the National Science Foundation (NSF). Costs are computed on a yearly basis using a fixed estimate of the number of papers that are published each year that list NSF as their funding agency. We assume each paper has one dataset and estimate the size of that dataset based on experience. By our estimates, the number of papers generated each year is 64,340. The average dataset size over all seven directorates of NSF is 32 gigabytes (GB). A total amount of data added to the repository is two petabytes (PB) per year, or 30 PB over 15 years. The architecture of the data/paper repository is based on a hierarchical storage model that uses a combination of fast disk for rapid access and tape for high reliability and cost efficient long-term storage. Data are ingested through workflows that are used in university institutional repositories, which add metadata and ensure data integrity. Average fixed costs is approximately .0.90/GBover15yearspan.Variablecostsareestimatedataslidingscaleof.0.90/GB over 15-year span. Variable costs are estimated at a sliding scale of 150 - 100pernewdatasetforupfrontcuration,or100 per new dataset for up-front curation, or 4.87 – 3.22perGB.Variablecostsreflecta3Thetotalprojectedcostofthedataandpaperrepositoryisestimatedat3.22 per GB. Variable costs reflect a 3% annual decrease in curation costs as efficiency and automated metadata and provenance capture are anticipated to help reduce what are now largely manual curation efforts. The total projected cost of the data and paper repository is estimated at 167,000,000 over 15 years of operation, curating close to one million of datasets and one million papers. After 15 years and 30 PB of data accumulated and curated, we estimate the cost per gigabyte at 5.56.This5.56. This 167 million cost is a direct cost in that it does not include federally allowable indirect costs return (ICR). After 15 years, it is reasonable to assume that some datasets will be compressed and rarely accessed. Others may be deemed no longer valuable, e.g., because they are replaced by more accurate results. Therefore, at some point the data growth in the repository will need to be adjusted by use of strategic preservation
    corecore