47,559 research outputs found

    A posteriori metadata from automated provenance tracking: Integration of AiiDA and TCOD

    Full text link
    In order to make results of computational scientific research findable, accessible, interoperable and re-usable, it is necessary to decorate them with standardised metadata. However, there are a number of technical and practical challenges that make this process difficult to achieve in practice. Here the implementation of a protocol is presented to tag crystal structures with their computed properties, without the need of human intervention to curate the data. This protocol leverages the capabilities of AiiDA, an open-source platform to manage and automate scientific computational workflows, and TCOD, an open-access database storing computed materials properties using a well-defined and exhaustive ontology. Based on these, the complete procedure to deposit computed data in the TCOD database is automated. All relevant metadata are extracted from the full provenance information that AiiDA tracks and stores automatically while managing the calculations. Such a protocol also enables reproducibility of scientific data in the field of computational materials science. As a proof of concept, the AiiDA-TCOD interface is used to deposit 170 theoretical structures together with their computed properties and their full provenance graphs, consisting in over 4600 AiiDA nodes

    Sparse cross-products of metadata in scientific simulation management

    Get PDF
    Managing scientific data is by no means a trivial task even in a single site environment with a small number of researchers involved. We discuss some issues concerned with posing well-specified experiments in terms of parameters or instrument settings and the metadata framework that arises from doing so. We are particularly interested in parallel computer simulation experiments, where very large quantities of warehouse-able data are involved. We consider SQL databases and other framework technologies for manipulating experimental data. Our framework manages the the outputs from parallel runs that arise from large cross-products of parameter combinations. Considerable useful experiment planning and analysis can be done with the sparse metadata without fully expanding the parameter cross-products. Extra value can be obtained from simulation output that can subsequently be data-mined. We have particular interests in running large scale Monte-Carlo physics model simulations. Finding ourselves overwhelmed by the problems of managing data and compute ¿resources, we have built a prototype tool using Java and MySQL that addresses these issues. We use this example to discuss type-space management and other fundamental ideas for implementing a laboratory information management system

    Metadata on Biodiversity: Definition and Implementation

    Get PDF
    SINP (Information system on nature and landscape) and ECOSCOPE (Observation for research on biodiversity data hub) are two distinct scientific infrastructures on biodiversity relying on different data sources and producers. Their main objective is to document and share information on biodiversity in France. INPN (https://inpn.mnhn.fr) is the reference information system for data related to nature. It manages and disseminates the reference data of the "geodiversity and biodiversity" part of the SINP, and deliver the metadata and data to GBIF (Global Biodiversity Information Facility). For SINP and Ecoscope projects, working groups composed of scientific organisations have defined two compliant metadata profiles, also compliant with INSPIRE Directive, to describe data on this thematic. These profiles are implemented using existing metadata standards: ISO 19115/19139 (for geographic metadata) for SINP and EML (Ecological Metadata Language) and ISO 19115/19139 for ECOSCOPE. A mapping has also been processed between the two profiles, as well as several thesaurus for keywords and a classification system for taxonomic identification are used, so as to ensure interoperability between systems. The profiles are implemented in web applications for editing and managing data (GeoSource/GeoNetwork for SINP and an ad hoc application for ECOSCOPE). These applications allow the harvesting of metadata using OGC/CSW (Catalog Service for the Web) standard. Next steps will support increased metadata visibility through the automatization of web-services

    Harvesting for disseminating, open archives and role of academic libraries

    Get PDF
    The Scholarly communication system is in a critical stage, due to a number of factors.The Open Access movement is perhaps the most interesting response that the scientific community has tried to give to this problem. The paper examines strengths and weaknesses of the Open Access strategy in general and, more specifically, of the Open Archives Initiative, discussing experiences, criticisms and barriers. All authors that have faced the problems of implementing an OAI compliant e-print server agree that technical and practical problems are not the most difficult to overcome and that the real problem is the change in cultural attitude required. In this scenario the university library is possibly the standard bearer for the advent and implementation of e-prints archives and Open Archives services. To ensure the successful implementation of this service the Library has a number of distinct roles to play

    Towards Exascale Scientific Metadata Management

    Full text link
    Advances in technology and computing hardware are enabling scientists from all areas of science to produce massive amounts of data using large-scale simulations or observational facilities. In this era of data deluge, effective coordination between the data production and the analysis phases hinges on the availability of metadata that describe the scientific datasets. Existing workflow engines have been capturing a limited form of metadata to provide provenance information about the identity and lineage of the data. However, much of the data produced by simulations, experiments, and analyses still need to be annotated manually in an ad hoc manner by domain scientists. Systematic and transparent acquisition of rich metadata becomes a crucial prerequisite to sustain and accelerate the pace of scientific innovation. Yet, ubiquitous and domain-agnostic metadata management infrastructure that can meet the demands of extreme-scale science is notable by its absence. To address this gap in scientific data management research and practice, we present our vision for an integrated approach that (1) automatically captures and manipulates information-rich metadata while the data is being produced or analyzed and (2) stores metadata within each dataset to permeate metadata-oblivious processes and to query metadata through established and standardized data access interfaces. We motivate the need for the proposed integrated approach using applications from plasma physics, climate modeling and neuroscience, and then discuss research challenges and possible solutions

    Data Representation Model for Management and Distribution of Scientific Data

    Get PDF
    Scientific tools and computer simulations enable rapid creation of various types of data and a number of studies have been conducted on data provenance and web-based data representation models to enhance the distribution, reproduction and reusability of scientific data. Ontology is a knowledge representation model, which is also used as data and workflow technology for data provenance. In this study, as part of managing and distributing for scientific data studies, metadata and data representation model were defined for the management and distribution of Visible Korean online. In addition, additional metadata required for re-distributing the user data created through the Visible Korean study is defined using an ontology-based data representation model, and an RDFa-based web page generation method is proposed to search and extract data from existing web pages. This study enables to manage and distribute the Visible Korean online, which has been managed and distributed offline, and a virtuous recycling of distributing research results as wells

    Extracting, Transforming and Archiving Scientific Data

    Get PDF
    It is becoming common to archive research datasets that are not only large but also numerous. In addition, their corresponding metadata and the software required to analyse or display them need to be archived. Yet the manual curation of research data can be difficult and expensive, particularly in very large digital repositories, hence the importance of models and tools for automating digital curation tasks. The automation of these tasks faces three major challenges: (1) research data and data sources are highly heterogeneous, (2) future research needs are difficult to anticipate, (3) data is hard to index. To address these problems, we propose the Extract, Transform and Archive (ETA) model for managing and mechanizing the curation of research data. Specifically, we propose a scalable strategy for addressing the research-data problem, ranging from the extraction of legacy data to its long-term storage. We review some existing solutions and propose novel avenues of research.Comment: 8 pages, Fourth Workshop on Very Large Digital Libraries, 201

    A practitioners guide to managing geoscience information

    Get PDF
    In the UK the Natural Environment Research Council manages its scientific data holdings through a series of Environmental Data Centres1 Within the Earth Science sector the National Geoscience Data Centre covering Atmosphere, Bioinformatics, Earth Sciences, Earth Observation, Hydrology, Marine Science and Polar Science. 2 - Risk Reduction; (NGDC), a component of the British Geological Survey (BGS), is responsible for managing the geosciences data resource. The purpose of the NGDC is to maintain the national geoscience database and to ensure efficient and effective delivery by providing geoscientists with ready to access data and information that is timely, fit for purpose, and in which the user has confidence. The key benefits that NERC derives from this approach are: - Increased Productivity; and - Higher Quality Science. The paper briefly describes the key benefits of managing geoscientific information effectively and describes how these benefits are realised within the NGDC and BGS
    corecore