47,559 research outputs found
A posteriori metadata from automated provenance tracking: Integration of AiiDA and TCOD
In order to make results of computational scientific research findable,
accessible, interoperable and re-usable, it is necessary to decorate them with
standardised metadata. However, there are a number of technical and practical
challenges that make this process difficult to achieve in practice. Here the
implementation of a protocol is presented to tag crystal structures with their
computed properties, without the need of human intervention to curate the data.
This protocol leverages the capabilities of AiiDA, an open-source platform to
manage and automate scientific computational workflows, and TCOD, an
open-access database storing computed materials properties using a well-defined
and exhaustive ontology. Based on these, the complete procedure to deposit
computed data in the TCOD database is automated. All relevant metadata are
extracted from the full provenance information that AiiDA tracks and stores
automatically while managing the calculations. Such a protocol also enables
reproducibility of scientific data in the field of computational materials
science. As a proof of concept, the AiiDA-TCOD interface is used to deposit 170
theoretical structures together with their computed properties and their full
provenance graphs, consisting in over 4600 AiiDA nodes
Sparse cross-products of metadata in scientific simulation management
Managing scientific data is by no means a trivial task even in a single site environment
with a small number of researchers involved. We discuss some issues concerned with posing
well-specified experiments in terms of parameters or instrument settings and the metadata
framework that arises from doing so. We are particularly interested in parallel computer
simulation experiments, where very large quantities of warehouse-able data are involved. We
consider SQL databases and other framework technologies for manipulating experimental data.
Our framework manages the the outputs from parallel runs that arise from large cross-products
of parameter combinations. Considerable useful experiment planning and analysis can be done
with the sparse metadata without fully expanding the parameter cross-products. Extra value
can be obtained from simulation output that can subsequently be data-mined. We have
particular interests in running large scale Monte-Carlo physics model simulations. Finding
ourselves overwhelmed by the problems of managing data and compute ¿resources, we have
built a prototype tool using Java and MySQL that addresses these issues. We use this example
to discuss type-space management and other fundamental ideas for implementing a laboratory
information management system
Metadata on Biodiversity: Definition and Implementation
SINP (Information system on nature and landscape) and
ECOSCOPE (Observation for research on biodiversity data hub) are two distinct
scientific infrastructures on biodiversity relying on different data sources and
producers. Their main objective is to document and share information on
biodiversity in France. INPN (https://inpn.mnhn.fr) is the reference information
system for data related to nature. It manages and disseminates the reference
data of the "geodiversity and biodiversity" part of the SINP, and deliver the
metadata and data to GBIF (Global Biodiversity Information Facility). For SINP
and Ecoscope projects, working groups composed of scientific organisations have
defined two compliant metadata profiles, also compliant with INSPIRE Directive,
to describe data on this thematic. These profiles are implemented using existing
metadata standards: ISO 19115/19139 (for geographic metadata) for SINP and EML
(Ecological Metadata Language) and ISO 19115/19139 for ECOSCOPE. A mapping has
also been processed between the two profiles, as well as several thesaurus for
keywords and a classification system for taxonomic identification are used, so
as to ensure interoperability between systems. The profiles are implemented in
web applications for editing and managing data (GeoSource/GeoNetwork for SINP
and an ad hoc application for ECOSCOPE). These applications allow the harvesting
of metadata using OGC/CSW (Catalog Service for the Web) standard. Next steps
will support increased metadata visibility through the automatization of
web-services
Harvesting for disseminating, open archives and role of academic libraries
The Scholarly communication system is in a critical stage, due to a number of factors.The Open Access movement is perhaps the most interesting response that the scientific community has tried to give to this problem. The paper examines strengths and weaknesses of the Open Access strategy in general and, more specifically, of the Open Archives Initiative, discussing experiences, criticisms and barriers. All authors that have faced the problems of implementing an OAI compliant e-print server agree that technical and practical problems are not the most difficult to overcome and that the real problem is the change in cultural attitude required. In this scenario the university library is possibly the standard bearer for the advent and implementation of e-prints archives and Open Archives services. To ensure the successful implementation of this service the Library has a number of distinct roles to play
Towards Exascale Scientific Metadata Management
Advances in technology and computing hardware are enabling scientists from
all areas of science to produce massive amounts of data using large-scale
simulations or observational facilities. In this era of data deluge, effective
coordination between the data production and the analysis phases hinges on the
availability of metadata that describe the scientific datasets. Existing
workflow engines have been capturing a limited form of metadata to provide
provenance information about the identity and lineage of the data. However,
much of the data produced by simulations, experiments, and analyses still need
to be annotated manually in an ad hoc manner by domain scientists. Systematic
and transparent acquisition of rich metadata becomes a crucial prerequisite to
sustain and accelerate the pace of scientific innovation. Yet, ubiquitous and
domain-agnostic metadata management infrastructure that can meet the demands of
extreme-scale science is notable by its absence.
To address this gap in scientific data management research and practice, we
present our vision for an integrated approach that (1) automatically captures
and manipulates information-rich metadata while the data is being produced or
analyzed and (2) stores metadata within each dataset to permeate
metadata-oblivious processes and to query metadata through established and
standardized data access interfaces. We motivate the need for the proposed
integrated approach using applications from plasma physics, climate modeling
and neuroscience, and then discuss research challenges and possible solutions
Data Representation Model for Management and Distribution of Scientific Data
Scientific tools and computer simulations enable rapid creation of various types of data and a number of studies have been conducted on data provenance and web-based data representation models to enhance the distribution, reproduction and reusability of scientific data. Ontology is a knowledge representation model, which is also used as data and workflow technology for data provenance. In this study, as part of managing and distributing for scientific data studies, metadata and data representation model were defined for the management and distribution of Visible Korean online. In addition, additional metadata required for re-distributing the user data created through the Visible Korean study is defined using an ontology-based data representation model, and an RDFa-based web page generation method is proposed to search and extract data from existing web pages. This study enables to manage and distribute the Visible Korean online, which has been managed and distributed offline, and a virtuous recycling of distributing research results as wells
Extracting, Transforming and Archiving Scientific Data
It is becoming common to archive research datasets that are not only large
but also numerous. In addition, their corresponding metadata and the software
required to analyse or display them need to be archived. Yet the manual
curation of research data can be difficult and expensive, particularly in very
large digital repositories, hence the importance of models and tools for
automating digital curation tasks. The automation of these tasks faces three
major challenges: (1) research data and data sources are highly heterogeneous,
(2) future research needs are difficult to anticipate, (3) data is hard to
index. To address these problems, we propose the Extract, Transform and Archive
(ETA) model for managing and mechanizing the curation of research data.
Specifically, we propose a scalable strategy for addressing the research-data
problem, ranging from the extraction of legacy data to its long-term storage.
We review some existing solutions and propose novel avenues of research.Comment: 8 pages, Fourth Workshop on Very Large Digital Libraries, 201
A practitioners guide to managing geoscience information
In the UK the Natural Environment Research Council manages its scientific data holdings through a series of Environmental Data Centres1
Within the Earth Science sector the National Geoscience Data Centre covering Atmosphere, Bioinformatics, Earth Sciences, Earth Observation, Hydrology, Marine Science and Polar Science.
2
- Risk Reduction; (NGDC), a component of the British Geological Survey (BGS), is responsible for managing the geosciences data resource. The purpose of the NGDC is to maintain the national geoscience database and to ensure efficient and effective delivery by providing geoscientists with ready to access data and information that is timely, fit for purpose, and in which the user has confidence. The key benefits that NERC derives from this approach are:
- Increased Productivity; and
- Higher Quality Science.
The paper briefly describes the key benefits of managing geoscientific information effectively and describes how these benefits are realised within the NGDC and BGS
- …