169,019 research outputs found
Data Management and Metadata
Are you flooded with files? Drowning in data? Swimming in spreadsheets? Just add metadata! Doing research can generate a lot of data quickly, so quickly that it can easily get out of hand. Well-planned metadata can help you use, organize, search, share, and manage your data effectively! In this session, we\u27ll review some standards and best practices for metadata, and then get some hands-on practice creating and using metadata for different types of data.
The presentation slides are available by clicking the Download button on the right.
Supporting materials for a workshop activity are listed as additional files below and are available for download. The materials include:
1. Instructions about the activity
2. Two datasets in form of zip files, each of which holds: A description of the dataset Images of library pets in different file formats A sample spreadsheet for entering metadata
3. Two spreadsheets of compiled data (for sets A and B) in a zip file, for use in part 2 of the activity
The datasets are distributed under the terms of the Creative Commons Attribution-ShareAlike 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided that the dataset creator and source are credited, that changes (if any) are clearly indicated, and that users distribute their datasets under the Creative Commons Attribution-ShareAlike 4.0 International License.
The video and audio files of this workshop are listed as additional files below and are available for download
Towards Exascale Scientific Metadata Management
Advances in technology and computing hardware are enabling scientists from
all areas of science to produce massive amounts of data using large-scale
simulations or observational facilities. In this era of data deluge, effective
coordination between the data production and the analysis phases hinges on the
availability of metadata that describe the scientific datasets. Existing
workflow engines have been capturing a limited form of metadata to provide
provenance information about the identity and lineage of the data. However,
much of the data produced by simulations, experiments, and analyses still need
to be annotated manually in an ad hoc manner by domain scientists. Systematic
and transparent acquisition of rich metadata becomes a crucial prerequisite to
sustain and accelerate the pace of scientific innovation. Yet, ubiquitous and
domain-agnostic metadata management infrastructure that can meet the demands of
extreme-scale science is notable by its absence.
To address this gap in scientific data management research and practice, we
present our vision for an integrated approach that (1) automatically captures
and manipulates information-rich metadata while the data is being produced or
analyzed and (2) stores metadata within each dataset to permeate
metadata-oblivious processes and to query metadata through established and
standardized data access interfaces. We motivate the need for the proposed
integrated approach using applications from plasma physics, climate modeling
and neuroscience, and then discuss research challenges and possible solutions
Towards information profiling: data lake content metadata management
There is currently a burst of Big Data (BD) processed and stored in huge raw data repositories, commonly called Data Lakes (DL). These BD require new techniques of data integration and schema alignment in order to make the data usable by its consumers and to discover the relationships linking their content. This can be provided by metadata services which discover and describe their content. However, there is currently a lack of a systematic approach for such kind of metadata discovery and management. Thus, we propose a framework for the profiling of informational content stored in the DL, which we call information profiling. The profiles are stored as metadata to support data analysis. We formally define a metadata management process which identifies the key activities required to effectively handle this.We demonstrate the alternative techniques and performance of our process using a prototype implementation handling a real-life case-study from the OpenML DL, which showcases the value and feasibility of our approach.Peer ReviewedPostprint (author's final draft
DAS: a data management system for instrument tests and operations
The Data Access System (DAS) is a metadata and data management software
system, providing a reusable solution for the storage of data acquired both
from telescopes and auxiliary data sources during the instrument development
phases and operations. It is part of the Customizable Instrument WorkStation
system (CIWS-FW), a framework for the storage, processing and quick-look at the
data acquired from scientific instruments. The DAS provides a data access layer
mainly targeted to software applications: quick-look displays, pre-processing
pipelines and scientific workflows. It is logically organized in three main
components: an intuitive and compact Data Definition Language (DAS DDL) in XML
format, aimed for user-defined data types; an Application Programming Interface
(DAS API), automatically adding classes and methods supporting the DDL data
types, and providing an object-oriented query language; a data management
component, which maps the metadata of the DDL data types in a relational Data
Base Management System (DBMS), and stores the data in a shared (network) file
system. With the DAS DDL, developers define the data model for a particular
project, specifying for each data type the metadata attributes, the data format
and layout (if applicable), and named references to related or aggregated data
types. Together with the DDL user-defined data types, the DAS API acts as the
only interface to store, query and retrieve the metadata and data in the DAS
system, providing both an abstract interface and a data model specific one in
C, C++ and Python. The mapping of metadata in the back-end database is
automatic and supports several relational DBMSs, including MySQL, Oracle and
PostgreSQL.Comment: Accepted for pubblication on ADASS Conference Serie
Observation Centric Sensor Data Model
Management of sensor data requires metadata to understand the semantics of observations. While e-science researchers have high demands on metadata, they are selective in entering metadata. The claim in this paper is to focus on the essentials, i.e., the actual observations being described by location, time, owner, instrument, and measurement. The applicability of this approach is demonstrated in two very different case studies
Metadata
Metadata, or data about data, play a crucial rule in social sciences to ensure that high quality documentation and community knowledge are properly captured and surround the data across its entire life cycle, from the early stages of production to secondary analysis by researchers or use by policy makers and other key stakeholders. The paper provides an overview of the social sciences metadata landscape, best practices and related information technologies. It particularly focuses on two specifications - the Data Documentation Initiative (DDI) and the Statistical Data and Metadata Exchange Standard (SDMX) - seen as central to a global metadata management framework for social data and official statistics. It also highlights current directions, outlines typical integration challenges, and provides a set of high level recommendations for producers, archives, researchers and sponsors in order to foster the adoption of metadata standards and best practices in the years to come.social sciences, metadata, data, statistics, documentation, data quality, XML, DDI, SDMX, archive, preservation, production, access, dissemination, analysis
Database support of detector operation and data analysis in the DEAP-3600 Dark Matter experiment
The DEAP-3600 detector searches for dark matter interactions on a 3.3 tonne
liquid argon target. Over nearly a decade, from start of detector construction
through the end of the data analysis phase, well over 200 scientists will have
contributed to the project. The DEAP-3600 detector will amass in excess of 900
TB of data representing more than 10 particle interactions, a few of
which could be from dark matter. At the same time, metadata exceeding 80 GB
will be generated. This metadata is crucial for organizing and interpreting the
dark matter search data and contains both structured and unstructured
information.
The scale of the data collected, the important role of metadata in
interpreting it, the number of people involved, and the long lifetime of the
project necessitate an industrialized approach to metadata management.
We describe how the CouchDB and the PostgreSQL database systems were
integrated into the DEAP detector operation and analysis workflows. This
integration provides unified, distributed access to both structured
(PostgreSQL) and unstructured (CouchDB) metadata at runtime of the data
analysis software. It also supports operational and reporting requirements
- …