169,019 research outputs found

    Data Management and Metadata

    Get PDF
    Are you flooded with files? Drowning in data? Swimming in spreadsheets? Just add metadata! Doing research can generate a lot of data quickly, so quickly that it can easily get out of hand. Well-planned metadata can help you use, organize, search, share, and manage your data effectively! In this session, we\u27ll review some standards and best practices for metadata, and then get some hands-on practice creating and using metadata for different types of data. The presentation slides are available by clicking the Download button on the right. Supporting materials for a workshop activity are listed as additional files below and are available for download. The materials include: 1. Instructions about the activity 2. Two datasets in form of zip files, each of which holds: A description of the dataset Images of library pets in different file formats A sample spreadsheet for entering metadata 3. Two spreadsheets of compiled data (for sets A and B) in a zip file, for use in part 2 of the activity The datasets are distributed under the terms of the Creative Commons Attribution-ShareAlike 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided that the dataset creator and source are credited, that changes (if any) are clearly indicated, and that users distribute their datasets under the Creative Commons Attribution-ShareAlike 4.0 International License. The video and audio files of this workshop are listed as additional files below and are available for download

    Towards Exascale Scientific Metadata Management

    Full text link
    Advances in technology and computing hardware are enabling scientists from all areas of science to produce massive amounts of data using large-scale simulations or observational facilities. In this era of data deluge, effective coordination between the data production and the analysis phases hinges on the availability of metadata that describe the scientific datasets. Existing workflow engines have been capturing a limited form of metadata to provide provenance information about the identity and lineage of the data. However, much of the data produced by simulations, experiments, and analyses still need to be annotated manually in an ad hoc manner by domain scientists. Systematic and transparent acquisition of rich metadata becomes a crucial prerequisite to sustain and accelerate the pace of scientific innovation. Yet, ubiquitous and domain-agnostic metadata management infrastructure that can meet the demands of extreme-scale science is notable by its absence. To address this gap in scientific data management research and practice, we present our vision for an integrated approach that (1) automatically captures and manipulates information-rich metadata while the data is being produced or analyzed and (2) stores metadata within each dataset to permeate metadata-oblivious processes and to query metadata through established and standardized data access interfaces. We motivate the need for the proposed integrated approach using applications from plasma physics, climate modeling and neuroscience, and then discuss research challenges and possible solutions

    Towards information profiling: data lake content metadata management

    Get PDF
    There is currently a burst of Big Data (BD) processed and stored in huge raw data repositories, commonly called Data Lakes (DL). These BD require new techniques of data integration and schema alignment in order to make the data usable by its consumers and to discover the relationships linking their content. This can be provided by metadata services which discover and describe their content. However, there is currently a lack of a systematic approach for such kind of metadata discovery and management. Thus, we propose a framework for the profiling of informational content stored in the DL, which we call information profiling. The profiles are stored as metadata to support data analysis. We formally define a metadata management process which identifies the key activities required to effectively handle this.We demonstrate the alternative techniques and performance of our process using a prototype implementation handling a real-life case-study from the OpenML DL, which showcases the value and feasibility of our approach.Peer ReviewedPostprint (author's final draft

    DAS: a data management system for instrument tests and operations

    Full text link
    The Data Access System (DAS) is a metadata and data management software system, providing a reusable solution for the storage of data acquired both from telescopes and auxiliary data sources during the instrument development phases and operations. It is part of the Customizable Instrument WorkStation system (CIWS-FW), a framework for the storage, processing and quick-look at the data acquired from scientific instruments. The DAS provides a data access layer mainly targeted to software applications: quick-look displays, pre-processing pipelines and scientific workflows. It is logically organized in three main components: an intuitive and compact Data Definition Language (DAS DDL) in XML format, aimed for user-defined data types; an Application Programming Interface (DAS API), automatically adding classes and methods supporting the DDL data types, and providing an object-oriented query language; a data management component, which maps the metadata of the DDL data types in a relational Data Base Management System (DBMS), and stores the data in a shared (network) file system. With the DAS DDL, developers define the data model for a particular project, specifying for each data type the metadata attributes, the data format and layout (if applicable), and named references to related or aggregated data types. Together with the DDL user-defined data types, the DAS API acts as the only interface to store, query and retrieve the metadata and data in the DAS system, providing both an abstract interface and a data model specific one in C, C++ and Python. The mapping of metadata in the back-end database is automatic and supports several relational DBMSs, including MySQL, Oracle and PostgreSQL.Comment: Accepted for pubblication on ADASS Conference Serie

    Observation Centric Sensor Data Model

    Get PDF
    Management of sensor data requires metadata to understand the semantics of observations. While e-science researchers have high demands on metadata, they are selective in entering metadata. The claim in this paper is to focus on the essentials, i.e., the actual observations being described by location, time, owner, instrument, and measurement. The applicability of this approach is demonstrated in two very different case studies

    Metadata

    Get PDF
    Metadata, or data about data, play a crucial rule in social sciences to ensure that high quality documentation and community knowledge are properly captured and surround the data across its entire life cycle, from the early stages of production to secondary analysis by researchers or use by policy makers and other key stakeholders. The paper provides an overview of the social sciences metadata landscape, best practices and related information technologies. It particularly focuses on two specifications - the Data Documentation Initiative (DDI) and the Statistical Data and Metadata Exchange Standard (SDMX) - seen as central to a global metadata management framework for social data and official statistics. It also highlights current directions, outlines typical integration challenges, and provides a set of high level recommendations for producers, archives, researchers and sponsors in order to foster the adoption of metadata standards and best practices in the years to come.social sciences, metadata, data, statistics, documentation, data quality, XML, DDI, SDMX, archive, preservation, production, access, dissemination, analysis

    Database support of detector operation and data analysis in the DEAP-3600 Dark Matter experiment

    Full text link
    The DEAP-3600 detector searches for dark matter interactions on a 3.3 tonne liquid argon target. Over nearly a decade, from start of detector construction through the end of the data analysis phase, well over 200 scientists will have contributed to the project. The DEAP-3600 detector will amass in excess of 900 TB of data representing more than 1010^{10} particle interactions, a few of which could be from dark matter. At the same time, metadata exceeding 80 GB will be generated. This metadata is crucial for organizing and interpreting the dark matter search data and contains both structured and unstructured information. The scale of the data collected, the important role of metadata in interpreting it, the number of people involved, and the long lifetime of the project necessitate an industrialized approach to metadata management. We describe how the CouchDB and the PostgreSQL database systems were integrated into the DEAP detector operation and analysis workflows. This integration provides unified, distributed access to both structured (PostgreSQL) and unstructured (CouchDB) metadata at runtime of the data analysis software. It also supports operational and reporting requirements
    corecore