5 research outputs found

    Research Data Management in the Lab

    Get PDF
    Research, especially in science, is increasingly data-driven (Hey & Trefethen, 2003). The obvious type of research data is raw data produced by experiments (by means of sensors and other lab equipment). However, other types of data are highly relevant as well: calibration and configuration settings, analyzed and aggregated data, data generated by simulations. Today, nearly all of this data is born-digital. Based on the recommendations for "good scientific practice", researchers are required to keep their data for a long time. In Germany, DFG demands 8-10 years for published results (Deutsche Forschungsgemeinschaft, 1998). Ideally, data should not only be kept and made accessible upon request, but be published as well - either as part of the publication proper, or as references to data sets stored in dedicated data repositories. Another emerging trend are data publication journals, e.g. the Earth System Science Data Journal (http://www.earth-system-science-data.net/). In contrast to these high-level requirements, many research institutes still lack a well-established and structured data management. Extremely data-intense disciplines like high-energy physics or climate research have built powerful grid infrastructures, which they provide to their respective communities. But for most "small sciences", such complex and highly specialized compute and storage infrastructures are missing and may not even be adequate. Consequently, the burden of setting up a data management infrastructure and of establishing and enforcing data curation policies lie with each institute or university. The ANDS project has shown that this approach is even preferable over a central (e.g., national or discipline-specific) data repository (The ANDS Technical Working Group, 2007). However, delegating the task of proper data curation to the head of a department or a working group adds a huge workload to their daily work. At the same time, they typically have little training and experience in data acquisition and cataloging. The library has expertise in cataloging and describing textual publications with metadata, but typically lacks the disciplinespecific knowledge needed to assess the data objects in their semantic meaning and importance. Trying to link raw data with calibration and configuration data at the end of a project is challenging or impossible, even for dedicated "data curators" and researchers themselves. Consequently, researchers focus on their (mostly textual) publications and have no established procedures on how to cope with data objects after the end of a project or a publication (Helly, Staudigel, & Koppers, 2003). This dilemma can be resolved by acquiring and storing the data automatically at the earliest convenience, i.e. during the course of an experiment. Only at this point in time, all the contextual information is available, which can be used to generate additional metadata. Deploying a data infrastructure to store and maintain the data in a generic way helps to enforce organization-wide data curation policies. Here, repository systems like Fedora (http://www.fedora-commons.org/) (Lagoze, Payette, Shin, & Wilper, 2005) or eSciDoc (https://www.escidoc.org/) (Dreyer, Bulatovic, Tschida, & Razum, 2007) come into play. However, an organization-wide data management has only a limited added-value for the researcher in the lab. Thus, the data acquisition should take place in a non-invasive manner, so that it doesn't interfere with the established work processes of researchers and thus poses a minimal threshold to the scientist

    Research Data Management in the Lab

    Get PDF
    Research, especially in science, is increasingly data-driven (Hey & Trefethen, 2003). The obvious type of research data is raw data produced by experiments (by means of sensors and other lab equipment). However, other types of data are highly relevant as well: calibration and configuration settings, analyzed and aggregated data, data generated by simulations. Today, nearly all of this data is born-digital. Based on the recommendations for "good scientific practice", researchers are required to keep their data for a long time. In Germany, DFG demands 8-10 years for published results (Deutsche Forschungsgemeinschaft, 1998). Ideally, data should not only be kept and made accessible upon request, but be published as well - either as part of the publication proper, or as references to data sets stored in dedicated data repositories. Another emerging trend are data publication journals, e.g. the Earth System Science Data Journal (http://www.earth-system-science-data.net/). In contrast to these high-level requirements, many research institutes still lack a well-established and structured data management. Extremely data-intense disciplines like high-energy physics or climate research have built powerful grid infrastructures, which they provide to their respective communities. But for most "small sciences", such complex and highly specialized compute and storage infrastructures are missing and may not even be adequate. Consequently, the burden of setting up a data management infrastructure and of establishing and enforcing data curation policies lie with each institute or university. The ANDS project has shown that this approach is even preferable over a central (e.g., national or discipline-specific) data repository (The ANDS Technical Working Group, 2007). However, delegating the task of proper data curation to the head of a department or a working group adds a huge workload to their daily work. At the same time, they typically have little training and experience in data acquisition and cataloging. The library has expertise in cataloging and describing textual publications with metadata, but typically lacks the disciplinespecific knowledge needed to assess the data objects in their semantic meaning and importance. Trying to link raw data with calibration and configuration data at the end of a project is challenging or impossible, even for dedicated "data curators" and researchers themselves. Consequently, researchers focus on their (mostly textual) publications and have no established procedures on how to cope with data objects after the end of a project or a publication (Helly, Staudigel, & Koppers, 2003). This dilemma can be resolved by acquiring and storing the data automatically at the earliest convenience, i.e. during the course of an experiment. Only at this point in time, all the contextual information is available, which can be used to generate additional metadata. Deploying a data infrastructure to store and maintain the data in a generic way helps to enforce organization-wide data curation policies. Here, repository systems like Fedora (http://www.fedora-commons.org/) (Lagoze, Payette, Shin, & Wilper, 2005) or eSciDoc (https://www.escidoc.org/) (Dreyer, Bulatovic, Tschida, & Razum, 2007) come into play. However, an organization-wide data management has only a limited added-value for the researcher in the lab. Thus, the data acquisition should take place in a non-invasive manner, so that it doesn't interfere with the established work processes of researchers and thus poses a minimal threshold to the scientist

    Networking Resources for Research and Scientific Education in BW-eLabs

    No full text

    Design and Implementation of a Research Data Management System: The CRC/TR32 Project Database (TR32DB)

    Get PDF
    Research data management (RDM) includes all processes and measures which ensure that research data are well-organised, documented, preserved, stored, backed up, accessible, available, and re-usable. Corresponding RDM systems or repositories form the technical framework to support the collection, accurate documentation, storage, back-up, sharing, and provision of research data, which are created in a specific environment, like a research group or institution. The required measures for the implementation of a RDM system vary according to the discipline or purpose of data (re-)use. In the context of RDM, the documentation of research data is an essential duty. This has to be conducted by accurate, standardized, and interoperable metadata to ensure the interpretability, understandability, shareability, and long-lasting usability of the data. RDM is achieving an increasing importance, as digital information increases. New technologies enable to create more digital data, also automatically. Consequently, the volume of digital data, including big data and small data, will approximately double every two years in size. With regard to e-science, this increase of data was entitled and predicted as the data deluge. Furthermore, the paradigm change in science has led to data intensive science. Particularly scientific data that were financed by public funding are significantly demanded to be archived, documented, provided or even open accessible by different policy makers, funding agencies, journals and other institutions. RDM can prevent the loss of data, otherwise around 80-90 % of the generated research data disappear and are not available for re-use or further studies. This will lead to empty archives or RDM systems. The reasons for this course are well known and are of a technical, socio-cultural, and ethical nature, like missing user participation and data sharing knowledge, as well as lack of time or resources. In addition, the fear of exploitation and missing or limited reward for publishing and sharing data has an important role. This thesis presents an approach in handling research data of the collaborative, multidisciplinary, long-term DFG-funded research project Collaborative Research Centre/Transregio 32 (CRC/TR32) “Patterns in Soil-Vegetation-Atmosphere Systems: Monitoring, Modelling, and Data Assimilation”. In this context, a RDM system, the so-called CRC/TR32 project database (TR32DB), was designed and implemented. The TR32DB considers the demands of the project participants (e.g. heterogeneous data from different disciplines with various file sizes) and the requirements of the DFG, as well as general challenges in RDM. For this purpose, a RDM system was established that comprises a well-described self-designed metadata schema, a file-based data storage, a well-elaborated database of metadata, and a corresponding user-friendly web interface. The whole system is developed in close cooperation with the local Regional Computing Centre of the University of Cologne (RRZK), where it is also hosted. The documentation of the research data with accurate metadata is of key importance. For this purpose, an own specific TR32DB Metadata Schema was designed, consisting of multi-level metadata properties. This is distinguished in general and data type specific (e.g. data, publication, report) properties and is developed according to the project background, demands of the various data types, as well as recent associated metadata standards and principles. Consequently, it is interoperable to recent metadata standards, such as the Dublin Core, the DataCite Metadata Schema, as well as core elements of the ISO19115:2003 Metadata Standard and INSPIRE Directive. Furthermore, the schema supports optional, mandatory, and automatically generated metadata properties, as well as it provides predefined, obligatory and self-established controlled vocabulary lists. The integrated mapping to the DataCite Metadata Schema facilitates the simple application of a Digital Object Identifier (DOI) for a dataset. The file-based data storage is organized in a folder system, corresponding to the structure of the CRC/TR32 and additionally distinguishes between several data types (e.g. data, publication, report). It is embedded in the Andrew File System hosted by the RRZK. The file system is capable to store and backup all data, is highly scalable, supports location independence, and enables easy administration by Access Control Lists. In addition, the relational database management system MySQL stores the metadata according to the previous mentioned TR32DB Metadata Schema as well as further necessary administrative data. A user-friendly web-based graphical user interface enables the access to the TR32DB system. The web-interface provides metadata input, search, and download of data, as well as the visualization of important geodata is handled by an internal WebGIS. This web-interface, as well as the entire RDM system, is self-developed and adjusted to the specific demands. Overall, the TR32DB system is developed according to the needs and requirements of the CRC/TR32 scientists, fits the demands of the DFG, and considers general problems and challenges of RDM as well. With regard to changing demands of the CRC/TR32 and technologic advances, the system is and will be consequently further developed. The established TR32DB approach was already successfully applied to another interdisciplinary research project. Thus, this approach is transferable and generally capable to archive all data, generated by the CRC/TR32, with accurately, interoperable metadata to ensure the re-use of the data, beyond the end of the project
    corecore