9 research outputs found

    Measuring FAIR Principles to Inform Fitness for Use

    Get PDF
    For open science to flourish, data and any related digital outputs should be discoverable and re-usable by a variety of potential consumers. The recent FAIR Data Principles produced by the Future of Research Communication and e-Scholarship (FORCE11) collective provide a compilation of considerations for making data findable, accessible, interoperable, and re-usable. The principles serve as guideposts to ‘good’ data management and stewardship for data and/or metadata. On a conceptual level, the principles codify best practices that managers and stewards would find agreement with, exist in other data quality metrics, and already implement. This paper reports on a secondary purpose of the principles: to inform assessment of data’s FAIR-ness or, put another way, data’s fitness for use. Assessment of FAIR-ness likely requires more stratification across data types and among various consumer communities, as how data are found, accessed, interoperated, and re-used differs depending on types and purposes. This paper’s purpose is to present a method for qualitatively measuring the FAIR Data Principles through operationalizing findability, accessibility, interoperability, and re- usability from a re-user’s perspective. The findings may inform assessments that could also be used to develop situationally-relevant fitness for use frameworks

    Designated Community: Uncertainty and Risk

    Get PDF
    Purpose: This article explores the tension between the concept of a Designated Community as a foundational element in Trustworthy Digital Repository certification and curators’ uncertainty about how to interpret and apply this concept in practice. Design/methodology/approach: This research employs a qualitative research design involving in-depth semi-structured interviews with stakeholders in the Trustworthy Digital Repository Audit and Certification process. Findings: Our findings indicate that stakeholders in the audit and certification process viewed their uncertainty about how to apply the concept of a Designated Community in the context of an audit as a source of risk for digital repositories and their collections. Originality: This article brings new insights to digital preservation by applying social theories of risk to trustworthy digital repository audit and certification processes, with an emphasis on the concept of Designated Community

    Repositories for Taxonomic Data: Where We Are and What is Missing

    Get PDF
    AbstractNatural history collections are leading successful large-scale projects of specimen digitization (images, metadata, DNA barcodes), thereby transforming taxonomy into a big data science. Yet, little effort has been directed towards safeguarding and subsequently mobilizing the considerable amount of original data generated during the process of naming 15,000–20,000 species every year. From the perspective of alpha-taxonomists, we provide a review of the properties and diversity of taxonomic data, assess their volume and use, and establish criteria for optimizing data repositories. We surveyed 4113 alpha-taxonomic studies in representative journals for 2002, 2010, and 2018, and found an increasing yet comparatively limited use of molecular data in species diagnosis and description. In 2018, of the 2661 papers published in specialized taxonomic journals, molecular data were widely used in mycology (94%), regularly in vertebrates (53%), but rarely in botany (15%) and entomology (10%). Images play an important role in taxonomic research on all taxa, with photographs used in &amp;gt;80% and drawings in 58% of the surveyed papers. The use of omics (high-throughput) approaches or 3D documentation is still rare. Improved archiving strategies for metabarcoding consensus reads, genome and transcriptome assemblies, and chemical and metabolomic data could help to mobilize the wealth of high-throughput data for alpha-taxonomy. Because long-term—ideally perpetual—data storage is of particular importance for taxonomy, energy footprint reduction via less storage-demanding formats is a priority if their information content suffices for the purpose of taxonomic studies. Whereas taxonomic assignments are quasifacts for most biological disciplines, they remain hypotheses pertaining to evolutionary relatedness of individuals for alpha-taxonomy. For this reason, an improved reuse of taxonomic data, including machine-learning-based species identification and delimitation pipelines, requires a cyberspecimen approach—linking data via unique specimen identifiers, and thereby making them findable, accessible, interoperable, and reusable for taxonomic research. This poses both qualitative challenges to adapt the existing infrastructure of data centers to a specimen-centered concept and quantitative challenges to host and connect an estimated ≀ \le 2 million images produced per year by alpha-taxonomic studies, plus many millions of images from digitization campaigns. Of the 30,000–40,000 taxonomists globally, many are thought to be nonprofessionals, and capturing the data for online storage and reuse therefore requires low-complexity submission workflows and cost-free repository use. Expert taxonomists are the main stakeholders able to identify and formalize the needs of the discipline; their expertise is needed to implement the envisioned virtual collections of cyberspecimens. [Big data; cyberspecimen; new species; omics; repositories; specimen identifier; taxonomy; taxonomic data.]</jats:p

    Site-based data curation: bridging data collection protocols and curatorial processes at scientifically significant sites

    Get PDF
    Research conducted at scientifically significant sites produces an abundance of important and highly valuable data. Yet, though sites are logical points for coordinating the curation of these data, their unique needs have been under supported. Previous studies have shown that two principal stakeholder groups – scientific researchers and local resource managers – both need information that is most effectively collected and curated early in research workflows. However, well-designed site-based data curation interventions are necessary to accomplish this. Additionally, further research is needed to understand and align the data curation needs of researchers and resource managers, and to guide coordination of the data collection protocols used by researchers in the field and the data curation processes applied later by resource managers. This dissertation develops two case studies of research and curation at scientifically significant sites: geobiology at Yellowstone National Park and paleontology at the La Brea Tar Pits. The case studies investigate: What information do different stakeholders value about the natural sites at which they work? How do these values manifest in data collection protocols, curatorial processes, and infrastructures? And how are sometimes conflicting stakeholder priorities mediated through the use and development of shared information infrastructures? The case studies are developed through interviews with researchers and resource managers, as well as participatory methods to collaboratively develop “minimum information frameworks” – high level models of the information needed by all stakeholders. Approaches from systems analysis are adapted to model data collection and curation workflows, identifying points of curatorial intervention early in the processes of generating and working with data. Additionally, a general information model for site-based data collections is proposed with three classes of information documenting key aspects of the research project, a site’s structure, and individual specimens and measurements. This research contributes to our understanding of how data from scientifically significant sites can be aggregated, integrated and reused over the long term, and how both researcher and resource manager needs can be reflected and supported during information modeling, workflow documentation and the development of data infrastructure policy. It contributes prototypes of minimal information frameworks for both sites, as well as a general model that can serve as the basis for later site-based standards and infrastructure development

    Towards an Ecological Trait-data Standard

    Get PDF
    Trait-based approaches are widespread throughout ecological research, offering great potential for trait data to deliver general and mechanistic conclusions. Accordingly, a wealth of trait data is available for many organism groups, but, due to a lack of standardisation, these data come in heterogeneous formats. We review current initiatives and infrastructures for standardising trait data and discuss the importance of standardisation for trait data hosted in distributed open-access repositories. In order to facilitate the standardisation and harmonisation of distributed trait datasets, we propose a general and simple vocabulary as well as a simple data structure for storing and sharing ecological trait data. Additionally, we provide an R-package that enables the transformation of any tabular dataset into the proposed format. This also allows trait datasets from heterogeneous sources to be harmonised and merged, thus facilitating data compilation for any particular research focus. With these decentralised tools for trait-data harmonisation, we intend to facilitate the exchange and analysis of trait data within ecological research and enable global syntheses of traits across a wide range of taxa and ecosystems

    Raster Time Series: Learning and Processing

    Get PDF
    As the amount of remote sensing data is increasing at a high rate, due to great improvements in sensor technology, efficient processing capabilities are of utmost importance. Remote sensing data from satellites is crucial in many scientific domains, like biodiversity and climate research. Because weather and climate are of particular interest for almost all living organisms on earth, the efficient classification of clouds is one of the most important problems. Geostationary satellites such as Meteosat Second Generation (MSG) offer the only possibility to generate long-term cloud data sets with high spatial and temporal resolution. This work, therefore, addresses research problems on efficient and parallel processing of MSG data to enable new applications and insights. First, we address the lack of a suitable processing chain to generate a long-term Fog and Low Stratus (FLS) time series. We present an efficient MSG data processing chain that processes multiple tasks simultaneously, and raster data in parallel using the Open Computing Language (OpenCL). The processing chain delivers a uniform FLS classification that combines day and night approaches in a single method. As a result, it is possible to calculate a year of FLS rasters quite easy. The second topic presents the application of Convolutional Neural Networks (CNN) for cloud classification. Conventional approaches to cloud detection often only classify single pixels and ignore the fact that clouds are highly dynamic and spatially continuous entities. Therefore, we propose a new method based on deep learning. Using a CNN image segmentation architecture, the presented Cloud Segmentation CNN (CS-CNN) classifies all pixels of a scene simultaneously. We show that CS-CNN is capable of processing multispectral satellite data to identify continuous phenomena such as highly dynamic clouds. The proposed approach provides excellent results on MSG satellite data in terms of quality, robustness, and runtime, in comparison to Random Forest (RF), another widely used machine learning method. Finally, we present the processing of raster time series with a system for Visualization, Transformation, and Analysis (VAT) of spatio-temporal data. It enables data-driven research with explorative workflows and uses time as an integral dimension. The combination of various raster and vector data time series enables new applications and insights. We present an application that combines weather information and aircraft trajectories to identify patterns in bad weather situations

    Building the knowledge base for environmental action and sustainability

    Get PDF
    corecore