277 research outputs found
A provenance-based semantic approach to support understandability, reproducibility, and reuse of scientific experiments
Understandability and reproducibility of scientific results are vital in every field of science. Several reproducibility measures are being taken to make the data used in the publications findable and accessible. However, there are many challenges faced by scientists from the beginning of an experiment to the end in particular for data management. The explosive growth of heterogeneous research data and understanding how this data has been derived is one of the research problems faced in this context. Interlinking the data, the steps and the results from the computational and non-computational processes of a scientific experiment is important for the reproducibility. We introduce the notion of end-to-end provenance management'' of scientific experiments to help scientists understand and reproduce the experimental results. The main contributions of this thesis are: (1) We propose a provenance modelREPRODUCE-ME'' to describe the scientific experiments using semantic web technologies by extending existing standards. (2) We study computational reproducibility and important aspects required to achieve it. (3) Taking into account the REPRODUCE-ME provenance model and the study on computational reproducibility, we introduce our tool, ProvBook, which is designed and developed to demonstrate computational reproducibility. It provides features to capture and store provenance of Jupyter notebooks and helps scientists to compare and track their results of different executions. (4) We provide a framework, CAESAR (CollAborative Environment for Scientific Analysis with Reproducibility) for the end-to-end provenance management. This collaborative framework allows scientists to capture, manage, query and visualize the complete path of a scientific experiment consisting of computational and non-computational steps in an interoperable way. We apply our contributions to a set of scientific experiments in microscopy research projects
Document Automation Architectures: Updated Survey in Light of Large Language Models
This paper surveys the current state of the art in document automation (DA).
The objective of DA is to reduce the manual effort during the generation of
documents by automatically creating and integrating input from different
sources and assembling documents conforming to defined templates. There have
been reviews of commercial solutions of DA, particularly in the legal domain,
but to date there has been no comprehensive review of the academic research on
DA architectures and technologies. The current survey of DA reviews the
academic literature and provides a clearer definition and characterization of
DA and its features, identifies state-of-the-art DA architectures and
technologies in academic research, and provides ideas that can lead to new
research opportunities within the DA field in light of recent advances in
generative AI and large language models.Comment: The current paper is the updated version of an earlier survey on
document automation [Ahmadi Achachlouei et al. 2021]. Updates in the current
paper are as follows: We shortened almost all sections to reduce the size of
the main paper (without references) from 28 pages to 10 pages, added a review
of selected papers on large language models, removed certain sections and
most of diagrams. arXiv admin note: substantial text overlap with
arXiv:2109.1160
Reusing digital collections from GLAM institutions
For some decades now, Galleries, Libraries, Archives and Museums (GLAM) institutions have published and provided access to information resources in digital format. Recently, innovative approaches have appeared such as the concept of Labs within GLAM institutions that facilitates the adoption of innovative and creative tools for content delivery and user engagement. In addition, new methods have been proposed to address the publication of digital collections as data sets amenable to computational use. In this article, we propose a methodology to create machine actionable collections following a set of steps. This methodology is then applied to several use cases based on data sets published by relevant GLAM institutions. It intends to encourage institutions to adopt the publication of data sets that support computationally driven research as a core activity.This work has been partially supported by ECLIPSE-UA RTI2018-094283-B-C32 (Spanish Ministry of Education and Science)
Notebook-as-a-VRE (NaaVRE): From private notebooks to a collaborative cloud virtual research environment
Virtual Research Environments (VREs) provide user-centric support in the
lifecycle of research activities, e.g., discovering and accessing research
assets, or composing and executing application workflows. A typical VRE is
often implemented as an integrated environment, which includes a catalog of
research assets, a workflow management system, a data management framework, and
tools for enabling collaboration among users. Notebook environments, such as
Jupyter, allow researchers to rapidly prototype scientific code and share their
experiments as online accessible notebooks. Jupyter can support several popular
languages that are used by data scientists, such as Python, R, and Julia.
However, such notebook environments do not have seamless support for running
heavy computations on remote infrastructure or finding and accessing software
code inside notebooks. This paper investigates the gap between a notebook
environment and a VRE and proposes an embedded VRE solution for the Jupyter
environment called Notebook-as-a-VRE (NaaVRE). The NaaVRE solution provides
functional components via a component marketplace and allows users to create a
customized VRE on top of the Jupyter environment. From the VRE, a user can
search research assets (data, software, and algorithms), compose workflows,
manage the lifecycle of an experiment, and share the results among users in the
community. We demonstrate how such a solution can enhance a legacy workflow
that uses Light Detection and Ranging (LiDAR) data from country-wide airborne
laser scanning surveys for deriving geospatial data products of ecosystem
structure at high resolution over broad spatial extents. This enables users to
scale out the processing of multi-terabyte LiDAR point clouds for ecological
applications to more data sources in a distributed cloud environment.Comment: A revised version has been published in the journal software practice
and experienc
Reproducibility and Replicability in Unmanned Aircraft Systems and Geographic Information Science
Multiple scientific disciplines face a so-called crisis of reproducibility and replicability (R&R) in which the validity of methodologies is questioned due to an inability to confirm experimental results. Trust in information technology (IT)-intensive workflows within geographic information science (GIScience), remote sensing, and photogrammetry depends on solutions to R&R challenges affecting multiple computationally driven disciplines. To date, there have only been very limited efforts to overcome R&R-related issues in remote sensing workflows in general, let alone those tied to disruptive technologies such as unmanned aircraft systems (UAS) and machine learning (ML). To accelerate an understanding of this crisis, a review was conducted to identify the issues preventing R&R in GIScience. Key barriers included: (1) awareness of time and resource requirements, (2) accessibility of provenance, metadata, and version control, (3) conceptualization of geographic problems, and (4) geographic variability between study areas. As a case study, a replication of a GIScience workflow utilizing Yolov3 algorithms to identify objects in UAS imagery was attempted. Despite the ability to access source data and workflow steps, it was discovered that the lack of accessibility to provenance and metadata of each small step of the work prohibited the ability to successfully replicate the work. Finally, a novel method for provenance generation was proposed to address these issues. It was found that artificial intelligence (AI) could be used to quickly create robust provenance records for workflows that do not exceed time and resource constraints and provide the information needed to replicate work. Such information can bolster trust in scientific results and provide access to cutting edge technology that can improve everyday life
Reproducibility and Replicability in Unmanned Aircraft Systems and Geographic Information Science
Multiple scientific disciplines face a so-called crisis of reproducibility and replicability (R&R) in which the validity of methodologies is questioned due to an inability to confirm experimental results. Trust in information technology (IT)-intensive workflows within geographic information science (GIScience), remote sensing, and photogrammetry depends on solutions to R&R challenges affecting multiple computationally driven disciplines. To date, there have only been very limited efforts to overcome R&R-related issues in remote sensing workflows in general, let alone those tied to disruptive technologies such as unmanned aircraft systems (UAS) and machine learning (ML). To accelerate an understanding of this crisis, a review was conducted to identify the issues preventing R&R in GIScience. Key barriers included: (1) awareness of time and resource requirements, (2) accessibility of provenance, metadata, and version control, (3) conceptualization of geographic problems, and (4) geographic variability between study areas. As a case study, a replication of a GIScience workflow utilizing Yolov3 algorithms to identify objects in UAS imagery was attempted. Despite the ability to access source data and workflow steps, it was discovered that the lack of accessibility to provenance and metadata of each small step of the work prohibited the ability to successfully replicate the work. Finally, a novel method for provenance generation was proposed to address these issues. It was found that artificial intelligence (AI) could be used to quickly create robust provenance records for workflows that do not exceed time and resource constraints and provide the information needed to replicate work. Such information can bolster trust in scientific results and provide access to cutting edge technology that can improve everyday life
- …