36 research outputs found

    <i>Active</i> provenance for Data-Intensive workflows: engaging users and developers

    Get PDF
    We present a practical approach for provenance capturing in Data-Intensive workflow systems. It provides contextualisation by recording injected domain metadata with the provenance stream. It offers control over lineage precision, combining automation with specified adaptations. We address provenance tasks such as extraction of domain metadata, injection of custom annotations, accuracy and integration of records from multiple independent workflows running in distributed contexts. To allow such flexibility, we introduce the concepts of programmable Provenance Types and Provenance Configuration.Provenance Types handle domain contextualisation and allow developers to model lineage patterns by re-defining API methods, composing easy-to-use extensions. Provenance Configuration, instead, enables users of a Data-Intensive workflow execution to prepare it for provenance capture, by configuring the attribution of Provenance Types to components and by specifying grouping into semantic clusters. This enables better searches over the lineage records. Provenance Types and Provenance Configuration are demonstrated in a system being used by computational seismologists. It is based on an extended provenance model, S-PROV.PublishedSan Diego (CA, USA)3IT. Calcolo scientific

    Active provenance for data intensive research

    Get PDF
    The role of provenance information in data-intensive research is a significant topic of discussion among technical experts and scientists. Typical use cases addressing traceability, versioning and reproducibility of the research findings are extended with more interactive scenarios in support, for instance, of computational steering and results management. In this thesis we investigate the impact that lineage records can have on the early phases of the analysis, for instance performed through near-real-time systems and Virtual Research Environments (VREs) tailored to the requirements of a specific community. By positioning provenance at the centre of the computational research cycle, we highlight the importance of having mechanisms at the data-scientists’ side that, by integrating with the abstractions offered by the processing technologies, such as scientific workflows and data-intensive tools, facilitate the experts’ contribution to the lineage at runtime. Ultimately, by encouraging tuning and use of provenance for rapid feedback, the thesis aims at improving the synergy between different user groups to increase productivity and understanding of their processes. We present a model of provenance, called S-PROV, that uses and further extends PROV and ProvONE. The relationships and properties characterising the workflow’s abstractions and their concrete executions are re-elaborated to include aspects related to delegation, distribution and steering of stateful streaming operators. The model is supported by the Active framework for tuneable and actionable lineage ensuring the user’s engagement by fostering rapid exploitation. Here, concepts such as provenance types, configuration and explicit state management allow users to capture complex provenance scenarios and activate selective controls based on domain and user-defined metadata. We outline how the traces are recorded in a new comprehensive system, called S-ProvFlow, enabling different classes of consumers to explore the provenance data with services and tools for monitoring, in-depth validation and comprehensive visual-analytics. The work of this thesis will be discussed in the context of an existing computational framework and the experience matured in implementing provenance-aware tools for seismology and climate VREs. It will continue to evolve through newly funded projects, thereby providing generic and user-centred solutions for data-intensive research

    DARE Platform a Developer-Friendly and Self-Optimising Workflows-as-a-Service Framework for e-Science on the Cloud

    Get PDF
    The DARE platform, developed as part of the H2020 DARE project (grant agreement No 777413), enables the seamless development and reusability of scientific workflows and applications, and the reproducibility of the experiments. Further, it provides Workflow-as-a-Service (WaaS) functionality and dynamic loading of execution contexts in order to hide technical complexity from its end users. This archive includes v3.5 of the DARE platform

    Data integration and FAIR data management in Solid Earth Science

    Get PDF
    Integrated use of multidisciplinary data is nowadays a recognized trend in scientific research, in particular in the domain of solid Earth science where the understanding of a physical process is improved and made complete by different types of measurements – for instance, ground acceleration, SAR imaging, crustal deformation – describing a physical phenomenon. FAIR principles are recognized as a means to foster data integration by providing a common set of criteria for building data stewardship systems for Open Science. However, the implementation of FAIR principles raises issues along dimensions like governance and legal beyond, of course, the technical one. In the latter, in particular, the development of FAIR data provision systems is often delegated to Research Infrastructures or data providers, with support in terms of metrics and best practices offered by cluster projects or dedicated initiatives. In the current work, we describe the approach to FAIR data management in the European Plate Observing System (EPOS), a distributed research infrastructure in the solid Earth science domain that includes more than 250 individual research infrastructures across 25 countries in Europe. We focus in particular on the technical aspects, but including also governance, policies and organizational elements, by describing the architecture of the EPOS delivery framework both from the organizational and technical point of view and by outlining the key principles used in the technical design. We describe how a combination of approaches, namely rich metadata and service-based systems design, are required to achieve data integration. We show the system architecture and the basic features of the EPOS data portal, that integrates data from more than 220 services in a FAIR way. The construction of such a portal was driven by the EPOS FAIR data management approach, that by defining a clear roadmap for compliance with the FAIR principles, produced a number of best practices and technical approaches for complying with the FAIR principles. Such a work, that spans over a decade but concentrates the key efforts in the last 5 years with the EPOS Implementation Phase project and the establishment of EPOS-ERIC, was carried out in synergy with other EU initiatives dealing with FAIR data. On the basis of the EPOS experience, future directions are outlined, emphasizing the need to provide i) FAIR reference architectures that can ease data practitioners and engineers from the domain communities to adopt FAIR principles and build FAIR data systems; ii) a FAIR data management framework addressing FAIR through the entire data lifecycle, including reproducibility and provenance; and iii) the extension of the FAIR principles to policies and governance dimensions.publishedVersio

    Comprehensible Control for Researchers and Developers facing Data Challenges

    Get PDF
    The DARE platform enables researchers and their developers to exploit more capabilities to handle complexity and scale in data, computation and collaboration. Today’s challenges pose increasing and urgent demands for this combination of capabilities. To meet technical, economic and governance constraints, application communities must use use shared digital infrastructure principally via virtualisation and mapping. This requires precise abstractions that retain their meaning while their implementations and infrastructures change. Giving specialists direct control over these capabilities with detail relevant to each discipline is necessary for adoption. Research agility, improved power and retained return on intellectual investment incentivise that adoption. We report on an architecture for establishing and sustaining the necessary optimised mappings and early evaluations of its feasibility with two application communities.PublishedSan Diego (CA, USA)3IT. Calcolo scientific

    Methodology to sustain common information spaces for research collaborations

    Get PDF
    Information and knowledge sharing collaborations are essential for scientific research and innovation. They provide opportunities to pool expertise and resources. They are required to draw on today’s wealth of data to address pressing societal challenges. Establishing effective collaborations depends on the alignment of intellectual and technical capital. In this thesis we investigate implications and influences of socio-technical aspects of research collaborations to identify methods of facilitating their formation and sustained success. We draw on our experience acquired in an international federated seismological context, and in a large research infrastructure for solid-Earth sciences. We recognise the centrality of the users and propose a strategy to sustain their engagement as actors participating in the collaboration. Our approach promotes and enables their active contribution in the construction and maintenance of Common Information Spaces (CISs). These are shaped by conceptual agreements that are captured and maintained to facilitate mutual understanding and to underpin their collaborative work. A user-driven approach shapes the evolution of a CIS based on the requirements of the communities involved in the collaboration. Active users’ engagement is pursued by partitioning concerns and by targeting their interests. For instance, application domain experts focus on scientific and conceptual aspects; data and information experts address knowledge representation issues; and architects and engineers build the infrastructure that populates the common space. We introduce a methodology to sustain CIS and a conceptual framework that has its foundations on a set of agreed Core Concepts forming a Canonical Core (CC). A representation of such a CC is also introduced that leverages and promotes reuse of existing standards: EPOS-DCAT-AP. The application of our methodology shows promising results with a good uptake and adoption by the targeted communities. This encourages us to continue applying and evaluating such a strategy in the future

    dispel4py: An Open-Source Python library for Data-Intensive Seismology

    Get PDF
    Scientific workflows are a necessary tool for many scientific communities as they enable easy composition and execution of applications on computing resources while scientists can focus on their research without being distracted by the computation management. Nowadays, scientific communities (e.g. Seismology) have access to a large variety of computing resources and their computational problems are best addressed using parallel computing technology. However, successful use of these technologies requires a lot of additional machinery whose use is not straightforward for non-experts: different parallel frameworks (MPI, Storm, multiprocessing, etc.) must be used depending on the computing resources (local machines, grids, clouds, clusters) where applications are run. This implies that for achieving the best applications' performance, users usually have to change their codes depending on the features of the platform selected for running them. This work presents dispel4py, a new open-source Python library for describing abstract stream-based workflows for distributed data-intensive applications. Special care has been taken to provide dispel4py with the ability to map abstract workflows to different platforms dynamically at run-time. Currently dispel4py has four mappings: Apache Storm, MPI, multi-threading and sequential. The main goal of dispel4py is to provide an easy-to-use tool to develop and test workflows in local resources by using the sequential mode with a small dataset. Later, once a workflow is ready for long runs, it can be automatically executed on different parallel resources. dispel4py takes care of the underlying mappings by performing an efficient parallelisation. Processing Elements (PE) represent the basic computational activities of any dispel4Py workflow, which can be a seismologic algorithm, or a data transformation process. For creating a dispel4py workflow, users only have to write very few lines of code to describe their PEs and how they are connected by using Python, which is widely supported on many platforms and is popular in many scientific domains, such as in geosciences. Once, a dispel4py workflow is written, a user only has to select which mapping they would like to use, and everything else (parallelisation, distribution of data) is carried on by dispel4py without any cost to the user. Among all dispel4py features we would like to highlight the following: * The PEs are connected by streams and not by writing to and reading from intermediate files, avoiding many IO operations. * The PEs can be stored into a registry. Therefore, different users can recombine PEs in many different workflows. * dispel4py has been enriched with a provenance mechanism to support runtime provenance analysis. We have adopted the W3C-PROV data model, which is accessible via a prototypal browser-based user interface and a web API. It supports the users with the visualisation of graphical products and offers combined operations to access and download the data, which may be selectively stored at runtime, into dedicated data archives. dispel4py has been already used by seismologists in the VERCE project to develop different seismic workflows. One of them is the Seismic Ambient Noise Cross-Correlation workflow, which preprocesses and cross-correlates traces from several stations. First, this workflow was tested on a local machine by using a small number of stations as input data. Later, it was executed on different parallel platforms (SuperMUC cluster, and Terracorrelator machine), automatically scaling up by using MPI and multiprocessing mappings and up to 1000 stations as input data. The results show that the dispel4py achieves scalable performance in both mappings tested on different parallel platforms

    DARE: A Reflective Platform Designed to Enable Agile Data-Driven Research on the Cloud

    Get PDF
    The DARE platform has been designed to help research developers deliver user-facing applications and solutions over diverse underlying e-infrastructures, data and computational contexts. The platform is Cloud-ready, and relies on the exposure of APIs, which are suitable for raising the abstraction level and hiding complexity. At its core, the platform implements the cataloguing and execution of fine-grained and Python-based dispel4py workflows as services. Reflection is achieved via a logical knowledge base, comprising multiple internal catalogues, registries and semantics, while it supports persistent and pervasive data provenance. This paper presents design and implementation aspects of the DARE platform, as well as it provides directions for future development.PublishedSan Diego (CA, USA)3IT. Calcolo scientific

    DARE: A Reflective Platform Designed to Enable Agile Data-Driven Research on the Cloud

    Get PDF
    The DARE platform has been designed to help research developers deliver user-facing applications and solutions over diverse underlying e-infrastructures, data and computational contexts. The platform is Cloud-ready, and relies on the exposure of APIs, which are suitable for raising the abstraction level and hiding complexity. At its core, the platform implements the cataloguing and execution of fine-grained and Python-based dispel4py workflows as services. Reflection is achieved via a logical knowledge base, comprising multiple internal catalogues, registries and semantics, while it supports persistent and pervasive data provenance. This paper presents design and implementation aspects of the DARE platform, as well as it provides directions for future development.PublishedSan Diego (CA, USA)3IT. Calcolo scientific
    corecore