Search CORE

Fraunhofer-ePrints

Oxford University Research Archive

DARE: A Reflective Platform Designed to Enable Agile Data-Driven Research on the Cloud

Author: Atkinson Malcolm
Casarotti Emanuele
Charalambidis Angelos
Davvetas A
Filgueira Rosa
Gemund André
Karkaletsis Vangelis
Klampanos Iraklis
Koukourikos Antonis
Krause Amy
Lindner Mike
Magnoni Federica
Pagé Christian
Spinuso Alessandro
Publication venue
Publication date: 01/09/2019
Field of study

The DARE platform has been designed to help research developers deliver user-facing applications and solutions over diverse underlying e-infrastructures, data and computational contexts. The platform is Cloud-ready, and relies on the exposure of APIs, which are suitable for raising the abstraction level and hiding complexity. At its core, the platform implements the cataloguing and execution of fine-grained and Python-based dispel4py workflows as services. Reflection is achieved via a logical knowledge base, comprising multiple internal catalogues, registries and semantics, while it supports persistent and pervasive data provenance. This paper presents design and implementation aspects of the DARE platform, as well as it provides directions for future development.PublishedSan Diego (CA, USA)3IT. Calcolo scientific

Comprehensible Control for Researchers and Developers facing Data Challenges

Author: Atkinson Malcolm
Filgueira Vicente Rosa
Klampanos Iraklis
Koukourikos Antonis
Krause Amrey
Magnoni Federica
Page Christian M.
Rietbrock Andreas
Spinuso Alessandro
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2019
Field of study

The DARE platform enables researchers and their developers to exploit more capabilities to handle complexity and scale in data, computation and collaboration. Today’s challenges pose increasing and urgent demands for this combination of capabilities. To meet technical, economic and governance constraints, application communities must use use shared digital infrastructure principally via virtualisation and mapping. This requires precise abstractions that retain their meaning while their implementations and infrastructures change. Giving specialists direct control over these capabilities with detail relevant to each discipline is necessary for adoption. Research agility, improved power and retained return on intellectual investment incentivise that adoption. We report on an architecture for establishing and sustaining the necessary optimised mappings and early evaluations of its feasibility with two application communities.PublishedSan Diego (CA, USA)3IT. Calcolo scientific

Dispel4Py: A Python Framework for Data-intensive eScience

Author: Atkinson Malcolm
Filgueira Rosa
Krause Amrey
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2015
Field of study

We present dispel4py, a novel data intensive and high performance computing middleware provided as a standard Python library for describing stream-based workows. It allows its users to develop their scientific applications locally and then run them on a wide range of HPC-infrastructures without any changes to the code. Moreover, it provides automated and efficient parallel mappings toMPI, multiprocessing, Storm and Spark frameworks, commonly used in big data applications. It builds on the wide availability of Python in many environments and only requires familiarity with basic Python syntax. We will show the dispel4py advantages by walking through an example. We will conclude demonstrating how dispel4py can be employed as an easy-to-use tool for designing scientific applications using real-world scenarios.</p

Edinburgh Research Archive

Active provenance for data intensive research

Author: Spinuso Alessandro
Publication venue: The University of Edinburgh
Publication date: 29/11/2018
Field of study

The role of provenance information in data-intensive research is a significant topic of discussion among technical experts and scientists. Typical use cases addressing traceability, versioning and reproducibility of the research findings are extended with more interactive scenarios in support, for instance, of computational steering and results management. In this thesis we investigate the impact that lineage records can have on the early phases of the analysis, for instance performed through near-real-time systems and Virtual Research Environments (VREs) tailored to the requirements of a specific community. By positioning provenance at the centre of the computational research cycle, we highlight the importance of having mechanisms at the data-scientists’ side that, by integrating with the abstractions offered by the processing technologies, such as scientific workflows and data-intensive tools, facilitate the experts’ contribution to the lineage at runtime. Ultimately, by encouraging tuning and use of provenance for rapid feedback, the thesis aims at improving the synergy between different user groups to increase productivity and understanding of their processes. We present a model of provenance, called S-PROV, that uses and further extends PROV and ProvONE. The relationships and properties characterising the workflow’s abstractions and their concrete executions are re-elaborated to include aspects related to delegation, distribution and steering of stateful streaming operators. The model is supported by the Active framework for tuneable and actionable lineage ensuring the user’s engagement by fostering rapid exploitation. Here, concepts such as provenance types, configuration and explicit state management allow users to capture complex provenance scenarios and activate selective controls based on domain and user-defined metadata. We outline how the traces are recorded in a new comprehensive system, called S-ProvFlow, enabling different classes of consumers to explore the provenance data with services and tools for monitoring, in-depth validation and comprehensive visual-analytics. The work of this thesis will be discussed in the context of an existing computational framework and the experience matured in implementing provenance-aware tools for seismology and climate VREs. It will continue to evolve through newly funded projects, thereby providing generic and user-centred solutions for data-intensive research

WFCatalog: a catalogue for seismological waveform data

Author: Addair
Cauzzi
Galea
Godey
Ivanova
Krischer
Ringler
Wilkinson
Publication venue: 'Elsevier BV'
Publication date: 08/06/2017
Field of study

This paper reports advances in seismic waveform description and discovery leading to a new seismological service and presents the key steps in its design, implementation and adoption. This service, named WFCatalog, which stands for waveform catalogue, accommodates features of seismological waveform data. Therefore, it meets the need for seismologists to be able to select waveform data based on seismic waveform features as well as sensor geolocations and temporal specifications. We describe the collaborative design methods and the technical solution showing the central role of seismic feature catalogues in framing the technical and operational delivery of the new service. Also, we provide an overview of the complex environment wherein this endeavour is scoped and the related challenges discussed. As multi-disciplinary, multi-organisational and global collaboration is necessary to address today's challenges, canonical representations can provide a focus for collaboration and conceptual tools for agreeing directions. Such collaborations can be fostered and formalised by rallying intellectual effort into the design of novel scientific catalogues and the services that support them. This work offers an example of the benefits generated by involving cross-disciplinary skills (e.g. data and domain expertise) from the early stages of design, and by sustaining the engagement with the target community throughout the delivery and deployment process

NERC Open Research Archive

<i>Active</i> provenance for Data-Intensive workflows: engaging users and developers

Author: Atkinson Malcolm
Magnoni Federica
Spinuso Alessandro
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2019
Field of study

We present a practical approach for provenance capturing in Data-Intensive workflow systems. It provides contextualisation by recording injected domain metadata with the provenance stream. It offers control over lineage precision, combining automation with specified adaptations. We address provenance tasks such as extraction of domain metadata, injection of custom annotations, accuracy and integration of records from multiple independent workflows running in distributed contexts. To allow such flexibility, we introduce the concepts of programmable Provenance Types and Provenance Configuration.Provenance Types handle domain contextualisation and allow developers to model lineage patterns by re-defining API methods, composing easy-to-use extensions. Provenance Configuration, instead, enables users of a Data-Intensive workflow execution to prepare it for provenance capture, by configuring the attribution of Provenance Types to components and by specifying grouping into semantic clusters. This enables better searches over the lineage records. Provenance Types and Provenance Configuration are demonstrated in a system being used by computational seismologists. It is based on an extended provenance model, S-PROV.PublishedSan Diego (CA, USA)3IT. Calcolo scientific

dispel4py: A Python framework for data-intensive scientific computing

Author: Alexander Moreno
Amrey Krause
Baccianella S
Blankenberg D
Buil-Aranda C
Filgueira R
Filgueira R
Hey AJG
Iraklis Klampanos
Malcolm Atkinson
MPI Forum
Nielsen FA
Pak A
Rosa Filguiera
Rynge M
Segaran T
Shoshani A
Vahi K
Publication venue: 'SAGE Publications'
Publication date: 01/07/2017
Field of study

This paper presents dispel4py, a new Python framework for describing abstract stream-based workflows for distributed data-intensive applications. These combine the familiarity of Python programming with the scalability of workflows. Data streaming is used to gain performance, rapid prototyping and applicability to live observations. dispel4py enables scientists to focus on their scientific goals, avoiding distracting details and retaining flexibility over the computing infrastructure they use. The implementation, therefore, has to map dispel4py abstract workflows optimally onto target platforms chosen dynamically. We present four dispel4py mappings: Apache Storm, message-passing interface (MPI), multi-threading and sequential, showing two major benefits: a) smooth transitions from local development on a laptop to scalable execution for production work, and b) scalable enactment on significantly different distributed computing infrastructures. Three application domains are reported and measurements on multiple infrastructures show the optimisations achieved; they have provided demanding real applications and helped us develop effective training. The dispel4py.org is an open-source project to which we invite participation. The effective mapping of dispel4py onto multiple target infrastructures demonstrates exploitation of data-intensive and high-performance computing (HPC) architectures and consistent scalability.</p

Scientific Workflows: Moving Across Paradigms

Author: Ang Tan Fong
Atkinson Malcolm
Galea Michelle
Liew Chee Sun
Martin Paul
van Hemert Jano
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/02/2017
Field of study

Modern scientific collaborations have opened up the opportunity to solve complex problems that require both multidisciplinary expertise and large-scale computational experiments. These experiments typically consist of a sequence of processing steps that need to be executed on selected computing platforms. Execution poses a challenge, however, due to (1) the complexity and diversity of applications, (2) the diversity of analysis goals, (3) the heterogeneity of computing platforms, and (4) the volume and distribution of data. A common strategy to make these in silico experiments more manageable is to model them as workflows and to use a workflow management system to organize their execution. This article looks at the overall challenge posed by a new order of scientific experiments and the systems they need to be run on, and examines how this challenge can be addressed by workflows and workflow management systems. It proposes a taxonomy of workflow management system (WMS) characteristics, including aspects previously overlooked. This frames a review of prevalent WMSs used by the scientific community, elucidates their evolution to handle the challenges arising with the emergence of the “fourth paradigm,” and identifies research needed to maintain progress in this area

International Migration, Integration and Social Cohesion online publications

Methodology to sustain common information spaces for research collaborations

Author: Bailo Daniele
Jeffery Keith G.
Nayembil Martin
Spinuso Alessandro
Trani Luca
Ulbricht Damian
Publication venue: 'Elsevier BV'
Publication date: 11/06/2016
Field of study

Information and knowledge sharing collaborations are essential for scientific research and innovation. They provide opportunities to pool expertise and resources. They are required to draw on today’s wealth of data to address pressing societal challenges. Establishing effective collaborations depends on the alignment of intellectual and technical capital. In this thesis we investigate implications and influences of socio-technical aspects of research collaborations to identify methods of facilitating their formation and sustained success. We draw on our experience acquired in an international federated seismological context, and in a large research infrastructure for solid-Earth sciences. We recognise the centrality of the users and propose a strategy to sustain their engagement as actors participating in the collaboration. Our approach promotes and enables their active contribution in the construction and maintenance of Common Information Spaces (CISs). These are shaped by conceptual agreements that are captured and maintained to facilitate mutual understanding and to underpin their collaborative work. A user-driven approach shapes the evolution of a CIS based on the requirements of the communities involved in the collaboration. Active users’ engagement is pursued by partitioning concerns and by targeting their interests. For instance, application domain experts focus on scientific and conceptual aspects; data and information experts address knowledge representation issues; and architects and engineers build the infrastructure that populates the common space. We introduce a methodology to sustain CIS and a conceptual framework that has its foundations on a set of agreed Core Concepts forming a Canonical Core (CC). A representation of such a CC is also introduced that leverages and promotes reuse of existing standards: EPOS-DCAT-AP. The application of our methodology shows promising results with a good uptake and adoption by the targeted communities. This encourages us to continue applying and evaluating such a strategy in the future