16 research outputs found
VERCE delivers a productive e-Science environment for seismology research
The VERCE project has pioneered an e-Infrastructure to support researchers
using established simulation codes on high-performance computers in conjunction
with multiple sources of observational data. This is accessed and organised via
the VERCE science gateway that makes it convenient for seismologists to use
these resources from any location via the Internet. Their data handling is made
flexible and scalable by two Python libraries, ObsPy and dispel4py and by data
services delivered by ORFEUS and EUDAT. Provenance driven tools enable rapid
exploration of results and of the relationships between data, which accelerates
understanding and method improvement. These powerful facilities are integrated
and draw on many other e-Infrastructures. This paper presents the motivation
for building such systems, it reviews how solid-Earth scientists can make
significant research progress using them and explains the architecture and
mechanisms that make their construction and operation achievable. We conclude
with a summary of the achievements to date and identify the crucial steps
needed to extend the capabilities for seismologists, for solid-Earth scientists
and for similar disciplines.Comment: 14 pages, 3 figures. Pre-publication version of paper accepted and
published at the IEEE eScience 2015 conference in Munich with substantial
additions, particularly in the analysis of issue
DARE: A Reflective Platform Designed to Enable Agile Data-Driven Research on the Cloud
The DARE platform has been designed to help research developers deliver user-facing applications and solutions over diverse underlying e-infrastructures, data and computational contexts. The platform is Cloud-ready, and relies on the exposure of APIs, which are suitable for raising the abstraction level and hiding complexity. At its core, the platform implements the cataloguing and execution of fine-grained and Python-based dispel4py workflows as services. Reflection is achieved via a logical knowledge base, comprising multiple internal catalogues, registries and semantics, while it supports persistent and pervasive data provenance. This paper presents design and implementation aspects of the DARE platform, as well as it provides directions for future development.PublishedSan Diego (CA, USA)3IT. Calcolo scientific
Comprehensible Control for Researchers and Developers facing Data Challenges
The DARE platform enables researchers and their developers to exploit more capabilities to handle complexity and scale in data, computation and collaboration. Today’s challenges pose increasing and urgent demands for this combination of capabilities. To meet technical, economic and governance constraints, application communities must use use shared digital infrastructure principally via virtualisation and mapping. This requires precise abstractions that retain their meaning while their implementations and infrastructures change. Giving specialists direct control over these capabilities with detail relevant to each discipline is necessary for adoption. Research agility, improved power and retained return on intellectual investment incentivise that adoption. We report on an architecture for establishing and sustaining the necessary optimised mappings and early evaluations of its feasibility with two application communities.PublishedSan Diego (CA, USA)3IT. Calcolo scientific
Dispel4Py: A Python Framework for Data-intensive eScience
We present dispel4py, a novel data intensive and high performance computing middleware provided as a standard Python library for describing stream-based workows. It allows its users to develop their scientific applications locally and then run them on a wide range of HPC-infrastructures without any changes to the code. Moreover, it provides automated and efficient parallel mappings toMPI, multiprocessing, Storm and Spark frameworks, commonly used in big data applications. It builds on the wide availability of Python in many environments and only requires familiarity with basic Python syntax. We will show the dispel4py advantages by walking through an example. We will conclude demonstrating how dispel4py can be employed as an easy-to-use tool for designing scientific applications using real-world scenarios.</p
Active provenance for data intensive research
The role of provenance information in data-intensive research is a significant topic of
discussion among technical experts and scientists. Typical use cases addressing traceability,
versioning and reproducibility of the research findings are extended with more
interactive scenarios in support, for instance, of computational steering and results
management. In this thesis we investigate the impact that lineage records can have on
the early phases of the analysis, for instance performed through near-real-time systems
and Virtual Research Environments (VREs) tailored to the requirements of a specific
community. By positioning provenance at the centre of the computational research
cycle, we highlight the importance of having mechanisms at the data-scientists’ side
that, by integrating with the abstractions offered by the processing technologies, such
as scientific workflows and data-intensive tools, facilitate the experts’ contribution to
the lineage at runtime. Ultimately, by encouraging tuning and use of provenance for
rapid feedback, the thesis aims at improving the synergy between different user groups
to increase productivity and understanding of their processes.
We present a model of provenance, called S-PROV, that uses and further extends
PROV and ProvONE. The relationships and properties characterising the workflow’s
abstractions and their concrete executions are re-elaborated to include aspects related
to delegation, distribution and steering of stateful streaming operators. The model is
supported by the Active framework for tuneable and actionable lineage ensuring the
user’s engagement by fostering rapid exploitation. Here, concepts such as provenance
types, configuration and explicit state management allow users to capture complex
provenance scenarios and activate selective controls based on domain and user-defined
metadata. We outline how the traces are recorded in a new comprehensive system,
called S-ProvFlow, enabling different classes of consumers to explore the provenance
data with services and tools for monitoring, in-depth validation and comprehensive
visual-analytics. The work of this thesis will be discussed in the context of an existing
computational framework and the experience matured in implementing provenance-aware
tools for seismology and climate VREs. It will continue to evolve through
newly funded projects, thereby providing generic and user-centred solutions for data-intensive
research
WFCatalog: a catalogue for seismological waveform data
This paper reports advances in seismic waveform description and discovery leading to a new seismological service and presents the key steps in its design, implementation and adoption. This service, named WFCatalog, which stands for waveform catalogue, accommodates features of seismological waveform data. Therefore, it meets the need for seismologists to be able to select waveform data based on seismic waveform features as well as sensor geolocations and temporal specifications. We describe the collaborative design methods and the technical solution showing the central role of seismic feature catalogues in framing the technical and operational delivery of the new service. Also, we provide an overview of the complex environment wherein this endeavour is scoped and the related challenges discussed. As multi-disciplinary, multi-organisational and global collaboration is necessary to address today's challenges, canonical representations can provide a focus for collaboration and conceptual tools for agreeing directions. Such collaborations can be fostered and formalised by rallying intellectual effort into the design of novel scientific catalogues and the services that support them. This work offers an example of the benefits generated by involving cross-disciplinary skills (e.g. data and domain expertise) from the early stages of design, and by sustaining the engagement with the target community throughout the delivery and deployment process
<i>Active</i> provenance for Data-Intensive workflows: engaging users and developers
We present a practical approach for provenance capturing in Data-Intensive workflow systems. It provides contextualisation by recording injected domain metadata with the provenance stream. It offers control over lineage precision, combining automation with specified adaptations. We address provenance tasks such as extraction of domain metadata, injection of custom annotations, accuracy and integration of records from multiple independent workflows running in distributed contexts. To allow such flexibility, we introduce the concepts of programmable Provenance Types and Provenance Configuration.Provenance Types handle domain contextualisation and allow developers to model lineage patterns by re-defining API methods, composing easy-to-use extensions. Provenance Configuration, instead, enables users of a Data-Intensive workflow execution to prepare it for provenance capture, by configuring the attribution of Provenance Types to components and by specifying grouping into semantic clusters. This enables better searches over the lineage records. Provenance Types and Provenance Configuration are demonstrated in a system being used by computational seismologists. It is based on an extended provenance model, S-PROV.PublishedSan Diego (CA, USA)3IT. Calcolo scientific
dispel4py: A Python framework for data-intensive scientific computing
This paper presents dispel4py, a new Python framework for describing abstract stream-based workflows for distributed data-intensive applications. These combine the familiarity of Python programming with the scalability of workflows. Data streaming is used to gain performance, rapid prototyping and applicability to live observations. dispel4py enables scientists to focus on their scientific goals, avoiding distracting details and retaining flexibility over the computing infrastructure they use. The implementation, therefore, has to map dispel4py abstract workflows optimally onto target platforms chosen dynamically. We present four dispel4py mappings: Apache Storm, message-passing interface (MPI), multi-threading and sequential, showing two major benefits: a) smooth transitions from local development on a laptop to scalable execution for production work, and b) scalable enactment on significantly different distributed computing infrastructures. Three application domains are reported and measurements on multiple infrastructures show the optimisations achieved; they have provided demanding real applications and helped us develop effective training. The dispel4py.org is an open-source project to which we invite participation. The effective mapping of dispel4py onto multiple target infrastructures demonstrates exploitation of data-intensive and high-performance computing (HPC) architectures and consistent scalability.</p
Scientific Workflows: Moving Across Paradigms
Modern scientific collaborations have opened up the opportunity to solve complex problems that require both multidisciplinary expertise and large-scale computational experiments. These experiments typically consist of a sequence of processing steps that need to be executed on selected computing platforms. Execution poses a challenge, however, due to (1) the complexity and diversity of applications, (2) the diversity of analysis goals, (3) the heterogeneity of computing platforms, and (4) the volume and distribution of data. A common strategy to make these in silico experiments more manageable is to model them as workflows and to use a workflow management system to organize their execution. This article looks at the overall challenge posed by a new order of scientific experiments and the systems they need to be run on, and examines how this challenge can be addressed by workflows and workflow management systems. It proposes a taxonomy of workflow management system (WMS) characteristics, including aspects previously overlooked. This frames a review of prevalent WMSs used by the scientific community, elucidates their evolution to handle the challenges arising with the emergence of the “fourth paradigm,” and identifies research needed to maintain progress in this area
Methodology to sustain common information spaces for research collaborations
Information and knowledge sharing collaborations are essential for scientific research
and innovation. They provide opportunities to pool expertise and resources. They are
required to draw on today’s wealth of data to address pressing societal challenges.
Establishing effective collaborations depends on the alignment of intellectual and
technical capital.
In this thesis we investigate implications and influences of socio-technical aspects
of research collaborations to identify methods of facilitating their formation and
sustained success. We draw on our experience acquired in an international federated
seismological context, and in a large research infrastructure for solid-Earth sciences.
We recognise the centrality of the users and propose a strategy to sustain their
engagement as actors participating in the collaboration. Our approach promotes and
enables their active contribution in the construction and maintenance of Common
Information Spaces (CISs). These are shaped by conceptual agreements that are
captured and maintained to facilitate mutual understanding and to underpin their
collaborative work.
A user-driven approach shapes the evolution of a CIS based on the requirements of
the communities involved in the collaboration. Active users’ engagement is pursued by
partitioning concerns and by targeting their interests. For instance, application domain
experts focus on scientific and conceptual aspects; data and information experts address
knowledge representation issues; and architects and engineers build the infrastructure
that populates the common space.
We introduce a methodology to sustain CIS and a conceptual framework that has
its foundations on a set of agreed Core Concepts forming a Canonical Core (CC). A
representation of such a CC is also introduced that leverages and promotes reuse of
existing standards: EPOS-DCAT-AP.
The application of our methodology shows promising results with a good uptake
and adoption by the targeted communities. This encourages us to continue applying
and evaluating such a strategy in the future