20 research outputs found
Active provenance for data intensive research
The role of provenance information in data-intensive research is a significant topic of
discussion among technical experts and scientists. Typical use cases addressing traceability,
versioning and reproducibility of the research findings are extended with more
interactive scenarios in support, for instance, of computational steering and results
management. In this thesis we investigate the impact that lineage records can have on
the early phases of the analysis, for instance performed through near-real-time systems
and Virtual Research Environments (VREs) tailored to the requirements of a specific
community. By positioning provenance at the centre of the computational research
cycle, we highlight the importance of having mechanisms at the data-scientists’ side
that, by integrating with the abstractions offered by the processing technologies, such
as scientific workflows and data-intensive tools, facilitate the experts’ contribution to
the lineage at runtime. Ultimately, by encouraging tuning and use of provenance for
rapid feedback, the thesis aims at improving the synergy between different user groups
to increase productivity and understanding of their processes.
We present a model of provenance, called S-PROV, that uses and further extends
PROV and ProvONE. The relationships and properties characterising the workflow’s
abstractions and their concrete executions are re-elaborated to include aspects related
to delegation, distribution and steering of stateful streaming operators. The model is
supported by the Active framework for tuneable and actionable lineage ensuring the
user’s engagement by fostering rapid exploitation. Here, concepts such as provenance
types, configuration and explicit state management allow users to capture complex
provenance scenarios and activate selective controls based on domain and user-defined
metadata. We outline how the traces are recorded in a new comprehensive system,
called S-ProvFlow, enabling different classes of consumers to explore the provenance
data with services and tools for monitoring, in-depth validation and comprehensive
visual-analytics. The work of this thesis will be discussed in the context of an existing
computational framework and the experience matured in implementing provenance-aware
tools for seismology and climate VREs. It will continue to evolve through
newly funded projects, thereby providing generic and user-centred solutions for data-intensive
research
WS-PGRADE/gUSE in European Projects
Besides core project partners, the SCI-BUS project also supported several external user communities in developing and setting up customized science gateways. The focus was on large communities typically represented by other European research projects. However, smaller local efforts with the potential of generalizing the solution to wider communities were also supported. This chapter gives an overview of support activities related to user communities external to the SCI-BUS project. A generic overview of such activities is provided followed by the detailed description of three gateways developed in collaboration with European projects: the agINFRA Science Gateway for Workflows for agricultural research, the VERCE Science Gateway for seismology, and the DRIHM Science Gateway for weather research and forecasting
dispel4py: A Python framework for data-intensive scientific computing
This paper presents dispel4py, a new Python framework for describing abstract stream-based workflows for distributed data-intensive applications. These combine the familiarity of Python programming with the scalability of workflows. Data streaming is used to gain performance, rapid prototyping and applicability to live observations. dispel4py enables scientists to focus on their scientific goals, avoiding distracting details and retaining flexibility over the computing infrastructure they use. The implementation, therefore, has to map dispel4py abstract workflows optimally onto target platforms chosen dynamically. We present four dispel4py mappings: Apache Storm, message-passing interface (MPI), multi-threading and sequential, showing two major benefits: a) smooth transitions from local development on a laptop to scalable execution for production work, and b) scalable enactment on significantly different distributed computing infrastructures. Three application domains are reported and measurements on multiple infrastructures show the optimisations achieved; they have provided demanding real applications and helped us develop effective training. The dispel4py.org is an open-source project to which we invite participation. The effective mapping of dispel4py onto multiple target infrastructures demonstrates exploitation of data-intensive and high-performance computing (HPC) architectures and consistent scalability.</p
Optimisation of the enactment of fine-grained distributed data-intensive work flows
The emergence of data-intensive science as the fourth science paradigm has posed a
data deluge challenge for enacting scientific work-flows. The scientific community is
facing an imminent flood of data from the next generation of experiments and simulations,
besides dealing with the heterogeneity and complexity of data, applications and
execution environments. New scientific work-flows involve execution on distributed and
heterogeneous computing resources across organisational and geographical boundaries,
processing gigabytes of live data streams and petabytes of archived and simulation data,
in various formats and from multiple sources. Managing the enactment of such work-flows not only requires larger storage space and faster machines, but the capability to
support scalability and diversity of the users, applications, data, computing resources
and the enactment technologies.
We argue that the enactment process can be made efficient using optimisation techniques
in an appropriate architecture. This architecture should support the creation
of diversified applications and their enactment on diversified execution environments,
with a standard interface, i.e. a work-flow language. The work-flow language should
be both human readable and suitable for communication between the enactment environments.
The data-streaming model central to this architecture provides a scalable
approach to large-scale data exploitation. Data-flow between computational elements
in the scientific work-flow is implemented as streams. To cope with the exploratory
nature of scientific work-flows, the architecture should support fast work-flow prototyping,
and the re-use of work-flows and work-flow components. Above all, the enactment
process should be easily repeated and automated.
In this thesis, we present a candidate data-intensive architecture that includes an intermediate
work-flow language, named DISPEL. We create a new fine-grained measurement
framework to capture performance-related data during enactments, and design
a performance database to organise them systematically. We propose a new enactment
strategy to demonstrate that optimisation of data-streaming work-flows can be
automated by exploiting performance data gathered during previous enactments
Scaling full seismic waveform inversions
The main goal of this research study is to scale full seismic waveform inversions using the adjoint-state method to the data volumes that are nowadays available in seismology. Practical issues hinder the routine application of this, to a certain extent theoretically well understood, method. To a large part this comes down to outdated or flat out missing tools and ways to automate the highly iterative procedure in a reliable way.
This thesis tackles these issues in three successive stages. It first introduces a modern and properly designed data processing framework sitting at the very core of all the consecutive developments. The ObsPy toolkit is a Python library providing a bridge for seismology into the scientific Python ecosystem and bestowing seismologists with effortless I/O and a powerful signal processing library, amongst other things.
The following chapter deals with a framework designed to handle the specific data management and organization issues arising in full seismic waveform inversions, the Large-scale Seismic Inversion Framework. It has been created to orchestrate the various pieces of data accruing in the course of an iterative waveform inversion.
Then, the Adaptable Seismic Data Format, a new, self-describing, and scalable data format for seismology is introduced along with the rationale why it is needed for full waveform inversions in particular and seismology in general.
Finally, these developments are put into service to construct a novel full seismic waveform inversion model for elastic subsurface structure beneath the North American continent and the Northern Atlantic well into Europe. The spectral element method is used for the forward and adjoint simulations coupled with windowed time-frequency phase misfit measurements. Later iterations use 72 events, all happening after the USArray project has commenced, resulting in approximately 150`000 three components recordings that are inverted for. 20 L-BFGS iterations yield a model that can produce complete seismograms at a period range between 30 and 120 seconds while comparing favorably to observed data
eScience gateway stimulating collaboration in rock physics and volcanology
Earth scientist observe many facets of the planet's crust and integrate their resulting data to better understand the processes at work. We report on a new data-intensive science gateway designed to bring rock physicists and volcanologists into a collaborative framework that enables them to accelerate their research and integrate well with other Earth scientists. The science gateway supports three major functions: 1) sharing data from laboratories and observatories, experimental facilities and computational model runs, 2) sharing computational model sand methods for analysing experimental and observational data, and 3) supporting recurrent tasks, such as data collection and running application in real time. Our prototype gateway has worked with two exemplar projects giving experience of data gathering, model sharing and data analysis. The geoscientists found that the gateway accelerated their work, triggered new practices and provided a good platform for long-term collaboration.</p
3rd EGEE User Forum
We have organized this book in a sequence of chapters, each chapter associated with an application or technical theme introduced by an overview of the contents, and a summary of the main conclusions coming from the Forum for the chapter topic. The first chapter gathers all the plenary session keynote addresses, and following this there is a sequence of chapters covering the application flavoured sessions. These are followed by chapters with the flavour of Computer Science and Grid Technology. The final chapter covers the important number of practical demonstrations and posters exhibited at the Forum. Much of the work presented has a direct link to specific areas of Science, and so we have created a Science Index, presented below. In addition, at the end of this book, we provide a complete list of the institutes and countries involved in the User Forum