3 research outputs found
Towards optimising distributed data streaming graphs using parallel streams
Modern scientific collaborations have opened up the op-portunity of solving complex problems that involve multi-disciplinary expertise and large-scale computational experi-ments. These experiments usually involve large amounts of data that are located in distributed data repositories running various software systems, and managed by different organi-sations. A common strategy to make the experiments more manageable is executing the processing steps as a work-flow. In this paper, we look into the implementation of fine-grained data-flow between computational elements in a scientific workflow as streams. We model the distributed computation as a directed acyclic graph where the nodes rep-resent the processing elements that incrementally implement specific subtasks. The processing elements are connected in a pipelined streaming manner, which allows task executions to overlap. We further optimise the execution by splitting pipelines across processes and by introducing extra parallel streams. We identify performance metrics and design a mea-surement tool to evaluate each enactment. We conducted ex-periments to evaluate our optimisation strategies with a real world problem in the Life Sciences—EURExpress-II. The paper presents our distributed data-handling model, the op-timisation and instrumentation strategies and the evaluation experiments. We demonstrate linear speed up and argue that this use of data-streaming to enable both overlapped pipeline and parallelised enactment is a generally applicable optimisation strategy
Workflow level parametric study support by MOTEUR and the P-GRADE portal
International audienceMany large-scale scientific applications require the processing of complete data sets made of individual data segments that can be manipulated independently following a single analysis procedure. Workflow managers have been designed for describing and controlling such complex application control flows. How- ever, when considering very data-intensive applications, there is a large poten- tial parallelism that should be properly exploited to ensure efficient processing. Distributed systems such as Grid infrastructures are promising for handling the load resulting from parallel data analysis and manipulation. Workflow managers can help in exploiting the infrastructure parallelism, given that they are able to handle the data flow resulting from the application's execution
Optimisation of the enactment of fine-grained distributed data-intensive work flows
The emergence of data-intensive science as the fourth science paradigm has posed a
data deluge challenge for enacting scientific work-flows. The scientific community is
facing an imminent flood of data from the next generation of experiments and simulations,
besides dealing with the heterogeneity and complexity of data, applications and
execution environments. New scientific work-flows involve execution on distributed and
heterogeneous computing resources across organisational and geographical boundaries,
processing gigabytes of live data streams and petabytes of archived and simulation data,
in various formats and from multiple sources. Managing the enactment of such work-flows not only requires larger storage space and faster machines, but the capability to
support scalability and diversity of the users, applications, data, computing resources
and the enactment technologies.
We argue that the enactment process can be made efficient using optimisation techniques
in an appropriate architecture. This architecture should support the creation
of diversified applications and their enactment on diversified execution environments,
with a standard interface, i.e. a work-flow language. The work-flow language should
be both human readable and suitable for communication between the enactment environments.
The data-streaming model central to this architecture provides a scalable
approach to large-scale data exploitation. Data-flow between computational elements
in the scientific work-flow is implemented as streams. To cope with the exploratory
nature of scientific work-flows, the architecture should support fast work-flow prototyping,
and the re-use of work-flows and work-flow components. Above all, the enactment
process should be easily repeated and automated.
In this thesis, we present a candidate data-intensive architecture that includes an intermediate
work-flow language, named DISPEL. We create a new fine-grained measurement
framework to capture performance-related data during enactments, and design
a performance database to organise them systematically. We propose a new enactment
strategy to demonstrate that optimisation of data-streaming work-flows can be
automated by exploiting performance data gathered during previous enactments