14 research outputs found
Scientific Workflow Applications on Amazon EC2
The proliferation of commercial cloud computing providers has generated
significant interest in the scientific computing community. Much recent
research has attempted to determine the benefits and drawbacks of cloud
computing for scientific applications. Although clouds have many attractive
features, such as virtualization, on-demand provisioning, and "pay as you go"
usage-based pricing, it is not clear whether they are able to deliver the
performance required for scientific applications at a reasonable price. In this
paper we examine the performance and cost of clouds from the perspective of
scientific workflow applications. We use three characteristic workflows to
compare the performance of a commercial cloud with that of a typical HPC
system, and we analyze the various costs associated with running those
workflows in the cloud. We find that the performance of clouds is not
unreasonable given the hardware resources provided, and that performance
comparable to HPC systems can be achieved given similar resources. We also find
that the cost of running workflows on a commercial cloud can be reduced by
storing data in the cloud rather than transferring it from outside
Metadata and provenance management
Scientists today collect, analyze, and generate TeraBytes and PetaBytes of
data. These data are often shared and further processed and analyzed among
collaborators. In order to facilitate sharing and data interpretations, data
need to carry with it metadata about how the data was collected or generated,
and provenance information about how the data was processed. This chapter
describes metadata and provenance in the context of the data lifecycle. It also
gives an overview of the approaches to metadata and provenance management,
followed by examples of how applications use metadata and provenance in their
scientific processes
Doctor of Philosophy
dissertationVisualization has emerged as an effective means to quickly obtain insight from raw data. While simple computer programs can generate simple visualizations, and while there has been constant progress in sophisticated algorithms and techniques for generating insightful pictorial descriptions of complex data, the process of building visualizations remains a major bottleneck in data exploration. In this thesis, we present the main design and implementation aspects of VisTrails, a system designed around the idea of transparently capturing the exploration process that leads to a particular visualization. In particular, VisTrails explores the idea of provenance management in visualization systems: keeping extensive metadata about how the visualizations were created and how they relate to one another. This thesis presents the provenance data model in VisTrails, which can be easily adopted by existing visualization systems and libraries. This lightweight model entirely captures the exploration process of the user, and it can be seen as an electronic analogue of the scientific notebook. The provenance metadata collected during the creation of pipelines can be reused to suggest similar content in related visualizations and guide semi-automated changes. This thesis presents the idea of building visualizations by analogy in a system that allows users to change many visualizations at once, without requiring them to interact with the visualization specifications. It then proposes techniques to help users construct pipelines by consensus, automatically suggesting completions based on a database of previously created pipelines. By presenting these predictions in a carefully designed interface, users can create visualizations and other data products more efficiently because they can augment their normal work patterns with the suggested completions. VisTrails leverages the workflow specifications to identify and avoid redundant operations. This optimization is especially useful while exploring multiple visualizations. When variations of the same pipeline need to be executed, substantial speedups can be obtained by caching the results of overlapping subsequences of the pipelines. We present the design decisions behind the execution engine, and how it easily supports the execution of arbitrary third-party modules. These specifications also facilitate the reproduction of previous results. We will present a description of an infrastructure that makes the workflows a complete description of the computational processes, including information necessary to identify and install necessary system libraries. In an environment where effective visualization and data analysis tasks combine many different software packages, this infrastructure can mean the difference between being able to replicate published results and getting lost in a sea of software dependencies and missing libraries. The thesis concludes with a discussion of the system architecture, design decisions and learned lessons in VisTrails. This discussion is meant to clarify the issues present in creating a system based around a provenance tracking engine, and should help implementors decide how to best incorporate these notions into their own systems
Publish/subscribe scientific workflow interoperability framework (PS-SWIF)
Different or similar workflow systems, hosted anywhere on a network, written in any language and running on different operating systems, can easily use the full range of PS-SWIF tools to interoperate with each other. The PS-SWIF approach provides interoperability among a wide range of scientific workflow systems
Semantics and planning based workflow composition and execution for video processing
Traditional workflow systems have several drawbacks, e.g. in their inabilities to rapidly
react to changes, to construct workflow automatically (or with user involvement) and
to improve performance autonomously (or with user involvement) in an incremental
manner according to specified goals. Overcoming these limitations would be highly
beneficial for complex domains where such adversities are exhibited. Video processing
is one such domain that increasingly requires attention as larger amounts of images and
videos are becoming available to persons who are not technically adept in modelling
the processes that are involved in constructing complex video processing workflows.
Conventional video and image processing systems, on the other hand, are developed
by programmers possessing image processing expertise. These systems are tailored
to produce highly specialised hand-crafted solutions for very specific tasks, making
them rigid and non-modular. The knowledge-based vision community have attempted
to produce more modular solutions by incorporating ontologies. However,
they have not been maximally utilised to encompass aspects such as application context
descriptions (e.g. lighting and clearness effects) and qualitative measures.
This thesis aims to tackle some of the research gaps yet to be addressed by the
workflow and knowledge-based image processing communities by proposing a novel
workflow composition and execution approach within an integrated framework. This
framework distinguishes three levels of abstraction via the design, workflow and processing
layers. The core technologies that drive the workflow composition mechanism
are ontologies and planning. Video processing problems provide a fitting domain for
investigating the effectiveness of this integratedmethod as tackling such problems have
not been fully explored by the workflow, planning and ontological communities despite
their combined beneficial traits to confront this known hard problem. In addition, the
pervasiveness of video data has proliferated the need for more automated assistance
for image processing-naive users, but no adequate support has been provided as of yet.
A video and image processing ontology that comprises three sub-ontologies was
constructed to capture the goals, video descriptions and capabilities (video and image
processing tools). The sub-ontologies are used for representation and inference. In
particular, they are used in conjunction with an enhanced Hierarchical Task Network
(HTN) domain independent planner to help with performance-based selection of solution
steps based on preconditions, effects and postconditions. The planner, in turn,
makes use of process models contained in a process library when deliberating on the
steps and then consults the capability ontology to retrieve a suitable tool at each step. Two key features of the planner are the ability to support workflow execution (interleaves
planning with execution) and can perform in automatic or semi-automatic
(interactive) mode. The first feature is highly desirable for video processing problems
because execution of image processing steps yield visual results that are intuitive
and verifiable by the human user, as automatic validation is non trivial. In the semiautomaticmode,
the planner is interactive and prompts the user tomake a tool selection
when there is more than one tool available to perform a task. The user makes the tool
selection based on the recommended descriptions provided by the workflow system.
Once planning is complete, the result of applying the tool of their choice is presented
to the user textually and visually for verification. This plays a pivotal role in providing
the user with control and the ability to make informed decisions. Hence, the planner
extends the capabilities of typical planners by guiding the user to construct more
optimal solutions. Video processing problems can also be solved in more modular,
reusable and adaptable ways as compared to conventional image processing systems.
The integrated approach was evaluated on a test set consisting of videos originating
from open sea environment of varying quality. Experiments to evaluate the efficiency,
adaptability to user’s changing needs and user learnability of this approach were conducted
on users who did not possess image processing expertise. The findings indicate
that using this integrated workflow composition and execution method: 1) provides a
speed up of over 90% in execution time for video classification tasks using full automatic
processing compared to manual methods without loss of accuracy; 2) is more
flexible and adaptable in response to changes in user requests (be it in the task, constraints
to the task or descriptions of the video) than modifying existing image processing
programs when the domain descriptions are altered; 3) assists the user in selecting
optimal solutions by providing recommended descriptions