20 research outputs found
Provenance support for service-based infrastructure
Service-based architectures represent the next evolutionary step in the development of e-science, namely, the transformation of the Internet from a commercial marketplace to a mechanism for sharing multidisciplinary scientific resources. Although scientists in many disciplines have become increasingly reliant on distributed computing technologies for data processing and dissemination, the record of the processing history and origin of a data product, that is its data provenance, is often nonexistent, incomplete or impossible to recover by potential users. This thesis aims to address data provenance issues in service-based environments, particularly to answer how a scientist who performs a workflow execution in such an environment can (1) document the data provenance for a data item created by the execution, and (2) use the provenance documentation as a recipe to re-execute the workflow. This thesis pro poses a provenance model for delivering data provenance support in a service-based environment. Through the use of an example scenario of a scientific workflow in the Astrophysics domain, we explore and identify components of the provenance model. The provenance model proposes a technique to collect and record data provenance for service-based workflow executions. The technique facilitates the collection of data provenance of workflow execution at runtime. In order to record the collected data provenance, the thesis also proposes a specification to represent provenance to de scribe the processing history whereby a piece of data was derived. The thesis also proposes query interfaces that allow recorded provenance to be queried, has formulated a technique to construct provenance graphs, and supports the re-execution of past workflows. The provenance representation specification, the collection technique, and the query interfaces have been used to implement a prototype system to demonstrate the proposed model. The thesis also experimentally evaluates the scalability of the components implemented.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
BRIL - Capturing Experiments in the Wild
This presentation describes a project to embed a repository system (based on Fedora) within the complex, experimental processes of a number of researchers in biophysics and structural biology. The project is capturing not just individual datasets but entire experimental workflows as complex objects, incorporating provenance information based on the Open Provenance Model, to support reproduction and validation of published results. The repository is integrated within these experimental processes, so that data capture is as far as possible automatic and invisible to the researcher. A particular challenge is that the researchersâ work takes place in local environments within the department, entirely decoupled from the repository. In meeting this challenge, the project is bridging the gap between the âwildâ, ad hoc and independent environment of the researchers desktop, and the curated, sustainable, institutional environment of the repository, and in the process project crosses the boundary between several of the pairs of polar opposites identified in the call
Provenance support for service-based infrastructure
Service-based architectures represent the next evolutionary step in the development of e-science, namely, the transformation of the Internet from a commercial marketplace to a mechanism for sharing multidisciplinary scientific resources. Although scientists in many disciplines have become increasingly reliant on distributed computing technologies for data processing and dissemination, the record of the processing history and origin of a data product, that is its data provenance, is often nonexistent, incomplete or impossible to recover by potential users. This thesis aims to address data provenance issues in service-based environments, particularly to answer how a scientist who performs a workflow execution in such an environment can (1) document the data provenance for a data item created by the execution, and (2) use the provenance documentation as a recipe to re-execute the workflow. This thesis pro poses a provenance model for delivering data provenance support in a service-based environment. Through the use of an example scenario of a scientific workflow in the Astrophysics domain, we explore and identify components of the provenance model. The provenance model proposes a technique to collect and record data provenance for service-based workflow executions. The technique facilitates the collection of data provenance of workflow execution at runtime. In order to record the collected data provenance, the thesis also proposes a specification to represent provenance to de scribe the processing history whereby a piece of data was derived. The thesis also proposes query interfaces that allow recorded provenance to be queried, has formulated a technique to construct provenance graphs, and supports the re-execution of past workflows. The provenance representation specification, the collection technique, and the query interfaces have been used to implement a prototype system to demonstrate the proposed model. The thesis also experimentally evaluates the scalability of the components implemented
BRIL - Capturing Experiments in the Wild
This presentation describes a project to embed a repository system (based on Fedora) within the complex, experimental processes of a number of researchers in biophysics and structural biology. The project is capturing not just individual datasets but entire experimental workflows as complex objects, incorporating provenance information based on the Open Provenance Model, to support reproduction and validation of published results. The repository is integrated within these experimental processes, so that data capture is as far as possible automatic and invisible to the researcher. A particular challenge is that the researchersâ work takes place in local environments within the department, entirely decoupled from the repository. In meeting this challenge, the project is bridging the gap between the âwildâ, ad hoc and independent environment of the researchers desktop, and the curated, sustainable, institutional environment of the repository, and in the process project crosses the boundary between several of the pairs of polar opposites identified in the call
Consulting (in Writing) to the Corporation: Principles and Pragmatics
Provenance information provides a useful basis to verify whether a particular application behavior has been adhered to. This is particularly useful to evaluate the basis for a particular outcome, as a result of a process, and to verify if the process involved in making the decision conforms to some pre-defined set of rules. This is significant in a healthcare scenario, where it is necessary to demonstrate that patient data has been processed in a particular way. Understanding how provenance information may be recorded, stored, and subsequently analyzed by a decision maker is therefore significant in a service oriented architecture, which involves the use of third party services over which the decision maker does not have control. The aggregation of data from multiple sources of patient information plays an important part in subsequent treatments that are proposed for a patient. A tool to navigate through and analyze such provenance information is proposed, based on the use of a portal framework that allows different views on provenance information to co-exist. The portal enables users to add custom portlets enabling application specific views that would facilitate particular decision making
Trust Assessment Using Provenance in Service Oriented Applications
Workflow forms a key part of many existing Service Oriented applications, involving the integration of services that may be made available at distributed sites. It is possible to distinguish between an "abstract" workflow description outlining which services must be involved in a workflow execution and a "physical" workflow description outlining the particular instances of services that were used in a particular enactment. Provenance information provides a useful way to capture the physical workflow description automatically especially if this information is captured in a standard format. Subsequent analysis on this provenance information may be used to evaluate whether the abstract workflow description has been adhered to, and to enable a user executing a workflow-based application to establish "trust" in the outcome
MetaTools - Investigating Metadata Generation Tools - Final Report
Automatic metadata generation has sometimes been posited as a solution to the âmetadata bottleneckâ that repositories and portals are facing as they struggle to provide resource discovery metadata for a rapidly growing number of new digital resources. Unfortunately there is no registry or trusted body of documentation that rates the quality of metadata generation tools or identifies the most effective tool(s) for any given task. The aim of the first stage of the project was to remedy this situation by developing a framework for evaluating tools used for the purpose of generating Dublin Core metadata. A range of intrinsic and extrinsic metrics (standard tests or measurements) that capture the attributes of good metadata from various perspectives were identified from the research literature and evaluated in a report. A test program was then implemented using metrics from the framework. It evaluated the quality of metadata generated from 1) Web pages (html) and 2) scholarly works (pdf) by four of the more widely-known metadata generation tools - Data Fountains, DC-dot, SamgI, and the Yahoo! Term Extractor. The intention was also to test PaperBase, a prototype for generating metadata for scholarly works, but its developers ultimately preferred to conduct tests in-house. Some interesting comparisons with their results were nonetheless possible and were included in the stage 2 report. It was found that the output from Data Fountains was generally superior to that of the other tools that the project tested. But the output from all of the tools was considered to be disappointing and markedly inferior to the quality of metadata that Tonkin and Muller report that PaperBase has extracted from scholarly works. Over all, the prospects for generating high-quality metadata for scholarly works appear to be brighter because of their more predictable layout. It is suggested JISC should particularly encourage research into auto-generation methods that exploit the structural and syntactic features of scholarly works in pdf format, as exemplified by PaperBase, and strongly consider funding the development of tools in this direction. In the third stage of the project SOAP and RESTful Web Service interfaces were developed for three metadata generation tools â Data Fountains, SamgI and Kea. This had a dual purpose. Firstly, the creation of an optimal metadata record usually requires the merging of output from several tools each of which, until now, had to be invoked separately because of the ad hoc nature of their interfaces. As Web services, they will be available for use in a network such as the Web with well-defined interfaces that are implementation-independent. These services will be exposed for use by clients without them having to be concerned with how the service will execute their requests. Repositories should be able to plug them into their own cataloguing environments and experiment with automatic metadata generation under more âreal-lifeâ circumstances than hitherto. Secondly, and more importantly (in view of the relatively poor quality of current tools) they enabled the project to experiment with the use of a high-level ontology for describing metadata generation tools. The value of an ontology being used in this way should be felt as higher quality tools (such as PaperBase?) emerge. The high-level ontology is part of a MetaTools system architecture that consists of various components to describe, register and discover services. Low level definitions within a service ontology are mapped to higher-level human-understandable semantic descriptions contained within a MetaTools ontology. A user interface enables service providers register their service in a public registry. This registry is used by consumers to find services that match certain criteria. If the registry has such a service, it provides the consumer with a contract and an endpoint address for that service. The terms in the MetaTools ontology can, in turn, be part of a higher-level ontology that describes the preservation domain as a whole. The team believes that an ontology-aided approach to service discovery, as employed by the MetaTools project, is a practical solution. A stage 3 technical report was also written
Support for Provenance in a Service-based Computing Grid
There is underlying need to support data provenance in a service-based computing environment such as the Grid where web services may be automatically discovered, composed, and
then consumed using innovative workflow management systems. In many scientic scenarios a succession of data transformations occurs producing data of added scientic value. The provenance of such data needs to be recognized for other potential users to verify the data before using it in further studies. In this paper we present general requirements and implementation issues in provenance data collection, recording and reasoning, and discuss how these reflect on what aspects of information are essential for an effective and scalable system