142 research outputs found

    The Origin of Data: Enabling the Determination of Provenance in Multi-institutional Scientific Systems through the Documentation of Processes

    Get PDF
    The Oxford English Dictionary defines provenance as (i) the fact of coming from some particular source or quarter; origin, derivation. (ii) the history or pedigree of a work of art, manuscript, rare book, etc.; concr., a record of the ultimate derivation and passage of an item through its various owners. In art, knowing the provenance of an artwork lends weight and authority to it while providing a context for curators and the public to understand and appreciate the work’s value. Without such a documented history, the work may be misunderstood, unappreciated, or undervalued. In computer systems, knowing the provenance of digital objects would provide them with greater weight, authority, and context just as it does for works of art. Specifically, if the provenance of digital objects could be determined, then users could understand how documents were produced, how simulation results were generated, and why decisions were made. Provenance is of particular importance in science, where experimental results are reused, reproduced, and verified. However, science is increasingly being done through large-scale collaborations that span multiple institutions, which makes the problem of determining the provenance of scientific results significantly harder. Current approaches to this problem are not designed specifically for multi-institutional scientific systems and their evolution towards greater dynamic and peer-to-peer topologies. Therefore, this thesis advocates a new approach, namely, that through the autonomous creation, scalable recording, and principled organisation of documentation of systems’ processes, the determination of the provenance of results produced by complex multi-institutional scientific systems is enabled. The dissertation makes four contributions to the state of the art. First is the idea that provenance is a query performed over documentation of a system’s past process. Thus, the problem is one of how to collect and collate documentation from multiple distributed sources and organise it in a manner that enables the provenance of a digital object to be determined. Second is an open, generic, shared, principled data model for documentation of processes, which enables its collation so that it provides high-quality evidence that a system’s processes occurred. Once documentation has been created, it is recorded into specialised repositories called provenance stores using a formally specified protocol, which ensures documentation has high-quality characteristics. Furthermore, patterns and techniques are given to permit the distributed deployment of provenance stores. The protocol and patterns are the third contribution. The fourth contribution is a characterisation of the use of documentation of process to answer questions related to the provenance of digital objects and the impact recording has on application performance. Specifically, in the context of a bioinformatics case study, it is shown that six different provenance use cases are answered given an overhead of 13% on experiment run-time. Beyond the case study, the solution has been applied to other applications including fault tolerance in service-oriented systems, aerospace engineering, and organ transplant management

    Context of processes : achieving thorough documentation in provenance systems through context awareness

    Get PDF
    To fully understand real world processes, having evidence which is as comprehensive as possible is essential. Comprehensive evidence enables the reviewer to have some confidence that they are aware of the nuances of a past scenario and can act appropriately upon them in the future. There are examples of this throughout everyday life the outcome of a court case could be affected by available evidence or an antique could be considered more valuable if certain facts about its history are known. Similarly, in computer systems, evidence of processes allow users to make more informed decisions than if it were not captured. Where computer based experimentation has enabled scientists to perform complicated experiments quickly with ease, understanding the precise circumstances of the process which created a particular set of results is important. Significant recent research has sought to address the problem of understanding the provenance of an data item—the process which led to that data item. Increasingly, these experiments are being performed using systems which are distributed, large scale and open. Comprehensive evidence in these environments is achieved when both documentation of the actions per formed and the circumstances in which they occur are captured. Therefore, in order for a user to achieve confidence in results, we argue the importance of documenting the context of a process. This thesis addresses the problem of how context may be suitably modeled, captured and queried to later answer questions concerning data origin. We begin by defining context as any information describing a scenario which has some bearing on a process's outcome. Based on a number of use cases from a Functional Magnetic Resonance Imaging (fMRI) workflow, we present a model for representation of context. Our model treats each actor in a process as capable of progressing over a number of finite states as they perform actions. We show that each state can be encoded by using a set of monitored variables from an actor's host. Each transition between states therefore is a series of variable changes and this model is shown to be capable of measuring similarity of context when comparing multiple executions of the same process. It also allows us to consider future state changes for actors based on their past execution. We evaluate through the use of our own context capture system which allows common monitoring tools to be used as an indication of state change and recording of context transparently from stake holders. Our experimental findings suggest our approach to both be acceptable in terms of performance (with an overhead of 4–8% against a non context capturing approach) and use case satisfaction.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Context of processes : achieving thorough documentation in provenance systems through context awareness

    Get PDF
    To fully understand real world processes, having evidence which is as comprehensive as possible is essential. Comprehensive evidence enables the reviewer to have some confidence that they are aware of the nuances of a past scenario and can act appropriately upon them in the future. There are examples of this throughout everyday life the outcome of a court case could be affected by available evidence or an antique could be considered more valuable if certain facts about its history are known. Similarly, in computer systems, evidence of processes allow users to make more informed decisions than if it were not captured. Where computer based experimentation has enabled scientists to perform complicated experiments quickly with ease, understanding the precise circumstances of the process which created a particular set of results is important. Significant recent research has sought to address the problem of understanding the provenance of an data item—the process which led to that data item. Increasingly, these experiments are being performed using systems which are distributed, large scale and open. Comprehensive evidence in these environments is achieved when both documentation of the actions per formed and the circumstances in which they occur are captured. Therefore, in order for a user to achieve confidence in results, we argue the importance of documenting the context of a process. This thesis addresses the problem of how context may be suitably modeled, captured and queried to later answer questions concerning data origin. We begin by defining context as any information describing a scenario which has some bearing on a process's outcome. Based on a number of use cases from a Functional Magnetic Resonance Imaging (fMRI) workflow, we present a model for representation of context. Our model treats each actor in a process as capable of progressing over a number of finite states as they perform actions. We show that each state can be encoded by using a set of monitored variables from an actor's host. Each transition between states therefore is a series of variable changes and this model is shown to be capable of measuring similarity of context when comparing multiple executions of the same process. It also allows us to consider future state changes for actors based on their past execution. We evaluate through the use of our own context capture system which allows common monitoring tools to be used as an indication of state change and recording of context transparently from stake holders. Our experimental findings suggest our approach to both be acceptable in terms of performance (with an overhead of 4–8% against a non context capturing approach) and use case satisfaction.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    The context of processes: achieving thorough documentation in provenance systems through context awareness

    Get PDF
    To fully understand real world processes, having evidence which is as comprehensive as possible is essential. Comprehensive evidence enables the reviewer to have some confidence that they are aware of the nuances of a past scenario and can act appropriately upon them in the future. There are examples of this throughout everyday life the outcome of a court case could be affected by available evidence or an antique could be considered more valuable if certain facts about its history are known. Similarly, in computer systems, evidence of processes allow users to make more informed decisions than if it were not captured. Where computer based experimentation has enabled scientists to perform complicated experiments quickly with ease, understanding the precise circumstances of the process which created a particular set of results is important. Significant recent research has sought to address the problem of understanding the provenance of an data item—the process which led to that data item. Increasingly, these experiments are being performed using systems which are distributed, large scale and open. Comprehensive evidence in these environments is achieved when both documentation of the actions per formed and the circumstances in which they occur are captured. Therefore, in order for a user to achieve confidence in results, we argue the importance of documenting the context of a process. This thesis addresses the problem of how context may be suitably modeled, captured and queried to later answer questions concerning data origin. We begin by defining context as any information describing a scenario which has some bearing on a process's outcome. Based on a number of use cases from a Functional Magnetic Resonance Imaging (fMRI) workflow, we present a model for representation of context. Our model treats each actor in a process as capable of progressing over a number of finite states as they perform actions. We show that each state can be encoded by using a set of monitored variables from an actor's host. Each transition between states therefore is a series of variable changes and this model is shown to be capable of measuring similarity of context when comparing multiple executions of the same process. It also allows us to consider future state changes for actors based on their past execution. We evaluate through the use of our own context capture system which allows common monitoring tools to be used as an indication of state change and recording of context transparently from stake holders. Our experimental findings suggest our approach to both be acceptable in terms of performance (with an overhead of 4–8% against a non context capturing approach) and use case satisfaction

    How Does User Behavior Evolve During Exploratory Visual Analysis?

    Full text link
    Exploratory visual analysis (EVA) is an essential stage of the data science pipeline, where users often lack clear analysis goals at the start and iteratively refine them as they learn more about their data. Accurate models of users' exploration behavior are becoming increasingly vital to developing responsive and personalized tools for exploratory visual analysis. Yet we observe a discrepancy between the static view of human exploration behavior adopted by many computational models versus the dynamic nature of EVA. In this paper, we explore potential parallels between the evolution of users' interactions with visualization tools during data exploration and assumptions made in popular online learning techniques. Through a series of empirical analyses, we seek to answer the question: how might users' exploration behavior evolve in response to what they have learned from the data during EVA? We present our findings and discuss their implications for the future of user modeling for system design

    Optimisation of the enactment of fine-grained distributed data-intensive work flows

    Get PDF
    The emergence of data-intensive science as the fourth science paradigm has posed a data deluge challenge for enacting scientific work-flows. The scientific community is facing an imminent flood of data from the next generation of experiments and simulations, besides dealing with the heterogeneity and complexity of data, applications and execution environments. New scientific work-flows involve execution on distributed and heterogeneous computing resources across organisational and geographical boundaries, processing gigabytes of live data streams and petabytes of archived and simulation data, in various formats and from multiple sources. Managing the enactment of such work-flows not only requires larger storage space and faster machines, but the capability to support scalability and diversity of the users, applications, data, computing resources and the enactment technologies. We argue that the enactment process can be made efficient using optimisation techniques in an appropriate architecture. This architecture should support the creation of diversified applications and their enactment on diversified execution environments, with a standard interface, i.e. a work-flow language. The work-flow language should be both human readable and suitable for communication between the enactment environments. The data-streaming model central to this architecture provides a scalable approach to large-scale data exploitation. Data-flow between computational elements in the scientific work-flow is implemented as streams. To cope with the exploratory nature of scientific work-flows, the architecture should support fast work-flow prototyping, and the re-use of work-flows and work-flow components. Above all, the enactment process should be easily repeated and automated. In this thesis, we present a candidate data-intensive architecture that includes an intermediate work-flow language, named DISPEL. We create a new fine-grained measurement framework to capture performance-related data during enactments, and design a performance database to organise them systematically. We propose a new enactment strategy to demonstrate that optimisation of data-streaming work-flows can be automated by exploiting performance data gathered during previous enactments
    corecore