287 research outputs found

    Enhancing Workflow with a Semantic Description of Scientific Intent

    Get PDF
    Peer reviewedPreprin

    Publishing Linked Data - There is no One-Size-Fits-All Formula

    Get PDF
    Publishing Linked Data is a process that involves several design decisions and technologies. Although some initial guidelines have been already provided by Linked Data publishers, these are still far from covering all the steps that are necessary (from data source selection to publication) or giving enough details about all these steps, technologies, intermediate products, etc. Furthermore, given the variety of data sources from which Linked Data can be generated, we believe that it is possible to have a single and uni�ed method for publishing Linked Data, but we should rely on di�erent techniques, technologies and tools for particular datasets of a given domain. In this paper we present a general method for publishing Linked Data and the application of the method to cover di�erent sources from di�erent domains

    A Semantic Workflow Mechanism to Realize Experimental Goals and Constraints

    Get PDF
    Postprin

    PAV ontology: provenance, authoring and versioning

    Get PDF
    Provenance is a critical ingredient for establishing trust of published scientific content. This is true whether we are considering a data set, a computational workflow, a peer-reviewed publication or a simple scientific claim with supportive evidence. Existing vocabularies such as DC Terms and the W3C PROV-O are domain-independent and general-purpose and they allow and encourage for extensions to cover more specific needs. We identify the specific need for identifying or distinguishing between the various roles assumed by agents manipulating digital artifacts, such as author, contributor and curator. We present the Provenance, Authoring and Versioning ontology (PAV): a lightweight ontology for capturing just enough descriptions essential for tracking the provenance, authoring and versioning of web resources. We argue that such descriptions are essential for digital scientific content. PAV distinguishes between contributors, authors and curators of content and creators of representations in addition to the provenance of originating resources that have been accessed, transformed and consumed. We explore five projects (and communities) that have adopted PAV illustrating their usage through concrete examples. Moreover, we present mappings that show how PAV extends the PROV-O ontology to support broader interoperability. The authors strived to keep PAV lightweight and compact by including only those terms that have demonstrated to be pragmatically useful in existing applications, and by recommending terms from existing ontologies when plausible. We analyze and compare PAV with related approaches, namely Provenance Vocabulary, DC Terms and BIBFRAME. We identify similarities and analyze their differences with PAV, outlining strengths and weaknesses of our proposed model. We specify SKOS mappings that align PAV with DC Terms.Comment: 22 pages (incl 5 tables and 19 figures). Submitted to Journal of Biomedical Semantics 2013-04-26 (#1858276535979415). Revised article submitted 2013-08-30. Second revised article submitted 2013-10-06. Accepted 2013-10-07. Author proofs sent 2013-10-09 and 2013-10-16. Published 2013-11-22. Final version 2013-12-06. http://www.jbiomedsem.com/content/4/1/3

    Curated Reasoning by Formal Modeling of Provenance

    Get PDF
    The core problem addressed in this research is the current lack of an ability to repurpose and curate scientific data among interdisciplinary scientists within a research enterprise environment. Explosive growth in sensor technology as well as the cost of collecting ocean data and airborne measurements has allowed for exponential increases in scientific data collection as well as substantial enterprise resources required for data collection. There is currently no framework for efficiently curating this scientific data for repurposing or intergenerational use. There are several reasons why this problem has eluded solution to date to include the competitive requirements for funding and publication, multiple vocabularies used among various scientific disciplines, the number of scientific disciplines and the variation among workflow processes, lack of a flexible framework to allow for diversity among vocabularies and data but a unifying approach to exploitation and a lack of affordable computing resources (mostly in past tense now). Addressing this lack of sharing scientific data among interdisciplinary scientists is an exceptionally challenging problem given the need for combination of various vocabularies, maintenance of associated scientific data provenance, requirement to minimize any additional workload being placed on originating data scientist project/time, protect publication/credit to reward scientific creativity and obtaining priority for a long-term goal such as scientific data curation for intergenerational, interdisciplinary scientific problem solving that likely offers the most potential for the highest impact discoveries in the future. This research approach focuses on the core technical problem of formally modeling interdisciplinary scientific data provenance as the enabling and missing component to demonstrate the potential of interdisciplinary scientific data repurposing. This research develops a framework to combine varying vocabularies in a formal manner that allows the provenance information to be used as a key for reasoning to allow manageable curation. The consequence of this research is that it has pioneered an approach of formally modeling provenance within an interdisciplinary research enterprise to demonstrate that intergenerational curation can be aided at the machine level to allow reasoning and repurposing to occur with minimal impact to data collectors and maximum impact to other scientists

    Big Data Analytics in Static and Streaming Provenance

    Get PDF
    Thesis (Ph.D.) - Indiana University, Informatics and Computing,, 2016With recent technological and computational advances, scientists increasingly integrate sensors and model simulations to understand spatial, temporal, social, and ecological relationships at unprecedented scale. Data provenance traces relationships of entities over time, thus providing a unique view on over-time behavior under study. However, provenance can be overwhelming in both volume and complexity; the now forecasting potential of provenance creates additional demands. This dissertation focuses on Big Data analytics of static and streaming provenance. It develops filters and a non-preprocessing slicing technique for in-situ querying of static provenance. It presents a stream processing framework for online processing of provenance data at high receiving rate. While the former is sufficient for answering queries that are given prior to the application start (forward queries), the latter deals with queries whose targets are unknown beforehand (backward queries). Finally, it explores data mining on large collections of provenance and proposes a temporal representation of provenance that can reduce the high dimensionality while effectively supporting mining tasks like clustering, classification and association rules mining; and the temporal representation can be further applied to streaming provenance as well. The proposed techniques are verified through software prototypes applied to Big Data provenance captured from computer network data, weather models, ocean models, remote (satellite) imagery data, and agent-based simulations of agricultural decision making

    A new approach for publishing workflows: abstractions, standards, and linked data

    Get PDF
    In recent years, a variety of systems have been developed that export the workflows used to analyze data and make them part of published articles. We argue that the workflows that are published in current approaches are dependent on the specific codes used for execution, the specific workflow system used, and the specific workflow catalogs where they are published. In this paper, we describe a new approach that addresses these shortcomings and makes workflows more reusable through: 1) the use of abstract workflows to complement executable workflows to make them reusable when the execution environment is different, 2) the publication of both abstract and executable workflows using standards such as the Open Provenance Model that can be imported by other workflow systems, 3) the publication of workflows as Linked Data that results in open web accessible workflow repositories. We illustrate this approach using a complex workflow that we re-created from an influential publication that describes the generation of 'drugomes'

    Provenance in bioinformatics workflows

    Get PDF
    In this work, we used the PROV-DM model to manage data provenance in workflows of genome projects. This provenance model allows the storage of details of one workflow execution, e.g., raw and produced data and computational tools, their versions and parameters. Using this model, biologists can access details of one particular execution of a workflow, compare results produced by different executions, and plan new experiments more efficiently. In addition to this, a provenance simulator was created, which facilitates the inclusion of provenance data of one genome project workflow execution. Finally, we discuss one case study, which aims to identify genes involved in specific metabolic pathways of Bacillus cereus, as well as to compare this isolate with other phylogenetic related bacteria from the Bacillus group. B. cereus is an extremophilic bacteria, collectemd in warm water in the Midwestern Region of Brazil, its DNA samples having been sequenced with an NGS machine
    corecore