8 research outputs found

    Estimating The Quality Of Data Using Provenance: A Case Study In Escience

    Get PDF
    Data quality assessment is a key factor in data-intensive domains. The data deluge is aggravated by an increasing need for interoperability and cooperation across groups and organizations. New alternatives must be found to select the data that best satisfy users' needs in a given context. This paper presents a strategy to provide information to support the evaluation of the quality of data sets. This strategy is based on combining metadata on the provenance of a data set (derived from workflows that generate it) and quality dimensions defined by the set's users, based on the desired context of use. Our solution, validated via a case study, takes advantage of a semantic model to preserve data provenance related to applications in a specific domain. © (2013) by the AIS/ICIS Administrative Office All rights reserved.214421451IBM,SAP University Alliances,Microsoft,DePaul University,Georgia State University - J. Mack Robinson College of Business,et alBallou, D., Modeling Information Manufacturing Systems to Determine Information Product Quality (1998) Manage. Sci, 44, pp. 462-484Barga, R.S., Digiampietri, L.A., Automatic capture and efficient storage of e-Science experiment provenance (2008) Concurr. Comput.□: Pract. Exper, 20 (5), pp. 419-429Batini, C., Scannapieco, M., (2006) Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications), , Springer-VerlagBlake, R., Mangiameli, P., The Effects and Interactions of Data Quality and Problem Complexity on Classification (2011) Journal of Data and Information Quality, 2 (2), pp. 1-28Chapman, A.D., (2005) Principles of Data Quality, , Global Biodiversity Information Facility, CopenhagenChen, P., Plale, B., Aktas, M.S., Temporal Representation for Scientific Data Provenance (2012) In Proc. 8th IEEE Int. Conf. On EScience 2012Cugler, D.C., Medeiros, C.B., Toledo, F., An architecture for retrieval of animal sound recordings based on context variables (2012) Concurrency and Computation - Practice and ExperienceDavies, J., Studer, R., Warren, P., (2006) Semantic Web Technologies: Trends and Research In Ontology-based Systems, , Wiley(2010) The Dublin Core Metadata Initiative, , http://dublincore.org/, DCMI, Available atDeVries, P.J., (2009) GeoSpecies Ontology, , http://bioportal.bioontology.org/ontologies/1247, Available at(2009) Darwin Core Task Group, , http://www.tdwg.org/standards/450/, DwC, Available atGoodchild, M.F., Li, L., Assuring the quality of volunteered geographic information (2012) Spatial Statistics, 1, pp. 110-120Hartig, O., Zhao, J., Using web data provenance for quality assessment (2009) In Proc. of the Workshop On Semantic Web and Provenance Management At ISWC(2011) The Kepler Project, , https://kepler-project.org/, Kepler, Available atKondo, A.A., Traceability in Food for Supply Chains (2007) In Proc. 3rd Int. Conf. On Web Information Systems and Technologies (WEBIST), pp. 121-127. , INSTICCLassila, O., Swick, R.R., (1999) Resource Description Framework (RDF) Model and Syntax SpecificationMalaverri, J.E.G., Medeiros, C.B., A Provenance-based Approach to Evaluate Data Quality in eScience (2013) Int. J. Metadata, Semantics and Ontology - Special Issue On Metadata For E-science and E-researchMoreau, L., The Open Provenance Model core specification (v1.1) (2011) Future Generation Comp. Syst, 27 (6), pp. 743-756Parssian, A., Managerial decision support with knowledge of accuracy and completeness of the relational aggregate functions (2006) Decis. Support Syst, 42, pp. 1494-1502Pernici, B., Scannapieco, M., Data Quality in Web Information Systems (2002) In Proc. of the 21st Int. Conf. On Conceptual Modeling, pp. 397-413. , Springer-VerlagPipino, L.L., Lee, Y.W., Wang, R.Y., Data Quality Assessment (2002) Commun. ACM, 45, pp. 211-218Prat, N., Madnick, S., Measuring Data Believability: A Provenance Approach (2008) Proc. of the 41st Hawaii Int. Conf. On System Sciences, p. 393Richard, Y., Diane, M., Beyond accuracy□: What data quality means to data consumers (1996) Journal of ManagementSahoo, S.S., Sheth, A.P., Henson, C.A., Semantic Provenance for eScience: Managing the Deluge of Scientific Data (2008) IEEE Internet Computing, 12 (4), pp. 46-54Simmhan, Y., Plale, B., Using Provenance for Personalized Quality Ranking of Scientific Datasets (2011) I. J. Comput. Appl, 18 (3), pp. 180-195(2009) The Taverna Project, , http://www.taverna.org.uk/, Taverna, Available at(2011) The VisTrails Project, , http://www.vistrails.org, VisTrails, Available at(2012) The PROV Ontology, , http://www.w3.org/TR/prov-o/, W3C, Available atWang, X., Gorlitsky, R., Almeida, J.S., From XML to RDF: How semantic web technologies will change the design of omic standards (2005) Nat Biotech, 23 (9), pp. 1099-1103Yeganeh, S.H., Hassanzadeh, O., Miller, R.J., Linking Semistructured Data on the Web (2011) In Proc. 14th Int. Workshop On the Web and DatabasesZhao, J., Mining Taverna's semantic web of provenance (2008) Concurr. Comput.□: Pract. Exper, 20, pp. 463-47

    Distributed Management of Grid-based Scientific Workflows

    Get PDF
    Grids and service-oriented technologies are emerging as dominant approaches for distributed systems. With the evolution of these technologies, scientific workflows have been introduced as a tool for scientists to assemble highly specialized applications, and to exchange large heterogeneous datasets in order to automate and accelerate the accomplishment of complex scientific tasks. Several Scientific Workflow Management Systems (SWfMS) have already been designed to support the specification, execution, and monitoring of scientific workflows. Meanwhile, they still face key challenges from two different perspectives: system usability and system efficiency. From the system usability perspective, current SWfMS are not designed to be simple enough for scientists who have quite limited IT knowledge. What’s more, there is no easy mechanism by which scientists can share and re-use scientific experiments that have already been designed and proved by others. From the perspective of system efficiency, existing SWfMS are coordinating and executing workflows in a centralized fashion using a single scheduler and / or a workflow enactor. This creates a single point of failure, forms a scalability bottleneck, and enforces centralized fault handling. In addition, they don’t consider load balancing while mapping abstract jobs onto several computational nodes. Another important challenge exists due to the common nature of scientific workflow applications, that need to exchange a huge amount of data during the execution process. Some available SWfMS use a mediator-based approach for data transfer where data must be transferred first to a centralized data manager, which is completely inefficient. Other SWfMS apply a peer-to-peer approach via data references. Even this approach is not sufficient for scientific workflows as a single complex scientific activity can produce an extensive amount of data. In this thesis, we introduce SWIMS (Scientific Workflow Integration and Management System) framework. It employs the Web Services technology to originate a distributed management system for data-intensive scientific workflows. The purpose of SWIMS is to overcome the previously mentioned challenges through a set of salient features: i) Support for distributed execution and management of workflows, ii) diminution of communication traffic, iii) support for smart re-run, iv) distributed fault handling and load balancing, v) ease of use, and vi) extensive sharing of scientific workflows. We discuss the motivation, design, and implementation of the SWIMS framework. Then, we evaluate it through the Montage application from the astronomy domain

    Conceptual Framework and Methodology for Analysing Previous Molecular Docking Results

    Get PDF
    Modern drug discovery relies on in-silico computational simulations such as molecular docking. Molecular docking models biochemical interactions to predict where and how two molecules would bind. The results of large-scale molecular docking simulations can provide valuable insight into the relationship between two molecules. This is useful to a biomedical scientist before conducting in-vitro or in-vivo wet-lab experiments. Although this ˝eld has seen great advancements, feedback from biomedical scientists shows that there is a need for storage and further analysis of molecular docking results. To meet this need, biomedical scientists need to have access to computing, data, and network resources, and require speci˝c knowledge or skills they might lack. Therefore, a conceptual framework speci˝cally tailored to enable biomedical scientists to reuse molecular docking results, and a methodology which uses regular input from scientists, has been proposed. The framework is composed of 5 types of elements and 13 interfaces. The methodology is light and relies on frequent communication between biomedical sciences and computer science experts, speci˝ed by particular roles. It shows how developers can bene˝t from using the framework which allows them to determine whether a scenario ˝ts the framework, whether an already implemented element can be reused, or whether a newly proposed tool can be used as an element. Three scenarios that show the versatility of this new framework and the methodology based on it, have been identi˝ed and implemented. A methodical planning and design approach was used and it was shown that the implementations are at least as usable as existing solutions. To eliminate the need for access to expensive computing infrastructure, state-of-the-art cloud computing techniques are used. The implementations enable faster identi˝cation of new molecules for use in docking, direct querying of existing databases, and simpler learning of good molecular docking practice without the need to manually run multiple tools. Thus, the framework and methodol-ogy enable more user-friendly implementations, and less error-prone use of computational methods in drug discovery. Their use could lead to more e˙ective discovery of new drugs

    A knowledge-based approach to scientific workflow composition

    Get PDF
    Scientific Workflow Systems have been developed as a means to enable scientists to carry out complex analysis operations on local and remote data sources in order to achieve their research goals. Systems typically provide a large number of components and facilities to enable such analysis to be performed and have matured to a point where they offer many complex capabilities. This complexity makes it difficult for scientists working with these systems to readily achieve their goals. In this thesis we describe the increasing burden of knowledge required of these scientists in order for them to specify the outcomes they wish to achieve within the workflow systems. We consider ways in which the challenges presented by these systems can be reduced, focusing on the following questions: How can metadata describing the resources available assist users in composing workflows? Can automated assistance be provided to guide users through the composition process? Can such an approach be implemented so as to work with the resources provided by existing Scientific Workflow Systems? We have developed a new approach to workflow composition which makes use of a number of features: an ontology for recording metadata relating to workflow components, a set of algorithms for analyzing the state of a workflow composition and providing suggestions for how to progress based on this metadata, an API to enable both the algorithms and metadata to utilise the resources provided by existing Scientific Workflow Systems, and a prototype user interface to demonstrate how our proposed approach to workflow composition can work in practice. We evaluate the system to show the approach is valid and capable of reducing some of the difficulties presented by existing systems, but that limitations exist regarding the complexity of workflows which can be composed, and also regarding the challenge of initially populating the metadata ontology

    Mining Taverna's semantic web of provenance

    No full text
    Taverna is a workflow workbench developed as part of the UK's myGrid project. Taverna's provenance model captures both internal provenance locally generated in Taverna and external provenance gathered from third-party data providers. This model also supports overlaying secondary provenance over the primary logs and lineage. This design is motivated by the particular properties of bioinformatics data and services used in Taverna. A Semantic Web of provenance, Ouzo, is built to combine the above different provenance by means of semantic annotations. This paper shows how Ouzo can be mined by a provenance usage component, Provenance Query and Answer (ProQA). ProQA supports provenance retrievals as well as provenance abstraction, aggregation, and semantic reasoning. ProQA is implemented as a suite APIs which can be deployed as provenance services to compose system provenance workflows that analyse experiment results using the provenance records. We show how these features of Taverna's provenance support us in answering the questions from the provenance challenge workshop and a set of additional provenance queries
    corecore