20 research outputs found

    Fine-Grained Provenance And Applications To Data Analytics Computation

    Get PDF
    Data provenance tools seek to facilitate reproducible data science and auditable data analyses by capturing the analytics steps used in generating data analysis results. However, analysts must choose among workflow provenance systems, which allow arbitrary code but only track provenance at the granularity of files; prove-nance APIs, which provide tuple-level provenance, but incur overhead in all computations; and database provenance tools, which track tuple-level provenance through relational operators and support optimization, but support a limited subset of data science tasks. None of these solutions are well suited for tracing errors introduced during common ETL, record alignment, and matching tasks – for data types such as strings, images, etc.Additionally, we need a provenance archival layer to store and manage the tracked fine-grained prove-nance that enables future sophisticated reasoning about why individual output results appear or fail to appear. For reproducibility and auditing, the provenance archival system should be tamper-resistant. On the other hand, the provenance collecting over time or within the same query computation tends to be repeated partially (i.e., the same operation with the same input records in the middle computation step). Hence, we desire efficient provenance storage (i.e., it compresses repeated results). We address these challenges with novel formalisms and algorithms, implemented in the PROVision system, for reconstructing fine-grained provenance for a broad class of ETL-style workflows. We extend database-style provenance techniques to capture equivalences, support optimizations, and enable lazy evaluations. We develop solutions for storing fine-grained provenance in relational storage systems while both compressing and protecting it via cryptographic hashes. We experimentally validate our proposed solutions using both scientific and OLAP workloads

    AVENTIS - An architecture for event data analysis

    Full text link
    Time-stamped event data is being generated at an exponential rate from various sources (sensor networks, e-markets etc.), which are stored in event logs and made available to researchers. Despite the data deluge and evolution of a plethora of tools and technologies, science behind exploratory analysis and knowledge discovery lags. There are several reasons behind this. In conducting event data analysis, researchers typically detect a pattern or trend in the data through computation of time-series measures and apply the computed measures to several mathematical models to glean information from data. This is a complex and time-consuming process covering a range of activities from data capture (from a broad array of data sources) to interpretation and dissemination of experimental results forming a pipeline of activities. Further, data-analysis is conducted by domain-users, who are typically non-IT experts but data processing tools and applications are largely developed by application developers. End-users not only lack the critical skills to build a structured analysis pipeline, but are also perplexed by the number of different ways available to derive the necessary information. Consequently, this thesis proposes AVENTIS (Architecture for eVENT Data analysIS), a novel framework to guide the design of analytic solutions to facilitate time-series analysis of event data and is tailored to the needs of domain users. The framework comprises three components; a knowledge base, a model-driven analytic methodology and an accompanying software architecture that provides the necessary technical and operational requirements. Specifically, the research contribution lies in the ability of the framework to enable expressing analysis requirements at a level of abstraction consistent with the domain users and readily make available the information sought without the users having to build the analysis process themselves. Secondly, the framework also facilitates an abstract design space for the domain experts to enable them to build conceptual models of their experiment as a sequence of structured tasks in a technology neutral manner and transparently translate these abstract process models to executable implementations. To evaluate the AVENTIS framework, a prototype based on AVENTIS is implemented and tested with case studies taken from the financial research domain

    Modern Systems for Large-scale Genomics Data Analysis in the Cloud

    Get PDF
    Genomics researchers increasingly turn to cloud computing as a means of accomplishing large-scale analyses efficiently and cost-effectively. Successful operation in the cloud requires careful instrumentation and management to avoid common pitfalls, such as resource bottlenecks and low utilisation that can both drive up costs and extend the timeline of a scientific project. We developed the Butler framework for large-scale scientific workflow management in the cloud to meet these challenges. The cornerstones of Butler design are: ability to support multiple clouds, declarative infrastructure configuration management, scalable, fault-tolerant operation, comprehensive resource monitoring, and automated error detection and recovery. Butler relies on industry-strength open-source components in order to deliver a framework that is robust and scalable to thousands of compute cores and millions of workflow executions. Butler’s error detection and self-healing capabilities are unique among scientific workflow frameworks and ensure that analyses are carried out with minimal human intervention. Butler has been used to analyse over 725TB of DNA sequencing data on the cloud, using 1500 CPU cores, and 6TB of RAM, delivering results with 43\% increased efficiency compared to other tools. The flexible design of this framework allows easy adoption within other fields of Life Sciences and ensures that it will scale together with the demand for scientific analysis in the cloud for years to come. Because many bioinformatics tools have been developed in the context of small sample sizes they often struggle to keep up with the demands for large-scale data processing required for modern research and clinical sequencing projects due to the limitations in their design. The Rheos software system is designed specifically with these large data sets in mind. Utilising the elastic compute capacity of modern academic and commercial clouds, Rheos takes a service-oriented containerised approach to the implementation of modern bioinformatics algorithms, which allows the software to achieve the scalability and ease-of-use required to succeed under increased operational load of massive data sets generated by projects like International Cancer Genomics Consortium (ICGC) Argo and the All of Us initiative. Rheos algorithms are based on an innovative stream-based approach for processing genomic data, which enables Rheos to make faster decisions about the presence of genomic mutations that drive diseases such as cancer, thereby improving the tools' efficacy and relevance to clinical sequencing applications. Our testing of the novel germline Single Nucleotide Polymorphism (SNP) and deletion variant calling algorithms developed within Rheos indicates that Rheos achieves ~98\% accuracy in SNP calling and ~85\% accuracy in deletion calling, which is comparable with other leading tools such as the Genome Analysis Toolkit (GATK), freebayes, and Delly. The two frameworks that we developed provide important contributions to solve the ever-growing need for large scale genomic data analysis on the cloud, by enabling more effective use of existing tools, in the case of Butler, and providing a new, more dynamic and real-time approach to genomic analysis, in the case of Rheos

    Reconstructing Data Provenance from Log Files

    Get PDF
    Data provenance describes the derivation history of data, capturing details such as the entities involved and the relationships between entities. Knowledge of data provenance can be used to address issues, such as data quality assurance, data audit and system security. However, current computer systems are usually not equipped with means to acquire data provenance. Modifying underlying systems or introducing new monitoring software for provenance logging may be too invasive for production systems. As a result, data provenance may not always be available. This thesis investigates the completeness and correctness of data provenance reconstructed from log files with respect to the actual derivation history. To accomplish this, we designed and tested a solution that first extracts and models information from log files into provenance relations then reconstructs the data provenance from those relations. The reconstructed output is then evaluated against the ground truth provenance. The thesis also details the methodology used for constructing a dataset for provenance reconstruction research. Experimental results revealed data provenance that completely captures the ground truth can be reconstructed from system-layer log files. However, the outputs are susceptible to errors generated during event logging and errors induced by program dependencies. Results also show that usage of log files of different granularities collected from the system can help resolve logging errors described. Experiments with removing suspected program dependencies using approaches such as blacklisting and clustering have shown that the number of errors can be reduced by a factor of one hundred. Conclusions drawn from this research contribute towards the work on using reconstruction as an alternative approach for acquiring data provenance from computer systems
    corecore