7 research outputs found

    Decibel: the relational dataset branching system

    Get PDF
    As scientific endeavors and data analysis become increasingly collaborative, there is a need for data management systems that natively support the versioning or branching of datasets to enable concurrent analysis, cleaning, integration, manipulation, or curation of data across teams of individuals. Common practice for sharing and collaborating on datasets involves creating or storing multiple copies of the dataset, one for each stage of analysis, with no provenance information tracking the relationships between these datasets. This results not only in wasted storage, but also makes it challenging to track and integrate modifications made by different users to the same dataset. In this paper, we introduce the Relational Dataset Branching System, Decibel, a new relational storage system with built-in version control designed to address these short-comings. We present our initial design for Decibel and provide a thorough evaluation of three versioned storage engine designs that focus on efficient query processing with minimal storage overhead. We also develop an exhaustive benchmark to enable the rigorous testing of these and future versioned storage engine designs.National Science Foundation (U.S.) (1513972)National Science Foundation (U.S.) (1513407)National Science Foundation (U.S.) (1513443)Intel Science and Technology Center for Big Dat

    A generic provenance framework to document public policy making processes

    Get PDF
    Public policies impact the day to day activities of individuals. Effective public policy outcomes result in general acceptance among the community. The transparency in policy making process and participation during policy creation holds significant positions for developing trust among the community. Established domains such as e-health employs provenance for creating transparency and trust among the researchers. Public policy making can also use provenance to develop trust and transparency in their processes. At present, however, public policy makers employ various means to manage public policy making. Having no unified platform for the policy making process presents challenges in respect of searching and locating the evidence that was used during policy creation and for ensuring trust and transparency among actors. The absence of such a support also presents challenges for participation in public policy making. To address the given challenges, this research presents the provenance framework that manages the public policy making provenance data and enable participation of diverse actors.Due to dynamicity attached to public policy making, a provenance framework needs to be adaptable. Therefore, a model-driven approach has been used to frame the public policy making provenance framework. In addition to a model-driven approach, a mechanism is required that can enable the capture of public policy processes. However, the knowledge-intensive dynamics of public policy making presents challenges for using process-based solutions. Therefore, this research work describes a process-agnostic approach inspired from a network-based packet switching approach for tracking policy making processes. Managing public policy provenance data is not the only facet that develops trust. What is required is the facilitation of citizens and non-government bodies in the policy creation process. Therefore, a provenance framework has been designed by considering the principles of smart governance which results in a smart cities solution. In order to evaluate the framework, a proof-of-concept has been designed and implemented. An evaluation has been carried out to determine the suitability of a model-driven and a process-agnostic approach for policy making provenance framework in smart cities. For the evaluation purposes, three public policy making case studies Shops Opening Hours’ Resolution, Air Quality Monitoring, and Neighbourhood Planning were employed. The three case studies were used to derive various experiments to test the provenance framework. The experiments captured the dynamic and knowledge-intensive aspects of the provenance framework. The results collected from the execution of the experiments demonstrated the aptness of a process-agnostic approach and model-driven approach for the policy making provenance framework. Lastly, an end user evaluation was carried out to assess the effectiveness of the provenance framework. The positive responses of end users showed the usefulness of the provenance framework

    Effective data versioning for collaborative data analytics

    Get PDF
    With the massive proliferation of datasets in a variety of sectors, data science teams in these sectors spend vast amounts of time collaboratively constructing, curating, and analyzing these datasets. Versions of datasets are routinely generated during this data science process, via various data processing operations like data transformation and cleaning, feature engineering and normalization, among others. However, no existing systems enable us to effectively store, track, and query these versioned datasets, leading to massive redundancy in versioned data storage and making true collaboration and sharing impossible. In this thesis, we develop solutions for versioned data management for collaborative data analytics. In the first part of this thesis, we extend a relational database to support versioning of structured data. Specifically, we build a system, OrpheusDB, on top of a relational database with a carefully designed data representation and an intelligent partitioning algorithm for fast version control operations. OrpheusDB inherits much of the same benefits of relational databases, while also compactly storing, keeping track of, and recreating versions on demand. However, OrpheusDB implicitly makes a few assumptions, namely that: (a) the SQL assumption: a SQL-like language is the best fit for querying data and versioning information; (b) the structural assumption: the data is in a relational format with a regular structure; (c) the from-scratch assumption: users adopt OrpheusDB from the very beginning of their project and register each data version along with full metadata in the system. In the second part of this thesis, we remove each of these assumptions, one at a time. First, we remove the SQL assumption and propose a generalized query language for querying data along with versioning and provenance information. Second, we remove the structural assumption and develop solutions for compact storage and fast retrieval of arbitrary data representations. Finally, we remove the “from-scratch” assumption, by developing techniques to infer lineage relationships among versions residing in an existing data repository

    16th SC@RUG 2019 proceedings 2018-2019

    Get PDF
    corecore