108 research outputs found


    Get PDF
    Version Control Systems were primarily designed to keep track of and provide control over changes to source code and have since provided an excellent way to combat the problem of sharing and editing files in a collaborative setting. The recent surge in data-driven decision making has resulted in a proliferation of datasets elevating them to the level of source code which in turn has led the data analysts to resort to version control systems for the purpose of storing and managing datasets and their versions over time. Unfortunately existing version control systems are poor at handling large datasets primarily due to the underlying assumption that the stored files are relatively small text files with localized changes. Moreover the algorithms used by these systems tend to be fairly simple leading to suboptimal performance when applied to large datasets. In order to address the shortcomings, a key requirement here is to have a Dataset Version Control System (DVCS) that will serve as a common platform to enable data analysts to efficiently store and query dataset versions, track changes to datasets and share datasets between users at ease. Towards this goal, we address the fundamental problem of designing storage layouts for a wide range of datasets to serve as the primary building block for an efficient and scalable DVCS. The key problem in this setting is to compactly store a large number of dataset versions and efficiently retrieve any specific version (or a collection of partial versions). We initiate our study by considering storage-retrieval trade-offs for versions of unstructured dataset such as text files, blobs, etc. where the notion of a partial version is not well-defined. Next, we consider array datasets, i.e., a collection of temporal snapshots (or versions) of multi-dimensional arrays, where the data is predominantly represented in single precision or double precision format. The primary challenge here is to develop efficient compression techniques for the hard-to-compress floating point data due to the high degree of entropy. We observe that the underlying techniques developed for unstructured or array datasets are not well suited for more structured dataset versions -- a version in this setting is defined by a collection of records each of which is uniquely addressable. We carefully explore the design space for building such a system and the various storage-retrieval trade-offs, and discuss how different storage layouts influence those trade-offs. Next, we formulate several problems trading off the version storage and retrieval cost in various ways and design several offline storage layout algorithms that effectively minimize the storage costs while keeping the retrieval costs low. In addition to version retrieval queries, our system also provides support for record provenance queries. Through extensive experiments on large datasets, we demonstrate that our proposed designs can operate at the scale required in most practical scenarios

    Designing Traceability into Big Data Systems

    Full text link
    Providing an appropriate level of accessibility and traceability to data or process elements (so-called Items) in large volumes of data, often Cloud-resident, is an essential requirement in the Big Data era. Enterprise-wide data systems need to be designed from the outset to support usage of such Items across the spectrum of business use rather than from any specific application view. The design philosophy advocated in this paper is to drive the design process using a so-called description-driven approach which enriches models with meta-data and description and focuses the design process on Item re-use, thereby promoting traceability. Details are given of the description-driven design of big data systems at CERN, in health informatics and in business process management. Evidence is presented that the approach leads to design simplicity and consequent ease of management thanks to loose typing and the adoption of a unified approach to Item management and usage.Comment: 10 pages; 6 figures in Proceedings of the 5th Annual International Conference on ICT: Big Data, Cloud and Security (ICT-BDCS 2015), Singapore July 2015. arXiv admin note: text overlap with arXiv:1402.5764, arXiv:1402.575

    Decibel: the relational dataset branching system

    Get PDF
    As scientific endeavors and data analysis become increasingly collaborative, there is a need for data management systems that natively support the versioning or branching of datasets to enable concurrent analysis, cleaning, integration, manipulation, or curation of data across teams of individuals. Common practice for sharing and collaborating on datasets involves creating or storing multiple copies of the dataset, one for each stage of analysis, with no provenance information tracking the relationships between these datasets. This results not only in wasted storage, but also makes it challenging to track and integrate modifications made by different users to the same dataset. In this paper, we introduce the Relational Dataset Branching System, Decibel, a new relational storage system with built-in version control designed to address these short-comings. We present our initial design for Decibel and provide a thorough evaluation of three versioned storage engine designs that focus on efficient query processing with minimal storage overhead. We also develop an exhaustive benchmark to enable the rigorous testing of these and future versioned storage engine designs.National Science Foundation (U.S.) (1513972)National Science Foundation (U.S.) (1513407)National Science Foundation (U.S.) (1513443)Intel Science and Technology Center for Big Dat

    Parallel Rendering and Large Data Visualization

    Full text link
    We are living in the big data age: An ever increasing amount of data is being produced through data acquisition and computer simulations. While large scale analysis and simulations have received significant attention for cloud and high-performance computing, software to efficiently visualise large data sets is struggling to keep up. Visualization has proven to be an efficient tool for understanding data, in particular visual analysis is a powerful tool to gain intuitive insight into the spatial structure and relations of 3D data sets. Large-scale visualization setups are becoming ever more affordable, and high-resolution tiled display walls are in reach even for small institutions. Virtual reality has arrived in the consumer space, making it accessible to a large audience. This thesis addresses these developments by advancing the field of parallel rendering. We formalise the design of system software for large data visualization through parallel rendering, provide a reference implementation of a parallel rendering framework, introduce novel algorithms to accelerate the rendering of large amounts of data, and validate this research and development with new applications for large data visualization. Applications built using our framework enable domain scientists and large data engineers to better extract meaning from their data, making it feasible to explore more data and enabling the use of high-fidelity visualization installations to see more detail of the data.Comment: PhD thesi
    • …