227 research outputs found

    A Survey on Array Storage, Query Languages, and Systems

    Full text link
    Since scientific investigation is one of the most important providers of massive amounts of ordered data, there is a renewed interest in array data processing in the context of Big Data. To the best of our knowledge, a unified resource that summarizes and analyzes array processing research over its long existence is currently missing. In this survey, we provide a guide for past, present, and future research in array processing. The survey is organized along three main topics. Array storage discusses all the aspects related to array partitioning into chunks. The identification of a reduced set of array operators to form the foundation for an array query language is analyzed across multiple such proposals. Lastly, we survey real systems for array processing. The result is a thorough survey on array data storage and processing that should be consulted by anyone interested in this research topic, independent of experience level. The survey is not complete though. We greatly appreciate pointers towards any work we might have forgotten to mention.Comment: 44 page

    CloudNotes: Annotation Management in Cloud-Based Platforms

    Get PDF
    We present an annotation management system for cloud-based platforms, which is called “CloudNotes�. CloudNotes enables the annotation management feature in the scalable Hadoop and MapRedue platforms. In CloudNotes system, every piece of data may have one or more annotations associate with it, and these annotations will be propagated when the data is being transformed through the MapReduce jobs. Such an annotation management system is important for understanding the provenance and quality of data, especially in applications that deal with integration of scientific and biological data at unprecedented scale and complexity. We propose several extensions to the Hadoop platform that allow end-users to add and retrieve annotations seamlessly. Annotations in CloudNotes will be generated, propagated and managed in a distributed manner. We address several challenges that include attaching annotations to data at various granularities in Hadoop, annotating data in flat files with no known schema until query time, and creating and storing the annotations is a distributed fashion. We also present new storage mechanisms and novel indexing techniques that enable adding the annotations in small increments although Hadoop’s file system is optimized for large batch processing

    Pattern Reification as the Basis for Description-Driven Systems

    Full text link
    One of the main factors driving object-oriented software development for information systems is the requirement for systems to be tolerant to change. To address this issue in designing systems, this paper proposes a pattern-based, object-oriented, description-driven system (DDS) architecture as an extension to the standard UML four-layer meta-model. A DDS architecture is proposed in which aspects of both static and dynamic systems behavior can be captured via descriptive models and meta-models. The proposed architecture embodies four main elements - firstly, the adoption of a multi-layered meta-modeling architecture and reflective meta-level architecture, secondly the identification of four data modeling relationships that can be made explicit such that they can be modified dynamically, thirdly the identification of five design patterns which have emerged from practice and have proved essential in providing reusable building blocks for data management, and fourthly the encoding of the structural properties of the five design patterns by means of one fundamental pattern, the Graph pattern. A practical example of this philosophy, the CRISTAL project, is used to demonstrate the use of description-driven data objects to handle system evolution.Comment: 20 pages, 10 figure

    ArrayBridge: Interweaving declarative array processing with high-performance computing

    Full text link
    Scientists are increasingly turning to datacenter-scale computers to produce and analyze massive arrays. Despite decades of database research that extols the virtues of declarative query processing, scientists still write, debug and parallelize imperative HPC kernels even for the most mundane queries. This impedance mismatch has been partly attributed to the cumbersome data loading process; in response, the database community has proposed in situ mechanisms to access data in scientific file formats. Scientists, however, desire more than a passive access method that reads arrays from files. This paper describes ArrayBridge, a bi-directional array view mechanism for scientific file formats, that aims to make declarative array manipulations interoperable with imperative file-centric analyses. Our prototype implementation of ArrayBridge uses HDF5 as the underlying array storage library and seamlessly integrates into the SciDB open-source array database system. In addition to fast querying over external array objects, ArrayBridge produces arrays in the HDF5 file format just as easily as it can read from it. ArrayBridge also supports time travel queries from imperative kernels through the unmodified HDF5 API, and automatically deduplicates between array versions for space efficiency. Our extensive performance evaluation in NERSC, a large-scale scientific computing facility, shows that ArrayBridge exhibits statistically indistinguishable performance and I/O scalability to the native SciDB storage engine.Comment: 12 pages, 13 figure

    Storing and querying evolving knowledge graphs on the web

    Get PDF
    corecore