4,252 research outputs found

    A Programming Model and Foundation for Lineage-Based Distributed Computation

    Get PDF
    The most successful systems for "big data'' processing have all adopted functional APIs. We present a new programming model we call function passing designed to provide a more principled substrate, or middleware, upon which to build data-centric distributed systems like Spark. A key idea is to build up a persistent functional data structure representing transformations on distributed immutable data by passing well-typed serializable functions over the wire and applying them to this distributed data. Thus, the function passing model can be thought of as a persistent functional data structure that is distributed, where transformations performed on distributed data are stored in its nodes rather than the distributed data itself. One advantage of this model is that failure recovery is simplified by design--data can be recovered by replaying function applications atop immutable data loaded from stable storage. Deferred evaluation is also central to our model; by incorporating deferred evaluation into our design only at the point of initiating network communication, the function passing model remains easy to reason about while remaining efficient in time and memory. Moreover, we provide a complete formalization of the programming model in order to study the foundations of lineage-based distributed computation. In particular, we develop a theory of safe, mobile lineages based on a subject reduction theorem for a typed core language. Furthermore, we formalize a progress theorem which guarantees the finite materialization of remote, lineage-based data. Thus, the formal model may serve as a basis for further developments of the theory of data-centric distributed programming, including aspects such as fault tolerance. We provide an open-source implementation of our model in and for the Scala programming language, along with a case study of several example frameworks and end-user programs written atop of this model

    Scientific Computing Meets Big Data Technology: An Astronomy Use Case

    Full text link
    Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, tools from the HPC software stack are used to parallelize these analyses. In this work, we investigate an alternate approach that uses Apache Spark -- a modern big data platform -- to parallelize many-task applications. We present Kira, a flexible and distributed astronomy image processing toolkit using Apache Spark. We then use the Kira toolkit to implement a Source Extractor application for astronomy images, called Kira SE. With Kira SE as the use case, we study the programming flexibility, dataflow richness, scheduling capacity and performance of Apache Spark running on the EC2 cloud. By exploiting data locality, Kira SE achieves a 2.5x speedup over an equivalent C program when analyzing a 1TB dataset using 512 cores on the Amazon EC2 cloud. Furthermore, we show that by leveraging software originally designed for big data infrastructure, Kira SE achieves competitive performance to the C implementation running on the NERSC Edison supercomputer. Our experience with Kira indicates that emerging Big Data platforms such as Apache Spark are a performant alternative for many-task scientific applications

    Towards Autopoietic Computing

    Full text link
    A key challenge in modern computing is to develop systems that address complex, dynamic problems in a scalable and efficient way, because the increasing complexity of software makes designing and maintaining efficient and flexible systems increasingly difficult. Biological systems are thought to possess robust, scalable processing paradigms that can automatically manage complex, dynamic problem spaces, possessing several properties that may be useful in computer systems. The biological properties of self-organisation, self-replication, self-management, and scalability are addressed in an interesting way by autopoiesis, a descriptive theory of the cell founded on the concept of a system's circular organisation to define its boundary with its environment. In this paper, therefore, we review the main concepts of autopoiesis and then discuss how they could be related to fundamental concepts and theories of computation. The paper is conceptual in nature and the emphasis is on the review of other people's work in this area as part of a longer-term strategy to develop a formal theory of autopoietic computing.Comment: 10 Pages, 3 figure

    A Survey on Array Storage, Query Languages, and Systems

    Full text link
    Since scientific investigation is one of the most important providers of massive amounts of ordered data, there is a renewed interest in array data processing in the context of Big Data. To the best of our knowledge, a unified resource that summarizes and analyzes array processing research over its long existence is currently missing. In this survey, we provide a guide for past, present, and future research in array processing. The survey is organized along three main topics. Array storage discusses all the aspects related to array partitioning into chunks. The identification of a reduced set of array operators to form the foundation for an array query language is analyzed across multiple such proposals. Lastly, we survey real systems for array processing. The result is a thorough survey on array data storage and processing that should be consulted by anyone interested in this research topic, independent of experience level. The survey is not complete though. We greatly appreciate pointers towards any work we might have forgotten to mention.Comment: 44 page
    • …
    corecore