174,054 research outputs found

    Scientific Computing Meets Big Data Technology: An Astronomy Use Case

    Full text link
    Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, tools from the HPC software stack are used to parallelize these analyses. In this work, we investigate an alternate approach that uses Apache Spark -- a modern big data platform -- to parallelize many-task applications. We present Kira, a flexible and distributed astronomy image processing toolkit using Apache Spark. We then use the Kira toolkit to implement a Source Extractor application for astronomy images, called Kira SE. With Kira SE as the use case, we study the programming flexibility, dataflow richness, scheduling capacity and performance of Apache Spark running on the EC2 cloud. By exploiting data locality, Kira SE achieves a 2.5x speedup over an equivalent C program when analyzing a 1TB dataset using 512 cores on the Amazon EC2 cloud. Furthermore, we show that by leveraging software originally designed for big data infrastructure, Kira SE achieves competitive performance to the C implementation running on the NERSC Edison supercomputer. Our experience with Kira indicates that emerging Big Data platforms such as Apache Spark are a performant alternative for many-task scientific applications

    RepFlow: Minimizing Flow Completion Times with Replicated Flows in Data Centers

    Full text link
    Short TCP flows that are critical for many interactive applications in data centers are plagued by large flows and head-of-line blocking in switches. Hash-based load balancing schemes such as ECMP aggravate the matter and result in long-tailed flow completion times (FCT). Previous work on reducing FCT usually requires custom switch hardware and/or protocol changes. We propose RepFlow, a simple yet practically effective approach that replicates each short flow to reduce the completion times, without any change to switches or host kernels. With ECMP the original and replicated flows traverse distinct paths with different congestion levels, thereby reducing the probability of having long queueing delay. We develop a simple analytical model to demonstrate the potential improvement of RepFlow. Extensive NS-3 simulations and Mininet implementation show that RepFlow provides 50%--70% speedup in both mean and 99-th percentile FCT for all loads, and offers near-optimal FCT when used with DCTCP.Comment: To appear in IEEE INFOCOM 201

    The Homeostasis Protocol: Avoiding Transaction Coordination Through Program Analysis

    Get PDF
    Datastores today rely on distribution and replication to achieve improved performance and fault-tolerance. But correctness of many applications depends on strong consistency properties - something that can impose substantial overheads, since it requires coordinating the behavior of multiple nodes. This paper describes a new approach to achieving strong consistency in distributed systems while minimizing communication between nodes. The key insight is to allow the state of the system to be inconsistent during execution, as long as this inconsistency is bounded and does not affect transaction correctness. In contrast to previous work, our approach uses program analysis to extract semantic information about permissible levels of inconsistency and is fully automated. We then employ a novel homeostasis protocol to allow sites to operate independently, without communicating, as long as any inconsistency is governed by appropriate treaties between the nodes. We discuss mechanisms for optimizing treaties based on workload characteristics to minimize communication, as well as a prototype implementation and experiments that demonstrate the benefits of our approach on common transactional benchmarks

    Leveraging OpenStack and Ceph for a Controlled-Access Data Cloud

    Full text link
    While traditional HPC has and continues to satisfy most workflows, a new generation of researchers has emerged looking for sophisticated, scalable, on-demand, and self-service control of compute infrastructure in a cloud-like environment. Many also seek safe harbors to operate on or store sensitive and/or controlled-access data in a high capacity environment. To cater to these modern users, the Minnesota Supercomputing Institute designed and deployed Stratus, a locally-hosted cloud environment powered by the OpenStack platform, and backed by Ceph storage. The subscription-based service complements existing HPC systems by satisfying the following unmet needs of our users: a) on-demand availability of compute resources, b) long-running jobs (i.e., >30> 30 days), c) container-based computing with Docker, and d) adequate security controls to comply with controlled-access data requirements. This document provides an in-depth look at the design of Stratus with respect to security and compliance with the NIH's controlled-access data policy. Emphasis is placed on lessons learned while integrating OpenStack and Ceph features into a so-called "walled garden", and how those technologies influenced the security design. Many features of Stratus, including tiered secure storage with the introduction of a controlled-access data "cache", fault-tolerant live-migrations, and fully integrated two-factor authentication, depend on recent OpenStack and Ceph features.Comment: 7 pages, 5 figures, PEARC '18: Practice and Experience in Advanced Research Computing, July 22--26, 2018, Pittsburgh, PA, US

    An Infrastructure for the Dynamic Distribution of Web Applications and Services

    Full text link
    This paper presents the design and implementation of an infrastructure that enables any Web application, regardless of its current state, to be stopped and uninstalled from a particular server, transferred to a new server, then installed, loaded, and resumed, with all these events occurring "on the fly" and totally transparent to clients. Such functionalities allow entire applications to fluidly move from server to server, reducing the overhead required to administer the system, and increasing its performance in a number of ways: (1) Dynamic replication of new instances of applications to several servers to raise throughput for scalability purposes, (2) Moving applications to servers to achieve load balancing or other resource management goals, (3) Caching entire applications on servers located closer to clients.National Science Foundation (9986397

    Blazes: Coordination Analysis for Distributed Programs

    Full text link
    Distributed consistency is perhaps the most discussed topic in distributed systems today. Coordination protocols can ensure consistency, but in practice they cause undesirable performance unless used judiciously. Scalable distributed architectures avoid coordination whenever possible, but under-coordinated systems can exhibit behavioral anomalies under fault, which are often extremely difficult to debug. This raises significant challenges for distributed system architects and developers. In this paper we present Blazes, a cross-platform program analysis framework that (a) identifies program locations that require coordination to ensure consistent executions, and (b) automatically synthesizes application-specific coordination code that can significantly outperform general-purpose techniques. We present two case studies, one using annotated programs in the Twitter Storm system, and another using the Bloom declarative language.Comment: Updated to include additional materials from the original technical report: derivation rules, output stream label
    • …
    corecore