174,054 research outputs found
Scientific Computing Meets Big Data Technology: An Astronomy Use Case
Scientific analyses commonly compose multiple single-process programs into a
dataflow. An end-to-end dataflow of single-process programs is known as a
many-task application. Typically, tools from the HPC software stack are used to
parallelize these analyses. In this work, we investigate an alternate approach
that uses Apache Spark -- a modern big data platform -- to parallelize
many-task applications. We present Kira, a flexible and distributed astronomy
image processing toolkit using Apache Spark. We then use the Kira toolkit to
implement a Source Extractor application for astronomy images, called Kira SE.
With Kira SE as the use case, we study the programming flexibility, dataflow
richness, scheduling capacity and performance of Apache Spark running on the
EC2 cloud. By exploiting data locality, Kira SE achieves a 2.5x speedup over an
equivalent C program when analyzing a 1TB dataset using 512 cores on the Amazon
EC2 cloud. Furthermore, we show that by leveraging software originally designed
for big data infrastructure, Kira SE achieves competitive performance to the C
implementation running on the NERSC Edison supercomputer. Our experience with
Kira indicates that emerging Big Data platforms such as Apache Spark are a
performant alternative for many-task scientific applications
RepFlow: Minimizing Flow Completion Times with Replicated Flows in Data Centers
Short TCP flows that are critical for many interactive applications in data
centers are plagued by large flows and head-of-line blocking in switches.
Hash-based load balancing schemes such as ECMP aggravate the matter and result
in long-tailed flow completion times (FCT). Previous work on reducing FCT
usually requires custom switch hardware and/or protocol changes. We propose
RepFlow, a simple yet practically effective approach that replicates each short
flow to reduce the completion times, without any change to switches or host
kernels. With ECMP the original and replicated flows traverse distinct paths
with different congestion levels, thereby reducing the probability of having
long queueing delay. We develop a simple analytical model to demonstrate the
potential improvement of RepFlow. Extensive NS-3 simulations and Mininet
implementation show that RepFlow provides 50%--70% speedup in both mean and
99-th percentile FCT for all loads, and offers near-optimal FCT when used with
DCTCP.Comment: To appear in IEEE INFOCOM 201
The Homeostasis Protocol: Avoiding Transaction Coordination Through Program Analysis
Datastores today rely on distribution and replication to achieve improved
performance and fault-tolerance. But correctness of many applications depends
on strong consistency properties - something that can impose substantial
overheads, since it requires coordinating the behavior of multiple nodes. This
paper describes a new approach to achieving strong consistency in distributed
systems while minimizing communication between nodes. The key insight is to
allow the state of the system to be inconsistent during execution, as long as
this inconsistency is bounded and does not affect transaction correctness. In
contrast to previous work, our approach uses program analysis to extract
semantic information about permissible levels of inconsistency and is fully
automated. We then employ a novel homeostasis protocol to allow sites to
operate independently, without communicating, as long as any inconsistency is
governed by appropriate treaties between the nodes. We discuss mechanisms for
optimizing treaties based on workload characteristics to minimize
communication, as well as a prototype implementation and experiments that
demonstrate the benefits of our approach on common transactional benchmarks
Leveraging OpenStack and Ceph for a Controlled-Access Data Cloud
While traditional HPC has and continues to satisfy most workflows, a new
generation of researchers has emerged looking for sophisticated, scalable,
on-demand, and self-service control of compute infrastructure in a cloud-like
environment. Many also seek safe harbors to operate on or store sensitive
and/or controlled-access data in a high capacity environment.
To cater to these modern users, the Minnesota Supercomputing Institute
designed and deployed Stratus, a locally-hosted cloud environment powered by
the OpenStack platform, and backed by Ceph storage. The subscription-based
service complements existing HPC systems by satisfying the following unmet
needs of our users: a) on-demand availability of compute resources, b)
long-running jobs (i.e., days), c) container-based computing with
Docker, and d) adequate security controls to comply with controlled-access data
requirements.
This document provides an in-depth look at the design of Stratus with respect
to security and compliance with the NIH's controlled-access data policy.
Emphasis is placed on lessons learned while integrating OpenStack and Ceph
features into a so-called "walled garden", and how those technologies
influenced the security design. Many features of Stratus, including tiered
secure storage with the introduction of a controlled-access data "cache",
fault-tolerant live-migrations, and fully integrated two-factor authentication,
depend on recent OpenStack and Ceph features.Comment: 7 pages, 5 figures, PEARC '18: Practice and Experience in Advanced
Research Computing, July 22--26, 2018, Pittsburgh, PA, US
An Infrastructure for the Dynamic Distribution of Web Applications and Services
This paper presents the design and implementation of an infrastructure that enables any Web application, regardless of its current state, to be stopped and uninstalled from a particular server, transferred to a new server, then installed, loaded, and resumed, with all these events occurring "on the fly" and totally transparent to clients. Such functionalities allow entire applications to fluidly move from server to server, reducing the overhead required to administer the system, and increasing its performance in a number of ways: (1) Dynamic replication of new instances of applications to several servers to raise throughput for scalability purposes, (2) Moving applications to servers to achieve load balancing or other resource management goals, (3) Caching entire applications on servers located closer to clients.National Science Foundation (9986397
Blazes: Coordination Analysis for Distributed Programs
Distributed consistency is perhaps the most discussed topic in distributed
systems today. Coordination protocols can ensure consistency, but in practice
they cause undesirable performance unless used judiciously. Scalable
distributed architectures avoid coordination whenever possible, but
under-coordinated systems can exhibit behavioral anomalies under fault, which
are often extremely difficult to debug. This raises significant challenges for
distributed system architects and developers. In this paper we present Blazes,
a cross-platform program analysis framework that (a) identifies program
locations that require coordination to ensure consistent executions, and (b)
automatically synthesizes application-specific coordination code that can
significantly outperform general-purpose techniques. We present two case
studies, one using annotated programs in the Twitter Storm system, and another
using the Bloom declarative language.Comment: Updated to include additional materials from the original technical
report: derivation rules, output stream label
- …