2,764 research outputs found
HPC Cloud for Scientific and Business Applications: Taxonomy, Vision, and Research Challenges
High Performance Computing (HPC) clouds are becoming an alternative to
on-premise clusters for executing scientific applications and business
analytics services. Most research efforts in HPC cloud aim to understand the
cost-benefit of moving resource-intensive applications from on-premise
environments to public cloud platforms. Industry trends show hybrid
environments are the natural path to get the best of the on-premise and cloud
resources---steady (and sensitive) workloads can run on on-premise resources
and peak demand can leverage remote resources in a pay-as-you-go manner.
Nevertheless, there are plenty of questions to be answered in HPC cloud, which
range from how to extract the best performance of an unknown underlying
platform to what services are essential to make its usage easier. Moreover, the
discussion on the right pricing and contractual models to fit small and large
users is relevant for the sustainability of HPC clouds. This paper brings a
survey and taxonomy of efforts in HPC cloud and a vision on what we believe is
ahead of us, including a set of research challenges that, once tackled, can
help advance businesses and scientific discoveries. This becomes particularly
relevant due to the fast increasing wave of new HPC applications coming from
big data and artificial intelligence.Comment: 29 pages, 5 figures, Published in ACM Computing Surveys (CSUR
Resource provisioning in Science Clouds: Requirements and challenges
Cloud computing has permeated into the information technology industry in the
last few years, and it is emerging nowadays in scientific environments. Science
user communities are demanding a broad range of computing power to satisfy the
needs of high-performance applications, such as local clusters,
high-performance computing systems, and computing grids. Different workloads
are needed from different computational models, and the cloud is already
considered as a promising paradigm. The scheduling and allocation of resources
is always a challenging matter in any form of computation and clouds are not an
exception. Science applications have unique features that differentiate their
workloads, hence, their requirements have to be taken into consideration to be
fulfilled when building a Science Cloud. This paper will discuss what are the
main scheduling and resource allocation challenges for any Infrastructure as a
Service provider supporting scientific applications
Virtualizing the Stampede2 Supercomputer with Applications to HPC in the Cloud
Methods developed at the Texas Advanced Computing Center (TACC) are described
and demonstrated for automating the construction of an elastic, virtual cluster
emulating the Stampede2 high performance computing (HPC) system. The cluster
can be built and/or scaled in a matter of minutes on the Jetstream self-service
cloud system and shares many properties of the original Stampede2, including:
i) common identity management, ii) access to the same file systems, iii)
equivalent software application stack and module system, iv) similar job
scheduling interface via Slurm.
We measure time-to-solution for a number of common scientific applications on
our virtual cluster against equivalent runs on Stampede2 and develop an
application profile where performance is similar or otherwise acceptable. For
such applications, the virtual cluster provides an effective form of "cloud
bursting" with the potential to significantly improve overall turnaround time,
particularly when Stampede2 is experiencing long queue wait times. In addition,
the virtual cluster can be used for test and debug without directly impacting
Stampede2. We conclude with a discussion of how science gateways can leverage
the TACC Jobs API web service to incorporate this cloud bursting technique
transparently to the end user.Comment: 6 pages, 0 figures, PEARC '18: Practice and Experience in Advanced
Research Computing, July 22--26, 2018, Pittsburgh, PA, US
Survey and Analysis of Production Distributed Computing Infrastructures
This report has two objectives. First, we describe a set of the production
distributed infrastructures currently available, so that the reader has a basic
understanding of them. This includes explaining why each infrastructure was
created and made available and how it has succeeded and failed. The set is not
complete, but we believe it is representative.
Second, we describe the infrastructures in terms of their use, which is a
combination of how they were designed to be used and how users have found ways
to use them. Applications are often designed and created with specific
infrastructures in mind, with both an appreciation of the existing capabilities
provided by those infrastructures and an anticipation of their future
capabilities. Here, the infrastructures we discuss were often designed and
created with specific applications in mind, or at least specific types of
applications. The reader should understand how the interplay between the
infrastructure providers and the users leads to such usages, which we call
usage modalities. These usage modalities are really abstractions that exist
between the infrastructures and the applications; they influence the
infrastructures by representing the applications, and they influence the ap-
plications by representing the infrastructures
VM-MAD: a cloud/cluster software for service-oriented academic environments
The availability of powerful computing hardware in IaaS clouds makes cloud
computing attractive also for computational workloads that were up to now
almost exclusively run on HPC clusters.
In this paper we present the VM-MAD Orchestrator software: an open source
framework for cloudbursting Linux-based HPC clusters into IaaS clouds but also
computational grids. The Orchestrator is completely modular, allowing flexible
configurations of cloudbursting policies. It can be used with any batch system
or cloud infrastructure, dynamically extending the cluster when needed. A
distinctive feature of our framework is that the policies can be tested and
tuned in a simulation mode based on historical or synthetic cluster accounting
data.
In the paper we also describe how the VM-MAD Orchestrator was used in a
production environment at the FGCZ to speed up the analysis of mass
spectrometry-based protein data by cloudbursting to the Amazon EC2. The
advantages of this hybrid system are shown with a large evaluation run using
about hundred large EC2 nodes.Comment: 16 pages, 5 figures. Accepted at the International Supercomputing
Conference ISC13, June 17--20 Leipzig, German
Transparent Orchestration of Task-based Parallel Applications in Containers Platforms
This paper presents a framework to easily build and execute parallel applications in container-based distributed computing platforms in a user-transparent way. The proposed framework is a combination of the COMP Superscalar (COMPSs) programming model and runtime, which provides a straightforward way to develop task-based parallel applications from sequential codes, and containers management platforms that ease the deployment of applications in computing environments (as Docker, Mesos or Singularity). This framework provides scientists and developers with an easy way to implement parallel distributed applications and deploy them in a one-click fashion. We have built a prototype which integrates COMPSs with different containers engines in different scenarios: i) a Docker cluster, ii) a Mesos cluster, and iii) Singularity in an HPC cluster. We have evaluated the overhead in the building phase, deployment and execution of two benchmark applications compared to a Cloud testbed based on KVM and OpenStack and to the usage of bare metal nodes. We have observed an important gain in comparison to cloud environments during the building and deployment phases. This enables better adaptation of resources with respect to the computational load. In contrast, we detected an extra overhead during the execution, which is mainly due to the multi-host Docker networking.This work is partly supported by the Spanish Government through Programa Severo Ochoa (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316 project, by the Generalitat de Catalunya under contracts 2014-SGR-1051 and 2014-SGR-1272, and by the European Union through the Horizon 2020 research and innovation program under grant 690116 (EUBra-BIGSEA Project). Results presented in this paper were obtained using the Chameleon testbed supported by the National Science Foundation.Peer ReviewedPostprint (author's final draft
A Bag-of-Tasks Scheduler Tolerant to Temporal Failures in Clouds
Cloud platforms have emerged as a prominent environment to execute high
performance computing (HPC) applications providing on-demand resources as well
as scalability. They usually offer different classes of Virtual Machines (VMs)
which ensure different guarantees in terms of availability and volatility,
provisioning the same resource through multiple pricing models. For instance,
in Amazon EC2 cloud, the user pays per hour for on-demand VMs while spot VMs
are unused instances available for lower price. Despite the monetary
advantages, a spot VM can be terminated, stopped, or hibernated by EC2 at any
moment.
Using both hibernation-prone spot VMs (for cost sake) and on-demand VMs, we
propose in this paper a static scheduling for HPC applications which are
composed by independent tasks (bag-of-task) with deadline constraints. However,
if a spot VM hibernates and it does not resume within a time which guarantees
the application's deadline, a temporal failure takes place. Our scheduling,
thus, aims at minimizing monetary costs of bag-of-tasks applications in EC2
cloud, respecting its deadline and avoiding temporal failures. To this end, our
algorithm statically creates two scheduling maps: (i) the first one contains,
for each task, its starting time and on which VM (i.e., an available spot or
on-demand VM with the current lowest price) the task should execute; (ii) the
second one contains, for each task allocated on a VM spot in the first map, its
starting time and on which on-demand VM it should be executed to meet the
application deadline in order to avoid temporal failures. The latter will be
used whenever the hibernation period of a spot VM exceeds a time limit.
Performance results from simulation with task execution traces, configuration
of Amazon EC2 VM classes, and VMs market history confirms the effectiveness of
our scheduling and that it tolerates temporal failures
- …