Search CORE

1,454 research outputs found

Resource provisioning in Science Clouds: Requirements and challenges

Author: Afgan
Antcheva
Armbrust
Beloglazov
Birkenheuer
Blomer
Calheiros
Campos Plasencia
Chauhan
Chen
Corradi
Expósito
Fernández Albor
Gunarathne
Hardt
Ismail
Jung
Juve
Kune
Lee
Manvi
McNab
Mell
Michelotto
Mogul
Montero
Oesterle
Oliveira
Orgerie
Ostermann
Rehr
Rodriguez
Rodríguez-Marrero
Shamsi
Smanchat
Somasundaram
Sotomayor
Srirama
Szabo
Tan
Tchernykh
Walker
Publication venue: 'Wiley'
Publication date: 01/01/2017
Field of study

Cloud computing has permeated into the information technology industry in the last few years, and it is emerging nowadays in scientific environments. Science user communities are demanding a broad range of computing power to satisfy the needs of high-performance applications, such as local clusters, high-performance computing systems, and computing grids. Different workloads are needed from different computational models, and the cloud is already considered as a promising paradigm. The scheduling and allocation of resources is always a challenging matter in any form of computation and clouds are not an exception. Science applications have unique features that differentiate their workloads, hence, their requirements have to be taken into consideration to be fulfilled when building a Science Cloud. This paper will discuss what are the main scheduling and resource allocation challenges for any Infrastructure as a Service provider supporting scientific applications

arXiv.org e-Print Archive

Crossref

Digital.CSIC

funcX: A Federated Function Serving Fabric for Science

Author: Akkus I. E.
CMS
Forde J.
Fox G.
Hightower K.
Hindman B.
Malawski M.
Merkel D.
Spillner J.
Stubbs J.
Wang L.
Waterman D. G.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 07/05/2020
Field of study

Exploding data volumes and velocities, new computational methods and platforms, and ubiquitous connectivity demand new approaches to computation in the sciences. These new approaches must enable computation to be mobile, so that, for example, it can occur near data, be triggered by events (e.g., arrival of new data), be offloaded to specialized accelerators, or run remotely where resources are available. They also require new design approaches in which monolithic applications can be decomposed into smaller components, that may in turn be executed separately and on the most suitable resources. To address these needs we present funcX---a distributed function as a service (FaaS) platform that enables flexible, scalable, and high performance remote function execution. funcX's endpoint software can transform existing clouds, clusters, and supercomputers into function serving systems, while funcX's cloud-hosted service provides transparent, secure, and reliable function execution across a federated ecosystem of endpoints. We motivate the need for funcX with several scientific case studies, present our prototype design and implementation, show optimizations that deliver throughput in excess of 1 million functions per second, and demonstrate, via experiments on two supercomputers, that funcX can scale to more than more than 130000 concurrent workers.Comment: Accepted to ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC 2020). arXiv admin note: substantial text overlap with arXiv:1908.0490

arXiv.org e-Print Archive

Crossref

Recommended from our members

Transiency-driven Resource Management for Cloud Computing Platforms

Author: Sharma Prateek
Publication venue: ScholarWorks@UMass Amherst
Publication date: 25/10/2018
Field of study

Modern distributed server applications are hosted on enterprise or cloud data centers that provide computing, storage, and networking capabilities to these applications. These applications are built using the implicit assumption that the underlying servers will be stable and normally available, barring for occasional faults. In many emerging scenarios, however, data centers and clouds only provide transient, rather than continuous, availability of their servers. Transiency in modern distributed systems arises in many contexts, such as green data centers powered using renewable intermittent sources, and cloud platforms that provide lower-cost transient servers which can be unilaterally revoked by the cloud operator. Transient computing resources are increasingly important, and existing fault-tolerance and resource management techniques are inadequate for transient servers because applications typically assume continuous resource availability. This thesis presents research in distributed systems design that treats transiency as a first-class design principle. I show that combining transiency-specific fault-tolerance mechanisms with resource management policies to suit application characteristics and requirements, can yield significant cost and performance benefits. These mechanisms and policies have been implemented and prototyped as part of software systems, which allow a wide range of applications, such as interactive services and distributed data processing, to be deployed on transient servers, and can reduce cloud computing costs by up to 90\%. This thesis makes contributions to four areas of computer systems research: transiency-specific fault-tolerance, resource allocation, abstractions, and resource reclamation. For reducing the impact of transient server revocations, I develop two fault-tolerance techniques that are tailored to transient server characteristics and application requirements. For interactive applications, I build a derivative cloud platform that masks revocations by transparently moving application-state between servers of different types. Similarly, for distributed data processing applications, I investigate the use of application level periodic checkpointing to reduce the performance impact of server revocations. For managing and reducing the risk of server revocations, I investigate the use of server portfolios that allow transient resource allocation to be tailored to application requirements. Finally, I investigate how resource providers (such as cloud platforms) can provide transient resource availability without revocation, by looking into alternative resource reclamation techniques. I develop resource deflation, wherein a server\u27s resources are fractionally reclaimed, allowing the application to continue execution albeit with fewer resources. Resource deflation generalizes revocation, and the deflation mechanisms and cluster-wide policies can yield both high cluster utilization and low application performance degradation

ScholarWorks@UMass Amherst

Notes on Cloud computing principles

Author: Lee Dongman
Sandholm Thomas
Publication venue
Publication date: 01/01/2014
Field of study

This letter provides a review of fundamental distributed systems and economic Cloud computing principles. These principles are frequently deployed in their respective fields, but their inter-dependencies are often neglected. Given that Cloud Computing first and foremost is a new business model, a new model to sell computational resources, the understanding of these concepts is facilitated by treating them in unison. Here, we review some of the most important concepts and how they relate to each other

arXiv.org e-Print Archive

Springer - Publisher Connector