Search CORE

7 research outputs found

Réplication de requêtes pour la tolérance aux pannes de FaaS

Author: Bouizem Yasmina
Dib Djawida
Lahfa Fedoua
Morin Christine
Parlavantzas Nikos
Publication venue: HAL CCSD
Publication date: 04/01/2022
Field of study

Function-as-a-Service (FaaS) is a popular programming model for building serverless applications, supported by all major cloud providers and many open-source software frameworks. One of the main challenges for FaaS providers is providing fault-tolerance for the deployed applications. The basic fault-tolerance mechanism in current FaaS platforms is automatically retrying function invocations. Although the retry mechanism is well suited for transient faults, it incurs delays in recovering from other types of faults, such as node crashes. This paper proposes the integration of a Request Replication mechanism in FaaS platforms and describes how this integration was implemented in a well-known, open-source platform. The paper provides a detailed experimental comparison of the proposed mechanism with the retry mechanism and an Active-Standby mechanism under different failure scenarios.Le Function-as-a-Service (FaaS) est un modèle de programmation populaire pour la création d’applications sans serveur, pris en charge par tous les principaux fournisseurs de cloud et de nombreux frameworks logiciels open source. L’un des principaux défis pour les fournisseurs de FaaS est de fournir une tolérance aux pannes pour les applications déployées. Le mécanisme de base de tolérance aux pannes des plates-formes FaaS actuelles réessaie automatiquement les appels de fonction. Bien que le mécanisme de nouvelle tentative soit bien adapté aux pannestransitoires, il entraîne des retards dans la récupération d’autres types de pannes, telles que les pannes de noeuds. Cet article propose l’intégration d’un mécanisme de réplication de requêtes dans les plates-formes FaaS et décrit comment cette intégration a été implémentée dans une plate-forme open source bien connue. L’article fournit une comparaison expérimentale détaillée du mécanisme proposé avec le mécanisme de nouvelle tentative et un mécanisme Active-Standby sous différents scénarios de panne

INRIA a CCSD electronic archive server

Using Unused: Non-Invasive Dynamic FaaS Infrastructure with HPC-Whisk

Author: Malawski Maciej
Pawlik Maciej
Przybylski Bartłomiej
Rzadca Krzysztof
Łagosz Bartłomiej
Żuk Paweł
Publication venue
Publication date: 01/11/2022
Field of study

Modern HPC workload managers and their careful tuning contribute to the high utilization of HPC clusters. However, due to inevitable uncertainty it is impossible to completely avoid node idleness. Although such idle slots are usually too short for any HPC job, they are too long to ignore them. Function-as-a-Service (FaaS) paradigm promisingly fills this gap, and can be a good match, as typical FaaS functions last seconds, not hours. Here we show how to build a FaaS infrastructure on idle nodes in an HPC cluster in such a way that it does not affect the performance of the HPC jobs significantly. We dynamically adapt to a changing set of idle physical machines, by integrating open-source software Slurm and OpenWhisk. We designed and implemented a prototype solution that allowed us to cover up to 90\% of the idle time slots on a 50k-core cluster that runs production workloads

arXiv.org e-Print Archive

No Provisioned Concurrency: Fast RDMA-codesigned Remote Fork for Serverless Computing

Author: Chen Haibo
Chen Rong
Gu Jinyu
Lu Fangming
Wang Tianxia
Wei Xingda
Yang Yuhan
Publication venue
Publication date: 16/09/2022
Field of study

Serverless platforms essentially face a tradeoff between container startup time and provisioned concurrency (i.e., cached instances), which is further exaggerated by the frequent need for remote container initialization. This paper presents MITOSIS, an operating system primitive that provides fast remote fork, which exploits a deep codesign of the OS kernel with RDMA. By leveraging the fast remote read capability of RDMA and partial state transfer across serverless containers, MITOSIS bridges the performance gap between local and remote container initialization. MITOSIS is the first to fork over 10,000 new containers from one instance across multiple machines within a second, while allowing the new containers to efficiently transfer the pre-materialized states of the forked one. We have implemented MITOSIS on Linux and integrated it with FN, a popular serverless platform. Under load spikes in real-world serverless workloads, MITOSIS reduces the function tail latency by 89% with orders of magnitude lower memory usage. For serverless workflow that requires state transfer, MITOSIS improves its execution time by 86%.Comment: To appear in OSDI'2

arXiv.org e-Print Archive

Coordination Protocols for Verifiable Consistency in Distributed Storage Systems

Author: Ramaswamy Ashwin
Publication venue: SJSU ScholarWorks
Publication date: 01/01/2022
Field of study

Achieving consistency in a highly available distributed storage system has been formally proven to be an impossible task when the system faces network partitions and faulty processes. The complexity is exacerbated when the system allows concurrent processes to send transactions to all the other servers and coordinate the consistent commitment of different transactions. In the event of a partition, each server may allow clients to request updates involving the current state of the data, which makes achieving replicated consistency challenging. To solve the inconsistency problems, several consensus protocols are used, but have strict requirements in order to make progress and are not guaranteed to ever converge to a single value. Additionally, the coordination required to achieve consistency after a partition will be extremely high as each node must compare transaction times and conflicting data with all other servers in the system. To address the inconsistency in distributed systems, this thesis proposes a new coordination protocol that utilizes four ideas in order for clients to verify the consistency of data: (1) a universal timestamp signatory to certify the global order of events, (2) a relative consistency indicator to determine relative consistency during partitions, (3) an operation-based recency-weighted conflict resolution algorithm to simplify coordination for achieving global consistency, and (4) a rejection-oriented distributed transaction commit protocol to eliminate any guarantees required by atomic commit protocols and verify local consistency. This thesis will evaluate and analyze various issues related to coordination and concurrency under network partitions. The experimental results demonstrate that the proposed methods provide a verifiable consistency to the servers of the distributed storage systems

SJSU ScholarWorks