Search CORE

108 research outputs found

Fine-Grained Scheduling for Containerized HPC Workloads in Kubernetes Clusters

Author: Guitart Jordi
Liu Peini
Publication venue
Publication date: 01/01/2022
Field of study

Containerization technology offers lightweight OS-level virtualization, and enables portability, reproducibility, and flexibility by packing applications with low performance overhead and low effort to maintain and scale them. Moreover, container orchestrators (e.g., Kubernetes) are widely used in the Cloud to manage large clusters running many containerized applications. However, scheduling policies that consider the performance nuances of containerized High Performance Computing (HPC) workloads have not been well-explored yet. This paper conducts fine-grained scheduling policies for containerized HPC workloads in Kubernetes clusters, focusing especially on partitioning each job into a suitable multi-container deployment according to the application profile. We implement our scheduling schemes on different layers of management (application and infrastructure), so that each component has its own focus and algorithms but still collaborates with others. Our results show that our fine-grained scheduling policies outperform baseline and baseline with CPU/memory affinity enabled policies, reducing the overall response time by 35% and 19%, respectively, and also improving the makespan by 34% and 11%, respectively. They also provide better usability and flexibility to specify HPC workloads than other comparable HPC Cloud frameworks, while providing better scheduling efficiency thanks to their multi-layered approach.Comment: HPCC202

arXiv.org e-Print Archive

UPCommons. Portal del coneixement obert de la UPC

FfDL : A Flexible Multi-tenant Deep Learning Platform

Author: Egwutuoha Ifeanyi P.
Google Inc.
Hermann Jeremy
Jia Yangqing
Kraska Tim
Pan Xinghao
Park Jun Woo
Tantawi Asser N.
Venkataraman Shivaram
Wang Chao
Xiao Wencong
Zaharia Matei
Zhang Haoyu
Zhang K.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 14/09/2019
Field of study

Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and accurate. As a result, large scale on-premise and cloud-hosted deep learning platforms have become essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper describes the design, implementation and our experiences with FfDL, a DL platform used at IBM. We describe how our design balances dependability with scalability, elasticity, flexibility and efficiency. We examine FfDL qualitatively through a retrospective look at the lessons learned from building, operating, and supporting FfDL; and quantitatively through a detailed empirical evaluation of FfDL, including the overheads introduced by the platform for various deep learning models, the load and performance observed in a real case study using FfDL within our organization, the frequency of various faults observed including unanticipated faults, and experiments demonstrating the benefits of various scheduling policies. FfDL has been open-sourced.Comment: MIDDLEWARE 201

arXiv.org e-Print Archive

Crossref

Recommended from our members

COST-EFFICIENT RESOURCE PROVISIONING FOR CLOUD-ENABLED SCHEDULERS

Author: Ambati Lurdh Pradeep Reddy
Publication venue: ScholarWorks@UMass Amherst
Publication date: 20/10/2021
Field of study

Since the last decade, public cloud platforms are rapidly becoming de-facto computing platform for our society. To support the wide range of users and their diverse applications, public cloud platforms started to offer the same VMs under many purchasing options that differ across their cost, performance, availability, and time commitments. Popular purchasing options include on-demand, reserved, and transient VM types. Reserved VMs require long time commitments, whereas users can acquire and release the on-demand (and transient) VMs at any time. While transient VMs cost significantly less than on-demand VMs, platforms may revoke them at any time. In general, the stronger the commitment, i.e., longer and less flexible, the lower the price. However, longer and less flexible time commitments can increase cloud costs for users if future workloads cannot utilize the VMs they committed to buying. Interestingly, this wide range of purchasing options provide opportunities for cost savings. However, large cloud customers often find it challenging to choose the right mix of purchasing options to minimize their long-term costs while retaining the ability to adjust their capacity up and down in response to workload variations. Thus, optimizing the cloud costs requires users to select a mix of VM purchasing options based on their short- and long-term expectation of workload utilization. Notably, hybrid clouds combine multiple VM purchasing options or private clusters with public cloud VMs to optimize the cloud costs based on their workload expectations. In this thesis, we address the challenge of choosing a mix of different VM purchasing options in the context of large cloud customers and thereby optimizing their cloud costs. To this end, we make the following contributions: (i) design and implement a container orchestration platform (using Kubernetes) to optimize the cost of executing mixed interactive and batch workloads on cloud platforms using on-demand and transient VMs, (ii) develop simple analytical models for different straggler mitigation techniques to better understand the cost of synchronization in distributed machine learning workloads and compare their cost and performance on on-demand and transient VMs, (iii) design multiple policies to optimize long-term cloud costs by selecting a mix of VM purchasing options based on short- and long-term expectations of workload utilization (with no job waiting), (iv) introduce the concept of waiting policy for cloud-enabled schedulers, and show that provisioning long-term resources (e.g., reserved VMs) to optimize the cloud costs is dependent on it, and (v) design and implement speculative execution and ML-based waiting time predictions (for waiting policies) to show that optimizing job waiting in the cloud is possible without accurate job runtime predictions

ScholarWorks@UMass Amherst

GreenCourier: Carbon-Aware Scheduling for Serverless Functions

Author: Abboud Osama
Arima Eishi
Chadha Mohak
Gerndt Michael
Schulz Martin
Subramanian Thandayuthapani
Publication venue
Publication date: 31/10/2023
Field of study

This paper presents GreenCourier, a novel scheduling framework that enables the runtime scheduling of serverless functions across geographically distributed regions based on their carbon efficiencies. Our framework incorporates an intelligent scheduling strategy for Kubernetes and supports Knative as the serverless platform. To obtain real-time carbon information for different geographical regions, our framework supports multiple marginal carbon emissions sources such as WattTime and the Carbon-aware SDK. We comprehensively evaluate the performance of our framework using the Google Kubernetes Engine and production serverless function traces for scheduling functions across Spain, France, Belgium, and the Netherlands. Results from our experiments show that compared to other approaches, GreenCourier reduces carbon emissions per function invocation by an average of 13.25%.Comment: Accepted at the ACM 9th International Workshop on Serverless Computing (WoSC@Middleware'23

arXiv.org e-Print Archive

Achieving Continuous Delivery of Immutable Containerized Microservices with Mesos/Marathon

Author: Ravula Shashi
Publication venue
Publication date: 12/06/2017
Field of study

In the recent years, DevOps methodologies have been introduced to extend the traditional agile principles which have brought up on us a paradigm shift in migrating applications towards a cloud-native architecture. Today, microservices, containers, and Continuous Integration/Continuous Delivery have become critical to any organization’s transformation journey towards developing lean artifacts and dealing with the growing demand of pushing new features, iterating rapidly to keep the customers happy. Traditionally, applications have been packaged and delivered in virtual machines. But, with the adoption of microservices architectures, containerized applications are becoming the standard way to deploy services to production. Thanks to container orchestration tools like Marathon, containers can now be deployed and monitored at scale with ease. Microservices and Containers along with Container Orchestration tools disrupt and redefine DevOps, especially the delivery pipeline. This Master’s thesis project focuses on deploying highly scalable microservices packed as immutable containers onto a Mesos cluster using a container orchestrating framework called Marathon. This is achieved by implementing a CI/CD pipeline and bringing in to play some of the greatest and latest practices and tools like Docker, Terraform, Jenkins, Consul, Vault, Prometheus, etc. The thesis is aimed to showcase why we need to design systems around microservices architecture, packaging cloud-native applications into containers, service discovery and many other latest trends within the DevOps realm that contribute to the continuous delivery pipeline. At BetterDoctor Inc., it is observed that this project improved the avg. release cycle, increased team members’ productivity and collaboration, reduced infrastructure costs and deployment failure rates. With the CD pipeline in place along with container orchestration tools it has been observed that the organisation could achieve Hyperscale computing as and when business demands

Aaltodoc Publication Archive

Serverless Computing: A Security Perspective

Author: Di Pietro Roberto
Marin Eduard
Perino Diego
Publication venue
Publication date: 08/07/2021
Field of study

Serverless Computing is a virtualisation-related paradigm that promises to simplify application management and to solve one of the last architectural challenges in the field: scale down. The implied cost reduction, coupled with a simplified management of underlying applications, are expected to further push the adoption of virtualisation-based solutions, including cloud-computing. However, in this quest for efficiency, security is not ranked among the top priorities, also because of the (misleading) belief that current solutions developed for virtualised environments could be applied to this new paradigm. Unfortunately, this is not the case, due to the highlighted idiosyncratic features of serverless computing. In this paper, we review the current serverless architectures, abstract their founding principles, and analyse them from the point of view of security. We show the security shortcomings of the analysed serverless architectural paradigms, and point to possible countermeasures. We believe that our contribution, other than being valuable on its own, also paves the way for further research in this domain, a challenging and relevant one for both industry and academia

arXiv.org e-Print Archive

ZENODO

Recommended from our members

QoS and efficiency for FaaS platforms

Author: Kumar Pranav
Publication venue
Publication date: 25/11/2019
Field of study

Serverless computing or function-as-a-service (FaaS) provides a way to write applications composed of scalable and manageable independent tasks communicating seamlessly without developer involvement. Strict performance guarantees or service-level agreements (SLAs) provided by cloud vendors demand predictable performance of serverless applications. Performance predictability in a datacenter environment suffers due to contention for hardware resources. In this study, we evaluate the effects of contention on two FaaS platforms; AWS Lambda, an industry leader in serverless, and the open-source OpenFaaS serverless stack. We develop a complete set of microbenchmarks as well as end-to-end applications composed of multiple functions as a benchmark suite to facilitate our study. We quantify baseline system costs of these applications across both stacks given traditional orchestration mechanisms in an isolated system. We also quantify the same with co-located workloads in datacenter-like setting with Kubernetes orchestration. We show, via experiments, that significant performance slack exists at low to moderate loads and we can intelligently colocate workloads to maximize hardware utilization while still meeting QoS target latencies. Finally, we present a contention-aware static scheduling solution for FaaS platforms with predictable performance and compare it to static versions of baseline related works. We find that an intelligent FaaS orchestrator can be based along similar lines (similar hardware-level features) as a microservices one.Electrical and Computer Engineerin

Texas ScholarWorks