Search CORE

306 research outputs found

Energy-Efficient GPU Clusters Scheduling for Deep Learning

Author: Gu Diandian
Huang Gang
Jin Xin
Liu Xuanzhe
Xie Xintong
Publication venue
Publication date: 13/04/2023
Field of study

Training deep neural networks (DNNs) is a major workload in datacenters today, resulting in a tremendously fast growth of energy consumption. It is important to reduce the energy consumption while completing the DL training jobs early in data centers. In this paper, we propose PowerFlow, a GPU clusters scheduler that reduces the average Job Completion Time (JCT) under an energy budget. We first present performance models for DL training jobs to predict the throughput and energy consumption performance with different configurations. Based on the performance models, PowerFlow dynamically allocates GPUs and adjusts the GPU-level or job-level configurations of DL training jobs. PowerFlow applies network packing and buddy allocation to job placement, thus avoiding extra energy consumed by cluster fragmentations. Evaluation results show that under the same energy consumption, PowerFlow improves the average JCT by 1.57 - 3.39 x at most, compared to competitive baselines

arXiv.org e-Print Archive

Serverless Computing Strategies on Cloud Platforms

Author: Naranjo Delgado Diana María
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 08/02/2021
Field of study

[ES] Con el desarrollo de la Computación en la Nube, la entrega de recursos virtualizados a través de Internet ha crecido enormemente en los últimos años. Las Funciones como servicio (FaaS), uno de los modelos de servicio más nuevos dentro de la Computación en la Nube, permite el desarrollo e implementación de aplicaciones basadas en eventos que cubren servicios administrados en Nubes públicas y locales. Los proveedores públicos de Computación en la Nube adoptan el modelo FaaS dentro de su catálogo para proporcionar computación basada en eventos altamente escalable para las aplicaciones. Por un lado, los desarrolladores especializados en esta tecnología se centran en crear marcos de código abierto serverless para evitar el bloqueo con los proveedores de la Nube pública. A pesar del desarrollo logrado por la informática serverless, actualmente hay campos relacionados con el procesamiento de datos y la optimización del rendimiento en la ejecución en los que no se ha explorado todo el potencial. En esta tesis doctoral se definen tres estrategias de computación serverless que permiten evidenciar los beneficios de esta tecnología para el procesamiento de datos. Las estrategias implementadas permiten el análisis de datos con la integración de dispositivos de aceleración para la ejecución eficiente de aplicaciones científicas en plataformas cloud públicas y locales. En primer lugar, se desarrolló la plataforma CloudTrail-Tracker. CloudTrail-Tracker es una plataforma serverless de código abierto basada en eventos para el procesamiento de datos que puede escalar automáticamente hacia arriba y hacia abajo, con la capacidad de escalar a cero para minimizar los costos operativos. Seguidamente, se plantea la integración de GPUs en una plataforma serverless local impulsada por eventos para el procesamiento de datos escalables. La plataforma admite la ejecución de aplicaciones como funciones severless en respuesta a la carga de un archivo en un sistema de almacenamiento de ficheros, lo que permite la ejecución en paralelo de las aplicaciones según los recursos disponibles. Este procesamiento es administrado por un cluster Kubernetes elástico que crece y decrece automáticamente según las necesidades de procesamiento. Ciertos enfoques basados en tecnologías de virtualización de GPU como rCUDA y NVIDIA-Docker se evalúan para acelerar el tiempo de ejecución de las funciones. Finalmente, se implementa otra solución basada en el modelo serverless para ejecutar la fase de inferencia de modelos de aprendizaje automático previamente entrenados, en la plataforma de Amazon Web Services y en una plataforma privada con el framework OSCAR. El sistema crece elásticamente de acuerdo con la demanda y presenta una escalado a cero para minimizar los costes. Por otra parte, el front-end proporciona al usuario una experiencia simplificada en la obtención de la predicción de modelos de aprendizaje automático. Para demostrar las funcionalidades y ventajas de las soluciones propuestas durante esta tesis se recogen varios casos de estudio que abarcan diferentes campos del conocimiento como la analítica de aprendizaje y la Inteligencia Artificial. Esto demuestra que la gama de aplicaciones donde la computación serverless puede aportar grandes beneficios es muy amplia. Los resultados obtenidos avalan el uso del modelo serverless en la simplificación del diseño de arquitecturas para el uso intensivo de datos en aplicaciones complejas.[CA] Amb el desenvolupament de la Computació en el Núvol, el lliurament de recursos virtualitzats a través d'Internet ha crescut granment en els últims anys. Les Funcions com a Servei (FaaS), un dels models de servei més nous dins de la Computació en el Núvol, permet el desenvolupament i implementació d'aplicacions basades en esdeveniments que cobreixen serveis administrats en Núvols públics i locals. Els proveïdors de computació en el Núvol públic adopten el model FaaS dins del seu catàleg per a proporcionar a les aplicacions computació altament escalable basada en esdeveniments. D'una banda, els desenvolupadors especialitzats en aquesta tecnologia se centren en crear marcs de codi obert serverless per a evitar el bloqueig amb els proveïdors del Núvol públic. Malgrat el desenvolupament alcançat per la informàtica serverless, actualment hi ha camps relacionats amb el processament de dades i l'optimització del rendiment d'execució en els quals no s'ha explorat tot el potencial. En aquesta tesi doctoral es defineixen tres estratègies informàtiques serverless que permeten demostrar els beneficis d'aquesta tecnologia per al processament de dades. Les estratègies implementades permeten l'anàlisi de dades amb a integració de dispositius accelerats per a l'execució eficient d'aplicacion scientífiques en plataformes de Núvol públiques i locals. En primer lloc, es va desenvolupar la plataforma CloudTrail-Tracker. CloudTrail-Tracker és una plataforma de codi obert basada en esdeveniments per al processament de dades serverless que pot escalar automáticament cap amunt i cap avall, amb la capacitat d'escalar a zero per a minimitzar els costos operatius. A continuació es planteja la integració de GPUs en una plataforma serverless local impulsada per esdeveniments per al processament de dades escalables. La plataforma admet l'execució d'aplicacions com funcions severless en resposta a la càrrega d'un arxiu en un sistema d'emmagatzemaments de fitxers, la qual cosa permet l'execució en paral·lel de les aplicacions segon sels recursos disponibles. Este processament és administrat per un cluster Kubernetes elàstic que creix i decreix automàticament segons les necessitats de processament. Certs enfocaments basats en tecnologies de virtualització de GPU com rCUDA i NVIDIA-Docker s'avaluen per a accelerar el temps d'execució de les funcions. Finalment s'implementa una altra solució basada en el model serverless per a executar la fase d'inferència de models d'aprenentatge automàtic prèviament entrenats en la plataforma de Amazon Web Services i en una plataforma privada amb el framework OSCAR. El sistema creix elàsticament d'acord amb la demanda i presenta una escalada a zero per a minimitzar els costos. D'altra banda el front-end proporciona a l'usuari una experiència simplificada en l'obtenció de la predicció de models d'aprenentatge automàtic. Per a demostrar les funcionalitats i avantatges de les solucions proposades durant esta tesi s'arrepleguen diversos casos d'estudi que comprenen diferents camps del coneixement com l'analítica d'aprenentatge i la Intel·ligència Artificial. Això demostra que la gamma d'aplicacions on la computació serverless pot aportar grans beneficis és molt àmplia. Els resultats obtinguts avalen l'ús del model serverless en la simplificació del disseny d'arquitectures per a l'ús intensiu de dades en aplicacions complexes.[EN] With the development of Cloud Computing, the delivery of virtualized resources over the Internet has greatly grown in recent years. Functions as a Service (FaaS), one of the newest service models within Cloud Computing, allows the development and implementation of event-based applications that cover managed services in public and on-premises Clouds. Public Cloud Computing providers adopt the FaaS model within their catalog to provide event-driven highly-scalable computing for applications. On the one hand, developers specialized in this technology focus on creating open-source serverless frameworks to avoid the lock-in with public Cloud providers. Despite the development achieved by serverless computing, there are currently fields related to data processing and execution performance optimization where the full potential has not been explored. In this doctoral thesis three serverless computing strategies are defined that allow to demonstrate the benefits of this technology for data processing. The implemented strategies allow the analysis of data with the integration of accelerated devices for the efficient execution of scientific applications on public and on-premises Cloud platforms. Firstly, the CloudTrail-Tracker platform was developed to extract and process learning analytics in the Cloud. CloudTrail-Tracker is an event-driven open-source platform for serverless data processing that can automatically scale up and down, featuring the ability to scale to zero for minimizing the operational costs. Next, the integration of GPUs in an event-driven on-premises serverless platform for scalable data processing is discussed. The platform supports the execution of applications as serverless functions in response to the loading of a file in a file storage system, which allows the parallel execution of applications according to available resources. This processing is managed by an elastic Kubernetes cluster that automatically grows and shrinks according to the processing needs. Certain approaches based on GPU virtualization technologies such as rCUDA and NVIDIA-Docker are evaluated to speed up the execution time of the functions. Finally, another solution based on the serverless model is implemented to run the inference phase of previously trained machine learning models on theAmazon Web Services platform and in a private platform with the OSCAR framework. The system grows elastically according to demand and is scaled to zero to minimize costs. On the other hand, the front-end provides the user with a simplified experience in obtaining the prediction of machine learning models. To demonstrate the functionalities and advantages of the solutions proposed during this thesis, several case studies are collected covering different fields of knowledge such as learning analytics and Artificial Intelligence. This shows the wide range of applications where serverless computing can bring great benefits. The results obtained endorse the use of the serverless model in simplifying the design of architectures for the intensive data processing in complex applications.Naranjo Delgado, DM. (2021). Serverless Computing Strategies on Cloud Platforms [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/160916TESI

RiuNet

Network Contention-Aware Cluster Scheduling with Reinforcement Learning

Author: Eo Jeongyoon
Ryu Junyeol
Publication venue
Publication date: 31/10/2023
Field of study

With continuous advances in deep learning, distributed training is becoming common in GPU clusters. Specifically, for emerging workloads with diverse amounts, ratios, and patterns of communication, we observe that network contention can significantly degrade training throughput. However, widely used scheduling policies often face limitations as they are agnostic to network contention between jobs. In this paper, we present a new approach to mitigate network contention in GPU clusters using reinforcement learning. We formulate GPU cluster scheduling as a reinforcement learning problem and opt to learn a network contention-aware scheduling policy that efficiently captures contention sensitivities and dynamically adapts scheduling decisions through continuous evaluation and improvement. We show that compared to widely used scheduling policies, our approach reduces average job completion time by up to 18.2\% and effectively cuts the tail job completion time by up to 20.7\% while allowing a preferable trade-off between average job completion time and resource utilization

arXiv.org e-Print Archive

KungFu: Making Training in Distributed Machine Learning Adaptive

Author: Brabete Andrei-Octavian
Fertakis Konstantinos
Li Guo
Mai Luo
Pietzuch Peter
Wagenländer Marcel
Publication venue
Publication date: 31/08/2020
Field of study

When using distributed machine learning (ML) systems to train models on a cluster of worker machines, users must con-figure a large number of parameters: hyper-parameters (e.g. the batch size and the learning rate) affect model convergence; system parameters (e.g. the number of workers and their communication topology) impact training performance. In current systems, adapting such parameters during training is ill-supported. Users must set system parameters at deployment time, and provide fixed adaptation schedules for hyper-parameters in the training program. We describe Kung Fu, a distributed ML library for Tensor-Flow that is designed to enable adaptive training. Kung Fu allows users to express high-level Adaptation Policies(APs)that describe how to change hyper- and system parameters during training. APs take real-time monitored metrics (e.g. signal-to-noise ratios and noise scale) as input and trigger control actions (e.g. cluster rescaling or synchronisation strategy updates). For execution, APs are translated into monitoring and control operators, which are embedded in the data flowgraph. APs exploit an efficient asynchronous collective communication layer, which ensures concurrency and consistency of monitoring and adaptation operation

Edinburgh Research Explorer

Spiral - Imperial College Digital Repository

딥 뉴럴 네트워크의 탄력적 분산학습

Author: 이경근
Publication venue: 서울대학교 대학원
Publication date: 01/08/2021
Field of study

학위논문(석사) -- 서울대학교대학원 : 공과대학 컴퓨터공학부, 2021.8. 이경근.As the training of Deep Neural Network (DNN) models relies more and more heavily on the shared GPU clusters or cloud computing services, elastic training of DNN has much potential gain for both the users and the managers of the shared clusters, such as idle resource utilization, job completion time (JCT), and responsiveness. However, making a distributed DNN training job elastic is not a trivial problem because we should handle the DNN training job’s states appropriately upon scaling events. Moreover, it is even more challenging to achieve both efficient scaling mechanism and correct job state management, which are the two conflicting goals. In this paper, we discuss the problem of state management in an elastic distributed DNN training jobs, and propose a design for fast and safe elastic DNN training system that can support various types of training jobs. We implemented an elastic training framework, named Elastic Parallax, and validated our system on the data-parallel training workloads.딥 뉴럴 네트워크(DNN) 모델들이 점점 공유 GPU 클러스터 또는 클라우드 컴퓨팅 서비스에 의존하게 됨에 따라, 유휴자원 활용, JCT, 반응성 등, 클러스터 사용자와 관리자 모두에게 있어 탄력적 학습을 지원하는 것의 잠재적 이점이 많아지고 있다. 그러나 분산 DNN 학습 작업을 탄력적으로 동작하게 만드는 것은 어려운 일인데, 왜냐하면 DNN 학습 작업을 탄력적이게 만들려면 스케일링 시마다 작업의 상태를 적절하게 관리해 주어야 하기 때문이다. 게다가, 효율적인 스케일링 메카니즘과 적절한 작업 상태 관리는 동시에 이루기 어려운 목표들이다. 따라서 본 논문에서는, 탄력적 분산 DNN 학습 작업의 상태 관리 문제를 논의하고, 이를 바탕으로 다양한 종류의 학습 작업을 지원할 수 있는 빠르고 안전한 탄력적 DNN 학습 시스템 디자인을 제안한다. 또한, 탄력적 학습 프레임워크인 Elastic Parallax를 직접 구현하고, 실제 데이터 병렬 학습 작업들에 대하여 시스템을 검증한다.Abstract 1 1 Introduction 5 2 Background 8 2.1 Distributed DNN Training 8 2.2 Elastic Distributed Training 10 3 Problem Statement 12 3.1 Definitions 12 3.1.1 State and State Consistency 12 3.1.2 Elasticity 13 3.2 State Management Problem of Elastic DNN Training 14 4 State Synchronization for Elastic Training 16 4.1 Classification of State Constraints 16 4.2 State Synchronization Operations 17 4.2.1 Replicated States 18 4.2.2 Partitioned States 18 4.2.3 Singleton States 18 4.3 Implication on API Design 19 5 API and System Design 20 5.1 API Design 20 5.2 System Architecture 22 5.3 Implementation 24 5.3.1 Two-Phase Rendezvous 24 5.3.2 Elastic Input Pipeline 25 6 Evaluation 26 6.1 Evaluation Setup 26 6.1.1 Environment 26 6.1.2 Workloads 26 6.2 Replicated Data Parallelism 27 6.3 Partitioned Data Parallelism 28 7 Related Work 33 7.1 Elastic Machine Learning 33 7.2 Elastic DNN Training 33 8 Discussion and Conclusion 35 초록 42석

SNU Open Repository and Archive