649 research outputs found

    POLCA: Power Oversubscription in LLM Cloud Providers

    Full text link
    Recent innovation in large language models (LLMs), and their myriad use-cases have rapidly driven up the compute capacity demand for datacenter GPUs. Several cloud providers and other enterprises have made substantial plans of growth in their datacenters to support these new workloads. One of the key bottleneck resources in datacenters is power, and given the increasing model sizes of LLMs, they are becoming increasingly power intensive. In this paper, we show that there is a significant opportunity to oversubscribe power in LLM clusters. Power oversubscription improves the power efficiency of these datacenters, allowing more deployable servers per datacenter, and reduces the deployment time, since building new datacenters is slow. We extensively characterize the power consumption patterns of a variety of LLMs and their configurations. We identify the differences between the inference and training power consumption patterns. Based on our analysis of these LLMs, we claim that the average and peak power utilization in LLM clusters for inference should not be very high. Our deductions align with the data from production LLM clusters, revealing that inference workloads offer substantial headroom for power oversubscription. However, the stringent set of telemetry and controls that GPUs offer in a virtualized environment, makes it challenging to have a reliable and robust power oversubscription mechanism. We propose POLCA, our framework for power oversubscription that is robust, reliable, and readily deployable for GPU clusters. Using open-source models to replicate the power patterns observed in production, we simulate POLCA and demonstrate that we can deploy 30% more servers in the same GPU cluster for inference, with minimal performance los

    On Reliability-Aware Server Consolidation in Cloud Datacenters

    Full text link
    In the past few years, datacenter (DC) energy consumption has become an important issue in technology world. Server consolidation using virtualization and virtual machine (VM) live migration allows cloud DCs to improve resource utilization and hence energy efficiency. In order to save energy, consolidation techniques try to turn off the idle servers, while because of workload fluctuations, these offline servers should be turned on to support the increased resource demands. These repeated on-off cycles could affect the hardware reliability and wear-and-tear of servers and as a result, increase the maintenance and replacement costs. In this paper we propose a holistic mathematical model for reliability-aware server consolidation with the objective of minimizing total DC costs including energy and reliability costs. In fact, we try to minimize the number of active PMs and racks, in a reliability-aware manner. We formulate the problem as a Mixed Integer Linear Programming (MILP) model which is in form of NP-complete. Finally, we evaluate the performance of our approach in different scenarios using extensive numerical MATLAB simulations.Comment: International Symposium on Parallel and Distributed Computing (ISPDC), Innsbruck, Austria, 201
    corecore