6,054 research outputs found

    Learning Scheduling Algorithms for Data Processing Clusters

    Full text link
    Efficiently scheduling data processing jobs on distributed compute clusters requires complex algorithms. Current systems, however, use simple generalized heuristics and ignore workload characteristics, since developing and tuning a scheduling policy for each workload is infeasible. In this paper, we show that modern machine learning techniques can generate highly-efficient policies automatically. Decima uses reinforcement learning (RL) and neural networks to learn workload-specific scheduling algorithms without any human instruction beyond a high-level objective such as minimizing average job completion time. Off-the-shelf RL techniques, however, cannot handle the complexity and scale of the scheduling problem. To build Decima, we had to develop new representations for jobs' dependency graphs, design scalable RL models, and invent RL training methods for dealing with continuous stochastic job arrivals. Our prototype integration with Spark on a 25-node cluster shows that Decima improves the average job completion time over hand-tuned scheduling heuristics by at least 21%, achieving up to 2x improvement during periods of high cluster load

    Robust resource management for time-critical tasks in the cloud-edge continuum

    Get PDF
    As an emerging distributed computing paradigm, the Cloud-edge continuum (CEC) leverages the strengths of both cloud computing and edge computing to provide efficient and effective services to end-users. CEC enables faster processing of data and provides multiple benefits, including scalability, data security, and improved quality of service. With the increasing demand for real-time data processing, the proliferation of the Internet of Things (IoT) devices, and the growing need for data privacy and security, CEC has been developing, evolving, and adapting quickly. Cloud computing provides scalable and flexible computing infrastructure, while edge computing offers low latency and location-awareness capabilities. How to schedule the tasks in a CEC among its exploding amount of resources is a challenge for both service providers and users. QoS (quality of service) or QoE (Quality of experience) are metrics that describe this process and are often adopted as the optimization objective. Among all kinds of resource management optimization approaches, learning-based task scheduling and offloading have gained popularity in recent years. To overcome these limitations, researchers have turned to machine learning techniques to develop more intelligent and adaptive resource management algorithms. However, the machine learning-based methods in CEC also face several challenges: 1. The performance of learning-based resource management is difficult to maintain when the pattern of time-critical tasks is dynamically changing;2. Learning-based resource management strategies are difficult to adapt when continuum resources are highly heterogeneous;3. Learning-based resource management suffers from low robustness when optimizing multiple objectives.My thesis tackles these challenges, and we propose a Meta-Learning-based resource management framework to deal with time-critical requests spanning from independent tasks to complex workflows in a dynamic cloud-edge continuum. Our goal is to improve the robustness and adaptivity of the resource management framework in highly changing environments

    Robust resource management for time-critical tasks in the cloud-edge continuum

    Get PDF
    As an emerging distributed computing paradigm, the Cloud-edge continuum (CEC) leverages the strengths of both cloud computing and edge computing to provide efficient and effective services to end-users. CEC enables faster processing of data and provides multiple benefits, including scalability, data security, and improved quality of service. With the increasing demand for real-time data processing, the proliferation of the Internet of Things (IoT) devices, and the growing need for data privacy and security, CEC has been developing, evolving, and adapting quickly. Cloud computing provides scalable and flexible computing infrastructure, while edge computing offers low latency and location-awareness capabilities. How to schedule the tasks in a CEC among its exploding amount of resources is a challenge for both service providers and users. QoS (quality of service) or QoE (Quality of experience) are metrics that describe this process and are often adopted as the optimization objective. Among all kinds of resource management optimization approaches, learning-based task scheduling and offloading have gained popularity in recent years. To overcome these limitations, researchers have turned to machine learning techniques to develop more intelligent and adaptive resource management algorithms. However, the machine learning-based methods in CEC also face several challenges: 1. The performance of learning-based resource management is difficult to maintain when the pattern of time-critical tasks is dynamically changing;2. Learning-based resource management strategies are difficult to adapt when continuum resources are highly heterogeneous;3. Learning-based resource management suffers from low robustness when optimizing multiple objectives.My thesis tackles these challenges, and we propose a Meta-Learning-based resource management framework to deal with time-critical requests spanning from independent tasks to complex workflows in a dynamic cloud-edge continuum. Our goal is to improve the robustness and adaptivity of the resource management framework in highly changing environments
    • …
    corecore