902 research outputs found

    Learning Scheduling Algorithms for Data Processing Clusters

    Full text link
    Efficiently scheduling data processing jobs on distributed compute clusters requires complex algorithms. Current systems, however, use simple generalized heuristics and ignore workload characteristics, since developing and tuning a scheduling policy for each workload is infeasible. In this paper, we show that modern machine learning techniques can generate highly-efficient policies automatically. Decima uses reinforcement learning (RL) and neural networks to learn workload-specific scheduling algorithms without any human instruction beyond a high-level objective such as minimizing average job completion time. Off-the-shelf RL techniques, however, cannot handle the complexity and scale of the scheduling problem. To build Decima, we had to develop new representations for jobs' dependency graphs, design scalable RL models, and invent RL training methods for dealing with continuous stochastic job arrivals. Our prototype integration with Spark on a 25-node cluster shows that Decima improves the average job completion time over hand-tuned scheduling heuristics by at least 21%, achieving up to 2x improvement during periods of high cluster load

    System Optimisation for Multi-access Edge Computing Based on Deep Reinforcement Learning

    Get PDF
    Multi-access edge computing (MEC) is an emerging and important distributed computing paradigm that aims to extend cloud service to the network edge to reduce network traffic and service latency. Proper system optimisation and maintenance are crucial to maintaining high Quality-of-service (QoS) for end-users. However, with the increasing complexity of the architecture of MEC and mobile applications, effectively optimising MEC systems is non-trivial. Traditional optimisation methods are generally based on simplified mathematical models and fixed heuristics, which rely heavily on expert knowledge. As a consequence, when facing dynamic MEC scenarios, considerable human efforts and expertise are required to redesign the model and tune the heuristics, which is time-consuming. This thesis aims to develop deep reinforcement learning (DRL) methods to handle system optimisation problems in MEC. Instead of developing fixed heuristic algorithms for the problems, this thesis aims to design DRL-based methods that enable systems to learn optimal solutions on their own. This research demonstrates the effectiveness of DRL-based methods on two crucial system optimisation problems: task offloading and service migration. Specifically, this thesis first investigate the dependent task offloading problem that considers the inner dependencies of tasks. This research builds a DRL-based method combining sequence-to-sequence (seq2seq) neural network to address the problem. Experiment results demonstrate that our method outperforms the existing heuristic algorithms and achieves near-optimal performance. To further enhance the learning efficiency of the DRL-based task offloading method for unseen learning tasks, this thesis then integrates meta reinforcement learning to handle the task offloading problem. Our method can adapt fast to new environments with a small number of gradient updates and samples. Finally, this thesis exploits the DRL-based solution for the service migration problem in MEC considering user mobility. This research models the service migration problem as a Partially Observable Markov Decision Process (POMDP) and propose a tailored actor-critic algorithm combining Long-short Term Memory (LSTM) to solve the POMDP. Results from extensive experiments based on real-world mobility traces demonstrate that our method consistently outperforms both the heuristic and state-of-the-art learning-driven algorithms on various MEC scenarios

    Robust resource management for time-critical tasks in the cloud-edge continuum

    Get PDF
    As an emerging distributed computing paradigm, the Cloud-edge continuum (CEC) leverages the strengths of both cloud computing and edge computing to provide efficient and effective services to end-users. CEC enables faster processing of data and provides multiple benefits, including scalability, data security, and improved quality of service. With the increasing demand for real-time data processing, the proliferation of the Internet of Things (IoT) devices, and the growing need for data privacy and security, CEC has been developing, evolving, and adapting quickly. Cloud computing provides scalable and flexible computing infrastructure, while edge computing offers low latency and location-awareness capabilities. How to schedule the tasks in a CEC among its exploding amount of resources is a challenge for both service providers and users. QoS (quality of service) or QoE (Quality of experience) are metrics that describe this process and are often adopted as the optimization objective. Among all kinds of resource management optimization approaches, learning-based task scheduling and offloading have gained popularity in recent years. To overcome these limitations, researchers have turned to machine learning techniques to develop more intelligent and adaptive resource management algorithms. However, the machine learning-based methods in CEC also face several challenges: 1. The performance of learning-based resource management is difficult to maintain when the pattern of time-critical tasks is dynamically changing;2. Learning-based resource management strategies are difficult to adapt when continuum resources are highly heterogeneous;3. Learning-based resource management suffers from low robustness when optimizing multiple objectives.My thesis tackles these challenges, and we propose a Meta-Learning-based resource management framework to deal with time-critical requests spanning from independent tasks to complex workflows in a dynamic cloud-edge continuum. Our goal is to improve the robustness and adaptivity of the resource management framework in highly changing environments

    Robust resource management for time-critical tasks in the cloud-edge continuum

    Get PDF
    As an emerging distributed computing paradigm, the Cloud-edge continuum (CEC) leverages the strengths of both cloud computing and edge computing to provide efficient and effective services to end-users. CEC enables faster processing of data and provides multiple benefits, including scalability, data security, and improved quality of service. With the increasing demand for real-time data processing, the proliferation of the Internet of Things (IoT) devices, and the growing need for data privacy and security, CEC has been developing, evolving, and adapting quickly. Cloud computing provides scalable and flexible computing infrastructure, while edge computing offers low latency and location-awareness capabilities. How to schedule the tasks in a CEC among its exploding amount of resources is a challenge for both service providers and users. QoS (quality of service) or QoE (Quality of experience) are metrics that describe this process and are often adopted as the optimization objective. Among all kinds of resource management optimization approaches, learning-based task scheduling and offloading have gained popularity in recent years. To overcome these limitations, researchers have turned to machine learning techniques to develop more intelligent and adaptive resource management algorithms. However, the machine learning-based methods in CEC also face several challenges: 1. The performance of learning-based resource management is difficult to maintain when the pattern of time-critical tasks is dynamically changing;2. Learning-based resource management strategies are difficult to adapt when continuum resources are highly heterogeneous;3. Learning-based resource management suffers from low robustness when optimizing multiple objectives.My thesis tackles these challenges, and we propose a Meta-Learning-based resource management framework to deal with time-critical requests spanning from independent tasks to complex workflows in a dynamic cloud-edge continuum. Our goal is to improve the robustness and adaptivity of the resource management framework in highly changing environments

    Dynamic Resource Allocation in Industrial Internet of Things (IIoT) using Machine Learning Approaches

    Get PDF
    In today's era of rapid smart equipment development and the Industrial Revolution, the application scenarios for Internet of Things (IoT) technology are expanding widely. The combination of IoT and industrial manufacturing systems gives rise to the Industrial IoT (IIoT). However, due to resource limitations such as computational units and battery capacity in IIoT devices (IIEs), it is crucial to execute computationally intensive tasks efficiently. The dynamic and continuous generation of tasks poses a significant challenge to managing the limited resources in the IIoT environment. This paper proposes a collaborative approach for optimal offloading and resource allocation of highly sensitive industrial IoT tasks. Firstly, the computation-intensive IIoT tasks are transformed into a directed acyclic graph. Then, task offloading is treated as an optimization problem, taking into account the models of processor resources and energy consumption for the offloading scheme. Lastly, a dynamic resource allocation approach is introduced to allocate computing resources to the edge-cloud server for the execution of computation-intensive tasks. The proposed joint offloading and scheduling (JOS) algorithm creates its DAG and prepare a offloading queue. This queue is designed using collaborative q-learning based reinforcement learning and allocate optimal resources to the JOS for execution of tasks present in offloading queue. For this machine learning approach is used to predict and allocate resources. The paper compares conventional and machine learning-based resource allocation methods. The machine learning approach performs better in terms of response time, delay, and energy consumption. The proposed algorithm shows that energy usage increases with task size, and response time increases with the number of users. Among the algorithms compared, JOS has the lowest waiting time, followed by DQN, while Q-learning performs the worst. Based on these findings, the paper recommends adopting the machine learning approach, specifically the JOS algorithm, for joint offloading and resource allocation

    An Improved Scheduling with Advantage Actor-Critic for Storm Workloads

    Full text link
    Various resources as the essential elements of data centers, and the completion time is vital to users. In terms of the persistence, the periodicity and the spatial-temporal dependence of stream workload, a new Storm scheduler with Advantage Actor-Critic is proposed to improve resource utilization for minimizing the completion time. A new weighted embedding with a Graph Neural Network is designed to depend on the features of a job comprehensively, which includes the dependence, the types and the positions of tasks in a job. An improved Advantage Actor-Critic integrating task chosen and executor assignment is proposed to schedule tasks to executors in order to better resource utilization. Then the status of tasks and executors are updated for the next scheduling. Compared to existing methods, experimental results show that the proposed Storm scheduler improves resource utilization. The completion time is reduced by almost 17\% on the TPC-H data set and reduced by almost 25\% on the Alibaba data set
    • …
    corecore