9 research outputs found

    Toward Robust Long Range Policy Transfer

    Full text link
    Humans can master a new task within a few trials by drawing upon skills acquired through prior experience. To mimic this capability, hierarchical models combining primitive policies learned from prior tasks have been proposed. However, these methods fall short comparing to the human's range of transferability. We propose a method, which leverages the hierarchical structure to train the combination function and adapt the set of diverse primitive polices alternatively, to efficiently produce a range of complex behaviors on challenging new tasks. We also design two regularization terms to improve the diversity and utilization rate of the primitives in the pre-training phase. We demonstrate that our method outperforms other recent policy transfer methods by combining and adapting these reusable primitives in tasks with continuous action space. The experiment results further show that our approach provides a broader transferring range. The ablation study also shows the regularization terms are critical for long range policy transfer. Finally, we show that our method consistently outperforms other methods when the quality of the primitives varies.Comment: Accepted by AAAI 202

    DeepSoCS: A Neural Scheduler for Heterogeneous System-on-Chip (SoC) Resource Scheduling

    Full text link
    In this paper, we~present a novel scheduling solution for a class of System-on-Chip (SoC) systems where heterogeneous chip resources (DSP, FPGA, GPU, etc.) must be efficiently scheduled for continuously arriving hierarchical jobs with their tasks represented by a directed acyclic graph. Traditionally, heuristic algorithms have been widely used for many resource scheduling domains, and Heterogeneous Earliest Finish Time (HEFT) has been a dominating state-of-the-art technique across a broad range of heterogeneous resource scheduling domains over many years. Despite their long-standing popularity, HEFT-like algorithms are known to be vulnerable to a small amount of noise added to the environment. Our Deep Reinforcement Learning (DRL)-based SoC Scheduler (DeepSoCS), capable of learning the "best" task ordering under dynamic environment changes, overcomes the brittleness of rule-based schedulers such as HEFT with significantly higher performance across different types of jobs. We~describe a DeepSoCS design process using a real-time heterogeneous SoC scheduling emulator, discuss major challenges, and present two novel neural network design features that lead to outperforming HEFT: (i) hierarchical job- and task-graph embedding; and (ii) efficient use of real-time task information in the state space. Furthermore, we~introduce effective techniques to address two fundamental challenges present in our environment: delayed consequences and joint actions. Through an extensive simulation study, we~show that our DeepSoCS exhibits the significantly higher performance of job execution time than that of HEFT with a higher level of robustness under realistic noise conditions. We~conclude with a discussion of the potential improvements for our DeepSoCS neural scheduler.Comment: 18 pages, Accepted by Electronics 202

    Comyco: Quality-Aware Adaptive Video Streaming via Imitation Learning

    Full text link
    Learning-based Adaptive Bit Rate~(ABR) method, aiming to learn outstanding strategies without any presumptions, has become one of the research hotspots for adaptive streaming. However, it typically suffers from several issues, i.e., low sample efficiency and lack of awareness of the video quality information. In this paper, we propose Comyco, a video quality-aware ABR approach that enormously improves the learning-based methods by tackling the above issues. Comyco trains the policy via imitating expert trajectories given by the instant solver, which can not only avoid redundant exploration but also make better use of the collected samples. Meanwhile, Comyco attempts to pick the chunk with higher perceptual video qualities rather than video bitrates. To achieve this, we construct Comyco's neural network architecture, video datasets and QoE metrics with video quality features. Using trace-driven and real-world experiments, we demonstrate significant improvements of Comyco's sample efficiency in comparison to prior work, with 1700x improvements in terms of the number of samples required and 16x improvements on training time required. Moreover, results illustrate that Comyco outperforms previously proposed methods, with the improvements on average QoE of 7.5% - 16.79%. Especially, Comyco also surpasses state-of-the-art approach Pensieve by 7.37% on average video quality under the same rebuffering time.Comment: ACM Multimedia 201

    Learning Scheduling Algorithms for Data Processing Clusters

    Full text link
    Efficiently scheduling data processing jobs on distributed compute clusters requires complex algorithms. Current systems, however, use simple generalized heuristics and ignore workload characteristics, since developing and tuning a scheduling policy for each workload is infeasible. In this paper, we show that modern machine learning techniques can generate highly-efficient policies automatically. Decima uses reinforcement learning (RL) and neural networks to learn workload-specific scheduling algorithms without any human instruction beyond a high-level objective such as minimizing average job completion time. Off-the-shelf RL techniques, however, cannot handle the complexity and scale of the scheduling problem. To build Decima, we had to develop new representations for jobs' dependency graphs, design scalable RL models, and invent RL training methods for dealing with continuous stochastic job arrivals. Our prototype integration with Spark on a 25-node cluster shows that Decima improves the average job completion time over hand-tuned scheduling heuristics by at least 21%, achieving up to 2x improvement during periods of high cluster load

    Evaluation of blood glucose level control in Type 1 diabetic patients using online and offline reinforcement learning

    Get PDF
    [SPA] Los pacientes con diabetes tipo 1 deben monitorear de cerca sus niveles de glucemia y administrar insulina para controlarlos. Se han propuesto m茅todos de control automatizado de la glucemia que eliminan la necesidad de intervenci贸n humana, y recientemente, el aprendizaje por refuerzo, un tipo de algoritmo de aprendizaje autom谩tico, se ha utilizado como un m茅todo efectivo de control en entornos simulados. Actualmente, los m茅todos utilizados para los pacientes con diabetes, como el r茅gimen basal- bolus y los monitores continuos de glucemia, tienen limitaciones y todav铆a requieren intervenci贸n manual. Los controladores PID se utilizan ampliamente por su simplicidad y robustez, pero son sensibles a factores externos que afectan su efectividad. Las obras existentes en la literatura de investigaci贸n se han enfocado principalmente en mejorar la precisi贸n de estos algoritmos de control. Sin embargo, todav铆a hay margen para mejorar la adaptabilidad a los pacientes individuales. La siguiente fase de investigaci贸n tiene como objetivo optimizar a煤n m谩s los m茅todos actuales y adaptar los algoritmos para controlar mejor los niveles de glucemia. Una soluci贸n potencial es usar el aprendizaje por refuerzo (RL) para entrenar los algoritmos en base a datos individuales del paciente. En esta tesis, proponemos un control en lazo cerrado para los niveles de glucemia basado en el aprendizaje profundo por refuerzo. Describimos la evaluaci贸n inicial de varias alternativas llevadas a cabo en un simulador realista del sistema glucorregulador y proponemos una estrategia de implementaci贸n particular basada en reducir la frecuencia de las observaciones y recompensas pasadas al agente, y usar una funci贸n de recompensa simple. Entrenamos agentes con esa estrategia para tres grupos de clases de pacientes, los evaluamos y los comparamos con otras alternativas. Nuestros resultados muestran que nuestro m茅todo con Proximal Policy Optimization es capaz de superar a los m茅todos tradicionales, as铆 como a propuestas similares recientes, al lograr per铆odos m谩s prolongados de estado glic茅mico seguro y de bajo riesgo. Como extensi贸n del aporte anterior, constatamos que la aplicaci贸n pr谩ctica de los algoritmos de control de glucemia requerir铆a interacciones de prueba y error con los pacientes, lo que es una limitaci贸n para entrenar el sistema de manera efectiva. Como alternativa, el aprendizaje reforzado sin conexi贸n no requiere interacci贸n con humanos y la investigaci贸n previa sugiere que se pueden lograr resultados prometedores con conjuntos de datos obtenidos sin interacci贸n, similar a los algoritmos de aprendizaje autom谩tico cl谩sicos. Sin embargo, a煤n no se ha evaluado la aplicaci贸n del aprendizaje reforzado sin conexi贸n al control de la glucemia. Por lo tanto, en esta tesis, evaluamos exhaustivamente dos algoritmos de aprendizaje reforzado sin conexi贸n para el control de glucemia y examinamos su potencial y limitaciones. Evaluamos el impacto del m茅todo utilizado para generar los conjuntos de datos de entrenamiento, el tipo de trayectorias (secuencias de estados, acciones y recompensas experimentadas por un agente en un entorno,) empleadas (m茅todo 煤nico o mixto), la calidad de las trayectorias y el tama帽o de los conjuntos de datos en el entrenamiento y el rendimiento, y los comparamos con las alternativas como PID y Proximal Policy Optimization. Nuestros resultados demuestran que uno de los algoritmos de aprendizaje reforzado sin conexi贸n evaluados, Trajectory Transformer, es capaz de rendir al mismo nivel que alternativas, pero sin necesidad de interacci贸n con pacientes reales durante el entrenamiento.[ENG] Patients with Type 1 diabetes are required to closely monitor their blood glucose levels and administer insulin to manage them. Automated glucose control methods that eliminate the need for human intervention have been proposed, and recently, reinforcement learning, a type of machine learning algorithm, has been used as an effective control method in simulated environments. Currently, the methods used for diabetes patients, such as the basal-bolus regime and continuous glucose monitors, have limitations and still require manual intervention. The PID controllers are widely used for their simplicity and robustness, but they are sensitive to external factors affecting their effectiveness. The existing works in the research literature have mainly focused on improving the accuracy of these control algorithms. However, there is still room for improvement regarding adaptability to individual patients. The next phase of research aims to further optimize the current methods and adapt the algorithms to better control blood glucose levels. Machine learning proposals have paved the way partially, but they can generate generic models with limited adaptability. One potential solution is to use reinforcement learning (RL) to train the algorithms based on individual patient data. In this thesis, we propose a closed-loop control for blood glucose levels based on Deep reinforcement learning. We describe the initial evaluation of several alternatives conducted on a realistic simulator of the glucoregulatory system and propose a particular implementation strategy based on reducing the frequency of the observations and rewards passed to the agent, and using a simple reward function. We train agents with that strategy for three groups of patient classes, evaluate and compare it with alternative control baselines. Our results show that our method with Proximal Policy Optimization is able to outperform baselines as well as similar recent proposals, by achieving longer periods of safe glycemic state and low risk. As an extension of the previous contribution, we have noticed that, practical application of blood glucose control algorithms would necessitate trial-and-error interaction with patients, which could be a limitation for effectively training the system. As an alternative, offline reinforcement learning does not require interaction with subjects and preliminary research suggests that promising results can be achieved with datasets obtained offline, similar to classical machine learning algorithms. However, application of offline reinforcement learning to glucose control has to be evaluated yet. Thus, in this thesis, we comprehensively evaluate two offline reinforcement learning algorithms for blood glucose control and examine their potential and limitations. We assess the impact of the method used to generate training datasets, the type of trajectories employed (sequences of states, actions, and rewards experienced by an agent in an environment over time), the quality of the trajectories, and the size of the datasets on training and performance, and compare them to commonly used baselines such as PID and Proximal Policy Optimization. Our results demonstrate that one of the offline reinforcement learning algorithms evaluated, Trajectory Transformer, is able to perform at the same level as the baselines, but without the need for interaction with real patients during training.Escuela Internacional de Doctorado de la Universidad Polit茅cnica de CartagenaUniversidad Polit茅cnica de CartagenaPrograma Doctorado en Tecnolog铆as de la Informaci贸n y las Comunicacione
    corecore