8 research outputs found

    Deep Quality-Value (DQV) Learning

    Get PDF
    We introduce a novel Deep Reinforcement Learning (DRL) algorithm called Deep Quality-Value (DQV) Learning. DQV uses temporal-difference learning to train a Value neural network and uses this network for training a second Quality-value network that learns to estimate state-action values. We first test DQV's update rules with Multilayer Perceptrons as function approximators on two classic RL problems, and then extend DQV with the use of Deep Convolutional Neural Networks, `Experience Replay' and `Target Neural Networks' for tackling four games of the Atari Arcade Learning environment. Our results show that DQV learns significantly faster and better than Deep Q-Learning and Double Deep Q-Learning, suggesting that our algorithm can potentially be a better performing synchronous temporal difference algorithm than what is currently present in DRL

    Reinforcement learning in continuous state and action spaces

    Get PDF
    Many traditional reinforcement-learning algorithms have been designed for problems with small finite state and action spaces. Learning in such discrete problems can been difficult, due to noise and delayed reinforcements. However, many real-world problems have continuous state or action spaces, which can make learning a good decision policy even more involved. In this chapter we discuss how to automatically find good decision policies in continuous domains. Because analytically computing a good policy from a continuous model can be infeasible, in this chapter we mainly focus on methods that explicitly update a representation of a value function, a policy or both. We discuss considerations in choosing an appropriate representation for these functions and discuss gradient-based and gradient-free ways to update the parameters. We show how to apply these methods to reinforcement-learning problems and discuss many specific algorithms. Amongst others, we cover gradient-based temporal-difference learning, evolutionary strategies, policy-gradient algorithms and actor-critic methods. We discuss the advantages of different approaches and compare the performance of a state-of-the-art actor-critic method and a state-of-the-art evolutionary strategy empirically

    The QV Family Compared to Other Reinforcement Learning Algorithms

    No full text
    This paper describes several new online model-free reinforcement learning (RL) algorithms. We designed three new reinforcement algorithms, namely: QV2, QVMAX, and QV-MAX2, that are all based on the QV-learning algorithm, but in contrary to QV-learning, QVMAX and QVMAX2 are off-policy RL algorithms and QV2 is a new on-policy RL algorithm. We experimentally compare these algorithms to a large number of different RL algorithms, namely: Q-learning, Sarsa, R-learning, Actor-Critic, QV-learning, and ACLA. We show experiments on five maze problems of varying complexity. Furthermore, we show experimental results on the cart pole balancing problem. The results show that for different problems, there can be large performance differences between the different algorithms, and that there is not a single RL algorithm that always performs best, although on average QV-learning scores highest

    The QV Family Compared to Other Reinforcement Learning Algorithms

    No full text
    This paper describes several new online model-free reinforcement learning (RL) algorithms. We designed three new reinforcement algorithms, namely: QV2, QVMAX, and QV-MAX2, that are all based on the QV-learning algorithm, but in contrary to QV-learning, QVMAX and QVMAX2 are off-policy RL algorithms and QV2 is a new on-policy RL algorithm. We experimentally compare these algorithms to a large number of different RL algorithms, namely: Q-learning, Sarsa, R-learning, Actor-Critic, QV-learning, and ACLA. We show experiments on five maze problems of varying complexity. Furthermore, we show experimental results on the cart pole balancing problem. The results show that for different problems, there can be large performance differences between the different algorithms, and that there is not a single RL algorithm that always performs best, although on average QV-learning scores highest

    Sustainable scheduling policies for radio access networks based on LTE technology

    Get PDF
    A thesis submitted to the University of Bedfordshire in partial fulfilment of the requirements for the degree of Doctor of PhilosophyIn the LTE access networks, the Radio Resource Management (RRM) is one of the most important modules which is responsible for handling the overall management of radio resources. The packet scheduler is a particular sub-module which assigns the existing radio resources to each user in order to deliver the requested services in the most efficient manner. Data packets are scheduled dynamically at every Transmission Time Interval (TTI), a time window used to take the user’s requests and to respond them accordingly. The scheduling procedure is conducted by using scheduling rules which select different users to be scheduled at each TTI based on some priority metrics. Various scheduling rules exist and they behave differently by balancing the scheduler performance in the direction imposed by one of the following objectives: increasing the system throughput, maintaining the user fairness, respecting the Guaranteed Bit Rate (GBR), Head of Line (HoL) packet delay, packet loss rate and queue stability requirements. Most of the static scheduling rules follow the sequential multi-objective optimization in the sense that when the first targeted objective is satisfied, then other objectives can be prioritized. When the targeted scheduling objective(s) can be satisfied at each TTI, the LTE scheduler is considered to be optimal or feasible. So, the scheduling performance depends on the exploited rule being focused on particular objectives. This study aims to increase the percentage of feasible TTIs for a given downlink transmission by applying a mixture of scheduling rules instead of using one discipline adopted across the entire scheduling session. Two types of optimization problems are proposed in this sense: Dynamic Scheduling Rule based Sequential Multi-Objective Optimization (DSR-SMOO) when the applied scheduling rules address the same objective and Dynamic Scheduling Rule based Concurrent Multi-Objective Optimization (DSR-CMOO) if the pool of rules addresses different scheduling objectives. The best way of solving such complex optimization problems is to adapt and to refine scheduling policies which are able to call different rules at each TTI based on the best matching scheduler conditions (states). The idea is to develop a set of non-linear functions which maps the scheduler state at each TTI in optimal distribution probabilities of selecting the best scheduling rule. Due to the multi-dimensional and continuous characteristics of the scheduler state space, the scheduling functions should be approximated. Moreover, the function approximations are learned through the interaction with the RRM environment. The Reinforcement Learning (RL) algorithms are used in this sense in order to evaluate and to refine the scheduling policies for the considered DSR-SMOO/CMOO optimization problems. The neural networks are used to train the non-linear mapping functions based on the interaction among the intelligent controller, the LTE packet scheduler and the RRM environment. In order to enhance the convergence in the feasible state and to reduce the scheduler state space dimension, meta-heuristic approaches are used for the channel statement aggregation. Simulation results show that the proposed aggregation scheme is able to outperform other heuristic methods. When the aggregation scheme of the channel statements is exploited, the proposed DSR-SMOO/CMOO problems focusing on different objectives which are solved by using various RL approaches are able to: increase the mean percentage of feasible TTIs, minimize the number of TTIs when the RL approaches punish the actions taken TTI-by-TTI, and minimize the variation of the performance indicators when different simulations are launched in parallel. This way, the obtained scheduling policies being focused on the multi-objective criteria are sustainable. Keywords: LTE, packet scheduling, scheduling rules, multi-objective optimization, reinforcement learning, channel, aggregation, scheduling policies, sustainable
    corecore