32,467 research outputs found

    Off-Policy Actor-Critic

    Get PDF
    This paper presents the first actor-critic algorithm for off-policy reinforcement learning. Our algorithm is online and incremental, and its per-time-step complexity scales linearly with the number of learned weights. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in off-policy gradient temporal-difference learning. Off-policy techniques, such as Greedy-GQ, enable a target policy to be learned while following and obtaining data from another (behavior) policy. For many problems, however, actor-critic methods are more practical than action value methods (like Greedy-GQ) because they explicitly represent the policy; consequently, the policy can be stochastic and utilize a large action space. In this paper, we illustrate how to practically combine the generality and learning potential of off-policy learning with the flexibility in action selection given by actor-critic methods. We derive an incremental, linear time and space complexity algorithm that includes eligibility traces, prove convergence under assumptions similar to previous off-policy algorithms, and empirically show better or comparable performance to existing algorithms on standard reinforcement-learning benchmark problems.Comment: Full version of the paper, appendix and errata included; Proceedings of the 2012 International Conference on Machine Learnin

    Episodic self-imitation learning with hindsight

    Get PDF
    Episodic self-imitation learning, a novel self-imitation algorithm with a trajectory selection module and an adaptive loss function, is proposed to speed up reinforcement learning. Compared to the original self-imitation learning algorithm, which samples good state–action pairs from the experience replay buffer, our agent leverages entire episodes with hindsight to aid self-imitation learning. A selection module is introduced to filter uninformative samples from each episode of the update. The proposed method overcomes the limitations of the standard self-imitation learning algorithm, a transitions-based method which performs poorly in handling continuous control environments with sparse rewards. From the experiments, episodic self-imitation learning is shown to perform better than baseline on-policy algorithms, achieving comparable performance to state-of-the-art off-policy algorithms in several simulated robot control tasks. The trajectory selection module is shown to prevent the agent learning undesirable hindsight experiences. With the capability of solving sparse reward problems in continuous control settings, episodic self-imitation learning has the potential to be applied to real-world problems that have continuous action spaces, such as robot guidance and manipulation

    On-Policy Trust Region Policy Optimisation with Replay Buffers

    Get PDF
    Building upon the recent success of deep reinforcement learning methods, we investigate the possibility of on-policy reinforcement learning improvement by reusing the data from several consecutive policies. On-policy methods bring many benefits, such as ability to evaluate each resulting policy. However, they usually discard all the information about the policies which existed before. In this work, we propose adaptation of the replay buffer concept, borrowed from the off-policy learning setting, to create the method, combining advantages of on- and off-policy learning. To achieve this, the proposed algorithm generalises the QQ-, value and advantage functions for data from multiple policies. The method uses trust region optimisation, while avoiding some of the common problems of the algorithms such as TRPO or ACKTR: it uses hyperparameters to replace the trust region selection heuristics, as well as the trainable covariance matrix instead of the fixed one. In many cases, the method not only improves the results comparing to the state-of-the-art trust region on-policy learning algorithms such as PPO, ACKTR and TRPO, but also with respect to their off-policy counterpart DDPG.Engineering and Physical Sciences Research Council (EPSRC

    On-Policy Trust Region Policy Optimisation with Replay Buffers

    Get PDF
    Building upon the recent success of deep reinforcement learning methods, we investigate the possibility of on-policy reinforcement learning improvement by reusing the data from several consecutive policies. On-policy methods bring many benefits, such as ability to evaluate each resulting policy. However, they usually discard all the information about the policies which existed before. In this work, we propose adaptation of the replay buffer concept, borrowed from the off-policy learning setting, to create the method, combining advantages of on- and off-policy learning. To achieve this, the proposed algorithm generalises the QQ-, value and advantage functions for data from multiple policies. The method uses trust region optimisation, while avoiding some of the common problems of the algorithms such as TRPO or ACKTR: it uses hyperparameters to replace the trust region selection heuristics, as well as the trainable covariance matrix instead of the fixed one. In many cases, the method not only improves the results comparing to the state-of-the-art trust region on-policy learning algorithms such as PPO, ACKTR and TRPO, but also with respect to their off-policy counterpart DDPG.Engineering and Physical Sciences Research Council (EPSRC

    Sparse temporal difference learning via alternating direction method of multipliers

    Get PDF
    Recent work in off-line Reinforcement Learning has focused on efficient algorithms to incorporate feature selection, via 1-regularization, into the Bellman operator fixed-point estimators. These developments now mean that over-fitting can be avoided when the number of samples is small compared to the number of features. However, it remains unclear whether existing algorithms have the ability to offer good approximations for the task of policy evaluation and improvement. In this paper, we propose a new algorithm for approximating the fixed-point based on the Alternating Direction Method of Multipliers (ADMM). We demonstrate, with experimental results, that the proposed algorithm is more stable for policy iteration compared to prior work. Furthermore, we also derive a theoretical result that states the proposed algorithm obtains a solution which satisfies the optimality conditions for the fixed-point problem

    Introducing Self-Learning into Robotic Arm Using Deep Reinforcement Learning

    Get PDF
    With the growing trend of autonomous machines, the combination of supervised and unsupervised machine learning techniques has been explored in providing optimal solutions for self-learning. In robotics, the curse of dimensionality makes convergence of machine learning difficult, no matter whether it is supervised or unsupervised. Therefore, reinforcement learning, which often requires a large number of trials for effective learning experience similar to unsupervised learning, suffers serious challenges in robotic applications. Consequently, choosing an appropriate algorithm that would perform optimally is of utmost importance. In this work, a robotic arm having 6 degrees of freedom combines supervised and unsupervised learning techniques by using a concept called Deep Reinforcement Learning, this helps the robot in becoming autonomous. It uses a camera image as an input to generate states through observation of the image, and distance for the reward system. It learns the optimum policy for action selection given a particular state observation that would achieve the maximum reward. The off-policy Deep Q Network (DQN) algorithm is to be implemented in this design and will be deployed on the robotic arm for independently learning the optimum movement towards achieving a certain task in a controlled environment
    • …
    corecore