Search CORE

13,164 research outputs found

Generalized Off-Policy Actor-Critic

Author: Boehmer Wendelin
Whiteson Shimon
Zhang Shangtong
Publication venue
Publication date: 28/10/2019
Field of study

We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.Comment: NeurIPS 201

arXiv.org e-Print Archive

Oxford University Research Archive

A deep reinforcement learning based homeostatic system for unmanned position control

Author: Anjum Ashiq
Bower Craig
Dassanayake Priyanthi
Manning Warren
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

Deep Reinforcement Learning (DRL) has been proven to be capable of designing an optimal control theory by minimising the error in dynamic systems. However, in many of the real-world operations, the exact behaviour of the environment is unknown. In such environments, random changes cause the system to reach different states for the same action. Hence, application of DRL for unpredictable environments is difficult as the states of the world cannot be known for non-stationary transition and reward functions. In this paper, a mechanism to encapsulate the randomness of the environment is suggested using a novel bio-inspired homeostatic approach based on a hybrid of Receptor Density Algorithm (an artificial immune system based anomaly detection application) and a Plastic Spiking Neuronal model. DRL is then introduced to run in conjunction with the above hybrid model. The system is tested on a vehicle to autonomously re-position in an unpredictable environment. Our results show that the DRL based process control raised the accuracy of the hybrid model by 32%.N/

Crossref

UDORA - University of Derby Online Research Archive

All-Optical Reinforcement Learning in Solitonic X-Junctions

Author: Alonzo M.
Bastiani L.
Belardini A.
Fazio E.
Moscatelli D.
Soci C.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

L'etologia ha dimostrato che gruppi di animali o colonie possono eseguire calcoli complessi distribuendo semplici processi decisionali ai membri del gruppo. Ad esempio, le colonie di formiche possono ottimizzare le traiettorie verso il cibo eseguendo sia un rinforzo (o una cancellazione) delle tracce di feromone sia spostarsi da una traiettoria ad un'altra con feromone più forte. Questa procedura delle formiche possono essere implementati in un hardware fotonico per riprodurre l'elaborazione del segnale stigmergico. Presentiamo qui innovative giunzioni a X completamente integrate realizzate utilizzando guide d'onda solitoniche in grado di fornire entrambi i processi decisionali delle formiche. Le giunzioni a X proposte possono passare da comportamenti simmetrici (50/50) ad asimmetrici (80/20) utilizzando feedback ottici, cancellando i canali di uscita inutilizzati o rinforzando quelli usati.Ethology has shown that animal groups or colonies can perform complex calculation distributing simple decision-making processes to the group members. For example ant colonies can optimize the trajectories towards the food by performing both a reinforcement (or a cancellation) of the pheromone traces and a switch from one path to another with stronger pheromone. Such ant's processes can be implemented in a photonic hardware to reproduce stigmergic signal processing. We present innovative, completely integrated X-junctions realized using solitonic waveguides which can provide both ant's decision-making processes. The proposed X-junctions can switch from symmetric (50/50) to asymmetric behaviors (80/20) using optical feedbacks, vanishing unused output channels or reinforcing the used ones

Directory of Open Access Journals

Archivio della ricerca- Università di Roma La Sapienza

A Shared Task on Bandit Learning for Machine Translation

Author: Danchenko Pavel
Fürstenau Hagen
Kreutzer Julia
Riezler Stefan
Sokolov Artem
Sunderland Kellen
Szymaniak Witold
Publication venue
Publication date: 01/01/2017
Field of study

We introduce and describe the results of a novel shared task on bandit learning for machine translation. The task was organized jointly by Amazon and Heidelberg University for the first time at the Second Conference on Machine Translation (WMT 2017). The goal of the task is to encourage research on learning machine translation from weak user feedback instead of human references or post-edits. On each of a sequence of rounds, a machine translation system is required to propose a translation for an input, and receives a real-valued estimate of the quality of the proposed translation for learning. This paper describes the shared task's learning and evaluation setup, using services hosted on Amazon Web Services (AWS), the data and evaluation metrics, and the results of various machine translation architectures and learning protocols.Comment: Conference on Machine Translation (WMT) 201

arXiv.org e-Print Archive

Crossref

Learning to Run challenge solutions: Adapting reinforcement learning methods for neuromusculoskeletal environments

In the NIPS 2017 Learning to Run challenge, participants were tasked with building a controller for a musculoskeletal model to make it run as fast as possible through an obstacle course. Top participants were invited to describe their algorithms. In this work, we present eight solutions that used deep reinforcement learning approaches, based on algorithms such as Deep Deterministic Policy Gradient, Proximal Policy Optimization, and Trust Region Policy Optimization. Many solutions use similar relaxations and heuristics, such as reward shaping, frame skipping, discretization of the action space, symmetry, and policy blending. However, each of the eight teams implemented different modifications of the known algorithms.Comment: 27 pages, 17 figure

arXiv.org e-Print Archive

Crossref

Publications at Bielefeld University