13,164 research outputs found
Generalized Off-Policy Actor-Critic
We propose a new objective, the counterfactual objective, unifying existing
objectives for off-policy policy gradient algorithms in the continuing
reinforcement learning (RL) setting. Compared to the commonly used excursion
objective, which can be misleading about the performance of the target policy
when deployed, our new objective better predicts such performance. We prove the
Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient
of the counterfactual objective and use an emphatic approach to get an unbiased
sample from this policy gradient, yielding the Generalized Off-Policy
Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over
existing algorithms in Mujoco robot simulation tasks, the first empirical
success of emphatic algorithms in prevailing deep RL benchmarks.Comment: NeurIPS 201
A deep reinforcement learning based homeostatic system for unmanned position control
Deep Reinforcement Learning (DRL) has been proven to be capable of designing an optimal control theory by minimising the error in dynamic systems. However, in many of the real-world operations, the exact behaviour of the environment is unknown. In such environments, random changes cause the system to reach different states for the same action. Hence, application of DRL for unpredictable environments is difficult as the states of the world cannot be known for non-stationary transition and reward functions. In this paper, a mechanism to encapsulate the randomness of the environment is suggested using a novel bio-inspired homeostatic approach based on a hybrid of Receptor Density Algorithm (an artificial immune system based anomaly detection application) and a Plastic Spiking Neuronal model. DRL is then introduced to run in conjunction with the above hybrid model. The system is tested on a vehicle to autonomously re-position in an unpredictable environment. Our results show that the DRL based process control raised the accuracy of the hybrid model by 32%.N/
All-Optical Reinforcement Learning in Solitonic X-Junctions
L'etologia ha dimostrato che gruppi di animali o colonie possono eseguire calcoli complessi distribuendo semplici processi decisionali ai membri del gruppo. Ad esempio, le colonie di formiche possono ottimizzare le traiettorie verso il cibo eseguendo sia un rinforzo (o una cancellazione) delle tracce di feromone sia spostarsi da una traiettoria ad un'altra con feromone più forte. Questa procedura delle formiche possono essere implementati in un hardware fotonico per riprodurre l'elaborazione del segnale stigmergico. Presentiamo qui innovative giunzioni a X completamente integrate realizzate utilizzando guide d'onda solitoniche in grado di fornire entrambi i processi decisionali delle formiche. Le giunzioni a X proposte possono passare da comportamenti simmetrici (50/50) ad asimmetrici (80/20) utilizzando feedback ottici, cancellando i canali di uscita inutilizzati o rinforzando quelli usati.Ethology has shown that animal groups or colonies can perform complex calculation distributing simple decision-making processes to the group members. For example ant colonies can optimize the trajectories towards the food by performing both a reinforcement (or a cancellation) of the pheromone traces and a switch from one path to another with stronger pheromone. Such ant's processes can be implemented in a photonic hardware to reproduce stigmergic signal processing. We present innovative, completely integrated X-junctions realized using solitonic waveguides which can provide both ant's decision-making processes. The proposed X-junctions can switch from symmetric (50/50) to asymmetric behaviors (80/20) using optical feedbacks, vanishing unused output channels or reinforcing the used ones
A Shared Task on Bandit Learning for Machine Translation
We introduce and describe the results of a novel shared task on bandit
learning for machine translation. The task was organized jointly by Amazon and
Heidelberg University for the first time at the Second Conference on Machine
Translation (WMT 2017). The goal of the task is to encourage research on
learning machine translation from weak user feedback instead of human
references or post-edits. On each of a sequence of rounds, a machine
translation system is required to propose a translation for an input, and
receives a real-valued estimate of the quality of the proposed translation for
learning. This paper describes the shared task's learning and evaluation setup,
using services hosted on Amazon Web Services (AWS), the data and evaluation
metrics, and the results of various machine translation architectures and
learning protocols.Comment: Conference on Machine Translation (WMT) 201
Learning to Run challenge solutions: Adapting reinforcement learning methods for neuromusculoskeletal environments
In the NIPS 2017 Learning to Run challenge, participants were tasked with
building a controller for a musculoskeletal model to make it run as fast as
possible through an obstacle course. Top participants were invited to describe
their algorithms. In this work, we present eight solutions that used deep
reinforcement learning approaches, based on algorithms such as Deep
Deterministic Policy Gradient, Proximal Policy Optimization, and Trust Region
Policy Optimization. Many solutions use similar relaxations and heuristics,
such as reward shaping, frame skipping, discretization of the action space,
symmetry, and policy blending. However, each of the eight teams implemented
different modifications of the known algorithms.Comment: 27 pages, 17 figure
- …