69 research outputs found
Using the online cross-entropy method to learn relational policies for playing different games
By defining a video-game environment as a collection of objects, relations, actions and rewards, the relational reinforcement learning algorithm presented in this paper generates and optimises a set of concise, human-readable relational rules for achieving maximal reward. Rule learning is achieved using a combination of incremental specialisation of rules and a modified online cross-entropy method, which dynamically adjusts the rate of learning as the agent progresses. The algorithm is tested on the Ms. Pac-Man and Mario environments, with results indicating the agent learns an effective policy for acting within each environment
Deterministic Policy Optimization by Combining Pathwise and Score Function Estimators for Discrete Action Spaces
Policy optimization methods have shown great promise in solving complex
reinforcement and imitation learning tasks. While model-free methods are
broadly applicable, they often require many samples to optimize complex
policies. Model-based methods greatly improve sample-efficiency but at the cost
of poor generalization, requiring a carefully handcrafted model of the system
dynamics for each task. Recently, hybrid methods have been successful in
trading off applicability for improved sample-complexity. However, these have
been limited to continuous action spaces. In this work, we present a new hybrid
method based on an approximation of the dynamics as an expectation over the
next state under the current policy. This relaxation allows us to derive a
novel hybrid policy gradient estimator, combining score function and pathwise
derivative estimators, that is applicable to discrete action spaces. We show
significant gains in sample complexity, ranging between and ,
when learning parameterized policies on Cart Pole, Acrobot, Mountain Car and
Hand Mass. Our method is applicable to both discrete and continuous action
spaces, when competing pathwise methods are limited to the latter.Comment: In AAAI 2018 proceeding
Automatic Grammar Augmentation for Robust Voice Command Recognition
This paper proposes a novel pipeline for automatic grammar augmentation that
provides a significant improvement in the voice command recognition accuracy
for systems with small footprint acoustic model (AM). The improvement is
achieved by augmenting the user-defined voice command set, also called grammar
set, with alternate grammar expressions. For a given grammar set, a set of
potential grammar expressions (candidate set) for augmentation is constructed
from an AM-specific statistical pronunciation dictionary that captures the
consistent patterns and errors in the decoding of AM induced by variations in
pronunciation, pitch, tempo, accent, ambiguous spellings, and noise conditions.
Using this candidate set, greedy optimization based and cross-entropy-method
(CEM) based algorithms are considered to search for an augmented grammar set
with improved recognition accuracy utilizing a command-specific dataset. Our
experiments show that the proposed pipeline along with algorithms considered in
this paper significantly reduce the mis-detection and mis-classification rate
without increasing the false-alarm rate. Experiments also demonstrate the
consistent superior performance of CEM method over greedy-based algorithms
Controlling Level of Unconsciousness by Titrating Propofol with Deep Reinforcement Learning
Reinforcement Learning (RL) can be used to fit a mapping from patient state
to a medication regimen. Prior studies have used deterministic and value-based
tabular learning to learn a propofol dose from an observed anesthetic state.
Deep RL replaces the table with a deep neural network and has been used to
learn medication regimens from registry databases. Here we perform the first
application of deep RL to closed-loop control of anesthetic dosing in a
simulated environment. We use the cross-entropy method to train a deep neural
network to map an observed anesthetic state to a probability of infusing a
fixed propofol dosage. During testing, we implement a deterministic policy that
transforms the probability of infusion to a continuous infusion rate. The model
is trained and tested on simulated pharmacokinetic/pharmacodynamic models with
randomized parameters to ensure robustness to patient variability. The deep RL
agent significantly outperformed a proportional-integral-derivative controller
(median absolute performance error 1.7% +/- 0.6 and 3.4% +/- 1.2). Modeling
continuous input variables instead of a table affords more robust pattern
recognition and utilizes our prior domain knowledge. Deep RL learned a smooth
policy with a natural interpretation to data scientists and anesthesia care
providers alike.Comment: International Conference on Artificial Intelligence in Medicine 202
- …