100,761 research outputs found
Quantum Natural Policy Gradients: Towards Sample-Efficient Reinforcement Learning
Reinforcement learning is a growing field in AI with a lot of potential.
Intelligent behavior is learned automatically through trial and error in
interaction with the environment. However, this learning process is often
costly. Using variational quantum circuits as function approximators can reduce
this cost. In order to implement this, we propose the quantum natural policy
gradient (QNPG) algorithm -- a second-order gradient-based routine that takes
advantage of an efficient approximation of the quantum Fisher information
matrix. We experimentally demonstrate that QNPG outperforms first-order based
training on Contextual Bandits environments regarding convergence speed and
stability and thereby reduces the sample complexity. Furthermore, we provide
evidence for the practical feasibility of our approach by training on a
12-qubit hardware device.Comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessible. 7 pages, 5 figures, 1 tabl
Deep Reinforcement Learning for Multi-Agent Interaction
The development of autonomous agents which can interact with other agents to
accomplish a given task is a core area of research in artificial intelligence
and machine learning. Towards this goal, the Autonomous Agents Research Group
develops novel machine learning algorithms for autonomous systems control, with
a specific focus on deep reinforcement learning and multi-agent reinforcement
learning. Research problems include scalable learning of coordinated agent
policies and inter-agent communication; reasoning about the behaviours, goals,
and composition of other agents from limited observations; and sample-efficient
learning based on intrinsic motivation, curriculum learning, causal inference,
and representation learning. This article provides a broad overview of the
ongoing research portfolio of the group and discusses open problems for future
directions.Comment: Published in AI Communications Special Issue on Multi-Agent Systems
Research in the U
Unified Algorithms for RL with Decision-Estimation Coefficients: No-Regret, PAC, and Reward-Free Learning
Finding unified complexity measures and algorithms for sample-efficient
learning is a central topic of research in reinforcement learning (RL). The
Decision-Estimation Coefficient (DEC) is recently proposed by Foster et al.
(2021) as a necessary and sufficient complexity measure for sample-efficient
no-regret RL. This paper makes progress towards a unified theory for RL with
the DEC framework. First, we propose two new DEC-type complexity measures:
Explorative DEC (EDEC), and Reward-Free DEC (RFDEC). We show that they are
necessary and sufficient for sample-efficient PAC learning and reward-free
learning, thereby extending the original DEC which only captures no-regret
learning. Next, we design new unified sample-efficient algorithms for all three
learning goals. Our algorithms instantiate variants of the
Estimation-To-Decisions (E2D) meta-algorithm with a strong and general model
estimation subroutine. Even in the no-regret setting, our algorithm E2D-TA
improves upon the algorithms of Foster et al. (2021) which require either
bounding a variant of the DEC which may be prohibitively large, or designing
problem-specific estimation subroutines. As applications, we recover existing
and obtain new sample-efficient learning results for a wide range of tractable
RL problems using essentially a single algorithm. We also generalize the DEC to
give sample-efficient algorithms for all-policy model estimation, with
applications for learning equilibria in Markov Games. Finally, as a connection,
we re-analyze two existing optimistic model-based algorithms based on Posterior
Sampling or Maximum Likelihood Estimation, showing that they enjoy similar
regret bounds as E2D-TA under similar structural conditions as the DEC
Offline Reinforcement Learning with Differentiable Function Approximation is Provably Efficient
Offline reinforcement learning, which aims at optimizing sequential
decision-making strategies with historical data, has been extensively applied
in real-life applications. State-Of-The-Art algorithms usually leverage
powerful function approximators (e.g. neural networks) to alleviate the sample
complexity hurdle for better empirical performances. Despite the successes, a
more systematic understanding of the statistical complexity for function
approximation remains lacking. Towards bridging the gap, we take a step by
considering offline reinforcement learning with differentiable function class
approximation (DFA). This function class naturally incorporates a wide range of
models with nonlinear/nonconvex structures. Most importantly, we show offline
RL with differentiable function approximation is provably efficient by
analyzing the pessimistic fitted Q-learning (PFQL) algorithm, and our results
provide the theoretical basis for understanding a variety of practical
heuristics that rely on Fitted Q-Iteration style design. In addition, we
further improve our guarantee with a tighter instance-dependent
characterization. We hope our work could draw interest in studying
reinforcement learning with differentiable function approximation beyond the
scope of current research
Universal Trading for Order Execution with Oracle Policy Distillation
As a fundamental problem in algorithmic trading, order execution aims at
fulfilling a specific trading order, either liquidation or acquirement, for a
given instrument. Towards effective execution strategy, recent years have
witnessed the shift from the analytical view with model-based market
assumptions to model-free perspective, i.e., reinforcement learning, due to its
nature of sequential decision optimization. However, the noisy and yet
imperfect market information that can be leveraged by the policy has made it
quite challenging to build up sample efficient reinforcement learning methods
to achieve effective order execution. In this paper, we propose a novel
universal trading policy optimization framework to bridge the gap between the
noisy yet imperfect market states and the optimal action sequences for order
execution. Particularly, this framework leverages a policy distillation method
that can better guide the learning of the common policy towards practically
optimal execution by an oracle teacher with perfect information to approximate
the optimal trading strategy. The extensive experiments have shown significant
improvements of our method over various strong baselines, with reasonable
trading actions.Comment: Accepted in AAAI 2021, the code and the supplementary materials are
in https://seqml.github.io/opd
Sample-efficient Reinforcement Learning Representation Learning with Curiosity Contrastive Forward Dynamics Model
Developing an agent in reinforcement learning (RL) that is capable of
performing complex control tasks directly from high-dimensional observation
such as raw pixels is yet a challenge as efforts are made towards improving
sample efficiency and generalization. This paper considers a learning framework
for Curiosity Contrastive Forward Dynamics Model (CCFDM) in achieving a more
sample-efficient RL based directly on raw pixels. CCFDM incorporates a forward
dynamics model (FDM) and performs contrastive learning to train its deep
convolutional neural network-based image encoder (IE) to extract conducive
spatial and temporal information for achieving a more sample efficiency for RL.
In addition, during training, CCFDM provides intrinsic rewards, produced based
on FDM prediction error, encourages the curiosity of the RL agent to improve
exploration. The diverge and less-repetitive observations provide by both our
exploration strategy and data augmentation available in contrastive learning
improve not only the sample efficiency but also the generalization. Performance
of existing model-free RL methods such as Soft Actor-Critic built on top of
CCFDM outperforms prior state-of-the-art pixel-based RL methods on the DeepMind
Control Suite benchmark
- …