774 research outputs found
Model-Based Reinforcement Learning with Multinomial Logistic Function Approximation
We study model-based reinforcement learning (RL) for episodic Markov decision
processes (MDP) whose transition probability is parametrized by an unknown
transition core with features of state and action. Despite much recent progress
in analyzing algorithms in the linear MDP setting, the understanding of more
general transition models is very restrictive. In this paper, we establish a
provably efficient RL algorithm for the MDP whose state transition is given by
a multinomial logistic model. To balance the exploration-exploitation
trade-off, we propose an upper confidence bound-based algorithm. We show that
our proposed algorithm achieves regret
bound where is the dimension of the transition core, is the horizon,
and is the total number of steps. To the best of our knowledge, this is the
first model-based RL algorithm with multinomial logistic function approximation
with provable guarantees. We also comprehensively evaluate our proposed
algorithm numerically and show that it consistently outperforms the existing
methods, hence achieving both provable efficiency and practical superior
performance.Comment: Accepted in AAAI 2023 (Main Technical Track
Quantum attacks on Bitcoin, and how to protect against them
The key cryptographic protocols used to secure the internet and financial
transactions of today are all susceptible to attack by the development of a
sufficiently large quantum computer. One particular area at risk are
cryptocurrencies, a market currently worth over 150 billion USD. We investigate
the risk of Bitcoin, and other cryptocurrencies, to attacks by quantum
computers. We find that the proof-of-work used by Bitcoin is relatively
resistant to substantial speedup by quantum computers in the next 10 years,
mainly because specialized ASIC miners are extremely fast compared to the
estimated clock speed of near-term quantum computers. On the other hand, the
elliptic curve signature scheme used by Bitcoin is much more at risk, and could
be completely broken by a quantum computer as early as 2027, by the most
optimistic estimates. We analyze an alternative proof-of-work called Momentum,
based on finding collisions in a hash function, that is even more resistant to
speedup by a quantum computer. We also review the available post-quantum
signature schemes to see which one would best meet the security and efficiency
requirements of blockchain applications.Comment: 21 pages, 6 figures. For a rough update on the progress of Quantum
devices and prognostications on time from now to break Digital signatures,
see https://www.quantumcryptopocalypse.com/quantum-moores-law
Conservative Dual Policy Optimization for Efficient Model-Based Reinforcement Learning
Provably efficient Model-Based Reinforcement Learning (MBRL) based on
optimism or posterior sampling (PSRL) is ensured to attain the global
optimality asymptotically by introducing the complexity measure of the model.
However, the complexity might grow exponentially for the simplest nonlinear
models, where global convergence is impossible within finite iterations. When
the model suffers a large generalization error, which is quantitatively
measured by the model complexity, the uncertainty can be large. The sampled
model that current policy is greedily optimized upon will thus be unsettled,
resulting in aggressive policy updates and over-exploration. In this work, we
propose Conservative Dual Policy Optimization (CDPO) that involves a
Referential Update and a Conservative Update. The policy is first optimized
under a reference model, which imitates the mechanism of PSRL while offering
more stability. A conservative range of randomness is guaranteed by maximizing
the expectation of model value. Without harmful sampling procedures, CDPO can
still achieve the same regret as PSRL. More importantly, CDPO enjoys monotonic
policy improvement and global optimality simultaneously. Empirical results also
validate the exploration efficiency of CDPO.Comment: Published at NeurIPS 202
One Objective to Rule Them All: A Maximization Objective Fusing Estimation and Planning for Exploration
In online reinforcement learning (online RL), balancing exploration and
exploitation is crucial for finding an optimal policy in a sample-efficient
way. To achieve this, existing sample-efficient online RL algorithms typically
consist of three components: estimation, planning, and exploration. However, in
order to cope with general function approximators, most of them involve
impractical algorithmic components to incentivize exploration, such as
optimization within data-dependent level-sets or complicated sampling
procedures. To address this challenge, we propose an easy-to-implement RL
framework called \textit{Maximize to Explore} (\texttt{MEX}), which only needs
to optimize \emph{unconstrainedly} a single objective that integrates the
estimation and planning components while balancing exploration and exploitation
automatically. Theoretically, we prove that \texttt{MEX} achieves a sublinear
regret with general function approximations for Markov decision processes (MDP)
and is further extendable to two-player zero-sum Markov games (MG). Meanwhile,
we adapt deep RL baselines to design practical versions of \texttt{MEX}, in
both model-free and model-based manners, which can outperform baselines by a
stable margin in various MuJoCo environments with sparse rewards. Compared with
existing sample-efficient online RL algorithms with general function
approximations, \texttt{MEX} achieves similar sample efficiency while enjoying
a lower computational cost and is more compatible with modern deep RL methods
Making Linear MDPs Practical via Contrastive Representation Learning
It is common to address the curse of dimensionality in Markov decision
processes (MDPs) by exploiting low-rank representations. This motivates much of
the recent theoretical study on linear MDPs. However, most approaches require a
given representation under unrealistic assumptions about the normalization of
the decomposition or introduce unresolved computational challenges in practice.
Instead, we consider an alternative definition of linear MDPs that
automatically ensures normalization while allowing efficient representation
learning via contrastive estimation. The framework also admits
confidence-adjusted index algorithms, enabling an efficient and principled
approach to incorporating optimism or pessimism in the face of uncertainty. To
the best of our knowledge, this provides the first practical representation
learning method for linear MDPs that achieves both strong theoretical
guarantees and empirical performance. Theoretically, we prove that the proposed
algorithm is sample efficient in both the online and offline settings.
Empirically, we demonstrate superior performance over existing state-of-the-art
model-based and model-free algorithms on several benchmarks.Comment: ICML 2022. The first two authors contribute equall
Finding and Certifying (Near-)Optimal Strategies in Black-Box Extensive-Form Games
Often -- for example in war games, strategy video games, and financial
simulations -- the game is given to us only as a black-box simulator in which
we can play it. In these settings, since the game may have unknown nature
action distributions (from which we can only obtain samples) and/or be too
large to expand fully, it can be difficult to compute strategies with
guarantees on exploitability. Recent work \cite{Zhang20:Small} resulted in a
notion of certificate for extensive-form games that allows exploitability
guarantees while not expanding the full game tree. However, that work assumed
that the black box could sample or expand arbitrary nodes of the game tree at
any time, and that a series of exact game solves (via, for example, linear
programming) can be conducted to compute the certificate. Each of those two
assumptions severely restricts the practical applicability of that method. In
this work, we relax both of the assumptions. We show that high-probability
certificates can be obtained with a black box that can do nothing more than
play through games, using only a regret minimizer as a subroutine. As a bonus,
we obtain an equilibrium-finding algorithm with
convergence rate in the extensive-form game setting that does not rely on a
sampling strategy with lower-bounded reach probabilities (which MCCFR assumes).
We demonstrate experimentally that, in the black-box setting, our methods are
able to provide nontrivial exploitability guarantees while expanding only a
small fraction of the game tree.Comment: AAAI 202
- …