61,070 research outputs found
High-confidence error estimates for learned value functions
Estimating the value function for a fixed policy is a fundamental problem in
reinforcement learning. Policy evaluation algorithms---to estimate value
functions---continue to be developed, to improve convergence rates, improve
stability and handle variability, particularly for off-policy learning. To
understand the properties of these algorithms, the experimenter needs
high-confidence estimates of the accuracy of the learned value functions. For
environments with small, finite state-spaces, like chains, the true value
function can be easily computed, to compute accuracy. For large, or continuous
state-spaces, however, this is no longer feasible. In this paper, we address
the largely open problem of how to obtain these high-confidence estimates, for
general state-spaces. We provide a high-confidence bound on an empirical
estimate of the value error to the true value error. We use this bound to
design an offline sampling algorithm, which stores the required quantities to
repeatedly compute value error estimates for any learned value function. We
provide experiments investigating the number of samples required by this
offline algorithm in simple benchmark reinforcement learning domains, and
highlight that there are still many open questions to be solved for this
important problem.Comment: Presented at (UAI) Uncertainty in Artificial Intelligence 201
Adaptively Truncating Backpropagation Through Time to Control Gradient Bias
Truncated backpropagation through time (TBPTT) is a popular method for
learning in recurrent neural networks (RNNs) that saves computation and memory
at the cost of bias by truncating backpropagation after a fixed number of lags.
In practice, choosing the optimal truncation length is difficult: TBPTT will
not converge if the truncation length is too small, or will converge slowly if
it is too large. We propose an adaptive TBPTT scheme that converts the problem
from choosing a temporal lag to one of choosing a tolerable amount of gradient
bias. For many realistic RNNs, the TBPTT gradients decay geometrically in
expectation for large lags; under this condition, we can control the bias by
varying the truncation length adaptively. For RNNs with smooth activation
functions, we prove that this bias controls the convergence rate of SGD with
biased gradients for our non-convex loss. Using this theory, we develop a
practical method for adaptively estimating the truncation length during
training. We evaluate our adaptive TBPTT method on synthetic data and language
modeling tasks and find that our adaptive TBPTT ameliorates the computational
pitfalls of fixed TBPTT
A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation
Temporal difference learning (TD) is a simple iterative algorithm used to
estimate the value function corresponding to a given policy in a Markov
decision process. Although TD is one of the most widely used algorithms in
reinforcement learning, its theoretical analysis has proved challenging and few
guarantees on its statistical efficiency are available. In this work, we
provide a simple and explicit finite time analysis of temporal difference
learning with linear function approximation. Except for a few key insights, our
analysis mirrors standard techniques for analyzing stochastic gradient descent
algorithms, and therefore inherits the simplicity and elegance of that
literature. Final sections of the paper show how all of our main results extend
to the study of TD learning with eligibility traces, known as TD(),
and to Q-learning applied in high-dimensional optimal stopping problems
Classification-based Approximate Policy Iteration: Experiments and Extended Discussions
Tackling large approximate dynamic programming or reinforcement learning
problems requires methods that can exploit regularities, or intrinsic
structure, of the problem in hand. Most current methods are geared towards
exploiting the regularities of either the value function or the policy. We
introduce a general classification-based approximate policy iteration (CAPI)
framework, which encompasses a large class of algorithms that can exploit
regularities of both the value function and the policy space, depending on what
is advantageous. This framework has two main components: a generic value
function estimator and a classifier that learns a policy based on the estimated
value function. We establish theoretical guarantees for the sample complexity
of CAPI-style algorithms, which allow the policy evaluation step to be
performed by a wide variety of algorithms (including temporal-difference-style
methods), and can handle nonparametric representations of policies. Our bounds
on the estimation error of the performance loss are tighter than existing
results. We also illustrate this approach empirically on several problems,
including a large HIV control task
A Self-Adaptive Proposal Model for Temporal Action Detection based on Reinforcement Learning
Existing action detection algorithms usually generate action proposals
through an extensive search over the video at multiple temporal scales, which
brings about huge computational overhead and deviates from the human perception
procedure. We argue that the process of detecting actions should be naturally
one of observation and refinement: observe the current window and refine the
span of attended window to cover true action regions. In this paper, we propose
an active action proposal model that learns to find actions through
continuously adjusting the temporal bounds in a self-adaptive way. The whole
process can be deemed as an agent, which is firstly placed at a position in the
video at random, adopts a sequence of transformations on the current attended
region to discover actions according to a learned policy. We utilize
reinforcement learning, especially the Deep Q-learning algorithm to learn the
agent's decision policy. In addition, we use temporal pooling operation to
extract more effective feature representation for the long temporal window, and
design a regression network to adjust the position offsets between predicted
results and the ground truth. Experiment results on THUMOS 2014 validate the
effectiveness of the proposed approach, which can achieve competitive
performance with current action detection algorithms via much fewer proposals.Comment: Deep Reinforcement Learning, Action Temporal Detection, Temporal
Location Regressio
Sequential anatomy localization in fetal echocardiography videos
Fetal heart motion is an important diagnostic indicator for structural
detection and functional assessment of congenital heart disease. We propose an
approach towards integrating deep convolutional and recurrent architectures
that utilize localized spatial and temporal features of different anatomical
substructures within a global spatiotemporal context for interpretation of
fetal echocardiography videos. We formulate our task as a cardiac structure
localization problem with convolutional architectures for aggregating global
spatial context and detecting anatomical structures on spatial region
proposals. This information is aggregated temporally by recurrent architectures
to quantify the progressive motion patterns. We experimentally show that the
resulting architecture combines anatomical landmark detection at the
frame-level over multiple video sequences-with temporal progress of the
associated anatomical motions to encode local spatiotemporal fetal heart
dynamics and is validated on a real-world clinical dataset.Comment: To appear in ISBI 201
Online Monotone Games
Algorithmic game theory (AGT) focuses on the design and analysis of
algorithms for interacting agents, with interactions rigorously formalized
within the framework of games. Results from AGT find applications in domains
such as online bidding auctions for web advertisements and network routing
protocols. Monotone games are games where agent strategies naturally converge
to an equilibrium state. Previous results in AGT have been obtained for convex,
socially-convex, or smooth games, but not monotone games. Our primary
theoretical contributions are defining the monotone game setting and its
extension to the online setting, a new notion of regret for this setting, and
accompanying algorithms that achieve sub-linear regret. We demonstrate the
utility of online monotone game theory on a variety of problem domains
including variational inequalities, reinforcement learning, and generative
adversarial networks
End-to-end Learning of Action Detection from Frame Glimpses in Videos
In this work we introduce a fully end-to-end approach for action detection in
videos that learns to directly predict the temporal bounds of actions. Our
intuition is that the process of detecting actions is naturally one of
observation and refinement: observing moments in video, and refining hypotheses
about when an action is occurring. Based on this insight, we formulate our
model as a recurrent neural network-based agent that interacts with a video
over time. The agent observes video frames and decides both where to look next
and when to emit a prediction. Since backpropagation is not adequate in this
non-differentiable setting, we use REINFORCE to learn the agent's decision
policy. Our model achieves state-of-the-art results on the THUMOS'14 and
ActivityNet datasets while observing only a fraction (2% or less) of the video
frames.Comment: Update to version in CVPR 2016 proceeding
Nonparametric risk bounds for time-series forecasting
We derive generalization error bounds for traditional time-series forecasting
models. Our results hold for many standard forecasting tools including
autoregressive models, moving average models, and, more generally, linear
state-space models. These non-asymptotic bounds need only weak assumptions on
the data-generating process, yet allow forecasters to select among competing
models and to guarantee, with high probability, that their chosen model will
perform well. We motivate our techniques with and apply them to standard
economic and financial forecasting tools---a GARCH model for predicting equity
volatility and a dynamic stochastic general equilibrium model (DSGE), the
standard tool in macroeconomic forecasting. We demonstrate in particular how
our techniques can aid forecasters and policy makers in choosing models which
behave well under uncertainty and mis-specification.Comment: 34 pages, 3 figure
Memory shapes time perception and intertemporal choices
There is a consensus that human and non-human subjects experience temporal
distortions in many stages of their perceptual and decision-making systems.
Similarly, intertemporal choice research has shown that decision-makers
undervalue future outcomes relative to immediate ones. Here we combine
techniques from information theory and artificial intelligence to show how both
temporal distortions and intertemporal choice preferences can be explained as a
consequence of the coding efficiency of sensorimotor representation. In
particular, the model implies that interactions that constrain future behavior
are perceived as being both longer in duration and more valuable. Furthermore,
using simulations of artificial agents, we investigate how memory constraints
enforce a renormalization of the perceived timescales. Our results show that
qualitatively different discount functions, such as exponential and hyperbolic
discounting, arise as a consequence of an agent's probabilistic model of the
world.Comment: 24 pages, 4 figures, 2 tables. Submitte
- …